Cloudera Administration Handbook – How to become an effective Big Data administrator of large Hadoop clusters – #bookreview

 

 

Cloudera Administration Handbook

 Rohit Menon

Packt PublishingKindle, paperback

 

The explosive growth and use of Big Data in business, government, science and other arenas has fueled a strong demand for new Hadoop administrators. The administrators’ key duty is to set up and maintain Hadoop clusters that help process and analyze massive amounts of information.

New Hadoop administrators and those looking to join their ranks especially will want to give good consideration to The Cloudera Administration Handbook by Rohit Menon. This is a well-organized, well-written and solidly illustrated guide to building and maintaining large Apache Hadoop clusters using Cloudera Manager and CDH5.

The author has an extensive computer science background and is a Cloudera Certified Apache Hadoop Developer. He notes that “Cloudera Inc., is a Palo Alto-based American enterprise software company that provides Apache Hadoop-based software, support and services, and training to data-driven enterprises. It is often referred to as the commercial Hadoop company.”

CDH, Menon points out, is the easy shorthand name for a rather awkward software title: “Cloudera’s Distribution Including Apache Hadoop.” CDH is “an enterprise-level distribution including Apache Hadoop and several components of its ecosystem such as Apache Hive, Apache Avro, HBase, and many more. CDH is 100 percent open source,” Menon writes.

The Cloudera Manager, meanwhile, “is a web-browser-based administration tool to manage Apache Hadoop clusters. It is the centralized command center to operate the entire cluster from a single interface. Using Cloudera Manager, the administrator gets visibility for each and every component in the cluster.”

The Cloudera Manager is not explored until nearly halfway into the book, and some may wish it had been explained sooner, since they may be trying to learn it on day one of their new job. However, Menon wants readers first to become familiar with “all the steps and operations needed to set up a cluster via the command line” at a terminal. And these are, of course, important considerations to becoming an effective, knowledgeable and versatile Hadoop Administrator.  (You may not always have access to Cloudera Manager while setting up or troubleshooting a cluster.)

The book’s nine chapters show its well-focused range:

  • Chapter 1: Getting Started with Apache Hadoop
  • Chapter 2: HDFS and MapReduce
  • Chapter 3: Cloudera’s Distribution Including Apache Hadoop
  • Chapter 4: Exploring HDFS Federation and Its High Availability
  • Chapter 5: Using Cloudera Manager
  • Chapter 6: Implementing Security Using Kerberos
  • Chapter 7: Managing an Apache Hadoop Cluster
  • Chapter 8: Cluster Monitoring Using Events and Alerts
  • Chapter 9: Configuring Backups

You will have to bring some hardware and software experience and skills to the table, of course. Apache Hadoop primarily is run on Linux. “So having good Linux skills such as monitoring, troubleshooting, configuration, and security is a must” for a Hadoop administrator, Menon points out. Another requirement is being able to work comfortably with the Java Virtual Machine (JVM) and understand Java exceptions.

But those skills and his Cloudera Administration Handbook can take you from “the very basics of Hadoop” to taking up “the responsibilities of a Hadoop administrator and…managing huge Hadoop clusters.”

Si Dunn

Help support the work of reviewing books. Click here to buy the book:  Kindle, paperback

Making Sense of NoSQL – A balanced, well-written overview – #bigdata #bookreview

Making Sense of NoSQL

A Guide for Managers and the Rest of Us
Dan McCreary and Ann Kelly
(Manning, paperback)

This is NOT a how-to guide for learning to use NoSQL software and build NoSQL databases. It is a meaty, well-structured overview aimed primarily at “technical managers, [software] architects, and developers.” However, it also is written to appeal to other, not-so-technical readers who are curious about NoSQL databases and where NoSQL could fit into the Big Data picture for their business, institution, or organization.

Making Sense of NoSQL definitely lives up to its subtitle: “A guide for managers and the rest of us.”

Many executives, managers, consultants and others today are dealing with expensive questions related to Big Data, primarily how it affects their current databases, database management systems, and the employees and contractors who maintain them. A variety of  problems can fall upon those who operate and update big relational (SQL) databases and their huge arrays of servers pieced together over years or decades.

The authors, Dan McCreary and Ann Kelly, are strong proponents, obviously, of the NoSQL approach. It offers, they note, “many ways to allow you to grow your database without ever having to shut down your servers.” However, they also realize that NoSQL may not a good, nor affordable, choice in many situations. Indeed, a blending of SQL and NoSQL systems may be a better choice. Or, making changes from SQL to NoSQL may not be financially feasible at all. So they have structured their book into four parts that attempt to help readers “objectively evaluate SQL and NoSQL database systems to see which business problems they solve.”

Part 1 provides an overview of NoSQL, its history, and its potential business benefits. Part 2 focuses on “database patterns,” including “legacy database patterns (which most solution architects are familiar with), NoSQL patterns, and native XML databases.” Part 3 examines “how NoSQL solutions solve the real-world business problems of big data, search, high availability, and agility.” And Part 4 looks at “two advanced topics associated with NoSQL: functional programming and system security.”

McCreary and Kelly observe that “[t]he transition to functional programming requires a paradigm shift away from software designed to control state and toward software that has a focus on independent data transformation.” (Erlang, Scala, and F# are some of the functional languages that they highlight.) And, they contend: “It’s no longer sufficient to design a system that will scale to 2, 4, or 8 core processors. You need to ask if your architecture will scale to 100, 1,000, or even 10,000 processors.”

Meanwhile, various security challenges can arise as a NoSQL database “becomes popular and is used by multiple projects” across “department trust boundaries.”

Computer science students, software developers, and others who are trying to stay knowledgeable about Big Data technology and issues should also consider reading this well-written book.

Si Dunn

The Practice of Network Security Monitoring – You’re compromised, so deal with it. #security #bookreview

The Practice of Network Security Monitoring

Understanding Incident Detection and Response
Richard Bejtlich
(No Starch Press – paperback, Kindle)

Security expert Richard Bejtlich’s focus in his new book is not on “the planning and defense phases of the security cycle.” Instead, he emphasizes how to handle “systems that are already compromised or that are on the verge of being compromised.”

His well-organized, well-written, 341-page book aims to help you “start detecting and responding to digital intrusions using network-centric operations, tools, and techniques.”

Bejtlich has long emphasized a “detection-centered philosophy” built around a straightforward central tenet: “Prevention eventually fails.” No matter how many digital walls and moats you build around your network, someone will find a way to tunnel in, parachute in, or sneak in via an unsuspecting employee’s $9.95 thumb drive.

“It’s becoming smarter,” he writes, “to operate as though your enterprise is always compromised. Incident response is no longer an infrequent, ad-hoc affair. Rather, incident response should be a continuous business process with defined metrics and objectives.”

You may recognize some of Bejtlich’s previous books on network security monitoring (NSM): The Tao of Network Security Monitoring; Extrusion Detection; and Real Digital Forensics.

The Practice of Network Security Monitoring is tailored toward two key audiences: (1) security professionals who have little or no experience with NSM; and (2) “more senior incident handlers, architects, and engineers who need to teach NSM to managers, junior analysts, or others who may be technically less adept.”

Readers, he add, should understand “the basic use of the Linux and Windows operating systems, TCP/IP networking, and the essentials of network attack and defense.”

The examples in Bejtlich’s book rely on open source and vendor-neutral tools, primarily from Doug Burks’ Security Onion (SO) distribution.

The 13-chapter book is organized into four parts:

  • Part I: Getting Started – Introduces NSM and sensor placement issues.
  • Part II: Security Onion Deployment – Shows how to install and configure SO.
  • Part III: Tools – Examines the “key software shipped with SO and how to use these applications.”
  • Part IV: NSM in Action – Looks at “how to use NSM processes and data to detect and respond to intrusions.”

Following the technical chapters, Bejtlich offers some concluding thoughts on network security management, cloud computing, and establishing an effective workflow for NSM. “NSM isn’t just about tools,” he writes. “NSM is an operation, and that concept implies workflow, metrics, and collaboration. A workflow establishes  a series of steps that an analyst follows to perform the detection and response mission. Metrics, like the classification and count of incidents and time elapsed from incident detection to containment, measure the effectiveness of the workflow. Collaboration enables analysts to work smarter and faster.”

He also observes: “It is possible to defeat adversaries if we stop them before they accomplish their mission. As it has been since the early 1990s, NSM will continue to be a powerful, cost-effective way to counter intruders.”

Si Dunn

Absolute OpenBSD: Unix for the Practical Paranoid, 2nd Edition – A good & long-overdue update – #bookreview

Absolute OpenBSD, 2nd Edition
Unix for the Practical Paranoid
Michael W. Lucas
(No Starch Press – Kindle, paperback)

This updated new edition likely will be hailed — and rightly so — as a major event by many dedicated users of OpenBSD. After all, the first edition of Michael W. Lucas’ book was published a full decade ago, back when, the author concedes, he still had hair.

OpenBSD’s founder and long-time administrator Theo de Raadt has called this new edition both “[t]he definitive book on OpenBSD” and “a long-overdue refresh.” The praise can’t get much higher in OpenBSD-land.

OpenBSD is a highly secure, Unix-like operating system frequently used in Domain Name System (DNS) servers, routers, and firewalls. It also can run on a wide array of computer hardware, ranging from new systems to old VAXes, 386 machines, Apple’s PowerPC Macintoshes, and most products from Sun.

“Old systems can run OpenBSD quite well,” Lucas notes. “I’ve run OpenBSD/i386 quite nicely on a 166 MHz processor with 128MB of memory. You probably have some old system lying around that’s perfectly adequate for learning OpenBSD.”

Indeed, he explains, “As a matter of legacy, OpenBSD will run on hardware that has been obsolete for decades because the hardware was in popular use when OpenBSD started, and the developers try to maintain compatibility and performance when possible.”

The OpenBSD software has an intriguing and complex history that involves the 1980s breakup of AT&T, lots of lawsuits, the Berkeley Software Distribution (BSD) project, the University of California, and the eventual emergence of the “BSD license.” The result was “perhaps the freest of the free operating systems,” Lucas says.

Today, Lucas emphasizes, “OpenBSD strives to be the most secure operating system in the world.” OpenBSD developers constantly work to try to “eliminate [security] problems before they exist,” he states.

“OpenBSD is a gift. You’re free to use it or not. As with any gift, you can do whatever you want with it. But you’re not free to bug the developers for features or support.”

His 491-page second edition offers a heavy dose–23 chapters–of how-to instructions. And readers are encouraged to read OpenBSD’s man (manual) pages online. In a book where the first chapter is titled “Getting Additional Help” and the second is titled “Installation Preparations,” you can guess that this is not aimed at absolute newcomers. Actually, Lucas says: “This book is written for experienced Unix users or system administrators who want to add OpenBSD to their repertoire.”

Still, if you want to learn and use OpenBSD, you will need this book — and some online documentation and very likely some advice from the OpenBSD community, as well. There don’t seem to be recent introduction-level books floating around. However, there are a few tutorial sites, including this one. And OpenBSD.org maintains a list of support and consulting specialists. Training also is available from a number of companies that can be found via the Web.

If you want to use OpenBSD but not spend much time learning it, you also can purchase a support contract and let someone else set up and maintain your system. Even then, you likely will want to have this new edition of Absolute OpenBSD handy for reference–and for learning, just in case, down the line, you change your mind.

Si Dunn

Hadoop is hot! Three new how-to books for riding the Big Data elephant – #programming #bookreview

In the world of Big Data, Hadoop has become the hard-charging elephant in the room.

Its big-name users now span the alphabet and include such notables as Amazon, eBay, Facebook, Google, the New York Times, and Yahoo. Not bad for software named after a child’s toy elephant.

Computer systems that run Hadoop can store, process, and analyze large amounts of data that have been gathered up in many different formats from many different sources.

According to the Apache Software Foundation’s Hadoop website: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

The (well-trained) user defines the Big Data problem that Hadoop will tackle. Then the software handles all aspects of the job completion, including spreading out the problem in small pieces to many different computers, or nodes, in the distributed system for more efficient processing. Hadoop also handles individual node failures, and collects and combines the calculated results from each node.

But you don’t need a collection of hundreds or thousands of computers to run Hadoop. You can learn it, write programs, and do some testing and debugging on a single Linux machine, Windows PC or Mac. The Open Source software can be downloaded here. (Do some research first. You may have use web searches to find detailed installation instructions for your specific system.)

Hadoop is open-source software that is often described as “a Java-based framework for large-scale data processing.” It has a lengthy learning curve that includes getting familiar with Java, if you don’t already know it.

But if you are now ready and eager to take on Hadoop, Packt Publishing recently has unveiled three excellent how-to books that can help you begin and extend your mastery: Hadoop Beginner’s Guide, Hadoop MapReduce Cookbook, and Hadoop Real-World Solutions Cookbook.

Short reviews of each are presented below.

Hadoop Beginner’s Guide
Garry Turkington
(Packt Publishing – paperback, Kindle)

Garry Turkington’s new book is a detailed, well-structured introduction to Hadoop. It covers everything from the software’s three modes–local standalone mode, pseudo-distributed mode, and fully distributed mode–to running basic jobs, developing simple and advanced MapReduce programs, maintaining clusters of computers, and working with Hive, MySQL, and other tools.

“The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination,” the author writes.

He calls this capability “possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system.”

The 374-page book is written well and provides numerous code samples and illustrations. But it  has one drawback for some beginners who want to install and  use Hadoop.  Turkington offers step-by-step instructions for how to perform a Linux installation, specifically Ubuntu. However, he refers Windows and Mac users to an Apache site where there is insufficient how-to information. Web searches become necessary to find more installation details.

Hadoop MapReduce Cookbook
Srinath Perera and Thilina Gunarathne
(Packt Publishing – paperback, Kindle)

MapReduce “jobs” are an essential part of  how Hadoop is able to crunch huge chunks of Big Data.  The Hadoop MapReduce Cookbook offers “recipes for analyzing large and complex data sets with Hadoop MapReduce.”

MapReduce is a well-known programming model for processing large sets of data. Typically, MapReduce is used within clusters of computers that are configured to perform distributed computing.

In the “Map” portion of the process, a problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes. (Nodes also can have sub-nodes). During the “Reduce” part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

“Hadoop is the most widely known and widely used implementation of the MapReduce paradigm,” the two authors note.

Their 284-page book initially shows how to run Hadoop in local mode, which “does not start any servers but does all the work within the same JVM [Java Virtual Machine]” on a standalone computer. Then, as you gain more experience with MapReduce and the Hadoop Distributed File System (HDFS), they guide you into using Hadoop in more complex, distributed-computing environments.

Echoing the Hadoop Beginner’s Guide, the authors explain how to install Hadoop on Linux machines only.

Hadoop Real-World Solutions Cookbook
Jonathan R. Owens, Jon Lentz and Brian Femiano
(Packt Publishing – paperback, Kindle)

The Hadoop Real-World Solutions Cookbook assumes you already have some experience with Hadoop. So it jumps straight into helping “developers become more comfortable with, and proficient at solving problems in, the Hadoop space.”

Its goal is to “teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.”

The 299-page book is packed with code examples and short explanations that help solve specific types of problems. A few randomly selected problem headings:

  • “Using Apache Pig to filter bot traffic from web server logs.”
  • “Using the distributed cache in MapReduce.”
  • “Trim Outliers from the Audioscrobbler dataset using Pig and datafu.” 
  • “Designing a row key to store geographic events in Accumulo.”
  • “Enabling MapReduce jobs to skip bad records.”

The authors use a simple but effective strategy for presenting problems and solutions. First, the problem is clearly described. Then, under a “Getting Ready” heading, they spell out what you need to  solve the problem. That is followed by a “How to do it…” heading where each step is presented and supported by code examples. Then, paragraphs beneath a “How it works…” heading sum up and explain how the problem was solved. Finally, a “There’s more…” heading highlights more explanations and links to additional details.

If you are a Hadoop beginner, consider the first two books reviewed above. If you have some Hadoop experience, you likely can find some useful tips in book number three

Si Dunn

Getting Started with Mule Cloud Connect – To help sort out the chaos of Internet services – #bookreview

Getting Started with Mule Cloud Connect
Ryan Carter
(O’Reilly – paperback, Kindle)

In a digital world increasingly cluttered with Software-as-a-Service (SaaS) platforms, Open APIs, and social networks, complexity quickly can get out of hand.

“It all starts,” Ryan Carter writes in his new book, “with a simple API that publishes somebody’s status to Facebook, sends a Tweet, or updates a contact in Salesforce. As you start to integrate more and more of these external services with your applications, trying to identify the tasks that one might want to perform when you’re surrounded by SOAP, REST, JSON, XML, GETs, PUTs, POSTs, and DELETEs, can be a real challenge.”

Indeed. But never fear, Mule ESB can ride to your rescue and connect you quickly and easily to the cloud. At least, that’s the marketing claim.

Some truly big-name users, it should be noted, are adding credibility to Mule’s claimed capabilities and usefulness as an Open Source integration platform. They include Adobe, eBay, Hewlett-Packard, J.P. Morgan, T-Mobile, Ericsson, Southwest Airlines, and Nestle, to mention just a few.

Meanwhile, riding Mule to the cloud is the central focus of this compact (105 pages), well-written get-started guide. Its author, Ryan Carter, is both a specialist in integration and APIs and “an appointed Mule champion” who contributes regularly to the MuleSoft community.

“Mule,” Carter points out, “is an integration platform that allows developers to connect applications together quickly and easily, enabling them to exchange data regardless of the different technologies that the applications use. It is also at the core of CloudHub, an Integration Platform as a Service(IPaas). CloudHub allows you to integrate cross-cloud services, create new APIs on top of existing data sources, and integrate on-premise applications with cloud services.”

The book is structured so you start off by building a simple Mule application that will serve “as the base of our examples and introduce some core concepts for those unfamiliar with Mule.” Then Carter shows and illustrates how to “start taking advantage of Mule Cloud Connectors.” He includes numerous code examples, plus some screenshots and diagrams.

The book’s six chapters are:

  1. Getting Started
  2. Cloud Connectors
  3. OAuth Connectivity
  4. Configuration Management
  5. Real-Time Connectivity
  6. Custom Connectivity

Carter emphasizes: “Mule Cloud Connect offers a more maintainable way to work with APIs. Built on top of the Mule and CloudHub integration platforms, Cloud Connectors are service-specific clients that abstract away the complexities of transports and protocols. Many complex but common processes such as authorization and session management work without you having to write a single line of code. Although service-specific, Cloud Connectors all share a common and consistent interface to configure typical API tasks such as OAuth, WebHooks, and connection management. They remove the pain from working with multiple, individual client libraries.”

If Mule does not have a connector for a resource that you need, the book shows you how to create your own.

Getting Started with Mule Cloud Connect can get you started on a beneficial ride of  discovery, and it can take you onto the trail that leads to solutions.

— Si Dunn

All for Search and Search for All: 3 New Books for Putting Search to Work – #bookreview

Seek and ye shall find.

That’s the theory behind the still-debated benefits of digging through Big Data to uncover new, overlooked, or forgotten paths to greater profits and greater understanding.

Big Data, however, is here to stay (and get bigger). And search is what we do to find and extract useful nuggets and diamonds and nickels and dimes of information.

O’Reilly Media recently has published three new, enlightening books focused on the processes, application, and management of search: Enterprise Search by Martin White, Mastering Search Analytics by Brent Chaters, and Search Patterns by Peter Morville and Jeffery Callender.

Here are short looks at each.

Enterprise Search
Martin White
(O’Reilly, paperback, Kindle)

Start with this book if you’re just beginning to explore what focused search efforts and search technology may be able to do for your company.

The book’s key goal is “to help business managers , and the IT teams supporting them, understand why effective enterprise-wide search is essential in any organization, and how to go about the process of meeting user requirements.”

You may think, So what’s the big deal? Just put somebody in a cubicle and pay them to use Google, Bing, and a few other search engines to find stuff.

Search involves much more than that. Even small businesses now have large quantities of potentially profitable information stored internally in documents, emails, spreadsheets and other formats. And large corporations are awash in data that can be mined for trends, warnings, new opportunities, new product or service ideas, and new market possibilities, to name just a few.

The goal of Enterprise Search is to help you set up a managed search environment that benefits your business but also enables employees to use search technology to help them do their jobs more efficiently and productively.

Yet, putting search technology within every worker’s reach is not the complete answer, author Martin White emphasizes.

“The reason for the well-documented lack of satisfaction with a search application,” he writes, “is that organizations invest in technology but not staff with the expertise and experience to gain the best possible return on the investment….”

Enterprise Search explains how to determine your firm’s search needs and how to create an effective search support team that can meet the needs of employees, management, and customers.

Curiously, White
waits until his final chapter to list 12 “critical success factors” for getting the most from enterprise-wide search capabilities.

Perhaps, in a future edition, this important list will be positioned closer to the front of the book.

Mastering Search Analytics
Brent Chaters
(O’Reilly – paperback, Kindle)

This in-depth and well-illustrated guide details how a unified, focused search strategy can generate greater traffic for your website, increase conversion rates, and bring in more revenue.

Brent Chaters explains how to use search engine optimization (SEO) and paid search as part of an effective, comprehensive approach.

Key to Chaters’ strategy is the importance of bringing together the efforts and expertise of both the SEO specialists and the Search Engine Marketing (SEM) specialists — two groups that often battle each other for supremacy within corporate settings.

“A well-defined search program should utilize both SEO and SEM tactics to provide maximum coverage and exposure to the right person at the right time, to maximize your revenue,” Chaters contends. “I do not believe that SEO and SEM should be optimized from each other; in fact, there should be open sharing and examination of your overall search strategy.”

His book is aimed at three audiences: “the search specialist, the marketer, and the executive”–particularly executives who are in charge of search campaigns and search teams.

If you are a search specialist, the author expects that “you understand the basics of SEO, SEM, and site search (meaning you understand how to set up a paid search campaign, you understand that organic search cannot be bought, and you understand how your site search operates and works.)”

Search Patterns
Peter Morville and Jeffery Callender
(O’Reilly – paperback, Kindle)

“Search applications demand an obsessive attention to detail,” the two authors of this fine book point out. “Simple, fast, and relevant don’t come easy.”

Indeed, they add, “Search is not a solved problem,” but remains, instead, “a wicked problem of terrific consequence. As the choice of first resort for many users and tasks, search is the defining element of the user experience. It changes the way we find everything…it shapes how we learn and what we believe. It informs and influences our decisions and, and it flows into every noon and cranny….Search is among the biggest, baddest, most disruptive innovations around. It’s a source of entrepreneurial insight, competitive advantage, and impossible wealth.”

They emphasize: “Unfortunately, it’s also the source of endless frustration. Search is the worst usability problem on the Web….We find too many results or too few, and most regular folks don’t know where to search, or how….business goals are disrupted by failures in findability…[and] “Mobile search is a mess.”

Ouch!

Colorfully illustrated and well-written, Search Patterns is centered around major aspects in the design of user interfaces for search and discovery. It is aimed at “designers, information architects, students, entrepreneurs, and anyone who cares about the future of search.”

It covers the key bases, “from precision, recall, and relevance to autosuggestion and faceted navigation.” It looks at how search may be reshaped in the future. And, very importantly, it also joins the growing calls for collaboration across disciplines and “tearing down walls to make search better….”

Si Dunn