Optimizing Hadoop for MapReduce – A practical guide to lowering some costs of mining Big Data – #bookreview

Optimizing Hadoop for MapReduce

Learn how to configure your Hadoop cluster to run optimal MapReduce jobs

Khaled Tannir

(Packt Publishing, paperback, Kindle)

Time is money, as the old saying goes. And that saying especially applies to the world of Big Data, where much time, computing power and cash can be consumed while trying to extract profitable information from mountains of data.

This short, well-focused book by veteran software developer Khalid Tannir describes how to achieve a very important, money-saving goal: improve the efficiency of MapReduce jobs that are run with Hadoop.

As Tannir explains in his preface:

“MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.

“Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees.

“Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.”

Tannir’s well-focused, seven-chapter book zeroes in on how to find and fix misconfigured Hadoop clusters and numerous other problems. But first, he explains how Hadoop parameters are configured and how MapReduce metrics are monitored.

Two chapters are devoted to learning how to identify system bottlenecks , including CPU bottlenecks, storage bottlenecks, and network bandwidth bottlenecks.

One chapter examines how to properly identify resource weaknesses, particularly in Hadoop clusters. Then, as the book shifts strongly to solutions, Tannir explains how to reconfigure Hadoop clusters for greater efficiency.

Indeed, the final three chapters deliver details and steps that can help you improve how well Hadoop and MapReduce work together in your setting.

For example, the author explains how to make the map and reduce functions operate more efficiently, how to work with small or unsplittable files, how to deal with spilled records (those written to local disk when the allocated memory buffer is full), and ways to tune map and reduce parameters to improve performance.

“Most MapReduce programs are written for data analysis and they usually take a lot of time to finish,” Tannir emphasizes. However: “Many companies are embracing Hadoop for advanced data analytics over large datasets that require completion-time guarantees.” And that means “[e]fficiency, especially the I/O costs of MapReduce, still need(s) to be addressed for successful implications.”

He describes how to use compression, Combiners, the correct Writable types, and quick reuse of types to help improve memory management and the speed of job execution.

And, along with other tips, Tannir presents several “best practices” to help manage Hadoop clusters and make them do their work quicker and with fewer demands on hardware and software resources. 

Tannir notes that “setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.”

If you work with Hadoop and MapReduce or are now learning how to help install, maintain or administer Hadoop clusters, you can find helpful information and many useful tips in Khaled Tannir’s Optimizing Hadoop for Map Reduce.

Si Dunn

Hadoop is hot! Three new how-to books for riding the Big Data elephant – #programming #bookreview

In the world of Big Data, Hadoop has become the hard-charging elephant in the room.

Its big-name users now span the alphabet and include such notables as Amazon, eBay, Facebook, Google, the New York Times, and Yahoo. Not bad for software named after a child’s toy elephant.

Computer systems that run Hadoop can store, process, and analyze large amounts of data that have been gathered up in many different formats from many different sources.

According to the Apache Software Foundation’s Hadoop website: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

The (well-trained) user defines the Big Data problem that Hadoop will tackle. Then the software handles all aspects of the job completion, including spreading out the problem in small pieces to many different computers, or nodes, in the distributed system for more efficient processing. Hadoop also handles individual node failures, and collects and combines the calculated results from each node.

But you don’t need a collection of hundreds or thousands of computers to run Hadoop. You can learn it, write programs, and do some testing and debugging on a single Linux machine, Windows PC or Mac. The Open Source software can be downloaded here. (Do some research first. You may have use web searches to find detailed installation instructions for your specific system.)

Hadoop is open-source software that is often described as “a Java-based framework for large-scale data processing.” It has a lengthy learning curve that includes getting familiar with Java, if you don’t already know it.

But if you are now ready and eager to take on Hadoop, Packt Publishing recently has unveiled three excellent how-to books that can help you begin and extend your mastery: Hadoop Beginner’s Guide, Hadoop MapReduce Cookbook, and Hadoop Real-World Solutions Cookbook.

Short reviews of each are presented below.

Hadoop Beginner’s Guide
Garry Turkington
(Packt Publishing – paperback, Kindle)

Garry Turkington’s new book is a detailed, well-structured introduction to Hadoop. It covers everything from the software’s three modes–local standalone mode, pseudo-distributed mode, and fully distributed mode–to running basic jobs, developing simple and advanced MapReduce programs, maintaining clusters of computers, and working with Hive, MySQL, and other tools.

“The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination,” the author writes.

He calls this capability “possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system.”

The 374-page book is written well and provides numerous code samples and illustrations. But it  has one drawback for some beginners who want to install and  use Hadoop.  Turkington offers step-by-step instructions for how to perform a Linux installation, specifically Ubuntu. However, he refers Windows and Mac users to an Apache site where there is insufficient how-to information. Web searches become necessary to find more installation details.

Hadoop MapReduce Cookbook
Srinath Perera and Thilina Gunarathne
(Packt Publishing – paperback, Kindle)

MapReduce “jobs” are an essential part of  how Hadoop is able to crunch huge chunks of Big Data.  The Hadoop MapReduce Cookbook offers “recipes for analyzing large and complex data sets with Hadoop MapReduce.”

MapReduce is a well-known programming model for processing large sets of data. Typically, MapReduce is used within clusters of computers that are configured to perform distributed computing.

In the “Map” portion of the process, a problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes. (Nodes also can have sub-nodes). During the “Reduce” part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

“Hadoop is the most widely known and widely used implementation of the MapReduce paradigm,” the two authors note.

Their 284-page book initially shows how to run Hadoop in local mode, which “does not start any servers but does all the work within the same JVM [Java Virtual Machine]” on a standalone computer. Then, as you gain more experience with MapReduce and the Hadoop Distributed File System (HDFS), they guide you into using Hadoop in more complex, distributed-computing environments.

Echoing the Hadoop Beginner’s Guide, the authors explain how to install Hadoop on Linux machines only.

Hadoop Real-World Solutions Cookbook
Jonathan R. Owens, Jon Lentz and Brian Femiano
(Packt Publishing – paperback, Kindle)

The Hadoop Real-World Solutions Cookbook assumes you already have some experience with Hadoop. So it jumps straight into helping “developers become more comfortable with, and proficient at solving problems in, the Hadoop space.”

Its goal is to “teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.”

The 299-page book is packed with code examples and short explanations that help solve specific types of problems. A few randomly selected problem headings:

  • “Using Apache Pig to filter bot traffic from web server logs.”
  • “Using the distributed cache in MapReduce.”
  • “Trim Outliers from the Audioscrobbler dataset using Pig and datafu.” 
  • “Designing a row key to store geographic events in Accumulo.”
  • “Enabling MapReduce jobs to skip bad records.”

The authors use a simple but effective strategy for presenting problems and solutions. First, the problem is clearly described. Then, under a “Getting Ready” heading, they spell out what you need to  solve the problem. That is followed by a “How to do it…” heading where each step is presented and supported by code examples. Then, paragraphs beneath a “How it works…” heading sum up and explain how the problem was solved. Finally, a “There’s more…” heading highlights more explanations and links to additional details.

If you are a Hadoop beginner, consider the first two books reviewed above. If you have some Hadoop experience, you likely can find some useful tips in book number three

Si Dunn

MapReduce Design Patterns – For solving Big Data problems – #bookreview #programming #hadoop

MapReduce Design Patterns
Donald Miner and Adam Shook
(O’Reilly –
paperback, Kindle)

“MapReduce is a computing paradigm for processing data that resides on hundreds of computers,” the authors point out. It has been “popularized recently by Google, Hadoop, and many others.”

The MapReduce paradigm is “extraordinarily powerful, but does not provide a general solution to what many are calling ‘big data,” they add, “so while it works particularly well on some problems, some are more challenging.” The authors’ focus in their new book is on using MapReduce design patterns as “templates or general guides to solving problems.”

Their new book definitely can help solve some time-crunch problems for new MapReduce developers. It brings together a variety of solutions that have emerged over time in a patchwork of blogs, books, and research papers and explains them in detail, with code samples, illustrations, and cautions about potential pitfalls.

You can’t simply cut and paste solutions from the chapters. But the two writers do “hope that you will find a pattern to get you at least 90% of the way for just about all of your challenges.”

Six of the book’s eight chapters focus on specific types of design patterns:

  • Summarization Patterns
  • Filtering Patterns
  • Data Organization Patterns
  • Join Patterns
  • Metapatterns
  • Input and Output Patterns

“The MapReduce world is in a state similar to the object-oriented world before 1994,” the authors point out. “Patterns today are scattered across blogs, websites such as StackOverflow, deep inside other books, and inside very advanced technology teams at organizations across the world.”

They add that “[t]he intent of this book is not to provide some groundbreaking new ways to solve problems with MapReduce….” but to offer, instead, a collection of “patterns that have been developed by veterans in the field so they can be shared with everyone else.”

The book’s code samples are all written in Hadoop, and the two writers deal with the question of “why should we use Java MapReduce in Hadoop at all when we have options like Pig and Hive,” which reduce the need for MapReduce patterns.

There is “conceptual value,” they state, “in understanding the lower level workings of a system like MapReduce.” Furthermore, “Using Pig or Hive without understanding MapReduce can lead to some dangerous situations.” And, Pig and Hive cannot yet “tackle all of the problems in the ways that Java MapReduce can. This will surely change over time….”

If you are new to MapReduce, this useful and informative book from Donald Miner and Adam Shook can be the next best thing to having MapReduce experts at your side.

MapReduce Design Patterns can save you time and effort, steer you away from dead ends, and help give you solid understandings of the powerful MapReduce paradigm.

Si Dunn