Learn how to configure your Hadoop cluster to run optimal MapReduce jobs
Time is money, as the old saying goes. And that saying especially applies to the world of Big Data, where much time, computing power and cash can be consumed while trying to extract profitable information from mountains of data.
This short, well-focused book by veteran software developer Khalid Tannir describes how to achieve a very important, money-saving goal: improve the efficiency of MapReduce jobs that are run with Hadoop.
As Tannir explains in his preface:
“MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.
“Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees.
“Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.”
Tannir’s well-focused, seven-chapter book zeroes in on how to find and fix misconfigured Hadoop clusters and numerous other problems. But first, he explains how Hadoop parameters are configured and how MapReduce metrics are monitored.
Two chapters are devoted to learning how to identify system bottlenecks , including CPU bottlenecks, storage bottlenecks, and network bandwidth bottlenecks.
One chapter examines how to properly identify resource weaknesses, particularly in Hadoop clusters. Then, as the book shifts strongly to solutions, Tannir explains how to reconfigure Hadoop clusters for greater efficiency.
Indeed, the final three chapters deliver details and steps that can help you improve how well Hadoop and MapReduce work together in your setting.
For example, the author explains how to make the map and reduce functions operate more efficiently, how to work with small or unsplittable files, how to deal with spilled records (those written to local disk when the allocated memory buffer is full), and ways to tune map and reduce parameters to improve performance.
“Most MapReduce programs are written for data analysis and they usually take a lot of time to finish,” Tannir emphasizes. However: “Many companies are embracing Hadoop for advanced data analytics over large datasets that require completion-time guarantees.” And that means “[e]fficiency, especially the I/O costs of MapReduce, still need(s) to be addressed for successful implications.”
He describes how to use compression, Combiners, the correct Writable types, and quick reuse of types to help improve memory management and the speed of job execution.
And, along with other tips, Tannir presents several “best practices” to help manage Hadoop clusters and make them do their work quicker and with fewer demands on hardware and software resources.
Tannir notes that “setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.”
If you work with Hadoop and MapReduce or are now learning how to help install, maintain or administer Hadoop clusters, you can find helpful information and many useful tips in Khaled Tannir’s Optimizing Hadoop for Map Reduce.
— Si Dunn