Hadoop: The Definitive Guide, Third Edition – Big Tools for Big Data – #programming #bookreview

Hadoop: The Definitive Guide, Third Edition
Tom White
(O’Reilly, paperback, list price $49.99; Kindle edition, list price, $39.99)

“The good news is that Big Data is here,” Tom White writes in this revised and updated third edition to Hadoop’s “definitive guide.” But: “The bad news is that we are struggling to store and analyze it.”

Indeed, Big Data is now being measured in zettabytes, which is “equivalently one thousand exabytes, one million petabytes, or one billion terabytes,” White says. And all of us are creating, storing and trying to benefit from expanding amounts of data each day.

Enter Hadoop, “a reliable shared storage and analysis system. The storage is provided by HDFS [the Hadoop Distributed File System] and the analysis by MapReduce. There are other parts to Hadoop,” White emphasizes, “but these capabilities are its kernel.”

Hadoop (it’s not an acronym; simply the name of a child’s toy elephant) is a complex programming language. But, White says: “Stripped to its core, the tools that Hadoop provides for building distributed systems—for data storage, data analysis, and coordination—are simple. If there’s a common theme, it’s about raising the level of abstraction—to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time , the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.”

This new edition covers recent changes and additions to Hadoop, including the MapReduce API and new MapReduce 2 runtime, “which is built on a new distributed resource management system called YARN.” Several chapters related to MapReduce and other topics also have been added or expanded.

Hadoop can run MapReduce programs written in a variety of languages, including Java, Ruby, Python, and C++. And: “MapReduce programs are inherently parallel, thus putting very large-scare data analysis into the hands of anyone with enough machines at her disposal.” Hadoop, meanwhile, provides powerful parallel processing capabilities.

Hadoop increasingly is being employed by companies and organizations that must deal with processing, analyzing, and storing very large amounts of data. White’s book includes some case studies that explain Hadoop’s role in solving several Big Data challenges.

Hadoop: The Definitive Guide, Third Edition is not a beginner’s how-to book. But it’s definitely recommended for “programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.”

Si Dunn

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s