Big Data is hothotHOT. And O’Reilly recently has added three new books of potential interest to Big Data workers, as well as those hoping to join their ranks.
Hadoop, Hive and–surprise!—Python are just a few of the hot tools you may encounter in the rapidly expanding sea of data now being gathered, explored, stored, and manipulated by companies, organizations, institutions, governments, and individuals around the planet. Here are the books:
(O’Reilly, paperback – Kindle)
“Companies are storing more data from more sources in more formats than ever before,” writes Eric Sammer, a Hadoop expert who is principal solution architect at Cloudera. But gathering and stockpiling data is only “one half of the equation,” he adds. “Processing that data to produce information is fundamental to the daily operations of every modern business.”
Enter Apache Hadoop, a “pragmatic, cost-effective, scalable infrastructure” that increasingly is being used to develop Big Data applications for storing and processing information.
“Made up of a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a computation layer that implements a processing paradigm called MapReduce, Hadoop is an open source, batch data processing system for enormous amounts of data. We live in a flawed world, and Hadoop is designed to survive in it by not only tolerating hardware and software failures, but also treating them as first-class conditions that happen regularly.”
Sammer adds: “Hadoop uses a cluster of plain old commodity servers with no specialized hardware or network infrastructure to form a single, logical, storage and compute platform, or cluster, that can be shared by multiple individuals or groups. Computation in Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction for developers that obviates complex synchronization and network programming. Unlike many other distributed data processing systems, Hadoop runs the user-provided processing logic on the machine where the data lives rather than dragging the data across the network; a huge win for performance.”
Sammer’s new, 282-page book is well written and focuses on running Hadoop in production, including planning its use, installing it, configuring the system and providing ongoing maintenance. He also shows “what works, as demonstrated in crucial deployments.”
If you’re new to Hadoop or still getting a handle on it, you need Hadoop Operations. And even if you’re now an “old” hand at Hadoop, you likely can learn new things from this book. “It’s an extremely exciting time to get into Apache Hadoop,” Sammer states.
Eric Capriolo, Dean Wampler, and Jason Rutherglen
(O’Reilly, paperback – Kindle)
“Hive,” the three authors point out, “provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL), for querying data stored in a Hadoop cluster.”
They add: “Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when data is not changing rapidly.”
Their well-structured and well-written book shows how to install and test Hadoop and Hive on a personal workstation – “a convenient way to learn and experiment with Hadoop.” Then it shows “how to configure Hive for use on Hadoop clusters.”
They also provide a brief overview of Hadoop and MapReduce before diving into Hive’s command-line interface (CLI) and introductory aspects such as how to embed lines of comments in Hive v0.80 and later.
From there, the book flows smoothly into HiveQL and how to use its SQL dialect to query, summarize, and analyze large datasets that Hadoop has stored in its distributed filesystem.
User documentation for Hive and Hadoop has been sparse, so Programming Hive definitely fills a solid need. Significantly, the final chapter presents several “Case Study Examples from the User Trenches” where real companies explain how they have used Hive to solve some very challenging problems involving Big Data.
Python for Data Analysis
(O’Reilly, paperback – Kindle)
No, Python is not the first language many people think of when picturing large data analysis projects. For one thing, it’s an interpreted language, so Python code runs a lot slower than code written in compiled programming languages such as C++ or Java.
Also, the author concedes, “Python is not an ideal language for highly concurrent, multithreaded applications, particularly applications with many CPU-bound threads.” The software’s global interpreter lock (GIL) “prevents the interpreter from executing more than one Python bytecode instruction at a time.”
Thus, Python will not soon be challenging Hadoop to a Big Data petabyte speed duel.
On the other hand, Python is reasonably easy to learn, and it has strong and widespread support within the scientific and academic communities, where a lot of data must get crunched at a reasonable clip, if not at blinding speed.
And Wes McKinney is the main author of pandas, Python’s increasingly popular open source library for data analysis. It (pandas) is “designed to make working with structured data fast, easy, and expressive.”
His book makes a good case for using Python in at least some Big Data situations. “In recent years,” he states, “Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks. Combined with Python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications.”
Much of this well-written, well-illustrated book “focuses on high-performance array-based computing tools for working with large data sets.” It uses a case-study-examples approach to demonstrate how to tackle a wide range of data analysis problems, using Python libraries that include pandas, NumPy, matplotlib, and IPython, “the component in the standard scientific Python toolset that ties everything together.”
By the way, if you have never programmed in Python, check out the end of McKinney’s book. An appendix titled “Python Language Essentials” gives a good overview of the language, with a specific bias toward “processing and manipulating structured and unstructured data.”
If you do scientific, academic, or business computing and need to crunch and visualize a lot of data, definitely check out Python for Data Analysis.
You may be pleasantly surprised at how well and how easily Python and its data-analysis libraries can do the job.
– Si Dunn