Programming MapReduce with Scalding
A practical guide to designing, testing, and implementing complex MapReduce applications in Scala
Antonio Chalkiopoulos’s new book has three key goals, and it meets each of them in good, readable fashion.
It describes how MapReduce, Hadoop, and Scalding can work together. It suggests some useful design patterns and idioms. And, it provides numerous code examples of “real implementations for common use cases.”
The book also briefly introduces the Scala programming language and the Cascading platform, two elements vital to the Scalding framework.
Right here, a few brief definitions need to be offered.
According to a Wikipedia definition, MapReduce is both a programming model and “an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.”
Meanwhile, the Apache Hadoop website states: “Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.”
And: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware” and “is highly fault-tolerant….”
Continuing, the Cascading.org website promotes Cascading as “the proven application development platform for building data applications on Hadoop.” Plus, Scalding, it explains, is “an extension to Cascading that enables application development with Scala, a powerful language for solving functional problems.” Indeed, Scalding is “[a] Scala API for Cascading,” and it provides functionality from custom join algorithms to multiple APIs (Fields-based, Type-safe, Matrix) for developers to build robust data applications. Scalding is built and maintained by Twitter.”
Scalding “makes MapReduce computations look very similar to Scala’s collection API. It’s also a wrapper for Cascading to simplify jobs, tests and data sources on HDFS or local disk.”
Okay, that’s a lot to digest, especially if you are making some of your first forays into the world of Big Data.
Fortunately, Programming MapReduce with Scalding offers clear, well-illustrated, smoothly paced how-to steps, as well as easy-to-digest definitions and descriptions. It takes the reader from setting up and running a Hadoop mini-cluster and local-development environment to applying Scalding to real-use cases, as well as developing good test and test-driven development methodologies, running Scalding in production, using external data stores, and applying matrix calculations and machine learning.
The book is written for developers who have “a basic understanding” of Hadoop and MapReduce, but is also intended for experienced Hadoop developers who may be “enlightened by this alternative methodology of developing MapReduce applications with Scalding.”
In this book, “[a] Linux operating system is the preferred environment for Hadoop.” And the author includes instructions for how to install and use a Kiji Bento Box, “a zero-configuration bundle that provides a suitable environment for testing and prototyping projects that use HDFS, MapReduce, and HBase with minimal setup time.” It’s an easy way to get Apache Hadoop up and running in as little as five minutes or so.
Or, if you prefer, you can manually install the required software packages. Either way, you can learn a lot and do a lot with a Hadoop mini-cluster. And, with this book, you can get a very good handle on the Scalding API.
It does help to be somewhat familiar with MapReduce, Scalding, Scala, Hadoop, Maven, Eclipse and the Linux environment. But Antonio Chalkiopoulo does a good job of keeping the examples accessible even when readers are new to some of the packages. Still, be prepared to take your time and be prepared to do some additional research on the web and ask questions in forums, particularly if any of the required software is new to you.
(The book also can be purchased direct from Packt Publishing at http://goo.gl/Tyw4Sh.)
– Si Dunn