Principles and best practices of scalable real-time data systems
Nathan Marz, with James Warren
Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.
You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.
“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.
Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.
The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”
Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”
The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.
Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:
- In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
- Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
- In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”
Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”
This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.
If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.