BIG DATA: A well-written look at principles & best practices of scalable real-time data systems – #bookreview

 

 

Big Data

Principles and best practices of scalable real-time data systems

Nathan Marz, with James Warren

Manning – paperback

Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.

You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.

“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.

Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.

The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”

Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”

The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.

Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:

  • In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
  • Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
  • In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”

Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”

This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.

If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.

Si Dunn

Making Sense of NoSQL – A balanced, well-written overview – #bigdata #bookreview

Making Sense of NoSQL

A Guide for Managers and the Rest of Us
Dan McCreary and Ann Kelly
(Manning, paperback)

This is NOT a how-to guide for learning to use NoSQL software and build NoSQL databases. It is a meaty, well-structured overview aimed primarily at “technical managers, [software] architects, and developers.” However, it also is written to appeal to other, not-so-technical readers who are curious about NoSQL databases and where NoSQL could fit into the Big Data picture for their business, institution, or organization.

Making Sense of NoSQL definitely lives up to its subtitle: “A guide for managers and the rest of us.”

Many executives, managers, consultants and others today are dealing with expensive questions related to Big Data, primarily how it affects their current databases, database management systems, and the employees and contractors who maintain them. A variety of  problems can fall upon those who operate and update big relational (SQL) databases and their huge arrays of servers pieced together over years or decades.

The authors, Dan McCreary and Ann Kelly, are strong proponents, obviously, of the NoSQL approach. It offers, they note, “many ways to allow you to grow your database without ever having to shut down your servers.” However, they also realize that NoSQL may not a good, nor affordable, choice in many situations. Indeed, a blending of SQL and NoSQL systems may be a better choice. Or, making changes from SQL to NoSQL may not be financially feasible at all. So they have structured their book into four parts that attempt to help readers “objectively evaluate SQL and NoSQL database systems to see which business problems they solve.”

Part 1 provides an overview of NoSQL, its history, and its potential business benefits. Part 2 focuses on “database patterns,” including “legacy database patterns (which most solution architects are familiar with), NoSQL patterns, and native XML databases.” Part 3 examines “how NoSQL solutions solve the real-world business problems of big data, search, high availability, and agility.” And Part 4 looks at “two advanced topics associated with NoSQL: functional programming and system security.”

McCreary and Kelly observe that “[t]he transition to functional programming requires a paradigm shift away from software designed to control state and toward software that has a focus on independent data transformation.” (Erlang, Scala, and F# are some of the functional languages that they highlight.) And, they contend: “It’s no longer sufficient to design a system that will scale to 2, 4, or 8 core processors. You need to ask if your architecture will scale to 100, 1,000, or even 10,000 processors.”

Meanwhile, various security challenges can arise as a NoSQL database “becomes popular and is used by multiple projects” across “department trust boundaries.”

Computer science students, software developers, and others who are trying to stay knowledgeable about Big Data technology and issues should also consider reading this well-written book.

Si Dunn

Scaling Big Data with Hadoop and Solr – A new how-to guide – #bigdata #java #bookreview

Scaling Big Data with Hadoop and Solr

Learn new ways to build efficient, high performance enterprise search repositories for Big Data using Hadoop and Solr
Hrishikesh Karambelkar
(Packt – paperback, Kindle)

This well-presented, step-by-step guide shows how to use Apache Hadoop and Apache Solr to work with Big Data.  Author and software architect Hrishikesh Karambelkar does a good job of explaining Hadoop and Solr, and he illustrates how they can work together to tackle Big Data enterprise search projects.

“Google faced the problem of storing and processing big data, and they came up with the MapReduce approach, which is basically a divide-and-conquer strategy for distributed data processing,” Karambelkar notes. “MapReduce is widely accepted by many organizations to run their Big Data computations. Apache Hadoop is the most popular open source Apache licensed implementation of MapReduce….Apache Hadoop enables distributed processing of large datasets across a commodity of clustered servers. It is designed to scale up from single server to thousands of commodity hardware machines, each offering partial computational units and data storage.”

Meanwhile, Karambelkar adds, “Apache Solr is an open source enterprise search application which provides user abilities to search structured as well as unstructured data across the organization.”

His book (128 pages in print format) is structured with five chapters and three appendices:

  • Chapter 1: Processing Big Data Using Hadoop MapReduce
  • Chapter 2: Understanding Solr
  • Chapter 3: Making Big Data Work for Hadoop and Solr
  • Chapter 4: Using Big Data to Build Your Large Indexing
  • Chapter 5: Improving Performance of Search while Scaling with Big Data
  • Appendix A: Use Cases for Big Data Search
  • Appendix B: Creating Enterprise Search Using Apache Solr
  • Appendix C: Sample MapReduce Programs to Build the Solr Indexes

Where the book falls short (and I have noted this about many works by computer-book publishers) is that the author simply assumes everything will go well during the process of downloading and setting up the software–and gives almost no troubleshooting hints. This can happen with books written by software experts that are also are reviewed by software experts. Their systems likely are already optimized and may not throw the error messages that less-experienced users may encounter.

For example, the author states: “Installing Hadoop is a straightforward job with a default setup….” Unfortunately, there are many “flavors” and configurations of Linux running in the world. And Google searches can turn up a variety of problems others have encountered when installing, configuring and running Hadoop.  Getting Solr installed and running likewise is not a simple process for everyone.

If you are ready to plunge in and start dealing with Big Data, Scaling Big Data with Hadoop and Solr definitely can give you some well-focused and important information.  But heed the “Who this book is for” statement on page 2: “This book is primarily aimed at Java programmers, who wish to extend Hadoop platform to make it run as an enterprise search without prior knowledge of Apache Hadoop and Solr.”

And don’t surprised if you have to seek additional how-to details and troubleshooting information from websites and other books, as well as from co-workers and friends who may know Linux, Java and NoSQL databases better than you do (whether you want to admit it or not).

Si Dunn

Jump Start Node.js – A well-written guide for learning Node.js quickly – #programming #bookreview

 Jump Start Node.js
Don Nguyen
(SitePoint – paperback, Kindle)

Don Nguyen’s well-written Node.js book has been in print for a few months and is an excellent text for learning how to put Node.js to work in fast, scalable real-time web applications.

You should have some experience with JavaScript before tackling Node.js. But Nguyen says a “server-side engineer who uses another language such as PHP, Python, Ruby, Java, or .NET” can pick up enough JavaScript from his book to get a good feel for its syntax and idiosyncratic features.

What I like most about the book is how it  jumps right into developing a dynamic working Node.js application that you deploy to a production server. The project is a real-time stock market trading engine that streams live prices into a web browser. Along the way, you learn how to set up and use a NoSQL database (with MongoDB), you learn some functional programming techniques, and you work with Ajax, Express, Mocha, Socket.io, Backbone.js, Twitter Bootstrap, GitHub and Heroku.

The author covers a lot of ground, with clear code examples and good explanations, in just 154 pages. “The main goal of this book,” he notes, “is to transfer the skill set rather than the actual project into the real world. There is a narrow domain of ‘hard’ real-time applications such as a stock exchange where specialized software and hardware are required because microseconds count. However, there is a much larger number of ‘soft’ real-time applications such as Facebook, Twitter, and eBay where microseconds are of small consequence. This is Node.js’s speciality, and you’ll understand how to build these types of applications by the end of this book.”

Note: If you are a Windows user, you will have to install Cygwin before you can start using the Mocha testing framework on page 23. If you use Mac OS X, you will need to have the Xcode Command Line Tools installed. More information related to the book can be found at this SitePoint forum.

Si Dunn

Spring Data: Modern Data Access for Enterprise Java – #java #bookreview

Spring Data: Modern Data Access for Enterprise Java
Mark Pollack, Oliver Gierke, Thomas Risberg, Jonathan L. Brisbin and Michael Hunger
(O’Reilly, paperbackKindle)

Big Data keeps getting wider and deeper by the second. And so do the demands for analyzing and profiting from all of those piled-up terabytes.

Meanwhile, the once whiz-bang technology known as the relational database is having a very hard time keeping pace. The sheer amount of data that companies now gather, store, access, and analyze is pushing traditional relational databases to the breaking point.

Many Java developers who are trying to keep these overloaded systems held together with baling wire, also are starting to learn to work with some of the “alternative data stores that are being used in mission-critical enterprise applications,” the authors of Spring Data point out.

A lot of data now is being stored elsewhere and not in relational databases. Yet companies cannot abandon what they have already gathered and invested heavily to access. So they need to keep using and supporting their relational databases, plus some newer, faster, more voracious solutions lumped under the heading “NoSQL databases,” (even though you can query them).

In “the new data access landscape,” the authors note: “there is a revolution taking place, which for data geeks is quite exciting. Relational databases are not dead; they are still central to the operations of many enterprises and will remain so for quite some time. The trends, though, are very clear: new data access technologies are solving problems that traditional relational databases can’t, so we need to broaden our skill set as developers and have a foot in both camps.”

They add: “The Spring Framework has a long history of simplifying the development of Java applications, in particular for writing RDBMS-based data access layers that use Java database connectivity (JDBC) or object-relational mappers.”

Their new book “is intended to give you a hands-on introduction to the Spring Data project, whose core mission is to enable Java developers to use state-of-the-art data processing and manipulation tools but also use traditional databases in a state-of-the-art manner.”

They have organized their 288-page book into six parts and 14 chapters:

Part I – Background

  • Chapter 1 – The Spring Data Project
  • Chapter 2 – Repositories: Convenient Data Access Layers
  • Chapter 3 – Type-Safe Querying Using Querydsl

Part II – Relational Databases

  • Chapter 4 – JPA Repositories
  • Chapter 5 – Type-safe JDBC Programming with Querydsl SQL

Part III – NoSQL

  • Chapter 6 – MongoDB: A Document Store
  • Chapter 7 – Neo4j: A Graph Database
  • Chapter8 – Redis: A Key/Value Store

Part IV – Rapid Application Development

  • Chapter 9 – Persistence Layers with Spring Roo
  • Chapter 10 – REST Repository Exporter

Part V – Big Data

  • Chapter 11 – Spring for Apache Hadoop
  • Chapter 12 – Analyzing Data with Hadoop
  • Chapter 13 – Creating Big Data Pipelines with Spring Batch and Spring Integration

Part 5 – Data Grids

  • Chapter 14 – GemFire: A Distributed Data Grid

“Many of the values that have made Spring the preferred platform for enterprise Java developers deliver particular benefit in a world of fragmented persistence solutions,” states Ron Johnson, creator of Spring Framework. Writing in the book’s foreword, he notes: “Part of the value of Spring is how it brings consistency (without descending to a lowest common denominator) in its approach to different technologies with which it integrates.

“A distinct ‘Spring way’ helps shorten the learning curve for developers and simplifies code maintenance. If you are already familiar with Spring, you will find that Spring Data eases your exploration and adoption of unfamiliar stores. If you aren’t already familiar with Spring, this is a good opportunity to see how Spring can simplify your code and make it more consistent.”

Spring Data definitely is not light reading, but it is well-written, and provides a good blending of procedures, steps, explanations, code samples, screenshots and other illustrations.

Si Dunn