Making Sense of NoSQL – A balanced, well-written overview – #bigdata #bookreview

Making Sense of NoSQL

A Guide for Managers and the Rest of Us
Dan McCreary and Ann Kelly
(Manning, paperback)

This is NOT a how-to guide for learning to use NoSQL software and build NoSQL databases. It is a meaty, well-structured overview aimed primarily at “technical managers, [software] architects, and developers.” However, it also is written to appeal to other, not-so-technical readers who are curious about NoSQL databases and where NoSQL could fit into the Big Data picture for their business, institution, or organization.

Making Sense of NoSQL definitely lives up to its subtitle: “A guide for managers and the rest of us.”

Many executives, managers, consultants and others today are dealing with expensive questions related to Big Data, primarily how it affects their current databases, database management systems, and the employees and contractors who maintain them. A variety of  problems can fall upon those who operate and update big relational (SQL) databases and their huge arrays of servers pieced together over years or decades.

The authors, Dan McCreary and Ann Kelly, are strong proponents, obviously, of the NoSQL approach. It offers, they note, “many ways to allow you to grow your database without ever having to shut down your servers.” However, they also realize that NoSQL may not a good, nor affordable, choice in many situations. Indeed, a blending of SQL and NoSQL systems may be a better choice. Or, making changes from SQL to NoSQL may not be financially feasible at all. So they have structured their book into four parts that attempt to help readers “objectively evaluate SQL and NoSQL database systems to see which business problems they solve.”

Part 1 provides an overview of NoSQL, its history, and its potential business benefits. Part 2 focuses on “database patterns,” including “legacy database patterns (which most solution architects are familiar with), NoSQL patterns, and native XML databases.” Part 3 examines “how NoSQL solutions solve the real-world business problems of big data, search, high availability, and agility.” And Part 4 looks at “two advanced topics associated with NoSQL: functional programming and system security.”

McCreary and Kelly observe that “[t]he transition to functional programming requires a paradigm shift away from software designed to control state and toward software that has a focus on independent data transformation.” (Erlang, Scala, and F# are some of the functional languages that they highlight.) And, they contend: “It’s no longer sufficient to design a system that will scale to 2, 4, or 8 core processors. You need to ask if your architecture will scale to 100, 1,000, or even 10,000 processors.”

Meanwhile, various security challenges can arise as a NoSQL database “becomes popular and is used by multiple projects” across “department trust boundaries.”

Computer science students, software developers, and others who are trying to stay knowledgeable about Big Data technology and issues should also consider reading this well-written book.

Si Dunn

Scaling Big Data with Hadoop and Solr – A new how-to guide – #bigdata #java #bookreview

Scaling Big Data with Hadoop and Solr

Learn new ways to build efficient, high performance enterprise search repositories for Big Data using Hadoop and Solr
Hrishikesh Karambelkar
(Packt – paperback, Kindle)

This well-presented, step-by-step guide shows how to use Apache Hadoop and Apache Solr to work with Big Data.  Author and software architect Hrishikesh Karambelkar does a good job of explaining Hadoop and Solr, and he illustrates how they can work together to tackle Big Data enterprise search projects.

“Google faced the problem of storing and processing big data, and they came up with the MapReduce approach, which is basically a divide-and-conquer strategy for distributed data processing,” Karambelkar notes. “MapReduce is widely accepted by many organizations to run their Big Data computations. Apache Hadoop is the most popular open source Apache licensed implementation of MapReduce….Apache Hadoop enables distributed processing of large datasets across a commodity of clustered servers. It is designed to scale up from single server to thousands of commodity hardware machines, each offering partial computational units and data storage.”

Meanwhile, Karambelkar adds, “Apache Solr is an open source enterprise search application which provides user abilities to search structured as well as unstructured data across the organization.”

His book (128 pages in print format) is structured with five chapters and three appendices:

  • Chapter 1: Processing Big Data Using Hadoop MapReduce
  • Chapter 2: Understanding Solr
  • Chapter 3: Making Big Data Work for Hadoop and Solr
  • Chapter 4: Using Big Data to Build Your Large Indexing
  • Chapter 5: Improving Performance of Search while Scaling with Big Data
  • Appendix A: Use Cases for Big Data Search
  • Appendix B: Creating Enterprise Search Using Apache Solr
  • Appendix C: Sample MapReduce Programs to Build the Solr Indexes

Where the book falls short (and I have noted this about many works by computer-book publishers) is that the author simply assumes everything will go well during the process of downloading and setting up the software–and gives almost no troubleshooting hints. This can happen with books written by software experts that are also are reviewed by software experts. Their systems likely are already optimized and may not throw the error messages that less-experienced users may encounter.

For example, the author states: “Installing Hadoop is a straightforward job with a default setup….” Unfortunately, there are many “flavors” and configurations of Linux running in the world. And Google searches can turn up a variety of problems others have encountered when installing, configuring and running Hadoop.  Getting Solr installed and running likewise is not a simple process for everyone.

If you are ready to plunge in and start dealing with Big Data, Scaling Big Data with Hadoop and Solr definitely can give you some well-focused and important information.  But heed the “Who this book is for” statement on page 2: “This book is primarily aimed at Java programmers, who wish to extend Hadoop platform to make it run as an enterprise search without prior knowledge of Apache Hadoop and Solr.”

And don’t surprised if you have to seek additional how-to details and troubleshooting information from websites and other books, as well as from co-workers and friends who may know Linux, Java and NoSQL databases better than you do (whether you want to admit it or not).

Si Dunn

Data Science for Business – A serious guide for those who need to know – #bigdata #bookreview

Data Science for Business

What You Need to Know about Data Mining and Data-Analytic Thinking
Foster Provost and Tom Fawcett
(O’Reilly – paperback, Kindle)

This is not an introductory text for casual readers curious about the hoopla over data science and Big Data.

And you definitely won’t find code here for simple screen scrapers written in Python 2.7 or programs that access the Twitter API to scoop up messages containing certain hashtags.

Data Science for Business is based on an MBA course Foster Provost teaches at New York University, and it is aimed at three specific, serious audiences:

  • “Aspiring data scientists”
  • “Developers who will be implementing data science solutions…”
  • “Business people who will be working with data scientists, managing data science-oriented projects, or investing in data science ventures….”

Provost’s and Fawcett’s book  “concentrates on the fundamentals of data science and data mining,” the two authors state. But it specifically avoids “an algorithm-centered approach” and instead focuses on “a relatively small set of fundamental concepts or principles that underlie techniques for extracting useful knowledge from data. These concepts serve as the foundation for many well-known algorithms of data mining,” the authors note.

“Moreover, these concepts underlie the analysis of data-centered business problems, the creation and evaluation of data science solutions, and the evaluation of general data science strategies and proposals.”

The book is well-written and adequately illustrated with charts, diagrams, mathematical equations and mathematical examples. And the text, while technical and dense in some places, is organized into short sections. Most of the chapters end with insightful summaries that help the lessons stick.

Both authors are experienced veterans in the use of data science in business.  Their new book includes two helpful appendices. One shows how to “assess potential data mining projects” and “uncover potential flaws in proposals.” The second appendix presents a sample proposal and discusses its flaws.

“If you are a business stakeholder rather than a data scientist,” the authors caution, “don’t let so-called data scientists bamboozle you with jargon: the concepts of this book plus knowledge of your own business and data systems should allow you to understand 80% or more of the data science at a reasonable enough level to be productive for your business.”

They also challenge data scientists to “think deeply about why your work is relevant to helping the business and be able to present it as such.”

Si Dunn

Hadoop is hot! Three new how-to books for riding the Big Data elephant – #programming #bookreview

In the world of Big Data, Hadoop has become the hard-charging elephant in the room.

Its big-name users now span the alphabet and include such notables as Amazon, eBay, Facebook, Google, the New York Times, and Yahoo. Not bad for software named after a child’s toy elephant.

Computer systems that run Hadoop can store, process, and analyze large amounts of data that have been gathered up in many different formats from many different sources.

According to the Apache Software Foundation’s Hadoop website: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

The (well-trained) user defines the Big Data problem that Hadoop will tackle. Then the software handles all aspects of the job completion, including spreading out the problem in small pieces to many different computers, or nodes, in the distributed system for more efficient processing. Hadoop also handles individual node failures, and collects and combines the calculated results from each node.

But you don’t need a collection of hundreds or thousands of computers to run Hadoop. You can learn it, write programs, and do some testing and debugging on a single Linux machine, Windows PC or Mac. The Open Source software can be downloaded here. (Do some research first. You may have use web searches to find detailed installation instructions for your specific system.)

Hadoop is open-source software that is often described as “a Java-based framework for large-scale data processing.” It has a lengthy learning curve that includes getting familiar with Java, if you don’t already know it.

But if you are now ready and eager to take on Hadoop, Packt Publishing recently has unveiled three excellent how-to books that can help you begin and extend your mastery: Hadoop Beginner’s Guide, Hadoop MapReduce Cookbook, and Hadoop Real-World Solutions Cookbook.

Short reviews of each are presented below.

Hadoop Beginner’s Guide
Garry Turkington
(Packt Publishing – paperback, Kindle)

Garry Turkington’s new book is a detailed, well-structured introduction to Hadoop. It covers everything from the software’s three modes–local standalone mode, pseudo-distributed mode, and fully distributed mode–to running basic jobs, developing simple and advanced MapReduce programs, maintaining clusters of computers, and working with Hive, MySQL, and other tools.

“The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination,” the author writes.

He calls this capability “possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system.”

The 374-page book is written well and provides numerous code samples and illustrations. But it  has one drawback for some beginners who want to install and  use Hadoop.  Turkington offers step-by-step instructions for how to perform a Linux installation, specifically Ubuntu. However, he refers Windows and Mac users to an Apache site where there is insufficient how-to information. Web searches become necessary to find more installation details.

Hadoop MapReduce Cookbook
Srinath Perera and Thilina Gunarathne
(Packt Publishing – paperback, Kindle)

MapReduce “jobs” are an essential part of  how Hadoop is able to crunch huge chunks of Big Data.  The Hadoop MapReduce Cookbook offers “recipes for analyzing large and complex data sets with Hadoop MapReduce.”

MapReduce is a well-known programming model for processing large sets of data. Typically, MapReduce is used within clusters of computers that are configured to perform distributed computing.

In the “Map” portion of the process, a problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes. (Nodes also can have sub-nodes). During the “Reduce” part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

“Hadoop is the most widely known and widely used implementation of the MapReduce paradigm,” the two authors note.

Their 284-page book initially shows how to run Hadoop in local mode, which “does not start any servers but does all the work within the same JVM [Java Virtual Machine]” on a standalone computer. Then, as you gain more experience with MapReduce and the Hadoop Distributed File System (HDFS), they guide you into using Hadoop in more complex, distributed-computing environments.

Echoing the Hadoop Beginner’s Guide, the authors explain how to install Hadoop on Linux machines only.

Hadoop Real-World Solutions Cookbook
Jonathan R. Owens, Jon Lentz and Brian Femiano
(Packt Publishing – paperback, Kindle)

The Hadoop Real-World Solutions Cookbook assumes you already have some experience with Hadoop. So it jumps straight into helping “developers become more comfortable with, and proficient at solving problems in, the Hadoop space.”

Its goal is to “teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.”

The 299-page book is packed with code examples and short explanations that help solve specific types of problems. A few randomly selected problem headings:

  • “Using Apache Pig to filter bot traffic from web server logs.”
  • “Using the distributed cache in MapReduce.”
  • “Trim Outliers from the Audioscrobbler dataset using Pig and datafu.” 
  • “Designing a row key to store geographic events in Accumulo.”
  • “Enabling MapReduce jobs to skip bad records.”

The authors use a simple but effective strategy for presenting problems and solutions. First, the problem is clearly described. Then, under a “Getting Ready” heading, they spell out what you need to  solve the problem. That is followed by a “How to do it…” heading where each step is presented and supported by code examples. Then, paragraphs beneath a “How it works…” heading sum up and explain how the problem was solved. Finally, a “There’s more…” heading highlights more explanations and links to additional details.

If you are a Hadoop beginner, consider the first two books reviewed above. If you have some Hadoop experience, you likely can find some useful tips in book number three

Si Dunn

Spring Data: Modern Data Access for Enterprise Java – #java #bookreview

Spring Data: Modern Data Access for Enterprise Java
Mark Pollack, Oliver Gierke, Thomas Risberg, Jonathan L. Brisbin and Michael Hunger
(O’Reilly, paperbackKindle)

Big Data keeps getting wider and deeper by the second. And so do the demands for analyzing and profiting from all of those piled-up terabytes.

Meanwhile, the once whiz-bang technology known as the relational database is having a very hard time keeping pace. The sheer amount of data that companies now gather, store, access, and analyze is pushing traditional relational databases to the breaking point.

Many Java developers who are trying to keep these overloaded systems held together with baling wire, also are starting to learn to work with some of the “alternative data stores that are being used in mission-critical enterprise applications,” the authors of Spring Data point out.

A lot of data now is being stored elsewhere and not in relational databases. Yet companies cannot abandon what they have already gathered and invested heavily to access. So they need to keep using and supporting their relational databases, plus some newer, faster, more voracious solutions lumped under the heading “NoSQL databases,” (even though you can query them).

In “the new data access landscape,” the authors note: “there is a revolution taking place, which for data geeks is quite exciting. Relational databases are not dead; they are still central to the operations of many enterprises and will remain so for quite some time. The trends, though, are very clear: new data access technologies are solving problems that traditional relational databases can’t, so we need to broaden our skill set as developers and have a foot in both camps.”

They add: “The Spring Framework has a long history of simplifying the development of Java applications, in particular for writing RDBMS-based data access layers that use Java database connectivity (JDBC) or object-relational mappers.”

Their new book “is intended to give you a hands-on introduction to the Spring Data project, whose core mission is to enable Java developers to use state-of-the-art data processing and manipulation tools but also use traditional databases in a state-of-the-art manner.”

They have organized their 288-page book into six parts and 14 chapters:

Part I – Background

  • Chapter 1 – The Spring Data Project
  • Chapter 2 – Repositories: Convenient Data Access Layers
  • Chapter 3 – Type-Safe Querying Using Querydsl

Part II – Relational Databases

  • Chapter 4 – JPA Repositories
  • Chapter 5 – Type-safe JDBC Programming with Querydsl SQL

Part III – NoSQL

  • Chapter 6 – MongoDB: A Document Store
  • Chapter 7 – Neo4j: A Graph Database
  • Chapter8 – Redis: A Key/Value Store

Part IV – Rapid Application Development

  • Chapter 9 – Persistence Layers with Spring Roo
  • Chapter 10 – REST Repository Exporter

Part V – Big Data

  • Chapter 11 – Spring for Apache Hadoop
  • Chapter 12 – Analyzing Data with Hadoop
  • Chapter 13 – Creating Big Data Pipelines with Spring Batch and Spring Integration

Part 5 – Data Grids

  • Chapter 14 – GemFire: A Distributed Data Grid

“Many of the values that have made Spring the preferred platform for enterprise Java developers deliver particular benefit in a world of fragmented persistence solutions,” states Ron Johnson, creator of Spring Framework. Writing in the book’s foreword, he notes: “Part of the value of Spring is how it brings consistency (without descending to a lowest common denominator) in its approach to different technologies with which it integrates.

“A distinct ‘Spring way’ helps shorten the learning curve for developers and simplifies code maintenance. If you are already familiar with Spring, you will find that Spring Data eases your exploration and adoption of unfamiliar stores. If you aren’t already familiar with Spring, this is a good opportunity to see how Spring can simplify your code and make it more consistent.”

Spring Data definitely is not light reading, but it is well-written, and provides a good blending of procedures, steps, explanations, code samples, screenshots and other illustrations.

Si Dunn

Ethics of Big Data – Thoughtful insights into key issues confronting big-data ‘gold mines’ – #management #bookreview

Ethics of Big Data
Kord Davis, with Doug Patterson
(O’Reilly, paperbackKindle)

“Big Data” and how to mine it for profit are red-hot topics in today’s business world. Many corporations now find themselves sitting atop virtual gold mines of customer information. And even small businesses now are attempting to find new ways to profit from their stashes of sales, marketing, and research data. 

Like it or not, you can’t block all of the cookies or tracking companies or sites that are following you, and each time you surf the web, you leave behind a “data exhaust” trail that has monetary value to others. Indeed, one recent start-up, Enliken, (“Data to the People”), is offering a way for computer users to gain some control over their data exhaust trail’s monetary value and choose who benefits from it, including some charities.

Ethics of Big Data does not seek to lay down a “hard-and-fast list of rules for the ethical handling of data.” The new book also doesn’t “tell you what to do with your data.” Its goals are “to help you engage in productive ethical discussions raised by today’s big-data-driven enterprises, propose a framework for thinking and talking about these issues, and introduce a methodology for aligning actions with values within an organization.”

It’s heady stuff, packed into just 64 pages. But the book is well written and definitely thought-provoking. It can serve as a focused guide for corporate leaders and others now hoping to get a grip on their own big-data situations, in ways that will not alienate their customers, partners, and stakeholders.

In the view of the authors: “For both individuals and organizations, four common elements define what can be considered a framework for big data:

  • “Identity – What is the relationship between our offline identity and our online identity?”
  • “Privacy – Who should control access to data?”
  • “Ownership – Who owns data, can rights be transferred, and what are the obligations of people who generate and use that data?”
  • “Reputation – How can we determine what data is trustworthy? Whether about ourselves, others, or anything else, big data exponentially increases the amount of information and ways we can interact with it. This phenomenon increases the complexity of managing how we are perceived and judged.”

Big-data technology itself is “ethnically neutral,” the authors contend, and it “has no value framework. Individuals and corporations, however, do have value systems, and it is only by asking and seeking answers to ethical questions that we can ensure big data is used in a way that aligns with those values.”

At the same time: “Big data is pushing corporate action further and more fully into individual lives through the sheer volume, variety, and velocity of the data being generated. Big-data product design, development, sales, and management actions expand their influence and impact over individuals’ lives that may be changing the common meanings of words like privacy, reputation, ownership, and identity.”

What will happen next as (1) big data continues to expand and intrude and (2) people and organizations  push back harder, is still anybody’s guess. But matters of ethics likely will remain at the center of the conflicts.

Indeed, some big-data gold mines could suffer devastating financial and legal cave-ins if greed is allowed to trump ethics.

Si Dunn