Manning’s ‘MongoDB in Action’ has been updated for version 3.0 – #programming #bookreview

MongoDB in Action, Second Edition

Covers MongoDB version 3.0

Kyle Banker

Manning, paperback

Yes, this updated edition of MongoDB in Action is aimed at software developers. However, the book wisely does not ignore those of us who are more casual users of MongoDB.

Indeed, this is a fine how-to book for MongoDB newcomers and casual users, too, particularly if you are patient and willing to read through an introductory chapter focusing on “MongoDB’s history, design goals, and application use cases.”

Many people, of course, just want to jump straight into downloading the software, running it, and playing with it for a while before getting down to any serious stuff such as application use cases. So this book’s Appendix A is the place to go first, so you can get MongoDB onto your Linux, Mac, or Windows computer.  Then, after MongoDB is installed, you can jump back to Chapter 2 to start learning how to use the JavaScript shell.

After that, things quickly  start getting more “practical.” For example, Chapter 3 introduces “Writing programs using MongoDB.” Here, Ruby is employed to work with the MongoDB API. But the author notes: “MongoDB, Inc. provides officially supported, Apache-licensed MongoDB drivers for all of the most popular programming languages. The driver examples in the book use Ruby, but the principles we’ll illustrate are universal and easily transferable to other drivers. Throughout the book we’ll illustrate most commands with the JavaScript shell, but examples of using MongoDB from within an application will be in Ruby.”

I won’t try to sum up everything in this well-written, 13-chapter book. I have used older, 2.X versions of MongoDB in MEAN stack applications. And, separately, I have worked a bit with Ruby and MongoDB. But in each case, I haven’t needed to learn all that much about MongoDB itself, mainly just ensure that it is storing data that can be accessed in the right place and updated, saved or deleted as needed. So this book, written for 3.0.X (and earlier and later) MongoDB releases is an eye-opener for me and one that I will keep around for reference and more learning now that I have upgraded to 3.2.

Part 1 of MongoDB in Action, 2nd edition “provides a broad, practical introduction to MongoDB.” Part 2 delivers “a deep exploration of MongoDB’s document data model.” Part 3, meanwhile, examines MongoDB “from the database administrator’s perspective. This means we’ll cover all the things you need to know about performance, deployments, fault tolerance, and scalability.”

The book’s author knows that readers with some MongoDB experience will not read the book straight through. Instead, they will tackle chapters in many different orders and will even skip some chapters. And this is okay. MongoDB in Action: Second Edition is a book many of us will be happy to have handy whenever we need to get a better grip on some new aspect of working with this very popular open-source document database.

One cautionary note: The author points out that “as of MongoDB v3.0, 32-bit binaries will no longer be supported.” Of course, some 3.X 32-bit binaries are still out there, and you can install them. But you will get a lot of warning messages from MongoDB. So, download a 64-bit binary if your system will support it.

Si Dunn

BIG DATA: A well-written look at principles & best practices of scalable real-time data systems – #bookreview

 

 

Big Data

Principles and best practices of scalable real-time data systems

Nathan Marz, with James Warren

Manning – paperback

Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.

You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.

“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.

Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.

The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”

Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”

The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.

Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:

  • In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
  • Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
  • In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”

Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”

This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.

If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.

Si Dunn

HADOOP IN PRACTICE, 2nd Edition – An updated guide to handling some of the ‘trickier and dirtier aspects of Hadoop’ – #programming #bookreview

 

Hadoop in Practice, Second Edition

Alex Holmes

(Manning – paperback )

 

The Hadoop world has undergone some big changes lately, and this hefty, updated edition offers excellent coverage of a lot of what’s new. If you currently work with Hadoop and MapReduce or are planning to take them up soon, give serious consideration to adding this well-written book to your technical library. A key feature of the book is its “104 techniques.” These show how to accomplish practical and important tasks when working with Hadoop, MapReduce and their growing array of software “friends.”

The author, Alex Holmes, has been working with Hadoop for more than six years and is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects.

His new second edition, he writes, “covers Hadoop 2, which at the time of writing is the current production-ready version of Hadoop. The first edition of the book covered Hadoop 0.22 (Hadoop 1 wasn’t yet out), and Hadoop 2 has turned the world upside-down and opened up the Hadoop platform to processing paradigms beyond MapReduce. YARN, the new scheduler and application manager in Hadoop 2, is complex and new to the community, which prompted me to dedicate a new chapter 2 to covering YARN basics and to discussing how MapReduce now functions as a YARN application.”

In the book, Holmes notes that “Parquet has also recently emerged as a new way to store data in HDFS—its columnar format can yield both space and time efficiencies in your data pipelines, and it’s quickly becoming the ubiquitous way to store data. Chapter 4 includes extensive coverage of Parquet, which includes how Parquet supports sophisticated object models such as Avro and how various Hadoop tools can use Parquet.”

Furthermore, “[h]ow data is being ingested into Hadoop has also evolved since the first edition,” Holmes points out, “and Kafka has emerged as the new data pipeline, which serves as the transport tier between your data producers and data consumers, where a consumer would be a system
such as Camus that can pull data from Kafka into HDFS. Chapter 5, which covers moving data into and out of Hadoop, now includes coverage of Kafka and Camus.”  [Reviewer’s note: Interesting software names here. Franz Kafka and Alfred Camus were writers deeply concerned about finding clarity and meaning in a world that seemed to offer none.]

Holmes adds that “[t]here are many new technologies that YARN now can support side by side in the same cluster, and some of the more exciting and promising technologies are covered in the new part 4, titled ‘Beyond MapReduce,’ where I cover some compelling new SQL technologies such as Impala and Spark SQL. The last chapter, also new for this edition, looks at how you can write your own YARN application, and it’s packed with information about important features to support your YARN application.”

Hadoop and MapReduce have gained reputations (well-earned) for being difficult to set up, use and master. In his new edition, Holmes describes his own early experiences: “The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was relearning how to solve problems with it. MapReduce is its own flavor of parallel programming, and it’s quite different from the in-JVM programming that we were accustomed to. The first big hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well.”

(These days, of course, there are both open source and commercial releases of Hadoop, as well as quickstart virtual machine versions that are good for learning Hadoop.)

Holmes continues: “After one is used to thinking in MapReduce, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in and out of HDFS and effective and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much coverage, and that’s what attracted me to the potential of this book—the chance to go beyond the fundamental word-count Hadoop uses and covering some of the trickier and dirtier aspects of Hadoop.”

If you have difficulty explaining Hadoop to others (such as a manager or executive hesitant to let it be implemented), Holmes offers a succint summation in his updated book:

“Doug Cutting, the creator of Hadoop, likes to call Hadoop the kernel for big data, and I would tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data and allows these datasets to be easily joined together.”

One book cannot possibly cover everything you need to know about Hadoop, MapReduce, Parquet, Kafka, Camus, YARN and other technologies. And this book and its software examples assume that you have some experience with Java, XML and JSON. Yet Hadoop in Practice, Second Edition gives a very good and reasonably deep overview, spanning such major categories as background and fundamentals, data logistics, Big Data patterns, and moving beyond MapReduce.

Si Dunn

 

 

Hadoop is hot! Three new how-to books for riding the Big Data elephant – #programming #bookreview

In the world of Big Data, Hadoop has become the hard-charging elephant in the room.

Its big-name users now span the alphabet and include such notables as Amazon, eBay, Facebook, Google, the New York Times, and Yahoo. Not bad for software named after a child’s toy elephant.

Computer systems that run Hadoop can store, process, and analyze large amounts of data that have been gathered up in many different formats from many different sources.

According to the Apache Software Foundation’s Hadoop website: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

The (well-trained) user defines the Big Data problem that Hadoop will tackle. Then the software handles all aspects of the job completion, including spreading out the problem in small pieces to many different computers, or nodes, in the distributed system for more efficient processing. Hadoop also handles individual node failures, and collects and combines the calculated results from each node.

But you don’t need a collection of hundreds or thousands of computers to run Hadoop. You can learn it, write programs, and do some testing and debugging on a single Linux machine, Windows PC or Mac. The Open Source software can be downloaded here. (Do some research first. You may have use web searches to find detailed installation instructions for your specific system.)

Hadoop is open-source software that is often described as “a Java-based framework for large-scale data processing.” It has a lengthy learning curve that includes getting familiar with Java, if you don’t already know it.

But if you are now ready and eager to take on Hadoop, Packt Publishing recently has unveiled three excellent how-to books that can help you begin and extend your mastery: Hadoop Beginner’s Guide, Hadoop MapReduce Cookbook, and Hadoop Real-World Solutions Cookbook.

Short reviews of each are presented below.

Hadoop Beginner’s Guide
Garry Turkington
(Packt Publishing – paperback, Kindle)

Garry Turkington’s new book is a detailed, well-structured introduction to Hadoop. It covers everything from the software’s three modes–local standalone mode, pseudo-distributed mode, and fully distributed mode–to running basic jobs, developing simple and advanced MapReduce programs, maintaining clusters of computers, and working with Hive, MySQL, and other tools.

“The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination,” the author writes.

He calls this capability “possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system.”

The 374-page book is written well and provides numerous code samples and illustrations. But it  has one drawback for some beginners who want to install and  use Hadoop.  Turkington offers step-by-step instructions for how to perform a Linux installation, specifically Ubuntu. However, he refers Windows and Mac users to an Apache site where there is insufficient how-to information. Web searches become necessary to find more installation details.

Hadoop MapReduce Cookbook
Srinath Perera and Thilina Gunarathne
(Packt Publishing – paperback, Kindle)

MapReduce “jobs” are an essential part of  how Hadoop is able to crunch huge chunks of Big Data.  The Hadoop MapReduce Cookbook offers “recipes for analyzing large and complex data sets with Hadoop MapReduce.”

MapReduce is a well-known programming model for processing large sets of data. Typically, MapReduce is used within clusters of computers that are configured to perform distributed computing.

In the “Map” portion of the process, a problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes. (Nodes also can have sub-nodes). During the “Reduce” part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

“Hadoop is the most widely known and widely used implementation of the MapReduce paradigm,” the two authors note.

Their 284-page book initially shows how to run Hadoop in local mode, which “does not start any servers but does all the work within the same JVM [Java Virtual Machine]” on a standalone computer. Then, as you gain more experience with MapReduce and the Hadoop Distributed File System (HDFS), they guide you into using Hadoop in more complex, distributed-computing environments.

Echoing the Hadoop Beginner’s Guide, the authors explain how to install Hadoop on Linux machines only.

Hadoop Real-World Solutions Cookbook
Jonathan R. Owens, Jon Lentz and Brian Femiano
(Packt Publishing – paperback, Kindle)

The Hadoop Real-World Solutions Cookbook assumes you already have some experience with Hadoop. So it jumps straight into helping “developers become more comfortable with, and proficient at solving problems in, the Hadoop space.”

Its goal is to “teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.”

The 299-page book is packed with code examples and short explanations that help solve specific types of problems. A few randomly selected problem headings:

  • “Using Apache Pig to filter bot traffic from web server logs.”
  • “Using the distributed cache in MapReduce.”
  • “Trim Outliers from the Audioscrobbler dataset using Pig and datafu.” 
  • “Designing a row key to store geographic events in Accumulo.”
  • “Enabling MapReduce jobs to skip bad records.”

The authors use a simple but effective strategy for presenting problems and solutions. First, the problem is clearly described. Then, under a “Getting Ready” heading, they spell out what you need to  solve the problem. That is followed by a “How to do it…” heading where each step is presented and supported by code examples. Then, paragraphs beneath a “How it works…” heading sum up and explain how the problem was solved. Finally, a “There’s more…” heading highlights more explanations and links to additional details.

If you are a Hadoop beginner, consider the first two books reviewed above. If you have some Hadoop experience, you likely can find some useful tips in book number three

Si Dunn

Spring Data: Modern Data Access for Enterprise Java – #java #bookreview

Spring Data: Modern Data Access for Enterprise Java
Mark Pollack, Oliver Gierke, Thomas Risberg, Jonathan L. Brisbin and Michael Hunger
(O’Reilly, paperbackKindle)

Big Data keeps getting wider and deeper by the second. And so do the demands for analyzing and profiting from all of those piled-up terabytes.

Meanwhile, the once whiz-bang technology known as the relational database is having a very hard time keeping pace. The sheer amount of data that companies now gather, store, access, and analyze is pushing traditional relational databases to the breaking point.

Many Java developers who are trying to keep these overloaded systems held together with baling wire, also are starting to learn to work with some of the “alternative data stores that are being used in mission-critical enterprise applications,” the authors of Spring Data point out.

A lot of data now is being stored elsewhere and not in relational databases. Yet companies cannot abandon what they have already gathered and invested heavily to access. So they need to keep using and supporting their relational databases, plus some newer, faster, more voracious solutions lumped under the heading “NoSQL databases,” (even though you can query them).

In “the new data access landscape,” the authors note: “there is a revolution taking place, which for data geeks is quite exciting. Relational databases are not dead; they are still central to the operations of many enterprises and will remain so for quite some time. The trends, though, are very clear: new data access technologies are solving problems that traditional relational databases can’t, so we need to broaden our skill set as developers and have a foot in both camps.”

They add: “The Spring Framework has a long history of simplifying the development of Java applications, in particular for writing RDBMS-based data access layers that use Java database connectivity (JDBC) or object-relational mappers.”

Their new book “is intended to give you a hands-on introduction to the Spring Data project, whose core mission is to enable Java developers to use state-of-the-art data processing and manipulation tools but also use traditional databases in a state-of-the-art manner.”

They have organized their 288-page book into six parts and 14 chapters:

Part I – Background

  • Chapter 1 – The Spring Data Project
  • Chapter 2 – Repositories: Convenient Data Access Layers
  • Chapter 3 – Type-Safe Querying Using Querydsl

Part II – Relational Databases

  • Chapter 4 – JPA Repositories
  • Chapter 5 – Type-safe JDBC Programming with Querydsl SQL

Part III – NoSQL

  • Chapter 6 – MongoDB: A Document Store
  • Chapter 7 – Neo4j: A Graph Database
  • Chapter8 – Redis: A Key/Value Store

Part IV – Rapid Application Development

  • Chapter 9 – Persistence Layers with Spring Roo
  • Chapter 10 – REST Repository Exporter

Part V – Big Data

  • Chapter 11 – Spring for Apache Hadoop
  • Chapter 12 – Analyzing Data with Hadoop
  • Chapter 13 – Creating Big Data Pipelines with Spring Batch and Spring Integration

Part 5 – Data Grids

  • Chapter 14 – GemFire: A Distributed Data Grid

“Many of the values that have made Spring the preferred platform for enterprise Java developers deliver particular benefit in a world of fragmented persistence solutions,” states Ron Johnson, creator of Spring Framework. Writing in the book’s foreword, he notes: “Part of the value of Spring is how it brings consistency (without descending to a lowest common denominator) in its approach to different technologies with which it integrates.

“A distinct ‘Spring way’ helps shorten the learning curve for developers and simplifies code maintenance. If you are already familiar with Spring, you will find that Spring Data eases your exploration and adoption of unfamiliar stores. If you aren’t already familiar with Spring, this is a good opportunity to see how Spring can simplify your code and make it more consistent.”

Spring Data definitely is not light reading, but it is well-written, and provides a good blending of procedures, steps, explanations, code samples, screenshots and other illustrations.

Si Dunn

Ethics of Big Data – Thoughtful insights into key issues confronting big-data ‘gold mines’ – #management #bookreview

Ethics of Big Data
Kord Davis, with Doug Patterson
(O’Reilly, paperbackKindle)

“Big Data” and how to mine it for profit are red-hot topics in today’s business world. Many corporations now find themselves sitting atop virtual gold mines of customer information. And even small businesses now are attempting to find new ways to profit from their stashes of sales, marketing, and research data. 

Like it or not, you can’t block all of the cookies or tracking companies or sites that are following you, and each time you surf the web, you leave behind a “data exhaust” trail that has monetary value to others. Indeed, one recent start-up, Enliken, (“Data to the People”), is offering a way for computer users to gain some control over their data exhaust trail’s monetary value and choose who benefits from it, including some charities.

Ethics of Big Data does not seek to lay down a “hard-and-fast list of rules for the ethical handling of data.” The new book also doesn’t “tell you what to do with your data.” Its goals are “to help you engage in productive ethical discussions raised by today’s big-data-driven enterprises, propose a framework for thinking and talking about these issues, and introduce a methodology for aligning actions with values within an organization.”

It’s heady stuff, packed into just 64 pages. But the book is well written and definitely thought-provoking. It can serve as a focused guide for corporate leaders and others now hoping to get a grip on their own big-data situations, in ways that will not alienate their customers, partners, and stakeholders.

In the view of the authors: “For both individuals and organizations, four common elements define what can be considered a framework for big data:

  • “Identity – What is the relationship between our offline identity and our online identity?”
  • “Privacy – Who should control access to data?”
  • “Ownership – Who owns data, can rights be transferred, and what are the obligations of people who generate and use that data?”
  • “Reputation – How can we determine what data is trustworthy? Whether about ourselves, others, or anything else, big data exponentially increases the amount of information and ways we can interact with it. This phenomenon increases the complexity of managing how we are perceived and judged.”

Big-data technology itself is “ethnically neutral,” the authors contend, and it “has no value framework. Individuals and corporations, however, do have value systems, and it is only by asking and seeking answers to ethical questions that we can ensure big data is used in a way that aligns with those values.”

At the same time: “Big data is pushing corporate action further and more fully into individual lives through the sheer volume, variety, and velocity of the data being generated. Big-data product design, development, sales, and management actions expand their influence and impact over individuals’ lives that may be changing the common meanings of words like privacy, reputation, ownership, and identity.”

What will happen next as (1) big data continues to expand and intrude and (2) people and organizations  push back harder, is still anybody’s guess. But matters of ethics likely will remain at the center of the conflicts.

Indeed, some big-data gold mines could suffer devastating financial and legal cave-ins if greed is allowed to trump ethics.

Si Dunn

PostgreSQL: Up and Running – Get a well-guided grip on this powerful, free database software – #bookreview

PostgreSQL: Up and Running
 Regina Obe and Leo Hsu
(O’Reilly,
paperbackKindle)

PostgreSQL is both a powerful open source database system and a very flexible application platform.

“PostgreSQL allows you to write stored procedures and functions in several programming languages, and the architecture allows you the flexibility to support more languages,” this book’s two authors point out.

Indeed: “You can have functions written in several different languages participating in one query.”

Release 9.2 of PostgreSQL hit the web Sept. 10, 2012. Regina Obe’s and Leo Hsu’s fine, 145-page introduction to PostgreSQL focuses on Release 9.1, but describes major 9.2 features, too. And their book definitely can be used to get you up and running. It describes “the unique features of PostgreSQL that make it stand apart from other databases…”, and shows “how to use these features to solve real world problems.”

PostgreSQL is not for every database user, the writers emphasize. “PostgreSQL was designed from the ground up to be a server-side database. Many people do use it on the desktop similarly to how they use SQL Server Express or Oracle, but just like those, it cares about security management and doesn’t leave this up to the application connecting to it. As such, it’s not ideal as an imbeddable database, like SQLite or Firebird.”

It also “does a lot and a lot can be daunting,” they concede. “It’s not a dumb data store; it’s a smart elephant. If all you need is a key value store or you expect your database to just sit there and hold stuff, it’s probably overkill for your needs.”

But after years of using PostgreSQL, the two writers remain unabashed fans. “Each update,” they state in their book, “treats us to new features, eases usability, brings improvements in speed, and pushes the envelope of what is possible with a database. In the end, you will wonder why you ever used any other relational database, when PostgreSQL does everything you could hope for—and does it for free.”

By the way, users of PostgreSQL 8.3 or older need to upgrade ASAP, Regina Obe and Leo Hsu urge. Release 8.3 “will be reaching end-of-life in early 2013,” making support increasingly difficult and expensive.

Si Dunn