R IN ACTION: Data Analysis and Graphics with R, 2nd Edition – #bookreview

R in Action

Data Analysis and Graphics with R

Robert I. Kabacoff

Manning – paperback

Whether data analysis is your field, your current major or your next career-change ambition, you likely should get this book. Free and open source  R is one of the world’s most popular languages for data analysis and visualization. And Robert I. Kabacoff’s updated new edition is, in my opinion, one of the top books out there for getting a handle on R. (I have used and previously reviewed several R how-to books.)

R is relatively easy to install on Windows, Mac OS X and Linux machines. But it is generally considered difficult to learn. Much of that is because of its rich abundance of features and packages, as well as its ability to create many types of graphs. “The base installation,” Kabacoff writes, “provides hundreds of data-management, statistical, and graphical functions out of the box. But some of its most powerful features come from the thousands of extensions (packages) provided by contributing authors.”

Kabacoff concedes: “It can be hard for new users to get a handle on what R is and what it can do.” And: “Even the most experienced R user is surprised to learn about features they were unaware of.”

R in Action, Second Edition, contains more than 200 pages of new material. And it is nicely structured to meet the needs of R beginners, as well as those of us who have some experience and want to gain more.

The book (579 pages in print format) is divided into five major parts. The first part, “Getting Started,” takes the beginner from an installing and trying R to creating data sets, working with graphs, and managing data. Part 2, “Basic Methods,”focuses on graphical and statistical techniques for obtaining basic information about data.”

Part 3, “Intermediate Methods,” moves the reader well beyond “describing the relationship between two variables.” It introduces  regression, analysis of variance, power analysis, intermediate graphs, and resampling statistics and bootstrapping. Part 4 presents “Advanced Methods,” including generalized linear models, principal components and factor analysis, time series, cluster analysis, classification, and advanced methods for missing data.

Part 5, meanwhile, offers how-to information for “Expanding Your Skills.” The topics include: advanced graphics with ggplot2, advanced programming, creating a package, creating dynamic reports, and developing advanced graphics with the lattice program.

A key strength of R in Action, Second Edition is Kabacoff’s use of generally short code examples to illustrate many of the ways that data can be entered, manipulated, analyzed and displayed in graphical form.

The first thing I did, however, was start at the very back of the book, Appendix G, and upgrade my existing version of R to 3.2.1, “World-Famous Astronaut.” The upgrade instructions could have been a little bit clearer, but after hitting a couple of unmentioned prompts and changing a couple of wrong choices, the process turned out to be quick and smooth.

Then I started reading chapters and keying in some of the code examples. I had not used R much recently, so it was fun again to enter some commands and numbers and have nicely formatted graphs suddenly pop open on the screen.

Even better, it is nice to have a LOT of new things to learn, with a well-written, well-illustrated guidebook in hand.

Si Dunn


BIG DATA: A well-written look at principles & best practices of scalable real-time data systems – #bookreview



Big Data

Principles and best practices of scalable real-time data systems

Nathan Marz, with James Warren

Manning – paperback

Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.

You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.

“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.

Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.

The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”

Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”

The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.

Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:

  • In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
  • Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
  • In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”

Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”

This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.

If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.

Si Dunn

D3.js in Action: A good book packed with data visualization how-to info – #javascript #programming

D3.js in Action

Elijah Meeks

Manning – paperback


The D3.js library is very powerful, and it is full of useful choices and possibilities. But, you should not try to tackle Elijah Meeks’s new book if you are a JavaScript newcomer and not also comfortable with HTML, CSS and JSON.

It likewise helps to understand how CSVs (Comma Separated Values) can be used. And you should know how to set up and run local web servers on your computer. Prior knowledge of D3.js and SVG (Scalable Vector Graphics) is not necessary, however.

Some reviewers have remarked on the amount of how-to and technical information packed into DS3.js in Action. It is indeed impressive. And, yes, it really can seem like concepts, details and examples are being squirted at you from a fire hose, particularly if you are attempting to race through the text. As Elijah Meeks writes, “[T]he focus of this book is on a more exhaustive explanation of key principles of the library.”

So plan to take your time. Tackle D3.js in small bites, using the d3js.org website and this text. I am pretty new to learning data visualization, and I definitely had never heard of visualizations such as Voronoi diagrams, nor tools such as TopoJSON, until I started working my way through this book. And those are just a few of the available possibilities.

I have not yet tried all of the code examples. But the ones I have tested have worked very well, and they have gotten me thinking about how I can adapt them to use in some of my work.

I am a bit disappointed that the book takes 40 pages to get to the requisite “Hello, world” examples. And once you arrive, the explanations likely will seem a bit murky and incomplete to some readers.

However, that is a minor complaint. D3.js in Action will get frequent use as I dig deeper into data visualization. D3.js and Elijah Meeks’s new book are keepers for the long-term in the big world of JavaScript.

Si Dunn

HADOOP IN PRACTICE, 2nd Edition – An updated guide to handling some of the ‘trickier and dirtier aspects of Hadoop’ – #programming #bookreview


Hadoop in Practice, Second Edition

Alex Holmes

(Manning – paperback )


The Hadoop world has undergone some big changes lately, and this hefty, updated edition offers excellent coverage of a lot of what’s new. If you currently work with Hadoop and MapReduce or are planning to take them up soon, give serious consideration to adding this well-written book to your technical library. A key feature of the book is its “104 techniques.” These show how to accomplish practical and important tasks when working with Hadoop, MapReduce and their growing array of software “friends.”

The author, Alex Holmes, has been working with Hadoop for more than six years and is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects.

His new second edition, he writes, “covers Hadoop 2, which at the time of writing is the current production-ready version of Hadoop. The first edition of the book covered Hadoop 0.22 (Hadoop 1 wasn’t yet out), and Hadoop 2 has turned the world upside-down and opened up the Hadoop platform to processing paradigms beyond MapReduce. YARN, the new scheduler and application manager in Hadoop 2, is complex and new to the community, which prompted me to dedicate a new chapter 2 to covering YARN basics and to discussing how MapReduce now functions as a YARN application.”

In the book, Holmes notes that “Parquet has also recently emerged as a new way to store data in HDFS—its columnar format can yield both space and time efficiencies in your data pipelines, and it’s quickly becoming the ubiquitous way to store data. Chapter 4 includes extensive coverage of Parquet, which includes how Parquet supports sophisticated object models such as Avro and how various Hadoop tools can use Parquet.”

Furthermore, “[h]ow data is being ingested into Hadoop has also evolved since the first edition,” Holmes points out, “and Kafka has emerged as the new data pipeline, which serves as the transport tier between your data producers and data consumers, where a consumer would be a system
such as Camus that can pull data from Kafka into HDFS. Chapter 5, which covers moving data into and out of Hadoop, now includes coverage of Kafka and Camus.”  [Reviewer’s note: Interesting software names here. Franz Kafka and Alfred Camus were writers deeply concerned about finding clarity and meaning in a world that seemed to offer none.]

Holmes adds that “[t]here are many new technologies that YARN now can support side by side in the same cluster, and some of the more exciting and promising technologies are covered in the new part 4, titled ‘Beyond MapReduce,’ where I cover some compelling new SQL technologies such as Impala and Spark SQL. The last chapter, also new for this edition, looks at how you can write your own YARN application, and it’s packed with information about important features to support your YARN application.”

Hadoop and MapReduce have gained reputations (well-earned) for being difficult to set up, use and master. In his new edition, Holmes describes his own early experiences: “The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was relearning how to solve problems with it. MapReduce is its own flavor of parallel programming, and it’s quite different from the in-JVM programming that we were accustomed to. The first big hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well.”

(These days, of course, there are both open source and commercial releases of Hadoop, as well as quickstart virtual machine versions that are good for learning Hadoop.)

Holmes continues: “After one is used to thinking in MapReduce, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in and out of HDFS and effective and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much coverage, and that’s what attracted me to the potential of this book—the chance to go beyond the fundamental word-count Hadoop uses and covering some of the trickier and dirtier aspects of Hadoop.”

If you have difficulty explaining Hadoop to others (such as a manager or executive hesitant to let it be implemented), Holmes offers a succint summation in his updated book:

“Doug Cutting, the creator of Hadoop, likes to call Hadoop the kernel for big data, and I would tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data and allows these datasets to be easily joined together.”

One book cannot possibly cover everything you need to know about Hadoop, MapReduce, Parquet, Kafka, Camus, YARN and other technologies. And this book and its software examples assume that you have some experience with Java, XML and JSON. Yet Hadoop in Practice, Second Edition gives a very good and reasonably deep overview, spanning such major categories as background and fundamentals, data logistics, Big Data patterns, and moving beyond MapReduce.

Si Dunn



Cloudera Administration Handbook – How to become an effective Big Data administrator of large Hadoop clusters – #bookreview



Cloudera Administration Handbook

 Rohit Menon

Packt PublishingKindle, paperback


The explosive growth and use of Big Data in business, government, science and other arenas has fueled a strong demand for new Hadoop administrators. The administrators’ key duty is to set up and maintain Hadoop clusters that help process and analyze massive amounts of information.

New Hadoop administrators and those looking to join their ranks especially will want to give good consideration to The Cloudera Administration Handbook by Rohit Menon. This is a well-organized, well-written and solidly illustrated guide to building and maintaining large Apache Hadoop clusters using Cloudera Manager and CDH5.

The author has an extensive computer science background and is a Cloudera Certified Apache Hadoop Developer. He notes that “Cloudera Inc., is a Palo Alto-based American enterprise software company that provides Apache Hadoop-based software, support and services, and training to data-driven enterprises. It is often referred to as the commercial Hadoop company.”

CDH, Menon points out, is the easy shorthand name for a rather awkward software title: “Cloudera’s Distribution Including Apache Hadoop.” CDH is “an enterprise-level distribution including Apache Hadoop and several components of its ecosystem such as Apache Hive, Apache Avro, HBase, and many more. CDH is 100 percent open source,” Menon writes.

The Cloudera Manager, meanwhile, “is a web-browser-based administration tool to manage Apache Hadoop clusters. It is the centralized command center to operate the entire cluster from a single interface. Using Cloudera Manager, the administrator gets visibility for each and every component in the cluster.”

The Cloudera Manager is not explored until nearly halfway into the book, and some may wish it had been explained sooner, since they may be trying to learn it on day one of their new job. However, Menon wants readers first to become familiar with “all the steps and operations needed to set up a cluster via the command line” at a terminal. And these are, of course, important considerations to becoming an effective, knowledgeable and versatile Hadoop Administrator.  (You may not always have access to Cloudera Manager while setting up or troubleshooting a cluster.)

The book’s nine chapters show its well-focused range:

  • Chapter 1: Getting Started with Apache Hadoop
  • Chapter 2: HDFS and MapReduce
  • Chapter 3: Cloudera’s Distribution Including Apache Hadoop
  • Chapter 4: Exploring HDFS Federation and Its High Availability
  • Chapter 5: Using Cloudera Manager
  • Chapter 6: Implementing Security Using Kerberos
  • Chapter 7: Managing an Apache Hadoop Cluster
  • Chapter 8: Cluster Monitoring Using Events and Alerts
  • Chapter 9: Configuring Backups

You will have to bring some hardware and software experience and skills to the table, of course. Apache Hadoop primarily is run on Linux. “So having good Linux skills such as monitoring, troubleshooting, configuration, and security is a must” for a Hadoop administrator, Menon points out. Another requirement is being able to work comfortably with the Java Virtual Machine (JVM) and understand Java exceptions.

But those skills and his Cloudera Administration Handbook can take you from “the very basics of Hadoop” to taking up “the responsibilities of a Hadoop administrator and…managing huge Hadoop clusters.”

Si Dunn

Help support the work of reviewing books. Click here to buy the book:  Kindle, paperback

Programming MapReduce with Scalding – Using Hadoop & Scala to do some Big Data – #programming #bookreview

Programming MapReduce with Scalding

Programming MapReduce with Scalding

A practical guide to designing, testing, and implementing complex MapReduce applications in Scala

Antonios Chalkiopoulos

(Packt Publishing – paperback, Kindle)


Antonio Chalkiopoulos’s new book has three key goals, and it meets each of them in good, readable fashion.

It describes how MapReduce, Hadoop, and Scalding can work together. It suggests some useful design patterns and idioms. And, it provides numerous code examples of “real implementations for common use cases.”

The book also briefly introduces the Scala programming language and the Cascading platform, two elements vital to the Scalding framework.

Right here, a few brief definitions need to be offered.

According to a Wikipedia definition, MapReduce is both a programming model and “an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.”

Meanwhile, the Apache Hadoop website states: “Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.”

And: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware” and “is highly fault-tolerant….”

Continuing, the Cascading.org website promotes Cascading as “the proven application development platform for building data applications on Hadoop.” Plus, Scalding, it explains, is  “an extension to Cascading that enables application development with Scala, a powerful language for solving functional problems.” Indeed, Scalding is “[a] Scala API for Cascading,” and it provides functionality from custom join algorithms to multiple APIs (Fields-based, Type-safe, Matrix) for developers to build robust data applications. Scalding is built and maintained by Twitter.”

Scalding “makes MapReduce computations look very similar to Scala’s collection API. It’s also a wrapper for Cascading to simplify jobs, tests and data sources on HDFS or local disk.”

Okay, that’s a lot to digest, especially if you are making some of your first forays into the world of Big Data.

Fortunately, Programming MapReduce with Scalding offers clear, well-illustrated, smoothly paced how-to steps, as well as easy-to-digest definitions and descriptions. It takes the reader from setting up and running a Hadoop mini-cluster and local-development environment to applying  Scalding to real-use cases, as well as developing good test and test-driven development methodologies, running Scalding in production, using external data stores, and applying matrix calculations and machine learning.

The book is written for developers who have “a basic understanding” of Hadoop and MapReduce, but is also intended for experienced Hadoop developers who may be “enlightened by this alternative methodology of developing MapReduce applications with Scalding.”

In this book, “[a] Linux operating system is the preferred environment for Hadoop.” And the author includes instructions for how to install and use a Kiji Bento Box, “a zero-configuration bundle that provides a suitable environment for testing and prototyping projects that use HDFS, MapReduce, and HBase with minimal setup time.”  It’s an easy way to get Apache Hadoop up and running in as little as five minutes or so.

Or, if you prefer, you can manually install the required software packages. Either way, you can learn a lot and do a lot with a Hadoop mini-cluster. And, with this book, you can get a very good handle on the Scalding API.

It does help to be somewhat familiar with MapReduce, Scalding, Scala, Hadoop, Maven, Eclipse and the Linux environment.  But Antonio Chalkiopoulo does a good job of keeping the examples accessible even when readers are new to some of the packages. Still, be prepared to take your time and be prepared to do some additional research on the web and ask questions in forums, particularly if any of the required software is new to you.

(The book also can be purchased direct from Packt Publishing at http://goo.gl/Tyw4Sh.)


Si Dunn





Mule in Action, 2nd Edition – Want to be an integration developer? Here’s a good start – #bookreview


Mule in Action, Second Edition

David Dossot, John D’Emic, Victor Romero

(Manning – paperback)


An enterprise service bus (ESB) can help you link together many different types of platforms and applications–old and new–and keep them communicating and passing data between each other.

“Mule,” this book’s authors note, “is a lightweight, event-driven enterprise service bus and an integration platform and broker.  As such, it resembles more a rich and diverse toolbox than a shrink-wrapped application.”

Mule in Action, Second Edition, is a comprehensive and generally well-written overview of Mule 3 and how to put its open-source building blocks together to create integration solutions and develop them with Mule. The book provides very good focus on sending, receiving, routing, and transforming data, key aspects of an ESB.

More attention, however, could have been paid to clarity and detail in Chapter 1, the all-important chapter that helps Mule newcomers get started and enthused.

This second edition is a recent update of the 2009 first edition. Unfortunately, the Mule screens have changed a bit since the book’s screen shots were created for the new edition. Therefore, some of the how-to instructions and screen images do not match what the user now sees. This gets particularly confusing while trying to learn how to configure a JMS outbound endpoint for the first time, using Mule Studio’s graphical editor. The instructions seem insufficient, and the mismatch of screens can leave a beginner unsure how to proceed.

The same goes for configuring the message setting in the Logger element. The text instructs: “You’ll set the message attribute to print a String followed by the payload of the message, using the Mule Expression Language.” But no example is given. Fortunately, a reviewer on Amazon has posted a correct procedure. In his view, the message attribute should be: We received a message: #[message.payload]  –without any quote marks around it. (It works.)

Of course, this book is not really aimed at beginners–it’s for developers, architects, and managers (even though there will be Mule “beginners” in those ranks). Fortunately, it soon moves away from relying solely on Mule Studio’s graphical editor. The book’s examples, as the authors note, “mostly focus on the XML configurations of flows.” Thus, there are many XML code examples to work with, plus occasional screen shots of the flows as they appear in Mule Studio. And you can use other IDEs to work with the XML, if you prefer.

Indeed, the authors note, “no functionality in the CE version of Mule is dependent on Mule Studio.”

Overall, this is a very good book, and it definitely covers a lot of ground, from “discovering” Mule to becoming a Mule developer of integration applications, and using certain tools (such as business process management systems) to augment the applications you develop. I just wish a little more how-to clarity had been delivered in Chapter 1.

Si Dunn