‘Introducing Data Science’ – A good doorway into the world of processing, analyzing & displaying Big Data – #bookreview

Introducing Data Science

Davy Cielen, Arno D. B. Meysman, and Mohamed Ali

Manning – paperback

The three authors of this book note that “[d]ata science is a very wide field, so wide indeed that a book ten times the size of this one wouldn’t be able to cover it all. For each chapter, we picked a different aspect we find interesting. Some hard decisions had to be made to keep this book from collapsing your bookshelf!”

In their decisions and selections, they have made some good choices. Introducing Data Science is well written and generally well-organized (unless you are overly impatient to get to hands-on tasks).

The book appears to be aimed primarily at individual computer users and persons contemplating possible careers in data science–not those already working in, or heading, big data centers. The book also could be good for managers and others trying to wrap their heads around some data science techniques that could help them cope with swelling mountains of business data.

With this book in hand, you may be impatient to open it to the first chapter and dive headfirst into slicing, dicing, and graphing data. Try to curb your enthusiasm for a little while. Books from Manning generally avoid the “jump in now, swim later” approach. Instead, you get some overviews, explanations and theory first. Then you start getting to the heart of the matter. Some like this approach, while others get impatient with it.

In Introducing Data Science, your “First steps in big data” start happening in chapter five, after you’ve first delved into the data science process: 1. Setting the research goal; 2. Retrieving data; 3. Data preparation, 4. Data exploration; 5. Data modeling; and 6. Presentation and automation.

The “First steps” chapter also is preceded by chapters on machine learning and how to handle large data files on a single computer.

Once you get to Chapter 5, however, your “First steps” start moving pretty quickly. You are shown how to work (at the sandbox level) with two big data applications, Hadoop and Spark. And you get examples of how even Python can be used to write big data jobs.

From there, you march on to (1) the use of NoSQL databases and graph databases, (2) text mining and text analytics, and (3) data visualization and creating a small data science application.

It should be noted and emphasized, however, that the concluding pages of chapter 1 do present “An introductory working example of Hadoop.” The authors explain how to run “a small [Hadoop] application in a big data context,” using a Hortonworks Sandbox image inside a VirtualBox.

It’s not grand, but it is a start in a book that otherwise would take four chapters to get to the first hands-on part.

Near the beginning of their book, the authors also include a worthy quote from Morpheus in “The Matrix”: “I can only show you the door. You’re the one that has to walk through it.”

This book can be a good entry door to the huge and rapidly changing field of data science,  if you are willing to go through it and do the work it presents.

Si Dunn

BIG DATA: A well-written look at principles & best practices of scalable real-time data systems – #bookreview

 

 

Big Data

Principles and best practices of scalable real-time data systems

Nathan Marz, with James Warren

Manning – paperback

Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.

You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.

“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.

Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.

The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”

Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”

The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.

Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:

  • In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
  • Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
  • In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”

Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”

This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.

If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.

Si Dunn

HADOOP IN PRACTICE, 2nd Edition – An updated guide to handling some of the ‘trickier and dirtier aspects of Hadoop’ – #programming #bookreview

 

Hadoop in Practice, Second Edition

Alex Holmes

(Manning – paperback )

 

The Hadoop world has undergone some big changes lately, and this hefty, updated edition offers excellent coverage of a lot of what’s new. If you currently work with Hadoop and MapReduce or are planning to take them up soon, give serious consideration to adding this well-written book to your technical library. A key feature of the book is its “104 techniques.” These show how to accomplish practical and important tasks when working with Hadoop, MapReduce and their growing array of software “friends.”

The author, Alex Holmes, has been working with Hadoop for more than six years and is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects.

His new second edition, he writes, “covers Hadoop 2, which at the time of writing is the current production-ready version of Hadoop. The first edition of the book covered Hadoop 0.22 (Hadoop 1 wasn’t yet out), and Hadoop 2 has turned the world upside-down and opened up the Hadoop platform to processing paradigms beyond MapReduce. YARN, the new scheduler and application manager in Hadoop 2, is complex and new to the community, which prompted me to dedicate a new chapter 2 to covering YARN basics and to discussing how MapReduce now functions as a YARN application.”

In the book, Holmes notes that “Parquet has also recently emerged as a new way to store data in HDFS—its columnar format can yield both space and time efficiencies in your data pipelines, and it’s quickly becoming the ubiquitous way to store data. Chapter 4 includes extensive coverage of Parquet, which includes how Parquet supports sophisticated object models such as Avro and how various Hadoop tools can use Parquet.”

Furthermore, “[h]ow data is being ingested into Hadoop has also evolved since the first edition,” Holmes points out, “and Kafka has emerged as the new data pipeline, which serves as the transport tier between your data producers and data consumers, where a consumer would be a system
such as Camus that can pull data from Kafka into HDFS. Chapter 5, which covers moving data into and out of Hadoop, now includes coverage of Kafka and Camus.”  [Reviewer’s note: Interesting software names here. Franz Kafka and Alfred Camus were writers deeply concerned about finding clarity and meaning in a world that seemed to offer none.]

Holmes adds that “[t]here are many new technologies that YARN now can support side by side in the same cluster, and some of the more exciting and promising technologies are covered in the new part 4, titled ‘Beyond MapReduce,’ where I cover some compelling new SQL technologies such as Impala and Spark SQL. The last chapter, also new for this edition, looks at how you can write your own YARN application, and it’s packed with information about important features to support your YARN application.”

Hadoop and MapReduce have gained reputations (well-earned) for being difficult to set up, use and master. In his new edition, Holmes describes his own early experiences: “The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was relearning how to solve problems with it. MapReduce is its own flavor of parallel programming, and it’s quite different from the in-JVM programming that we were accustomed to. The first big hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well.”

(These days, of course, there are both open source and commercial releases of Hadoop, as well as quickstart virtual machine versions that are good for learning Hadoop.)

Holmes continues: “After one is used to thinking in MapReduce, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in and out of HDFS and effective and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much coverage, and that’s what attracted me to the potential of this book—the chance to go beyond the fundamental word-count Hadoop uses and covering some of the trickier and dirtier aspects of Hadoop.”

If you have difficulty explaining Hadoop to others (such as a manager or executive hesitant to let it be implemented), Holmes offers a succint summation in his updated book:

“Doug Cutting, the creator of Hadoop, likes to call Hadoop the kernel for big data, and I would tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data and allows these datasets to be easily joined together.”

One book cannot possibly cover everything you need to know about Hadoop, MapReduce, Parquet, Kafka, Camus, YARN and other technologies. And this book and its software examples assume that you have some experience with Java, XML and JSON. Yet Hadoop in Practice, Second Edition gives a very good and reasonably deep overview, spanning such major categories as background and fundamentals, data logistics, Big Data patterns, and moving beyond MapReduce.

Si Dunn

 

 

Optimizing Hadoop for MapReduce – A practical guide to lowering some costs of mining Big Data – #bookreview

Optimizing Hadoop for MapReduce

Learn how to configure your Hadoop cluster to run optimal MapReduce jobs

Khaled Tannir

(Packt Publishing, paperback, Kindle)

Time is money, as the old saying goes. And that saying especially applies to the world of Big Data, where much time, computing power and cash can be consumed while trying to extract profitable information from mountains of data.

This short, well-focused book by veteran software developer Khalid Tannir describes how to achieve a very important, money-saving goal: improve the efficiency of MapReduce jobs that are run with Hadoop.

As Tannir explains in his preface:

“MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.

“Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees.

“Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.”

Tannir’s well-focused, seven-chapter book zeroes in on how to find and fix misconfigured Hadoop clusters and numerous other problems. But first, he explains how Hadoop parameters are configured and how MapReduce metrics are monitored.

Two chapters are devoted to learning how to identify system bottlenecks , including CPU bottlenecks, storage bottlenecks, and network bandwidth bottlenecks.

One chapter examines how to properly identify resource weaknesses, particularly in Hadoop clusters. Then, as the book shifts strongly to solutions, Tannir explains how to reconfigure Hadoop clusters for greater efficiency.

Indeed, the final three chapters deliver details and steps that can help you improve how well Hadoop and MapReduce work together in your setting.

For example, the author explains how to make the map and reduce functions operate more efficiently, how to work with small or unsplittable files, how to deal with spilled records (those written to local disk when the allocated memory buffer is full), and ways to tune map and reduce parameters to improve performance.

“Most MapReduce programs are written for data analysis and they usually take a lot of time to finish,” Tannir emphasizes. However: “Many companies are embracing Hadoop for advanced data analytics over large datasets that require completion-time guarantees.” And that means “[e]fficiency, especially the I/O costs of MapReduce, still need(s) to be addressed for successful implications.”

He describes how to use compression, Combiners, the correct Writable types, and quick reuse of types to help improve memory management and the speed of job execution.

And, along with other tips, Tannir presents several “best practices” to help manage Hadoop clusters and make them do their work quicker and with fewer demands on hardware and software resources. 

Tannir notes that “setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.”

If you work with Hadoop and MapReduce or are now learning how to help install, maintain or administer Hadoop clusters, you can find helpful information and many useful tips in Khaled Tannir’s Optimizing Hadoop for Map Reduce.

Si Dunn

Scaling Big Data with Hadoop and Solr – A new how-to guide – #bigdata #java #bookreview

Scaling Big Data with Hadoop and Solr

Learn new ways to build efficient, high performance enterprise search repositories for Big Data using Hadoop and Solr
Hrishikesh Karambelkar
(Packt – paperback, Kindle)

This well-presented, step-by-step guide shows how to use Apache Hadoop and Apache Solr to work with Big Data.  Author and software architect Hrishikesh Karambelkar does a good job of explaining Hadoop and Solr, and he illustrates how they can work together to tackle Big Data enterprise search projects.

“Google faced the problem of storing and processing big data, and they came up with the MapReduce approach, which is basically a divide-and-conquer strategy for distributed data processing,” Karambelkar notes. “MapReduce is widely accepted by many organizations to run their Big Data computations. Apache Hadoop is the most popular open source Apache licensed implementation of MapReduce….Apache Hadoop enables distributed processing of large datasets across a commodity of clustered servers. It is designed to scale up from single server to thousands of commodity hardware machines, each offering partial computational units and data storage.”

Meanwhile, Karambelkar adds, “Apache Solr is an open source enterprise search application which provides user abilities to search structured as well as unstructured data across the organization.”

His book (128 pages in print format) is structured with five chapters and three appendices:

  • Chapter 1: Processing Big Data Using Hadoop MapReduce
  • Chapter 2: Understanding Solr
  • Chapter 3: Making Big Data Work for Hadoop and Solr
  • Chapter 4: Using Big Data to Build Your Large Indexing
  • Chapter 5: Improving Performance of Search while Scaling with Big Data
  • Appendix A: Use Cases for Big Data Search
  • Appendix B: Creating Enterprise Search Using Apache Solr
  • Appendix C: Sample MapReduce Programs to Build the Solr Indexes

Where the book falls short (and I have noted this about many works by computer-book publishers) is that the author simply assumes everything will go well during the process of downloading and setting up the software–and gives almost no troubleshooting hints. This can happen with books written by software experts that are also are reviewed by software experts. Their systems likely are already optimized and may not throw the error messages that less-experienced users may encounter.

For example, the author states: “Installing Hadoop is a straightforward job with a default setup….” Unfortunately, there are many “flavors” and configurations of Linux running in the world. And Google searches can turn up a variety of problems others have encountered when installing, configuring and running Hadoop.  Getting Solr installed and running likewise is not a simple process for everyone.

If you are ready to plunge in and start dealing with Big Data, Scaling Big Data with Hadoop and Solr definitely can give you some well-focused and important information.  But heed the “Who this book is for” statement on page 2: “This book is primarily aimed at Java programmers, who wish to extend Hadoop platform to make it run as an enterprise search without prior knowledge of Apache Hadoop and Solr.”

And don’t surprised if you have to seek additional how-to details and troubleshooting information from websites and other books, as well as from co-workers and friends who may know Linux, Java and NoSQL databases better than you do (whether you want to admit it or not).

Si Dunn

Hadoop is hot! Three new how-to books for riding the Big Data elephant – #programming #bookreview

In the world of Big Data, Hadoop has become the hard-charging elephant in the room.

Its big-name users now span the alphabet and include such notables as Amazon, eBay, Facebook, Google, the New York Times, and Yahoo. Not bad for software named after a child’s toy elephant.

Computer systems that run Hadoop can store, process, and analyze large amounts of data that have been gathered up in many different formats from many different sources.

According to the Apache Software Foundation’s Hadoop website: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

The (well-trained) user defines the Big Data problem that Hadoop will tackle. Then the software handles all aspects of the job completion, including spreading out the problem in small pieces to many different computers, or nodes, in the distributed system for more efficient processing. Hadoop also handles individual node failures, and collects and combines the calculated results from each node.

But you don’t need a collection of hundreds or thousands of computers to run Hadoop. You can learn it, write programs, and do some testing and debugging on a single Linux machine, Windows PC or Mac. The Open Source software can be downloaded here. (Do some research first. You may have use web searches to find detailed installation instructions for your specific system.)

Hadoop is open-source software that is often described as “a Java-based framework for large-scale data processing.” It has a lengthy learning curve that includes getting familiar with Java, if you don’t already know it.

But if you are now ready and eager to take on Hadoop, Packt Publishing recently has unveiled three excellent how-to books that can help you begin and extend your mastery: Hadoop Beginner’s Guide, Hadoop MapReduce Cookbook, and Hadoop Real-World Solutions Cookbook.

Short reviews of each are presented below.

Hadoop Beginner’s Guide
Garry Turkington
(Packt Publishing – paperback, Kindle)

Garry Turkington’s new book is a detailed, well-structured introduction to Hadoop. It covers everything from the software’s three modes–local standalone mode, pseudo-distributed mode, and fully distributed mode–to running basic jobs, developing simple and advanced MapReduce programs, maintaining clusters of computers, and working with Hive, MySQL, and other tools.

“The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination,” the author writes.

He calls this capability “possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system.”

The 374-page book is written well and provides numerous code samples and illustrations. But it  has one drawback for some beginners who want to install and  use Hadoop.  Turkington offers step-by-step instructions for how to perform a Linux installation, specifically Ubuntu. However, he refers Windows and Mac users to an Apache site where there is insufficient how-to information. Web searches become necessary to find more installation details.

Hadoop MapReduce Cookbook
Srinath Perera and Thilina Gunarathne
(Packt Publishing – paperback, Kindle)

MapReduce “jobs” are an essential part of  how Hadoop is able to crunch huge chunks of Big Data.  The Hadoop MapReduce Cookbook offers “recipes for analyzing large and complex data sets with Hadoop MapReduce.”

MapReduce is a well-known programming model for processing large sets of data. Typically, MapReduce is used within clusters of computers that are configured to perform distributed computing.

In the “Map” portion of the process, a problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes. (Nodes also can have sub-nodes). During the “Reduce” part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

“Hadoop is the most widely known and widely used implementation of the MapReduce paradigm,” the two authors note.

Their 284-page book initially shows how to run Hadoop in local mode, which “does not start any servers but does all the work within the same JVM [Java Virtual Machine]” on a standalone computer. Then, as you gain more experience with MapReduce and the Hadoop Distributed File System (HDFS), they guide you into using Hadoop in more complex, distributed-computing environments.

Echoing the Hadoop Beginner’s Guide, the authors explain how to install Hadoop on Linux machines only.

Hadoop Real-World Solutions Cookbook
Jonathan R. Owens, Jon Lentz and Brian Femiano
(Packt Publishing – paperback, Kindle)

The Hadoop Real-World Solutions Cookbook assumes you already have some experience with Hadoop. So it jumps straight into helping “developers become more comfortable with, and proficient at solving problems in, the Hadoop space.”

Its goal is to “teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.”

The 299-page book is packed with code examples and short explanations that help solve specific types of problems. A few randomly selected problem headings:

  • “Using Apache Pig to filter bot traffic from web server logs.”
  • “Using the distributed cache in MapReduce.”
  • “Trim Outliers from the Audioscrobbler dataset using Pig and datafu.” 
  • “Designing a row key to store geographic events in Accumulo.”
  • “Enabling MapReduce jobs to skip bad records.”

The authors use a simple but effective strategy for presenting problems and solutions. First, the problem is clearly described. Then, under a “Getting Ready” heading, they spell out what you need to  solve the problem. That is followed by a “How to do it…” heading where each step is presented and supported by code examples. Then, paragraphs beneath a “How it works…” heading sum up and explain how the problem was solved. Finally, a “There’s more…” heading highlights more explanations and links to additional details.

If you are a Hadoop beginner, consider the first two books reviewed above. If you have some Hadoop experience, you likely can find some useful tips in book number three

Si Dunn

MapReduce Design Patterns – For solving Big Data problems – #bookreview #programming #hadoop

MapReduce Design Patterns
Donald Miner and Adam Shook
(O’Reilly –
paperback, Kindle)

“MapReduce is a computing paradigm for processing data that resides on hundreds of computers,” the authors point out. It has been “popularized recently by Google, Hadoop, and many others.”

The MapReduce paradigm is “extraordinarily powerful, but does not provide a general solution to what many are calling ‘big data,” they add, “so while it works particularly well on some problems, some are more challenging.” The authors’ focus in their new book is on using MapReduce design patterns as “templates or general guides to solving problems.”

Their new book definitely can help solve some time-crunch problems for new MapReduce developers. It brings together a variety of solutions that have emerged over time in a patchwork of blogs, books, and research papers and explains them in detail, with code samples, illustrations, and cautions about potential pitfalls.

You can’t simply cut and paste solutions from the chapters. But the two writers do “hope that you will find a pattern to get you at least 90% of the way for just about all of your challenges.”

Six of the book’s eight chapters focus on specific types of design patterns:

  • Summarization Patterns
  • Filtering Patterns
  • Data Organization Patterns
  • Join Patterns
  • Metapatterns
  • Input and Output Patterns

“The MapReduce world is in a state similar to the object-oriented world before 1994,” the authors point out. “Patterns today are scattered across blogs, websites such as StackOverflow, deep inside other books, and inside very advanced technology teams at organizations across the world.”

They add that “[t]he intent of this book is not to provide some groundbreaking new ways to solve problems with MapReduce….” but to offer, instead, a collection of “patterns that have been developed by veterans in the field so they can be shared with everyone else.”

The book’s code samples are all written in Hadoop, and the two writers deal with the question of “why should we use Java MapReduce in Hadoop at all when we have options like Pig and Hive,” which reduce the need for MapReduce patterns.

There is “conceptual value,” they state, “in understanding the lower level workings of a system like MapReduce.” Furthermore, “Using Pig or Hive without understanding MapReduce can lead to some dangerous situations.” And, Pig and Hive cannot yet “tackle all of the problems in the ways that Java MapReduce can. This will surely change over time….”

If you are new to MapReduce, this useful and informative book from Donald Miner and Adam Shook can be the next best thing to having MapReduce experts at your side.

MapReduce Design Patterns can save you time and effort, steer you away from dead ends, and help give you solid understandings of the powerful MapReduce paradigm.

Si Dunn

Spring Data: Modern Data Access for Enterprise Java – #java #bookreview

Spring Data: Modern Data Access for Enterprise Java
Mark Pollack, Oliver Gierke, Thomas Risberg, Jonathan L. Brisbin and Michael Hunger
(O’Reilly, paperbackKindle)

Big Data keeps getting wider and deeper by the second. And so do the demands for analyzing and profiting from all of those piled-up terabytes.

Meanwhile, the once whiz-bang technology known as the relational database is having a very hard time keeping pace. The sheer amount of data that companies now gather, store, access, and analyze is pushing traditional relational databases to the breaking point.

Many Java developers who are trying to keep these overloaded systems held together with baling wire, also are starting to learn to work with some of the “alternative data stores that are being used in mission-critical enterprise applications,” the authors of Spring Data point out.

A lot of data now is being stored elsewhere and not in relational databases. Yet companies cannot abandon what they have already gathered and invested heavily to access. So they need to keep using and supporting their relational databases, plus some newer, faster, more voracious solutions lumped under the heading “NoSQL databases,” (even though you can query them).

In “the new data access landscape,” the authors note: “there is a revolution taking place, which for data geeks is quite exciting. Relational databases are not dead; they are still central to the operations of many enterprises and will remain so for quite some time. The trends, though, are very clear: new data access technologies are solving problems that traditional relational databases can’t, so we need to broaden our skill set as developers and have a foot in both camps.”

They add: “The Spring Framework has a long history of simplifying the development of Java applications, in particular for writing RDBMS-based data access layers that use Java database connectivity (JDBC) or object-relational mappers.”

Their new book “is intended to give you a hands-on introduction to the Spring Data project, whose core mission is to enable Java developers to use state-of-the-art data processing and manipulation tools but also use traditional databases in a state-of-the-art manner.”

They have organized their 288-page book into six parts and 14 chapters:

Part I – Background

  • Chapter 1 – The Spring Data Project
  • Chapter 2 – Repositories: Convenient Data Access Layers
  • Chapter 3 – Type-Safe Querying Using Querydsl

Part II – Relational Databases

  • Chapter 4 – JPA Repositories
  • Chapter 5 – Type-safe JDBC Programming with Querydsl SQL

Part III – NoSQL

  • Chapter 6 – MongoDB: A Document Store
  • Chapter 7 – Neo4j: A Graph Database
  • Chapter8 – Redis: A Key/Value Store

Part IV – Rapid Application Development

  • Chapter 9 – Persistence Layers with Spring Roo
  • Chapter 10 – REST Repository Exporter

Part V – Big Data

  • Chapter 11 – Spring for Apache Hadoop
  • Chapter 12 – Analyzing Data with Hadoop
  • Chapter 13 – Creating Big Data Pipelines with Spring Batch and Spring Integration

Part 5 – Data Grids

  • Chapter 14 – GemFire: A Distributed Data Grid

“Many of the values that have made Spring the preferred platform for enterprise Java developers deliver particular benefit in a world of fragmented persistence solutions,” states Ron Johnson, creator of Spring Framework. Writing in the book’s foreword, he notes: “Part of the value of Spring is how it brings consistency (without descending to a lowest common denominator) in its approach to different technologies with which it integrates.

“A distinct ‘Spring way’ helps shorten the learning curve for developers and simplifies code maintenance. If you are already familiar with Spring, you will find that Spring Data eases your exploration and adoption of unfamiliar stores. If you aren’t already familiar with Spring, this is a good opportunity to see how Spring can simplify your code and make it more consistent.”

Spring Data definitely is not light reading, but it is well-written, and provides a good blending of procedures, steps, explanations, code samples, screenshots and other illustrations.

Si Dunn

Big Data Book Blast: Hadoop, Hive…and Python??? – #programming #bookreview

Big Data is hothotHOT. And O’Reilly recently has added three new books of potential interest to Big Data workers, as well as those hoping to join their ranks.

Hadoop, Hive and–surprise!—Python are just a few of the hot tools you may encounter in the rapidly expanding sea of data now being gathered, explored, stored, and manipulated by companies, organizations, institutions, governments, and individuals around the planet. Here are the books:

Hadoop Operations
Eric Sammer
(O’Reilly, paperbackKindle)

“Companies are storing more data from more sources in more formats than ever before,” writes Eric Sammer, a Hadoop expert who is principal solution architect at Cloudera. But gathering and stockpiling data is only “one half of the equation,” he adds. “Processing that data to produce information is fundamental to the daily operations of every modern business.”

Enter Apache Hadoop, a “pragmatic, cost-effective, scalable infrastructure” that increasingly is being used to develop Big Data applications for storing and processing information.

“Made up of a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a computation layer that implements a processing paradigm called MapReduce, Hadoop is an open source, batch data processing system for enormous amounts of data. We live in a flawed world, and Hadoop is designed to survive in it by not only tolerating hardware and software failures, but also treating them as first-class conditions that happen regularly.”

Sammer adds: “Hadoop uses a cluster of plain old commodity servers with no specialized hardware or network infrastructure to form a single, logical, storage and compute platform, or cluster, that can be shared by multiple individuals or groups. Computation in Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction for developers that obviates complex synchronization and network programming. Unlike many other distributed data processing systems, Hadoop runs the user-provided processing logic on the machine where the data lives rather than dragging the data across the network; a huge win for performance.”

Sammer’s new, 282-page book is well written and focuses on running Hadoop in production, including planning its use, installing it, configuring the system and providing ongoing maintenance. He also shows “what works, as demonstrated in crucial deployments.”

If you’re new to Hadoop or still getting a handle on it, you need Hadoop Operations. And even if you’re now an “old” hand at Hadoop, you likely can learn new things from this book. “It’s an extremely exciting time to get into Apache Hadoop,” Sammer states.

Programming Hive
Eric Capriolo, Dean Wampler, and Jason Rutherglen
(O’Reilly, paperback Kindle)

“Hive,” the three authors point out, “provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL), for querying data stored in a Hadoop cluster.”

They add: “Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when data is not changing rapidly.”

Their well-structured and well-written book shows how to install and test Hadoop and Hive on a personal workstation – “a convenient way to learn and experiment with Hadoop.” Then it shows “how to configure Hive for use on Hadoop clusters.”

They also provide a brief overview of Hadoop and MapReduce before diving into Hive’s command-line interface (CLI) and introductory aspects such as how to embed lines of comments in Hive v0.80 and later.

From there, the book flows smoothly into HiveQL and how to use its SQL dialect to query, summarize, and analyze large datasets that Hadoop has stored in its distributed filesystem.

User documentation for Hive and Hadoop has been sparse, so Programming Hive definitely fills a solid need. Significantly, the final chapter presents several “Case Study Examples from the User Trenches” where real companies explain how they have used Hive to solve some very challenging problems involving Big Data.

Python for Data Analysis
Wes McKinney
(O’Reilly, paperbackKindle)

No, Python is not the first language many people think of when picturing large data analysis projects. For one thing, it’s an interpreted language, so Python code runs a lot slower than code written in compiled programming languages such as C++ or Java.

Also, the author concedes, “Python is not an ideal language for highly concurrent, multithreaded applications, particularly applications with many CPU-bound threads.” The software’s global interpreter lock (GIL) “prevents the interpreter from executing more than one Python bytecode instruction at a time.”

Thus, Python will not soon be challenging Hadoop to a Big Data petabyte speed duel.

On the other hand, Python is reasonably easy to learn, and it has strong and widespread support within the scientific and academic communities, where a lot of data must get crunched at a reasonable clip, if not at blinding speed.

And Wes McKinney is the main author of pandas, Python’s increasingly popular open source library for data analysis. It (pandas) is “designed to make working with structured data fast, easy, and expressive.”

His book makes a good case for using Python in at least some Big Data situations. “In recent years,” he states, “Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks. Combined with Python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications.”

Much of this well-written, well-illustrated book “focuses on high-performance array-based computing tools for working with large data sets.” It uses a case-study-examples approach to demonstrate how to tackle a wide range of data analysis problems, using Python libraries that include pandas, NumPy, matplotlib, and IPython, “the component in the standard scientific Python toolset that ties everything together.”

By the way, if you have never programmed in Python, check out the end of McKinney’s book. An appendix titled “Python Language Essentials” gives a good overview of the language, with a specific bias toward “processing and manipulating structured and unstructured data.”

If you do scientific, academic, or business computing and need to crunch and visualize a lot of data, definitely check out Python for Data Analysis.

You may be pleasantly surprised at how well and how easily Python and its data-analysis libraries can do the job.

Si Dunn

Hadoop: The Definitive Guide, Third Edition – Big Tools for Big Data – #programming #bookreview

Hadoop: The Definitive Guide, Third Edition
Tom White
(O’Reilly, paperback, list price $49.99; Kindle edition, list price, $39.99)

“The good news is that Big Data is here,” Tom White writes in this revised and updated third edition to Hadoop’s “definitive guide.” But: “The bad news is that we are struggling to store and analyze it.”

Indeed, Big Data is now being measured in zettabytes, which is “equivalently one thousand exabytes, one million petabytes, or one billion terabytes,” White says. And all of us are creating, storing and trying to benefit from expanding amounts of data each day.

Enter Hadoop, “a reliable shared storage and analysis system. The storage is provided by HDFS [the Hadoop Distributed File System] and the analysis by MapReduce. There are other parts to Hadoop,” White emphasizes, “but these capabilities are its kernel.”

Hadoop (it’s not an acronym; simply the name of a child’s toy elephant) is a complex programming language. But, White says: “Stripped to its core, the tools that Hadoop provides for building distributed systems—for data storage, data analysis, and coordination—are simple. If there’s a common theme, it’s about raising the level of abstraction—to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time , the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.”

This new edition covers recent changes and additions to Hadoop, including the MapReduce API and new MapReduce 2 runtime, “which is built on a new distributed resource management system called YARN.” Several chapters related to MapReduce and other topics also have been added or expanded.

Hadoop can run MapReduce programs written in a variety of languages, including Java, Ruby, Python, and C++. And: “MapReduce programs are inherently parallel, thus putting very large-scare data analysis into the hands of anyone with enough machines at her disposal.” Hadoop, meanwhile, provides powerful parallel processing capabilities.

Hadoop increasingly is being employed by companies and organizations that must deal with processing, analyzing, and storing very large amounts of data. White’s book includes some case studies that explain Hadoop’s role in solving several Big Data challenges.

Hadoop: The Definitive Guide, Third Edition is not a beginner’s how-to book. But it’s definitely recommended for “programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.”

Si Dunn