BIG DATA: A well-written look at principles & best practices of scalable real-time data systems – #bookreview

 

 

Big Data

Principles and best practices of scalable real-time data systems

Nathan Marz, with James Warren

Manning – paperback

Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.

You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.

“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.

Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.

The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”

Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”

The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.

Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:

  • In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
  • Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
  • In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”

Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”

This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.

If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.

Si Dunn

Making Sense of NoSQL – A balanced, well-written overview – #bigdata #bookreview

Making Sense of NoSQL

A Guide for Managers and the Rest of Us
Dan McCreary and Ann Kelly
(Manning, paperback)

This is NOT a how-to guide for learning to use NoSQL software and build NoSQL databases. It is a meaty, well-structured overview aimed primarily at “technical managers, [software] architects, and developers.” However, it also is written to appeal to other, not-so-technical readers who are curious about NoSQL databases and where NoSQL could fit into the Big Data picture for their business, institution, or organization.

Making Sense of NoSQL definitely lives up to its subtitle: “A guide for managers and the rest of us.”

Many executives, managers, consultants and others today are dealing with expensive questions related to Big Data, primarily how it affects their current databases, database management systems, and the employees and contractors who maintain them. A variety of  problems can fall upon those who operate and update big relational (SQL) databases and their huge arrays of servers pieced together over years or decades.

The authors, Dan McCreary and Ann Kelly, are strong proponents, obviously, of the NoSQL approach. It offers, they note, “many ways to allow you to grow your database without ever having to shut down your servers.” However, they also realize that NoSQL may not a good, nor affordable, choice in many situations. Indeed, a blending of SQL and NoSQL systems may be a better choice. Or, making changes from SQL to NoSQL may not be financially feasible at all. So they have structured their book into four parts that attempt to help readers “objectively evaluate SQL and NoSQL database systems to see which business problems they solve.”

Part 1 provides an overview of NoSQL, its history, and its potential business benefits. Part 2 focuses on “database patterns,” including “legacy database patterns (which most solution architects are familiar with), NoSQL patterns, and native XML databases.” Part 3 examines “how NoSQL solutions solve the real-world business problems of big data, search, high availability, and agility.” And Part 4 looks at “two advanced topics associated with NoSQL: functional programming and system security.”

McCreary and Kelly observe that “[t]he transition to functional programming requires a paradigm shift away from software designed to control state and toward software that has a focus on independent data transformation.” (Erlang, Scala, and F# are some of the functional languages that they highlight.) And, they contend: “It’s no longer sufficient to design a system that will scale to 2, 4, or 8 core processors. You need to ask if your architecture will scale to 100, 1,000, or even 10,000 processors.”

Meanwhile, various security challenges can arise as a NoSQL database “becomes popular and is used by multiple projects” across “department trust boundaries.”

Computer science students, software developers, and others who are trying to stay knowledgeable about Big Data technology and issues should also consider reading this well-written book.

Si Dunn

Jump Start Sinatra – With this book and a little Ruby, you can make Sinatra sing – #programming #bookreview

Jump Start Sinatra
Get Up to Speed with Sinatra in a Weekend
Darren Jones
(SitePoint – Kindle, Paperback)

Many Ruby developers love Rails for its power and capabilities as a model-view-controller (MVC) framework. But some of them don’t like Rails’ size, complexity, and learning curve.

Meanwhile, many other Rubyists love Sinatra for its simplicity and ease of learning, plus its ability “to create a fully functional web app in just one file,” says Darren Jones in his new book, Jump Start Sinatra. “There are no complicated setup procedures or configuration to worry about. You can just open up a text editor and get started with minimal effort, leaving you to focus on the needs of your application.”

Jones does not temper his enthusiasm for Sinatra, adding that “there isn’t a single line of bloat anywhere in its source code, which weighs in at fewer than 2,000 lines!”

His 150-page book covers a lot of ground, from downloading and installing Sinatra to building websites, working with SQLite, Heroku, Rack, jQuery, and Git, and even using some CoffeeScript (to avoid “getting our hands dirty writing JavaScript…”). He also shows how to create modular Sinatra applications that use separate classes.

“Sinatra makes it easy–trivial almost–to build sites, services, and web apps using Ruby,” the author states. “A Sinatra application is basically made up of one or more Ruby files. You don’t need to be an expert Rubyist to use Sinatra, but the more Ruby you know, the better you’ll be at building Sinatra apps.”

Jones adds: “Unlike Ruby on Rails, Sinatra is definitely not a framework. It’s without conventions and imposes no file structure on you whatsoever. Sinatra apps are basically just Ruby programs; what Sinatra does is connect them to the Web. Rather than hide behind lots of magic, it exposes the way the Web works by making the key concepts of HTTP verbs and URLs an explicit part of it.”

Jump Start Sinatra is a well-written, appropriately illustrated guide to getting started with this popular free software. Ruby newcomers may wish for a few more how-to steps or code examples. But the counter argument is, if you’re brand-new to Ruby, save Sinatra for later; focus on getting learning Ruby first. 

Darren Jones does not buy into a common assessment that’s often heard when developers are asked their views of Rails vs. Sinatra. “Opinions abound that Sinatra can only be used for small applications or simple APIs, but this simply isn’t true,” he argues. “”While it is a perfect fit for these tasks, Sinatra also scales impressively, demonstrated by the fact that it’s been used to power some big production sites.”

Some of those “big production sites,” according to Wikipedia, include such notables as Apple, LinkedIn, the BBC, the British government, Heroku, and GitHub.

Si Dunn

Spring Data: Modern Data Access for Enterprise Java – #java #bookreview

Spring Data: Modern Data Access for Enterprise Java
Mark Pollack, Oliver Gierke, Thomas Risberg, Jonathan L. Brisbin and Michael Hunger
(O’Reilly, paperbackKindle)

Big Data keeps getting wider and deeper by the second. And so do the demands for analyzing and profiting from all of those piled-up terabytes.

Meanwhile, the once whiz-bang technology known as the relational database is having a very hard time keeping pace. The sheer amount of data that companies now gather, store, access, and analyze is pushing traditional relational databases to the breaking point.

Many Java developers who are trying to keep these overloaded systems held together with baling wire, also are starting to learn to work with some of the “alternative data stores that are being used in mission-critical enterprise applications,” the authors of Spring Data point out.

A lot of data now is being stored elsewhere and not in relational databases. Yet companies cannot abandon what they have already gathered and invested heavily to access. So they need to keep using and supporting their relational databases, plus some newer, faster, more voracious solutions lumped under the heading “NoSQL databases,” (even though you can query them).

In “the new data access landscape,” the authors note: “there is a revolution taking place, which for data geeks is quite exciting. Relational databases are not dead; they are still central to the operations of many enterprises and will remain so for quite some time. The trends, though, are very clear: new data access technologies are solving problems that traditional relational databases can’t, so we need to broaden our skill set as developers and have a foot in both camps.”

They add: “The Spring Framework has a long history of simplifying the development of Java applications, in particular for writing RDBMS-based data access layers that use Java database connectivity (JDBC) or object-relational mappers.”

Their new book “is intended to give you a hands-on introduction to the Spring Data project, whose core mission is to enable Java developers to use state-of-the-art data processing and manipulation tools but also use traditional databases in a state-of-the-art manner.”

They have organized their 288-page book into six parts and 14 chapters:

Part I – Background

  • Chapter 1 – The Spring Data Project
  • Chapter 2 – Repositories: Convenient Data Access Layers
  • Chapter 3 – Type-Safe Querying Using Querydsl

Part II – Relational Databases

  • Chapter 4 – JPA Repositories
  • Chapter 5 – Type-safe JDBC Programming with Querydsl SQL

Part III – NoSQL

  • Chapter 6 – MongoDB: A Document Store
  • Chapter 7 – Neo4j: A Graph Database
  • Chapter8 – Redis: A Key/Value Store

Part IV – Rapid Application Development

  • Chapter 9 – Persistence Layers with Spring Roo
  • Chapter 10 – REST Repository Exporter

Part V – Big Data

  • Chapter 11 – Spring for Apache Hadoop
  • Chapter 12 – Analyzing Data with Hadoop
  • Chapter 13 – Creating Big Data Pipelines with Spring Batch and Spring Integration

Part 5 – Data Grids

  • Chapter 14 – GemFire: A Distributed Data Grid

“Many of the values that have made Spring the preferred platform for enterprise Java developers deliver particular benefit in a world of fragmented persistence solutions,” states Ron Johnson, creator of Spring Framework. Writing in the book’s foreword, he notes: “Part of the value of Spring is how it brings consistency (without descending to a lowest common denominator) in its approach to different technologies with which it integrates.

“A distinct ‘Spring way’ helps shorten the learning curve for developers and simplifies code maintenance. If you are already familiar with Spring, you will find that Spring Data eases your exploration and adoption of unfamiliar stores. If you aren’t already familiar with Spring, this is a good opportunity to see how Spring can simplify your code and make it more consistent.”

Spring Data definitely is not light reading, but it is well-written, and provides a good blending of procedures, steps, explanations, code samples, screenshots and other illustrations.

Si Dunn

PostgreSQL: Up and Running – Get a well-guided grip on this powerful, free database software – #bookreview

PostgreSQL: Up and Running
 Regina Obe and Leo Hsu
(O’Reilly,
paperbackKindle)

PostgreSQL is both a powerful open source database system and a very flexible application platform.

“PostgreSQL allows you to write stored procedures and functions in several programming languages, and the architecture allows you the flexibility to support more languages,” this book’s two authors point out.

Indeed: “You can have functions written in several different languages participating in one query.”

Release 9.2 of PostgreSQL hit the web Sept. 10, 2012. Regina Obe’s and Leo Hsu’s fine, 145-page introduction to PostgreSQL focuses on Release 9.1, but describes major 9.2 features, too. And their book definitely can be used to get you up and running. It describes “the unique features of PostgreSQL that make it stand apart from other databases…”, and shows “how to use these features to solve real world problems.”

PostgreSQL is not for every database user, the writers emphasize. “PostgreSQL was designed from the ground up to be a server-side database. Many people do use it on the desktop similarly to how they use SQL Server Express or Oracle, but just like those, it cares about security management and doesn’t leave this up to the application connecting to it. As such, it’s not ideal as an imbeddable database, like SQLite or Firebird.”

It also “does a lot and a lot can be daunting,” they concede. “It’s not a dumb data store; it’s a smart elephant. If all you need is a key value store or you expect your database to just sit there and hold stuff, it’s probably overkill for your needs.”

But after years of using PostgreSQL, the two writers remain unabashed fans. “Each update,” they state in their book, “treats us to new features, eases usability, brings improvements in speed, and pushes the envelope of what is possible with a database. In the end, you will wonder why you ever used any other relational database, when PostgreSQL does everything you could hope for—and does it for free.”

By the way, users of PostgreSQL 8.3 or older need to upgrade ASAP, Regina Obe and Leo Hsu urge. Release 8.3 “will be reaching end-of-life in early 2013,” making support increasingly difficult and expensive.

Si Dunn

The Data Journalism Handbook – Get new skills for a new career that’s actually in demand – #bookreview

The Data Journalism Handbook: How Journalists Can Use Data to Improve the News
Edited by Jonathan Gray, Liliana Bounegru, and Lucy Chambers
(O’Reilly, paperbackKindle)

Arise, ye downtrodden, unemployed newspaper and magazine writers and editors yearning to be working again as journalists. Data journalism apparently is hiring.

Data journalism? I didn’t know, either, until I read this intriguing and hopeful collection of essays, how-to reports, and case studies written by journalists now working as, or helping train, data journalists in the United States and other parts of the world.

Data journalism, according to Paul Bradshaw of Birmingham City University, combines “the traditional ‘nose for news’ and ability to tell a compelling story with the sheer scale and range of digital information now available.”

Traditional journalists should view that swelling tide of information not as a mind-numbing, overwhelming flood but ”as an opportunity,” says Mirko Lorenz of Deutsche Welle. “By using data, the job of journalists shifts its main focus from being the first ones to report to being the ones telling us what a certain development actually means.”

He adds: “Data journalists or data scientists… are already a sought-after group of employees, not only in the media. Companies and institutions around the world are looking for ‘sense makers’ and professionals who know how to dig through data and transform it into something tangible.”

So, how do you transform yourself from an ex-investigative reporter now working at a shoe store into a prizewinning data journalist?

A bit of training. And, a willingness to bend your stubborn brain in a few new directions, according to this excellent and eye-opening book.

Yes, you may still be able to use the inverted-pyramid writing style and the “five W’s and H” you learned in J-school. But more importantly, you will now need to show you have some good skills in (drum roll, please)…Microsoft Excel.

That’s it? No, not quite.

Google Docs, SQL, Python, Django, R, Ruby, Ruby on Rails, screen scrapers, graphics packages – these are just a few more of the working data journalists’ favorite things. Skills in some these, plus a journalism background, can help you become part of a team that finds, analyzes and presents information in a clear and graphical way.

 You may dig up and present accurate data that reveals, for example, how tax dollars are being wasted by a certain school official, or how crime has increased in a particular neighborhood, or how extended drought is causing high unemployment among those who rely on lakes or rivers for income.

You might burrow deep into publically accessible data and come up with a story that changes the course of a major election or alters national discourse.

Who are today’s leading practitioners of data journalism? The New York Times, the Texas Tribune, the Chicago Tribune, the BBC, Zeit Online, and numerous others are cited in this book.

The Data Journalism Handbook grew out of MozFest 2011 and is a project of the European Journalism Centre and the Open Knowledge Foundation.

This book can show you “how data can be either the course of data journalism or a tool with which the story is told—or both.”

If you are looking for new ways to use journalism skills that you thought were outmoded, The Data Journalism Handbook can give you both hope and a clear roadmap toward a possible new career.

Si Dunn

New Books for Windows Phone 7 & 7.5 and Microsoft SQL Server 2012 T-SQL – #bookreview

Microsoft Press recently has released two new books, one for developers who work with Windows Phone 7 & 7.5 and the other for newcomers to Microsoft SQL Server 2012 T-SQL. 

Windows Phone 7 Development Internals
Andrew Whitechapel
(Microsoft Press, paperback, Kindle)

Andrew Whitechapel’s hefty new 809-page development internals guidebook focuses on Windows Phone 7 design and architecture and helps you learn best practices for building Windows Phone 7 applications. It is illustrated with numerous screenshots, code examples, and other illustrations.

The book “covers the breadth of application development for the Windows Phone platform, both the major 7 and 7.1/7.5 versions and the minor 7.1.1 version,” Whitechapel writes.

Windows Phone 7 Development Internals is aimed at experienced .NET developers who are familiar with Microsoft Silverlight and want to dig into Windows Phone’s platform design and API surface.

“The Windows Phone 7 release only supports C#,” Whitechapel notes, “and although support for Visual Basic was introduced with the 7.1 SDK, this book focuses purely on C# and XAML.”

In each of the 20 chapters, several features are introduced, and Whitechapel provides “one or more sample [Silverlight] applications and walks you through the significant code (C# and XAML).”

The book’s author is a senior program manager for the Windows Phone Application Platform.

#

Microsoft SQL Server 2012 T-SQL Fundamentals
Itzik Ben-Gan
(Microsoft Press, paperback, Kindle)

Transact-SQL, more commonly known as T-SQL, is the Microsoft SQL Server dialect of the ISO and ANSI standards for SQL. T-SQL code is used to query and modify data in SQL Server 2012.

Itzik Ben-Gan, one of the leading experts on T-SQL, emphasizes that his new book “covers fundamentals [and] is mainly aimed at T-SQL practitioners with little or no experience.” But others who have some T-SQL experience also can find it helpful for filling in gaps in knowledge. The book also is recommended for database administrators, business intelligence (BI) practitioners, report writer, analysts, architects, and SQL Server power users who have “just started working with SQL Server and need to write queries and develop code using Transact-SQL.”

Microsoft SQL Server 2012 T-SQL Fundamentals is structured into 10 chapters. The first chapter provides “Background to T-SQL Querying and Programming. Chapters 2 through 8 examine “various aspects of querying and modifying data.” Chapter 9 looks at concurrency and transactions, and Chapter 10 provides an overview of programmable objects.

The book’s one appendix shows you how to “get started and set up your environment so that you have everything you need to get the most out of this book.” The major discussions include: “Getting Started with SQL Database”; “Installing an On-Premises Implementation of SQL Server”; “Downloading Source Code and Installing the Sample Database”; “Working with SQL Server Management Studio”; and “Working with SQL Server Books Online.”

#

Si Dunn