R IN ACTION: Data Analysis and Graphics with R, 2nd Edition – #bookreview

R in Action

Data Analysis and Graphics with R

Robert I. Kabacoff

Manning – paperback

Whether data analysis is your field, your current major or your next career-change ambition, you likely should get this book. Free and open source  R is one of the world’s most popular languages for data analysis and visualization. And Robert I. Kabacoff’s updated new edition is, in my opinion, one of the top books out there for getting a handle on R. (I have used and previously reviewed several R how-to books.)

R is relatively easy to install on Windows, Mac OS X and Linux machines. But it is generally considered difficult to learn. Much of that is because of its rich abundance of features and packages, as well as its ability to create many types of graphs. “The base installation,” Kabacoff writes, “provides hundreds of data-management, statistical, and graphical functions out of the box. But some of its most powerful features come from the thousands of extensions (packages) provided by contributing authors.”

Kabacoff concedes: “It can be hard for new users to get a handle on what R is and what it can do.” And: “Even the most experienced R user is surprised to learn about features they were unaware of.”

R in Action, Second Edition, contains more than 200 pages of new material. And it is nicely structured to meet the needs of R beginners, as well as those of us who have some experience and want to gain more.

The book (579 pages in print format) is divided into five major parts. The first part, “Getting Started,” takes the beginner from an installing and trying R to creating data sets, working with graphs, and managing data. Part 2, “Basic Methods,”focuses on graphical and statistical techniques for obtaining basic information about data.”

Part 3, “Intermediate Methods,” moves the reader well beyond “describing the relationship between two variables.” It introduces  regression, analysis of variance, power analysis, intermediate graphs, and resampling statistics and bootstrapping. Part 4 presents “Advanced Methods,” including generalized linear models, principal components and factor analysis, time series, cluster analysis, classification, and advanced methods for missing data.

Part 5, meanwhile, offers how-to information for “Expanding Your Skills.” The topics include: advanced graphics with ggplot2, advanced programming, creating a package, creating dynamic reports, and developing advanced graphics with the lattice program.

A key strength of R in Action, Second Edition is Kabacoff’s use of generally short code examples to illustrate many of the ways that data can be entered, manipulated, analyzed and displayed in graphical form.

The first thing I did, however, was start at the very back of the book, Appendix G, and upgrade my existing version of R to 3.2.1, “World-Famous Astronaut.” The upgrade instructions could have been a little bit clearer, but after hitting a couple of unmentioned prompts and changing a couple of wrong choices, the process turned out to be quick and smooth.

Then I started reading chapters and keying in some of the code examples. I had not used R much recently, so it was fun again to enter some commands and numbers and have nicely formatted graphs suddenly pop open on the screen.

Even better, it is nice to have a LOT of new things to learn, with a well-written, well-illustrated guidebook in hand.

Si Dunn


D3.js in Action: A good book packed with data visualization how-to info – #javascript #programming

D3.js in Action

Elijah Meeks

Manning – paperback


The D3.js library is very powerful, and it is full of useful choices and possibilities. But, you should not try to tackle Elijah Meeks’s new book if you are a JavaScript newcomer and not also comfortable with HTML, CSS and JSON.

It likewise helps to understand how CSVs (Comma Separated Values) can be used. And you should know how to set up and run local web servers on your computer. Prior knowledge of D3.js and SVG (Scalable Vector Graphics) is not necessary, however.

Some reviewers have remarked on the amount of how-to and technical information packed into DS3.js in Action. It is indeed impressive. And, yes, it really can seem like concepts, details and examples are being squirted at you from a fire hose, particularly if you are attempting to race through the text. As Elijah Meeks writes, “[T]he focus of this book is on a more exhaustive explanation of key principles of the library.”

So plan to take your time. Tackle D3.js in small bites, using the d3js.org website and this text. I am pretty new to learning data visualization, and I definitely had never heard of visualizations such as Voronoi diagrams, nor tools such as TopoJSON, until I started working my way through this book. And those are just a few of the available possibilities.

I have not yet tried all of the code examples. But the ones I have tested have worked very well, and they have gotten me thinking about how I can adapt them to use in some of my work.

I am a bit disappointed that the book takes 40 pages to get to the requisite “Hello, world” examples. And once you arrive, the explanations likely will seem a bit murky and incomplete to some readers.

However, that is a minor complaint. D3.js in Action will get frequent use as I dig deeper into data visualization. D3.js and Elijah Meeks’s new book are keepers for the long-term in the big world of JavaScript.

Si Dunn

Optimizing Hadoop for MapReduce – A practical guide to lowering some costs of mining Big Data – #bookreview

Optimizing Hadoop for MapReduce

Learn how to configure your Hadoop cluster to run optimal MapReduce jobs

Khaled Tannir

(Packt Publishing, paperback, Kindle)

Time is money, as the old saying goes. And that saying especially applies to the world of Big Data, where much time, computing power and cash can be consumed while trying to extract profitable information from mountains of data.

This short, well-focused book by veteran software developer Khalid Tannir describes how to achieve a very important, money-saving goal: improve the efficiency of MapReduce jobs that are run with Hadoop.

As Tannir explains in his preface:

“MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.

“Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees.

“Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.”

Tannir’s well-focused, seven-chapter book zeroes in on how to find and fix misconfigured Hadoop clusters and numerous other problems. But first, he explains how Hadoop parameters are configured and how MapReduce metrics are monitored.

Two chapters are devoted to learning how to identify system bottlenecks , including CPU bottlenecks, storage bottlenecks, and network bandwidth bottlenecks.

One chapter examines how to properly identify resource weaknesses, particularly in Hadoop clusters. Then, as the book shifts strongly to solutions, Tannir explains how to reconfigure Hadoop clusters for greater efficiency.

Indeed, the final three chapters deliver details and steps that can help you improve how well Hadoop and MapReduce work together in your setting.

For example, the author explains how to make the map and reduce functions operate more efficiently, how to work with small or unsplittable files, how to deal with spilled records (those written to local disk when the allocated memory buffer is full), and ways to tune map and reduce parameters to improve performance.

“Most MapReduce programs are written for data analysis and they usually take a lot of time to finish,” Tannir emphasizes. However: “Many companies are embracing Hadoop for advanced data analytics over large datasets that require completion-time guarantees.” And that means “[e]fficiency, especially the I/O costs of MapReduce, still need(s) to be addressed for successful implications.”

He describes how to use compression, Combiners, the correct Writable types, and quick reuse of types to help improve memory management and the speed of job execution.

And, along with other tips, Tannir presents several “best practices” to help manage Hadoop clusters and make them do their work quicker and with fewer demands on hardware and software resources. 

Tannir notes that “setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.”

If you work with Hadoop and MapReduce or are now learning how to help install, maintain or administer Hadoop clusters, you can find helpful information and many useful tips in Khaled Tannir’s Optimizing Hadoop for Map Reduce.

Si Dunn

Ethics of Big Data – Thoughtful insights into key issues confronting big-data ‘gold mines’ – #management #bookreview

Ethics of Big Data
Kord Davis, with Doug Patterson
(O’Reilly, paperbackKindle)

“Big Data” and how to mine it for profit are red-hot topics in today’s business world. Many corporations now find themselves sitting atop virtual gold mines of customer information. And even small businesses now are attempting to find new ways to profit from their stashes of sales, marketing, and research data. 

Like it or not, you can’t block all of the cookies or tracking companies or sites that are following you, and each time you surf the web, you leave behind a “data exhaust” trail that has monetary value to others. Indeed, one recent start-up, Enliken, (“Data to the People”), is offering a way for computer users to gain some control over their data exhaust trail’s monetary value and choose who benefits from it, including some charities.

Ethics of Big Data does not seek to lay down a “hard-and-fast list of rules for the ethical handling of data.” The new book also doesn’t “tell you what to do with your data.” Its goals are “to help you engage in productive ethical discussions raised by today’s big-data-driven enterprises, propose a framework for thinking and talking about these issues, and introduce a methodology for aligning actions with values within an organization.”

It’s heady stuff, packed into just 64 pages. But the book is well written and definitely thought-provoking. It can serve as a focused guide for corporate leaders and others now hoping to get a grip on their own big-data situations, in ways that will not alienate their customers, partners, and stakeholders.

In the view of the authors: “For both individuals and organizations, four common elements define what can be considered a framework for big data:

  • “Identity – What is the relationship between our offline identity and our online identity?”
  • “Privacy – Who should control access to data?”
  • “Ownership – Who owns data, can rights be transferred, and what are the obligations of people who generate and use that data?”
  • “Reputation – How can we determine what data is trustworthy? Whether about ourselves, others, or anything else, big data exponentially increases the amount of information and ways we can interact with it. This phenomenon increases the complexity of managing how we are perceived and judged.”

Big-data technology itself is “ethnically neutral,” the authors contend, and it “has no value framework. Individuals and corporations, however, do have value systems, and it is only by asking and seeking answers to ethical questions that we can ensure big data is used in a way that aligns with those values.”

At the same time: “Big data is pushing corporate action further and more fully into individual lives through the sheer volume, variety, and velocity of the data being generated. Big-data product design, development, sales, and management actions expand their influence and impact over individuals’ lives that may be changing the common meanings of words like privacy, reputation, ownership, and identity.”

What will happen next as (1) big data continues to expand and intrude and (2) people and organizations  push back harder, is still anybody’s guess. But matters of ethics likely will remain at the center of the conflicts.

Indeed, some big-data gold mines could suffer devastating financial and legal cave-ins if greed is allowed to trump ethics.

Si Dunn

The Data Journalism Handbook – Get new skills for a new career that’s actually in demand – #bookreview

The Data Journalism Handbook: How Journalists Can Use Data to Improve the News
Edited by Jonathan Gray, Liliana Bounegru, and Lucy Chambers
(O’Reilly, paperbackKindle)

Arise, ye downtrodden, unemployed newspaper and magazine writers and editors yearning to be working again as journalists. Data journalism apparently is hiring.

Data journalism? I didn’t know, either, until I read this intriguing and hopeful collection of essays, how-to reports, and case studies written by journalists now working as, or helping train, data journalists in the United States and other parts of the world.

Data journalism, according to Paul Bradshaw of Birmingham City University, combines “the traditional ‘nose for news’ and ability to tell a compelling story with the sheer scale and range of digital information now available.”

Traditional journalists should view that swelling tide of information not as a mind-numbing, overwhelming flood but ”as an opportunity,” says Mirko Lorenz of Deutsche Welle. “By using data, the job of journalists shifts its main focus from being the first ones to report to being the ones telling us what a certain development actually means.”

He adds: “Data journalists or data scientists… are already a sought-after group of employees, not only in the media. Companies and institutions around the world are looking for ‘sense makers’ and professionals who know how to dig through data and transform it into something tangible.”

So, how do you transform yourself from an ex-investigative reporter now working at a shoe store into a prizewinning data journalist?

A bit of training. And, a willingness to bend your stubborn brain in a few new directions, according to this excellent and eye-opening book.

Yes, you may still be able to use the inverted-pyramid writing style and the “five W’s and H” you learned in J-school. But more importantly, you will now need to show you have some good skills in (drum roll, please)…Microsoft Excel.

That’s it? No, not quite.

Google Docs, SQL, Python, Django, R, Ruby, Ruby on Rails, screen scrapers, graphics packages – these are just a few more of the working data journalists’ favorite things. Skills in some these, plus a journalism background, can help you become part of a team that finds, analyzes and presents information in a clear and graphical way.

 You may dig up and present accurate data that reveals, for example, how tax dollars are being wasted by a certain school official, or how crime has increased in a particular neighborhood, or how extended drought is causing high unemployment among those who rely on lakes or rivers for income.

You might burrow deep into publically accessible data and come up with a story that changes the course of a major election or alters national discourse.

Who are today’s leading practitioners of data journalism? The New York Times, the Texas Tribune, the Chicago Tribune, the BBC, Zeit Online, and numerous others are cited in this book.

The Data Journalism Handbook grew out of MozFest 2011 and is a project of the European Journalism Centre and the Open Knowledge Foundation.

This book can show you “how data can be either the course of data journalism or a tool with which the story is told—or both.”

If you are looking for new ways to use journalism skills that you thought were outmoded, The Data Journalism Handbook can give you both hope and a clear roadmap toward a possible new career.

Si Dunn