‘Introducing Data Science’ – A good doorway into the world of processing, analyzing & displaying Big Data – #bookreview

Introducing Data Science

Davy Cielen, Arno D. B. Meysman, and Mohamed Ali

Manning – paperback

The three authors of this book note that “[d]ata science is a very wide field, so wide indeed that a book ten times the size of this one wouldn’t be able to cover it all. For each chapter, we picked a different aspect we find interesting. Some hard decisions had to be made to keep this book from collapsing your bookshelf!”

In their decisions and selections, they have made some good choices. Introducing Data Science is well written and generally well-organized (unless you are overly impatient to get to hands-on tasks).

The book appears to be aimed primarily at individual computer users and persons contemplating possible careers in data science–not those already working in, or heading, big data centers. The book also could be good for managers and others trying to wrap their heads around some data science techniques that could help them cope with swelling mountains of business data.

With this book in hand, you may be impatient to open it to the first chapter and dive headfirst into slicing, dicing, and graphing data. Try to curb your enthusiasm for a little while. Books from Manning generally avoid the “jump in now, swim later” approach. Instead, you get some overviews, explanations and theory first. Then you start getting to the heart of the matter. Some like this approach, while others get impatient with it.

In Introducing Data Science, your “First steps in big data” start happening in chapter five, after you’ve first delved into the data science process: 1. Setting the research goal; 2. Retrieving data; 3. Data preparation, 4. Data exploration; 5. Data modeling; and 6. Presentation and automation.

The “First steps” chapter also is preceded by chapters on machine learning and how to handle large data files on a single computer.

Once you get to Chapter 5, however, your “First steps” start moving pretty quickly. You are shown how to work (at the sandbox level) with two big data applications, Hadoop and Spark. And you get examples of how even Python can be used to write big data jobs.

From there, you march on to (1) the use of NoSQL databases and graph databases, (2) text mining and text analytics, and (3) data visualization and creating a small data science application.

It should be noted and emphasized, however, that the concluding pages of chapter 1 do present “An introductory working example of Hadoop.” The authors explain how to run “a small [Hadoop] application in a big data context,” using a Hortonworks Sandbox image inside a VirtualBox.

It’s not grand, but it is a start in a book that otherwise would take four chapters to get to the first hands-on part.

Near the beginning of their book, the authors also include a worthy quote from Morpheus in “The Matrix”: “I can only show you the door. You’re the one that has to walk through it.”

This book can be a good entry door to the huge and rapidly changing field of data science,  if you are willing to go through it and do the work it presents.

Si Dunn

Advertisements

R IN ACTION: Data Analysis and Graphics with R, 2nd Edition – #bookreview

R in Action

Data Analysis and Graphics with R

Robert I. Kabacoff

Manning – paperback

Whether data analysis is your field, your current major or your next career-change ambition, you likely should get this book. Free and open source  R is one of the world’s most popular languages for data analysis and visualization. And Robert I. Kabacoff’s updated new edition is, in my opinion, one of the top books out there for getting a handle on R. (I have used and previously reviewed several R how-to books.)

R is relatively easy to install on Windows, Mac OS X and Linux machines. But it is generally considered difficult to learn. Much of that is because of its rich abundance of features and packages, as well as its ability to create many types of graphs. “The base installation,” Kabacoff writes, “provides hundreds of data-management, statistical, and graphical functions out of the box. But some of its most powerful features come from the thousands of extensions (packages) provided by contributing authors.”

Kabacoff concedes: “It can be hard for new users to get a handle on what R is and what it can do.” And: “Even the most experienced R user is surprised to learn about features they were unaware of.”

R in Action, Second Edition, contains more than 200 pages of new material. And it is nicely structured to meet the needs of R beginners, as well as those of us who have some experience and want to gain more.

The book (579 pages in print format) is divided into five major parts. The first part, “Getting Started,” takes the beginner from an installing and trying R to creating data sets, working with graphs, and managing data. Part 2, “Basic Methods,”focuses on graphical and statistical techniques for obtaining basic information about data.”

Part 3, “Intermediate Methods,” moves the reader well beyond “describing the relationship between two variables.” It introduces  regression, analysis of variance, power analysis, intermediate graphs, and resampling statistics and bootstrapping. Part 4 presents “Advanced Methods,” including generalized linear models, principal components and factor analysis, time series, cluster analysis, classification, and advanced methods for missing data.

Part 5, meanwhile, offers how-to information for “Expanding Your Skills.” The topics include: advanced graphics with ggplot2, advanced programming, creating a package, creating dynamic reports, and developing advanced graphics with the lattice program.

A key strength of R in Action, Second Edition is Kabacoff’s use of generally short code examples to illustrate many of the ways that data can be entered, manipulated, analyzed and displayed in graphical form.

The first thing I did, however, was start at the very back of the book, Appendix G, and upgrade my existing version of R to 3.2.1, “World-Famous Astronaut.” The upgrade instructions could have been a little bit clearer, but after hitting a couple of unmentioned prompts and changing a couple of wrong choices, the process turned out to be quick and smooth.

Then I started reading chapters and keying in some of the code examples. I had not used R much recently, so it was fun again to enter some commands and numbers and have nicely formatted graphs suddenly pop open on the screen.

Even better, it is nice to have a LOT of new things to learn, with a well-written, well-illustrated guidebook in hand.

Si Dunn

 

Data Science for Business – A serious guide for those who need to know – #bigdata #bookreview

Data Science for Business

What You Need to Know about Data Mining and Data-Analytic Thinking
Foster Provost and Tom Fawcett
(O’Reilly – paperback, Kindle)

This is not an introductory text for casual readers curious about the hoopla over data science and Big Data.

And you definitely won’t find code here for simple screen scrapers written in Python 2.7 or programs that access the Twitter API to scoop up messages containing certain hashtags.

Data Science for Business is based on an MBA course Foster Provost teaches at New York University, and it is aimed at three specific, serious audiences:

  • “Aspiring data scientists”
  • “Developers who will be implementing data science solutions…”
  • “Business people who will be working with data scientists, managing data science-oriented projects, or investing in data science ventures….”

Provost’s and Fawcett’s book  “concentrates on the fundamentals of data science and data mining,” the two authors state. But it specifically avoids “an algorithm-centered approach” and instead focuses on “a relatively small set of fundamental concepts or principles that underlie techniques for extracting useful knowledge from data. These concepts serve as the foundation for many well-known algorithms of data mining,” the authors note.

“Moreover, these concepts underlie the analysis of data-centered business problems, the creation and evaluation of data science solutions, and the evaluation of general data science strategies and proposals.”

The book is well-written and adequately illustrated with charts, diagrams, mathematical equations and mathematical examples. And the text, while technical and dense in some places, is organized into short sections. Most of the chapters end with insightful summaries that help the lessons stick.

Both authors are experienced veterans in the use of data science in business.  Their new book includes two helpful appendices. One shows how to “assess potential data mining projects” and “uncover potential flaws in proposals.” The second appendix presents a sample proposal and discusses its flaws.

“If you are a business stakeholder rather than a data scientist,” the authors caution, “don’t let so-called data scientists bamboozle you with jargon: the concepts of this book plus knowledge of your own business and data systems should allow you to understand 80% or more of the data science at a reasonable enough level to be productive for your business.”

They also challenge data scientists to “think deeply about why your work is relevant to helping the business and be able to present it as such.”

Si Dunn

The Data Journalism Handbook – Get new skills for a new career that’s actually in demand – #bookreview

The Data Journalism Handbook: How Journalists Can Use Data to Improve the News
Edited by Jonathan Gray, Liliana Bounegru, and Lucy Chambers
(O’Reilly, paperbackKindle)

Arise, ye downtrodden, unemployed newspaper and magazine writers and editors yearning to be working again as journalists. Data journalism apparently is hiring.

Data journalism? I didn’t know, either, until I read this intriguing and hopeful collection of essays, how-to reports, and case studies written by journalists now working as, or helping train, data journalists in the United States and other parts of the world.

Data journalism, according to Paul Bradshaw of Birmingham City University, combines “the traditional ‘nose for news’ and ability to tell a compelling story with the sheer scale and range of digital information now available.”

Traditional journalists should view that swelling tide of information not as a mind-numbing, overwhelming flood but ”as an opportunity,” says Mirko Lorenz of Deutsche Welle. “By using data, the job of journalists shifts its main focus from being the first ones to report to being the ones telling us what a certain development actually means.”

He adds: “Data journalists or data scientists… are already a sought-after group of employees, not only in the media. Companies and institutions around the world are looking for ‘sense makers’ and professionals who know how to dig through data and transform it into something tangible.”

So, how do you transform yourself from an ex-investigative reporter now working at a shoe store into a prizewinning data journalist?

A bit of training. And, a willingness to bend your stubborn brain in a few new directions, according to this excellent and eye-opening book.

Yes, you may still be able to use the inverted-pyramid writing style and the “five W’s and H” you learned in J-school. But more importantly, you will now need to show you have some good skills in (drum roll, please)…Microsoft Excel.

That’s it? No, not quite.

Google Docs, SQL, Python, Django, R, Ruby, Ruby on Rails, screen scrapers, graphics packages – these are just a few more of the working data journalists’ favorite things. Skills in some these, plus a journalism background, can help you become part of a team that finds, analyzes and presents information in a clear and graphical way.

 You may dig up and present accurate data that reveals, for example, how tax dollars are being wasted by a certain school official, or how crime has increased in a particular neighborhood, or how extended drought is causing high unemployment among those who rely on lakes or rivers for income.

You might burrow deep into publically accessible data and come up with a story that changes the course of a major election or alters national discourse.

Who are today’s leading practitioners of data journalism? The New York Times, the Texas Tribune, the Chicago Tribune, the BBC, Zeit Online, and numerous others are cited in this book.

The Data Journalism Handbook grew out of MozFest 2011 and is a project of the European Journalism Centre and the Open Knowledge Foundation.

This book can show you “how data can be either the course of data journalism or a tool with which the story is told—or both.”

If you are looking for new ways to use journalism skills that you thought were outmoded, The Data Journalism Handbook can give you both hope and a clear roadmap toward a possible new career.

Si Dunn