Introducing Data Science
Davy Cielen, Arno D. B. Meysman, and Mohamed Ali
Manning – paperback
The three authors of this book note that “[d]ata science is a very wide field, so wide indeed that a book ten times the size of this one wouldn’t be able to cover it all. For each chapter, we picked a different aspect we find interesting. Some hard decisions had to be made to keep this book from collapsing your bookshelf!”
In their decisions and selections, they have made some good choices. Introducing Data Science is well written and generally well-organized (unless you are overly impatient to get to hands-on tasks).
The book appears to be aimed primarily at individual computer users and persons contemplating possible careers in data science–not those already working in, or heading, big data centers. The book also could be good for managers and others trying to wrap their heads around some data science techniques that could help them cope with swelling mountains of business data.
With this book in hand, you may be impatient to open it to the first chapter and dive headfirst into slicing, dicing, and graphing data. Try to curb your enthusiasm for a little while. Books from Manning generally avoid the “jump in now, swim later” approach. Instead, you get some overviews, explanations and theory first. Then you start getting to the heart of the matter. Some like this approach, while others get impatient with it.
In Introducing Data Science, your “First steps in big data” start happening in chapter five, after you’ve first delved into the data science process: 1. Setting the research goal; 2. Retrieving data; 3. Data preparation, 4. Data exploration; 5. Data modeling; and 6. Presentation and automation.
The “First steps” chapter also is preceded by chapters on machine learning and how to handle large data files on a single computer.
Once you get to Chapter 5, however, your “First steps” start moving pretty quickly. You are shown how to work (at the sandbox level) with two big data applications, Hadoop and Spark. And you get examples of how even Python can be used to write big data jobs.
From there, you march on to (1) the use of NoSQL databases and graph databases, (2) text mining and text analytics, and (3) data visualization and creating a small data science application.
It should be noted and emphasized, however, that the concluding pages of chapter 1 do present “An introductory working example of Hadoop.” The authors explain how to run “a small [Hadoop] application in a big data context,” using a Hortonworks Sandbox image inside a VirtualBox.
It’s not grand, but it is a start in a book that otherwise would take four chapters to get to the first hands-on part.
Near the beginning of their book, the authors also include a worthy quote from Morpheus in “The Matrix”: “I can only show you the door. You’re the one that has to walk through it.”
This book can be a good entry door to the huge and rapidly changing field of data science, if you are willing to go through it and do the work it presents.