Optimizing Hadoop for MapReduce – A practical guide to lowering some costs of mining Big Data – #bookreview

Optimizing Hadoop for MapReduce

Learn how to configure your Hadoop cluster to run optimal MapReduce jobs

Khaled Tannir

(Packt Publishing, paperback, Kindle)

Time is money, as the old saying goes. And that saying especially applies to the world of Big Data, where much time, computing power and cash can be consumed while trying to extract profitable information from mountains of data.

This short, well-focused book by veteran software developer Khalid Tannir describes how to achieve a very important, money-saving goal: improve the efficiency of MapReduce jobs that are run with Hadoop.

As Tannir explains in his preface:

“MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.

“Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees.

“Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.”

Tannir’s well-focused, seven-chapter book zeroes in on how to find and fix misconfigured Hadoop clusters and numerous other problems. But first, he explains how Hadoop parameters are configured and how MapReduce metrics are monitored.

Two chapters are devoted to learning how to identify system bottlenecks , including CPU bottlenecks, storage bottlenecks, and network bandwidth bottlenecks.

One chapter examines how to properly identify resource weaknesses, particularly in Hadoop clusters. Then, as the book shifts strongly to solutions, Tannir explains how to reconfigure Hadoop clusters for greater efficiency.

Indeed, the final three chapters deliver details and steps that can help you improve how well Hadoop and MapReduce work together in your setting.

For example, the author explains how to make the map and reduce functions operate more efficiently, how to work with small or unsplittable files, how to deal with spilled records (those written to local disk when the allocated memory buffer is full), and ways to tune map and reduce parameters to improve performance.

“Most MapReduce programs are written for data analysis and they usually take a lot of time to finish,” Tannir emphasizes. However: “Many companies are embracing Hadoop for advanced data analytics over large datasets that require completion-time guarantees.” And that means “[e]fficiency, especially the I/O costs of MapReduce, still need(s) to be addressed for successful implications.”

He describes how to use compression, Combiners, the correct Writable types, and quick reuse of types to help improve memory management and the speed of job execution.

And, along with other tips, Tannir presents several “best practices” to help manage Hadoop clusters and make them do their work quicker and with fewer demands on hardware and software resources. 

Tannir notes that “setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.”

If you work with Hadoop and MapReduce or are now learning how to help install, maintain or administer Hadoop clusters, you can find helpful information and many useful tips in Khaled Tannir’s Optimizing Hadoop for Map Reduce.

Si Dunn

Hadoop is hot! Three new how-to books for riding the Big Data elephant – #programming #bookreview

In the world of Big Data, Hadoop has become the hard-charging elephant in the room.

Its big-name users now span the alphabet and include such notables as Amazon, eBay, Facebook, Google, the New York Times, and Yahoo. Not bad for software named after a child’s toy elephant.

Computer systems that run Hadoop can store, process, and analyze large amounts of data that have been gathered up in many different formats from many different sources.

According to the Apache Software Foundation’s Hadoop website: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

The (well-trained) user defines the Big Data problem that Hadoop will tackle. Then the software handles all aspects of the job completion, including spreading out the problem in small pieces to many different computers, or nodes, in the distributed system for more efficient processing. Hadoop also handles individual node failures, and collects and combines the calculated results from each node.

But you don’t need a collection of hundreds or thousands of computers to run Hadoop. You can learn it, write programs, and do some testing and debugging on a single Linux machine, Windows PC or Mac. The Open Source software can be downloaded here. (Do some research first. You may have use web searches to find detailed installation instructions for your specific system.)

Hadoop is open-source software that is often described as “a Java-based framework for large-scale data processing.” It has a lengthy learning curve that includes getting familiar with Java, if you don’t already know it.

But if you are now ready and eager to take on Hadoop, Packt Publishing recently has unveiled three excellent how-to books that can help you begin and extend your mastery: Hadoop Beginner’s Guide, Hadoop MapReduce Cookbook, and Hadoop Real-World Solutions Cookbook.

Short reviews of each are presented below.

Hadoop Beginner’s Guide
Garry Turkington
(Packt Publishing – paperback, Kindle)

Garry Turkington’s new book is a detailed, well-structured introduction to Hadoop. It covers everything from the software’s three modes–local standalone mode, pseudo-distributed mode, and fully distributed mode–to running basic jobs, developing simple and advanced MapReduce programs, maintaining clusters of computers, and working with Hive, MySQL, and other tools.

“The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination,” the author writes.

He calls this capability “possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system.”

The 374-page book is written well and provides numerous code samples and illustrations. But it  has one drawback for some beginners who want to install and  use Hadoop.  Turkington offers step-by-step instructions for how to perform a Linux installation, specifically Ubuntu. However, he refers Windows and Mac users to an Apache site where there is insufficient how-to information. Web searches become necessary to find more installation details.

Hadoop MapReduce Cookbook
Srinath Perera and Thilina Gunarathne
(Packt Publishing – paperback, Kindle)

MapReduce “jobs” are an essential part of  how Hadoop is able to crunch huge chunks of Big Data.  The Hadoop MapReduce Cookbook offers “recipes for analyzing large and complex data sets with Hadoop MapReduce.”

MapReduce is a well-known programming model for processing large sets of data. Typically, MapReduce is used within clusters of computers that are configured to perform distributed computing.

In the “Map” portion of the process, a problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes. (Nodes also can have sub-nodes). During the “Reduce” part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

“Hadoop is the most widely known and widely used implementation of the MapReduce paradigm,” the two authors note.

Their 284-page book initially shows how to run Hadoop in local mode, which “does not start any servers but does all the work within the same JVM [Java Virtual Machine]” on a standalone computer. Then, as you gain more experience with MapReduce and the Hadoop Distributed File System (HDFS), they guide you into using Hadoop in more complex, distributed-computing environments.

Echoing the Hadoop Beginner’s Guide, the authors explain how to install Hadoop on Linux machines only.

Hadoop Real-World Solutions Cookbook
Jonathan R. Owens, Jon Lentz and Brian Femiano
(Packt Publishing – paperback, Kindle)

The Hadoop Real-World Solutions Cookbook assumes you already have some experience with Hadoop. So it jumps straight into helping “developers become more comfortable with, and proficient at solving problems in, the Hadoop space.”

Its goal is to “teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.”

The 299-page book is packed with code examples and short explanations that help solve specific types of problems. A few randomly selected problem headings:

  • “Using Apache Pig to filter bot traffic from web server logs.”
  • “Using the distributed cache in MapReduce.”
  • “Trim Outliers from the Audioscrobbler dataset using Pig and datafu.” 
  • “Designing a row key to store geographic events in Accumulo.”
  • “Enabling MapReduce jobs to skip bad records.”

The authors use a simple but effective strategy for presenting problems and solutions. First, the problem is clearly described. Then, under a “Getting Ready” heading, they spell out what you need to  solve the problem. That is followed by a “How to do it…” heading where each step is presented and supported by code examples. Then, paragraphs beneath a “How it works…” heading sum up and explain how the problem was solved. Finally, a “There’s more…” heading highlights more explanations and links to additional details.

If you are a Hadoop beginner, consider the first two books reviewed above. If you have some Hadoop experience, you likely can find some useful tips in book number three

Si Dunn

Make something new, with MakerBot or Raspberry Pi – #bookreview #programming #diy

O’Reilly has released two new books to help you get started with two hot new products: the MakerBot desktop 3D printer and the Raspberry Pi, a tiny, inexpensive computer the size of a credit card.

Here are short reviews of the two how-to guides:

Getting Started with MakerBot
Bre Pettis, Anna Kaziunas France & Jay Shergill
(O’Reilly –
paperback, Kindle)

The MakerBot 3D printer has captured worldwide attention for its ability to replicate objects such as game pieces, knobs and other plastic parts no longer available from manufacturers, and its use also to produce small art works.

“In our consumer-focused, disposable world, a MakerBot is a revitalizing force for all your broken things,” the authors state. (One of them, Bre Pettis, is one of MakerBot’s creators.)

The MakerBot machine, however, also can be a revitalizing force for artistic endeavors and, in some cases, dreams of self-employment. It is, after all, essentially a small factory in a box.

Getting Started with MakerBot introduces the machine and things you can make with it from your own designs or from designs downloaded from the web. “Though the underlying engineering principles behind a MakerBot are quite complex, in a nutshell, a MakerBot is a very precise, robotic hot glue gun mounted to a very precise, robotic positioning system,” the three writers point out.

In 213 pages, the book covers the basics, from history to set-up, and then shows you how to “print 10 useful objects right away.” It also introduces how to design your own 3D objects, using SketchUp, Autodesk 123D, OpenSCAD, and some other tools.

Getting Started with MakerBot is well-written, heavily illustrated, and organized to help you advance from unboxing a MakerBot to turning out products and creations and becoming a significant citizen of the “Thingiverse”—where “one must share designs…but all are welcome to reap the bounty of shared digital designs for physical objects.”

***

Getting Started with Raspberry Pi
Matt Richardson & Shawn Wallace
O’Reilly –
paperback, Kindle)

The Raspberry Pi “is meant as an educational tool to encourage kids to experiment with computers.” But many adults are latching to the tiny device as well, because it comes preloaded with interpreters and compilers for several programming languages, including Python, Scratch, C, Ruby, Java, and Perl. Its operating system is Linux Raspbian.

The Raspberry Pi is not plug-and-play, but it can be connected to – and control –a number of electronic devices. And the list of uses  for the microcomputer keeps growing.

Some owners have made their Raspberry Pi devices into game machines. Others have connected many of the units together to create low-budget supercomputers. Some are using them as web servers. And still others work at the  “bare metal” of a Raspberry Pi to create and test new operating systems. Intriguing new roles for the Raspberry Pi keep appearing, and the surge will continue as more adults and kids start working with the tiny but powerful device.

Getting Started with Raspberry Pi covers the basics of hooking up, programming and running the device. It also provides several starter projects, including how to use a Raspberry Pi as a web server or in other roles.

Once you know what you’re doing, “You can even create your own JSON API for an electronics project!” the authors promise.

The well-written book packs a lot of how-to information into its 160 pages, including working at the command line in Linux, learning to program the device, and creating simple games in Python and Scratch.

– Si Dunn

Programming C# 5.0 – Excellent how-to guide for experienced developers ready to learn C# – #bookreview

Programming C# 5.0
Ian Griffiths
(O’Reilly, paperbackKindle)

Ian Griffiths’ new book is for “experienced developers,” not for beginners hoping to learn the basics of programming while also learning C#. The focus is “Building Windows 8, Web, and Desktop Applications for the .NET 4.5 Framework.”

Earlier editions in the Programming C# series have “explained some basic concepts such as classes, polymorphism, and collections,” Griffiths notes. But C# also keeps growing in power and size, which means the page counts of its how-to manuals must keep growing, too, to cover “everything.”

The paperback version of Programming C# 5.0 weighs in at 861 pages and more than three pounds. So Griffiths’ choice to sharpen the book’s focus is a smart one. Beginners can learn the basics of programming in other books and other ways before digging into this edition. And experienced developers will find that the author’s explanations and code examples now have space to go “into rather more detail” than would have been possible if chapters explaining the basics of programming had been packed in, as well.

If you have done some programming and know a class from an array, this book can be your well-structured guide to learning C#. The “basics” are gone, but you still are shown how to create a “Hello World” program—primarily so you can see how new C# projects are created in Visual Studio, Microsoft’s development environment.

C# has been around since 2000 and “can be used for many kinds of applications, including websites, desktop applications, games, phone apps, and command-line utilities,” Griffiths says.

“The most significant new feature in C# 5.0,” he emphasizes, “is support for asynchronous programming.” He notes that “.NET has always offered asynchronous APIs (i.e., ones that do not wait for the operation they perform to finish before returning). Asynchrony is particularly important with input/output(I/O) operations, which can take a long time and often don’t require any active involvement from the CPU except at the start and end of an operation. Simple, synchronous APIs that do not return until the operation completes can be inefficient. They tie up a thread while waiting, which can cause suboptimal performance in servers, and they’re also unhelpful in client-side code, where they can make a user interface unresponsive.”

In the past, however, “the more efficient and flexible asynchronous APIs” have been “considerably harder to use than their synchronous counterparts. But now,” Griffiths points out, “if an asynchronous API conforms to a certain pattern, you can write C# code that looks almost as simple as the synchronous alternative would.”

If you are an experienced programmer hoping to add C# to your language skills, Ian Griffiths’ new book covers much of what you need to know, including how to use XAML (pronounced “zammel”) “to create  applications of the [touch-screen] style introduced by Windows 8” but also applications for desktop computers and Windows Phone.

Yes, Microsoft created C#, but there are other ways to run it, too, Griffiths adds.

“The open source Mono project (http://www.mono-project.com/) provides tools for building C# applications that run on Linux, Mac OS X, iOS, and Android.”

Si Dunn

For more information:  paperback – Kindle

Learning Node – A good how-to guide for server-side Web development with Node.js – #programming #bookreview

Learning Node
Shelley Powers
(O’Reilly, paperbackKindle)

 “Node is designed to be used for [server-side] applications that are heavy on input/output (I/O), but light on computation,” veteran Web technology author Shelley Powers notes in Learning Node, her ninth and newest how-to book from O’Reilly.

“Node.js,” she explains, “is a server-side technology that’s based on Google’s V8 JavaScript engine. It’s a highly scalable system that uses asynchronous, event-driven I/O (input/output), rather than threads or separate processes. “It’s ideal for web applications that are frequently accessed but computationally simple.”

I’ve criticized some previous Node books (1) for assuming that all of their readers know a lot about Node.js and assorted programming languages and (2) for not giving enough step-by-step installation and start-up information.

Happily, Learning Node is well written, nicely illustrated with code samples and screen shots, and assumes only that you have some working familiarity with JavaScript. It gives a detailed overview of how to set up development environments in Linux (Ubuntu) and Windows 7. “Installation on a Mac should be similar to installation on Linux,” the author adds.

One caveat regarding code examples: “Most were tested in a Linux environment, but should work, as is, in any Node environment.”

The 374-page book has 16 chapters. The first five “cover both getting Node and the package manager (npm) installed , how to use them, creating your first applications, and utilizing modules.”

Shelley Powers notes that she incorporates “the use of the Express framework, which also utilizes the Connect middleware, throughout the book.” So if you have little or no experience with Express, you will need to pay attention to chapters 6 through 8. But: “After these foundation chapters, you can skip around a bit,” she adds.

Some of the additional chapters cover key/value pairs, using MongoDb with Node, and working with Node’s relational database bindings.

Two chapters get into specialized application use. “Chapter 12 focuses purely on graphics and media access, including how to provide media for the new HTML5 video element, as well as working with PDF documents and Canvas,” the author points out. “Chapter 13 covers the very popular Sockets.io module, especially for working with the new web socket functionality.”

The final chapters are crucial, particularly if you want to move from learning Node to working in a production environment. Chapter 14 covers “Testing and Debugging Node Applications.” Chapter 15 “covers issues of security and authority…it is essential that you spend time in this chapter before you roll a Node application out for general use.”

Meanwhile, Chapter 16 describes “how to prepare your application for production use, including how to deploy your Node application not only on your own system , but also in one of the cloud servers that are popping up to host Node applications.”

Learning Node is both an excellent overall introduction to Node.js and a how-to reference guide that you will want to keep close at hand as you develop and deploy Node applications.

Si Dunn

For more information: Node.js, paperback, Kindle

Learning PHP, MySQL, JavaScript, and CSS, 2nd Edition – Dynamic websites #programming #bookreview

Learning PHP, MySQL, JavaScript, and CSS, 2nd Edition
Robin Nixon
(O’Reilly, paperbackKindle)

Robin Nixon recently has updated and expanded his popular 2009 “step-by-step guide to creating dynamic websites.” The new edition has an added section that focuses on Cascading Style Sheets (CSS), so the book “now covers all four of the most popular web development technologies.”

Nixon notes: “The real beauty of PHP, MySQL, JavaScript, and CSS is the way in which they all work together to produce dynamic web content: PHP handles the main work on the web server, MySQL manages all of the data, and the combination of CSS and JavaScript looks after web page presentation. JavaScript can also talk with your PHP code on the web server whenever it needs to update something (either on the server or on the web page).”

The book’s opening chapters introduce (1) what dynamic web content means and (2) how to set up a development server on your Windows PC, Mac, or Linux machine. After that, Learning PHP, MySQL, JavaScript, & CSS, 2nd Edition follows the structure outlined by its title. First, you get a five-chapter tutorial on PHP programming. Then, two chapters show how to use MySQL. One additional chapter shows how to access MySQL using PHP, and two related chapters deal with (1) form handling and (2) cookies, sessions and authentication, using PHP and MySQL.

Three chapters introduce JavaScript programming. A fourth chapter covers “JavaScript and PHP Validation and Error Handling.” And one additional chapter describes “how to implement Ajax using JavaScript.”

Ajax, Nixon explains, “not only substantially reduces the amount of data that must be sent back and forth [between a browser and a server] but also makes web pages seamlessly dynamic, allowing them to behave more like self-contained applications.”

CSS gets its turn next, with an introductory chapter, a chapter on advanced CSS with CSS3, and a chapter on accessing CSS from JavaScript.

Finally, in the “Bringing It All Together” chapter, Nixon shows how to build a simple social networking site, using all of the tools introduced in the book.

Learning PHP, MySQL, JavaScript, & CSS, 2nd Edition is an excellent how-to guide for web development beginners who have moderate computer skills and a little bit of experience with HTML and static web pages. The book is nicely written and well-illustrated, and the code examples generally are easy to follow. Screen shots and other descriptions of expected results also can help keep you moving forward on the right path.

No book can cover everything you need to know, of course, particularly when several different types of software are involved. You may need occasional help from someone who has used one or more of the described programs. And some of the screen examples may appear a bit different on your machine as new software updates are released. But Robin Nixon’s updated edition can take you a long way toward the goal of learning how to design, create, post, and maintain dynamic web pages, using free, open source tools.

Si Dunn

Ubuntu Made Easy – A simple, well-guided way to try Linux without installing it – #bookreview

Ubuntu Made Easy: A Project-Based Introduction to Linux
Rickford Grant, with Phil Bull
(No Starch Press
, paperbackKindle)

Curious about Linux? (Many of us are.) Wondering if you should put it on one of your PCs and venture out into a different realm that some of our geek friends constantly tell us is “better” (or even “vastly better”) than Windows?

The Ubuntu 12.04 (Precise Pangolin) CD that is packaged with this book “lets you both try Ubuntu without installing it and install Ubuntu to your hard drive once you’re ready,” the writers note. “It’s called a live CD. You can boot your computer from the CD and run Ubuntu directly off the CD without touching your hard disk to see if you like Ubuntu and to make sure that Ubuntu will work with your hardware. If, after running the live CD, you like what you see and everything seems to work, you can install Ubuntu on your computer using the same disc.”

During the installation process, you can choose to install Ubuntu to run within Windows (with slightly limited functionality), using the Wubi installer. Or you can take the full plunge and install Ubuntu outside of Windows. You can put it in a separate partition and create a Windows-Linux dual-boot setup. Or you can replace Windows with Linux, after carefully backing up all of your important data.

Ubuntu Made Easy: A Project-Based Introduction to Linux offers plenty of clear how-to information, screen shots, and step-by-step tips in its 22 chapters and four appendices. One detailed chapter covers how to fix common problems that may be encountered. The book’s cover is goofy, but the contents are solid.

The projects in this book primarily are exercises that help you put your new Linux and Ubuntu knowledge to work. You will learn, the authors state, how to “configure and customize your Ubuntu system.” And the book is organized “so that, as much as possible, you won’t be asked to do something that you haven’t already learned.”

Specifically, the projects range from (1) setting up printers, scanners, flash drives and other devices so they work with Linux, to(2) creating documents, spreadsheets, and presentations with the office-related applications, to (3) installing and playing free games and editing and sharing digital videos and photographs, to (4) using or staying away from the Linux command line.

Ubuntu Made Easy is not a book that will appeal to “seasoned geeks or power users” with Linux experience, the authors concede. But it does take a lot of the mystery out of what “running Linux” actually means.

The book and Ubuntu CD can make it simple and affordable for many computer users to see what much of the hype and hoopla over Linux is all about — and then decide, from first-hand experience, if they want to join in or not.

Si Dunn