‘Introducing Data Science’ – A good doorway into the world of processing, analyzing & displaying Big Data – #bookreview

Introducing Data Science

Davy Cielen, Arno D. B. Meysman, and Mohamed Ali

Manning – paperback

The three authors of this book note that “[d]ata science is a very wide field, so wide indeed that a book ten times the size of this one wouldn’t be able to cover it all. For each chapter, we picked a different aspect we find interesting. Some hard decisions had to be made to keep this book from collapsing your bookshelf!”

In their decisions and selections, they have made some good choices. Introducing Data Science is well written and generally well-organized (unless you are overly impatient to get to hands-on tasks).

The book appears to be aimed primarily at individual computer users and persons contemplating possible careers in data science–not those already working in, or heading, big data centers. The book also could be good for managers and others trying to wrap their heads around some data science techniques that could help them cope with swelling mountains of business data.

With this book in hand, you may be impatient to open it to the first chapter and dive headfirst into slicing, dicing, and graphing data. Try to curb your enthusiasm for a little while. Books from Manning generally avoid the “jump in now, swim later” approach. Instead, you get some overviews, explanations and theory first. Then you start getting to the heart of the matter. Some like this approach, while others get impatient with it.

In Introducing Data Science, your “First steps in big data” start happening in chapter five, after you’ve first delved into the data science process: 1. Setting the research goal; 2. Retrieving data; 3. Data preparation, 4. Data exploration; 5. Data modeling; and 6. Presentation and automation.

The “First steps” chapter also is preceded by chapters on machine learning and how to handle large data files on a single computer.

Once you get to Chapter 5, however, your “First steps” start moving pretty quickly. You are shown how to work (at the sandbox level) with two big data applications, Hadoop and Spark. And you get examples of how even Python can be used to write big data jobs.

From there, you march on to (1) the use of NoSQL databases and graph databases, (2) text mining and text analytics, and (3) data visualization and creating a small data science application.

It should be noted and emphasized, however, that the concluding pages of chapter 1 do present “An introductory working example of Hadoop.” The authors explain how to run “a small [Hadoop] application in a big data context,” using a Hortonworks Sandbox image inside a VirtualBox.

It’s not grand, but it is a start in a book that otherwise would take four chapters to get to the first hands-on part.

Near the beginning of their book, the authors also include a worthy quote from Morpheus in “The Matrix”: “I can only show you the door. You’re the one that has to walk through it.”

This book can be a good entry door to the huge and rapidly changing field of data science,  if you are willing to go through it and do the work it presents.

Si Dunn

Agile Metrics in Action: A good how-to guide to getting better performance measurements – #programming #bookreview

Agile Metrics in Action

Christopher W. H. Davis

Manningpaperback

In the rapidly changing world of software development, metrics “represent the data you can get from your application lifecycle as it applies to the performance of software development teams,” Christopher W. H. Davis writes in his well-written, well-structured new book, Agile Metrics in Action.

“A metric can come from a single data source or it can be a combination of data from multiple data sources. Any data point that you track eventually becomes a metric that you can use to measure your team’s performance.”

The goals of agile metrics include collecting and analyzing data from almost every useful and accessible point in the software development life cycle, so team and individual performances can be measured and improved, and processes can be streamlined.

A key aspect of the data collection and analysis process is distributing the resulting information “across the organization in such a way that everyone can get the data they care about at a glance,” Davis says. He explains how and highlights some “traps” that teams can “fall into when they start publishing metrics,” such as “[s]ending all the data to all stakeholders,” many of whom won’t know what to do with most of it.

Metrics remain a controversial topic for many software developers, Davis emphasizes. So any business leader planning to rush his or her company into adopting agile metrics will need to proceed cautiously, instead. It is vital to get buy-in first from developers and their managers, he says.

“There will likely be people in your group who want nothing to do with measuring their work,” he explains. “Usually this stems from the fear of the unknown, fear of Big Brother, or a lack of control. The whole point here is that teams should measure themselves, not have some external person or system tell them what’s good and bad. And who doesn’t want to get better? No one is perfect—we all have a lot to learn and we can always improve.”

The concept of continuous development is a key topic in this book. “In today’s digital world consumers expect the software they interact with every day to continuously improve,” Davis states. “Mobile devices and web interfaces are ubiquitous and are evolving so rapidly that the average consumer of data expects interfaces to continually be updated and improved. To be able to provide your consumers the most competitive products, the development world has adapted by designing deployment systems that continuously integrate, test, and deploy changes. When used to their full potential, continuous practices allow development teams to hone their consumer’s experience multiple times per day.”

Of course, continuous development produces continuous data to measure and manage, as well, using agile metrics techniques.

Many different topics are addressed effectively in this book. And the practices the author presents are organized to work with any development process or tool stack. However, the software tools Davis favors for this book’s code-based examples include Grails, Groovy and MongoDB.

Agile Metrics in Action is structured and written to serve as a how-to book for virtually anyone associated with a software development team that relies on agile metrics. You may not understand all of the text. But if you take your time with this well-illustrated book, you can at least gain a better comprehension of what agile metrics means, how the process works, and why it is important to your employer, your group and your paycheck.

Si Dunn

R IN ACTION: Data Analysis and Graphics with R, 2nd Edition – #bookreview

R in Action

Data Analysis and Graphics with R

Robert I. Kabacoff

Manning – paperback

Whether data analysis is your field, your current major or your next career-change ambition, you likely should get this book. Free and open source  R is one of the world’s most popular languages for data analysis and visualization. And Robert I. Kabacoff’s updated new edition is, in my opinion, one of the top books out there for getting a handle on R. (I have used and previously reviewed several R how-to books.)

R is relatively easy to install on Windows, Mac OS X and Linux machines. But it is generally considered difficult to learn. Much of that is because of its rich abundance of features and packages, as well as its ability to create many types of graphs. “The base installation,” Kabacoff writes, “provides hundreds of data-management, statistical, and graphical functions out of the box. But some of its most powerful features come from the thousands of extensions (packages) provided by contributing authors.”

Kabacoff concedes: “It can be hard for new users to get a handle on what R is and what it can do.” And: “Even the most experienced R user is surprised to learn about features they were unaware of.”

R in Action, Second Edition, contains more than 200 pages of new material. And it is nicely structured to meet the needs of R beginners, as well as those of us who have some experience and want to gain more.

The book (579 pages in print format) is divided into five major parts. The first part, “Getting Started,” takes the beginner from an installing and trying R to creating data sets, working with graphs, and managing data. Part 2, “Basic Methods,”focuses on graphical and statistical techniques for obtaining basic information about data.”

Part 3, “Intermediate Methods,” moves the reader well beyond “describing the relationship between two variables.” It introduces  regression, analysis of variance, power analysis, intermediate graphs, and resampling statistics and bootstrapping. Part 4 presents “Advanced Methods,” including generalized linear models, principal components and factor analysis, time series, cluster analysis, classification, and advanced methods for missing data.

Part 5, meanwhile, offers how-to information for “Expanding Your Skills.” The topics include: advanced graphics with ggplot2, advanced programming, creating a package, creating dynamic reports, and developing advanced graphics with the lattice program.

A key strength of R in Action, Second Edition is Kabacoff’s use of generally short code examples to illustrate many of the ways that data can be entered, manipulated, analyzed and displayed in graphical form.

The first thing I did, however, was start at the very back of the book, Appendix G, and upgrade my existing version of R to 3.2.1, “World-Famous Astronaut.” The upgrade instructions could have been a little bit clearer, but after hitting a couple of unmentioned prompts and changing a couple of wrong choices, the process turned out to be quick and smooth.

Then I started reading chapters and keying in some of the code examples. I had not used R much recently, so it was fun again to enter some commands and numbers and have nicely formatted graphs suddenly pop open on the screen.

Even better, it is nice to have a LOT of new things to learn, with a well-written, well-illustrated guidebook in hand.

Si Dunn

 

BIG DATA: A well-written look at principles & best practices of scalable real-time data systems – #bookreview

 

 

Big Data

Principles and best practices of scalable real-time data systems

Nathan Marz, with James Warren

Manning – paperback

Get this book, whether you are new to working with Big Data or now an old hand at dealing with Big Data’s seemingly never-ending (and steadily expanding) complexities.

You may not agree with all that the authors offer or contend in this well-written “theory” text. But Nathan Marz’s Lambda Architecture is well worth serious consideration, especially if you are now trying to come up with more reliable and more efficient approaches to processing and mining Big Data. The writers’ explanations of some of the power, problems, and possibilities of Big Data are among the clearest and best I have read.

“More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating,” Marz and Warren point out.

Thus, previous “solutions” for working with Big Data are now getting overwhelmed, not only by the sheer volume of information pouring in but by greater system complexities and failures of overworked hardware that now plague many outmoded systems.

The authors have structured their book to show “how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application.” In other words, they write, you will “learn how to fish, not just how to use a particular fishing rod.”

Marz’s Lambda Architecture also is at the heart of Big Data, the book. It is, the two authors explain, “an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team.”

The Lambda Architecture has three layers: the batch layer, the serving layer, and the speed layer.

Not surprisingly, the book likewise is divided into three parts, each focusing on one of the layers:

  • In Part 1, chapters 4 through 9 deal with various aspects of the batch layer, such as building a batch layer from end to end and implementing an example batch layer.
  • Part 2 has two chapters that zero in on the serving layer. “The serving layer consists of databases that index and serve the results of the batch layer,” the writers explain. “Part 2 is short because databases that don’t require random writes are extraordinarily simple.”
  • In Part 3, chapters 12 through 17 explore and explain the Lambda Architecture’s speed layer, which “compensates for the high latency of the batch layer to enable up-to-date results for queries.”

Marz and Warren contend that “[t]he benefits of data systems built using the Lambda Architecture go beyond just scaling. Because your system will be able to handle much larger amounts of data, you’ll be able to collect even more data and get more value out of it. Increasing the amount and types of data you store will lead to more opportunities to mine your data, produce analytics, and build new applications.”

This book requires no previous experience with large-scale data analysis, nor with NoSQL tools. However, it helps to be somewhat familiar with traditional databases. Nathan Marz is the creator of Apache Storm and originator of the Lambda Architecture. James Warren is an analytics architect with a background in machine learning and scientific computing.

If you think the Big Data world already is too much with us, just stick around a while. Soon, it may involve almost every aspect of our lives.

Si Dunn

D3.js in Action: A good book packed with data visualization how-to info – #javascript #programming

D3.js in Action

Elijah Meeks

Manning – paperback

 

The D3.js library is very powerful, and it is full of useful choices and possibilities. But, you should not try to tackle Elijah Meeks’s new book if you are a JavaScript newcomer and not also comfortable with HTML, CSS and JSON.

It likewise helps to understand how CSVs (Comma Separated Values) can be used. And you should know how to set up and run local web servers on your computer. Prior knowledge of D3.js and SVG (Scalable Vector Graphics) is not necessary, however.

Some reviewers have remarked on the amount of how-to and technical information packed into DS3.js in Action. It is indeed impressive. And, yes, it really can seem like concepts, details and examples are being squirted at you from a fire hose, particularly if you are attempting to race through the text. As Elijah Meeks writes, “[T]he focus of this book is on a more exhaustive explanation of key principles of the library.”

So plan to take your time. Tackle D3.js in small bites, using the d3js.org website and this text. I am pretty new to learning data visualization, and I definitely had never heard of visualizations such as Voronoi diagrams, nor tools such as TopoJSON, until I started working my way through this book. And those are just a few of the available possibilities.

I have not yet tried all of the code examples. But the ones I have tested have worked very well, and they have gotten me thinking about how I can adapt them to use in some of my work.

I am a bit disappointed that the book takes 40 pages to get to the requisite “Hello, world” examples. And once you arrive, the explanations likely will seem a bit murky and incomplete to some readers.

However, that is a minor complaint. D3.js in Action will get frequent use as I dig deeper into data visualization. D3.js and Elijah Meeks’s new book are keepers for the long-term in the big world of JavaScript.

Si Dunn

HADOOP IN PRACTICE, 2nd Edition – An updated guide to handling some of the ‘trickier and dirtier aspects of Hadoop’ – #programming #bookreview

 

Hadoop in Practice, Second Edition

Alex Holmes

(Manning – paperback )

 

The Hadoop world has undergone some big changes lately, and this hefty, updated edition offers excellent coverage of a lot of what’s new. If you currently work with Hadoop and MapReduce or are planning to take them up soon, give serious consideration to adding this well-written book to your technical library. A key feature of the book is its “104 techniques.” These show how to accomplish practical and important tasks when working with Hadoop, MapReduce and their growing array of software “friends.”

The author, Alex Holmes, has been working with Hadoop for more than six years and is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects.

His new second edition, he writes, “covers Hadoop 2, which at the time of writing is the current production-ready version of Hadoop. The first edition of the book covered Hadoop 0.22 (Hadoop 1 wasn’t yet out), and Hadoop 2 has turned the world upside-down and opened up the Hadoop platform to processing paradigms beyond MapReduce. YARN, the new scheduler and application manager in Hadoop 2, is complex and new to the community, which prompted me to dedicate a new chapter 2 to covering YARN basics and to discussing how MapReduce now functions as a YARN application.”

In the book, Holmes notes that “Parquet has also recently emerged as a new way to store data in HDFS—its columnar format can yield both space and time efficiencies in your data pipelines, and it’s quickly becoming the ubiquitous way to store data. Chapter 4 includes extensive coverage of Parquet, which includes how Parquet supports sophisticated object models such as Avro and how various Hadoop tools can use Parquet.”

Furthermore, “[h]ow data is being ingested into Hadoop has also evolved since the first edition,” Holmes points out, “and Kafka has emerged as the new data pipeline, which serves as the transport tier between your data producers and data consumers, where a consumer would be a system
such as Camus that can pull data from Kafka into HDFS. Chapter 5, which covers moving data into and out of Hadoop, now includes coverage of Kafka and Camus.”  [Reviewer’s note: Interesting software names here. Franz Kafka and Alfred Camus were writers deeply concerned about finding clarity and meaning in a world that seemed to offer none.]

Holmes adds that “[t]here are many new technologies that YARN now can support side by side in the same cluster, and some of the more exciting and promising technologies are covered in the new part 4, titled ‘Beyond MapReduce,’ where I cover some compelling new SQL technologies such as Impala and Spark SQL. The last chapter, also new for this edition, looks at how you can write your own YARN application, and it’s packed with information about important features to support your YARN application.”

Hadoop and MapReduce have gained reputations (well-earned) for being difficult to set up, use and master. In his new edition, Holmes describes his own early experiences: “The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was relearning how to solve problems with it. MapReduce is its own flavor of parallel programming, and it’s quite different from the in-JVM programming that we were accustomed to. The first big hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well.”

(These days, of course, there are both open source and commercial releases of Hadoop, as well as quickstart virtual machine versions that are good for learning Hadoop.)

Holmes continues: “After one is used to thinking in MapReduce, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in and out of HDFS and effective and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much coverage, and that’s what attracted me to the potential of this book—the chance to go beyond the fundamental word-count Hadoop uses and covering some of the trickier and dirtier aspects of Hadoop.”

If you have difficulty explaining Hadoop to others (such as a manager or executive hesitant to let it be implemented), Holmes offers a succint summation in his updated book:

“Doug Cutting, the creator of Hadoop, likes to call Hadoop the kernel for big data, and I would tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data and allows these datasets to be easily joined together.”

One book cannot possibly cover everything you need to know about Hadoop, MapReduce, Parquet, Kafka, Camus, YARN and other technologies. And this book and its software examples assume that you have some experience with Java, XML and JSON. Yet Hadoop in Practice, Second Edition gives a very good and reasonably deep overview, spanning such major categories as background and fundamentals, data logistics, Big Data patterns, and moving beyond MapReduce.

Si Dunn

 

 

Optimizing Hadoop for MapReduce – A practical guide to lowering some costs of mining Big Data – #bookreview

Optimizing Hadoop for MapReduce

Learn how to configure your Hadoop cluster to run optimal MapReduce jobs

Khaled Tannir

(Packt Publishing, paperback, Kindle)

Time is money, as the old saying goes. And that saying especially applies to the world of Big Data, where much time, computing power and cash can be consumed while trying to extract profitable information from mountains of data.

This short, well-focused book by veteran software developer Khalid Tannir describes how to achieve a very important, money-saving goal: improve the efficiency of MapReduce jobs that are run with Hadoop.

As Tannir explains in his preface:

“MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.

“Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees.

“Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.”

Tannir’s well-focused, seven-chapter book zeroes in on how to find and fix misconfigured Hadoop clusters and numerous other problems. But first, he explains how Hadoop parameters are configured and how MapReduce metrics are monitored.

Two chapters are devoted to learning how to identify system bottlenecks , including CPU bottlenecks, storage bottlenecks, and network bandwidth bottlenecks.

One chapter examines how to properly identify resource weaknesses, particularly in Hadoop clusters. Then, as the book shifts strongly to solutions, Tannir explains how to reconfigure Hadoop clusters for greater efficiency.

Indeed, the final three chapters deliver details and steps that can help you improve how well Hadoop and MapReduce work together in your setting.

For example, the author explains how to make the map and reduce functions operate more efficiently, how to work with small or unsplittable files, how to deal with spilled records (those written to local disk when the allocated memory buffer is full), and ways to tune map and reduce parameters to improve performance.

“Most MapReduce programs are written for data analysis and they usually take a lot of time to finish,” Tannir emphasizes. However: “Many companies are embracing Hadoop for advanced data analytics over large datasets that require completion-time guarantees.” And that means “[e]fficiency, especially the I/O costs of MapReduce, still need(s) to be addressed for successful implications.”

He describes how to use compression, Combiners, the correct Writable types, and quick reuse of types to help improve memory management and the speed of job execution.

And, along with other tips, Tannir presents several “best practices” to help manage Hadoop clusters and make them do their work quicker and with fewer demands on hardware and software resources. 

Tannir notes that “setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.”

If you work with Hadoop and MapReduce or are now learning how to help install, maintain or administer Hadoop clusters, you can find helpful information and many useful tips in Khaled Tannir’s Optimizing Hadoop for Map Reduce.

Si Dunn

Lean Analytics and Lean UX – Two new guides to better business and user experiences – #bookreview

Okay, how are we leaning today? Leaning in? Leaning back? Leaning to the left or right? Leaning over? Or just leaning toward chucking all “hot new” postures that supposedly help us pose ourselves for career success?

Here’s some good news. None of the above leanings are topics in two new books from O’Reilly’s popular “Lean” series, edited by Eric Ries.

Lean Analytics deals with using data to help you determine if there is a profitable need for the product or service you hope to offer with a startup business. Lean UX, meanwhile, deals with the process of designing a better user experience (UX) for a company’s apps, website or other products.  Here are short reviews of each book:

Lean Analytics
Use Data to Build a Better Startup Faster
Alistair Croll and Benjamin Yoskovitz
(O’Reilly – hardback, Kindle)

“Entrepreneurs,” the authors state, “are particularly good at lying to themselves. Lying may even be a prerequisite for succeeding as an entrepreneur–after all, you need to convince others that something is true in the absence of good, hard evidence. You need believers to take a leap of faith with you. As an entrepreneur, you need to live in a semi-delusional state just to survive the inevitable rollercoaster ride of running your startup.”

But…you also need cold, hard data. And what you learn from that data may not mesh well with the lie you are living as you try to start a new business from scratch. Yet, it may save you from failing and wasting a lot of money.

“Your delusions,” the authors argue, “no matter how convincing, will wither under the harsh light of data. Analytics is the necessary counterweight to lying, the yin to the yang of hyperbole. Moreover, data-driven learning is the cornerstone of success in startups. It’s how you learn what’s working and iterate toward the right product and market before the money runs out.”

Lean Analytics builds on the Lean Startup process developed by Eric Ries. In today’s digital world, the authors explain, “[w]e’re in the midst of a fundamental shift in how companies are built. It’s vanishingly cheap to create the first version of something. Clouds are free. Social media is free. Competitive research is free. Even billing and transactions are free.”

Taken together, these facilities mean “you can build something, measure its effect, and learn from it to build something better next time. You can iterate quickly, deciding early on if you should double down on your idea or fold and move on to the next one.”

Their 409-page book is not quick reading. But it deserves attention and study, whether you want to start a business, already have started a business, or hope to revamp and improve a business that has been in operation for some time. Lean Analytics presents many examples and case studies that illustrate how you can gather and analyze existing data, then test products or services to determine if they are something that customers actually need, want and will use.

With new data from the tests and the ability to continue testing, you can modify your product or service and focus more resources, energy, and time on improving and refining what will work best for your customers–and your bottom line.

***

Lean UX
Applying Lean Principles to Improve User Experience
Jeff Gothelf with Josh Seiden
(O’Reilly – hardback, Kindle)

“Lean UX is a collaborative process,” the two authors of this book emphasize. “It brings designers and non-designers together in co-creation. It yields ideas that are bigger than those of the individual contributors. But it’s not design-by-committee. Instead, Lean UX increases a team’s ownership over the work by providing an opportunity for all opinions to be heard much earlier in the process.”

For example, forget the notion of a web designer hiding in an office for a week or so and then emerging with what he or she insists will be a “masterpiece” as the company’s new home page.

Particularly in software development, a key aspect of Lean and Agile development theories is the notion of creating a Minimum Viable Product (MVP). “Lean UX makes heavy use of the notion of MVP,” the two authors explain. “MVPs help test our assumptions–will this tactic achieve the desired outcome?–while minimizing the work we put into unproven ideas. The sooner we can find which features are worth investing in, the sooner we can focus our limited resources on the best solutions to our business problems. This concept is an important part of how Lean UX minimizes waste.”

The web designer’s “masterpiece” might work okay, but it also might offer costly confusions for customers and others visiting the website. Instead, Lean UX emphasizes collaboration, teamwork, testing prototypes, analyzing the results, gathering feedback from outsiders, revamping the project, testing it again–and continuing the process.

According to the writers, the most powerful tool in Lean UX is one that is basic to human beings: conversation. Indeed, conversation should be “the primary means of communication among team members.” Some of the other tools for collaboration also are basic: pencils, pens, notepads, whiteboards, blackboards, and simple paper templates that can spur discussions, opinions, and basic designs for the Minimum Viable Product and its successors, before moving the work to computers.

Lean UX is just 130 pages long. But it is rich with how-to examples, process descriptions, short case studies, clear steps, useful illustrations, and good examples that you can adapt and employ to create cheaper, faster, and better user experiences.


Si Dunn

All for Search and Search for All: 3 New Books for Putting Search to Work – #bookreview

Seek and ye shall find.

That’s the theory behind the still-debated benefits of digging through Big Data to uncover new, overlooked, or forgotten paths to greater profits and greater understanding.

Big Data, however, is here to stay (and get bigger). And search is what we do to find and extract useful nuggets and diamonds and nickels and dimes of information.

O’Reilly Media recently has published three new, enlightening books focused on the processes, application, and management of search: Enterprise Search by Martin White, Mastering Search Analytics by Brent Chaters, and Search Patterns by Peter Morville and Jeffery Callender.

Here are short looks at each.

Enterprise Search
Martin White
(O’Reilly, paperback, Kindle)

Start with this book if you’re just beginning to explore what focused search efforts and search technology may be able to do for your company.

The book’s key goal is “to help business managers , and the IT teams supporting them, understand why effective enterprise-wide search is essential in any organization, and how to go about the process of meeting user requirements.”

You may think, So what’s the big deal? Just put somebody in a cubicle and pay them to use Google, Bing, and a few other search engines to find stuff.

Search involves much more than that. Even small businesses now have large quantities of potentially profitable information stored internally in documents, emails, spreadsheets and other formats. And large corporations are awash in data that can be mined for trends, warnings, new opportunities, new product or service ideas, and new market possibilities, to name just a few.

The goal of Enterprise Search is to help you set up a managed search environment that benefits your business but also enables employees to use search technology to help them do their jobs more efficiently and productively.

Yet, putting search technology within every worker’s reach is not the complete answer, author Martin White emphasizes.

“The reason for the well-documented lack of satisfaction with a search application,” he writes, “is that organizations invest in technology but not staff with the expertise and experience to gain the best possible return on the investment….”

Enterprise Search explains how to determine your firm’s search needs and how to create an effective search support team that can meet the needs of employees, management, and customers.

Curiously, White
waits until his final chapter to list 12 “critical success factors” for getting the most from enterprise-wide search capabilities.

Perhaps, in a future edition, this important list will be positioned closer to the front of the book.

Mastering Search Analytics
Brent Chaters
(O’Reilly – paperback, Kindle)

This in-depth and well-illustrated guide details how a unified, focused search strategy can generate greater traffic for your website, increase conversion rates, and bring in more revenue.

Brent Chaters explains how to use search engine optimization (SEO) and paid search as part of an effective, comprehensive approach.

Key to Chaters’ strategy is the importance of bringing together the efforts and expertise of both the SEO specialists and the Search Engine Marketing (SEM) specialists — two groups that often battle each other for supremacy within corporate settings.

“A well-defined search program should utilize both SEO and SEM tactics to provide maximum coverage and exposure to the right person at the right time, to maximize your revenue,” Chaters contends. “I do not believe that SEO and SEM should be optimized from each other; in fact, there should be open sharing and examination of your overall search strategy.”

His book is aimed at three audiences: “the search specialist, the marketer, and the executive”–particularly executives who are in charge of search campaigns and search teams.

If you are a search specialist, the author expects that “you understand the basics of SEO, SEM, and site search (meaning you understand how to set up a paid search campaign, you understand that organic search cannot be bought, and you understand how your site search operates and works.)”

Search Patterns
Peter Morville and Jeffery Callender
(O’Reilly – paperback, Kindle)

“Search applications demand an obsessive attention to detail,” the two authors of this fine book point out. “Simple, fast, and relevant don’t come easy.”

Indeed, they add, “Search is not a solved problem,” but remains, instead, “a wicked problem of terrific consequence. As the choice of first resort for many users and tasks, search is the defining element of the user experience. It changes the way we find everything…it shapes how we learn and what we believe. It informs and influences our decisions and, and it flows into every noon and cranny….Search is among the biggest, baddest, most disruptive innovations around. It’s a source of entrepreneurial insight, competitive advantage, and impossible wealth.”

They emphasize: “Unfortunately, it’s also the source of endless frustration. Search is the worst usability problem on the Web….We find too many results or too few, and most regular folks don’t know where to search, or how….business goals are disrupted by failures in findability…[and] “Mobile search is a mess.”

Ouch!

Colorfully illustrated and well-written, Search Patterns is centered around major aspects in the design of user interfaces for search and discovery. It is aimed at “designers, information architects, students, entrepreneurs, and anyone who cares about the future of search.”

It covers the key bases, “from precision, recall, and relevance to autosuggestion and faceted navigation.” It looks at how search may be reshaped in the future. And, very importantly, it also joins the growing calls for collaboration across disciplines and “tearing down walls to make search better….”

Si Dunn

Natural Language Annotation for Machine Learning – #programming #bookreview

Natural Language Annotation for Machine Learning
James Pustejovsky and Amber Stubbs
(O’Reilly, paperbackKindle)

You may not be sure what’s going on here, at first, even after you’ve read the tag line on the book’s cover: “A Guide to Corpus-Building for Applications.

Fortunately, a few definitions inside this book can enlighten you quickly and might even get you interested in delving deeper into natural language processing and computational linguistics as a career.

“A natural language,” the authors note,” refers to any language spoken by humans, either currently (e.g., English, Chinese, Spanish) or in the past (e.g., Latin, ancient Greek, Sanskrit). Annotation refers to the process of adding metadata information to the text in order to augment a computer’s ability to perform Natural Language Processing (NLP).”

Meanwhile: “Machine learning refers to the area of computer science focusing on the development and implementation of systems that improve as they encounter more data.”

And, finally, what is a corpus? “A corpus,” the authors explain, “is a collection of machine-readable texts that have been produced in a natural communicative setting. They have been sampled to be representative and balanced with respect to particular factors; for example, by genre—newspaper articles, literary fiction, spoken speech, blogs and diaries, and legal documents.”

The Internet is delivering vast amounts of information in many different formats to researchers in the fields of theoretical and computational linguistics. And, in turn, specialists are now working to develop new insights and algorithms “and turn them into functioning, high-performance programs that can impact the ways we interact with computers using language.”

This book’s central focus is on learning how an efficient annotation development cycle works and how you can use such a cycle to add metadata to a training corpus that helps machine-language algorithms work more effectively.

Natural Language Annotation for Machine Learning is not light reading. But it is well structured, well written and offers detailed examples. Using an effective hands-on approach, it takes the reader from annotation specifications and designs to the use of annotations in machine-language algorithms. And the final two chapters of the 326-page book “give a complete walkthrough of a single annotation project and how it was recreated with machine learning and rule-based algorithms.”

“[I]t is not enough,” the authors emphasize, “to simply provide a computer with a large amount of data and expect it to learn to speak—the data has to be prepared in such a way that the computer can more easily find patterns and inferences. This is usually done by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. However,” they point out, “in order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. For this reason, the discipline of language annotation is a critical link in developing intelligent human language technology.”

Si Dunn