Natural Language Annotation for Machine Learning – #programming #bookreview

Natural Language Annotation for Machine Learning
James Pustejovsky and Amber Stubbs
(O’Reilly, paperbackKindle)

You may not be sure what’s going on here, at first, even after you’ve read the tag line on the book’s cover: “A Guide to Corpus-Building for Applications.

Fortunately, a few definitions inside this book can enlighten you quickly and might even get you interested in delving deeper into natural language processing and computational linguistics as a career.

“A natural language,” the authors note,” refers to any language spoken by humans, either currently (e.g., English, Chinese, Spanish) or in the past (e.g., Latin, ancient Greek, Sanskrit). Annotation refers to the process of adding metadata information to the text in order to augment a computer’s ability to perform Natural Language Processing (NLP).”

Meanwhile: “Machine learning refers to the area of computer science focusing on the development and implementation of systems that improve as they encounter more data.”

And, finally, what is a corpus? “A corpus,” the authors explain, “is a collection of machine-readable texts that have been produced in a natural communicative setting. They have been sampled to be representative and balanced with respect to particular factors; for example, by genre—newspaper articles, literary fiction, spoken speech, blogs and diaries, and legal documents.”

The Internet is delivering vast amounts of information in many different formats to researchers in the fields of theoretical and computational linguistics. And, in turn, specialists are now working to develop new insights and algorithms “and turn them into functioning, high-performance programs that can impact the ways we interact with computers using language.”

This book’s central focus is on learning how an efficient annotation development cycle works and how you can use such a cycle to add metadata to a training corpus that helps machine-language algorithms work more effectively.

Natural Language Annotation for Machine Learning is not light reading. But it is well structured, well written and offers detailed examples. Using an effective hands-on approach, it takes the reader from annotation specifications and designs to the use of annotations in machine-language algorithms. And the final two chapters of the 326-page book “give a complete walkthrough of a single annotation project and how it was recreated with machine learning and rule-based algorithms.”

“[I]t is not enough,” the authors emphasize, “to simply provide a computer with a large amount of data and expect it to learn to speak—the data has to be prepared in such a way that the computer can more easily find patterns and inferences. This is usually done by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. However,” they point out, “in order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. For this reason, the discipline of language annotation is a critical link in developing intelligent human language technology.”

Si Dunn