Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL, 2nd Edition
By Michael Schrenk
(No Starch Press, paperback, list price $39.95; Kindle edition, list price $31.95)
Bots have a bad reputation on the Web, but when used properly and for honest purposes, they can be tools for good, for better business and research efficiency, and for profit.
That’s the major premise behind Michael Schrenk’s popular book, now updated from its 2007 first edition.
He is a specialist in “automated agents (webbots, spiders, and screen scrapers)” that “solve problems” which web browsers can’t solve for themselves.
“The basic problem with browsers,” Schrenk writes, “is that they’re manual tools. Your browser only downloads and renders websites: You still need to decide if the web page is relevant, if you’ve already seen the information it contains or if you need to follow a link to another web page. What’s worse, your browser can’t think for itself. It can’t notify you when something important happens online, and it certainly won’t anticipate your actions, automatically complete forms, make purchases, or download files for you. To do these things, you’ll need the automation and intelligence only available with a webbot, or a web robot. Once you start thinking about the inherent limitations of browsers, you start to see the endless opportunities that wait around the corner for webbot developers.”
Spiders, by the way, “are specialized webbots that – unlike traditional webbots with well-defined targets – download multiple web pages across multiple websites,” he notes. Meanwhile, screen scraping is not clearly defined in this book, despite being in the subtitle. It generally involves automatically collecting, but not parsing, visual data from a source. Schrenk includes a chapter titled “Scraping Difficult Websites with Browser Macros,” and some purists would call that more a focus on the process known as web scraping rather than screen scraping. But this is minor nitpicking.
Schrenk’s well-written book offers sample scripts (mostly written in PHP) and example projects that show how to design and write webbots. And his website for the book offers several code libraries that can be downloaded. “The functions and declarations in these libraries provide the basis for most of the example scripts used in this book,” he says. Likewise, his example scripts mostly use that website “as targets, or resources for your webbots to download and take action on” for practice and learning.
It is important, before diving into the programming, to take very careful note of his paragraph titled: “Learn from My Mistakes.” In it, Schrenk emphasizes: “I’ve written webbots, spiders, and screen scrapers for over 15 years, and in the process I’ve made most of the mistakes someone can make. Because webbots are capable of making unconventional demands on website, system administrators can confuse webbots’ requests with attempts to hack into their systems. Thankfully, none of my mistakes has ever led to a courtroom, but they have resulted in intimidating phone calls, scary emails, and very awkward moments. Happily, I can say that I’ve learned from these situations, and it’s been a very long time since I’ve been across the desk from an angry system administrator. You can spare yourself a lot of grief by reading my stories and learning from my mistakes.”
The 362-page 2nd edition has 31 chapters and three appendixes, and it is divided into four major parts:
- · Part I: Fundamental Concepts and Technologies
- · Part II: Projects
- · Part III: Advanced Technical Considerations
- · Part IV: Larger Considerations
That final part includes a very important chapter on keeping webbots and spiders out of legal trouble.
In other words, have fun but be very careful with what you create. As Schrenk emphasizes: “…it’s up to you to do constructive things with the information in this book and not violate copyright law, disrupt networks, or do anything else that would be troublesome or illegal.” And: “If you have questions, talk to a lawyer before you experiment.”
Words to the wise. And, yes, to the wiseasses, as well.
– Si Dunn is a novelist, screenwriter, freelance book reviewer, and former software technical writer and software/hardware QA test specialist. He also is a former newspaper and magazine photojournalist. His latest book is Dark Signals, a Vietnam War memoir available now in paperback. He is the author of a detective novel, Erwin’s Law, a novella, Jump, and several other books and short stories.