Scraping, cleaning, and selling big data

Infochimps execs discuss the challenges of data scraping.

In 2008, the Austin-based data startup Infochimps released a scrape of Twitter data that was later taken down at the request of the microblogging site because of user privacy concerns. Infochimps has since struck a deal with Twitter to make some datasets available on the site, and the Infochimps marketplace now contains more than 10,000 datasets from a variety of sources. Not all these datasets have been obtained via scraping, but nevertheless, the company’s process of scraping, cleaning, and selling big data is an interesting topic to explore, both technically and legally.

With that in mind, Infochimps CEO Nick Ducoff, CTO Flip Kromer, and business development manager Dick Hall explain the business of data scraping in the following interview.

What are the legal implications of data scraping?

Dick HallDick Hall: There are three main areas you need to consider: copyright, terms of service, and “trespass to chattels.”

United States copyright law protects against unauthorized copying of “original works of authorship.” Facts and ideas are not copyrightable. However, expressions or arrangements of facts may be copyrightable. For example, a recipe for dinner is not copyrightable, but a recipe book with a series of recipes selected based on a unifying theme would be copyrightable. This example illustrates the “originality” requirement for copyright.

Let’s apply this to a concrete web-scraping example. The New York Times publishes a blog post that includes the results of an election poll arranged in descending order by percentage. The New York Times can claim a copyright on the blog post, but not the table of poll results. A web scraper is free to copy the data contained in the table without fear of copyright infringement. However, in order to make a copy of the blog post wholesale, the web scraper would have to rely on a defense to infringement, such as fair use. The result is that it is difficult to maintain a copyright over data, because only a specific arrangement or selection of the data will be protected.

Most websites include a page outlining their terms of service (ToS), which defines the acceptable use of the website. For example, YouTube forbids a user from posting copyrighted materials if the user does not own the copyright. Terms of service are based in contract law, but their enforceability is a gray area in US law. A web scraper violating the letter of a site’s ToS may argue that they never explicitly saw or agreed to the terms of service.

Assuming ToS are enforceable, they are a risky issue for web scrapers. First, every site on the Internet will have a different ToS — Twitter, Facebook, and The New York Times may all have drastically different ideas of what is acceptable use. Second, a site may unilaterally change the ToS without notice and maintain that continued use represents acceptance of the new ToS by a web scraper or user. For example, Twitter recently changed its ToS to make it significantly more difficult for outside organizations to store or export tweets for any reason.

There’s also the issue of volume. High-volume web scraping could cause significant monetary damages to the sites being scraped. For example, if a web scraper checks a site for changes several thousand times per second, it is functionally equivalent to a denial of service attack. In this case, the web scraper may be liable for damages under a theory of “trespass to chattels,” because the site owner has a property interest in his or her web servers. A good-natured web scraper should be able to avoid this issue by picking a reasonable frequency for scraping.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

What are some of the challenges of acquiring data through scraping?

Flip KromerFlip Kromer: There are several problems with the scale and the metadata, as well as historical complications.

  • Scale — It’s obvious that terabytes of data will cause problems, but so (on most filesystems) will having tens of millions of files in the same directory tree.
  • Metadata — It’s a chicken-and-egg problem. Since few programs can draw on rich metadata, it’s not much use annotating it. But since so few datasets are annotated, it’s not worth writing support into your applications. We have an internal data-description language that we plan to open source as it matures.
  • Historical complications — Statisticians like SPSS files. Semantic web advocates like RDF/XML. Wall Street quants like Mathematica exports. There is no One True Format. Lifting each out of its source domain is time consuming.

But the biggest non-obvious problem we see is source domain complexity. This is what we call the “uber” problem. A developer wants the answer to a reasonable question, such as “What was the air temperature in Austin at noon on August 6, 1998?” The obvious answer — “damn hot” — isn’t acceptable. Neither is:

Well, it’s complicated. See, there are multiple weather stations, all reporting temperatures — each with its own error estimate — at different times. So you simply have to take the spatial- and time-average of their reported values across the region. And by the way, did you mean Austin’s city boundary, or its metropolitan area, or its downtown region?

There are more than a dozen incompatible yet fundamentally correct ways to measure time: Earth-centered? Leap seconds? Calendrical? Does the length of a day change as the earth’s rotational speed does?

Data at “everything” scale is sourced by domain experts, who necessarily live at the “it’s complicated” level. To make it useful to the rest of the world requires domain knowledge, and often a transformation that is simply nonsensical within the source domain.

How will data marketplaces change the work and direction of data startups?

Nick DucoffNick Ducoff: I vividly remember being taught about comparative advantage. This might age me a bit, but the lesson was: Michael Jordan doesn’t mow his own lawn. Why? Because he should spend his time practicing basketball since that’s what he’s best at and makes a lot of money doing. The same analogy applies to software developers. If you are best at the presentation layer, you don’t want to spend your time futzing around with databases

Infochimps allows these developers to spend their time doing what they do best — building apps — while we spend ours doing what we do best — making data easy to find and use. What we’re seeing is startups focusing on pieces of the stack. Over time the big cloud providers will buy these companies to integrate into their stacks.

Companies like Heroku (acquired by Salesforce) and CloudKick (acquired by Rackspace) have paved the way for this. Tools like ScraperWiki and Junar will allow anybody to pull down tables off the web, and companies like Mashery, Apigee and 3scale will continue to make APIs more prevalent. We’ll help make these tables and APIs findable and usable. Developers will be able to go from idea to app in hours, not days or weeks.

This interview was edited and condensed.

Related:

tags: , , ,