Wednesday, August 26, 2009

Web Scraping in Java

There are at least three ways to do web scraping in Java.

First, "manually" use string matching and regular expression to extract information from downloaded HTML.

Second, use JTidy to transform HTML to XHTML, and then use XQuery (e. g., Saxon, ...) over XHTML to extract required information.

Third, which is what I prefer:
  1. Create a TagSoup HTML parser, which provides an SAX interface;
  2. Use XOM to build a DOM from HTML using the TagSoup SAX parser;
  3. Use the built-in XPath query facility inside XOM (i.e., Jaxen) to parse the XOM DOM document.
A sample code skeleton looks like:

// Create a TagSoup SAX parser.
XMLReader parser = new org.ccil.cowan.tagsoup.Parser();

// Use the TagSoup parser to build an XOM document from HTML.
Document doc = new Builder(parser).build(new File("index.html"));

// Do some XPath query: find all "table" elements.
Nodes nodes = doc.query("//*[local-name()='table']");