Ord’s Blog
RSS icon Home icon
  • Scraping with Solvent

    Posted on February 21st, 2009 Ord No comments

    The Solvent extension for FireFox lets us write  screen scrapers that turn a web page into RDF data.  The data in this format is intended to feel into their Piggy Bank package, but for my purposes I will be using the generated scraping script to covert data read by the Crowbar application into RDF.  From there, it becomes easy to convert it to any format I need and save it to the database.

    Unfortunately Solvent requires Piggy Bank, which doesn’t work with FireFox 3.  Not being willing to remove my 3.0.6 installation, I needed a way to get a version 2 installed at the same time.  I found some simple instructions:  Firefox 2 and 3 Living Together in Harmony.  My install was slightly different than the one described because I wanted to keep my existing profile for version 3 and use the newly created one for version 2.   Also, I did this with the latest version of 2.x, which you can find at www.mozilla.com/en-US/firefox/all-older.html.

    After making a test scraper and uloading to a web server, I tried it with Crowbar using the -mode=scrape option, and sure enough it returned an RDF/XML document.

    There are a few bugs with Crowbar – the one that immediately caused me problems was non-encoded ampersand characters.  It’s otherwise such a useful tool that I will definitely be spending some time on it.