-
Scraping with Solvent
Posted on February 21st, 2009 No commentsThe Solvent extension for FireFox lets us write screen scrapers that turn a web page into RDF data. The data in this format is intended to feel into their Piggy Bank package, but for my purposes I will be using the generated scraping script to covert data read by the Crowbar application into RDF. From there, it becomes easy to convert it to any format I need and save it to the database.
Unfortunately Solvent requires Piggy Bank, which doesn’t work with FireFox 3. Not being willing to remove my 3.0.6 installation, I needed a way to get a version 2 installed at the same time. I found some simple instructions: Firefox 2 and 3 Living Together in Harmony. My install was slightly different than the one described because I wanted to keep my existing profile for version 3 and use the newly created one for version 2. Also, I did this with the latest version of 2.x, which you can find at www.mozilla.com/en-US/firefox/all-older.html.
After making a test scraper and uloading to a web server, I tried it with Crowbar using the -mode=scrape option, and sure enough it returned an RDF/XML document.
There are a few bugs with Crowbar – the one that immediately caused me problems was non-encoded ampersand characters. It’s otherwise such a useful tool that I will definitely be spending some time on it.

