Ord’s Blog
RSS icon Home icon
  • HTML Parsers

    Posted on February 20th, 2009 Ord 1 comment

    Once again, I find that I need to parse some HTML pages.  I’ve looked at difference parsers in the past, and always ended up writing some string or pattern matching routines to get the data I wanted off the page.  This time, the data I want isn’t actually on the HTML pages, it gets written by a Javascript that is included from a third party.  So, I need a parser/fetcher that understands Javascript, can load files, and can wait for the scripts to finish running before giving me the output.

    The Cobra toolkit, part of the Lobo Browser project looked ideal.   The toolkit itself looked great, and was fairly easy to get set up and testing.  It generally fetched and parsed pages as I would expect, although often with a lot of warnings.  Unfortunately, the pages I was actually wanting to parse didn’t work.  I decided to try the whole Lobo Browser to see how it would display those pages, but the installer gave a permissions error at the beginning of the install.  I decided to put this aside for now.  When I have to start troubleshooting installers so that I can test a program to find out if a library can do what I want, I see a huge risk of lost time.

    The Mozilla Parser looked like another good option.  This package is essentially a java interface to the engine that is used in the Firefox browser.  Since I knew that Firefox rendered the pages I wanted properly, this seemed like a good choice.   Again I had some issues getting this installed onto the 64 bit vista platform.  The documentation is limited, and the project seems to have not been touched in over a year.  Rather than spend a lot of time to find out why that wasn’t running, I looked at other options for using the Mozilla engine.

    Gecko, the engine behind Mozilla’s browsers has a lot of information about embedding it.   There is a whole SDK available, and if it were a bigger project I would probably take this route.  They also provide XULRunner, which allows running of XUL applications outside of the browser.  (Firefox 3 can also be used to run XUL on windows, by invoking with the -app parameter)

    The Crowbar program is an XUL app that fetches and parses web pages, and as it uses the same Firefox engine, it does parse the pages I need without any problem.  This runs as a server, it accepts HTTP requests and returns the data.  I will do some work with Crowbar, it looks like just a few small changes will make it the ideal tool for this project.

     

    One response to “HTML Parsers”


    1. CheapTabletsOnline.Com. Canadian Health&Care.No prescription online pharmacy.Special Internet Prices.Best quality drugs. Online Pharmacy. Buy drugs online

      Buy:Wellbutrin SR.Prozac.Female Cialis.Ventolin.Lipothin.Buspar.Advair.Cozaar.Nymphomax.Zetia.Lasix.Zocor.Benicar.Seroquel.Female Pink Viagra.SleepWell.Amoxicillin.Acomplia.Aricept.Lipitor….

    Leave a reply

    Spam protection by WP Captcha-Free