Ord’s Blog
RSS icon Home icon
  • HTML Parsers

    Posted on February 20th, 2009 Ord No comments

    Once again, I find that I need to parse some HTML pages.  I’ve looked at difference parsers in the past, and always ended up writing some string or pattern matching routines to get the data I wanted off the page.  This time, the data I want isn’t actually on the HTML pages, it gets written by a Javascript that is included from a third party.  So, I need a parser/fetcher that understands Javascript, can load files, and can wait for the scripts to finish running before giving me the output.

    The Cobra toolkit, part of the Lobo Browser project looked ideal.   The toolkit itself looked great, and was fairly easy to get set up and testing.  It generally fetched and parsed pages as I would expect, although often with a lot of warnings.  Unfortunately, the pages I was actually wanting to parse didn’t work.  I decided to try the whole Lobo Browser to see how it would display those pages, but the installer gave a permissions error at the beginning of the install.  I decided to put this aside for now.  When I have to start troubleshooting installers so that I can test a program to find out if a library can do what I want, I see a huge risk of lost time.

    The Mozilla Parser looked like another good option.  This package is essentially a java interface to the engine that is used in the Firefox browser.  Since I knew that Firefox rendered the pages I wanted properly, this seemed like a good choice.   Again I had some issues getting this installed onto the 64 bit vista platform.  The documentation is limited, and the project seems to have not been touched in over a year.  Rather than spend a lot of time to find out why that wasn’t running, I looked at other options for using the Mozilla engine.

    Gecko, the engine behind Mozilla’s browsers has a lot of information about embedding it.   There is a whole SDK available, and if it were a bigger project I would probably take this route.  They also provide XULRunner, which allows running of XUL applications outside of the browser.  (Firefox 3 can also be used to run XUL on windows, by invoking with the -app parameter)

    The Crowbar program is an XUL app that fetches and parses web pages, and as it uses the same Firefox engine, it does parse the pages I need without any problem.  This runs as a server, it accepts HTTP requests and returns the data.  I will do some work with Crowbar, it looks like just a few small changes will make it the ideal tool for this project.

  • Distance betwen points in Java

    Posted on February 19th, 2009 Ord No comments

    I needed to calculate the distance between two postals codes, in the case where I have latitude and longitude for each one.  It took some searching around the net to find the right formula, although along the way I found dozens of pages that would calculate the distances for me.

    The method milesBetween in my ZipCode class reports the number of miles between this Zip Code and the supplied argument.

    double milesBetween(ZipCode zip){
      double distance;
      distance = ( 3958 * Math.PI * Math.sqrt(
        (this.latitude - zip.latitude) * (this.longitude - zip.longitude) +
        Math.cos( Math.toRadians(this.latitude)) * Math.cos( Math.toRadians(zip.latitude)) *
        ( (this.longitude - zip.longitude) * (this.longitude - zip.longitude) )
        )/180);
     
      return distance;
    }

    The constant 3958 is the (average) radius of the earth in miles.  Chaging to 6369 would give us the distances in Kilometers instead.

  • Blogging I go

    Posted on February 15th, 2009 Ord No comments

    Ok, after several years of saying ” I gotta get that blog thing going soon” here I am, blogging your interwebs.

    Anyway, I will be doing my best to keep this updated with current projects and so on.