Work continues on PervWatch.org

I'm continuing to expand the sex offender registries that I screenscrape for PervWatch.org, adding Texas this weekend. So I will have the sex offender data for New York, California and Texas mapped. I've also added RSS and Atom feeds to the individual address search results, and I've been tweaking the site to behave better with this additional data - for instance, the zipcode list was getting way too long, so now I'm forcing the user to enter some filter text before I show any.

I thought I'd also share some cautionary tips for people out there who may be working on similar projects.

Internet Explorer still sucks and everyone uses it

I fail to remember this every time I deal with web pages. I use Firefox all the time, so I don't run into these problems until the odd day where I'm at a friend's trying to show them the site. Pervwatch was broken under IE. The odd thing was that the pages loaded fine, but the map pages would load well and then as soon as they were done would display a modal dialog saying that it couldn't open the page. After much googling, I realized that IE can't handle Google Maps unless you place the map script code at the end of the HTML body. Always remember, if you're altering the DOM (like Google Maps does), you must place the javascript code at the end of the page! IE just can't handle it otherwise.

Be careful with live searches

I was very excited to use AJAX to do live searches or filtering of the list of zipcodes - since the list was often large and I don't imagine people would want to scroll to the zipcode every time. So, I had set up a text field which would be polled every quarter-second and the text of it would filter the list of zipcodes. This worked well on my development box, but on the real server it didn't turn out so well. I took my cue from the engine that runs this blog - Typo. The problem was that the query and rendering could take some time, and the nature of the filtering meant that the first query would take much longer than later ones. Often the results would be out of sync with what the user had typed. For instance a user would type the first digit, say a '1', and Rails would start getting all zipcodes beginning with a 1. In the meantime the user enters a couple more digits, so Rails does more queries. The later queries would often finish first and display results. Then a split second later the first, slow query would return and the results containing all zipcodes beginning with the digit 1 would overwrite the more refined smaller result list. Rather than deal with temporarily disabling the text field until results appeared, I just decided to force the user to hit Return/Enter or click a submit button to force the query themselves. I'm still not sure if this is the best way...

Screenscraping is a delicate art

Since the sex offender registries don't offer up their data in programmatic ways, I have to manually screenscrape the contents. I originally was trying to do this by using open-uri and lots of GET requests, but this limited me to only a few registries which didn't create session IDs and cookies to track their users. If you need to do things like this, please check out Michael Neumann's WWW:Mechanize.

It's available as a gem so you should be able to grab it by doing:

# gem install mechanize

It's a port of Perl's mechanize library which allows you to pretend to be a browser. One drawback is that it isn't very tolerant of bad HTML code, so in some cases you can't use the built-in method to traverse the page contents if the HTML you grabbed is bad. You will likely need to fall back on regex's to do a lot of the dirty work.

A word of caution about Mechanize as well. In my scripts I was using open-uri and mechanize, but open-uri was being require'd first. Mechanize uses custom Ruby 1.9 versions of net/http and net/https, but if those libraries have already been loaded via require, the custom versions will not get loaded and you'll get complaints about the missing method get_fields. I edited mechanize locally to use the [] and []= methods whenever it called for get_fields and add_header, respectively. I'm not sure if those changes would break anything if we had multiple cookies or values for a given header key, but it worked for what I needed.

About this entry

Advertise here!