Locked out of the server on Election Day

Nov. 6, 2012, 10 p.m.Journalism

Screenshot of the delawareonline.com homepage on election night.

Election Day at The News Journal was a pretty solid success.

We used PHP and internet duct tape (iframes) to scrape and display results for big races live on our homepage and results for all races on another landing page.

Unlike the Primary Election, our cron job ran smooth all night and my stress level wasn't through the roof.

The day before, though, was a little more difficult.

We've been having issues with the main LAMP server that we use for interactive content. It's also the server that typically hosts all of our site scrapers and I, being someone who is much more comfortable with Python than PHP, was planning on re-purposing an old election results scraper.

Instead, I had to build from scratch on another server that is dedicated to our Django projects and become a whiz with regular expressions in PHP pretty quickly. If you couldn't guess, it was quite the learning experience.

With Python, regular expressions seemed to come easy to me, but I still probably would have used the DOM parsing library BeautifulSoup to scrape results. With PHP, though, most instructions I found said to stick with regex and it seemed easier than figuring out how to parse html. Here's an example:

If we are looking for percent of districts reporting and the html looks like, "GOVERNOR</td><td>25 of 90 districts reporting</td>", grabbing the numbers goes like this:

my_pattern = "GOVERNOR\D+(\d{2})\D+(\d{2})"
matches = re.search(my_pattern, data)
percent_reporting = 100 * int(matches.group(1)) / int(matches.group(2))
print percent_reporting

What threw me off first with PHP - other than adjusting to the syntax - was the required delimiters. The Python code above looks like this in PHP:

$my_pattern = "~GOVERNOR\D+(\d{2})\D+(\d{2})~";
preg_match($my_pattern, $data, $matches);
$percent_reporting = 100 * (int)$matches[1] / (int)$matches[2];
echo $percent_reporting;

As you can see, the actual expression was pretty much the same. I just had to wrap it in tildes. "\D+" and "\d+" still grab non-decimal and decimal characters respectively, and you can still pull out groups with parentheses. Not too bad, just a little different.

Other than dealing with regular expressions, the rest of the script was pretty straightforward. I grabbed the results with CURL (instead of urllib in Python) and wrote everything to a separate file with fwrite(). Even though I felt like a stranger in a strange land at times, I found a few things extremely handy with PHP, such as using a heredoc with embedded variables to output the whole thing instead of gluing all the text together. I'm still more of a PHP editor than a PHP writer, but it's getting easier and easier every time.


Comments

Categories