I’ve been meaning to post this for a while and was reminded when I saw a post from Anthony DeBarros about using xlrd to parse an Excel document.
A FOIA request for voting records for Delaware state legislators returned a ton of MS Word documents, each with a table of votes for a single legislator. The State House uses a Lotus Notes database and actually stores their voting records in this format. So, I had to extract all of the votes from the files and put them in a form that I could examine with excel and later put into MySQL for publication. Here’s the script that I used:
At the heart of this script is the win32com library. The client module allows you to open up Microsoft Office applications with Python and to parse whatever is in them. In this case, we are opening an instance of MS Word in the background with lines 12-13. From there, we open each document inside the writeVotes function with lines 22-23 and select the table with line 24, “table = doc.Tables(1)”. Luckily, each document only had one table that we cared about, though we could have easily iterated through each table in each document.
The rest of the script is pretty straightforward. For each row in the table, we grab the relevant information with “.Range.text” and join it all together.
I hope others find this helpful. If the data didn’t come in several dozen word files, it probably would have been easier to do a copy/paste job, but this script sped things up tremendously and made my life easier.
Election Day at The News Journal was a pretty solid success.
We used PHP and internet duct tape (iframes) to scrape and display results for big races live on our homepage and results for all races on another landing page.
Unlike the Primary Election, our cron job ran smooth all night and my stress level wasn’t through the roof.
Click here to check out a “still-live” version of our election night banner on our testing stage. [UPDATE: 02/05/2013] That link is a bit busted right now because our PHP server was hacked and we lost a few resources.
The day before, though, was a little more difficult.
Along with likely every other newsroom in the country, we’ve been having a lot of fun with campaign finance data around here. Last week, we tore into Gov. Jack Markell and GOP challenger Jeff Cragg’s reports and were experimenting with how to present the data online.
Our initial idea was just a map because the story focused heavily on geography after Markell asserted that most of the out-of-state money from surrounding states came from people who commuted and worked in Delaware. Leaflet took care of that quickly. I built the map in ArcMap, transferred it to GeoJSON and it’s straight to the web.
"But wait," we told ourselves. We’ve cleaned up all this data, let’s hand it over to readers and make it searchable. In comes DataTables. I’ve used DataTables a lot since I joined The News Journal. It’s a fantastic jQuery plugin, and I love that I can easily plug it into a MySQL database with a PHP script.
The News Journal now has a PANDA, a “newsroom data appliance” that will help fuel our reporting and hopefully make lives easier.
Needless to say, I’m pretty stoked. At the NICAR conference last February, I attended a presentation by the developers and knew right away that our newsroom would benefit from a PANDA. It took us a little while to track down a local server we could use, but we ended up re-purposing an old advertising server.
If you’ve never heard of PANDA before, it’s essentially a data “library” for the newsroom. Reporters can load data - or “feed the panda” - and it indexes it and makes it searchable. Already, we have loaded the registered voters database, state credit card payments and a few campaign finance tables. So, now all we have to do is type in a name and PANDA will tell us which databases that person is in. A search for Gov. Jack Markell’s last name shows entries in the registered voters database, the credit card data and the campaign finance reports for U.S. Senator Tom Carper. Once we get more data uploaded, we could ostensibly do a relatively thorough background check on sources just by searching for them with PANDA.
In a nutshell, this is going to greatly enhance our newsroom’s ability to capitalize on all the data we collect. Who wouldn’t be excited for that?