Parsing tables in MS Word with Python

Jan. 3, 2013, 10 p.m.Python

I’ve been meaning to post this for a while and was reminded when I saw a post from Anthony DeBarros about using xlrd to parse an Excel document.

A FOIA request for voting records for Delaware state legislators returned a ton of MS Word documents, each with a table of votes for a single legislator. The State House uses a Lotus Notes database and actually stores their voting records in this format. So, I had to extract all of the votes from the files and put them in a form that I could examine with excel and later put into MySQL for publication. Here’s the script that I used:

At the heart of this script is the win32com library. The client module allows you to open up Microsoft Office applications with Python and to parse whatever is in them. In this case, we are opening an instance of MS Word in the background with lines 12-13. From there, we open each document inside the writeVotes function with lines 22-23 and select the table with line 24, “table = doc.Tables(1)”. Luckily, each document only had one table that we cared about, though we could have easily iterated through each table in each document.

The rest of the script is pretty straightforward. For each row in the table, we grab the relevant information with “.Range.text” and join it all together.

I hope others find this helpful. If the data didn’t come in several dozen word files, it probably would have been easier to do a copy/paste job, but this script sped things up tremendously and made my life easier.


Comments

Categories