Project Rhode Island General Assembly Attendance Data
How the Providence Journal cleaned and published the data
Our state legislature still records attendance by legislators the old-fashioned way: on paper. The House and Senate publish a journal after each session that includes a roll call, which is just a list of the legislators who were present, and a list of those absent.
Almost all the journals were published as PDFs on the Assembly website. The Assembly does not tally the results in any fashion, so there was no way to know a legislator’s attendance over a period of time without a lot of manual work.
Wrangling the Data
For this project, I copied all 1,270 House and Senate journals to my PC and converted them to text with PDFtoText (IRE tip sheet). I wrote a program in PHP that used regular expressions to parse the roll call vote from each one, and store the results in a MySQL database.
Regular expressions are a series of characters that serve as a pattern a program tries to match as it reads through text. (Regular expressions are really useful for journalists. Many text editors let you use regular expressions to search for and replace text. Amit Arora offers a good tutorial.)
What seemed like a straightforward task—parsing the names in the document that followed the words “PRESENT” and “ABSENT”—turned out to be one of the most difficult challenges. First of all, since the roll call vote was created manually, the lists sometimes contained typos and missing punctuation. And the formatting wasn’t always consistent, making it hard to come up with a way that would reliably parse the names on the list.
I ended up using a series of three regular expressions. The program would use the most restrictive regular expression first. If that didn’t work, it would then try the next one, and so on.
The parsing process used a lookup table I created in advance that contained each legislator who held office over the past 10 years, comparing the found name with the list of legislators in office the day that particular journal was published. I built the table using legislator contact lists on the General Assembly website, as well as historical election results from the Rhode Island Secretary of State website.
The table contained a field for term-start and term-end to help make sure the program correctly identified the right legislator. It also contained a field for an alias, such as “Mr. Speaker” because sometimes the roll call only identified the person that way.
To catch the errors, the program would output an alert if it couldn’t find a name in the lookup table. It would also add together the number of legislators who were listed as present and absent for each day, and raise a flag if those numbers didn’t add up to 75 members in the House and 38 members in the Senate.
Every time a legislator was listed as present or absent, a new record was created in a database table. After parsing 10 years of attendance records, the table contained nearly 73,000 records.
Some kind folks on the NICAR list helped me write MySQL queries to summarize the data.
For the images of the legislators, I wrote a program that downloaded all the head shots from each legislator’s web page. I used our newspaper’s photo archive for the photos that were missing.
I wanted a way to graphically show which days in the year a legislator was present and absent, and the obvious way to do that was by using a visual tool we are all familiar with: a calendar. I couldn’t find open source code that would create a simple monthly calendar, so I made a jQuery plugin that would do so. I called it miniMonth and have released it as open source.
Here’s the result: General Assembly Attendance 2003-2012.