Skip to content
cjdd3b edited this page Jan 25, 2013 · 15 revisions

Like most journalists who have covered politics, I've always had a love/hate relationship with campaign finance data. It's one of the best tools we've got for tracking the influence of money in politics, but it's also a huge headache: the data is dirty, unstandardized and often incomplete. Even answering a simple question like "How much did Bob Perry give to Republicans in 2012?" can lead to days-long odysseys of data querying and cleaning.

That's because almost every database of campaign contributions is organized in the least helpful way possible: by contribution, rather than donor. If you want to look up all of Bob Perry's contributions, you need to know all the ways his name might be spelled (or typoed); you need to know all the addresses he might have filed contributions from; you need to know that Bob Perry the homebuilder from Texas is different from Bob Perry the finance executive from New York. The data would be so much better, so much more useful, if Bob Perry was a canonical figure, whose donations were grouped together and could be tracked over time.

That's what this experiment is trying to do.

For years now, organizations like the Center for Responsive Politics, the National Institute on Money in State Politics, the Sunlight Foundation have been standardizing donor names using a combination of automated analysis and human review. It's been an amazing service, and one I've used many times, but it's also made me wonder: Could machine learning accurately make the same judgments? Could we model the intuition of these expert standardizers and generalize it to any campaign finance dataset -- federal, state or local?

The short answer is yes. And this writeup will show you how it's done.

Project overview

The purpose of this project is to see how well an entirely automated workflow can mimic the judgment of CRP's standardization of individual donors in Federal Elections Commission data from the 2011-2012 cycle.

Grouping together contributions into donors is surprisingly difficult, especially considering how dirty and incomplete campaign finance data tends to be. Given two contributions by Joe Smith from New York, it's sometimes impossible to tell whether they're indeed from the same person, or from two people who happen to have the same name. There are certain hints: They share the same ZIP code, maybe, or they list the same occupation and/or employer. But at the end of the day, standardization is a binary judgment call based on limited information -- a perfect task, in some ways, for a machine.

This project uses a combination of machine learning and graph theory to standardize a random 100,000-record sample set of CRP data pulled from Sunlight's Influence Explorer project. We're using CRP data because it has already been pre-standardized, so we can tell how often our classifier's judgment matches that of CRP's, which is effectively the gold standard.

In the end, I found that a Random Forest machine learning classifier can produce results that match CRP's between 95 and 99 percent of the time. In some cases, it finds matches that CRP missed. It's not perfect, and it makes some stupid errors, but it's a promising start that will hopefully allow us to generalize CRP's intuition across other datasets, such as local campaign finance records.

The data

A few important points to note about the data:

  • I am only standardizing the names of individuals, not companies, PACs or other organizations. The reason for this is pretty simple: Federal and state governments often standardize organizational records to some degree on their own. For instance, federal PACs have unique IDs that make it easy to show all of the contributions coming in and out. Not so with individual donors, which are typically not standardized at all. That said, this method could also be adapted to work with organizations.

  • As I mentioned above, I'm working with a set of 100,000 random individual contributions classified by CRP and made available through the Sunlight Foundation. The actual slice I used is available in the repo's data folder. Different slices of that data are used for training, testing and cross-validation.

  • Aside from the cleaning that I describe later in this writeup, the only change I've made to the raw data is to exclude a handful of fields from the import for simplicity purposes. The imported data still has all fields that would be common to most campaign finance data, such as name, city, state, zip, occupation and employer. Addresses were excluded here, even though they're available in the data, because they're often not available in other campaign data and I wanted to see how the system would perform without them.

The method

The intuition of the method I use follows four distinct steps:

  • First, preprocess the data. Split donor names into first, middle and last; make all strings either all lowercase or uppercase; and add a few fields we'll need for testing later.

  • Second, break our universe of 100,000 campaign contributions into smaller chunks that can be processed more quickly and efficiently.

  • Third, use machine learning to run pairwise comparisons of individual donations to determine whether they are from the same person. If they are, link them together into miniature graphs.

  • And finally, find all of the graphs of multiple connected donations and assign them a unique donor ID.

Table of contents

I've written detailed instructions about how each of those steps works in the sections below. Feel free to dig in: