Preprocessing

Before we get to the fun part of analyzing the data, we first need to prepare our raw campaign contributions to be run through our algorithms. Those steps are outlined here:

Importing the data

Relevant code: bin/import.py

The first thing we need to do is import our test data -- a random 100,000-record slice of individual donations to federal campaigns from the 2011-2012 election cycle, as cleaned and processed by the Center for Responsive Politics. Not a lot special going on with the import itself. Just a couple things worth remembering:

First, I'm using Sunlight Labs' name-cleaver project to parse contributor names into first, middle, last, suffix, etc. It's a great tool, but given that a huge chunk of this workflow will be based on its outpus, it should be tested thoroughly on any new input data before it is used for other applications. If it doesn't work properly, this whole workflow is screwed.

And second, a quick optimization point. I'm using Django's ORM to handle the data loading, which means being mindful of transaction management. In this case, I use the bulk_insert method to insert records 5,000 at a time, which lets the whole import wrap up in just a couple minutes. Later we'll use manual transaction management for the same reason. Don't under any circumstances try to save each record one-by-one. You will be waiting a very long time.

Once the data is loaded, you can think of our entire dataset as a large universe filled with individual campaign contributions. Visually, imagine it like this (ignore the axis labels):

All campaign contributions

Breaking down the problem space

Relevant code: bin/preprocess.py

As I mentioned in the previous section, the objective of our processing task is to look through each of the contributions in the dataset and connect the ones that seem to come from the same person.

At this point, if we had access to a massively powerful supercomputer, we could naively build those links by comparing every contribution in our universe against one another. But even for our relatively small 100,000-record dataset, that would require 5 billion pairwise comparisons. We can be more efficient than that.

One approach is to break down our huge universe of campaign contributions into small groups that are likely to contain contributions from the same people. This can make a huge difference. For example, if we break down the entire dataset into 100 different subgroups, each with 1,000 contributions, the number of comparisons we'd have to run would drop from 10 billion to ((1,000 ^ 2) * 100) / 2, or 50 million -- still a huge number, but a drop of two whole orders of magnitude.

Dividing up our dataset into groups is exactly what we're going to do. But the question is: What's the best way to do it? If our groups are too small and specific, they might miss potential matches. If they're too big, we're stuck dealing with precisely the same computational problems we're trying to avoid. Turns out there are a couple decent options for creating our groups, each with its own pros and cons.

The first option is the most intuitive: Group our records by donor last names. Chances are that most, if not all, of the contributions by Joe Smith would fall within the "Smith" group, right? There are 35,369 unique last name groups in our dataset, each containing an average of about 2.8 records. That ends up breaking down to about 1.8 million individual comparisons -- a whole lot fewer than the 5 billion we started with. The downside is typos: At least one "Smith" in our dataset appears as "Smtih."

Another technique, known as locality-sensitive hashing can mitigate the typo issue. The details of the algorithm are beyond the scope of this writeup (they can be found here) but essentially LSH is a linear-time algorithm that can be used for clustering similar text strings -- in this case, names. Using the parameters assigned in the script, LSH ends up returning X clusters with an average membership of Y donor names. In practice, that also leads to about 1.8 million individual comparisons. Plus, "Smith" and "Smtih" appear in the same group.

Grouping process	No. of groups	Avg. group size	Total pairwise comparisons	Pros	Cons
Last names	35,369	2.8	1,836,445	Easy to understand and explain; fast; works pretty well	Susceptible to typos and variations in last name text
LSH	23,771	4	1,836,439	Fuzzy matching deals with things like typos	Less intuitive; requires more DB queries (in my implementation)

At the end of the day, both of these approaches end up giving us comparable performance. So owing to my bias toward keeping things simple, I'll use the last names approach for the rest of this exercise.

It's worth pointing out that we can actually reduce the potential universe of matches even further if we throw out extremely dissimilar pairs within a given last name group. It's pretty clear that Betty Smith from Utah and Joe Smith from Florida are different people, so why not just throw out that pair from the get-go?

We do this by calculating the Jaccard similarity of the 2-shingles of the first name, city and state of each potential pair. If the two strings are less than 10 percent similar, chances are they'd never be a match (the INITIAL_SIM parameter at the top of our script establishes the cutoff). Our approach also has the benefit of excluding any name that only appears in the dataset once (we'll assign IDs to those later), which pares down our universe of potential matches to less than 400,000.

Once the script is done running, we'll no longer be dealing with a huge universe of campaign contributions. Instead, we'll be dealing with a bunch of smaller subsets, sort of like the image below. Now we can process those one at a time.

All contributions blocked

Extracting features

Relevant code: bin/preprocess.py, learn/features.py

It's probably unfair and a little confusing to frame feature extraction as a totally different step in this workflow. It's actually done as we're creating and assigning the groups above. But the idea of translating data into features is still conceptually important, so that's what I'll explain here.

Remember that within each subset, we want to look at each pair of donations and determine whether they came from the same person. For that we're going to enlist a machine learning classifier and ask it a simple question: Given various attributes about the two donors in this pair, are they the same person or not?

Machine learning algorithms like to live in a world of numbers, vectors and matrices. They can't just look at the string "Joe" and "Joseph" and determine how similar they are -- they need to interact with the names on their terms, which means translating them into rows of numbers. That's exactly what our feature extraction layer is designed to do.

The first thing we want to do is think of all the pieces of information we know about a contribution, and how we might use those to determine whether two donors are similar. Do the two contributors in the pair have the same first name? Do they have the same last name? Middle initial? How similar are their ZIP codes? Do they have similar occupations? And so on.

The format of our feature vector will be a one-dimensional array filled with numeric values -- either booleans (1 or 0) or continuous values between 0 and 1. Each of these values will be generated by a function, which you can find in learn/features.py. For sake of example, let's look at this one, which checks to see whether the two donors come from the same state:

def same_state(c1, c2):
    match = 0
    if clean_str(c1.state) == clean_str(c2.state):
        match = 1
    return match

As you can see, the function returns a boolean -- 1 or 0 -- depending on whether the two donors report coming from the same state. You can imagine this would be useful to a classifier trying to distinguish between Joe Smith in Arkansas and Joe Smith in Minnesota.

In all, this project contains 15 of these features, which draw from the parsed and unparsed name fields, city, state, zip code, occupation and employer. CRP data also comes with street addresses, which would almost certainly improve the classifier's accuracy, but we're not including them for now. The ultimate goal is to generalize this classifier to any campaign finance dataset -- local, state or federal -- and outside of federal data, addresses are often hard to come by.

After this round of processing, each pair is assigned a vector with values corresponding to each feature, something like this: [1, 1, 1, 1.0, 0.7142857142857143, 1, 1, 0.09090909090909091, 1, 1, 1, 1.0, 1, 1.0, 1]. The order of the specific features is a little hard to decipher, so it's printed out whenever you run extract_features.py for reference.

In the next section, you'll see how our classifier puts these features to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing

Importing the data

Breaking down the problem space

Extracting features

Clone this wiki locally