Skip to content

Defining donor clusters

cjdd3b edited this page Jan 25, 2013 · 33 revisions

In the last section, we used a Random Forest classifier to run through every last-name subset of our campaign finance data in order to link contributions that appeared to come from the same person. In this step, we'll go through each of those clusters and identify their connected components (which represent individual donors), mark them with unique IDs and evaluate the results.

Visually, the plan now is to take our connected contributions and turn them into donors, like this:

Linked contributions

Finding connected components

Relevant code: bin/find_donors.py

In graph theory terms, what we're trying to do is find connected components. Fortunately, doing so is a pretty trivial process. As the Wikipedia page notes, finding connected components algorithmically can be done in linear time -- in other words, blazing fast. To pull that off, we're going to take advantage of Python's NetworkX library, which contains a ton of useful features for analyzing graphs.

The code in question is really only one line long, but it follows a process that bears a little explanation.

Remember that we first divided up our dataset into chunks by last name. Then we iterated, pairwise, through each of the contributions in those chunks and used a Random Forest to determine whether they were made by the same person -- effectively drawing those edges you see in the graph above. Now that we've got some edges connecting our nodes, we're to loop through each of those clusters again and find the connected components.

In pseudocode, the process looks like this:

for each last name group:
    create a new network graph
    for each matching pair in the last name group:
        add an edge to the graph

    c = get_connected_clusters_from_graph()

    for each cluster in c:
        for each contribution in the cluster:
            d = some donor ID we create, based on cluster number and donor name
            assign the donor ID to the contribution and save

That process, which is documented in the assign_clusters() function here, will assign a donor ID (in this case, a hash) to every contribution in the dataset. Contributions from what appear to be the same donor will have the same IDs. Donors that only appear once aren't handled by this step, but we pick them up later by assigning their IDs in one fell swoop.

Good news! Once this step is finished, we're done! Now let's see how we did.

Evaluating the errors

Relevant code: tests/match_tests.py

We know from our earlier tests that our Random Forest classifier has a precision, recall and F1 of 0.96 when identifying contributions from matching donors. But how does that translate to our final dataset? In the test run I did for this write-up, which uses a random slice of 10,000 records as training examples, here's how the final results looked.

There were 100,000 records in the entire dataset. Of those, 423 couldn't be properly tested because of missing data or inconsistencies. Within the remaining 99,577, the classifier came back with 1,188 potential false positives and 1,343 potential false negatives. On its face, the error rate looks like it would be about 2.5 percent, but it's probably a bit less than that. Because of the way I'm querying out false positives and negatives, our error sets also include some correctly classified examples, so the real mismatch rate is probably a bit closer to 2 percent. That's a bit better than, but still consistent with, the performance we saw in our earlier tests.

An informal review of the errors reveals a few interesting insights about what the classifier consistently gets right and wrong:

  • It seems like many of the so-called "false positives" actually appear to be correct matches, implying that the Random Forest is catching some things CRP's system didn't. This also happens with false negatives, although less often.

  • It has some problems with suffixes like Jr. and Sr., particularly when they aren't parsed correctly. The name-cleaver parser we used in the preprocessing phase normally does a great job, but sometimes it misparses strings like "DE CLEVA, PAUL MR SR" where the suffix (SR) comes after the prefix (MR). This was a pretty typical problem among false positives.

  • It also occasionally groups together people who have very similar or identical names and live in the same city or ZIP code, particularly if the text describing their occupation and employer is similar as well. This doesn't seem to happen a lot, but it pops up from time to time.

  • Among false negatives, the Random Forest sometimes stumbles on people that list addresses in multiple states, even if they're clearly the same person. This happens especially often around the DC area, where people might live in Virginia or Maryland but work in the District. This might be improved by building a feature that calculates actual distance between ZIP codes.

  • It also sometimes messes up certain nicknames (Jack/John, William/Bill, etc.), again partially due to the name-cleaver. This could pretty easily be improved by beefing up the cleaver's nickname detection.

All in all, it seems safe to assume we might be able to boost the classifier's performance by maybe a percentage point by running through a little further cleanup. We'll talk about that a little more in the next and final section.