Applications and next steps

At the end of the day, our process did a pretty decent job coming up with results that matched CRPs -- even sometimes finding things the CRP data got wrong. But the question still remains: So what? What good does it do to work with campaign data at the donor level as opposed to the contribution level? Is it really worth all the trouble?

Of course it is! Here are a few examples of applications that might derive from this process at the local, state and federal levels.

Potential applications

Recall that part of the motivation behind this project was to generalize the standardization of donor names across campaign finance datasets. The main fields in most campaign finance datasets -- local, state or federal -- look pretty much the same: donor name, recipient name, some location information, and often some info about occupation and employer. CRP does a great job cleaning up this data on the national level, and the National Institute for Money in State Politics does something similar for the states, but neither of them are going to be able to standardize your local city council's campaign finance records on demand. So there's one application right there.

But that still leaves the bigger question: What's the point of standardizing this stuff at all? Some of the most instructive and inspirational ideas I've heard along these lines came from a data mining contest we ran late last year. I was with the Center for Investigative Reporting at the time, and we teamed up with IRE to co-host the contest with Kaggle. The point was for data scientists and other non-journalist experts to look at a set of federal campaign finance data for angles and ideas reporters wouldn't think of. You can see most of the entries here, but here are a few that stood out:

The winning entry proposed a tool called a behavior stability index for tracking whether and when a particular donor's giving patterns change over time.
Another entry proposed linking donors with Wikipedia pages to enrich donor information with useful metadata. The entry also proposed looking at networks and communities of donors, which could reveal interesting patterns.
Another proposed using statistical techniques to detect donor coordination. Although the method was proposed to find illegal coordination between candidates and Super PACs, it could also be adapted to reveal donors who tend to work together, which could lead to new and interesting stories.

Common among all of those ideas is the seemingly obvious notion that political influence is accrued by individuals and institutions -- not contributions. Unless we have a clear picture of what people and groups are doing, how their behavior changes, and what they get in return, the tracking of money and politics loses a lot of its power. Done right, that's exactly the kind of stuff that a good donor standardization system can enable: new visualizations, analyses and tools that keep better track of money's influence in politics.

Next steps

All in all, I think this was a successful first attempt at automated donor standardization, but obviously it's not perfect -- particularly given that it's making judgments based on campaign finance data that is notoriously flawed and incomplete. The next step for me is to clean up a few things around the edges to see if I can bump up the system's performance another point or so.

There are some easy preprocessing steps that should eliminate some stupid classification mistakes: dealing with zero-padding on ZIP codes, for example, and improving aspects of the name parser we're using, such as nicknames and certain suffix placements. Knowing now that the CRP data we've been using to train our model contains a few errors of its own, a thorough review of the training data could also be helpful.

On the machine learning front, an analysis of bias vs. variance in the model might also be helpful, although Random Forests are designed to prevent over and under fitting. A closer look at optimal feature combinations might also be interesting.

And finally, the big next step is to generalize this method to work with any campaign finance dataset -- federal, state or local. In the short term that should just be a matter of tweaking this process and building an interface for tasks like data uploads, building training sets, and reviewing matches.

That's all for now! I'll continue to post updates to Github as I refine the process. In the mean time, if anyone has any thoughts or questions, I'm at chase.davis@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applications and next steps

Potential applications

Next steps

Clone this wiki locally