NC Campaign Finance

How to Run

Prerequisites

Docker + Docker Compose
A recent version of NodeJS (I'm using v12.16.3, but any recent version should be fine)

Running

From the data-cleaning directory:

Run cp sample.env .env then edit the values if they need to be changed from the default.

Copy the contributions CSV to ./tmp/data.csv

docker-compose up -d # This starts the database running on port 54322 and launches the PGWeb postgres client on port 8081'
npm install
npm run migrate up
node bin/etl.js # Alternatively `node bin/etl.js /path/to/some/file.csv`, if you to use a file other than ./tmp/data.csv
npm run start:dev

The data is now loaded into the database.
You can view the data in your preferred SQL client or by going to http://localhost:8081 in your browser.
The server is running on http://localhost:3001

Auto Process Records

To run the auto process program do the following

Update your .env DATABASE_URL if you want to run it against a remote DB

node bin/process.js

This will run the auto process with a 0.7 (70%) threshold. To use a different threshold do the following:

node bin/process.js 0.8

Routes

For all api routes providing a x-trigram-limit header will change the match level of the request
The value of the header should be a decimal number. I.E. 0.7 for a 70% match

GET api/contributions/matches/:name/:addr will return records that are similar to the given name and address
Example to get contributions for John Abbott: curl localhost:3000/contributions/matches/john%20abbott/410%20S%20Swing%20Rd
Alternatively to get contributions of Jon Abbott: curl localhost:3000/contributions/matches/jon%20abbott/410%20S%20Swing%20Rd
You will see the that this request returns the same results

GET /api/contributions/raw will return a list of contributions that are a close match EX:

{
    "data": {
        "raw": [
            {
                "id": "150a9ad1-8732-40ae-a614-dbec4b8c37c0",
                "name": "John Smith",
                "street_line_1": "123 Apple St",
                "street_line_2": null,
                "city": "Raleigh",
                "state": "NC",
                "zip_code": "12345",
                "profession": "Business",
                "employer_name": "Business Place",
                "transaction_type": "Individual",
                "committee_name": "Mrs. Politician",
                "committee_sboe_id": "STA-1234-5678",
                "committee_street_1": "PO BOX 12345",
                "committee_street_2": null,
                "committee_city": "RALEIGH",
                "committee_state": "NC",
                "committee_zip_code": "12345",
                "report_name": "2020 First Quarter",
                "date_occurred": "2/3/20",
                "account_code": "Not Available",
                "amount": 500,
                "form_of_payment": "Check",
                "purpose": null,
                "candidate_or_referendum_name": null,
                "declaration": null
            }
        ],
        "clean": [
            {
                "id": "956fddd7-20cd-407f-9c7a-75fce61cfe6b",
                "name": "John Smith",
                "street_line_1": "123 Apple St",
                "street_line_2": null,
                "city": "Raleigh",
                "state": "NC",
                "zip_code": "12345",
                "profession": "Business",
                "employer_name": "Business Place"
            }
        ]
    },
    "count": {
        "raw": 1,
        "clean": 1
    },
    "search": {
        "name": "John Smith",
        "address": "123 Apple ST"
    }
}

POST /api//contributions/clean Send a list of contribution ids that all belong to the same contributor EX:
contributorID is optional. If it is provided, the records will be associated with that ID

{
  "contributorID": "",
  "data": [
    "UUUD1", "UUID2"
  ]
}

Note: Parameters should be URI Encoded

Debugging

There are VSCode debug configurations setup for the etl script and the server

Data Cleaning

Tables

raw_contributions (The raw finance dataset)
- Current fields + UUID
contributors
- Individuals information + UUID
contributions
- Contribution information + ID field + Contributor_ID

Stories

As a data cleaner I should be presented with a list of contributions which are likely matches and be able to indicate which are matches and which are not
- Matches should be unique. IE multiple people should not see already processed contributions
A contributor uuid is created upon record cleaning. This means that if for some reason, the same person came up on two different cleanings, they would be classified as different records

Concerns

We may want to rethink how the ETL script works. Currently it truncates the raw_contributions table and reloads the dataset which means each record will be assigned a new UUID. This could cause issues because the contributions table utilizes that UUID
- One possible solution would be to generate a UUIDV5 for each record using the fields in the record while ingesting the record and before inserting into the database. This probably isn't the cleanest solution, but I wanted to document the idea and concern somewhere.

Future enhancements

Use a UUIDv5 or UUIDv3 and generate the UUID from the raw_contribution. This would give us reproducible UUIDs, which would make future data cleaning easier and more stable
For future cleaning, we can store all known permutations of a given contributors name and address and then make unclean data against it
Define onconflict conditions

How To's

Load data into Heroku

The auto process script chokes when running against Heroku database so to get around it you can do the following:
(NB: I haven't tried running this on a database with data already in it yet)

Load the data locally and run the auto process
Run node bin/exportToCSV.js
1. This should output 3 csv files
2. Optionally run node bin/exportToCSV.js some/other/directory to specify the output directory
Run the node bin/importFromCSV.js table_name file_path for each of the outputted files
1. EX: node bin/importFromCSV.js raw_contributions ./raw_contributions.csv
2. EX: node bin/importFromCSV.js contributions ./contributions.csv
3. EX: node bin/importFromCSV.js contributors ./contributors.csv
4. Remember to set the DATABASE_URL environment variable and to add the ?sslmode=require to load the data into Heroku

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.vscode		.vscode
data-cleaning		data-cleaning
frontend		frontend
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

data-cleaning

data-cleaning

frontend

frontend

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

package-lock.json

package-lock.json

package.json

package.json

Repository files navigation

NC Campaign Finance

How to Run

Prerequisites

Running

Auto Process Records

Routes

Debugging

Data Cleaning

Tables

Stories

Concerns

Future enhancements

How To's

Load data into Heroku

About

Releases

Packages

Contributors 3

Languages

SchaeStewart/nc-campaign-finance-data-cleaning

Folders and files

Latest commit

History

Repository files navigation

NC Campaign Finance

How to Run

Prerequisites

Running

Auto Process Records

Routes

Debugging

Data Cleaning

Tables

Stories

Concerns

Future enhancements

How To's

Load data into Heroku

About

Resources

Stars

Watchers

Forks

Languages