Predict Census Income

This is a collaborative predictive modeling project built on the ballet framework.

This project contains a feature engineering pipeline and associated models that can be used to predict personal income from raw survey responses to the US Census American Community Survey. The model built from features submitted by the community can then be used to optimize administration of the survey, direct public policy interventions, and assist empirical researchers.

See dates/times for upcoming virtual collaboration hours, opportunities to work at the same time as other collaborators and ask/answer questions in the chat.

Join the collaboration

Are you interested in joining the collaboration?

Your task

Your task is to create and submit one feature to the project.

The easiest way to get started and hack on this repository is to launch Assemblé, a interactive development environment built on top of Jupyter Lab and hosted in the cloud. You can read more about this development workflow here.
Alternately, you can use your preferred tools and development environment to create and submit a feature from your own machine. You can read about the local development workflow here.

Getting started

First, get acquainted with the Ballet framework if you are not yet familiar.

Look over the Ballet Contributor Guide
Look over the Ballet Feature Engineering Guide
Check out the features that are currently part of this project, in the contributed features directory (src/predict_census_income/features/contrib).
Check out the Analysis.ipynb notebook for some starter code

Virtual collaboration hours

We are hosting several Virtual Collaboration Hours (VCH) in which we will work together on feature engineering and help each other with ideas and implementation. The VCH will start with a short video presentation aimed at beginners introducing Ballet and this predict-census-income project, with an opportunity for Q&A. Then, we will split off to work and will be chatting in the Gitter chat.

Schedule (check back here to confirm!)

You can also see a recording of a recent VCH.

Dataset

Input data

The input data is the raw survey responses to the 2018 US Census American Community Survey (ACS). This is known as the "Public Use Microdata Sample" because otherwise most numbers from the ACS are reported in aggregate.

The data documentation can be viewed here
The data dictionary can be viewed here in PDF form, or here in CSV form.
Many additional resources about the ACS can be viewed here.
The dataset is created by merging the "household" and "person" parts of the survey. Thus one row of the dataset contains the responses for one person to both the household and person surveys. A person is identified by a unique SERIALNO. A set of "reasonable" rows is filtered as follows: (1) individuals older than 16 (2) personal income greater than $100 (3) hours worked in a typical week greater than 0.

The full script that minimally prepares the data is here.

The resulting training dataset has 30085 rows (people) and 494 columns (raw).

Prediction target

The prediction target is whether an individual respondent will earn more than $84,770 in 2018. Though a bit contrived, this comes from adapting the classic ML "census" dataset to the modern era. The original prediction target is to

determine whether a person makes over 50K a year.

Thus we adjust for inflation from 1994 to 2018.

Getting help

Read over Ballet's Frequently Asked Questions page
Join the project chat:

Usage

To use the feature engineering pipeline:

from predict_census_income.api import api
X_df, y_df = api.load_data()
pipeline = api.pipeline
features = pipeline.fit_transform(X_df, y_df)

To use a sample model:

from predict_census_income.models import train, predict
from predict_census_income.models.logistic_regression import create_logistic_regression_model
X_df, y_df = api.load_data()
model, encoder = train(X_df, y_df, create_logistic_regression_model)  # loads the pipeline/encoder automatically
predict(model, X_df)  # make predictions on training data

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.bin/split		.bin/split
.github		.github
binder		binder
docs		docs
notebooks		notebooks
src/predict_census_income		src/predict_census_income
.cookiecutter_context.json		.cookiecutter_context.json
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
ballet.yml		ballet.yml
requirements.txt		requirements.txt
setup.py		setup.py
tasks.py		tasks.py

ballet/predict-census-income

Folders and files

Latest commit

History

Repository files navigation

Predict Census Income

Join the collaboration

Your task

Getting started

Virtual collaboration hours

Dataset

Input data

Prediction target

Getting help

Usage

About

Resources

Stars

Watchers

Forks

Languages