Skip to content

E-A-Griffin/migration-pumas-database

Repository files navigation

Migration PUMAs Database

PostgreSQL database for storing PUMA and migPUMA information. More information about the underlying data can be found at https://www.census.gov/programs-surveys/acs/microdata/documentation.html. This current iteration of the database is based on the 5-year dataset between 2015-2019 conducted from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) https://www2.census.gov/programs-surveys/acs/data/pums/2019/5-Year/. Specifically this dataset includes the person records for all fifty states, Washington D.C., and Puerto Rico.

PostgreSQL version

PostgreSQL 10.19 (Ubuntu 10.19-0ubuntu0.18.04.1)

Dependencies

Most of the dependencies for the project are only needed if one wants to build the database from scratch or add new survey years to the database.

Build from scratch

  • UNIX/Linux (tested on Ubuntu 18.04)
    • sed (tested on 4.4)
    • cat (tested on 8.28)
    • konwert (tested on 1.8)
  • Clojure (tested on 1.10.1)
  • Python (tested on 3.6.9)
  • PostgreSQL (tested on 10.19)
  • psql (tested on 10.19)

Build from release

  • PostgreSQL (tested on 10.19)
  • psql (tested on 10.19)

Add new survey years

  • UNIX/Linux (tested on Ubuntu 18.04)
    • sed (tested on 4.4)
  • Clojure (tested on 1.10.1) or JRE (tested on OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1)
  • Python (tested on 3.6.9)
  • PostgreSQL (tested on 10.19)
  • psql (tested on 10.19)

Database creation & initialization

Overview

Converting the CSV files provided by https://www.census.gov/programs-surveys/acs/microdata/access.html into the format ultimately used by this PostgreSQL database involves multiple steps of data preprocessing and transformation. The current iteration of this database is based on the five year individual dataset from years 2015-2019, though this dataset aims to be dynamic in its capacity to be expanded to future releases of individual data. This database is a continuation of the initial work done in https://github.com/qiang-yi/PUMS_migration, thus this Python preprocessing is utilized as an initial step in converting the CSVs to SQL. For convenience, a modified version of the Python code used from this repository is included locally, given collaboration with the original authors of this repository.

Build from scratch

Database creation

Database initialization

Build from release

The easiest way to get the database up and running is via the included release. Once the release is unzipped and a new database has been created (this example assumes the new database is named pumsdb), run the following to import the database

psql -d pumsdb -f pumsdb-dump

You may receive some errors resembling

ERROR:   role "emma_audrey" does not exist

near the end of running psql -d pumsdb -f pumsdb-dump, these can be ignored, they likely arose from changing permissions on the original database the dump is based on.

Semantics of database attributes

Columns of the tables in the database are based mostly on https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2015-2019.pdf. Column names are made to closely resemble their original names in this data dictionary, where applicable. There are general tendencies for converting between the ranges of the values present in the original columns of the CSVs and the respective attributes in the associated SQL tables, for example, boolean/predicate columns in the data dictionary are represented by

b N/A
1 Yes
2 No

such that

b => NULL
1 => true
2 => false

This is more inline with SQL data types, especially given SQL’s support for ternary boolean logic (a bool b in SQL is either true, false, or NULL). This change allows for space and time complexity optimizations over using a numeric type to represent these values.

Future work

Given the Census’ record of changing pumas in accordance with the census conducted each ten years (as they have in 2000 and 2010). Inclusion of future years’ data may require reworking queries and the database itself, rather than simply being able to preprocess the data using the included scripts and then inserting the resultant CSV files into the existing database. This should be possible for expanding the database to include data from previous years, such as the 2010-2014 five year dataset not currently included.

Discrepancies between survey years

One important note when considering adding future or previous years to the database is the change in names from the data dictionary codes to longer names that are easier to read and immediately understand. A downside of this design decision is that when importing a CSV provided by the Census Bureau, there likely is a change in the attributes/columns from the 2015-2019 dataset to whatever year(s) are being added. There could also be changes when switching between 5 year and 1 year dataset, unfortunately, any remedy to this issue involves, at best, a semi-automated removing or padding of attributes to conform to the existent database schema. For example, when attempting to add the 1 year 2020 dataset, the new attribute “HIMRKS” appears, which represents “Subsidized Marketplace Coverage”. If one does not want to update the columns of the underlying database, then the solution is to simply remove this attribute from the CSV altogether. A script preprocess-new-file.sh has been provided to do so automatically, i.e. a program that takes in a list of command line arguments and removes each column matching the attribute name of each argument.

Data Dictionary → Database Column Names

The following data structure shows the mapping from the 2015-2019 ACS PUMS Data Dictionary codes to their respective column names in the database. The hash map shown is also available in data_mapping.edn.

;; Keys are data dictionary codes and values are their respective database column names
{;; Basic Variables
 :RT        "record_type"
 :SERIALNO  "serial_no"
 :DIVISION  "division"
 :SPORDER   "person_number"
 :PUMA      "puma"
 :REGION    "region"
 :ST        "state"
 :ADJINC    "adjustment_factor"
 :PWGTP     "person_weight"
 ;; Person Variables
 :AGEP      "age"
 :CIT       "citizenship_status"
 :CITWP     "naturalization_year"
 :COW       "worker_class"
 :DDRS      "self_care_difficulty"
 :DEAR      "hearing_difficulty"
 :DEYE      "vision_difficulty"
 :DOUT      "independent_living_difficulty"
 :DPHY      "ambulatory_difficulty"
 :DRAT      "vet_disability_rating"
 :DRATX     "vet_disability"
 :DREM      "cognitive_difficulty"
 :ENG       "english_ability"
 :FER       "gave_birth_last_year"
 :GCL       "grandparents_w_grandchildren"
 :GCM       "length_responsible_for_grandchildren"
 :GCR       "grandparents_responsible_for_grandchildren"
 :HINS1     "insurance_through_employer"
 :HINS2     "insurance_purchased_directly"
 :HINS3     "medicare"
 :HINS4     "medicaid_etc"
 :HINS5     "tricare"
 :HINS6     "va"
 :HINS7     "ind_health_service"
 :INTP      "interest_div_rental_income_last_year"
 :JWMNP     "travel_time_to_work"
 :JWRIP     "vehicle_occupancy"
 :JWTRNS    "means_of_transportation_to_work"
 :LANX      "language_other_than_english"
 :MAR       "marital_status"
 :MARHD     "divorced_in_last_year"
 :MARHM     "married_in_last_year"
 :MARHT     "n_times_married"
 :MARHW     "widowed_in_last_year"
 :MARHYP    "year_last_married"
 :MIG       "mobility_status"
 :MIL       "military_service"
 :MLPA      "served_after_09_2001"
 :MLPB      "served_from_08_1990_to_08_2001"
 :MLPCD     "served_from_05_1975_to_07_1990"
 :MLPE      "served_from_08_1964_to_04_1975"
 :MLPFG     "served_from_02_1955_to_07_1964"
 :MLPH      "served_from_07_1950_to_01_1955"
 :MLPI      "served_from_01_1947_to_06_1950"
 :MLPJ      "served_from_12_1941_to_12_1946"
 :MLPK      "served_before_11_1941"
 :NWAB      "temp_absence_from_work"
 :NWAV      "available_for_work"
 :NWLA      "on_layoff"
 :NWLK      "looking_for_work"
 :NWRE      "informed_of_recall"
 :OIP       "other_income_last_year"
 :PAP       "public_assistance_income_last_year"
 :RELSHIPP  "relationship_to_ref_person"
 :RETP      "retirement_income_last_year"
 :SCH       "school_enrollment"
 :SCHG      "grade_attending"
 :SCHL      "educational_attainment"
 :SEMP      "self_employed_income_last_year"
 :SEX       "sex"
 :SSIP      "supp_security_income_last_year"
 :SSP       "social_security_income_last_year"
 :WAGP      "salary_last_year"
 :WKHP      "usual_work_hours"
 :WKL       "last_worked"
 :WKW       "weeks_worked_last_year"
 :WKWN      "weeks_worked_2019_or_later"
 :WRK       "worked_last_week"
 :YOEP      "year_of_entry"
 ;; Recoded Person Variables
 :ANC       "ancestry_recode"
 :ANC1P     "ancestry_recode_first_entry"
 :ANC2P     "ancestry_recode_second_entry"
 :DECADE    "decade_of_entry"
 :DIS       "disability_recode"
 :DRIVESP   "number_of_vehicles"
 :ESP       "parents_employment"
 :ESR       "employment_status_recode"
 :FOD1P     "recode_field_of_degree_first_entry"
 :FOD2P     "recode_field_of_degree_second_entry"
 :HICOV     "health_insurance_recode"
 :HISP      "hispanic_origin_recode"
 :INDP      "industry_recode_for_2018_or_later"
 :JWAP      "time_of_arrival_at_work"
 :JWDP      "time_of_departure_for_work"
 :LANP      "language_other_than_english_code"
 :MIGPUMA   "migration_puma_2010_census"
 :MIGSP     "migration_recode"
 :MSP       "married_spouse_present"
 :NAICSP    "naics_recode_for_2018_or_later"
 :NATIVITY  "nativity"
 :NOP       "nativity_parent"
 :OC        "own_child"
 :OCCP      "occcupation_recode_for_2018_or_later"
 :PAOC      "presence_and_age_of_childen"
 :PERNP     "total_persons_earnings"
 :PINCP     "total_persons_income"
 :POBP      "place_of_birth"
 :POVPIP    "income_to_poverty_recode"
 :POWPUMA   "place_of_work_puma"
 :POWSP     "place_of_work_state_country"
 :PRIVCOV   "private_health_cov_recode"
 :PUBCOV    "public_health_cov_recode"
 :QTRBIR    "quarter_of_birth"
 :RAC1P     "recoded_race_code"
 :RAC2P     "recoded_race_code2"
 :RAC3P     "recoded_race_code3"
 :RACAIAN   "native_american_recode"
 :RACASN    "asian_recode"
 :RACBLK    "black_recode"
 :RACNH     "native_hawaiian_recode"
 :RACNUM    "n_races_represented"
 :RACPI     "pacific_islander_recode"
 :RACSOR    "other_races_recode"
 :RACWHT    "white_recode"
 :RC        "related_child"
 :SCIENGP   "field_of_degree_science_and_engineering"
 :SCIENGRLP "field_of_degree_science_and_engineering_related"
 :SFN       "subfamily_number"
 :SFR       "subfamily_relationship"
 :SOCP      "soc_codes_for_2018_and_later"
 :VPS       "vet_period_of_service"
 :WAOB      "world_area_of_birth"
 ;; Allocation Flags
 :FAGEP     "age_alloc_flag"
 :FANCP     "ancestry_alloc_flag"
 :FCITP     "citizenship_alloc_flag"
 :FCITWP    "year_of_naturalization_alloc_flag"
 :FCOWP     "class_of_worker_alloc_flag"
 :FDDRSP    "self_care_alloc_flag"
 :FDEARP    "hearing_difficulty_alloc_flag"
 :FDEYEP    "vision_difficulty_alloc_flag"
 :FDISP     "disability_recode_alloc_flag"
 :FDOUTP    "independent_living_difficulty_alloc_flag"
 :FDPHYP    "ambulatory_difficulty_alloc_flag"
 :FDRATP    "disability_rating_percentage_alloc_flag"
 :FDRATXP   "disability_rating_checkbox_alloc_flag"
 :FDREMP    "cognitive_difficulty_alloc_flag"
 :FENGP     "ability_to_speak_english_alloc_flag"
 :FESRP     "employment_status_recode_alloc_flag"
 :FFERP     "gave_birth_last_year_alloc_flag"
 :FFODP     "field_of_degree_alloc_flag"
 :FGCLP     "grandparents_living_w_grandchildren_alloc_flag"
 :FGCMP     "length_of_time_responsible_for_grandchildren_alloc_flag"
 :FGCRP     "grandparents_responsible_for_grandchildren_alloc_flag"
 :FHICOVP   "insurance_recode_alloc_flag"
 :FHINS1P   "insurance_through_employer_alloc_flag"
 :FHINS2P   "insurance_direct_alloc_flag"
 :FHINS3C   "medicare_coverage_given_through_eligibility_alloc_flag"
 :FHINS3P   "medicare_65_or_older_certain_disabilities_alloc_flag"
 :FHINS4C   "medicare_coverage_given_through_eligibility_alloc_flag2"
 :FHINS4P   "govt_assistance_alloc_flag"
 :FHINS5C   "tricare_through_eligibility_alloc_flag"
 ;; Typo in here on database side, may be worth correcting in a future release
 :FHINS5P   "trcare_alloc_flag"
 :FHINS6P   "va_alloc_flag"
 :FHINS7P   "ind_health_service_alloc_flag"
 :FHISP     "detailed_hispanic_origin_alloc_flag"
 :FINDP     "industry_alloc_flag"
 :FINTP     "interest_dividend_rental_income_alloc_flag"
 :FJWDP     "time_of_departure_to_work_alloc_flag"
 :FJWMNP    "travel_time_to_work_alloc_flag"
 :FJWRIP    "vehicle_occupancy_alloc_flag"
 :FJWTRNSP  "means_of_transportation_alloc_flag"
 :FLANP     "language_other_than_english_spoken_at_home_alloc_flag"
 :FLANXP    "language_other_than_english_alloc_flag"
 :FMARP     "marital_status_alloc_flag"
 :FMARHDP   "divorced_last_year_alloc_flag"
 :FMARHMP   "married_last_year_alloc_flag"
 :FMARHTP   "times_married_alloc_flag"
 :FMARHWP   "widowed_last_year_alloc_flag"
 :FMARHYP   "year_last_married_alloc_flag"
 :FMIGP     "mobility_status_alloc_flag"
 :FMIGSP    "migration_state_alloc_flag"
 :FMILPP    "military_periods_of_service_alloc_flag"
 :FMILSP    "military_service_alloc_flag"
 :FOCCP     "occupation_alloc_flag"
 :FOIP      "all_other_income_alloc_flag"
 :FPAP      "public_assistance_income_alloc_flag"
 :FPERNP    "total_persons_earnings_alloc_flag"
 :FPINCP    "total_persons_income_alloc_flag"
 :FPOBP     "place_of_birth_alloc_flag"
 :FPOWSP    "place_of_work_state_alloc_flag"
 :FPRIVCOVP "private_health_insurance_coverage_recode_alloc_flag"
 :FPUBCOVP  "public_health_coverage_recode_alloc_flag"
 :FRACP     "detailed_race_alloc_flag"
 :FRELSHIPP "relationship_alloc_flag"
 :FRETP     "retirement_income_alloc_flag"
 :FSCHGP    "grade_attending_alloc_flag"
 :FSCHLP    "highest_education_alloc_flag"
 :FSCHP     "school_enrollment_alloc_flag"
 :FSEMP     "self_employment_income_alloc_flag"
 :FSEXP     "sex_allocation_flag"
 :FSSIP     "supplementary_security_income_alloc_flag"
 :FSSP      "social_security_income_alloc_flag"
 :FWAGP     "wages_and_salary_income_alloc_flag"
 :FWKHP     "usual_hours_worked_per_week_last_year_alloc_flag"
 :FWKLP     "last_worked_alloc_flag"
 :FWKWNP    "weeks_worked_numeric_last_year_alloc_flag"
 :FWKWP     "weeks_worked_last_year_alloc_flag"
 :FWRKP     "worked_last_week_alloc_flag"
 :FYOEP     "year_of_entry_alloc_flag"
 ;; Replicate Weights
 :PWGTP1    "person_weight_replicate_1"
 :PWGTP2    "person_weight_replicate_2"
 :PWGTP3    "person_weight_replicate_3"
 :PWGTP4    "person_weight_replicate_4"
 :PWGTP5    "person_weight_replicate_5"
 :PWGTP6    "person_weight_replicate_6"
 :PWGTP7    "person_weight_replicate_7"
 :PWGTP8    "person_weight_replicate_8"
 :PWGTP9    "person_weight_replicate_9"
 :PWGTP10   "person_weight_replicate_10"
 :PWGTP11   "person_weight_replicate_11"
 :PWGTP12   "person_weight_replicate_12"
 :PWGTP13   "person_weight_replicate_13"
 :PWGTP14   "person_weight_replicate_14"
 :PWGTP15   "person_weight_replicate_15"
 :PWGTP16   "person_weight_replicate_16"
 :PWGTP17   "person_weight_replicate_17"
 :PWGTP18   "person_weight_replicate_18"
 :PWGTP19   "person_weight_replicate_19"
 :PWGTP20   "person_weight_replicate_20"
 :PWGTP21   "person_weight_replicate_21"
 :PWGTP22   "person_weight_replicate_22"
 :PWGTP23   "person_weight_replicate_23"
 :PWGTP24   "person_weight_replicate_24"
 :PWGTP25   "person_weight_replicate_25"
 :PWGTP26   "person_weight_replicate_26"
 :PWGTP27   "person_weight_replicate_27"
 :PWGTP28   "person_weight_replicate_28"
 :PWGTP29   "person_weight_replicate_29"
 :PWGTP30   "person_weight_replicate_30"
 :PWGTP31   "person_weight_replicate_31"
 :PWGTP32   "person_weight_replicate_32"
 :PWGTP33   "person_weight_replicate_33"
 :PWGTP34   "person_weight_replicate_34"
 :PWGTP35   "person_weight_replicate_35"
 :PWGTP36   "person_weight_replicate_36"
 :PWGTP37   "person_weight_replicate_37"
 :PWGTP38   "person_weight_replicate_38"
 :PWGTP39   "person_weight_replicate_39"
 :PWGTP40   "person_weight_replicate_40"
 :PWGTP41   "person_weight_replicate_41"
 :PWGTP42   "person_weight_replicate_42"
 :PWGTP43   "person_weight_replicate_43"
 :PWGTP44   "person_weight_replicate_44"
 :PWGTP45   "person_weight_replicate_45"
 :PWGTP46   "person_weight_replicate_46"
 :PWGTP47   "person_weight_replicate_47"
 :PWGTP48   "person_weight_replicate_48"
 :PWGTP49   "person_weight_replicate_49"
 :PWGTP50   "person_weight_replicate_50"
 :PWGTP51   "person_weight_replicate_51"
 :PWGTP52   "person_weight_replicate_52"
 :PWGTP53   "person_weight_replicate_53"
 :PWGTP54   "person_weight_replicate_54"
 :PWGTP55   "person_weight_replicate_55"
 :PWGTP56   "person_weight_replicate_56"
 :PWGTP57   "person_weight_replicate_57"
 :PWGTP58   "person_weight_replicate_58"
 :PWGTP59   "person_weight_replicate_59"
 :PWGTP60   "person_weight_replicate_60"
 :PWGTP61   "person_weight_replicate_61"
 :PWGTP62   "person_weight_replicate_62"
 :PWGTP63   "person_weight_replicate_63"
 :PWGTP64   "person_weight_replicate_64"
 :PWGTP65   "person_weight_replicate_65"
 :PWGTP66   "person_weight_replicate_66"
 :PWGTP67   "person_weight_replicate_67"
 :PWGTP68   "person_weight_replicate_68"
 :PWGTP69   "person_weight_replicate_69"
 :PWGTP70   "person_weight_replicate_70"
 :PWGTP71   "person_weight_replicate_71"
 :PWGTP72   "person_weight_replicate_72"
 :PWGTP73   "person_weight_replicate_73"
 :PWGTP74   "person_weight_replicate_74"
 :PWGTP75   "person_weight_replicate_75"
 :PWGTP76   "person_weight_replicate_76"
 :PWGTP77   "person_weight_replicate_77"
 :PWGTP78   "person_weight_replicate_78"
 :PWGTP79   "person_weight_replicate_79"
 :PWGTP80   "person_weight_replicate_80"}

Example: adding 1 year 2020 data to database

path-to-2020-data.csv refers to the path where the CSV containing the 1 year 2020 CSV is located and path-to-output.csv refers to the path where the final output file will be written to.

./preprocess-new-file.sh path-to-2020-data.csv path-to-output.csv

Potential considerations

The bash file included preprocess-new-file.sh has #!/usr/bin/env bash and the python file included add-migration-attrs.py has #!/usr/bin/python3 specified as the paths for bash and python3, respectively, if these programs are installed in another location then you will need to edit the top lines in these files’ source codes to point to the correct paths for your setup.

Another thing to be aware of is the size of the input CSV files. Even a single survey year could be large enough that the Python script may run out of memory on certain systems. If you encounter these issues, then run the script for each state, obviously this isn’t ideal to be done manually but writing a script to iterate through each state’s CSV file should be simple to implement if needed.

About

PostgreSQL database containing migration information based on the U.S. Census Bureau's Public Use Microdata Areas (PUMAs) dataset

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published