Skip to content

cfpb/proxy-methodology

Repository files navigation

BISG_RACE_ETHNICITY

In conducting fair lending analysis in both supervisory and enforcement contexts, the Bureau’s Office of Research (OR) and Division of Supervision, Enforcement, and Fair Lending (SEFL) rely on a Bayesian Improved Surname Geocoding (BISG) proxy method, which combines geography- and surname-based information into a single proxy probability for race and ethnicity used in fair lending analysis conducted for non-mortgage products. This document describes the steps needed to build the BISG proxies.

The methodology described here is an example of a proxy methodology that OR and SEFL use, although we may alter this methodology in particular analyses, depending on the circumstances involved. In addition, the proxy method may be revised as we become aware of enhancements that would increase accuracy and performance. For more details, see “Using Publicly Available Information to Proxy for Unidentified Race and Ethnicity: A Methodology and Assessment”.

Included are a series of Stata scripts and subroutines that prepare the publicly available census geography and surname data and that construct the surname-only, geography-only, and BISG proxies for race and ethnicity. The scripts, subroutines, and data provided here do not contain directly identifiable personal information or other confidential information, such as confidential supervisory information.

Please note that all scripts and subroutines are written for execution in Stata 12 on a Linux platform and may need to be modified for other environments. Users must define a number of parameters, including file paths and arguments for subroutines. The scripts that define the subroutines also identify and describe arguments, as required.

Users must supply their own application- or individual-level data, and any geocoding of those data must occur prior to the execution of the script sequence: this code assumes that the input application- or individual-level data are already geocoded with census block group, census tract, and 5-digit ZIP code.

However, included is an example designed to instruct the user in executing the proxy building code sequence. It relies on a set of fictitious data constructed by create_test_data.do from the publicly available census surname list and geography data. It is provided to illustrate how the main.do is set up to run the proxy building code and does not reflect any particular individual’s or institution’s information.

A control script, /scripts/main.do, is included to step through the process below. The user will need to change paths and define parameters as required.

  1. Geocode the data in a geocoding software package (for example, ArcGIS) to obtain tract and block group identifiers for each record.
  2. Build name and geography proxies from Census files included in /input_files:
    1. Census surname list:
      1. /scripts/surname_creation_lower.do—takes .csv file of census surnames, formats surnames to be read as all lower case, and imputes any suppressed values. File created by surname_creation_lower.do:
        1. /input_files/created/census_surnames_lower.dta
      2. In order to prepare the user-defined datasets for use with the Census surname list, basic cleaning of surnames using regular expressions and other forms of name standardization is required. This script exists at: /scripts/surname_parser.do. File created by surname_parser.do in user-defined directory:
        1. `dir'/proxy_name.dta
    2. Census geographies:
      1. /scripts/create_attr_over18_all_geo_entities.do—uses the base information, for individuals age 18 and older, from the Census flat files for block group, tract, and ZIP code1 and allocates "Some Other Race"2 to each group in proportion. It creates three files (one each for block group, tract, and ZIP code) with geo probabilities for use in proxy:
        1. /input_files/created/blkgrp_attr_over18.dta
        2. /input_files/created/tract_attr_over18.dta
        3. /input_files/created/zip_attr_over18.dta
  3. Calculate the BISG probabilities following the methodology described in “Using Publicly Available Information to Proxy for Unidentified Race and Ethnicity: A Methodology and Assessment”.
    1. /scripts/geo_name_merger_all_entities_over18.do—this program creates three files (one each for block group, tract, and ZIP code) with BISG probabilities in user-defined directory:
      1. /`maindir'/`inst_name'_proxied_blkgrp.dta
      2. /`maindir'/`inst_name'_proxied_tract.dta
      3. /`maindir'/`inst_name'_proxied_zip.dta
  4. The final step is to merge together the block group, tract, and ZIP code-based BISG proxies and choose the most precise proxy given the precision of geocoding, e.g. block group (if available), then tract (if available), or ZIP code (if block group and tract unavailable) using:
    1. /scripts/combine_probs.do File created by combine_probs.do in user-defined directory:
      1. /`maindir'/`inst_name'_`file'proxied_final.dta

Please direct all questions, comments, and suggestions to: CFPB_proxy_methodology_comments@cfpb.gov.


1 When referring to ZIP code demographics, we match the institution-based ZIP code information to ZIP Code Tabulation Areas (ZCTAs) as defined by the U.S. Census Bureau.

2 In the 2010 SF1, the U.S. Census Bureau produced tabulations that report counts of Hispanics and non-Hispanics by race. These tabulations include a “Some Other Race” category. We reallocate the “Some Other Race” counts to each of the remaining six race and ethnicity categories using an Iterative Proportional Fitting procedure to make geography based demographic categories consistent with those on the census surname list.


Update to proxy methodology – April 2017

In the summer 2014 edition of Supervisory Highlights,3 the Bureau previously reported that examination teams use a Bayesian Improved Surname Geocoding (BISG) proxy methodology for race and ethnicity in their fair lending analysis of non-mortgage credit products. The BISG methodology relies on the distribution of race and ethnicity based on place-of-residence and surname, which are publicly available information from Census. The method involves constructing a probability of assignment to race and ethnicity based on demographic information associated with surname and then updating this probability using the demographic characteristics of the census block group associated with place of residence. The updating is performed through the application of a Bayesian algorithm, which yields an integrated probability that can be used to proxy for an individual’s race and ethnicity.4

Through March of 2017, examination teams had relied on the surname list derived from the 2000 Decennial Census of the Population in their construction of the BISG proxy for race and ethnicity.5 In December 2016, the U.S. Census Bureau released a list of the most frequently occurring surnames based on data derived from 2010 Decennial Census of the Population. The updated 2010 list generally uses the same definitions and formats as the list based on the 2000 Census but includes updated values for total counts and race and ethnicity shares associated with each surname.6 In total, the new surname list provides information on the 162,253 surnames that appear at least 100 times in the 2010 Census, covering approximately 90% of the population.7 While 146,516 names appear on both the 2000 and 2010 surname lists, the 2010 list contains 15,737 names that do not appear on the 2000 list, whereas the 2000 list contains 5,155 names that do not appear on the 2010 list.8

As of April 2017, examination teams are relying on an updated proxy methodology that reflects the newly available surname data from the Census Bureau. Our updated proxy methodology relies on the race and ethnicity shares for the 162,253 names that appear on the 2010 list and supplements this list with the race and ethnicity shares for the 5,155 names that appear on the 2000 list but not on the 2010 list, resulting in a list of 167,409 surnames in total.9

The updated name list, statistical software code written in Stata, and other publicly available data used to build the BISG proxy are now available in this repository.

Please direct all questions, comments, and suggestions to: CFPB_proxy_methodology_comments@cfpb.gov.


Update to proxy methodology – April 2024

As of April 2024, examination teams performing fair lending analysis are relying on the newly available demographic data from the Census Bureau. The new demographic data is derived from the 2020 Census Demographic and Housing Characteristic (DHC) Files.10 To derive the new demographic files, the Bureau pulled DHC Table P11 at the Block Group, Census Tract, and Zip Code level through the Census API.11 This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.

The updated demographic data and modified versions of the BISG proxy code using the new 2020 demographic data are all now available in the 2024-update folder in this repository. The repository also continues to contain all of the previous code and data for users who would prefer to continue to generate proxies using the 2010 demographic data.

Please direct all questions, comments, and suggestions to: CFPB_proxy_methodology_comments@cfpb.gov.


3 See Consumer Financial Protection Bureau, Supervisory Highlights: Summer 2014 (Sept. 17, 2014).

4 For more information on the methodology, see Consumer Financial Protection Bureau, Using publicly available information to proxy for unidentified race and ethnicity (Sept. 2014).

5 See id.

6 For more details on the updated 2010 surname list, including revisions to the 2000 methodology and programming, see Joshua Comenetz, Frequently Occurring Surnames in the 2010 Census (Oct. 2016).

7 The surname data are available on the Census Bureau’s website, see Frequently Occurring Surnames from the 2010 Census (last revised Dec. 27, 2016).

8 Names must appear at least 100 times in the 2010 Decennial Census in order to be included on the surname list.

9 Although these names are not on the 2010 list, and thus likely no longer meet the 100-name threshold, we chose to include them so as to incorporate as much available surname information as possible into the proxy.

10 See 2020 Census Demographic and Housing Characteristics File (DHC) (last revised Sep. 27, 2023).

11 See 2020 DHC Table P11.