Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about extremely large C-TAM imputed benefits in the CPS data #281

Closed
martinholmer opened this issue Sep 3, 2018 · 10 comments
Closed

Comments

@martinholmer
Copy link
Contributor

@andersonfrailey and @Amy-Xu, Now that taxdata pull requests #178 (fix TANF values), #185 (use Medicare and Medicaid actuarial values) and #278 (ignore veterans benefits in the distribution of other benefits to filing units) have been merged over the past month or so, I've spent some time looking at filing units that have what seem to me to be extremely large benefits.

I've found CPS records to look at by using a two-step process. First, the Python script below is used to find RECID values for filing units that have large benefits in the cps.csv.gz file. Second, the non-zero variables in each of those records are produced using the csv_show.sh bash script, which is part of the Tax-Calculator repository.

One filing unit found in this way is shown in my recent comment on taxdata pull request #135. That filing unit has an imputed TANF benefit of about $136,000 even though the taxpayer and spouse have combined earnings of over one million dollars.

Looking at the filing units with large tanf_ben and vet_ben values raises a question about how the CPS filing units are constructed. Among those with extremely large tanf_ben and vet_ben values are two different groups of fifteen records, all of whom appear to have exactly the same demographics and earnings (but different unearned incomes) and exactly the same large benefit. What's going on here? Why do the CPS data include these nearly identical records? Why are there fifteen near replicates? Where are the fifteen near replicates created in the code?

But quite apart from the groups of fifteen nearly identical filing units, I don't understand how people with high incomes can be thought to be getting TANF benefits. Is that imputation being done in C-TAM code or in taxdata code?

The one filing unit (represented by fifteen near replicates) with a very large vet_ben value could plausibly be a retired three-star general with somewhere around 35 years of service as @feenberg suggested in C-TAM issue 73. The taxpayer is 57 years old and has a vet_ben value of $169,920. That amount includes our estimate of the actuarial value of access to the VA hospital system, which is about $9,890. So, the amount of what seems to be a pension for military service is roughly $160,000 per year.

But it seems to me that including military retirement pensions in vet_ben is incorrect because as taxable income they should be added to the e01700 variable.

And the fact that vet_ben seems to include largely military (retirement or disability) pensions and retiree medical benefits raises another question in my mind. Given that vet_ben are largely deferred compensation for those who served in the military, why would this kind of income ever be considered for repeal as part of a UBI reform? If they are thought to be "welfare" (rather than deferred compensation), why didn't the C-TAM project include the pension benefits and health insurance benefits accruing to retired federal (or state and local) government employees? If retired government employees were not a focus in the C-TAM work because they are getting not "welfare" but deferred compensation, then why was the deferred compensation of those with military service included in the scope of the C-TAM project?

Now the details. First, the Python script called bentab.py:

from __future__ import print_function
import numpy as np
import pandas as pd

data = pd.read_csv('cps.csv.gz')
print('num_filing_units:', data.shape[0])

def big_recids(big):
    rids = big['RECID'].tolist()
    print('   RECIDs:')
    for num in range(0, big.shape[0]):
        print('     ', rids[num])    

big = data[data['XTOT'] >= 14]
print('num_with_XTOT>=14:', big.shape[0])
big_recids(big)

big = data[data['ssi_ben'] >= 48000]
print('num_with_ssi>=$48K:', big.shape[0])
big_recids(big)

big = data[data['tanf_ben'] >= 120000]
print('num_with_tanf>=$120K:', big.shape[0])
big_recids(big)

big = data[data['vet_ben'] >= 156000]
print('num_with_vet>=$156K:', big.shape[0])
big_recids(big)

And now the output from that Python script:

taxdata/cps_data$ python bentab.py
num_filing_units: 456465
num_with_XTOT>=14: 3
   RECIDs:
      83778
      422382
      434766
num_with_ssi>=$48K: 2
   RECIDs:
      280113
      403676
num_with_tanf>=$120K: 20
   RECIDs:
      76509
      76510
      76511
      76512
      76513
      76514
      76515
      76516
      76517
      76518
      76519
      76520
      76521
      76522
      76523
      135454
      191624
      311549
      312738
      315578
num_with_vet>=$156K: 15
   RECIDs:
      119232
      119233
      119234
      119235
      119236
      119237
      119238
      119239
      119240
      119241
      119242
      119243
      119244
      119245
      119246

@MattHJensen @MaxGhenis

@feenberg
Copy link

feenberg commented Sep 4, 2018 via email

@andersonfrailey
Copy link
Collaborator

@martinholmer, the duplicate records relate back to how we currently handle top coding. If a record is flagged for top coding then fifteen copies are made each with the same information except in the top coded values. See here.

These imputations are all being done in C-TAM and are then merged with TaxData.

@martinholmer
Copy link
Contributor Author

@feenberg said:

I agree that veterans benefits are deferred comp and not something that could be eliminated by UBI.

Dan, thanks for you thoughts on taxdata issue #281.

@martinholmer
Copy link
Contributor Author

@andersonfrailey said in taxdata issue #281:

the duplicate records relate back to how we currently handle top coding. If a record is flagged for top coding then fifteen copies are made each with the same information except in the top coded values. See here.

Thanks for the link to the SAS code that splits CPS records with top-coded values into fifteen near-replicate records. I'll look over that code soon, although I have no SAS programming experience.

Am I correct in understanding that we now think splitting these records is unnecessary, and that this splitting will be eliminated sometime after you translate the CPS-creation SAS code into Python code? That's my understanding of taxdata issues #253 and #174. Am I thinking about this correctly?

@martinholmer
Copy link
Contributor Author

@andersonfrailey said in taxdata issue #281:

These imputations [of tanf_ben and vet_ben] are all being done in C-TAM and are then merged with TaxData.

OK, thanks for the information. I'll try looking at the C-TAM code to see if these high-benefit values are taken straight off the raw CPS files or if they are somehow imputed in the C-TAM code.

@Amy-Xu
Copy link
Member

Amy-Xu commented Sep 5, 2018

@martinholmer Sorry for delayed reply -- I'm still in a chaotic moving trip. I look into the case you bring up in issue #135, specifically the household with ID (h_seq) 41675. It is indeed an imputed TANF receiving family.

In the original CPS, this family has four members, with a total wage income of $400,000, while the maximum wage income of CPS TANF recipients, prior to imputation, is $330,000. So in other words, there are two separate issues here. First is that, as you bring up, some benefits are imputed to high-income families. Second is that, some families get move up on the income ladder during the tax-unit create process. To put it another way, the wage distribution of raw CPS is not the same as that of CPS tax unit.

The second issue is something we have been aware for a while but I don't remember we have found a clear answer or not. Maybe Anderson can recall better. Back to the case you bring up, 400k is high income range but still quite below a million.

The first issue, from my perspective, is intrinsically rooted in the algorithm we use -- we tend to replicate whatever distribution the CPS has at the beginning. In the example of TANF, the proxy we use (paw_val) has high-income recipients, and then we have similar high-income recipients as well. This high-income recipient thing is not unique to TANF. If I remember correctly, SSI and SNAP both have high income recipients in the raw CPS, which is the result after the Census Bureau's data cleaning before they release the datasets.

John has offered a scenario where a person could become unemployed for some periods during the year, and thus become eligible for some programs. This is certainly not a full explanation. When I was doing the imputation for SSI, I took out all the high-income recipients, as you can see in the documentation.
screen shot 2018-09-04 at 9 07 15 pm
But I wasn't sure whether that was proper since this is something quite prevalent across all the programs.

@martinholmer
Copy link
Contributor Author

@Amy-Xu said in taxdata issue #281:

I look into the case you bring up in issue #135, specifically the household with ID (h_seq) 41675. It is indeed an imputed TANF receiving family.

In the original CPS, this family has four members, with a total wage income of $400,000, while the maximum wage income of CPS TANF recipients, prior to imputation, is $330,000. So in other words, there are two separate issues here. First is that, as you bring up, some benefits are imputed to high-income families. Second is that, some families get move up on the income ladder during the tax-unit create process. To put it another way, the wage distribution of raw CPS is not the same as that of CPS tax unit.

The second issue is something we have been aware for a while but I don't remember we have found a clear answer or not. Maybe Anderson can recall better. Back to the case you bring up, 400k is high income range but still quite below a million.

My understanding is (via taxdata issues #174 and #253) is that this second issue is a bug in the CPS data creation logic that will be addressed after the programs that create the CPS tax filing units is converted from SAS to Python.

@Amy-Xu continued:

The first issue, from my perspective, is intrinsically rooted in the algorithm we use -- we tend to replicate whatever distribution the CPS has at the beginning. In the example of TANF, the proxy we use (paw_val) has high-income recipients, and then we have similar high-income recipients as well. This high-income recipient thing is not unique to TANF. If I remember correctly, SSI and SNAP both have high income recipients in the raw CPS, which is the result after the Census Bureau's data cleaning before they release the datasets.

John has offered a scenario where a person could become unemployed for some periods during the year, and thus become eligible for some programs. This is certainly not a full explanation. When I was doing the imputation for SSI, I took out all the high-income recipients, as you can see in the documentation.

You're correct that this is a complex topic. Benefit programs vary in their "filing unit" (which people's circumstances are considered in calculating a benefit) and in their "accounting period" (what period of time is considered in determining the filing unit's circumstances). Many benefit programs (certainly SNAP) have monthly accounting periods, so the point John made about people being low-income in some months of the year is relevant.

Maybe the family we're looking at in #135 is low income in some months of the year, but with earnings of $400,000 (before the top-coding logic makes it $1,000,000) for the year, it is hard to believe they were low-income for much (if any) of the year. But the biggest question about the family in #135 is how they can get a TANF benefit of $136,000 for the year. Because if they were low-income for just a few months, they the TANF benefit that they received during those few months was enormous. So, for example, if they received TANF benefits for two months, the annual rate of the benefit was six times $136,000, or an annual rate of benefit receipt of over $800,000. I don't find that believable.

A final question about your statement: "the [TANF] proxy we use (paw_val)". What does paw_val measure in the raw CPS data?

Thanks for all the explanation. This kind of information needs to be prominent in the C-TAM repository.

Did you ever consider using SIPP data to get at this monthly issue? If my memory is correct, SIPP has information about monthly benefit receipt.

@feenberg
Copy link

feenberg commented Sep 5, 2018 via email

@Amy-Xu
Copy link
Member

Amy-Xu commented Sep 8, 2018

@martinholmer paw_val is described as public assistance or welfare in the CPS codebook. Even though this variable includes many programs, there is a related question (paw_typ) asking survey participants to distinguish TANF participation from the rest. Please see more details down below, copied from the general documentation.
screen shot 2018-09-07 at 9 40 25 pm

In terms of the benefits received, I do agree it looks way too large. A few month ago, I was 'fixing' TANF imputation because this issue without contemplating much about potential outcomes. Looking at the extra large benefit amount, I think at least partially it is due to the non-cash benefits. Before the fix, the imputation only includes the so-called 'assistance' portion of TANF, which probably includes cash and non-cash already. After the fix, most of the added benefits, I'm afraid, is not cash.

Originally this UBI project aimed at imputing cash benefits only, because that's how the MTRs (benefit reduction rates) come in play. But later on it seems the cash part quietly faded away as we were adding programs like housing. In the case of TANF, the imputed cash and non-cash benefits could get really confusing and possibly misleading. One of the difficulties in TANF imputation has been picking out cash from non-cash benefits. Now it seems quite debatable whether this non-cash benefits should be assigned to participants, if somehow cash and non-cash can be separated.

@Amy-Xu
Copy link
Member

Amy-Xu commented Sep 8, 2018

@martinholmer also suggest

This kind of information needs to be prominent in the C-TAM repository.

Agreed. I was hesitant mostly because a good portion of explanation is just speculations about what's going on in the raw CPS. And frankly I have found no definitive answer to date. What do you think is the best way to highlight this part of the information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants