Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the ~18k zero-weight records in puf_weights.csv intentional? #385

Closed
donboyd5 opened this issue May 21, 2021 · 5 comments · Fixed by #387
Closed

Are the ~18k zero-weight records in puf_weights.csv intentional? #385

donboyd5 opened this issue May 21, 2021 · 5 comments · Fixed by #387

Comments

@donboyd5
Copy link

After constructing puf weights using an alternative solver (#381), I noticed that approximately 18k records, all at the bottom of the file, appear to have zero weights for all years. I did not examine this formally - I just looked at the file with a log viewer - but a quick look suggests most/all of these bottom-of-file records have all zero weights.

Thinking I did something wrong, I looked at puf_weights.csv.gz in taxdata and it has the same thing.

I then looked back at puf_weights.csv.gz from Aug 2020 and it does not appear to have zero-weight records, again based on an informal look with a log viewer.

With that as background, would someone be able to tell me:

  • Is it possible I am doing something wrong and misinterpreting this?
  • If not (if they really are zero-weight records), is this large number of zero-weight records intended?
  • If so, would you mind giving the reason for having zero-weight records?

Many thanks.

@donboyd5
Copy link
Author

The code at bottom gives:

  • no all-zero-weight-records from the Aug 2020 puf_weights.csv
  • 18,382 all-zero-weight-records from the puf_weights.csv downloadable from taxdata

image

import numpy as np
import pandas as pd

OLD = '/media/don/data/puf_files/puf_csv_related_files/PSL/2020-08-20/puf_weights.csv'
NEW = '~/Downloads/puf_weights.csv'

old_weights =  pd.read_csv(OLD)
new_weights =  pd.read_csv(NEW)

old_weights.loc[(old_weights==0).all(axis=1)]
new_weights.loc[(new_weights==0).all(axis=1)]

@andersonfrailey
Copy link
Collaborator

@donboyd5, I wouldn't say it's intentional, rather it's a result of the re-weighting process. A difference between the Aug. 2020 version and the new version is we switched to the solvers in Julia for stage 2. It's possible that the translation introduced a bug that resulted in all of these zero-weight records, or it's possible that this is just what the solver gives us. I'll dig into it a bit and see if I find anything.

@donboyd5
Copy link
Author

@andersonfrailey, I think it's happening sooner than that because the LP solver in effect is limiting each new weight to be +- ~55-70% (depending on year) of the initial year's weight. So it can only be zero if it started at zero.

Before the solver is called, in stage2.py, it reads puf from cps-matched-puf.csv:

puf = pd.read_csv(os.path.join(CUR_PATH, "../data/cps-matched-puf.csv"))

and a few lines later has this line:

puf.s006 = puf.matched_weight * 100

puf.s006 then goes on to serve as the initial weight for the solver.

Here is a screenshot of what is in puf.matched_weight right after reading the puf:

image

As you can see, the zero-weights are at the bottom before we ever get to the solver, so it looks like it is occurring somewhere along the line in creating cps-matched-puf.csv.

@donboyd5
Copy link
Author

donboyd5 commented May 22, 2021

I stepped through the code in createpuf.py.

Here are the key lines of code:

image

Everything seemed fine through line 168 - the relevant files all had positive s006 and positive matched_weight.

Here is what I get for data.matched_weight and data.s006 right after line 166:

image

But after running line 169, here is what I get for them:

image

The next line replaces na with zero, and away we go.

So the problem occurs when we add nonfilers. As you can see, they do not have a matched_weight column:

image

I'm not sure what the fix is (maybe they should be given s006 as their matched_weight?), but that appears to be the problem.

@andersonfrailey
Copy link
Collaborator

Thanks for looking into this, @donboyd5. I think I see what needs to be fixed. I'll get a PR up when I get it worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants