Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only parse Schedule A itemizations #45

Open
NickCrews opened this issue Aug 16, 2022 · 4 comments
Open

Only parse Schedule A itemizations #45

NickCrews opened this issue Aug 16, 2022 · 4 comments

Comments

@NickCrews
Copy link

NickCrews commented Aug 16, 2022

Hi! Thanks for this great utility.

I only care about the Schedule A itemizations. In some cases of multi gig .FEC files, the non-schedule A entries can take up more than half of the file, and so really slow down parsing.

Can we add some options to only parse particular itemizations?

In the meantime, I do this, do you see any problems with it? Like are schedule A itemizations always going to come before other schedules?

# filter_fec.sh

# We only want the individual contributions from an FEC file. We don't want
# the other itemizations, they can be gigabytes and slow parsings

# From the FEC file format documentation:

# The first record of every electronic file that is submitted to the FEC must be an
# HDR record that precedes the main body of the ASCII CSV (comma separated values) data.
# The second record will be a "cover" record for the particular filing, (for example,
# a F3 or and F3X record for a FEC-3 or FEC-3X electronic report). An unlimited number
# of Schedule records (examples: SA, SB, SC/ ...) can follow the first two records of
# an FEC electronic report file. (Electronic fi les are usually assigned the file
# suffix ".fec".)

# So as soon as we see a line starting with "SB", "SC", or "SD", we stop.
# From https://stackoverflow.com/a/8940829/5156887
awk '{if(/^SB|^SC|^SD/)exit;else print}'

and use it as curl https://docquery.fec.gov/dcdev/posted/13360.fec | filter_fec.sh | fastfec 13360

@freedmand
Copy link
Contributor

Hi @NickCrews, thanks for the question. I agree this is an important and useful feature to add. I'll see how easy it is to add a flag to pass a regex form filter. Would something like --form-filter make sense as a flag name?

@freedmand
Copy link
Contributor

In the meantime, I do this, do you see any problems with it? Like are schedule A itemizations always going to come before other schedules?

I think that will mostly work but I have observed out-of-order forms in the past (very rare). @chriszs may have more insight

@chriszs
Copy link
Contributor

chriszs commented Aug 21, 2022

Dylan's correct. Order is not guaranteed, though it's often ordered that way. For multi-gigabyte files, the limiting factor tends to be download speed. Filtering form types in FastFEC would speed up parsing, but it wouldn't bail half way through the download as this does, so it wouldn't have much of an impact on the overall time. Speeding up the download using aria2c -x 4 and then filtering to ^SA might be safer and more effective.

@NickCrews
Copy link
Author

Thank you for the responses. That makes sense that we can't rely on order, darn. And I would see how if we need to download the whole file then skipping parsing won't gain much speed. I guess save some disk space. So this isn't a super must have for me, if you aren't interested in supporting it then I wouldn't be heartbroken.

I would say that I might prefer explicit table names, instead of a regex, there aren't that many options. (Unless I'm wrong and there are a lot?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants