Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect number of trailing commas when last field(s) are empty #24

Open
afischer opened this issue Jul 12, 2022 · 9 comments
Open

Incorrect number of trailing commas when last field(s) are empty #24

afischer opened this issue Jul 12, 2022 · 9 comments

Comments

@afischer
Copy link

FastFEC export seems to be missing a trailing comma in lines that have one or more empty items at the end of a row.

Using homebrew version of fastfec on a M1 MacBook Pro running macOS Montery 12.4.

For example, you can reproduce this by running fastfec 876050 fastfec_output/ and checking the header.csv (should be an additional trailing comma after report_number 002), SB28A.csv (42 fields in line items vs 43 in header) or SB23.csv (43 fields in line items vs 44 in header)

header.csv:

record_type,ef_type,fec_version,soft_name,soft_ver,report_id,report_number,comment
HDR,FEC,8.0,Microsoft Navision 3.60 - AVF Consulting,1.00,FEC-840327,002
@freedmand
Copy link
Contributor

Hey, thanks for opening an issue! At the time I wrote the bulk of the code, this was the intended behavior. There are some errant filings with incorrect numbers of columns per line. To make the code as accurate as possible, I made it just output as many fields as are observed in the filing (and if there were more columns than headers, the remaining would just be printed for completeness). This quite literally reflects what is actually in the filing itself (and could be useful if you care to, say, detect filings with incorrect numbers of columns).

If you rerun FastFEC with the --warn parameter, e.g. fastfec --warn 876050 fastfec_output/, it will show many warning messages corresponding to these column mismatches.

Whether this is the correct behavior may warrant rethinking. Does having fewer columns break downstream tooling for you? Do you think it's more accurate to put in trailing commas for missing fields (in this case, what should be done for extra fields in a given row)? Maybe we could expose an option to pad missing fields with commas, or make this the default behavior. I'm curious to hear your/others' thoughts!

@chriszs
Copy link
Contributor

chriszs commented Jul 18, 2022

I'd vote for outputting the same number of commas as the header, because I suspect some import processes will choke otherwise. I vaguely remember participating in the discussion Dylan's talking about. WinRed seems to output one more separator than you'd expect, for instance. But I feel like a consistent number of commas would be the correct normalization.

@afischer
Copy link
Author

Hey @freedmand, thanks for the detailed response! I would also advocate for padding out to the correct number of commas regardless of the contents of the fecfile. I discovered this when attempting to use the COPY / FROM postgres command to easily pull a form into a database.

@freedmand
Copy link
Contributor

Great, that makes sense! In the case of too many fields in a given row (i.e. more fields than header columns), do you think just truncating would make sense Andrew/Chris? We could make that the default and then expose a raw option to preserve the source filing as closely as possible

@chriszs
Copy link
Contributor

chriszs commented Jul 18, 2022

I'd maybe draw a distinction between empty excess fields (like WinRed) and non-empty. Empty you can eat no problem. Non-empty typically means something has gone wrong and it might be useful to output those or error just so we catch it. I could see an argument for truncation regardless, though, but even then it should warn.

@freedmand
Copy link
Contributor

So, essentially:

  • by default, always output exactly as many fields as headers
  • don't warn at all if there's a mismatch in field amount that entails too few fields or too many fields but the extra ones are empty
  • warn by default otherwise; have a silent option to suppress warnings
  • have a non-default raw option to output exactly what's in the filing even if it doesn't match the number of header cols

@afischer
Copy link
Author

That sounds great. I think having, by default, a set of CSVs that have a guaranteed number fields on all rows would be best here, though I agree it makes sense to have a "raw" option as well as warnings where helpful.

NickCrews added a commit to NickCrews/FastFEC that referenced this issue Apr 13, 2023
Fixes washingtonpost#24

Before, we printed exactly what was in the .fec file. If a row had
more fields or fewer fields than we expected from the schema, we
just printed it as is. This broke downstream
tooling, such as loading the .csv's into
databases.

Now, by default we
- pad short rows with empty fields
- truncate long rows
You can get back to the old behavior by setting the `raw` flag to True.
By default it is False.

Note that this could be BREAKING.

This also adjusts the warnings a bit:
BEfore, you got a warning for every extra field in a row.
Now you only get one warning per row, and we print out the full row
(even though that row is a bit mangled by the csv parser as
it removes quotes and delimiters)

The tests only currently test the default behavior. A follow up should
adjust how we define test cases. Currently, we expect a 1:1
correspondence between an input .fec file and an output. But really we
want a 1:N relationship, where one .fec file
can generate multiple outputs depending on the
options passed. That will require updating our test definition
format.
@NickCrews
Copy link

I have this fixed in #58. Please @chriszs @afischer take a look and give @freedmand some help.

@NickCrews
Copy link

FYI, in the new version of DuckDB's CSV parser, they added an option that makes the parser more resilient to messy CSV data such as this. I am able to parse these CSVs with their missing trailing commas by using the null_padding option:

import duckdb

conn = duckdb.connect()
conn.read_csv(p, all_varchar=True,  quotechar='"', null_padding=True)

So this issue is a lot less pressing for me now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants