Refactor many things, fix extra/missing commas #58

NickCrews · 2023-04-07T00:41:46Z

I wanted to tackle #24, but I got distracted on the way ;) Now this thing is huge, sorry about that. If it's intimidating, take solace in the many tests which I added.

If this is just way too much to review, then I can try to split this up into more manageable pieces. I learned a lot as I went along, so if I re-do it then it should be much more cohesive. But that would be a solid day or two of work for me, so I'd love to avoid it if people are feeling heroic.

PLEASE, anyone leave some reviews.

Overhaul tests so that we have a lot more test cases. They are now defined in a domain specific language, and it's now easier to add a new test case. The actual contents of the results are tested now, which was sorta absurd we weren't testing that before.
Tries to simplify the overuse of *_CONTEXTs everywhere. When an FEC_CONTEXT is passed to a function, with all of its internal state, it's impossible to know what that function is using in the context, and what it could possibly be modifying. It's just one giant global state. I tried to make as many things pure functions as possible.
Improve csv parsing so there is a clear parsing API, and it's way simpler. CSV_FIELD is a first class citizen now.
Pull out lots of little clutter, such as string utils and path utils, into their own files, so that the actual using files are less cluttered.
Store constant regexes statically, so they don't have to be stored inside PERSISTENT_MEMORY. Much simpler to understand their lifecycle.
Extract mappings stuff into mappings.c so there is a clear, well-defined API that returns a FORM_SCHEMA object.

Also adds some tests for this

By using static variables inside helper functions, these regexes will only get compiled/allocated once. Since this only happens once, we don't need to worry about freeing the memory after use. This greatly simplifies init/cleanup of the context struct. While I was at it, make the helper functions more readable at what they are doing.

Also: - remove passing "extension" around excessively in fec.c - stop passing around the full FEC_CONTEXT excessively when you only need access to a single line. - Capitalize CSV_EXTENSION since it's a constant

After I did all that refactoring, I now realize we don't even use this info at all. As we read each line, we see if it contains any asci28. If so, then we treat that as the delimiter. Otherwise default to comma.

It really is a private method of PARSE_CONTEXT, we don't care about the implementation in fec.c

This is an implementation detail of the csv parser, we don't want to have to deal with it.

Another implementation detail of the csv reader.

We never need to share one FIELD_INFO between multiple places, and the lifetimes make sense to be tied to each other. This makes the API and lifecycle management simpler.

If I add "-Werror" build flag, then I found a few errors. Ideally I think we should turn on this flag by default to help catch more bugs.

It was weird to store these regexes in the persistentMemory. We only have one version of the mappings per compilation (and runtime) so we can just store them statically. Next up would be to refactor the lookupMappings() function from fec.c to mappings.c While I'm at it, clean up the regex compilation to share the helper function we already have. This involved creating regex.h to hold that shared regexCompile()

tmpdir is deprecated in pytest.

We never used the load kwarg, and the helper function did not need to be a fixture itself.

We want the reader thread to be stopped if the parent thread stops. Before, my pytest runs sometime hung.

Before, the testing was a bit half-baked. We didn't actually check the contents of the output files. We also weren't very thorough, and only checked some APIs on some examples. Now we test all APIs on all examples. Some example .fec files that were tracked in git weren't actually getting tested at all. For example, the 13360.fec file actually produces a segfault! We need to track down the issue and fix it. It is skipped for now. These changes also revealed an inconsistency in howform types are normalized. See `python/tests/filings

freedmand · 2023-04-08T17:00:54Z

Thank you for this large-scale, monumental PR! From a brief skim, looks to be lots of good ideas and practices in here. I'll take a more thorough look and double-check the perf of these changes early next week. 🎉

We didn't need all the info that PARSE_CONTEXT contained.

So to run the "trailing_commas" test, you can do: pytest -m trailing_commas

Now there is a single understandable `FORM_SCHEMA *lookupSchema(const char *version, int versionLength, const char *form, int formLength);` function. The stuff in fec.c just wraps it with logic specific to FEC_CONTEXT

and similarly, freeSchema() to formSchemaFree()

put the parts in order

We might want to write to a different dir and want control of that outside of WRITE_CONTEXT

This is way more readable I think

It was really hard to tell what each represented

This can be inferred from whether or not filingID is NULL, and now that the WRITE_CONTEXT doesn't depend on the filingId, all the filingId processing can happen in the callers of newFecContext()

These "options" should all be near each other.

Because I'm about to add another that has extra fields and I want to make the distinction

when you hit an assert, pytest will show the values of the args of the encolsing function. Before, you only saw the values to dir1 and dir2, but now you can see file1 and file2.

I thought this test indicated a bug in the lib, but actually I just wasn't testing it correctly.

Now if the supplied value can't be coerced to a Path, you get the error when you first call the function, not when the file is opened int he callback.

make them shorter to conform to line legnth. improve wording a bit. dont repeat defaults that are in the funciton signature.

Now we can see the line of which file.

If we run into a field that is supposed to be a float, but it is an empty string, this isn't actually a problem. Before, we were throwing a warning.

It was hard to tell where the stirng ended before, like if it was an empty string.

Fixes washingtonpost#24 Before, we printed exactly what was in the .fec file. If a row had more fields or fewer fields than we expected from the schema, we just printed it as is. This broke downstream tooling, such as loading the .csv's into databases. Now, by default we - pad short rows with empty fields - truncate long rows You can get back to the old behavior by setting the `raw` flag to True. By default it is False. Note that this could be BREAKING. This also adjusts the warnings a bit: BEfore, you got a warning for every extra field in a row. Now you only get one warning per row, and we print out the full row (even though that row is a bit mangled by the csv parser as it removes quotes and delimiters) The tests only currently test the default behavior. A follow up should adjust how we define test cases. Currently, we expect a 1:1 correspondence between an input .fec file and an output. But really we want a 1:N relationship, where one .fec file can generate multiple outputs depending on the options passed. That will require updating our test definition format.

NickCrews · 2023-05-03T16:37:09Z

Seeing as how no one has wanted to tackle this bear in the last 3 weeks (don't blame them) I can shrink this down so it's way less intimidating. But before I do that I'd love a top-level review so I know that 1. I'm going in the right direction and 2. the next version will actually get reviewed and merged. Thanks! @freedmand (or whoever)

freedmand · 2023-05-03T16:47:42Z

More a function of me being busy! I think I can tackle it in one pass but may be a bit before I can get to it. I really appreciate the contribution and am excited/think I can learn some things from your C style :)

NickCrews added 7 commits April 6, 2023 11:58

ref: improve usesAscii28 parsing from version

eb7ebed

Also adds some tests for this

ref: extract ctxWarn() helper

c218b74

ref: add freeSafe() helper

8e0e914

ref: add _compileRegex() helper

413fbcc

ref: split parseHeader() into *Legacy() and *NonLegacy() versions

cb308c2

ref: move string stuff to string_utils.h

a79be13

Also: - remove passing "extension" around excessively in fec.c - stop passing around the full FEC_CONTEXT excessively when you only need access to a single line. - Capitalize CSV_EXTENSION since it's a constant

NickCrews requested a review from a team as a code owner April 7, 2023 00:41

NickCrews added 14 commits April 6, 2023 19:39

ref: remove unused FEC_CONTEXT.useAscii28

22fcc2c

After I did all that refactoring, I now realize we don't even use this info at all. As we read each line, we see if it contains any asci28. If so, then we treat that as the delimiter. Otherwise default to comma.

ref: Move isParseDone() to csv.h and csv.c

1467a7d

It really is a private method of PARSE_CONTEXT, we don't care about the implementation in fec.c

ref: move initParseContext() into csv.h and csv.c

4ef050a

This is an implementation detail of the csv parser, we don't want to have to deal with it.

ref: Move readField() into csv.h

088bf0b

Another implementation detail of the csv reader.

ref: store FIELD_INFO in PARSE_CONTEXT as value, not pointer

98774db

We never need to share one FIELD_INFO between multiple places, and the lifetimes make sense to be tied to each other. This makes the API and lifecycle management simpler.

ref: clean up a few warnings

88c31c7

If I add "-Werror" build flag, then I found a few errors. Ideally I think we should turn on this flag by default to help catch more bugs.

ref: deduplicate testIncludes in build.zig

efe1ff3

ref: format test files with black

2f0d02f

ref: Replace py.path with pathlib.Path in tests

41672ce

tmpdir is deprecated in pytest.

ref: test: simplify getting fixtures

30b27a7

We never used the load kwarg, and the helper function did not need to be a fixture itself.

ref: blacken client.py

c9ef40c

fix: Make reader thread a daemon

706710b

We want the reader thread to be stopped if the parent thread stops. Before, my pytest runs sometime hung.

NickCrews added 7 commits April 8, 2023 09:10

ref: take char*, not PARSE_CONTEXT, as arg to lookupMappings()

7bc8549

We didn't need all the info that PARSE_CONTEXT contained.

ref: move create lookupType() helper in mappings.c

7427a07

test: add a unique marker for each test case

18451c1

So to run the "trailing_commas" test, you can do: pytest -m trailing_commas

ref: Move all form schema lookup logic into mappings.c

c17412e

Now there is a single understandable `FORM_SCHEMA *lookupSchema(const char *version, int versionLength, const char *form, int formLength);` function. The stuff in fec.c just wraps it with logic specific to FEC_CONTEXT

ref: remove unused collectLineInfo() from encoding.h

46db0b4

ref: rename lookupSchema() to formSchemaLookup()

05dccfa

and similarly, freeSchema() to formSchemaFree()

ref: rename lookupMappings() to updateCurrentFormSchema()

6772cdc

NickCrews added 25 commits April 11, 2023 11:02

ref: make path length calculation make more sense

1d7af46

put the parts in order

ref: move path stuff from writer.c to new path.c

df17f90

ref: add pathJoin() to path.h

6461d00

ref: calculate outputDir outside of WRITE_CONTEXT

4f26872

We might want to write to a different dir and want control of that outside of WRITE_CONTEXT

ref: name args when calling newFecContext()

12726ca

ref: one arg per line for newFecContext()

70776ed

This is way more readable I think

ref: label args to newFecContext() in client.py

f7837de

It was really hard to tell what each represented

ref: improve WriteCache in utils.py

91ade4a

ref: simplify decoding bytes in provide_line_callback()

a524dd7

ref: simplify provide_line_callback()

2a790ee

ref: remove includeFilingId from FEC_CONTEXT

1b8e100

This can be inferred from whether or not filingID is NULL, and now that the WRITE_CONTEXT doesn't depend on the filingId, all the filingId processing can happen in the callers of newFecContext()

BREAKING ref: re-order args to newFecContext()

cc2df27

These "options" should all be near each other.

ref: rename test case trailing_commas to too_few_fields

4919d82

Because I'm about to add another that has extra fields and I want to make the distinction

test: rename case 1550126 to slash_form

827f903

test: ref: add helper assert method to better show errors

6005a40

when you hit an assert, pytest will show the values of the args of the encolsing function. Before, you only saw the values to dir1 and dir2, but now you can see file1 and file2.

test: fix slash_form test

7f5d80b

I thought this test indicated a bug in the lib, but actually I just wasn't testing it correctly.

add types to client.py

0d70f9c

feat: Fail earlier on bad output_directory in parse_as_files()

dc72c90

Now if the supplied value can't be coerced to a Path, you get the error when you first call the function, not when the file is opened int he callback.

docs: improve docstrings of python API

70b3431

make them shorter to conform to line legnth. improve wording a bit. dont repeat defaults that are in the funciton signature.

test: better error message on fail

4618d74

Now we can see the line of which file.

ref: improve comparison between numFieldsRead and schema.numFields

5c4c487

ref: clean up newline handling in ctxWarn()

a244b38

fix: Don't warn on an empty float field

56b954c

If we run into a field that is supposed to be a float, but it is an empty string, this isn't actually a problem. Before, we were throwing a warning.

feat: put quotes around strings in warnings.

5e0cd77

It was hard to tell where the stirng ended before, like if it was an empty string.

NickCrews changed the title ~~Refactor many things~~ Refactor many things, fix extra/missing commas Apr 13, 2023

NickCrews mentioned this pull request Apr 13, 2023

Incorrect number of trailing commas when last field(s) are empty #24

Open

Update VERSION

b7eb213

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor many things, fix extra/missing commas #58

Refactor many things, fix extra/missing commas #58

NickCrews commented Apr 7, 2023 •

edited

freedmand commented Apr 8, 2023

NickCrews commented May 3, 2023

freedmand commented May 3, 2023

Refactor many things, fix extra/missing commas #58

Are you sure you want to change the base?

Refactor many things, fix extra/missing commas #58

Conversation

NickCrews commented Apr 7, 2023 • edited

freedmand commented Apr 8, 2023

NickCrews commented May 3, 2023

freedmand commented May 3, 2023

NickCrews commented Apr 7, 2023 •

edited