-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParserError in RCRAInfo #151
Comments
@bl-young, any updates on this front? I'm still getting the ParseError. |
So I took a look at the CSV file that is generated. If you provide pandas.read_csv with nrows, it successfully reads the data up to a point. I tried reading the number of lines in the CSV using a basic approach: >>> from stewi.RCRAInfo import DIR_RCRA_BY_YEAR
>>> report_year = 2017
>>> filepath = DIR_RCRA_BY_YEAR.joinpath(f'br_reporting_{str(report_year)}.csv')
>>> with open(filepath, 'r') as f:
>>> count = sum(1 for _ in f)
>>> print(count)
2119285 I can open this in pandas. >>> from stewi.RCRAInfo import RCRA_DATA_PATH
>>> fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
... header=None)
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
... low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
... nrows=2119285)
>>> df.head()
Handler ID State ... Generation Tons Waste Code Group
0 AK0000384040 AK ... 12.25 K171
1 AK0000384040 AK ... 0.20 K171
2 AK0000384040 AK ... 0.40 K050
3 AK0000384040 AK ... 1.50 K050
4 AK0000384040 AK ... 0.05 K050
>>> df.tail(1).to_dict()
{'Handler ID': {2119284: 'IDD073114654'},
'State': {2119284: 'ID'},
'Handler Name': {2119284: 'US ECOLOGY IDAHO INC SITE B'},
'Location Street Number': {2119284: '20400'},
'Location Street 1': {2119284: 'LEMLEY RD'},
'Location Street 2': {2119284: nan},
'Location City': {2119284: 'GRAND VIEW'},
'Location State': {2119284: 'ID'},
'Location Zip': {2119284: '83624'},
'County Name': {2119284: 'OWYHEE'},
'Generator ID Included in NBR': {2119284: 'Y'},
'Generator Waste Stream Included in NBR': {2119284: 'N'},
'Waste Description': {2119284: '43435-0'},
'Primary NAICS': {2119284: nan},
'Source Code': {2119284: nan},
'Form Code': {2119284: nan},
'Management Method': {2119284: nan},
'Federal Waste Flag': {2119284: nan},
'Generation Tons': {2119284: nan},
'Waste Code Group': {2119284: nan}} I'm not certain this count is accurate because I was able to read more than that with pandas. >>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
... low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
... nrows=236700)
>>> df.tail(1).to_dict()
{'Handler ID': {2366999: 'IDD073114654'},
'State': {2366999: 'ID'},
'Handler Name': {2366999: 'US ECOLOGY IDAHO INC SITE B'},
'Location Street Number': {2366999: '20400'},
'Location Street 1': {2366999: 'LEMLEY RD'},
'Location Street 2': {2366999: nan},
'Location City': {2366999: 'GRAND VIEW'},
'Location State': {2366999: 'ID'},
'Location Zip': {2366999: '83624'},
'County Name': {2366999: 'OWYHEE'},
'Generator ID Included in NBR': {2366999: 'Y'},
'Generator Waste Stream Included in NBR': {2366999: 'N'},
'Waste Description': {2366999: '43435-0'},
'Primary NAICS': {2366999: nan},
'Source Code': {2366999: nan},
'Form Code': {2366999: nan},
'Management Method': {2366999: nan},
'Federal Waste Flag': {2366999: nan},
'Generation Tons': {2366999: nan},
'Waste Code Group': {2366999: nan}} Not sure where the upper limit is for nrows, and not sure what happens when you overload nrows. |
No I have not had a chance to look closely yet. These ParseErrors can be tricky to track down. For consistency, and in the meantime, I would recommend using the already processed versions, such as via |
Yep. That seems to work! Thanks again for supporting the daisy chain of kwargs down through stewicombo to getInventory. |
Originally posted by @dt-woods in #146 (comment)
The text was updated successfully, but these errors were encountered: