Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParserError in RCRAInfo #151

Open
bl-young opened this issue Nov 21, 2023 · 4 comments
Open

ParserError in RCRAInfo #151

bl-young opened this issue Nov 21, 2023 · 4 comments
Labels

Comments

@bl-young
Copy link
Collaborator

          So I tried accessing other years of RCRAInfo data (2013, 2015, 2017, and 2019). All worked except for one (2017), which produced the following errors. I wasn't able to track down the CSV file it keeps crashing on. Maybe there's a debug statement that points to it.
INFO RCRAInfo_2017 not found in ~/stewi/flowbyfacility
INFO requested inventory does not exist in local directory, it will be generated...
INFO file extraction complete
INFO organizing data for BR_REPORTING from 2017...
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_0.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_1.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_2.csv
INFO saving to ~/stewi/RCRAInfo Data Files/RCRAInfo_by_year/br_reporting_2017.csv...
INFO generating inventory files for 2017
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[9], line 1
----> 1 stewi.getInventory('RCRAInfo', 2017)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
     66 """Return or generate an inventory in a standard output format.
     67 
     68 :param inventory_acronym: like 'TRI'
   (...)
     79 :return: dataframe with standard fields depending on output format
     80 """
     81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
     83                            download_if_missing)
     85 if (not keep_sec_cntx) and ('Compartment' in inventory):
     86     inventory['Compartment'] = (inventory['Compartment']
     87                                 .str.partition('/')[0])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:268, in read_inventory(inventory_acronym, year, f, download_if_missing)
    265 else:
    266     log.info('requested inventory does not exist in local directory, '
    267              'it will be generated...')
--> 268     generate_inventory(inventory_acronym, year)
    269 inventory = load_preprocessed_output(meta, paths)
    270 if inventory is None:

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:313, in generate_inventory(inventory_acronym, year)
    309     RCRAInfo.main(Option = 'A', Year = [year],
    310                   Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
    311     RCRAInfo.main(Option = 'B', Year = [year],
    312                   Tables = ['BR_REPORTING'])
--> 313     RCRAInfo.main(Option = 'C', Year = [year])
    314 elif inventory_acronym == 'TRI':
    315     import stewi.TRI as TRI

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:444, in main(**kwargs)
    441     organize_br_reporting_files_by_year(kwargs['Tables'], year)
    443 elif kwargs['Option'] == 'C':
--> 444     Generate_RCRAInfo_files_csv(year)
    446 elif kwargs['Option'] == 'D':
    447     """State totals are compiled from the Trends Analysis website
    448     and stored as csv. New years will be added as data becomes
    449     available"""

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:219, in Generate_RCRAInfo_files_csv(report_year)
    216 fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
    217                            header=None)
    218 # on_bad_lines requires pandas >= 1.3
--> 219 df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
    220                  low_memory=False, on_bad_lines='skip',
    221                  encoding='ISO-8859-1')
    223 log.info(f'completed reading {filepath}')
    224 # Checking the Waste Generation Data Health

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:239, in CParserWrapper.read(self, nrows)
    236         data = _concatenate_chunks(chunks)
    238     else:
--> 239         data = self._reader.read(nrows)
    240 except StopIteration:
    241     if self._first_chunk:

File parsers.pyx:825, in pandas._libs.parsers.TextReader.read()

File parsers.pyx:913, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Originally posted by @dt-woods in #146 (comment)

@bl-young bl-young added the bug label Nov 21, 2023
@dt-woods
Copy link

@bl-young, any updates on this front? I'm still getting the ParseError.

@dt-woods
Copy link

So I took a look at the CSV file that is generated. If you provide pandas.read_csv with nrows, it successfully reads the data up to a point. I tried reading the number of lines in the CSV using a basic approach:

>>> from stewi.RCRAInfo import DIR_RCRA_BY_YEAR
>>> report_year = 2017
>>> filepath = DIR_RCRA_BY_YEAR.joinpath(f'br_reporting_{str(report_year)}.csv')
>>> with open(filepath, 'r') as f:
>>>     count = sum(1 for _ in f)
>>> print(count)
2119285

I can open this in pandas.

>>> from stewi.RCRAInfo import RCRA_DATA_PATH
>>> fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
...                            header=None)
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
...     low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
...     nrows=2119285)
>>> df.head()
     Handler ID State  ... Generation Tons Waste Code Group
0  AK0000384040    AK  ...           12.25             K171
1  AK0000384040    AK  ...            0.20             K171
2  AK0000384040    AK  ...            0.40             K050
3  AK0000384040    AK  ...            1.50             K050
4  AK0000384040    AK  ...            0.05             K050
>>> df.tail(1).to_dict()
{'Handler ID': {2119284: 'IDD073114654'},
 'State': {2119284: 'ID'},
 'Handler Name': {2119284: 'US ECOLOGY IDAHO INC SITE B'},
 'Location Street Number': {2119284: '20400'},
 'Location Street 1': {2119284: 'LEMLEY RD'},
 'Location Street 2': {2119284: nan},
 'Location City': {2119284: 'GRAND VIEW'},
 'Location State': {2119284: 'ID'},
 'Location Zip': {2119284: '83624'},
 'County Name': {2119284: 'OWYHEE'},
 'Generator ID Included in NBR': {2119284: 'Y'},
 'Generator Waste Stream Included in NBR': {2119284: 'N'},
 'Waste Description': {2119284: '43435-0'},
 'Primary NAICS': {2119284: nan},
 'Source Code': {2119284: nan},
 'Form Code': {2119284: nan},
 'Management Method': {2119284: nan},
 'Federal Waste Flag': {2119284: nan},
 'Generation Tons': {2119284: nan},
 'Waste Code Group': {2119284: nan}}

I'm not certain this count is accurate because I was able to read more than that with pandas.
I can go higher!

>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
...     low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
...     nrows=236700)
>>> df.tail(1).to_dict()
{'Handler ID': {2366999: 'IDD073114654'},
 'State': {2366999: 'ID'},
 'Handler Name': {2366999: 'US ECOLOGY IDAHO INC SITE B'},
 'Location Street Number': {2366999: '20400'},
 'Location Street 1': {2366999: 'LEMLEY RD'},
 'Location Street 2': {2366999: nan},
 'Location City': {2366999: 'GRAND VIEW'},
 'Location State': {2366999: 'ID'},
 'Location Zip': {2366999: '83624'},
 'County Name': {2366999: 'OWYHEE'},
 'Generator ID Included in NBR': {2366999: 'Y'},
 'Generator Waste Stream Included in NBR': {2366999: 'N'},
 'Waste Description': {2366999: '43435-0'},
 'Primary NAICS': {2366999: nan},
 'Source Code': {2366999: nan},
 'Form Code': {2366999: nan},
 'Management Method': {2366999: nan},
 'Federal Waste Flag': {2366999: nan},
 'Generation Tons': {2366999: nan},
 'Waste Code Group': {2366999: nan}}

Not sure where the upper limit is for nrows, and not sure what happens when you overload nrows.

@bl-young
Copy link
Collaborator Author

@bl-young, any updates on this front? I'm still getting the ParseError.

No I have not had a chance to look closely yet. These ParseErrors can be tricky to track down.

For consistency, and in the meantime, I would recommend using the already processed versions, such as via
getInventory(..., download_if_missing=True) if that works for your application.

@dt-woods
Copy link

Yep. That seems to work! Thanks again for supporting the daisy chain of kwargs down through stewicombo to getInventory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants