Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Data Management #145

Open
dt-woods opened this issue Sep 18, 2023 · 4 comments
Open

Inconsistent Data Management #145

dt-woods opened this issue Sep 18, 2023 · 4 comments

Comments

@dt-woods
Copy link

I'm not sure what the intended pathway is for data management, but seeing a general buy-in on esupy's data manager, I'm guessing that's the direction you are heading. That said, it doesn't seem like there is any commonality between approaches when looking across the main data modules (i.e., egrid, NEI, RCRAInfo, and TRI).

  1. With the latest fix on esupy (v.0.3.1), eGRID data now downloads from source (from EPA website) and the metadata files are generated locally.
  2. NEI data are not generated from source; rather, pulled from the AWS remote server (here), and seems to work, albeit differently from eGRID
  3. TRI seems to do its own thing. It attempts to download data from source; however, it references the url key, which does not point to a data file. It is stored in the unique zip_url keyword (not shared by other databases). It doesn't use esupy methods. It fails.
  4. RCRAInfo requires the unique extra packages selenium and web driver_manager. Latest versions of these packages (4.12 and 4.0, respectively1) crash, see below (notice also the typo on the error message for RCRAInfo at timestamp: 2023-09-18 13:49:45.430).
    • Note also that I am not a Google Chrome user. Please let me know if this is also a prerequisite.
    • Further note that the error log message at the top of RCRAInfo.py does not trigger an actual message; where does it go?
>>> getInventory('RCRAInfo', 2015)
2023-09-18 13:49:45.360:INFO:globals:read_inventory:RCRAInfo_2015 not found in ~/Library/Application Support/stewi/flowbyfacility
2023-09-18 13:49:45.360:INFO:globals:read_inventory:requested inventory does not exist in local directory, it will be generated...
2023-09-18 13:49:45.430:INFO:RCRAInfo:download_and_extract_zip:Initiating download via browswer...
2023-09-18 13:49:45.430:INFO:logger:log:====== WebDriver manager ======
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:45.507:INFO:logger:log:Get LATEST chromedriver version for google-chrome
2023-09-18 13:49:45.641:INFO:logger:log:About to download new driver from https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_mac64.zip
2023-09-18 13:49:45.724:INFO:logger:log:Driver downloading response is 200
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:45.958:INFO:logger:log:Get LATEST chromedriver version for google-chrome
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:47.174:INFO:logger:log:Get LATEST chromedriver version for google-chrome
/bin/sh: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome: No such file or directory
2023-09-18 13:49:47.299:INFO:logger:log:Driver has been saved in cache [~/.wdm/drivers/chromedriver/mac64/114.0.5735.90]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 getInventory('RCRAInfo', 2015)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
     66 """Return or generate an inventory in a standard output format.
     67 
     68 :param inventory_acronym: like 'TRI'
   (...)
     79 :return: dataframe with standard fields depending on output format
     80 """
     81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
     83                            download_if_missing)
     85 if (not keep_sec_cntx) and ('Compartment' in inventory):
     86     inventory['Compartment'] = (inventory['Compartment']
     87                                 .str.partition('/')[0])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:331, in read_inventory(inventory_acronym, year, f, download_if_missing)
    328 else:
    329     log.info('requested inventory does not exist in local directory, '
    330              'it will be generated...')
--> 331     generate_inventory(inventory_acronym, year)
    332 inventory = load_preprocessed_output(meta, paths)
    333 if inventory is None:

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:372, in generate_inventory(inventory_acronym, year)
    370 elif inventory_acronym == 'RCRAInfo':
    371     import stewi.RCRAInfo as RCRAInfo
--> 372     RCRAInfo.main(Option = 'A', Year = [year],
    373                   Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
    374     RCRAInfo.main(Option = 'B', Year = [year],
    375                   Tables = ['BR_REPORTING'])
    376     RCRAInfo.main(Option = 'C', Year = [year])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:477, in main(**kwargs)
    473     """If issues in running this option to download the data, go to the
    474     specified url and find the BR_REPORTING_year.zip file and save to
    475     OUTPUT_PATH. Also requires HD_LU_WASTE_CODE.zip"""
    476     query = _config['queries']['Table_of_tables']
--> 477     download_and_extract_zip(tables, query)
    479 elif kwargs['Option'] == 'B':
    480     organize_br_reporting_files_by_year(kwargs['Tables'], year)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:165, in download_and_extract_zip(tables, query)
    159 prefs = {'download.default_directory': str(OUTPUT_PATH),
    160         'download.prompt_for_download': False,
    161         'download.directory_upgrade': True,
    162         'safebrowsing_for_trusted_sources_enabled': False,
    163         'safebrowsing.enabled': False}
    164 options.add_experimental_option('prefs', prefs)
--> 165 browser = webdriver.Chrome(ChromeDriverManager().install(),
    166                            options=options)
    167 browser.maximize_window()
    168 browser.set_page_load_timeout(30)

TypeError: WebDriver.__init__() got multiple values for argument 'options'

Footnotes

  1. https://www.selenium.dev/documentation/webdriver/

@bl-young
Copy link
Collaborator

Yes that lack of consistency in how the data are accessed is a bit of a relic and needs to be updated. This is especially true given #144 which is affecting multiple sources it seems. The goal will be to shift towards using data calls via esupy for consistency. I was just starting this on a new branch (requests_update) but have not yet finished.

@bl-young
Copy link
Collaborator

The RCRA selenium issue is one I am aware of but not yet documented. We regularly have issues accessing RCRA based on how that data is stored and provided. I have added a separate issue #146

@WesIngwersen
Copy link
Collaborator

The lack of consistency came from the original of the tool as a set of inventory specific and independent scripts written by authors who each approached data acquisition uniquely. Yes I agree we can reevaluate that as resources are available.

1 similar comment
@WesIngwersen
Copy link
Collaborator

The lack of consistency came from the original of the tool as a set of inventory specific and independent scripts written by authors who each approached data acquisition uniquely. Yes I agree we can reevaluate that as resources are available.

This was referenced Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants