Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

previously unseen geocoded_state values broke covid_hosp_state_timeseries acquisition #1420

Closed
melange396 opened this issue Apr 22, 2024 · 3 comments
Assignees
Labels
bug data quality mysql mysql database related

Comments

@melange396
Copy link
Collaborator

The "Update COVID Hosp: state timeseries dataset" acquisition job has been failing since apr 4 2024; first failure in cronicle: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jluljl1fnlb

Error message from cronicle:

{"url":"https://healthdata.gov/api/views/qqte-vkut/rows.csv","event":"fetching metadata","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:02.164180Z"}
{"newer_than":"Timestamp('2024-04-02 00:00:00')","older_than":"datetime.date(2024, 4, 5)","event":"looking up issues in date range","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.430783Z"}
{"newer_than":"2024-04-02 00:00:00","older_than":"2024-04-05","count":2,"event":"issues selected","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.434811Z"}
{"event":"already collected revision: https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-03T12-06-56.csv","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.443171Z"}
{"event":"including dataset revision: https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.451599Z"}
{"url":"https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv","event":"fetching dataset","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.451892Z"}
{"event":"2 issues checked w/ 2 revisions, resulting in 1 datasets.","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:16.100503Z"}
{"count":80201,"event":"updating values","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:16.193757Z"}
{"publ_date":20240404,"in_lines":[0,5000],"index":4999,"values":["PR",20210430,0.138728323699422,70,692,96,0.66804979253112,71,723,483,0,71,0,0,71,0,7,70,"POINT (-66.468953431 18.222606498)",2,70,1,70,0.052334784329244,70,10108,529,10285,71,6310,71,529,70,0.613514827418571,71,10285,6310,null,null,null,0.085144052792532,70,6213,529,25,0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,71,0,71,18,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,70,3,70,4,70,0,70,12,70,0,70,0,70,null,null,null,483,71,96,70,67,71,479,70,451,71,25,70,50,70,25,70,723,71],"exception":"DataError(1406, \"1406 (22001): Data too long for column 'geocoded_state' at row 1\", '22001')","event":"error on insert","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"error","pid":92518,"timestamp":"2024-04-04T18:00:18.773557Z"}
{"scope":"sys","event":"Top-level exception occurred","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"error","pid":92518,"timestamp":"2024-04-04T18:00:18.773974Z","exception":"Traceback (most recent call last):\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py\", line 487, in cmd_query\n    self._cmysql.query(query,\n_mysql_connector.MySQLInterfaceError: Data too long for column 'geocoded_state' at row 1\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py\", line 193, in _run_module_as_main\n    return _run_code(code, main_globals, None,\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py\", line 86, in _run_code\n    exec(code, run_globals)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/state_timeseries/update.py\", line 42, in <module>\n    Utils.launch_if_main(Update.run, __name__)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py\", line 38, in launch_if_main\n    entrypoint()\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/state_timeseries/update.py\", line 38, in run\n    return Utils.update_dataset(Database, network)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py\", line 262, in update_dataset\n    db.insert_dataset(issue_int, dataset, logger=logger)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py\", line 231, in insert_dataset\n    raise e\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py\", line 225, in insert_dataset\n    cursor.executemany(sql, many_values)\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py\", line 357, in executemany\n    self.execute(operation, params)\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py\", line 264, in execute\n    result = self._cnx.cmd_query(stmt, raw=self._raw,\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py\", line 491, in cmd_query\n    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,\nmysql.connector.errors.DataError: 1406 (22001): Data too long for column 'geocoded_state' at row 1"}

The offending csv file is https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv

In our TABLE covid_hosp_state_timeseries, COLUMN geocoded_state is of type VARCHAR(32).

The most recently successfully imported csv has nulls/empties for geocoded_state column, as does the csv issued immediately after the offending csv file, as does the most recently issued csv. Only the one csv appears to have these long "POINT (..." values. I downloaded these 4 mentioned csv files and ran the following commands to demonstrate:

$ ls -laF g62h-syeh_2024-04-*
-rw-r--r--  1 haff  staff  37787646 Apr 22 13:52 g62h-syeh_2024-04-03T12-06-56.csv
-rw-r--r--  1 haff  staff  40484359 Apr 22 13:53 g62h-syeh_2024-04-04T07-01-48.csv
-rw-r--r--  1 haff  staff  37961499 Apr 22 14:07 g62h-syeh_2024-04-11T15-31-38.csv
-rw-r--r--  1 haff  staff  38136294 Apr 22 14:06 g62h-syeh_2024-04-19T12-06-57.csv

$ grep -l POINT g62h-syeh_2024-04-*
g62h-syeh_2024-04-04T07-01-48.csv

$ grep "PR,2021/04/30" g62h-syeh_2024-04-03T12-06-56.csv 
PR,2021/04/30,0,0,71,0,0,71,2,70,10285,71,6310,71,529,70,25,71,18,70,0,70,12,70,483,71,96,70,67,71,479,70,451,71,50,70,25,70,723,71,0.613514827418571,71,6310,10285,0.085144052792532,70,529,6213,0.052334784329244,70,529,10108,0.138728323699422,70,96,692,0.66804979253112,71,483,723,,0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,0,71,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,3,70,7,70,,,,,,,1,70,4,70,0,70,0,70,25,13,70,70,62,8,196,8,,0,,0,,0,,0,,0,12,8,29,8

$ grep "PR,2021/04/30" g62h-syeh_2024-04-04T07-01-48.csv 
PR,2021/04/30,0,0,71,0,0,71,2,70,10285,71,6310,71,529,70,25,71,18,70,0,70,12,70,483,71,96,70,67,71,479,70,451,71,50,70,25,70,723,71,0.613514827418571,71,6310,10285,0.085144052792532,70,529,6213,0.052334784329244,70,529,10108,0.138728323699422,70,96,692,0.66804979253112,71,483,723,POINT (-66.468953431 18.222606498),0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,0,71,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,3,70,7,70,,,,,,,1,70,4,70,0,70,0,70,25,13,70,70,62,8,196,8,,0,,0,,0,,0,,0,12,8,29,8

Attempt to skip the offending file similar to solution in #1369 :

INSERT INTO covid_hosp_meta (dataset_name, publication_date, revision_timestamp, hhs_dataset_id, acquisition_datetime, metadata_json) 
VALUES ('covid_hosp_state_timeseries', 20240404, 'https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv', 'g62h-syeh', 
        '2024-04-21 00:00:00', '{"error":"semi-malformed data file, includes geo point encodings too long for our data column: geocoded_state"}'); 

After adding that line to the covid_hosp_meta table above, i re-ran the cronicle job and am now awaiting the results.

@melange396
Copy link
Collaborator Author

some discussion of this took place on slack: https://delphi-org.slack.com/archives/C013AH5N01E/p1713806839980759

@melange396
Copy link
Collaborator Author

Cronicle job was successful after ~7.5 minute runtime: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jlvbb7xtax6

Now waiting for the subsequently triggered "HHS" indicator job: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jlvbbhj50x7
...which should itself trigger a regular acquisition job.

@melange396
Copy link
Collaborator Author

All done. Here is our dashboard showing hhs data for apr 12: https://delphi.cmu.edu/covidcast/indicator/?sensor=hhs-confirmed_admissions_influenza_1d_prop_7dav&date=20240412

@melange396 melange396 self-assigned this Apr 22, 2024
@melange396 melange396 added bug data quality mysql mysql database related labels Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug data quality mysql mysql database related
Projects
None yet
Development

No branches or pull requests

1 participant