Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError in sanitize_metadata #846

Open
jacaravas opened this issue Jan 27, 2022 · 6 comments
Open

UnicodeDecodeError in sanitize_metadata #846

jacaravas opened this issue Jan 27, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@jacaravas
Copy link

jacaravas commented Jan 27, 2022

Some recent additions to GISAID are causing sanitize_metadata.py to fail on my system. It isn't obvious if this is an issue in the GISAID data, my local system/environment settings, or my data prep script.

The error can be removed by pre-processing the metadata with:
sed -i.bak 's/[\d128-\d255]//g' metadata.tsv

The error text is:

Traceback (most recent call last):
  File "/home/ncov/scripts/sanitize_metadata.py", line 405, in <module>
    database_ids_by_strain = get_database_ids_by_strain(
  File "/home/ncov/scripts/sanitize_metadata.py", line 211, in get_database_ids_by_strain
    for metadata in metadata_reader:
  File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1024, in __next__
    return self.get_chunk()
  File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1074, in get_chunk
    return self.read(nrows=size)
  File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 246, in read
    content = self._get_lines(rows)
  File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 1049, in _get_lines
    new_rows.append(next(self.data))
  File "/home/my_conda_envs/nextstrain/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 7734: invalid continuation byte

The bad character seems to be present in one or more sequences in the following list, probably in the submitting or originating lab fields:

GISIAD ID
--
EPI_ISL_8054068
EPI_ISL_8351839
EPI_ISL_8515106
EPI_ISL_8722546
EPI_ISL_8633134
EPI_ISL_8711110
EPI_ISL_8480143
EPI_ISL_8607291
EPI_ISL_8844787
EPI_ISL_8508549
EPI_ISL_8893774
EPI_ISL_8931724
EPI_ISL_8837874
EPI_ISL_8826040
EPI_ISL_8722609
EPI_ISL_8932327
EPI_ISL_8789921
EPI_ISL_8664443
EPI_ISL_8663986
EPI_ISL_8818636
EPI_ISL_8790891
EPI_ISL_8790205
EPI_ISL_8788890
EPI_ISL_8607109
EPI_ISL_9015602
EPI_ISL_8785652
EPI_ISL_8681256
EPI_ISL_8683191
EPI_ISL_8055058
EPI_ISL_8766418
EPI_ISL_8242009
EPI_ISL_8585276
EPI_ISL_8927637
EPI_ISL_9010538
EPI_ISL_8465460
EPI_ISL_8579421
EPI_ISL_8976041
EPI_ISL_8976040
EPI_ISL_8975532
EPI_ISL_8985674
EPI_ISL_8985653
EPI_ISL_8985734
EPI_ISL_8925410
EPI_ISL_8799966
EPI_ISL_8931073
@jacaravas jacaravas added the bug Something isn't working label Jan 27, 2022
@tsibley
Copy link
Member

tsibley commented Jan 27, 2022

This seems to me to likely to be bad upstream source data from GISAID (e.g. actually invalid UTF-8).

However, it could plausibly be an issue in Pandas' row/line-chunked parsing accidentally splitting a single UTF-8 multi-byte character across reads/decodes.

Not sure without digging in more and would really need to be able to reproduce locally to diagnose this.

@huddlej
Copy link
Contributor

huddlej commented Feb 2, 2022

@jacaravas Just to clarify, is the data flow for the metadata you pass to sanitize metadata like this?

  1. GISAID API endpoint
  2. internal database
  3. metadata TSV

@jacaravas
Copy link
Author

jacaravas commented Feb 3, 2022

@huddlej Yes, that is correct. There is a python script that is between step 2 & 3 where extra annotations are applied, names normalized, etc... I will try to get back to this tomorrow to confirm these entries are still causing failures for me.

@Gathii
Copy link

Gathii commented Mar 6, 2022

I struggled with this bug, and could not go beyond the sanitize metadata.py step: My metadata was retrieved from GISAID using the augur input option. The solution provided here "sed -i.bak 's/[\d128-\d255]//g' metadata.tsv " kept giving me an error "invalid collation character". What worked for me was converting the metadata file into UTF-8 encoding using Notepad++ then using the encoded version as my metadata.tsv.

@tsibley
Copy link
Member

tsibley commented Mar 9, 2022

@Gathii I suspect that invalid collation character from that sed command implicates something about your locale settings. I'd be curious to know what the output of the locale command is on your system. In any case, glad you found a workaround and shared it here!

@Gathii
Copy link

Gathii commented Mar 10, 2022

@tsibley see below my locale settings:

LANG=en_US.utf-8
LC_CTYPE="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME="en_US.utf-8"
LC_COLLATE="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_PAPER="en_US.utf-8"
LC_NAME="en_US.utf-8"
LC_ADDRESS="en_US.utf-8"
LC_TELEPHONE="en_US.utf-8"
LC_MEASUREMENT="en_US.utf-8"
LC_IDENTIFICATION="en_US.utf-8"
LC_ALL=en_US.utf-8

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

4 participants