New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError in sanitize_metadata #846
Comments
This seems to me to likely to be bad upstream source data from GISAID (e.g. actually invalid UTF-8). However, it could plausibly be an issue in Pandas' row/line-chunked parsing accidentally splitting a single UTF-8 multi-byte character across reads/decodes. Not sure without digging in more and would really need to be able to reproduce locally to diagnose this. |
@jacaravas Just to clarify, is the data flow for the metadata you pass to sanitize metadata like this?
|
@huddlej Yes, that is correct. There is a python script that is between step 2 & 3 where extra annotations are applied, names normalized, etc... I will try to get back to this tomorrow to confirm these entries are still causing failures for me. |
I struggled with this bug, and could not go beyond the sanitize metadata.py step: My metadata was retrieved from GISAID using the augur input option. The solution provided here "sed -i.bak 's/[\d128-\d255]//g' metadata.tsv " kept giving me an error "invalid collation character". What worked for me was converting the metadata file into UTF-8 encoding using Notepad++ then using the encoded version as my metadata.tsv. |
@Gathii I suspect that |
@tsibley see below my locale settings: LANG=en_US.utf-8 Thanks |
Some recent additions to GISAID are causing sanitize_metadata.py to fail on my system. It isn't obvious if this is an issue in the GISAID data, my local system/environment settings, or my data prep script.
The error can be removed by pre-processing the metadata with:
sed -i.bak 's/[\d128-\d255]//g' metadata.tsv
The error text is:
The bad character seems to be present in one or more sequences in the following list, probably in the submitting or originating lab fields:
The text was updated successfully, but these errors were encountered: