Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lack of explicit separator in read_csv in /mutate/protocol.py #272

Open
paulaberry opened this issue Feb 3, 2022 · 3 comments
Open

lack of explicit separator in read_csv in /mutate/protocol.py #272

paulaberry opened this issue Feb 3, 2022 · 3 comments

Comments

@paulaberry
Copy link

Because there is not an explicit separator in the function that imports the mutation data table in /mutate/protocol.py, pandas expects a comma separator, but the /mutate/calculations.py function that deals with that table expects commas to separate the locations of mutations. This leads to an error that stops the pipeline if you use commas for both field separators and mutation separators. It also leads to a key error in pandas if you format the mutation data table as suggested, with semicolons as field separators in the mutation data table.

I fixed this issue on my installation of evcouplings by changing line 126 on mutate/protocol.py from data = pd.read_csv(dataset_file, comment="#") to data = pd.read_csv(dataset_file, comment="#", sep=";")

@thomashopf
Copy link
Contributor

thomashopf commented Feb 4, 2022

Hi @paulaberry,

In your csv file, are the fields containing commas escaped/quoted, like in the following example file?
test_mutants.csv

This way, which is also how pandas' to_csv() function saves, you can have commas in comma-separated files.

@paulaberry
Copy link
Author

This file: https://github.com/debbiemarkslab/EVcouplings/blob/develop/notebooks/example/PABP_YEAST_Fields2013-singles.csv was the only one I could find as an example for how to format a mutation effects data table, and is the one referenced in your mutation effects documentation. Are there two different functions used for the same calculations when using EVCouplings as a python package vs the command line interface where this difference in formatting should be used?

@thomashopf
Copy link
Contributor

thomashopf commented Feb 6, 2022

Sorry, this is an unfortunate mismatch between the mutation effect documentation and overall pipeline usage. I tagged this issue to fix the documentation example so it doesn't lead to misunderstandings in the future.

In the example notebook there is an explicit sep=";" argument in the pd.read_csv() , while the pipeline defaults to sep="," when reading the csv file (which is the intended behaviour to have csv files handled consistently across the pipeline, the example file dates back to an older set of files). The actual prediction functions applied afterwards are the same.

So as solution I propose to use a file that is formatted like the test_mutants.csv file I posted above, where any strings containing commas are wrapped in quotation marks, which I think is the de facto standard for handling this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants