Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-UTF byte in CCLE drugInfo #27

Open
ChristopherEeles opened this issue Jul 16, 2021 · 1 comment
Open

Non-UTF byte in CCLE drugInfo #27

ChristopherEeles opened this issue Jul 16, 2021 · 1 comment

Comments

@ChristopherEeles
Copy link

ChristopherEeles commented Jul 16, 2021

There is a non-UTF byte in drugInfo(CCLE)[4, 2]. That is the 'Compound..brand.name.' column, I think it is probably a TM symbol. But it breaks a bunch of stuff, such as reading in the table as a .csv in Python. Also some R show methods.

We should have a general mechanism to ensure that only valid UTF-8 strings are stored in a PSet. There is a utility for this already in base called iconv.

We could do something like:

DF$column <- iconv(DF$column, to='UTF-8', sub='')
@ChristopherEeles
Copy link
Author

ChristopherEeles commented Jul 16, 2021

Here is a data.frame wide solution:

DF <- S4Vectors::endoapply(DF, FUN=iconv, to='UTF-8', sub='')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant