Non-UTF byte in CCLE drugInfo #27

ChristopherEeles · 2021-07-16T19:38:17Z

There is a non-UTF byte in drugInfo(CCLE)[4, 2]. That is the 'Compound..brand.name.' column, I think it is probably a TM symbol. But it breaks a bunch of stuff, such as reading in the table as a .csv in Python. Also some R show methods.

We should have a general mechanism to ensure that only valid UTF-8 strings are stored in a PSet. There is a utility for this already in base called iconv.

We could do something like:

DF$column <- iconv(DF$column, to='UTF-8', sub='')

The text was updated successfully, but these errors were encountered:

ChristopherEeles · 2021-07-16T19:46:32Z

Here is a data.frame wide solution:

DF <- S4Vectors::endoapply(DF, FUN=iconv, to='UTF-8', sub='')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-UTF byte in CCLE drugInfo #27

Non-UTF byte in CCLE drugInfo #27

ChristopherEeles commented Jul 16, 2021 •

edited

ChristopherEeles commented Jul 16, 2021 •

edited

Non-UTF byte in CCLE drugInfo #27

Non-UTF byte in CCLE drugInfo #27

Comments

ChristopherEeles commented Jul 16, 2021 • edited

ChristopherEeles commented Jul 16, 2021 • edited

ChristopherEeles commented Jul 16, 2021 •

edited

ChristopherEeles commented Jul 16, 2021 •

edited