Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non ASCII characters support #1

Open
yannnic opened this issue Mar 13, 2021 · 7 comments
Open

Non ASCII characters support #1

yannnic opened this issue Mar 13, 2021 · 7 comments

Comments

@yannnic
Copy link

yannnic commented Mar 13, 2021

Hi,
Bravo for this great initiative !
I suppose you already know that non ASCII strings are not well supported in your data. They seem to be filtered out of the strings : erased or replaced.
Example from this search : https://openeditors.ooir.org/index.php?editor_query=Nantes :
. Journal Title : 'Archives de Pdiatrie' should be 'Archives de Pédiatrie' > character erased
. University name : 'Universit de Nantes; Nantes, France' should be 'Université de Nantes; Nantes, France' > character erased
. Editor name : 'Francois Galgani' should be ''François Galgani'' > character 'ç' replaced by 'c'

If all characters could be preserved in Unicode, it would be eprfect !

@andreaspacher
Copy link
Owner

andreaspacher commented Mar 15, 2021

Hello,

thank you for pointing this out.

It seems to be an encoding issue for which I cannot find a quick fix, but I will keep trying.

Just to make sure:

  • The characters are not erased, but they are converted to, for example, a <fc> (instead of ü) or to a <e9> (instead of é). They seem to resemble hex codes (?). The conversions, in turn, are not shown in the web version as the browser believes them to be HTML codes.
  • As regards the Galgani-case, the journals themselves name them as "Francois" (such as here in Marine Pollution Bulletin), so the error lies with the journals and not with Open Editors. Otherwise it would have shown "Franois Galgani", I fear.

Anyway, I will continue looking for solutions, and I thank you for having pointed it out.

@yannnic
Copy link
Author

yannnic commented Mar 16, 2021

Thanks a lot for your reply and your best efforts !
Yann

@bmkramer
Copy link
Contributor

bmkramer commented Mar 17, 2021

@andreaspacher I ran into the same issue when working with the csv-files.

Thinking about a solution: could you perhaps try to specify the encoding as UTF-8 when writing the data to csv with write.csv?

As such:
write.csv(df, "Output/editors.csv", fileEncoding = "UTF-8")

Also, when you read the current file(s) back into R on your system, do the special characters display correctly for you?
E.g. does
df <- read.csv("Output/editors1.csv")
print(df$affiliation[1])
result in
"Children<U+0092>s Health, Dallas, United States" or "Children’s Health, Dallas, United States" ?

Happy to try and help troubleshoot this further, as your data is super useful!

@andreaspacher
Copy link
Owner

andreaspacher commented Mar 17, 2021

In the CSV-files, most of the encoding problems should be largely fixed now (with a few exceptions, e.g. some Chinese characters - I will take a look into these last few issues too soon).

I added the fix for the wrongful hex-codes in d1fb71b, and for most of the wrongful unicodes in e2448c1.

I resorted to a rather manual cleaning as iconv() or other codes (e.g. from the stringi-library) did not work. The whole encoding was probably messed up from the scraping (?).

Perhaps the fact that your code, @bmkramer, resulted in Children<U+0092>s Health, Dallas, United States indicates that there was too much of a "mojibake" to be fixed through automated codes (if I am not mistaken - I am still a newbie with these matters).

And thank you, @bmkramer, for your suggestaion regarding an explicit reading/writing of CSV-files in UTF-8. This is certainly helpful in the future - I integrated this (e.g. in 5bf111e).

As regards the online version at https://openeditors.ooir.org, I will correct the data in a few days.

@andreaspacher
Copy link
Owner

I fixed most of the issues in both the CSV and the online-web version.

A few unicodes that I could not properly identify remained in the dataset; the same applies to names in Chinese characters, of which there were a few (but most often with pinyin-transcriptions anyway). Most of them form part of the journals Bamboo and Silk, The China Nonprofit Review, and Rural China (all at Brill) as well as some of the Frontiers journals.


As a note to myself, I used the following code to fix (as an example) the wrongful hex-codes for the web version (in MySQL):

dbcon<- RMariaDB::dbConnect(MariaDB(), user = "AAXYZ", password = "AAXYZ", dbname = "AAXYZ", host = "AAXYZ")
ascii <- structure(list(Hex = c("<a0>", "<a1>", "<a2>", "<a3>", "<a4>", 
                                "<a5>", "<a6>", "<a7>", "<a8>", "<a9>", "<aa>", "<ab>", "<ac>", 
                                "<ad>", "<ae>", "<af>", "<b0>", "<b1>", "<b2>", "<b3>", "<b4>", 
                                "<b5>", "<b6>", "<b7>", "<b8>", "<b9>", "<ba>", "<bb>", "<bc>", 
                                "<bd>", "<be>", "<bf>", "<c0>", "<c1>", "<c2>", "<c3>", "<c4>", 
                                "<c5>", "<c6>", "<c7>", "<c8>", "<c9>", "<ca>", "<cb>", "<cc>", 
                                "<cd>", "<ce>", "<cf>", "<d0>", "<d1>", "<d2>", "<d3>", "<d4>", 
                                "<d5>", "<d6>", "<d7>", "<d8>", "<d9>", "<da>", "<db>", "<dc>", 
                                "<dd>", "<de>", "<df>", "<e0>", "<e1>", "<e2>", "<e3>", "<e4>", 
                                "<e5>", "<e6>", "<e7>", "<e8>", "<e9>", "<ea>", "<eb>", "<ec>", 
                                "<ed>", "<ee>", "<ef>", "<f0>", "<f1>", "<f2>", "<f3>", "<f4>", 
                                "<f5>", "<f6>", "<f7>", "<f8>", "<f9>", "<fa>", "<fb>", "<fc>", 
                                "<fd>", "<fe>", "<ff>"), Actual = c(" ", "¡", "¢", "£", "¤", 
                                                                    "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "SHY", "®", "¯", "°", 
                                                                    "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", 
                                                                    "¾", "¿", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", 
                                                                    "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", 
                                                                    "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", 
                                                                    "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", 
                                                                    "ò", "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", 
                                                                    "ÿ")), row.names = c(NA, -96L), class = "data.frame")

for(i in 1:nrow(ascii)) {
  QUERY <- paste0("
  UPDATE openeditors SET
    journal = REPLACE(journal, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    editor = REPLACE(editor, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    role = REPLACE(role, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    affiliation = REPLACE(affiliation, '", ascii$Hex[i], "', '", ascii$Actual[i], "');
  ")
  print(QUERY)
  
  dbExecute(dbcon, QUERY)
  
  Sys.sleep(3)
}

@bmkramer
Copy link
Contributor

Thanks @andreaspacher for fixing the encoding issues! Unfortunately, something apparently still happens along the way that causes the csv's to open with the unicode/ASCII codes on my system [no idea why...], but the code you included makes it easy to redo the fixes and proceed :-)

I used this in af88e49 as part of a workflow to match editor affiliations to ROR IDs.

@andreaspacher andreaspacher reopened this Mar 20, 2021
@jeroenbaas
Copy link

There seems to be something going on with encoding detection upstream. For instance this title
Otolaryngology<U+0096>Head and Neck Surgery
Is spelled "Otolaryngology–Head and Neck Surgery" on the website,
but that dash is not \u0096 in UTF-8.
It looks like these encodings are originating from the input data journal list, so perhaps these should be flagged on https://github.com/andreaspacher/academic-publishers instead.
See for instance: https://raw.githubusercontent.com/andreaspacher/academic-publishers/main/Output/alljournals-2021-02-05.csv

It may actually stem from the Scopus reader, as that is loading an xlsx file with Latin1 encoding and not UTF8 (although I don't see the em-dash in the Scopus list for this title, only the short, ascii dash). It is hard to tell where it comes from exactly as the outputs in the publishers repo doesn't have the individual csv outputs stored, only the final merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants