Non ASCII characters support #1

yannnic · 2021-03-13T11:30:43Z

Hi,
Bravo for this great initiative !
I suppose you already know that non ASCII strings are not well supported in your data. They seem to be filtered out of the strings : erased or replaced.
Example from this search : https://openeditors.ooir.org/index.php?editor_query=Nantes :
. Journal Title : 'Archives de Pdiatrie' should be 'Archives de Pédiatrie' > character erased
. University name : 'Universit de Nantes; Nantes, France' should be 'Université de Nantes; Nantes, France' > character erased
. Editor name : 'Francois Galgani' should be ''François Galgani'' > character 'ç' replaced by 'c'

If all characters could be preserved in Unicode, it would be eprfect !

andreaspacher · 2021-03-15T19:09:34Z

Hello,

thank you for pointing this out.

It seems to be an encoding issue for which I cannot find a quick fix, but I will keep trying.

Just to make sure:

The characters are not erased, but they are converted to, for example, a <fc> (instead of ü) or to a <e9> (instead of é). They seem to resemble hex codes (?). The conversions, in turn, are not shown in the web version as the browser believes them to be HTML codes.
As regards the Galgani-case, the journals themselves name them as "Francois" (such as here in Marine Pollution Bulletin), so the error lies with the journals and not with Open Editors. Otherwise it would have shown "Franois Galgani", I fear.

Anyway, I will continue looking for solutions, and I thank you for having pointed it out.

yannnic · 2021-03-16T10:23:16Z

Thanks a lot for your reply and your best efforts !
Yann

bmkramer · 2021-03-17T09:30:27Z

@andreaspacher I ran into the same issue when working with the csv-files.

Thinking about a solution: could you perhaps try to specify the encoding as UTF-8 when writing the data to csv with write.csv?

As such:
write.csv(df, "Output/editors.csv", fileEncoding = "UTF-8")

Also, when you read the current file(s) back into R on your system, do the special characters display correctly for you?
E.g. does
df <- read.csv("Output/editors1.csv")
print(df$affiliation[1])
result in
"Children<U+0092>s Health, Dallas, United States" or "Children’s Health, Dallas, United States" ?

Happy to try and help troubleshoot this further, as your data is super useful!

andreaspacher · 2021-03-17T20:07:22Z

In the CSV-files, most of the encoding problems should be largely fixed now (with a few exceptions, e.g. some Chinese characters - I will take a look into these last few issues too soon).

I added the fix for the wrongful hex-codes in d1fb71b, and for most of the wrongful unicodes in e2448c1.

I resorted to a rather manual cleaning as iconv() or other codes (e.g. from the stringi-library) did not work. The whole encoding was probably messed up from the scraping (?).

Perhaps the fact that your code, @bmkramer, resulted in Children<U+0092>s Health, Dallas, United States indicates that there was too much of a "mojibake" to be fixed through automated codes (if I am not mistaken - I am still a newbie with these matters).

And thank you, @bmkramer, for your suggestaion regarding an explicit reading/writing of CSV-files in UTF-8. This is certainly helpful in the future - I integrated this (e.g. in 5bf111e).

As regards the online version at https://openeditors.ooir.org, I will correct the data in a few days.

andreaspacher · 2021-03-18T19:45:28Z

I fixed most of the issues in both the CSV and the online-web version.

A few unicodes that I could not properly identify remained in the dataset; the same applies to names in Chinese characters, of which there were a few (but most often with pinyin-transcriptions anyway). Most of them form part of the journals Bamboo and Silk, The China Nonprofit Review, and Rural China (all at Brill) as well as some of the Frontiers journals.

As a note to myself, I used the following code to fix (as an example) the wrongful hex-codes for the web version (in MySQL):

dbcon<- RMariaDB::dbConnect(MariaDB(), user = "AAXYZ", password = "AAXYZ", dbname = "AAXYZ", host = "AAXYZ")
ascii <- structure(list(Hex = c("<a0>", "<a1>", "<a2>", "<a3>", "<a4>", 
                                "<a5>", "<a6>", "<a7>", "<a8>", "<a9>", "<aa>", "<ab>", "<ac>", 
                                "<ad>", "<ae>", "<af>", "<b0>", "<b1>", "<b2>", "<b3>", "<b4>", 
                                "<b5>", "<b6>", "<b7>", "<b8>", "<b9>", "<ba>", "<bb>", "<bc>", 
                                "<bd>", "<be>", "<bf>", "<c0>", "<c1>", "<c2>", "<c3>", "<c4>", 
                                "<c5>", "<c6>", "<c7>", "<c8>", "<c9>", "<ca>", "<cb>", "<cc>", 
                                "<cd>", "<ce>", "<cf>", "<d0>", "<d1>", "<d2>", "<d3>", "<d4>", 
                                "<d5>", "<d6>", "<d7>", "<d8>", "<d9>", "<da>", "<db>", "<dc>", 
                                "<dd>", "<de>", "<df>", "<e0>", "<e1>", "<e2>", "<e3>", "<e4>", 
                                "<e5>", "<e6>", "<e7>", "<e8>", "<e9>", "<ea>", "<eb>", "<ec>", 
                                "<ed>", "<ee>", "<ef>", "<f0>", "<f1>", "<f2>", "<f3>", "<f4>", 
                                "<f5>", "<f6>", "<f7>", "<f8>", "<f9>", "<fa>", "<fb>", "<fc>", 
                                "<fd>", "<fe>", "<ff>"), Actual = c(" ", "¡", "¢", "£", "¤", 
                                                                    "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "SHY", "®", "¯", "°", 
                                                                    "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", 
                                                                    "¾", "¿", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", 
                                                                    "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", 
                                                                    "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", 
                                                                    "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", 
                                                                    "ò", "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", 
                                                                    "ÿ")), row.names = c(NA, -96L), class = "data.frame")

for(i in 1:nrow(ascii)) {
  QUERY <- paste0("
  UPDATE openeditors SET
    journal = REPLACE(journal, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    editor = REPLACE(editor, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    role = REPLACE(role, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    affiliation = REPLACE(affiliation, '", ascii$Hex[i], "', '", ascii$Actual[i], "');
  ")
  print(QUERY)
  
  dbExecute(dbcon, QUERY)
  
  Sys.sleep(3)
}

bmkramer · 2021-03-20T13:31:43Z

Thanks @andreaspacher for fixing the encoding issues! Unfortunately, something apparently still happens along the way that causes the csv's to open with the unicode/ASCII codes on my system [no idea why...], but the code you included makes it easy to redo the fixes and proceed :-)

I used this in af88e49 as part of a workflow to match editor affiliations to ROR IDs.

jeroenbaas · 2021-03-24T20:33:22Z

There seems to be something going on with encoding detection upstream. For instance this title
Otolaryngology<U+0096>Head and Neck Surgery
Is spelled "Otolaryngology–Head and Neck Surgery" on the website,
but that dash is not \u0096 in UTF-8.
It looks like these encodings are originating from the input data journal list, so perhaps these should be flagged on https://github.com/andreaspacher/academic-publishers instead.
See for instance: https://raw.githubusercontent.com/andreaspacher/academic-publishers/main/Output/alljournals-2021-02-05.csv

It may actually stem from the Scopus reader, as that is loading an xlsx file with Latin1 encoding and not UTF8 (although I don't see the em-dash in the Scopus list for this title, only the short, ascii dash). It is hard to tell where it comes from exactly as the outputs in the publishers repo doesn't have the individual csv outputs stored, only the final merges.

andreaspacher closed this as completed Mar 18, 2021

andreaspacher reopened this Mar 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non ASCII characters support #1

Non ASCII characters support #1

yannnic commented Mar 13, 2021

andreaspacher commented Mar 15, 2021 •

edited

yannnic commented Mar 16, 2021

bmkramer commented Mar 17, 2021 •

edited

andreaspacher commented Mar 17, 2021 •

edited

andreaspacher commented Mar 18, 2021

bmkramer commented Mar 20, 2021

jeroenbaas commented Mar 24, 2021

Non ASCII characters support #1

Non ASCII characters support #1

Comments

yannnic commented Mar 13, 2021

andreaspacher commented Mar 15, 2021 • edited

yannnic commented Mar 16, 2021

bmkramer commented Mar 17, 2021 • edited

andreaspacher commented Mar 17, 2021 • edited

andreaspacher commented Mar 18, 2021

bmkramer commented Mar 20, 2021

jeroenbaas commented Mar 24, 2021

andreaspacher commented Mar 15, 2021 •

edited

bmkramer commented Mar 17, 2021 •

edited

andreaspacher commented Mar 17, 2021 •

edited