Est_republicaine Corpus not found #110

BaderEddineB · 2020-08-25T12:42:23Z

Hello
I'm trying to download the est_republicaine corpus to train the French language model using KenLM, when I click on the link, it gives me this error "nginx error! The page you are looking for is not found"
any ideo, where can have this corpus ?
thanks

svenha · 2020-08-26T08:48:42Z

This seems to be a problem of https://cnrtl.fr/ . I just mailed them a bug report.

BaderEddineB · 2020-08-26T08:56:04Z

Ok thanks, I just found another download link, is this one: ( https://repository.ortolang.fr/api/content/export?&path=/est_republicain/4/&filename=est_republicain&scope=YW5vbnltb3Vz3 )
I would like to know if it is the same as that of cnrtl.fr ?

BaderEddineB · 2020-08-27T07:25:44Z

svenha · 2020-08-28T09:52:16Z

Someone from cnrtl.fr answered my question. The official new web site for this corpus is https://www.ortolang.fr/market/corpora/est_republicain
Version 4 from 2020-07-22 is the latest.

BaderEddineB · 2020-08-28T14:36:40Z

Thank you very much, it looks a bit like the one i found (the pictures above).
but when I run ["xmllint --xpath '// * [local-name () =" div "] [@ type =" article "] // * [local-name () =" p "or local-name () = "head"] / text () 'Year * / *. xml | perl -pe' s / ^ + // g; s / ^ (. +) / $ 1 \ n / g; chomp '> est_republicain. txt "] to extract the titles and paragraphs in the text file" est_republicain.txt ". I see that the pulling is not going well

here is the example of the "est_republicain.txt" file result:

is it normal ? What is the problem ?

pguyot · 2022-06-12T07:23:32Z

The file format might have been changed. The idea is to extract text only and what you get is nearly what we need. You need to replace all sgml entities.

See https://serverfault.com/questions/440805/how-can-i-easily-convert-html-special-entities-from-a-standard-input-stream-in-l

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Est_republicaine Corpus not found #110

Est_republicaine Corpus not found #110

BaderEddineB commented Aug 25, 2020

svenha commented Aug 26, 2020

BaderEddineB commented Aug 26, 2020

BaderEddineB commented Aug 27, 2020

svenha commented Aug 28, 2020

BaderEddineB commented Aug 28, 2020 •

edited

pguyot commented Jun 12, 2022

Est_republicaine Corpus not found #110

Est_republicaine Corpus not found #110

Comments

BaderEddineB commented Aug 25, 2020

svenha commented Aug 26, 2020

BaderEddineB commented Aug 26, 2020

BaderEddineB commented Aug 27, 2020

svenha commented Aug 28, 2020

BaderEddineB commented Aug 28, 2020 • edited

pguyot commented Jun 12, 2022

BaderEddineB commented Aug 28, 2020 •

edited