Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Est_republicaine Corpus not found #110

Open
BaderEddineB opened this issue Aug 25, 2020 · 6 comments
Open

Est_republicaine Corpus not found #110

BaderEddineB opened this issue Aug 25, 2020 · 6 comments

Comments

@BaderEddineB
Copy link

Hello
I'm trying to download the est_republicaine corpus to train the French language model using KenLM, when I click on the link, it gives me this error "nginx error! The page you are looking for is not found"
any ideo, where can have this corpus ?
thanks

@svenha
Copy link
Contributor

svenha commented Aug 26, 2020

This seems to be a problem of https://cnrtl.fr/ . I just mailed them a bug report.

@BaderEddineB
Copy link
Author

Ok thanks, I just found another download link, is this one: ( https://repository.ortolang.fr/api/content/export?&path=/est_republicain/4/&filename=est_republicain&scope=YW5vbnltb3Vz3 )
I would like to know if it is the same as that of cnrtl.fr ?

@BaderEddineB
Copy link
Author

est_repeb2
est_repeb

@svenha
Copy link
Contributor

svenha commented Aug 28, 2020

Someone from cnrtl.fr answered my question. The official new web site for this corpus is https://www.ortolang.fr/market/corpora/est_republicain
Version 4 from 2020-07-22 is the latest.

@BaderEddineB
Copy link
Author

BaderEddineB commented Aug 28, 2020

Thank you very much, it looks a bit like the one i found (the pictures above).
but when I run ["xmllint --xpath '// * [local-name () =" div "] [@ type =" article "] // * [local-name () =" p "or local-name () = "head"] / text () 'Year * / *. xml | perl -pe' s / ^ + // g; s / ^ (. +) / $ 1 \ n / g; chomp '> est_republicain. txt "] to extract the titles and paragraphs in the text file" est_republicain.txt ". I see that the pulling is not going well

here is the example of the "est_republicain.txt" file result:
Capturekk

is it normal ? What is the problem ?

@pguyot
Copy link
Contributor

pguyot commented Jun 12, 2022

The file format might have been changed. The idea is to extract text only and what you get is nearly what we need. You need to replace all sgml entities.

See https://serverfault.com/questions/440805/how-can-i-easily-convert-html-special-entities-from-a-standard-input-stream-in-l

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants