Suggestion for extracting CNRTL Est Républicain Corpus #99

tattorba87 · 2020-02-28T16:31:04Z

Instead of using:

xmllint --xpath '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt

this seems to work better:

xmlstarlet sel -t -v '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt

As xmllint was replacing several French characters with their hex format. xmlstarlet doesn't seem to have this issue

tattorba87 · 2020-03-22T22:07:40Z

Or even better:

xmlstarlet sel -t -m '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' -n --var linebreak -n --break -v "translate(., $linebreak, '')" Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g; s/ +/ /g' > est_republicain.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for extracting CNRTL Est Républicain Corpus #99

Suggestion for extracting CNRTL Est Républicain Corpus #99

tattorba87 commented Feb 28, 2020

tattorba87 commented Mar 22, 2020

Suggestion for extracting CNRTL Est Républicain Corpus #99

Suggestion for extracting CNRTL Est Républicain Corpus #99

Comments

tattorba87 commented Feb 28, 2020

tattorba87 commented Mar 22, 2020