Option to drop section titles/headers #293

Matthieu-Tinycoaching · 2022-09-19T14:13:03Z

Hi,

When extracting to JSON format a wikidump:
python -m wikiextractor.WikiExtractor -o simpleWikipedia --templates template --json --bytes 200M simplewiki-20220901-pages-articles.xml.bz2

I would like to remove all subsections titles/headers and keep only textual paragraphs of the corpus (e.g. remove "The Month" and "April in poetry" titles from this page: https://simple.wikipedia.org/wiki/April)

Would there be any option or simple fix in the code to do in order to discard headers/titles?

Thanks!

The text was updated successfully, but these errors were encountered:

Matthieu-Tinycoaching · 2023-02-22T13:11:30Z

Hi,

@attardi any idea on how to deal with these?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to drop section titles/headers #293

Option to drop section titles/headers #293

Matthieu-Tinycoaching commented Sep 19, 2022

Matthieu-Tinycoaching commented Feb 22, 2023

Option to drop section titles/headers #293

Option to drop section titles/headers #293

Comments

Matthieu-Tinycoaching commented Sep 19, 2022

Matthieu-Tinycoaching commented Feb 22, 2023