Duplicate text output #42

adbar · 2021-10-20T17:41:20Z

Justext outputs the title of this webpage twice:

https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html
(archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html)

The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).

miso-belica · 2021-10-21T16:37:14Z

I fixed some issues in the main branch, but now if I run python -m justext -s Polish "https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html" I think it gets you what you expect. The title "Ziemniaki na szóstej, surówka na dziesiątej". Jak pomagać, żeby nie zaszkodzić? [PORADNIK W PIGUŁCE] is twice in the original HTML too and there is no deduplication logic. The jusText is intended to create corpora IMHO and some duplication there is not so bad. It would be nice to do some deduplication though, but you know. I don't have the motivation to do it because I am no longer using justText for my projects.

adbar · 2021-10-21T16:43:18Z

OK, I understand, I'll see what I can do.

miso-belica added the wont-fix label Oct 21, 2021

adbar mentioned this issue Oct 21, 2021

Thoroughly implement and test duplicate detection adbar/trafilatura#3

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate text output #42

Duplicate text output #42

adbar commented Oct 20, 2021

miso-belica commented Oct 21, 2021

adbar commented Oct 21, 2021

Duplicate text output #42

Duplicate text output #42

Comments

adbar commented Oct 20, 2021

miso-belica commented Oct 21, 2021

adbar commented Oct 21, 2021