Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate text output #42

Open
adbar opened this issue Oct 20, 2021 · 2 comments
Open

Duplicate text output #42

adbar opened this issue Oct 20, 2021 · 2 comments
Labels

Comments

@adbar
Copy link
Contributor

adbar commented Oct 20, 2021

Justext outputs the title of this webpage twice:

https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html
(archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html)

The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).

@miso-belica
Copy link
Owner

I fixed some issues in the main branch, but now if I run python -m justext -s Polish "https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html" I think it gets you what you expect. The title "Ziemniaki na szóstej, surówka na dziesiątej". Jak pomagać, żeby nie zaszkodzić? [PORADNIK W PIGUŁCE] is twice in the original HTML too and there is no deduplication logic. The jusText is intended to create corpora IMHO and some duplication there is not so bad. It would be nice to do some deduplication though, but you know. I don't have the motivation to do it because I am no longer using justText for my projects.
image

@adbar
Copy link
Contributor Author

adbar commented Oct 21, 2021

OK, I understand, I'll see what I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants