Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Non-unique entries in validated_sentences.tsv #4408

Open
HarikalarKutusu opened this issue Mar 25, 2024 · 0 comments
Open

[BUG] Non-unique entries in validated_sentences.tsv #4408

HarikalarKutusu opened this issue Mar 25, 2024 · 0 comments
Labels
Bug Text Corpus Bugs or feature requests that are related to Text Corpus

Comments

@HarikalarKutusu
Copy link
Contributor

Describe the bug
When I analyzed text corpus files from v17.0, in many of the locales, I found out that some sentence_id's are duplicated. I don't know the exact reason and a systematic source.

If someone can direct me to the related code, I can have a look to find the reason.

To Reproduce
Steps to reproduce the behavior: Analyze the validated_sentences.tsv files with pandas and get the duplicates.

Expected behavior
A sentence_id should exist only one time (assuming there is no hash collision, which is unlikely)

Screenshots
Example from ka locale:
image

sentence_id's on the snapshot:
01c596b07467cfe5c99b5a1341891404ee80d3bcb81521f6421507d464cd50de
01c6552a75163a62e3ca06f8ca68e024083351a03a7d0213d5c0a0947cd95464

Additional context

  • This can be deduplicated at application level, but be careful, in some of them is_used and/or clips_count values seems to be different. So, for example when using pandas DataFrames, one should not deduplicate the whole rows, but get sentence_id's and make them unique.
  • If the above procedure is used, some information loss can occur
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Text Corpus Bugs or feature requests that are related to Text Corpus
Projects
None yet
Development

No branches or pull requests

2 participants