Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding latvian sentence cleaners #126

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raivisdejus
Copy link

Adding Latvian cleaners to filter out sentences with broken encoding.

@HarikalarKutusu
Copy link
Contributor

Good idea @raivisdejus. I think you are trying to correct this:

The "?" inside words were caused by an encoding issue during import from old sentence collector, unicode characters for many languages were replaced by "?". Some of these sentences got recorded by volunteers, because they are humanly readable.

If this is the case, I think it should be corrected for all languages (perhaps not for es if the sentence starts with it - any more languages?).

@raivisdejus
Copy link
Author

@HarikalarKutusu You are correct, I am fixing issues with encodings of special characters. Created another PR the would validate this case in all languages. Currently it does not include any special handling of Spanish, see the considerations for this in the other PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants