Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean Language Support #98

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StrangeFate
Copy link

@StrangeFate StrangeFate commented Jan 14, 2022

#71
Because of the structure of the Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean NLP python package, I've added konlpy to do tokenization and removing numbers.

test.txt
test_preprocessed.txt

Tested on colab with Korean articles(5000, brief article).

GF1
GF2

Capture from colab

I did read the contributing guidelines, but since it's my first contribution on Github there will be some lack of information that needs to be provided. Feel free to request any additional information.

Want to know why Korean language can't be tokenized with a white space tokenizer?
Check out https://en.wikipedia.org/wiki/Korean_grammar#Substantives

Because of the structure of Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean nlp python package, I've added konlpy to do tokenization and removing numbers.
@vinid
Copy link
Contributor

vinid commented Jan 17, 2022

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

@StrangeFate
Copy link
Author

StrangeFate commented Jan 17, 2022

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

Hello @vinid,

I agree with you about forcing people to install all the tokenizers is not the appropriate way. That will be too much for users(I knew it as soon as I began modifying codes, but I'm kinda newbie to the python package so I couldn't come up with a better solution about it. :/ ) Anyway, I agree with what you said and it will be grateful if you could consider a better way to implement this.

Thanks for the great work!

@StrangeFate StrangeFate deleted the patch-2 branch January 17, 2022 07:55
@StrangeFate StrangeFate restored the patch-2 branch January 17, 2022 07:56
@StrangeFate
Copy link
Author

had a problem while changing branch name. re-open.

@StrangeFate StrangeFate reopened this Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants