Korean Language Support #98

StrangeFate · 2022-01-14T09:06:26Z

#71
Because of the structure of the Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean NLP python package, I've added konlpy to do tokenization and removing numbers.

test.txt
test_preprocessed.txt

Tested on colab with Korean articles(5000, brief article).

Capture from colab

I did read the contributing guidelines, but since it's my first contribution on Github there will be some lack of information that needs to be provided. Feel free to request any additional information.

Want to know why Korean language can't be tokenized with a white space tokenizer?
Check out https://en.wikipedia.org/wiki/Korean_grammar#Substantives

Because of the structure of Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean nlp python package, I've added konlpy to do tokenization and removing numbers.

vinid · 2022-01-17T07:27:53Z

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

StrangeFate · 2022-01-17T07:38:28Z

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

Hello @vinid,

I agree with you about forcing people to install all the tokenizers is not the appropriate way. That will be too much for users(I knew it as soon as I began modifying codes, but I'm kinda newbie to the python package so I couldn't come up with a better solution about it. :/ ) Anyway, I agree with what you said and it will be grateful if you could consider a better way to implement this.

Thanks for the great work!

StrangeFate · 2022-01-17T07:57:17Z

had a problem while changing branch name. re-open.

Korean Language Support

e5a0eed

Because of the structure of Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean nlp python package, I've added konlpy to do tokenization and removing numbers.

StrangeFate closed this Jan 17, 2022

StrangeFate deleted the patch-2 branch January 17, 2022 07:55

StrangeFate restored the patch-2 branch January 17, 2022 07:56

StrangeFate reopened this Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Korean Language Support #98

Korean Language Support #98

StrangeFate commented Jan 14, 2022 •

edited

vinid commented Jan 17, 2022

StrangeFate commented Jan 17, 2022 •

edited

StrangeFate commented Jan 17, 2022

Korean Language Support #98

Are you sure you want to change the base?

Korean Language Support #98

Conversation

StrangeFate commented Jan 14, 2022 • edited

vinid commented Jan 17, 2022

StrangeFate commented Jan 17, 2022 • edited

StrangeFate commented Jan 17, 2022

StrangeFate commented Jan 14, 2022 •

edited

StrangeFate commented Jan 17, 2022 •

edited