NLTK for vietnamese #995

rain1024 · 2015-05-31T04:13:09Z

Have nltk supported vietnamese language?

In case it haven't. How can I contribute to make ntlk support vietnamese language?

It would be like this

>>> import nltk
>>> sentence = "Vào tám giờ thứ sáu, tôi cảm thấy không được khỏe."

>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['Vào', 'tám', "giờ", 'sáng', 'thứ sáu', 'tôi', 'cảm thấy', 'không', 'được', 'khỏe', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:5]
[('Vào', 'IN'), ('tám', 'CD'), ("giờ", 'JJ'), ('sáng', 'NN'), ('thứ sáu', 'NNP'), ]

The text was updated successfully, but these errors were encountered:

longdt219 · 2015-06-08T00:34:11Z

Hi @stevenbird,
What do you think ? Probably we can port these
http://jvntextpro.sourceforge.net/

stevenbird · 2015-06-09T23:23:08Z

@rain1024 would you like to do some porting, or contribute wrappers for external Java libraries?

rain1024 · 2015-06-10T01:52:25Z

@stevenbird : yes. I'm glad to do this.

@longdt219: can we do this together?

longdt219 · 2015-06-10T03:02:52Z

Yes sure @rain1024

rain1024 · 2015-06-10T05:49:48Z

hi @longdt219

can I have your email? I will contact to you for more information 😄

longdt219 · 2015-06-11T06:17:04Z

Hi @rain1024,
I emailed you but probably we can discuss here so that others can join the discussion.

manhtai · 2015-06-18T00:55:28Z

@rain1024 @longdt219,

How about porting this https://github.com/rockkhuya/DongDu as first step? Which is aimed for word segmentation and written in C++ by the way.

I don't know C++ or Java but that tool must have the best performace so far, according to http://xltiengviet.wikia.com/wiki/K%E1%BB%B7_l%E1%BB%A5c_t%C3%A1ch_t%E1%BB%AB

manhtai · 2015-06-18T16:52:59Z

Hi, me again,

After searching around for a while I found that word segmentation in Vietnamese is a really hard problem, not to mention POS tagging.

I had an idea inspire by https://github.com/mesnilgr/is13 for using deep learning to learn word embeddings, and I'll try to implement it. Some interesting may come, or not 😸

manhtai · 2015-06-23T07:20:31Z

I've implemented a neural net for Vietnamese word segmenting here https://github.com/manhtai/vietseg. Have a look!

It's not so good for now. But at least I've tried, huh? 😄

longdt219 · 2015-06-23T08:49:04Z

About the performance, it look OK though. However, what's the baseline ?
What is the dependencies ? using network.py from https://github.com/mnielsen/neural-networks-and-deep-learning probably is not a good way w.r.t maintenance and licensing. The idea is we don't want to rely on external code.
Using Theano (python based) for this might be a better (and simpler) solution.

manhtai · 2015-06-23T09:33:31Z

Thanks, I'm looking for a baseline and will add it soon.

Theano may be better but not simpler, network.py is an independent file with less than 300 lines of code.

Anyway, it's only a quick and dirty implementation. I've added future works to README file, and that's for working in the future 😸

letuananh · 2015-09-14T02:14:29Z

@longdt219 @rain1024 I have been using jvntextpro2 for awhile and it's pretty decent. It's written in Java and also an opensource project. We may choose to port this as well.

alvations · 2016-02-28T22:48:38Z

Bumping the issue ;P

I've written a JVnTextPro wrapper some time ago but it's not properly documented and the coding style is outdated but I hope it helps.

Would be great to see other Asian languages annotators wrappers/ports too =)

letuananh · 2016-02-29T07:09:33Z

@alvations: are you interested in porting JVnTextPro to NLTK :P ?

alvations · 2017-05-05T04:31:37Z

@letuananh after much thinking, yes. After the new PTB tokenizer is merged, interface to JVN would be something on my todo list. Care to help?

stevenbird · 2017-05-25T05:05:43Z

👍 It would be great to support Vietnamese

toannguyenle · 2017-06-04T04:03:19Z

wow... this is awesome stuff. Would love to have Vietnamese support!

vincetran96 · 2017-07-27T10:09:43Z

@manhtai do you plan to continue on your project. it sounds awesome.

alvations · 2017-09-06T08:08:21Z

Coming back to this issue after the next minor release =)
But meanwhile take a look at https://github.com/magizbox/underthesea

u8621011 · 2018-05-30T01:27:31Z

@rain1024 How about your original porting plan? I reached here because i have ported a python version vnTokenizer and planning if it's possible to port into nltk. I also saw your continuous good job of underthesea and have a question about your next step.

alvations · 2018-05-30T01:50:00Z

@u8621011 underthesea isn't my work but they're doing a good job =)

I'm not sure how much mileage we can get if we start porting from Jvntextpro. But I think I won't be able to take another try at porting until late July.

Vietnamese support is surely on the list of things I personally would like to see and work on in NLTK.

rain1024 · 2018-05-30T01:57:36Z

@u8621011 Glad you asked. Our next steps in underthesea are integrating more modules such as speech synthesis, machine translation and (simple) chatbot for Vietnamese and improving speed and accuracy in current modules (word segmentation, pos tagging, chunking, named entity recognition, text classification and sentiment analysis).

About porting plan in nltk, I think we can write code in pure python to do word segmentation task (perhaps with cython to speed up performance) at the moment. I and my friend @trungtv have an accepted pull request in spacy 2 months ago.

alvations added the enhancement label Nov 17, 2016

alvations added corpus nice idea labels Oct 4, 2017

alvations added this to the 3.3 milestone Oct 4, 2017

stevenbird added the inactive label Aug 22, 2019

stevenbird closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLTK for vietnamese #995

NLTK for vietnamese #995

rain1024 commented May 31, 2015

longdt219 commented Jun 8, 2015

stevenbird commented Jun 9, 2015

rain1024 commented Jun 10, 2015

longdt219 commented Jun 10, 2015

rain1024 commented Jun 10, 2015

longdt219 commented Jun 11, 2015

manhtai commented Jun 18, 2015

manhtai commented Jun 18, 2015

manhtai commented Jun 23, 2015

longdt219 commented Jun 23, 2015

manhtai commented Jun 23, 2015

letuananh commented Sep 14, 2015

alvations commented Feb 28, 2016

letuananh commented Feb 29, 2016

alvations commented May 5, 2017 •

edited

stevenbird commented May 25, 2017

toannguyenle commented Jun 4, 2017

vincetran96 commented Jul 27, 2017

alvations commented Sep 6, 2017 •

edited

u8621011 commented May 30, 2018

alvations commented May 30, 2018

rain1024 commented May 30, 2018

NLTK for vietnamese #995

NLTK for vietnamese #995

Comments

rain1024 commented May 31, 2015

longdt219 commented Jun 8, 2015

stevenbird commented Jun 9, 2015

rain1024 commented Jun 10, 2015

longdt219 commented Jun 10, 2015

rain1024 commented Jun 10, 2015

longdt219 commented Jun 11, 2015

manhtai commented Jun 18, 2015

manhtai commented Jun 18, 2015

manhtai commented Jun 23, 2015

longdt219 commented Jun 23, 2015

manhtai commented Jun 23, 2015

letuananh commented Sep 14, 2015

alvations commented Feb 28, 2016

letuananh commented Feb 29, 2016

alvations commented May 5, 2017 • edited

stevenbird commented May 25, 2017

toannguyenle commented Jun 4, 2017

vincetran96 commented Jul 27, 2017

alvations commented Sep 6, 2017 • edited

u8621011 commented May 30, 2018

alvations commented May 30, 2018

rain1024 commented May 30, 2018

alvations commented May 5, 2017 •

edited

alvations commented Sep 6, 2017 •

edited