Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLTK for vietnamese #995

Closed
rain1024 opened this issue May 31, 2015 · 22 comments
Closed

NLTK for vietnamese #995

rain1024 opened this issue May 31, 2015 · 22 comments

Comments

@rain1024
Copy link

Have nltk supported vietnamese language?

In case it haven't. How can I contribute to make ntlk support vietnamese language?

It would be like this

>>> import nltk
>>> sentence = "Vào tám giờ thứ sáu, tôi cảm thấy không được khỏe."

>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['Vào', 'tám', "giờ", 'sáng', 'thứ sáu', 'tôi', 'cảm thấy', 'không', 'được', 'khỏe', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:5]
[('Vào', 'IN'), ('tám', 'CD'), ("giờ", 'JJ'), ('sáng', 'NN'), ('thứ sáu', 'NNP'), ]
@longdt219
Copy link
Contributor

Hi @stevenbird,
What do you think ? Probably we can port these
http://jvntextpro.sourceforge.net/

@stevenbird
Copy link
Member

@rain1024 would you like to do some porting, or contribute wrappers for external Java libraries?

@rain1024
Copy link
Author

@stevenbird : yes. I'm glad to do this.

@longdt219: can we do this together?

@longdt219
Copy link
Contributor

Yes sure @rain1024

@rain1024
Copy link
Author

hi @longdt219

can I have your email? I will contact to you for more information 😄

@longdt219
Copy link
Contributor

Hi @rain1024,
I emailed you but probably we can discuss here so that others can join the discussion.

@manhtai
Copy link

manhtai commented Jun 18, 2015

@rain1024 @longdt219,

How about porting this https://github.com/rockkhuya/DongDu as first step? Which is aimed for word segmentation and written in C++ by the way.

I don't know C++ or Java but that tool must have the best performace so far, according to http://xltiengviet.wikia.com/wiki/K%E1%BB%B7_l%E1%BB%A5c_t%C3%A1ch_t%E1%BB%AB

@manhtai
Copy link

manhtai commented Jun 18, 2015

Hi, me again,

After searching around for a while I found that word segmentation in Vietnamese is a really hard problem, not to mention POS tagging.

I had an idea inspire by https://github.com/mesnilgr/is13 for using deep learning to learn word embeddings, and I'll try to implement it. Some interesting may come, or not 😸

@manhtai
Copy link

manhtai commented Jun 23, 2015

I've implemented a neural net for Vietnamese word segmenting here https://github.com/manhtai/vietseg. Have a look!

It's not so good for now. But at least I've tried, huh? 😄

@longdt219
Copy link
Contributor

About the performance, it look OK though. However, what's the baseline ?
What is the dependencies ? using network.py from https://github.com/mnielsen/neural-networks-and-deep-learning probably is not a good way w.r.t maintenance and licensing. The idea is we don't want to rely on external code.
Using Theano (python based) for this might be a better (and simpler) solution.

@manhtai
Copy link

manhtai commented Jun 23, 2015

Thanks, I'm looking for a baseline and will add it soon.

Theano may be better but not simpler, network.py is an independent file with less than 300 lines of code.

Anyway, it's only a quick and dirty implementation. I've added future works to README file, and that's for working in the future 😸

@letuananh
Copy link
Contributor

@longdt219 @rain1024 I have been using jvntextpro2 for awhile and it's pretty decent. It's written in Java and also an opensource project. We may choose to port this as well.

@alvations
Copy link
Contributor

Bumping the issue ;P

I've written a JVnTextPro wrapper some time ago but it's not properly documented and the coding style is outdated but I hope it helps.

Would be great to see other Asian languages annotators wrappers/ports too =)

@letuananh
Copy link
Contributor

@alvations: are you interested in porting JVnTextPro to NLTK :P ?

@alvations
Copy link
Contributor

alvations commented May 5, 2017

@letuananh after much thinking, yes. After the new PTB tokenizer is merged, interface to JVN would be something on my todo list. Care to help?

@stevenbird
Copy link
Member

👍 It would be great to support Vietnamese

@toannguyenle
Copy link

wow... this is awesome stuff. Would love to have Vietnamese support!

@vincetran96
Copy link

@manhtai do you plan to continue on your project. it sounds awesome.

@alvations
Copy link
Contributor

alvations commented Sep 6, 2017

Coming back to this issue after the next minor release =)
But meanwhile take a look at https://github.com/magizbox/underthesea

@alvations alvations added this to the 3.3 milestone Oct 4, 2017
@u8621011
Copy link

@rain1024 How about your original porting plan? I reached here because i have ported a python version vnTokenizer and planning if it's possible to port into nltk. I also saw your continuous good job of underthesea and have a question about your next step.

@alvations
Copy link
Contributor

@u8621011 underthesea isn't my work but they're doing a good job =)

I'm not sure how much mileage we can get if we start porting from Jvntextpro. But I think I won't be able to take another try at porting until late July.

Vietnamese support is surely on the list of things I personally would like to see and work on in NLTK.

@rain1024
Copy link
Author

@u8621011 Glad you asked. Our next steps in underthesea are integrating more modules such as speech synthesis, machine translation and (simple) chatbot for Vietnamese and improving speed and accuracy in current modules (word segmentation, pos tagging, chunking, named entity recognition, text classification and sentiment analysis).

About porting plan in nltk, I think we can write code in pure python to do word segmentation task (perhaps with cython to speed up performance) at the moment. I and my friend @trungtv have an accepted pull request in spacy 2 months ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants