-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating word vector support for NLTK #2079
Comments
@alvations , @stevenbird please comment and provide your views |
Is there a way to integrate word vectors without replicating big chunks of gensim? |
@stevenbird , do u mean without importing gensim package? |
Can you please explain what you're proposing to do? |
Gensim is fantastic. About the only thing nltk generally provides that Gensim is lacking is the dataset management stuff, which is possibly what 53X was getting at? Nltk could add something like this. In fact I have implemented something of this style in my own project, finntk: https://github.com/frankier/finntk/tree/master/finntk/emb (it's a bit different from nltk - it pulls in datasets "on-demand"). For word vectors there's usually the extra step of converting to the KeyedVector format for quick looks from disk. However, I'm not sure if nltk is necessarily the best home for it. There are a few people trying to start "dataset package managers". See the manifesto: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers , and an example: https://quiltdata.com/ -- I wonder if there's the possibility of using something like this? |
In terms of repositories of word vectors. There's also: http://vectors.nlpl.eu/repository/ |
@stevenbird , quoting from the SpaCy doc.:
We can build and integrate something like this in NLTK,and I think that this might be the right PR where we start integrating DL functionality to NLTK. We can build something similar in NLTK and for that we might need to add the pre trained word embeddings in the NL |
My idea might be very crude, but with a bit of refinement from u guys, I think we can make something out of this .. |
@53X what are you proposing to do exactly? |
Let's say for a start incorporating something like sentence2vec, doc2vec etc. so that given comparable quantities we can say how similar they are. |
I think you're suggesting something like https://radimrehurek.com/gensim/models/keyedvectors.html ? My 2-cents worth. Without the cython or c code hacks, we'll not be able to achieve the |
Also, we haven't figured out a good distribution channel for data management. Personally, (on and off for quite some time), I've been trying different styles of reformatting the I've explored Kaggle Datasets, dropbox and zendoo and even data distribution as PyPI packages. But there's always a limit of
My other 2-cents worth. I think we need a better distribution channel for existing NLTK datasets even before thinking about redistributing word embeddings to be read into a Pythonic API. If anyone is interested in really solving the data distribution thing and dedicate time to it as a pre-step to incorporate word/doc embeddings, ping here and I'll contact you through email to discuss this. |
@alvations , this problem sounds interesting and if it's not a problem for any of u guys then please can I take it up as a challenge? |
My apologies to all of you guys if the following suggestion sounds foolish or absurd. I was actually thinking about making a separate repository (inside NLTK) for all these datasets. I think it would then be possible to address the issue that @alvations mentioned. Mostly importantly, then we could then directly download(clone) anything new (datasets or pretrained models like GloVe ) into the local machine using simple python codes(that's what I think... I might not be correct :P ). Also the above said problem of signing-in by an user can be avoided |
Oops, sorry .... turns out NLTK already has a repository called |
@alvations Do your objections apply to Quilt? |
@alvations how about posting your comments about datasets to nltk-dev. |
Lets continue the conversion on the dataset distribution on https://groups.google.com/forum/?hl=en#!topic/nltk-dev/LjThWkAthwc @53X there's no foolish/absurb suggestions, just how we can make it constructive and knowing what's the next steps =) @frankier I'm not familiar with Quilt data but it looks like a potential space to distribute the data. |
Closing inactive issue. |
Word Vectors are currently not supported by NLTK.
Integrating them would be a really good step as we often deal with them in our day-to-day jobs. This would then make NLTK a one stop for many more kinds of NLP purposes
Following are a list of word-vectors that can be integrated with NLTK :
word2vec
GloVe
If this issue is a go, then I can make the PR .
The text was updated successfully, but these errors were encountered: