Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating word vector support for NLTK #2079

Closed
53X opened this issue Aug 7, 2018 · 19 comments
Closed

Integrating word vector support for NLTK #2079

53X opened this issue Aug 7, 2018 · 19 comments

Comments

@53X
Copy link
Contributor

53X commented Aug 7, 2018

Word Vectors are currently not supported by NLTK.

Integrating them would be a really good step as we often deal with them in our day-to-day jobs. This would then make NLTK a one stop for many more kinds of NLP purposes

Following are a list of word-vectors that can be integrated with NLTK :
word2vec
GloVe

If this issue is a go, then I can make the PR .

@53X
Copy link
Contributor Author

53X commented Aug 9, 2018

@alvations , @stevenbird please comment and provide your views

@stevenbird
Copy link
Member

Is there a way to integrate word vectors without replicating big chunks of gensim?

@53X
Copy link
Contributor Author

53X commented Aug 10, 2018

@stevenbird , do u mean without importing gensim package?

@stevenbird
Copy link
Member

Can you please explain what you're proposing to do?

@frankier
Copy link

Gensim is fantastic. About the only thing nltk generally provides that Gensim is lacking is the dataset management stuff, which is possibly what 53X was getting at? Nltk could add something like this. In fact I have implemented something of this style in my own project, finntk: https://github.com/frankier/finntk/tree/master/finntk/emb (it's a bit different from nltk - it pulls in datasets "on-demand"). For word vectors there's usually the extra step of converting to the KeyedVector format for quick looks from disk.

However, I'm not sure if nltk is necessarily the best home for it. There are a few people trying to start "dataset package managers". See the manifesto: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers , and an example: https://quiltdata.com/ -- I wonder if there's the possibility of using something like this?

@frankier
Copy link

In terms of repositories of word vectors. There's also: http://vectors.nlpl.eu/repository/

@53X
Copy link
Contributor Author

53X commented Aug 12, 2018

@stevenbird , quoting from the SpaCy doc.:

SpaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

We can build and integrate something like this in NLTK,and I think that this might be the right PR where we start integrating DL functionality to NLTK.

We can build something similar in NLTK and for that we might need to add the pre trained word embeddings in the NL

@53X
Copy link
Contributor Author

53X commented Aug 15, 2018

My idea might be very crude, but with a bit of refinement from u guys, I think we can make something out of this ..

@stevenbird
Copy link
Member

@53X what are you proposing to do exactly?

@53X
Copy link
Contributor Author

53X commented Aug 15, 2018

Let's say for a start incorporating something like sentence2vec, doc2vec etc. so that given comparable quantities we can say how similar they are.

@alvations
Copy link
Contributor

I think you're suggesting something like https://radimrehurek.com/gensim/models/keyedvectors.html ?


My 2-cents worth. Without the cython or c code hacks, we'll not be able to achieve the gensim/spacy speed. Unless there's some way to differentiate the code between NLTK and gensim, there's no good use for two libraries to be too similar.

@alvations
Copy link
Contributor

Also, we haven't figured out a good distribution channel for data management.

Personally, (on and off for quite some time), I've been trying different styles of reformatting the nltk.corpus but I haven't came up with an API design that has suitable Content Delivery Network (CDN) that can handle data management elegantly.

I've explored Kaggle Datasets, dropbox and zendoo and even data distribution as PyPI packages. But there's always a limit of

  • how available can the data be? I.e. does it need user to sign up an account? how many hops/steps to take before user can get hold of the data to be read by nltk.corpus. Up till now, nothing beats the simplicity of pulling from github zip files.

  • how to track data precedence? I.e. when the data is updated, is there version? How do we go back to track changes and possible have some sort of git blame mechanism to debug what went wrong if it happens

  • how much support is the CDN going to give? There's always a case of bandwidth limit for files up/downloading and also a storage size limit. I think the latter is cheap but the previous is hard.


My other 2-cents worth. I think we need a better distribution channel for existing NLTK datasets even before thinking about redistributing word embeddings to be read into a Pythonic API.

If anyone is interested in really solving the data distribution thing and dedicate time to it as a pre-step to incorporate word/doc embeddings, ping here and I'll contact you through email to discuss this.

@53X
Copy link
Contributor Author

53X commented Aug 15, 2018

@alvations , this problem sounds interesting and if it's not a problem for any of u guys then please can I take it up as a challenge?

@53X
Copy link
Contributor Author

53X commented Aug 15, 2018

My apologies to all of you guys if the following suggestion sounds foolish or absurd.

I was actually thinking about making a separate repository (inside NLTK) for all these datasets. I think it would then be possible to address the issue that @alvations mentioned. Mostly importantly, then we could then directly download(clone) anything new (datasets or pretrained models like GloVe ) into the local machine using simple python codes(that's what I think... I might not be correct :P ). Also the above said problem of signing-in by an user can be avoided

@53X
Copy link
Contributor Author

53X commented Aug 15, 2018

Oops, sorry .... turns out NLTK already has a repository called nltk_data.

@frankier
Copy link

@alvations Do your objections apply to Quilt?

@stevenbird
Copy link
Member

@alvations how about posting your comments about datasets to nltk-dev.

@alvations
Copy link
Contributor

alvations commented Aug 17, 2018

Lets continue the conversion on the dataset distribution on https://groups.google.com/forum/?hl=en#!topic/nltk-dev/LjThWkAthwc


@53X there's no foolish/absurb suggestions, just how we can make it constructive and knowing what's the next steps =)

@frankier I'm not familiar with Quilt data but it looks like a potential space to distribute the data.

@stevenbird
Copy link
Member

Closing inactive issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants