Integrating word vector support for NLTK #2079

53X · 2018-08-07T14:58:21Z

Word Vectors are currently not supported by NLTK.

Integrating them would be a really good step as we often deal with them in our day-to-day jobs. This would then make NLTK a one stop for many more kinds of NLP purposes

Following are a list of word-vectors that can be integrated with NLTK :
word2vec
GloVe

If this issue is a go, then I can make the PR .

53X · 2018-08-09T18:08:01Z

@alvations , @stevenbird please comment and provide your views

stevenbird · 2018-08-10T05:02:33Z

Is there a way to integrate word vectors without replicating big chunks of gensim?

53X · 2018-08-10T09:07:00Z

@stevenbird , do u mean without importing gensim package?

stevenbird · 2018-08-10T18:48:31Z

Can you please explain what you're proposing to do?

frankier · 2018-08-10T20:04:37Z

Gensim is fantastic. About the only thing nltk generally provides that Gensim is lacking is the dataset management stuff, which is possibly what 53X was getting at? Nltk could add something like this. In fact I have implemented something of this style in my own project, finntk: https://github.com/frankier/finntk/tree/master/finntk/emb (it's a bit different from nltk - it pulls in datasets "on-demand"). For word vectors there's usually the extra step of converting to the KeyedVector format for quick looks from disk.

However, I'm not sure if nltk is necessarily the best home for it. There are a few people trying to start "dataset package managers". See the manifesto: http://juan.benet.ai/data/2014-03-04/the-case-for-data-package-managers , and an example: https://quiltdata.com/ -- I wonder if there's the possibility of using something like this?

frankier · 2018-08-10T20:26:08Z

In terms of repositories of word vectors. There's also: http://vectors.nlpl.eu/repository/

53X · 2018-08-12T10:03:16Z

@stevenbird , quoting from the SpaCy doc.:

SpaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

We can build and integrate something like this in NLTK,and I think that this might be the right PR where we start integrating DL functionality to NLTK.

We can build something similar in NLTK and for that we might need to add the pre trained word embeddings in the NL

53X · 2018-08-15T11:35:47Z

My idea might be very crude, but with a bit of refinement from u guys, I think we can make something out of this ..

stevenbird · 2018-08-15T11:51:13Z

@53X what are you proposing to do exactly?

53X · 2018-08-15T12:06:59Z

Let's say for a start incorporating something like sentence2vec, doc2vec etc. so that given comparable quantities we can say how similar they are.

alvations · 2018-08-15T15:52:46Z

I think you're suggesting something like https://radimrehurek.com/gensim/models/keyedvectors.html ?

My 2-cents worth. Without the cython or c code hacks, we'll not be able to achieve the gensim/spacy speed. Unless there's some way to differentiate the code between NLTK and gensim, there's no good use for two libraries to be too similar.

alvations · 2018-08-15T16:02:45Z

Also, we haven't figured out a good distribution channel for data management.

Personally, (on and off for quite some time), I've been trying different styles of reformatting the nltk.corpus but I haven't came up with an API design that has suitable Content Delivery Network (CDN) that can handle data management elegantly.

I've explored Kaggle Datasets, dropbox and zendoo and even data distribution as PyPI packages. But there's always a limit of

how available can the data be? I.e. does it need user to sign up an account? how many hops/steps to take before user can get hold of the data to be read by nltk.corpus. Up till now, nothing beats the simplicity of pulling from github zip files.
how to track data precedence? I.e. when the data is updated, is there version? How do we go back to track changes and possible have some sort of git blame mechanism to debug what went wrong if it happens
how much support is the CDN going to give? There's always a case of bandwidth limit for files up/downloading and also a storage size limit. I think the latter is cheap but the previous is hard.

My other 2-cents worth. I think we need a better distribution channel for existing NLTK datasets even before thinking about redistributing word embeddings to be read into a Pythonic API.

If anyone is interested in really solving the data distribution thing and dedicate time to it as a pre-step to incorporate word/doc embeddings, ping here and I'll contact you through email to discuss this.

53X · 2018-08-15T19:22:14Z

@alvations , this problem sounds interesting and if it's not a problem for any of u guys then please can I take it up as a challenge?

53X · 2018-08-15T20:01:05Z

My apologies to all of you guys if the following suggestion sounds foolish or absurd.

I was actually thinking about making a separate repository (inside NLTK) for all these datasets. I think it would then be possible to address the issue that @alvations mentioned. Mostly importantly, then we could then directly download(clone) anything new (datasets or pretrained models like GloVe ) into the local machine using simple python codes(that's what I think... I might not be correct :P ). Also the above said problem of signing-in by an user can be avoided

53X · 2018-08-15T20:12:24Z

Oops, sorry .... turns out NLTK already has a repository called nltk_data.

frankier · 2018-08-15T20:55:37Z

@alvations Do your objections apply to Quilt?

stevenbird · 2018-08-15T21:28:22Z

@alvations how about posting your comments about datasets to nltk-dev.

alvations · 2018-08-17T03:01:53Z

Lets continue the conversion on the dataset distribution on https://groups.google.com/forum/?hl=en#!topic/nltk-dev/LjThWkAthwc

@53X there's no foolish/absurb suggestions, just how we can make it constructive and knowing what's the next steps =)

@frankier I'm not familiar with Quilt data but it looks like a potential space to distribute the data.

stevenbird · 2019-05-03T21:08:35Z

Closing inactive issue.

alvations mentioned this issue Nov 28, 2018

Use unzipped files to facilitate contributions nltk/nltk_data#128

Closed

stevenbird closed this as completed May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating word vector support for NLTK #2079

Integrating word vector support for NLTK #2079

53X commented Aug 7, 2018 •

edited

53X commented Aug 9, 2018

stevenbird commented Aug 10, 2018

53X commented Aug 10, 2018

stevenbird commented Aug 10, 2018

frankier commented Aug 10, 2018

frankier commented Aug 10, 2018

53X commented Aug 12, 2018 •

edited

53X commented Aug 15, 2018 •

edited

stevenbird commented Aug 15, 2018

53X commented Aug 15, 2018 •

edited

alvations commented Aug 15, 2018

alvations commented Aug 15, 2018

53X commented Aug 15, 2018 •

edited

53X commented Aug 15, 2018 •

edited

53X commented Aug 15, 2018 •

edited

frankier commented Aug 15, 2018

stevenbird commented Aug 15, 2018

alvations commented Aug 17, 2018 •

edited

stevenbird commented May 3, 2019

Integrating word vector support for NLTK #2079

Integrating word vector support for NLTK #2079

Comments

53X commented Aug 7, 2018 • edited

53X commented Aug 9, 2018

stevenbird commented Aug 10, 2018

53X commented Aug 10, 2018

stevenbird commented Aug 10, 2018

frankier commented Aug 10, 2018

frankier commented Aug 10, 2018

53X commented Aug 12, 2018 • edited

53X commented Aug 15, 2018 • edited

stevenbird commented Aug 15, 2018

53X commented Aug 15, 2018 • edited

alvations commented Aug 15, 2018

alvations commented Aug 15, 2018

53X commented Aug 15, 2018 • edited

53X commented Aug 15, 2018 • edited

53X commented Aug 15, 2018 • edited

frankier commented Aug 15, 2018

stevenbird commented Aug 15, 2018

alvations commented Aug 17, 2018 • edited

stevenbird commented May 3, 2019

53X commented Aug 7, 2018 •

edited

53X commented Aug 12, 2018 •

edited

53X commented Aug 15, 2018 •

edited

53X commented Aug 15, 2018 •

edited

53X commented Aug 15, 2018 •

edited

53X commented Aug 15, 2018 •

edited

53X commented Aug 15, 2018 •

edited

alvations commented Aug 17, 2018 •

edited