Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add functions to reproduce preprocessing matching GoogleNews, GLoVe, etc pretrained word-vectors #3485

Open
gojomo opened this issue Jul 19, 2023 · 1 comment

Comments

@gojomo
Copy link
Collaborator

gojomo commented Jul 19, 2023

Suggested on project discussion list (https://groups.google.com/g/gensim/c/CsER2XBs8P4/m/f2EntuXRAgAJ):

Having discovered the undocumented feature that common words like
I'm
we're
don't

etc are OOV in the common GloVe pretrained models

(while words like o'clock are in so you can't just split on apostrophe/single quotes)

and seeing no docs except some vague references that Stanford parser with undocumented switches MIGHT have been used to generate the common pretrained GloVe models

and finding ZERO comments from Google about how they preprocessed the text used for Word2Vec's Google News pretrained model

it seems to me that GenSim would do people a lot of good by making tokenizers matching each of their most popular included pretrained models so that users are writing NLP programs that speak the same language as their models rather than comparing apples to oranges.

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 19, 2023

My thoughts:

A desire for help here has come up a lot – & at times I've shared my observations about what can be deduced from the limited statements, & observable contents, of pre-trained vector sets like the 'GoogleNews' release.

However, without disclosures (or better yet code) from the original researchers who prepared such pretrained vectors, all such efforts will only ever be gradually-approximating their practices, with lingering exceptions & caveats generating more questions.

Also: it often seems to be beginner & small-data projects that are most-eager to re-use pretrained vectors from elsewhere, under the assumption those must be the "right" thing, or better than what they'd achieve. But: many times that's not the case.

For example, GoogleNews was trained on an internal Google corpus of news articles 11+ years ago. It used a statistical model for creating multiword-tokens whose exact parameters/word-frequencies/multigram-frequencies has never been disclosed. For many current projects, word-vectors trained on more-recent domain-specific data via understood & conciously-chosen proprocessing – even much less data! – will likely generate better vocabulary & relevant-word-sense coverage than Google's old work.

So while I'd see some value in a "best guess" function to mimic the tokenizing choices of those commonly-used pretrained sets – as a research effort, or contribution – I'd also prefer it prominently-disclaimered as non-official, & not-necessarily-an-endorsement of preferring those vectors, and that tokenization, for anyone's particular purpose.

At this point, devising such helpers would be a sort of software-archeology/mystery project, and I'd not see it as any sort of urgent priority. But, it might make a good new-contributor, student, or hackathon project – especially if eventual integration includes good surrounding docs/discussion/demos of the limits/considerations involved in reusing another project's vectors/preprocessing choices.

@gojomo gojomo changed the title add functions to reproduct preprocessing behind GoogleNews, GLoVe, etc pretrained word-vectors add functions to reproduce preprocessing matching GoogleNews, GLoVe, etc pretrained word-vectors Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant