Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you exposure your underlying language model for uni/bigrams? #18

Open
davidbernat opened this issue Nov 4, 2019 · 0 comments
Open

Comments

@davidbernat
Copy link

This library is really superb.

One of the tools I wish I have had is a basic statistical language model (relative frequency) of various unigrams, bigrams, and trigrams. When extracting keywords from text, one of the failures of TF-IDF is that the relative scores are not calibrated so that unigram and bigram scores can be compared with one another. There also is the trouble of needing to have document and token frequencies. Instead, I normalize the TF/TF-IDF scores against the English corpus statistics, which you have within your models. Usually I use the unwieldy Google NGrams corpus, but yours is succinct and quite helpful. Is this easily accessed?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant