Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document minimum_probability=0.0 in LDA #2625

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Hiyorimi
Copy link
Contributor

@Hiyorimi Hiyorimi commented Oct 7, 2019

Documentation for using minimum_probability parameter to obtain all topic distribution. Fixed #2489.

@Hiyorimi
Copy link
Contributor Author

Hiyorimi commented Oct 8, 2019

@mpenkov take a look please

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 9, 2019

Does this actually achieve the goal of the PR?

I think code in ldamodel.py enforces a minimum on the probability: https://github.com/RaRe-Technologies/gensim/blob/2131e3a3b0231c7eb0b0b447539d2ab7c8802633/gensim/models/ldamodel.py#L1313

So even if you specify zero, there's no guarantee that you will get all topics.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 9, 2019

@piskvorky @menshikh-iv I recall discussing this in another ticket/PR... what do you guys think?

Personally, I think the enforcing line is misleading and can produce surprise for the user. If the user passes zero (a non-default), then they mean zero, not 1e-8.

@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2019

IIRC the issue was with segfaults: there's a contract in Gensim that sparse vectors (both BoW lists and scipy.sparse) don't contain explicit zeros.

Breaking the contract made scipy.sparse confused, and it segfaulted at some point: the number of declared elements didn't match the number of actual sparse elements.

Probably fixable, we could move enforcing the contract from the point of creating sparse vectors to the point of consuming them, possibly at some (minor) cost to performance.

The solution we've been offering people who "want zeros" is to explicitly convert the sparse vector to a dense one, e.g. with matutils.sparse2full.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 9, 2019

IIRC the issue was with segfaults: there's a contract in Gensim that sparse vectors (both BoW lists and scipy.sparse) don't contain explicit zeros.

only 0 - explicit zero, not elements like 5 * 1e-9. More than that, if I pass minimum_probability myself, I know what I do and why. Ignoring this parameter is a bad surprise for me as for the user.

Breaking the contract made scipy.sparse confused, and it segfaulted at some point: the number of declared elements didn't match the number of actual sparse elements.

no, that's not a problem, because we don't discuss the "training" stage, we talked only about inference.

The solution we've been offering people who "want zeros" is to explicitly convert the sparse vector to a dense one, e.g. with matutils.sparse2full.

not really, because of sparse2full use zeros instead of missing elements (0 < x < 1e-8 isn't zeros) -> we miss information, not possible to recover it after "min_probability" trimming.

@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2019

only 0 - explicit zero, not elements like 5 * 1e-9.

You cannot rely on explicit zeros in programming with floats.

1e-8 is meant as "epsilon" = value small enough that users don't care. A topic with probability 0.00000001 is as good as probability 0. No practical difference (and if you're relying on such difference, if it's significant, there's something wrong with your pipeline).

More than that, if I pass minimum_probability myself, I know what I do and why. Ignoring this parameter is a bad surprise for me as for the user.

Yeah, that's what this ticket is about. But keeping zeros in sparse representations ain't the solution. At least not naively like this.

no, that's not a problem, because we don't discuss the "training" stage, we talked only about inference.

It is a problem, for the reasons given. "Training stage" has nothing to do with it.

@menshikh-iv
Copy link
Contributor

No practical difference (and if you're relying on such difference, if it's significant, there's something wrong with your pipeline).

Ask @mpenkov about concrete practical example (we discuss that issue a long time ago based on some code piece), in the case when your model trains on the small corpus, this can be important.
If corpus are "large enough", this has no effect, of course.

But keeping zeros in sparse representations ain't the solution

For me this sounds like a good solution (in case if I set min_probability small enough), I do not agree with you.

It is a problem, for the reasons given. "Training stage" has nothing to do with it.

I don't get your point, what's an exact problem with scipy by your opinion when we infer vector? Maybe a code example?

@Hiyorimi
Copy link
Contributor Author

@menshikh-iv @piskvorky any ideas how can I close the PR?

@mpenkov mpenkov added this to Needs triage in PR triage Nov 3, 2019
@piskvorky piskvorky changed the title Documentation changes for #2489 #2491 Document minimum_probability=0.0 in LDA Apr 12, 2020
@mpenkov mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021
@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 22, 2021
@mpenkov mpenkov moved this from Unsorted to Handle after next release in PR triage 2021-06 Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
PR triage
  
Needs triage
PR triage 2021-06
Handle after next release
Development

Successfully merging this pull request may close these issues.

Document how to get complete document distribution from LDA
4 participants