Document minimum_probability=0.0 in LDA #2625

Hiyorimi · 2019-10-07T23:19:59Z

Documentation for using minimum_probability parameter to obtain all topic distribution. Fixed #2489.

Hiyorimi · 2019-10-08T11:13:46Z

@mpenkov take a look please

mpenkov · 2019-10-09T05:37:37Z

Does this actually achieve the goal of the PR?

I think code in ldamodel.py enforces a minimum on the probability: https://github.com/RaRe-Technologies/gensim/blob/2131e3a3b0231c7eb0b0b447539d2ab7c8802633/gensim/models/ldamodel.py#L1313

So even if you specify zero, there's no guarantee that you will get all topics.

mpenkov · 2019-10-09T05:39:01Z

@piskvorky @menshikh-iv I recall discussing this in another ticket/PR... what do you guys think?

Personally, I think the enforcing line is misleading and can produce surprise for the user. If the user passes zero (a non-default), then they mean zero, not 1e-8.

piskvorky · 2019-10-09T07:19:51Z

IIRC the issue was with segfaults: there's a contract in Gensim that sparse vectors (both BoW lists and scipy.sparse) don't contain explicit zeros.

Breaking the contract made scipy.sparse confused, and it segfaulted at some point: the number of declared elements didn't match the number of actual sparse elements.

Probably fixable, we could move enforcing the contract from the point of creating sparse vectors to the point of consuming them, possibly at some (minor) cost to performance.

The solution we've been offering people who "want zeros" is to explicitly convert the sparse vector to a dense one, e.g. with matutils.sparse2full.

menshikh-iv · 2019-10-09T07:58:07Z

IIRC the issue was with segfaults: there's a contract in Gensim that sparse vectors (both BoW lists and scipy.sparse) don't contain explicit zeros.

only 0 - explicit zero, not elements like 5 * 1e-9. More than that, if I pass minimum_probability myself, I know what I do and why. Ignoring this parameter is a bad surprise for me as for the user.

Breaking the contract made scipy.sparse confused, and it segfaulted at some point: the number of declared elements didn't match the number of actual sparse elements.

no, that's not a problem, because we don't discuss the "training" stage, we talked only about inference.

The solution we've been offering people who "want zeros" is to explicitly convert the sparse vector to a dense one, e.g. with matutils.sparse2full.

not really, because of sparse2full use zeros instead of missing elements (0 < x < 1e-8 isn't zeros) -> we miss information, not possible to recover it after "min_probability" trimming.

piskvorky · 2019-10-09T08:22:22Z

only 0 - explicit zero, not elements like 5 * 1e-9.

You cannot rely on explicit zeros in programming with floats.

1e-8 is meant as "epsilon" = value small enough that users don't care. A topic with probability 0.00000001 is as good as probability 0. No practical difference (and if you're relying on such difference, if it's significant, there's something wrong with your pipeline).

More than that, if I pass minimum_probability myself, I know what I do and why. Ignoring this parameter is a bad surprise for me as for the user.

Yeah, that's what this ticket is about. But keeping zeros in sparse representations ain't the solution. At least not naively like this.

no, that's not a problem, because we don't discuss the "training" stage, we talked only about inference.

It is a problem, for the reasons given. "Training stage" has nothing to do with it.

menshikh-iv · 2019-10-09T08:32:42Z

No practical difference (and if you're relying on such difference, if it's significant, there's something wrong with your pipeline).

Ask @mpenkov about concrete practical example (we discuss that issue a long time ago based on some code piece), in the case when your model trains on the small corpus, this can be important.
If corpus are "large enough", this has no effect, of course.

But keeping zeros in sparse representations ain't the solution

For me this sounds like a good solution (in case if I set min_probability small enough), I do not agree with you.

It is a problem, for the reasons given. "Training stage" has nothing to do with it.

I don't get your point, what's an exact problem with scipy by your opinion when we infer vector? Maybe a code example?

Hiyorimi · 2019-10-16T00:45:04Z

@menshikh-iv @piskvorky any ideas how can I close the PR?

Documentation changes for piskvorky#2489 piskvorky#2491

b43a301

mpenkov added this to Needs triage in PR triage Nov 3, 2019

piskvorky changed the title ~~Documentation changes for #2489 #2491~~ Document minimum_probability=0.0 in LDA Apr 12, 2020

mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021

mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 22, 2021

mpenkov moved this from Unsorted to Handle after next release in PR triage 2021-06 Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document minimum_probability=0.0 in LDA #2625

Document minimum_probability=0.0 in LDA #2625

Hiyorimi commented Oct 7, 2019 •

edited by piskvorky

Hiyorimi commented Oct 8, 2019

mpenkov commented Oct 9, 2019

mpenkov commented Oct 9, 2019

piskvorky commented Oct 9, 2019 •

edited

menshikh-iv commented Oct 9, 2019 •

edited

piskvorky commented Oct 9, 2019 •

edited

menshikh-iv commented Oct 9, 2019

Hiyorimi commented Oct 16, 2019

Document minimum_probability=0.0 in LDA #2625

Are you sure you want to change the base?

Document minimum_probability=0.0 in LDA #2625

Conversation

Hiyorimi commented Oct 7, 2019 • edited by piskvorky

Hiyorimi commented Oct 8, 2019

mpenkov commented Oct 9, 2019

mpenkov commented Oct 9, 2019

piskvorky commented Oct 9, 2019 • edited

menshikh-iv commented Oct 9, 2019 • edited

piskvorky commented Oct 9, 2019 • edited

menshikh-iv commented Oct 9, 2019

Hiyorimi commented Oct 16, 2019

Hiyorimi commented Oct 7, 2019 •

edited by piskvorky

piskvorky commented Oct 9, 2019 •

edited

menshikh-iv commented Oct 9, 2019 •

edited

piskvorky commented Oct 9, 2019 •

edited