Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for empty corpus in HDPModel. Fixes #2563 #2639

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

dm20
Copy link

@dm20 dm20 commented Oct 17, 2019

There are a couple of ways to go about preventing the infinite loop mentioned in the Issue from occurring. I thought maybe to user logger.error, or to skip the while loop if the corpus is empty. ButI saw in the inference method that a RunTimeError is thrown when lda_alpha and lda_beta are not initialized. So I thought this seemed consistent.

@@ -339,6 +339,8 @@ def __init__(self, corpus, id2word, max_chunks=None, max_time=None,
corresponding lda model from :meth:`~gensim.models.hdpmodel.HdpModel.suggested_lda_model`

"""
if len(corpus) == 0:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is corpus guaranteed to support len() in HdpModel? What if it's just a generator?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - would this suffice? @piskvorky

if len(corpus) == 0 or len(list(corpus)) == 0:

Copy link
Owner

@piskvorky piskvorky Oct 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most definitely not. The len exception is still there, plus now you're serializing the entire corpus into RAM! If corpus is a generator, its content will be irretrievably gone afterward.

Does HdpModel support generators? Or are we guaranteed to have a repeatable iterable with len?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok got it, well the comments in the file just say

corpus : iterable of list of (int, float)
            Corpus in BoW format.

so I wouldn't call this a guarantee but it does sound like list is the intended type. Should there maybe be type checking to guarantee that list is used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a list. It's an iterable of something.

That something is a list of tuples, but that still doesn't change the fact that you're dealing with an iterable, and you won't necessarily be able to call len() on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it maybe be easiest to use the existing utils.is_corpus method in the constructor? since it covers the empty corpus case and works for generators

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd say calling len(corpus) is probably a bug, because it risks consuming the iterable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible the HDP model does not support iterables. It is possible it supports list only :( (unlike other models in Gensim)

That's why I ask about the contract of HdpModel – whether it explicitly needs its input data in RAM (~list), or if it supports streamed inputs (~any iterable).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky I'll look over the code and also look around for usage of the model with a non-list iterable. if it turns out that only list is supported, would it be helpful to add support for all iterables? or not worth the effort since the model may be deprecated soon

Copy link
Owner

@piskvorky piskvorky Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look over the code and also look around for usage of the model with a non-list iterable.

That will be great, thanks!

or not worth the effort since the model may be deprecated soon

I'd say "not worth the effort" right now. It is probably possible to clean up and optimize HDP, but we lack a clear motivating use-case. So, before any technical optimizations, we'd need a good strong tutorial / how-to document, showing off why HDP is worth it, how it's better than other models, when to use it.

The performance considerations are secondary to my mind.

@mpenkov mpenkov added this to Needs triage in PR triage Nov 3, 2019
@mpenkov mpenkov moved this from Needs triage to High priority in PR triage Nov 3, 2019
@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
PR triage
  
High priority
Development

Successfully merging this pull request may close these issues.

None yet

4 participants