Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential lda optimization #3172

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

Sharganov
Copy link

@Sharganov Sharganov commented Jun 14, 2021

Hi team,

This pull request addresses the issue of the slow computational time of LdaSeqModel #1545 . In the current stage, it contains cythonized methods of SSLM and LdaPost classes. It is not the final version, I plan to add the following changes:

  • cythonize sslm_counts_init function of sslm class
  • refactor code (e.g. chain.sslm_counts_init(topic_obs_variance, topic_chain_variance, sstats) in LdaSeqModel class, here class method is used, however I don't know why) + many others
  • add more documentation for functions
  • implement DIM model
  • tune memory usage and allocation
  • fix bug which I found out in the update_obs function of sslm class. You can compare the logic in the gensim implementation and Blei's lab implementation

By now the performance improvement is 3-5 times on toy datasets e.g. https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_ldaseqmodel.py. I still haven't tried to benchmark for larger datasets as it takes a while to train original implementation, however, the speed up seems to be much higher based on synthetic data I tried.

Extracted topics of optimized model differ a bit from the original one (gensim tests are passed). The exact places in the code which causes it are

  • update_zeta function of sslm class
  • np.negative at the end of df_obs
    Everything except these two places works absolutely the same. From my perspective, it is not a bug in the code, just issues with the precision. (I hope so)

@gojomo @mpenkov @piskvorky What do you think about the idea of such changes in gensim in general, code structure and code itself particularly? I plan to finish with the optimization and bug fixes I mentioned above by the end of this month.

P.S. I've created the pull request before ending to be sure that these changes are valuable for the project :)

Reduced time for sslm initialization (it was especially critical for large datasets). Removed duplicated code.
@Sharganov Sharganov force-pushed the sequential_lda_optimization branch from 5768e96 to 2015bdb Compare June 17, 2021 23:02
@mpenkov mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021
@mpenkov mpenkov moved this from Triage to Needs work in PR triage 2021-06 Jun 22, 2021
@Sharganov
Copy link
Author

Sharganov commented Jun 23, 2021

I've run a test using setup from ldaseqmodel example notebook in gensim docs. As for now, the current time distribution is present in the picture below. Almost all time is spent in scipy Python optimization code. The time which is spent in Cython marked with red rectangles. The time in rectangles (right up corner) is in nanoseconds.

The test was done using Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz

ldaseq_optimized_news

The raw output of a profiler + file for kcachegrind is on gdrive

@piskvorky
Copy link
Owner

Thanks! Do you have any high-level numbers, on a larger dataset? To get a sense of the scale of these optimizations, now-vs-before.

@Sharganov
Copy link
Author

I trained both implementations using the first 10 years of the UN General Debate dataset . The code I used:

from gensim.corpora import Dictionary
from gensim import utils
from gensim.models import LdaSeqModel

import numpy as np
import pandas as pd

data = pd.read_csv("datasets/un-general-debates.csv")
data["year"] = data["year"].apply(lambda x: int(x))
data = data[data["year"] < 1980]

paragraphs = [[(p,year) for p in speech.split("\n") if len(p) > 10] for speech, year  in zip(data.text, data.year)]
paragraphs = pd.DataFrame(data=sorted([p for s in paragraphs for p in s], key=lambda x: x[1]), columns=["text", "year"])

time_slices = paragraphs.year.value_counts().values
paragraphs = [utils.simple_preprocess(p) for p in paragraphs.text]

dictionary = Dictionary(paragraphs)
corpus = [dictionary.doc2bow(text) for text in paragraphs]

del data, paragraphs

model = LdaSeqModel(corpus=corpus,id2word=dictionary, num_topics=15, time_slice=time_slices)

The results:
Number of docs: 61674
Vocabulary Length: 29485
Number of topics: 15
Number of time slices: 10

Old Implementation : ~47 hours
New Implementation ~3 hours 40 mins

@piskvorky
Copy link
Owner

piskvorky commented Jul 3, 2021

Thanks for the timings. That's a massive speedup!

And are the new-vs-old results comparable, quality-wise?

@Sharganov
Copy link
Author

Sure... I accidentally rewrote a trained old model :(

So for now I just trained with two years of data using the same code + removed stopwords using nltk set. I've put the resulting topics in the excel table . As for me topics are very similar.

Regarding the model trained with 10 years of data, let me train it one more time :) I'll back in a couple of days. Meanwhile, I'll find a more formal way to compare two SeqLda models.

@Sharganov
Copy link
Author

I've trained the model one more time. Pickled models are available on gdrive. I calculated the coherence score for both models as it was done in the docs tutorial.
I am not sure that it is the right choice to compare classic vs optimized models using a coherence score, I think they should be equal in a perfect case. Nevertheless, they are very close for all time slices.

The code I used:

from gensim.models.coherencemodel import CoherenceModel
import pickle

[CoherenceModel(topics=model.dtm_coherence(time=i), corpus=corpus, 
                dictionary=dictionary, coherence='u_mass').get_coherence() 
 for i in range(len(time_slices))]

Results:

Time slice Old New
1 -2.390386933137473 -2.3849280478010373
2 -2.3854065916265434 -2.38503617012231
3 -2.412550400281885 -2.412624753824853
4 -2.403873956103539 -2.402234764724972
5 -2.4228081694247994 -2.416472944689486
6 -2.462690192130912 -2.466086114010153
7 -2.4189737189539 -2.4169199318713814,
8 -2.484822837691009 -2.488989346989372
9 -2.5564440634792547 -2.557404323965962
10 -2.551624889028128 -2.54892377159758]

@mpenkov mpenkov moved this from Needs work before merge to Handle after next release in PR triage 2021-06 Jul 23, 2021
@lkcao
Copy link

lkcao commented Sep 1, 2021

Thanks so much. This is helpful. How can we use the new version? Is it built in the package, or we need to revise any of the codes ourselves?

@piskvorky
Copy link
Owner

piskvorky commented Sep 1, 2021

Good point. @Sharganov can you fix the bug you found above & tune the memory usage? Let's get your work merged & released!

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 28, 2021

@Sharganov Ping

@florianlorisch
Copy link

Thats some impressive work @Sharganov. Any news on whether this is anywhere close to being merged/released?

@Sharganov
Copy link
Author

Hi everyone, sorry for a long break. I fixed the bug which I mentioned. Also I'll look at memory usage.

@piskvorky
Copy link
Owner

@Sharganov what did you find about the memory? We're planning a new release, and this looks a solid candidate for including if finished up.

@Sharganov
Copy link
Author

@piskvorky I am planning to look at memory on aprox. 28/02 - 04/03, I'll have a lot of free time on these dates. Hope it fits your plans for release.

@piskvorky
Copy link
Owner

piskvorky commented Feb 22, 2022

No problem, thanks. If we don't manage to finish the optimization in this release, we can include it in the next.

@jlevy44
Copy link

jlevy44 commented Mar 8, 2022

This is great. Looking forward to the release! Can this be used in its current state?

@bhargavvader
Copy link
Contributor

Really looking forward to the faster release, @piskvorky @Sharganov ! Great job!

@jlevy44
Copy link

jlevy44 commented Jun 16, 2022

Hi everyone, any updates here?

@piskvorky
Copy link
Owner

@Sharganov how did your memory optimization go? Looks like there is quite some demand for this feature!

@maciejskorski
Copy link

maciejskorski commented Jun 23, 2023

What's the status? Do you need any help/volunteer to bridge gaps against changes in the code base?

@Fan-chen04
Copy link

Please, any updates so far? I really need to enhance the efficiency of the LdaSeqmodel. If we set aside concerns about memory optimization for the time being, would it be possible for me to make local file modifications to load this version? Or does this version have any errors or conflicts with the current one? Thank you for your assistance.

@piskvorky
Copy link
Owner

piskvorky commented Mar 31, 2024

I think it's fair to say this PR is stale / dead. Unless someone picks it up to push it over the line, it's not happening, sorry.

@piskvorky piskvorky added the stale Waiting for author to complete contribution, no recent effort label Mar 31, 2024
@Fan-chen04
Copy link

Fan-chen04 commented Apr 2, 2024

I think it's fair to say this PR is stale / dead. Unless someone picks it up, it's not happening, sorry.

So pity. I try to make local file modifications to load this version. I am facing the same problem detailed in #3491 when trying to compile gensim-4.3.2.tar.gz from the source. I've followed the recommended steps, including ensuring all prerequisites are met, and using the python setup.py build_ext --inplace command, but I still encounter compilation errors. Any guidance or suggestions based on that issue would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
PR triage 2021-06
Handle after next release
Development

Successfully merging this pull request may close these issues.

None yet

10 participants