Sequential lda optimization #3172

Sharganov · 2021-06-14T19:44:46Z

Hi team,

This pull request addresses the issue of the slow computational time of LdaSeqModel #1545 . In the current stage, it contains cythonized methods of SSLM and LdaPost classes. It is not the final version, I plan to add the following changes:

cythonize sslm_counts_init function of sslm class
refactor code (e.g. chain.sslm_counts_init(topic_obs_variance, topic_chain_variance, sstats) in LdaSeqModel class, here class method is used, however I don't know why) + many others
add more documentation for functions
implement DIM model
tune memory usage and allocation
fix bug which I found out in the update_obs function of sslm class. You can compare the logic in the gensim implementation and Blei's lab implementation

By now the performance improvement is 3-5 times on toy datasets e.g. https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_ldaseqmodel.py. I still haven't tried to benchmark for larger datasets as it takes a while to train original implementation, however, the speed up seems to be much higher based on synthetic data I tried.

Extracted topics of optimized model differ a bit from the original one (gensim tests are passed). The exact places in the code which causes it are

update_zeta function of sslm class
np.negative at the end of df_obs
Everything except these two places works absolutely the same. From my perspective, it is not a bug in the code, just issues with the precision. (I hope so)

@gojomo @mpenkov @piskvorky What do you think about the idea of such changes in gensim in general, code structure and code itself particularly? I plan to finish with the optimization and bug fixes I mentioned above by the end of this month.

P.S. I've created the pull request before ending to be sure that these changes are valuable for the project :)

Reduced time for sslm initialization (it was especially critical for large datasets). Removed duplicated code.

Sharganov · 2021-06-23T20:24:18Z

I've run a test using setup from ldaseqmodel example notebook in gensim docs. As for now, the current time distribution is present in the picture below. Almost all time is spent in scipy Python optimization code. The time which is spent in Cython marked with red rectangles. The time in rectangles (right up corner) is in nanoseconds.

The test was done using Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz

The raw output of a profiler + file for kcachegrind is on gdrive

piskvorky · 2021-06-23T20:35:40Z

Thanks! Do you have any high-level numbers, on a larger dataset? To get a sense of the scale of these optimizations, now-vs-before.

Sharganov · 2021-07-03T13:24:06Z

I trained both implementations using the first 10 years of the UN General Debate dataset . The code I used:

from gensim.corpora import Dictionary
from gensim import utils
from gensim.models import LdaSeqModel

import numpy as np
import pandas as pd

data = pd.read_csv("datasets/un-general-debates.csv")
data["year"] = data["year"].apply(lambda x: int(x))
data = data[data["year"] < 1980]

paragraphs = [[(p,year) for p in speech.split("\n") if len(p) > 10] for speech, year  in zip(data.text, data.year)]
paragraphs = pd.DataFrame(data=sorted([p for s in paragraphs for p in s], key=lambda x: x[1]), columns=["text", "year"])

time_slices = paragraphs.year.value_counts().values
paragraphs = [utils.simple_preprocess(p) for p in paragraphs.text]

dictionary = Dictionary(paragraphs)
corpus = [dictionary.doc2bow(text) for text in paragraphs]

del data, paragraphs

model = LdaSeqModel(corpus=corpus,id2word=dictionary, num_topics=15, time_slice=time_slices)

The results:
Number of docs: 61674
Vocabulary Length: 29485
Number of topics: 15
Number of time slices: 10

Old Implementation : ~47 hours
New Implementation ~3 hours 40 mins

piskvorky · 2021-07-03T13:49:44Z

Thanks for the timings. That's a massive speedup!

And are the new-vs-old results comparable, quality-wise?

Sharganov · 2021-07-04T15:04:06Z

Sure... I accidentally rewrote a trained old model :(

So for now I just trained with two years of data using the same code + removed stopwords using nltk set. I've put the resulting topics in the excel table . As for me topics are very similar.

Regarding the model trained with 10 years of data, let me train it one more time :) I'll back in a couple of days. Meanwhile, I'll find a more formal way to compare two SeqLda models.

Sharganov · 2021-07-07T20:40:10Z

I've trained the model one more time. Pickled models are available on gdrive. I calculated the coherence score for both models as it was done in the docs tutorial.
I am not sure that it is the right choice to compare classic vs optimized models using a coherence score, I think they should be equal in a perfect case. Nevertheless, they are very close for all time slices.

The code I used:

from gensim.models.coherencemodel import CoherenceModel
import pickle

[CoherenceModel(topics=model.dtm_coherence(time=i), corpus=corpus, 
                dictionary=dictionary, coherence='u_mass').get_coherence() 
 for i in range(len(time_slices))]

Results:

Time slice	Old	New
1	-2.390386933137473	-2.3849280478010373
2	-2.3854065916265434	-2.38503617012231
3	-2.412550400281885	-2.412624753824853
4	-2.403873956103539	-2.402234764724972
5	-2.4228081694247994	-2.416472944689486
6	-2.462690192130912	-2.466086114010153
7	-2.4189737189539	-2.4169199318713814,
8	-2.484822837691009	-2.488989346989372
9	-2.5564440634792547	-2.557404323965962
10	-2.551624889028128	-2.54892377159758]

lkcao · 2021-09-01T15:47:52Z

Thanks so much. This is helpful. How can we use the new version? Is it built in the package, or we need to revise any of the codes ourselves?

piskvorky · 2021-09-01T16:30:02Z

Good point. @Sharganov can you fix the bug you found above & tune the memory usage? Let's get your work merged & released!

mpenkov · 2021-09-28T14:05:39Z

@Sharganov Ping

florianlorisch · 2022-01-12T14:09:24Z

Thats some impressive work @Sharganov. Any news on whether this is anywhere close to being merged/released?

…Blei's lab

Sharganov · 2022-01-15T19:14:46Z

Hi everyone, sorry for a long break. I fixed the bug which I mentioned. Also I'll look at memory usage.

piskvorky · 2022-02-19T19:07:18Z

@Sharganov what did you find about the memory? We're planning a new release, and this looks a solid candidate for including if finished up.

Sharganov · 2022-02-22T13:03:52Z

@piskvorky I am planning to look at memory on aprox. 28/02 - 04/03, I'll have a lot of free time on these dates. Hope it fits your plans for release.

piskvorky · 2022-02-22T13:19:43Z

No problem, thanks. If we don't manage to finish the optimization in this release, we can include it in the next.

jlevy44 · 2022-03-08T17:18:22Z

This is great. Looking forward to the release! Can this be used in its current state?

bhargavvader · 2022-03-25T08:30:57Z

Really looking forward to the faster release, @piskvorky @Sharganov ! Great job!

…tial_lda_optimization

jlevy44 · 2022-06-16T20:37:28Z

Hi everyone, any updates here?

piskvorky · 2022-06-16T20:39:08Z

@Sharganov how did your memory optimization go? Looks like there is quite some demand for this feature!

maciejskorski · 2023-06-23T07:50:00Z

What's the status? Do you need any help/volunteer to bridge gaps against changes in the code base?

Fan-chen04 · 2024-03-31T05:17:53Z

Please, any updates so far? I really need to enhance the efficiency of the LdaSeqmodel. If we set aside concerns about memory optimization for the time being, would it be possible for me to make local file modifications to load this version? Or does this version have any errors or conflicts with the current one? Thank you for your assistance.

piskvorky · 2024-03-31T10:40:35Z

I think it's fair to say this PR is stale / dead. Unless someone picks it up to push it over the line, it's not happening, sorry.

Fan-chen04 · 2024-04-02T05:49:41Z

I think it's fair to say this PR is stale / dead. Unless someone picks it up, it's not happening, sorry.

So pity. I try to make local file modifications to load this version. I am facing the same problem detailed in #3491 when trying to compile gensim-4.3.2.tar.gz from the source. I've followed the recommended steps, including ensuring all prerequisites are met, and using the python setup.py build_ext --inplace command, but I still encounter compilation errors. Any guidance or suggestions based on that issue would be greatly appreciated.

Sharganov added 3 commits June 6, 2021 21:32

Cythonized Python code for sslm and LdaPost classes

1de110b

Fix error in LdaPosterior initialization

c3efd57

Cythonized sslm_counts_init method of sslm class

2015bdb

Reduced time for sslm initialization (it was especially critical for large datasets). Removed duplicated code.

Sharganov force-pushed the sequential_lda_optimization branch from 5768e96 to 2015bdb Compare June 17, 2021 23:02

mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021

mpenkov moved this from Triage to Needs work in PR triage 2021-06 Jun 22, 2021

Sharganov added 2 commits June 23, 2021 00:10

Code refactoring + merge of optimized code with original implementation

f22670b

Support compatibility with scipy 0.18.1

3f07b46

mpenkov moved this from Needs work before merge to Handle after next release in PR triage 2021-06 Jul 23, 2021

Sharganov added 2 commits January 15, 2022 21:37

Refactored code and removed numpy array allocation from loop

7c6f283

Fix: changed behavior for cut off as in original implementation from …

f20f196

…Blei's lab

Merge remote-tracking branch 'upstream/develop' into Sharganov_sequen…

adcc5df

…tial_lda_optimization

juanrloaiza mentioned this pull request May 25, 2022

Implement dynamic topic model instead of regular LDA juanrloaiza/latinamerican-philosophy-mining#7

Closed

smclh approved these changes Mar 4, 2023

View reviewed changes

piskvorky added the stale Waiting for author to complete contribution, no recent effort label Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential lda optimization #3172

Sequential lda optimization #3172

Sharganov commented Jun 14, 2021 •

edited

Sharganov commented Jun 23, 2021 •

edited

piskvorky commented Jun 23, 2021

Sharganov commented Jul 3, 2021

piskvorky commented Jul 3, 2021 •

edited

Sharganov commented Jul 4, 2021

Sharganov commented Jul 7, 2021

lkcao commented Sep 1, 2021

piskvorky commented Sep 1, 2021 •

edited

mpenkov commented Sep 28, 2021

florianlorisch commented Jan 12, 2022

Sharganov commented Jan 15, 2022

piskvorky commented Feb 19, 2022

Sharganov commented Feb 22, 2022

piskvorky commented Feb 22, 2022 •

edited

jlevy44 commented Mar 8, 2022

bhargavvader commented Mar 25, 2022

jlevy44 commented Jun 16, 2022

piskvorky commented Jun 16, 2022

maciejskorski commented Jun 23, 2023 •

edited

Fan-chen04 commented Mar 31, 2024

piskvorky commented Mar 31, 2024 •

edited

Fan-chen04 commented Apr 2, 2024 •

edited

Sequential lda optimization #3172

Are you sure you want to change the base?

Sequential lda optimization #3172

Conversation

Sharganov commented Jun 14, 2021 • edited

Sharganov commented Jun 23, 2021 • edited

piskvorky commented Jun 23, 2021

Sharganov commented Jul 3, 2021

piskvorky commented Jul 3, 2021 • edited

Sharganov commented Jul 4, 2021

Sharganov commented Jul 7, 2021

lkcao commented Sep 1, 2021

piskvorky commented Sep 1, 2021 • edited

mpenkov commented Sep 28, 2021

florianlorisch commented Jan 12, 2022

Sharganov commented Jan 15, 2022

piskvorky commented Feb 19, 2022

Sharganov commented Feb 22, 2022

piskvorky commented Feb 22, 2022 • edited

jlevy44 commented Mar 8, 2022

bhargavvader commented Mar 25, 2022

jlevy44 commented Jun 16, 2022

piskvorky commented Jun 16, 2022

maciejskorski commented Jun 23, 2023 • edited

Fan-chen04 commented Mar 31, 2024

piskvorky commented Mar 31, 2024 • edited

Fan-chen04 commented Apr 2, 2024 • edited

Sharganov commented Jun 14, 2021 •

edited

Sharganov commented Jun 23, 2021 •

edited

piskvorky commented Jul 3, 2021 •

edited

piskvorky commented Sep 1, 2021 •

edited

piskvorky commented Feb 22, 2022 •

edited

maciejskorski commented Jun 23, 2023 •

edited

piskvorky commented Mar 31, 2024 •

edited

Fan-chen04 commented Apr 2, 2024 •

edited