[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency #3437

piskvorky · 2023-01-20T11:28:49Z

This is still work-in-progress and needs finishing up. Namely:

Missing user-friendly docstrings and overall model motivation: what is this, who should use it? What do the various parameters mean?
As input, accept standard streaming corpora in the bag-of-words (BoW) format. Drop all the in-memory handling of the entire corpus in RAM as "list of list of strinks" and "scipy DOK matrix", that doesn't scale.
Complete the cleanup of the code formatting that I started. Especially use more helpful error messages in ValueErrors, showing what values are expected vs what the user supplied.
Related to that, focus all the parameter validation to a single place in code = the module entrypoints where users pass in these parameters. Currently the checks (even the same checks?) appear in multiple places, even in internal methods, where we should be in control of what the input values are, so we're doublechecking ourselves which makes no sense.

piskvorky · 2023-01-20T11:29:49Z

CC @ERijck are you able to continue and finish this up?

All the points above, plus all the FIXME notes I left in the code, must be resolved if we are to keep FlsaModel in Gensim.

ERijck · 2023-01-20T13:13:09Z

@piskvorky yes, I will do that.

piskvorky · 2023-01-20T13:39:28Z

Finishing up 1, 3 and 4 will be a great start. I can then assist with 2 (input streaming), to bring flsamodel in line with the rest of Gensim.

ERijck · 2023-01-23T16:21:24Z

To get up to speed with Git, I followed the Codecademy Git&Github pro course today. Afterwards, I just tried to fetch and merge the work in your branch. To do so, I used the following:

I assumed to see your code when opening flsamodel.py. However, this is not the case. Then, I tried the following steps:

This does not work. Which command can I use to pull cbfd972257f83d2d64803059e6585c00184f784c refs/heads/flsa_fixes?

piskvorky · 2023-01-23T17:58:52Z

Yeah git can be frustrating when you're starting out.

Probably best to discard any existing mess in your local fork and start fresh:

git checkout develop && git fetch upstream && git reset --hard upstream/develop  # discard local changes in your develop branch, if any.
git branch -D flsa_fixes  # delete your existing local flsa_fixes branch, if any.
git checkout -b my_flsa_fixes  # create a new local branch for your changes, named "my_flsa_fixes"
git reset --hard upstream/flsa_fixes  # set the content of "my_flsa_fixes" to match the remote "flsa_fixes", to begin with.

At that point you should be at commit cbfd97225 on branch my_flsa_fixes so you can make your changes and commit them and push them into your Github fork repository.

When your changes are ready for review, open a new pull request (PR) from your my_flsa_fixes branch against Gensim's flsa_fixes branch. You can do this from Github's UI, no need for CLI at this point.

Let me know how it goes :)

ERijck · 2023-01-24T06:59:56Z

Thank you @piskvorky, I will follow your steps!

victox5 · 2023-02-14T18:01:55Z

Hi guys,

I have been checking licensing in some of my projects and I got FuzzyTM+pyFUME popping up in one using gensim. If correctly, they are following GPL, importing them in gensim would make gensim GPL as well, rather than LGPL.

Are you aware of this? If I'm wrong concerning the licensing, please let me know.

Thanks!

damonmerrill · 2023-03-08T15:43:27Z

Plus, FuzzyTM is a GPL2/3 license which has a strong copy left requirement. Recently we let poetry update all our dependencies and our corporate scan tool reported a high concern to us with the dependency scan. We would not be able to continue to use Gensim if that library stayed in (I believe this would be the case for most companies/organizations where their IP is in software.) (ahh, I see @victox5 comment on this now as well)

piskvorky · 2023-03-08T16:06:15Z

Gensim itself has a strong copy left license too – LGPL. I'm afraid freeloading corporate concerns are not our primary motivator when choosing dependencies.

We offer a commercial (paid) dual licensing for such cases.

damonmerrill · 2023-03-08T17:34:34Z

ahh, thanks for the clarification. A mis-understanding on my part with gensims (RaRe-Technologies) position. The company I work for would gladly purchase commercial licensing as needed.

piskvorky · 2023-03-09T12:29:34Z

@damonmerrill that would be great – we welcome contributions on all levels: https://github.com/sponsors/piskvorky

pabs3 · 2023-03-13T09:01:30Z

I note that the license link in the file points at LGPLv3 instead of LGPLv2.1, that should get updated.

Pylint fixes

My flsa fixes

piskvorky · 2023-05-10T09:22:17Z

@ERijck can you please fix the merge conflict & update the LGPL link as per @pabs3 's comment above? Thanks.

ERijck · 2023-05-10T14:17:09Z

Yes, I will do this tomorrow!

ERijck · 2023-05-11T11:33:21Z

See PR #3471 where I apply the required changes to flsa_fixes

Update the licence link to LGPLv2.1

wip: clean up of FlsaModel; fixing bugs + formatting + efficiency

cbfd972

piskvorky added this to the Next release milestone Jan 20, 2023

piskvorky mentioned this pull request Jan 20, 2023

Fixes to FlsaModel #3435

Closed

ERijck added 6 commits January 24, 2023 13:33

Add what, why and how

4f46b95

Address the FIXMEs unrelated to BOW

a46d3fe

Show what values are expected instead of what the user supplied

524fc50

Fix the checks. Now in one place only and not in internal methods.

a69acb9

Improve docstrings

a7ea223

remove methods to obtain topic embeddings and pylint improvements

c211cf0

ERijck mentioned this pull request Jan 25, 2023

My flsa fixes #3438

Merged

ERijck added 2 commits February 6, 2023 22:46

Small fixes as mentioned in #3438

d977854

Pylint fixes

7864ebc

piskvorky mentioned this pull request Mar 7, 2023

remove unused dependency, handle ImportError #3447

Merged

piskvorky mentioned this pull request Mar 13, 2023

Replace copy of FuzzyTM in gensim/models/flsamodel.py with dep #3457

Closed

ERijck and others added 2 commits March 21, 2023 09:48

Merge pull request #1 from ERijck/pylint_fixes

9132451

Pylint fixes

Merge pull request #3438 from ERijck/my_flsa_fixes

4fcda16

My flsa fixes

piskvorky mentioned this pull request May 10, 2023

git rm gensim/models/flsamodel.py #3470

Merged

Update the licence link to LGPLv2.1

dd91b78

Merge pull request #3471 from ERijck/my_flsa_fixes

ce8b45e

Update the licence link to LGPLv2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency #3437

[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency #3437

piskvorky commented Jan 20, 2023 •

edited

piskvorky commented Jan 20, 2023 •

edited

ERijck commented Jan 20, 2023

piskvorky commented Jan 20, 2023

ERijck commented Jan 23, 2023

piskvorky commented Jan 23, 2023 •

edited

ERijck commented Jan 24, 2023

victox5 commented Feb 14, 2023

damonmerrill commented Mar 8, 2023 •

edited

piskvorky commented Mar 8, 2023 •

edited

damonmerrill commented Mar 8, 2023 •

edited

piskvorky commented Mar 9, 2023 •

edited

pabs3 commented Mar 13, 2023

piskvorky commented May 10, 2023

ERijck commented May 10, 2023

ERijck commented May 11, 2023

[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency #3437

Are you sure you want to change the base?

[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency #3437

Conversation

piskvorky commented Jan 20, 2023 • edited

piskvorky commented Jan 20, 2023 • edited

ERijck commented Jan 20, 2023

piskvorky commented Jan 20, 2023

ERijck commented Jan 23, 2023

piskvorky commented Jan 23, 2023 • edited

ERijck commented Jan 24, 2023

victox5 commented Feb 14, 2023

damonmerrill commented Mar 8, 2023 • edited

piskvorky commented Mar 8, 2023 • edited

damonmerrill commented Mar 8, 2023 • edited

piskvorky commented Mar 9, 2023 • edited

pabs3 commented Mar 13, 2023

piskvorky commented May 10, 2023

ERijck commented May 10, 2023

ERijck commented May 11, 2023

piskvorky commented Jan 20, 2023 •

edited

piskvorky commented Jan 20, 2023 •

edited

piskvorky commented Jan 23, 2023 •

edited

damonmerrill commented Mar 8, 2023 •

edited

piskvorky commented Mar 8, 2023 •

edited

damonmerrill commented Mar 8, 2023 •

edited

piskvorky commented Mar 9, 2023 •

edited