Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency #3437

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Jan 20, 2023

Fixes #3423. Supersedes #3435, #3436.

This is still work-in-progress and needs finishing up. Namely:

  1. Missing user-friendly docstrings and overall model motivation: what is this, who should use it? What do the various parameters mean?
  2. As input, accept standard streaming corpora in the bag-of-words (BoW) format. Drop all the in-memory handling of the entire corpus in RAM as "list of list of strinks" and "scipy DOK matrix", that doesn't scale.
  3. Complete the cleanup of the code formatting that I started. Especially use more helpful error messages in ValueErrors, showing what values are expected vs what the user supplied.
  4. Related to that, focus all the parameter validation to a single place in code = the module entrypoints where users pass in these parameters. Currently the checks (even the same checks?) appear in multiple places, even in internal methods, where we should be in control of what the input values are, so we're doublechecking ourselves which makes no sense.

@piskvorky piskvorky added this to the Next release milestone Jan 20, 2023
@piskvorky
Copy link
Owner Author

piskvorky commented Jan 20, 2023

CC @ERijck are you able to continue and finish this up?

All the points above, plus all the FIXME notes I left in the code, must be resolved if we are to keep FlsaModel in Gensim.

@piskvorky piskvorky mentioned this pull request Jan 20, 2023
@ERijck
Copy link
Contributor

ERijck commented Jan 20, 2023

@piskvorky yes, I will do that.

@piskvorky
Copy link
Owner Author

Finishing up 1, 3 and 4 will be a great start. I can then assist with 2 (input streaming), to bring flsamodel in line with the rest of Gensim.

@ERijck
Copy link
Contributor

ERijck commented Jan 23, 2023

To get up to speed with Git, I followed the Codecademy Git&Github pro course today. Afterwards, I just tried to fetch and merge the work in your branch. To do so, I used the following:

image

I assumed to see your code when opening flsamodel.py. However, this is not the case. Then, I tried the following steps:

image

This does not work. Which command can I use to pull cbfd972257f83d2d64803059e6585c00184f784c refs/heads/flsa_fixes?

@piskvorky
Copy link
Owner Author

piskvorky commented Jan 23, 2023

Yeah git can be frustrating when you're starting out.

Probably best to discard any existing mess in your local fork and start fresh:

git checkout develop && git fetch upstream && git reset --hard upstream/develop  # discard local changes in your develop branch, if any.
git branch -D flsa_fixes  # delete your existing local flsa_fixes branch, if any.
git checkout -b my_flsa_fixes  # create a new local branch for your changes, named "my_flsa_fixes"
git reset --hard upstream/flsa_fixes  # set the content of "my_flsa_fixes" to match the remote "flsa_fixes", to begin with.

At that point you should be at commit cbfd97225 on branch my_flsa_fixes so you can make your changes and commit them and push them into your Github fork repository.

When your changes are ready for review, open a new pull request (PR) from your my_flsa_fixes branch against Gensim's flsa_fixes branch. You can do this from Github's UI, no need for CLI at this point.

Let me know how it goes :)

@ERijck
Copy link
Contributor

ERijck commented Jan 24, 2023

Thank you @piskvorky, I will follow your steps!

@ERijck ERijck mentioned this pull request Jan 25, 2023
@victox5
Copy link

victox5 commented Feb 14, 2023

Hi guys,

I have been checking licensing in some of my projects and I got FuzzyTM+pyFUME popping up in one using gensim. If correctly, they are following GPL, importing them in gensim would make gensim GPL as well, rather than LGPL.

Are you aware of this? If I'm wrong concerning the licensing, please let me know.

Thanks!

@damonmerrill
Copy link

damonmerrill commented Mar 8, 2023

Plus, FuzzyTM is a GPL2/3 license which has a strong copy left requirement. Recently we let poetry update all our dependencies and our corporate scan tool reported a high concern to us with the dependency scan. We would not be able to continue to use Gensim if that library stayed in (I believe this would be the case for most companies/organizations where their IP is in software.) (ahh, I see @victox5 comment on this now as well)

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 8, 2023

Gensim itself has a strong copy left license too – LGPL. I'm afraid freeloading corporate concerns are not our primary motivator when choosing dependencies.

We offer a commercial (paid) dual licensing for such cases.

@damonmerrill
Copy link

damonmerrill commented Mar 8, 2023

ahh, thanks for the clarification. A mis-understanding on my part with gensims (RaRe-Technologies) position. The company I work for would gladly purchase commercial licensing as needed.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 9, 2023

@damonmerrill that would be great – we welcome contributions on all levels: https://github.com/sponsors/piskvorky

@pabs3
Copy link
Contributor

pabs3 commented Mar 13, 2023

I note that the license link in the file points at LGPLv3 instead of LGPLv2.1, that should get updated.

@piskvorky
Copy link
Owner Author

@ERijck can you please fix the merge conflict & update the LGPL link as per @pabs3 's comment above? Thanks.

@ERijck
Copy link
Contributor

ERijck commented May 10, 2023

Yes, I will do this tomorrow!

@ERijck
Copy link
Contributor

ERijck commented May 11, 2023

See PR #3471 where I apply the required changes to flsa_fixes

Update the licence link to  LGPLv2.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary dependency on FuzzyTM pulls in many libraries
5 participants