Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvergenceWarning during training #1091

Open
NickCrews opened this issue Sep 7, 2022 · 2 comments
Open

ConvergenceWarning during training #1091

NickCrews opened this issue Sep 7, 2022 · 2 comments

Comments

@NickCrews
Copy link
Contributor

I get this warning during the fitting of the linear model when performing a deduplication task:

/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

I am training on 800 records, manually labeled with cluster ids. Out of these 800*800 = 640,000 possible pairs, I'm sampling 50,000 using `dedupe.training_data_dedupe(), and feeding these 50k pairs to Dedupe.train(). After expanding Missing, Categorical, and Interaction variables, the X array that the linear model is seeing has 32 columns.

I know this isn't reproducible as yet, but I was hoping to avoid that work of getting everything together, if the information above is enough to give you any insights. If needed, I can try to make something reproducible.

@NickCrews
Copy link
Contributor Author

Ok, so if I go in and monkeypatch

dedupe/dedupe/labeler.py

Lines 72 to 77 in 220efe5

class MatchLearner(Learner):
def __init__(self, data_model: DataModel, candidates: TrainingExamples):
self.data_model = data_model
self._candidates = candidates.copy()
self._classifier = sklearn.linear_model.LogisticRegression()
self._distances = self._calc_distances(self.candidates)

to sklearn.linear_model.LogisticRegression(max_iter=1000), increasing max_iter from the default of 100 to 1000, the warning goes away.

IDK if this has some downside. Doing the LogisticRegression.fit() takes .005 seconds without the tweak, and half a second with the change, so slower but totally ignoreable.

I'm getting the same accuracy score in both cases, but that is measured after I do some post-processing cleanup, so I'm not sure it reflects the actual accuracy of the classifier. It seems like if the classifier hasn't converged then it would be bound to not be as accurate.

Want me to make a PR that increases max_iter? Think there might be something else causing the problem? It makes me a little nervous that I might not be going after the root cause of the problem and the real problem is sitting there unsolved (eg the warning tells you to look at pre-processing/scaling the data). But I don't see a downside to increasing max_iter?

@fgregg
Copy link
Contributor

fgregg commented Sep 21, 2022

i think this warning is not really a problem. typically when you don't have convergence it acts like a regularizer. I have a problem with increasing the max_iter, but there will still be some times where this warning appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants