ConvergenceWarning during training #1091

NickCrews · 2022-09-07T18:25:51Z

I get this warning during the fitting of the linear model when performing a deduplication task:

/Users/nickcrews/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

I am training on 800 records, manually labeled with cluster ids. Out of these 800*800 = 640,000 possible pairs, I'm sampling 50,000 using `dedupe.training_data_dedupe(), and feeding these 50k pairs to Dedupe.train(). After expanding Missing, Categorical, and Interaction variables, the X array that the linear model is seeing has 32 columns.

I know this isn't reproducible as yet, but I was hoping to avoid that work of getting everything together, if the information above is enough to give you any insights. If needed, I can try to make something reproducible.

The text was updated successfully, but these errors were encountered:

NickCrews · 2022-09-07T18:46:31Z

Ok, so if I go in and monkeypatch

dedupe/dedupe/labeler.py

Lines 72 to 77 in 220efe5

    
           class MatchLearner(Learner): 
        
               def __init__(self, data_model: DataModel, candidates: TrainingExamples): 
        
                   self.data_model = data_model 
        
                   self._candidates = candidates.copy() 
        
                   self._classifier = sklearn.linear_model.LogisticRegression() 
        
                   self._distances = self._calc_distances(self.candidates)

to sklearn.linear_model.LogisticRegression(max_iter=1000), increasing max_iter from the default of 100 to 1000, the warning goes away.

IDK if this has some downside. Doing the LogisticRegression.fit() takes .005 seconds without the tweak, and half a second with the change, so slower but totally ignoreable.

I'm getting the same accuracy score in both cases, but that is measured after I do some post-processing cleanup, so I'm not sure it reflects the actual accuracy of the classifier. It seems like if the classifier hasn't converged then it would be bound to not be as accurate.

Want me to make a PR that increases max_iter? Think there might be something else causing the problem? It makes me a little nervous that I might not be going after the root cause of the problem and the real problem is sitting there unsolved (eg the warning tells you to look at pre-processing/scaling the data). But I don't see a downside to increasing max_iter?

fgregg · 2022-09-21T13:43:53Z

i think this warning is not really a problem. typically when you don't have convergence it acts like a regularizer. I have a problem with increasing the max_iter, but there will still be some times where this warning appears.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConvergenceWarning during training #1091

ConvergenceWarning during training #1091

NickCrews commented Sep 7, 2022

NickCrews commented Sep 7, 2022

fgregg commented Sep 21, 2022

ConvergenceWarning during training #1091

ConvergenceWarning during training #1091

Comments

NickCrews commented Sep 7, 2022

NickCrews commented Sep 7, 2022

fgregg commented Sep 21, 2022