Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Autologging functionality for scikit-learn #2050

Closed
smurching opened this issue Nov 6, 2019 · 38 comments
Closed

[FR] Autologging functionality for scikit-learn #2050

smurching opened this issue Nov 6, 2019 · 38 comments
Labels
area/tracking Tracking service, tracking client APIs, autologging enhancement New feature or request good first issue Good for newcomers priority/important-soon The issue is worked on by the community currently or will be very soon, ideally in time for the

Comments

@smurching
Copy link
Collaborator

smurching commented Nov 6, 2019

Describe the proposal

Provide a clear high-level description of the feature request in the following sections. Feature requests that are likely to be accepted:

It'd be nice to add an mlflow.sklearn.autolog() API for automatically logging metrics, params & models generated via scikit-learn.

Note that I'm personally not particularly familiar with the scikit-learn APIs, so I'd welcome feedback on the proposal below.

MVP API Proposal

We could patch the BaseEstimator.fit method to log the params of the model being fit (estimator params are accessible via get_params) and also log the fitted model itself.

We should take care to ensure the UX is reasonable when working with scikit-learn Pipelines, which allow for defining DAGs of estimators. There are a few options here:

  1. The sub-estimators simply log nothing (simple to achieve, but may result in too-little logged information).
  2. Log params of the sub-estimators comprising a Pipeline under the parent run, but do not log the fitted sub-estimators themselves (log only the enclosing Pipeline model). Note that sub-estimator params can be obtained by passing deep=True to Estimator.get_params.
  3. The sub-estimators comprising a Pipeline & their params are logged under child runs of the parent run
  4. The sub-estimators comprising a Pipeline & their params are logged under the parent run (IMO this option is a non-starter, as the parent run could become very polluted with models/params)

For example:

# Logs information about `pipeline` under the run named "parent". In the custom
# patched logic that runs when we call `pipeline.fit()`, we can set an 
# `mlflow.sklearn.runContainsPipeline` tag on "parent" indicating that it's a run
# containing a Pipeline. As a result, our custom logic for 
# fitting our linear regressor knows to create a child run to which to log its 
# params (or simply not log anything), by virtue of checking that the
# `mlflow.sklearn.runContainsPipeline` tag is set on the current active run.
with mlflow.start_run(run_name="parent"):
   pipeline = Pipeline([standard_scaler, linear_regression])
   pipeline.fit(train)

Motivation

scikit-learn is a popular ML library, and it'd be a big value-add to make it easy for users to add MLflow tracking to their existing scikit-learn code.

Proposed Changes

For user-facing changes, what APIs are you proposing to add or modify? What code paths will need to be modified?
See above - we propose adding a new mlflow.sklearn.autolog API

We can add the definition of the new autolog API in https://github.com/mlflow/mlflow/blob/master/mlflow/sklearn.py, and unit tests under mlflow/tests/sklearn/test_sklearn_autologging.py. See this PR: #1601 as an example of how the same was done for Keras.

@smurching smurching added the enhancement New feature or request label Nov 6, 2019
@smurching smurching added the good first issue Good for newcomers label Nov 15, 2019
@maxconi
Copy link

maxconi commented Dec 11, 2019

Hi @smurching, have you started working on this? I'd be happy to collaborate!

@smurching
Copy link
Collaborator Author

@maxconi I haven't, would be happy to review a PR contributing this feature :). Do you have any questions about the API proposal above or how the existing autologging integrations for keras/Tensorflow work?

I think my preferred of the options for handling pipelines is 2).

We should also think about how we handle managing active runs - for example in Keras autologging, we'll automatically start and end a run when keras_model.fit() is called with no active run, as you can see here. I'm not as sure that the same is the right behavior for scikit-learn, looking at this example which calls fit or fit_transform on multiple estimators in sequence. However maybe it's a reasonable place to start - but would be curious to hear your thoughts :)

@smurching
Copy link
Collaborator Author

smurching commented Dec 12, 2019

Also btw, if you're on the MLflow Slack feel free to ping me there (my username is "sid") to discuss as well

@smurching
Copy link
Collaborator Author

smurching commented Jan 3, 2020

Hi @maxconi, hope you enjoyed the holidays! Thanks again for your interest in helping with this - are you still interested in collaborating?

@maxconi
Copy link

maxconi commented Jan 3, 2020

Hi @smurching , still haven't had the chance to work on this except for starting reviewing the scikit API a bit. I should be able to invest more time into this in the coming weeks.

@smurching
Copy link
Collaborator Author

Of course yeah no worries/rush, that sounds good :). Let me know if/as you have any questions etc

@maxconi
Copy link

maxconi commented Jan 15, 2020

Hi @smurching ,
I forked the repo and started working on the auto logger functionality.
A first version can be found here

I implemented a very simple mlflow.sklearn.autolog function that patches the sklearn.base.BaseEstimator.get_params method. The patch is currently really simple and is just for me to get used to it.

In your initial proposal you mention patching BaseEstimator.fit which isn't an existing method. From my review of sklearn it seems like the fit method is implemented at a lower level. Does that mean we need to patch every lower class that implement the .fit method (like BaseForest for forest based learner)? In short, it doesn't seem like there is an equivalent of keras.model.fit in sklearn but maybe it's just my current understanding.. what's your opinion?

@a-w-1806
Copy link

@maxconi Nice work! I am new to the MLflow community and I am also interested in working on this feature. It seems like we do not need to patch the get_params method because we can directly call it on the estimators.

But you make a good point that fit actually is not defined in the BaseEstimator class, but it seems like "every" estimator I have seen has a fit method. I am not sure whether we can still use patching to get it working. Maybe we can try that on a toy example or please tell me whether you find an estimator that does not have a fit method. :)

The get_params method seems not returning the parameters learned from fitting. So maybe we can just call it once before or after fitting?

And how we deal with fit and its cousins fit_transform and fit_predict still needs further considerations.

What do you think? I am more than happy to listen to your opinions! @maxconi @smurching

@maxconi
Copy link

maxconi commented Jan 16, 2020

Thanks for your comment @Yuchen-Wang-SH !

I haven't inspected all estimators from sklearn so I'm not sure if they all have the .fit method even if it looks like they all have.

For forest based models, they inherit their fit method from the BaseForest class while the linearRegression estimator defines its own fit method. For the SVM estimator, the behavior is defined in baseLibSVM class. So even if all estimators seem to have a .fit method it looks that they inherit or define it in a different way.

Do you know how we could use patching in this case? If we don't use patching, what would be a way to still implement autologging when .fit is called on a sklearn model?

I haven't had the chance yet to look into fit_transform & fir_predict

@smurching
Copy link
Collaborator Author

Oops, yeah sorry I missed that BaseEstimator doesn't actually define fit(). I'm not sure there's a great silver-bullet solution that'll cover all estimators, but we could consider a) patching some common estimators and Pipelines as a practical workaround, or b) doing something more complex like recursively searching for subclasses of BaseEstimator (using a technique like that suggested in this SO post) & patching them. Could be worth spending time investigating option b), but also happy to move forward with a) if b) is infeasible

@smurching
Copy link
Collaborator Author

smurching commented Jan 17, 2020

Also, regarding fit_transform & fit_predict, I'd imagine we could patch those as well? One potential complication though is that fit_predict and fit_transform may call fit themselves (thereby triggering patched logic twice). We could just not patch those methods to start with for simplicity, depending on how common their usage is.

@maxconi
Copy link

maxconi commented Jan 17, 2020

Solution b) looks more future proof if new estimators are added in the future (Provided that they still inherit from BaseEstimator). I'd like to give it a try.
Thanks for the feedback

@maxconi
Copy link

maxconi commented Jan 17, 2020

btw @smurching & @Yuchen-Wang-SH , I created a Trello board for this issue if you want to follow it or participate send me your Trello ID or emails and I'll add you to the board

@a-w-1806
Copy link

@smurching @maxconi Thanks for your comments! My email is on my Github profile page.

@smurching
Copy link
Collaborator Author

Yep thanks both! My trello ID is smurching

@maxconi
Copy link

maxconi commented Jan 24, 2020

@smurching @Yuchen-Wang-SH

Hi wrote 2 functions to find the estimators to patch.
https://github.com/maxconi/mlflow/blob/scikit_autolog/mlflow/sklearn.py

  1. get_sklearn_estimators_1 which will find any estimator with a fit method inheriting from BaseEstimator. That method requires a from sklearn import * which isn't an amazing way to do it imo

  2. get_sklearn_estimators_2 which will find all sklearn classes with a fit method within the current module. Cleaner but if you call autolog() before you actually import your estimator (can happen if you work in a notebook for instance) that will need to be rerun which isn't exactly user friendly.

May I ask you to review these solutions ?

Also, I just realized that logging the metric of a model (accuracy, recall, mse, ..) will be a tricky one. In Keras your model can be evaluated during the training on the train and test set as well. But in sklearn usually you'll first fit your model, then predict, then score it. How would we make sure that a fit, predict, score sequence belongs to the same experiment in mlflow?

Given what I just mention, do you think that autolog for sklearn is still feasible? maybe the sklearn way of working isn't really compatible with an autologger. Instead, sklearn would maybe benefit from a wrapper around the model that would fit it, make a predict and score the model?
It would go something like that:
mlflow.sklearn.model_logger(estimator(), X_train, y_train, X_test, y_test)
and it would return a fitted model & its predictions, while having logged it, its param and metrics in mlflow. It would still reduce the amount of code to use mlflow but then maybe it goes beyond the purpose of mlflow.

Happy to hear your suggestions! :)

@smurching
Copy link
Collaborator Author

smurching commented Jan 28, 2020

@maxconi nice, thanks for the investigation & updates! Some thoughts inline:

get_sklearn_estimators_1 which will find any estimator with a fit method inheriting from BaseEstimator. That method requires a from sklearn import * which isn't an amazing way to do it imo

Agree that requiring a from sklearn import * is not ideal (& we shouldn't do an import * for the user and then pollute their top-level namespace). However I think simply performing the from sklearn import * within the get_sklearn_estimators_1 function should work - the imported modules will be defined as local variables & the inspect logic should be able to find them, I'd give that a try.

get_sklearn_estimators_2 which will find all sklearn classes with a fit method within the current module. Cleaner but if you call autolog() before you actually import your estimator (can happen if you work in a notebook for instance) that will need to be rerun which isn't exactly user friendly.

Also agree that it's not super user-friendly to require defining all your sklearn estimators before calling autolog() - one nice property of the existing autolog APIs is that you can call them at the top of a block of ML code, without too much understanding of how the code works/is structured, and view metrics/params/models being logged. You could imagine us eventually exposing a unified framework-agnostic mlflow.autolog() API that users can just add to the top of their ML training code to instrument it. Given that (potential) eventual goal, I think we might also want to steer away from an API like mlflow.sklearn.model_logger(estimator(), X_train, y_train, X_test, y_test).

@Weissger
Copy link

Weissger commented Jan 29, 2020

Hi @smurching @Yuchen-Wang-SH @maxconi,
I would also be happy to collaborate on the proposition described here.
Regarding the logging I'm currently working on a one-size fits all extension for autologging in mlflow and would like to join forces. I'm willing to help implement an sklearn autologger function for mlflow.

As for my project, it is in its infancy, but there is already some support for sklearn. Maybe we can take a similar approach to write an mlflow autologger or at least use the mapping file to identify critical functions to patch. The current proof of concept version is able to track some information regarding sklearn and keras (tested with python 3.7, sklrean==0.21.3). It features a non-intrusive setup. PyPads uses duck punching while importing the module in combination with mapping files for libraries to insert and define hooks for logging functions.

Activating PyPads is as easy as prepending your imports with:

from pypads.base import PyPads
PyPads()

This results in optionally logged parameters, input, output, cpu information, etc. By using community managed mapping files our team hopes to reduce the workload on updating mlflow if libraries to be logged do change. Additionally we hope to have an extensible ecosystem of log functions which might be shared over multiple libraries.

Nevertheless some work is to be done on PyPads. For example regarding opening and closing runs, as you have described in this thread. if you want to help finding a more general solution for autologging we would also be happy to take contributions on our approach.

@smurching
Copy link
Collaborator Author

Hey all, sorry for the delay - @Weissger took a look at PyPads, that's awesome - is there any reason we couldn't merge the functionality exposed by PyPads into MLflow as the implementation of an mlflow.sklearn.autolog API?

The main question I have looking at the repo is the value provided by the mapping files (I would guess human readability?) - IMO it seems sufficient to just patch the relevant functions.

It also seems like it could be simpler to try to dynamically patch scikit-learn estimators' fit functions (as discussed in this thread) rather than hard-code the patching of each one, as PyPads seems to do - did you try out dynamic patching & find it to be difficult to implement correctly etc?

Thanks all for your work on this, also @maxconi let us know if there are any questions etc on your end - thanks!

@Weissger
Copy link

Weissger commented Feb 19, 2020

Thanks for the answer @smurching We certainly can include the sklearn logging functionality of PyPads into MLflow as an implementation of mlflow.sklearn.autolog, but we will have to change a few things. PyPads was not developed specifically for sklearn and therefore there isn't put to much though into what to log exactly regarding an sklearn pipeline.

A minimal mapping files in PyPads for sklearn looks something like this:

{
  "default_hooks": {
    "modules": {
      "fns": {}
    },
    "classes": {
      "fns": {}
    },
    "fns": {}
  },
  "algorithms": [
    {
      "name": "sklearn classification metrics",
      "other_names": [],
      "implementation": {
        "scikit-learn": "sklearn.metrics.classification"
      },
      "hooks": {
        "pypads_metric": "always"
      }
    },
    {
      "name": "base sklearn estimator",
      "other_names": [],
      "implementation": {
        "scikit-learn": "sklearn.base.BaseEstimator"
      },
      "hooks": {
        "pypads_fit": [
          "fit",
          "fit_predict",
          "fit_transform"
        ],
        "pypads_predict": [
          "fit_predict",
          "predict",
          "score"
        ],
        "pypads_transform": [
          "fit_transform",
          "transform"
        ]
      }
    }
  ]
}

Here we only track all classification metrics and all classes derived from BaseEstimators. Functions are punched dynamically.

As for why we are using mapping files - there are some reasons to use a concept similar to them. I'll try to give them from the top of my head:

  • We want to enable the community / library creators as well as people writing custom methods to provide a readable way of defining hooks for logging their algorithms. This results in less overhead for the management of the tracking library itself.
  • Mapping files can evolve with the versions of the libraries themselves. Differing versions of the same library might change implementation specifics which are essential for the, in autolog commonly used, duck punching of base classes and methods. Who is to say if BaseEstimator is not to change in the future or Mixins were renamed or moved etc. Such updates would render our tracking library useless for older or newer versions of a library. To circumvent this we might allow to hold multiple mapping files and use the one which is closest to the currently installed version of the library.
  • Mapping files can be specific for some algorithms we know a lot about and still take new algorithms into account which are derived from base classes. This allows for multiple levels of abstraction when defining log functionality.
  • By having specific information about algorithms we can include information about parameter types, value ranges etc. Here we would have a way to log uncommon settings or validate inputs the user provides to the library. On something like this an extensive hinting system could be built to identify common mistakes in the usage of some libraries or machine learning in general. Nevertheless getting validation rules would be a hard to solve topic by itself.
  • The mapping files can provide standardized events for logging. 'Library A' might share common functionality with 'Library B', which should be logged in the same way. This can be represented by calling the same log event. Log events could defined in a conceptual way and in phases (see for example the standardization efforts of ontoDM)
  • Additionally in a semantic way it's sometimes not easy to classify abstract super classes like for example the BaseEstimator and a mapping file might include some additional information for logging purposes.
  • Speaking of standardization efforts and ontologies. A mapping file could include links to which algorithmic concepts are implemented in the mapped functions by linking to wikidata entries or other ontologies. Different libraries might be harmonized by defining the parameters of the respective functions to the same concepts and therefore be comparable.

I hope this writeup gives a nice overview why we decided to go forward with some kind of mapping layer. Not to sound too ambitious, but all these factors will hopefully cumulate in humanreadable reports for a lot of different libraries and also allow to insert your own methods into a common ecosystem.

Technically for mlflow we could omit mapping files and stay on hard coded base classes to extend. If we do that, another technical aspect we might want to take into consideration for duck punching our logging into libraries is extending the importlib like PyPads is doing. This allows for a more performant way of importing and dynamically duck punching only the by the user used estimators.

@amueller
Copy link
Contributor

Just heard about this in the databricks keynote (https://www.youtube.com/watch?v=d71ayzZPjas) and would be happy to chat about this.

@juntai-zheng juntai-zheng added the area/tracking Tracking service, tracking client APIs, autologging label Jul 1, 2020
@smurching
Copy link
Collaborator Author

@amueller that'd be awesome! @dbczumar is leading design/development efforts on this, we'll reach out to set up time to discuss :)

@amueller
Copy link
Contributor

amueller commented Jul 2, 2020

Cool! There's a discussion that might be relevant here:
scikit-learn/scikit-learn#78

@dbczumar
Copy link
Collaborator

dbczumar commented Jul 6, 2020

Hi @amueller ! I sent an email to the address listed on your website (https://amueller.github.io/). Looking forward to discussing autologging / scikit-learn metadata collection with you!

@smurching smurching added the priority/important-soon The issue is worked on by the community currently or will be very soon, ideally in time for the label Jul 13, 2020
@abhi9cr7
Copy link

I was using both mlflow.sklearn.autolog() and mlflow.sklearn.log_model() within my experiment run. This is logging two artifacts under that specific run. When I compare those two artifacts, it looks like the artifact from autolog() is just labelled as 'model' and is smaller in size compared to the artifact from log_model().

When I try to use the artifact from autolog() by registering it in a model, I get an error <'LabelEncoder' object has no attribute 'predict'>. I can use the artifact from log_model() fine without any issues. Am I missing something here?

@dbczumar
Copy link
Collaborator

@abhi9cr7 Can you provide a snippet of the code that's producing this error?

@abhi9cr7
Copy link

@dbczumar After training the model, these are the two artifacts I am seeing in the experiment. The LinearSVC-Model is the one logged through mlflow.sklearn.log_model() and "model" is the one logged through mlflow.sklearn.autolog().

image

Loading the "LinearSVC-model" artifact works fine with the code below. But loading the artifact "model" as registered model gives the error.

import mlflow
import numpy as np

w1=np.array(['macronutrients','vitamin'])

loaded_model=mlflow.sklearn.load_model('models:/Production')
model_val=loaded_model.predict(w1)

print(model_val)

For reference- This is all on Azure Databricks Runtime 6.6

@dbczumar
Copy link
Collaborator

Got it. Can you provide a snippet of your model training / logging code?

@abhi9cr7
Copy link

abhi9cr7 commented Sep 11, 2020

mlflow.set_experiment("/Users/ClassificationModel/")
mlflow.sklearn.autolog()

def text_process(mess):
  nopunc = [char for char in mess if char not in string.punctuation]
  nopunc = ''.join(nopunc)
  return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

with mlflow.start_run():
  data=spark.sql("")
  
  X = data.select('KeyPhrase').collect() #data['KeyPhrase']
  Y = data.select('label').collect() #data['label']
  
  X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=0)

  pipeline = Pipeline([('bow', CountVectorizer(analyzer=text_process)),
                       ('tfidf',  tfidf_transformer),
  					           ('clf', LinearSVC(C=1,class_weight="balanced"))])
  
  LinearSVC = pipeline.fit(X_train, y_train)
  ytest = np.array(y_test)

  for x, y in zip(X_test, ytest):
    ypred = LinearSVC.predict([x])
    if y != ypred[0]:
      print(x, " : ", y , " : ", ypred[0])
  print(classification_report(ytest, LinearSVC.predict(X_test)))
  print(confusion_matrix(ytest, LinearSVC.predict(X_test)))

  mlflow.sklearn.log_model(LinearSVC,'LinearSVC-model')
  
  
# Evaluate model performance
  predictions = LinearSVC.predict(X_test)
  accuracy = metrics.accuracy_score(ytest, predictions)
  precision = metrics.precision_score(ytest, predictions, average="weighted")
  recall = metrics.recall_score(ytest, predictions, average="weighted")

  mlflow.log_metric("accuracy", accuracy)
  mlflow.log_metric("precision", precision)
  mlflow.log_metric("recall", recall)
  

@dbczumar
Copy link
Collaborator

@abhi9cr7 I'm a bit confused by the code here. It looks like model is undefined when passed to mlflow.sklearn.log_model. Also, we're reassigning LinearSVC to the output of pipeline.fit(), which I think overrides the sklearn class name in that scope. Is this the exact code you're using for training? Is the type of model you're trying to produce a pipeline or a LinearSVC model?

@abhi9cr7
Copy link

abhi9cr7 commented Sep 11, 2020

@abhi9cr7 I'm a bit confused by the code here. It looks like model is undefined when passed to mlflow.sklearn.log_model. Also, we're reassigning LinearSVC to the output of pipeline.fit(), which I think overrides the sklearn class name in that scope. Is this the exact code you're using for training? Is the type of model you're trying to produce a pipeline or a LinearSVC model?

Sorry, a typo. I am passing LinearSVC as the model in mlflow.sklearn.log_model(LinearSVC,'LinearSVC-model')

Corrected it in the above snippet as well.

@dbczumar
Copy link
Collaborator

@abhi9cr7 Got it. What type of model object is returned when you call load_model() on the autologged model and on the model you logged manually? Are they the same? Different?

@abhi9cr7
Copy link

@dbczumar The autologged model is loaded as sklearn.preprocessing.label.LabelEncoder and the manual logged model is loaded as a pipeline.

@dbczumar
Copy link
Collaborator

@abhi9cr7 Are you explicitly fitting a LabelEncoder object at any point prior to your pipeline.fit() call? Which sklearn execution backend are you using (single threaded, multithreaded, multiprocessing, etc)?

@abhi9cr7
Copy link

abhi9cr7 commented Sep 11, 2020

@dbczumar I am using single threaded execution. I haven't explicitly fitted any LabelEncoder object. The above snippet is all of the training code.

@dbczumar
Copy link
Collaborator

@abhi9cr7 If you remove the call to pipeline.fit(), is autologging still producing a model? Also, where and how is tfidf_transformer defined?

@abhi9cr7
Copy link

abhi9cr7 commented Sep 11, 2020

@dbczumar It seems like tfidfTransformer() is what is creating that model artifact. When I cleared the cluster state and executed again with just autolog(), it didn't produce any model. But when I called tfidfTransformer() along with autolog(), it produced a model artifact( didn't include pipeline.fit() in this run)

bow_transformer = CountVectorizer(analyzer=text_process).fit(data.select('KeyPhrase').collect())
key_phrase_bow = bow_transformer.transform(data.select('KeyPhrase').collect())
tfidf_transformer = TfidfTransformer().fit(key_phrase_bow)

Edit: I think as soon as the first .fit() is called, autolog() creates a model artifact for it.

@dbczumar
Copy link
Collaborator

@abhi9cr7 Got it. Correct - each fit() call creates a run with a model artifact. I would recommend fitting TfIdfTransformer along with the pipeline itself by placing an unfit TfIdfTransformer in the pipeline definition or moving the TfIdfTransformer outside of your MLflow run that is used for pipeline training.

@harupy harupy closed this as completed Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging enhancement New feature or request good first issue Good for newcomers priority/important-soon The issue is worked on by the community currently or will be very soon, ideally in time for the
Projects
None yet
Development

No branches or pull requests

9 participants