New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Autologging functionality for scikit-learn #2050
Comments
Hi @smurching, have you started working on this? I'd be happy to collaborate! |
@maxconi I haven't, would be happy to review a PR contributing this feature :). Do you have any questions about the API proposal above or how the existing autologging integrations for keras/Tensorflow work? I think my preferred of the options for handling pipelines is 2). We should also think about how we handle managing active runs - for example in Keras autologging, we'll automatically start and end a run when |
Also btw, if you're on the MLflow Slack feel free to ping me there (my username is "sid") to discuss as well |
Hi @maxconi, hope you enjoyed the holidays! Thanks again for your interest in helping with this - are you still interested in collaborating? |
Hi @smurching , still haven't had the chance to work on this except for starting reviewing the scikit API a bit. I should be able to invest more time into this in the coming weeks. |
Of course yeah no worries/rush, that sounds good :). Let me know if/as you have any questions etc |
Hi @smurching , I implemented a very simple mlflow.sklearn.autolog function that patches the sklearn.base.BaseEstimator.get_params method. The patch is currently really simple and is just for me to get used to it. In your initial proposal you mention patching BaseEstimator.fit which isn't an existing method. From my review of sklearn it seems like the fit method is implemented at a lower level. Does that mean we need to patch every lower class that implement the .fit method (like BaseForest for forest based learner)? In short, it doesn't seem like there is an equivalent of keras.model.fit in sklearn but maybe it's just my current understanding.. what's your opinion? |
@maxconi Nice work! I am new to the MLflow community and I am also interested in working on this feature. It seems like we do not need to patch the But you make a good point that The And how we deal with What do you think? I am more than happy to listen to your opinions! @maxconi @smurching |
Thanks for your comment @Yuchen-Wang-SH ! I haven't inspected all estimators from sklearn so I'm not sure if they all have the For forest based models, they inherit their Do you know how we could use patching in this case? If we don't use patching, what would be a way to still implement autologging when I haven't had the chance yet to look into |
Oops, yeah sorry I missed that BaseEstimator doesn't actually define |
Also, regarding |
Solution b) looks more future proof if new estimators are added in the future (Provided that they still inherit from |
btw @smurching & @Yuchen-Wang-SH , I created a Trello board for this issue if you want to follow it or participate send me your Trello ID or emails and I'll add you to the board |
@smurching @maxconi Thanks for your comments! My email is on my Github profile page. |
Yep thanks both! My trello ID is smurching |
@smurching @Yuchen-Wang-SH Hi wrote 2 functions to find the estimators to patch.
May I ask you to review these solutions ? Also, I just realized that logging the metric of a model (accuracy, recall, mse, ..) will be a tricky one. In Keras your model can be evaluated during the training on the train and test set as well. But in sklearn usually you'll first fit your model, then predict, then score it. How would we make sure that a fit, predict, score sequence belongs to the same experiment in mlflow? Given what I just mention, do you think that Happy to hear your suggestions! :) |
@maxconi nice, thanks for the investigation & updates! Some thoughts inline:
Agree that requiring a
Also agree that it's not super user-friendly to require defining all your sklearn estimators before calling |
Hi @smurching @Yuchen-Wang-SH @maxconi, As for my project, it is in its infancy, but there is already some support for sklearn. Maybe we can take a similar approach to write an mlflow autologger or at least use the mapping file to identify critical functions to patch. The current proof of concept version is able to track some information regarding sklearn and keras (tested with Activating PyPads is as easy as prepending your imports with:
This results in optionally logged parameters, input, output, cpu information, etc. By using community managed mapping files our team hopes to reduce the workload on updating mlflow if libraries to be logged do change. Additionally we hope to have an extensible ecosystem of log functions which might be shared over multiple libraries. Nevertheless some work is to be done on PyPads. For example regarding opening and closing runs, as you have described in this thread. if you want to help finding a more general solution for autologging we would also be happy to take contributions on our approach. |
Hey all, sorry for the delay - @Weissger took a look at PyPads, that's awesome - is there any reason we couldn't merge the functionality exposed by PyPads into MLflow as the implementation of an The main question I have looking at the repo is the value provided by the mapping files (I would guess human readability?) - IMO it seems sufficient to just patch the relevant functions. It also seems like it could be simpler to try to dynamically patch scikit-learn estimators' Thanks all for your work on this, also @maxconi let us know if there are any questions etc on your end - thanks! |
Thanks for the answer @smurching We certainly can include the sklearn logging functionality of PyPads into MLflow as an implementation of A minimal mapping files in PyPads for sklearn looks something like this: {
"default_hooks": {
"modules": {
"fns": {}
},
"classes": {
"fns": {}
},
"fns": {}
},
"algorithms": [
{
"name": "sklearn classification metrics",
"other_names": [],
"implementation": {
"scikit-learn": "sklearn.metrics.classification"
},
"hooks": {
"pypads_metric": "always"
}
},
{
"name": "base sklearn estimator",
"other_names": [],
"implementation": {
"scikit-learn": "sklearn.base.BaseEstimator"
},
"hooks": {
"pypads_fit": [
"fit",
"fit_predict",
"fit_transform"
],
"pypads_predict": [
"fit_predict",
"predict",
"score"
],
"pypads_transform": [
"fit_transform",
"transform"
]
}
}
]
} Here we only track all classification metrics and all classes derived from BaseEstimators. Functions are punched dynamically. As for why we are using mapping files - there are some reasons to use a concept similar to them. I'll try to give them from the top of my head:
I hope this writeup gives a nice overview why we decided to go forward with some kind of mapping layer. Not to sound too ambitious, but all these factors will hopefully cumulate in humanreadable reports for a lot of different libraries and also allow to insert your own methods into a common ecosystem. Technically for mlflow we could omit mapping files and stay on hard coded base classes to extend. If we do that, another technical aspect we might want to take into consideration for duck punching our logging into libraries is extending the importlib like PyPads is doing. This allows for a more performant way of importing and dynamically duck punching only the by the user used estimators. |
Just heard about this in the databricks keynote (https://www.youtube.com/watch?v=d71ayzZPjas) and would be happy to chat about this. |
Cool! There's a discussion that might be relevant here: |
Hi @amueller ! I sent an email to the address listed on your website (https://amueller.github.io/). Looking forward to discussing autologging / scikit-learn metadata collection with you! |
I was using both mlflow.sklearn.autolog() and mlflow.sklearn.log_model() within my experiment run. This is logging two artifacts under that specific run. When I compare those two artifacts, it looks like the artifact from autolog() is just labelled as 'model' and is smaller in size compared to the artifact from log_model(). When I try to use the artifact from autolog() by registering it in a model, I get an error <'LabelEncoder' object has no attribute 'predict'>. I can use the artifact from log_model() fine without any issues. Am I missing something here? |
@abhi9cr7 Can you provide a snippet of the code that's producing this error? |
@dbczumar After training the model, these are the two artifacts I am seeing in the experiment. The LinearSVC-Model is the one logged through mlflow.sklearn.log_model() and "model" is the one logged through mlflow.sklearn.autolog(). Loading the "LinearSVC-model" artifact works fine with the code below. But loading the artifact "model" as registered model gives the error. import mlflow
import numpy as np
w1=np.array(['macronutrients','vitamin'])
loaded_model=mlflow.sklearn.load_model('models:/Production')
model_val=loaded_model.predict(w1)
print(model_val) For reference- This is all on Azure Databricks Runtime 6.6 |
Got it. Can you provide a snippet of your model training / logging code? |
mlflow.set_experiment("/Users/ClassificationModel/")
mlflow.sklearn.autolog()
def text_process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
with mlflow.start_run():
data=spark.sql("")
X = data.select('KeyPhrase').collect() #data['KeyPhrase']
Y = data.select('label').collect() #data['label']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=0)
pipeline = Pipeline([('bow', CountVectorizer(analyzer=text_process)),
('tfidf', tfidf_transformer),
('clf', LinearSVC(C=1,class_weight="balanced"))])
LinearSVC = pipeline.fit(X_train, y_train)
ytest = np.array(y_test)
for x, y in zip(X_test, ytest):
ypred = LinearSVC.predict([x])
if y != ypred[0]:
print(x, " : ", y , " : ", ypred[0])
print(classification_report(ytest, LinearSVC.predict(X_test)))
print(confusion_matrix(ytest, LinearSVC.predict(X_test)))
mlflow.sklearn.log_model(LinearSVC,'LinearSVC-model')
# Evaluate model performance
predictions = LinearSVC.predict(X_test)
accuracy = metrics.accuracy_score(ytest, predictions)
precision = metrics.precision_score(ytest, predictions, average="weighted")
recall = metrics.recall_score(ytest, predictions, average="weighted")
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("precision", precision)
mlflow.log_metric("recall", recall)
|
@abhi9cr7 I'm a bit confused by the code here. It looks like |
Sorry, a typo. I am passing LinearSVC as the model in mlflow.sklearn.log_model(LinearSVC,'LinearSVC-model') Corrected it in the above snippet as well. |
@abhi9cr7 Got it. What type of model object is returned when you call |
@dbczumar The autologged model is loaded as sklearn.preprocessing.label.LabelEncoder and the manual logged model is loaded as a pipeline. |
@abhi9cr7 Are you explicitly fitting a LabelEncoder object at any point prior to your |
@dbczumar I am using single threaded execution. I haven't explicitly fitted any LabelEncoder object. The above snippet is all of the training code. |
@abhi9cr7 If you remove the call to |
@dbczumar It seems like bow_transformer = CountVectorizer(analyzer=text_process).fit(data.select('KeyPhrase').collect())
key_phrase_bow = bow_transformer.transform(data.select('KeyPhrase').collect())
tfidf_transformer = TfidfTransformer().fit(key_phrase_bow) Edit: I think as soon as the first .fit() is called, autolog() creates a model artifact for it. |
@abhi9cr7 Got it. Correct - each |
Describe the proposal
Provide a clear high-level description of the feature request in the following sections. Feature requests that are likely to be accepted:
It'd be nice to add an
mlflow.sklearn.autolog()
API for automatically logging metrics, params & models generated via scikit-learn.Note that I'm personally not particularly familiar with the scikit-learn APIs, so I'd welcome feedback on the proposal below.
MVP API Proposal
We could patch the
BaseEstimator.fit
method to log the params of the model being fit (estimator params are accessible viaget_params
) and also log the fitted model itself.We should take care to ensure the UX is reasonable when working with scikit-learn Pipelines, which allow for defining DAGs of estimators. There are a few options here:
deep=True
toEstimator.get_params
.For example:
Motivation
scikit-learn is a popular ML library, and it'd be a big value-add to make it easy for users to add MLflow tracking to their existing scikit-learn code.
Proposed Changes
For user-facing changes, what APIs are you proposing to add or modify? What code paths will need to be modified?
See above - we propose adding a new
mlflow.sklearn.autolog
APIWe can add the definition of the new
autolog
API in https://github.com/mlflow/mlflow/blob/master/mlflow/sklearn.py, and unit tests undermlflow/tests/sklearn/test_sklearn_autologging.py
. See this PR: #1601 as an example of how the same was done for Keras.The text was updated successfully, but these errors were encountered: