Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas output #39

Merged
merged 16 commits into from
Mar 18, 2024
Merged

Pandas output #39

merged 16 commits into from
Mar 18, 2024

Conversation

enricogandini
Copy link
Contributor

Closes #38 .

All scikit-mol transformer classes now support the scikit-learn set_output API. They can keep the data in DataFrames with meaningful column names.

@EBjerrum
Copy link
Owner

Super, thanks. I'm super busy this week. I'll try to squeeze in a code-review and test-drive anyway.


Also sets the column prefix for use by the transform method with dataframe output.
"""
self._get_column_prefix()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not doing anything as the return is simply discarded???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added it there when I was developing the code, then I forgot about it.

Done in 49d39d0

Copy link
Owner

@EBjerrum EBjerrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Enrico,
First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol.
However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.

mols = smilestomol_transformer.transform(to_convert)
assert isinstance(mols, pd.DataFrame)
assert mols.columns.tolist() == ["molecule"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stay with the pandastools defaults of 'ROMol', alternatively this needs to be configurable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the default ROMol.

Done in ac73c12

# contains multiple molecule columns (as if from a column selector transformer).
# In that case, the resulting array should be a concatenation of the fingerprint arrays
# for each molecule column.
return X.loc[:, "molecule"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really be sure that the molecule column is called 'molecule'. It should either be configurable or we should stay with the convention that input to the transformer is always a single column. I think we may be going beyond the scope of the classes by building in column selections to them.

If needed this property can be overwritten in the child class.
"""
return np.int8

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep this simple and add a hidden property directly to the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I used a property to make sure that users were not tempted to override it.
But I agree that the most Pythonic way to achieve this is just to define private attributes.

Changed in 8cb5cfe

@@ -13,6 +13,13 @@ def __init__(self, parallel: Union[bool, int] = False):
self.parallel = parallel
self.start_method = None #TODO implement handling of start_method

def get_feature_names_out(self, input_features=None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed by the base sklearn classes? I can't seem to understand where it's used otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what makes the "pandas output" mechanism work. If this method is defined, then the base classes will use it to generate the column names, and return the result as a Pandas DataFrame. If this method is not defined, we get an error when trying to set the pandas output for transformers.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the clarification

@@ -53,3 +53,18 @@ def sanitize(self, X_smiles_list, y=None):
self.errors = pd.DataFrame({'SMILES':X_errors})

return X_out, X_errors


def get_mols_from_X(X):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the place to do this I think. We should not make these kinds of column selections at this level. I would say, that we need to fix the input and output of smilesconverter and molfromsmiles converter to accept 2D input and pandas dataframes, under the assumption that there's only one column. We can assert that and throw a sensible error message about using ColumnTransformer if picking it out of a dataframe. Currently they accept iterables and returns lists or flat arrays, which isn't the way according to sklearn.
Although not completely in the vane of sklearn, I would like to keep the option to input lists and flat arrays as it's so fast to convert a list of smiles or mols with the tools.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of it, this lack of column vector or pandas input support is likely contributing to crashing your attempts to use column transformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of it, this lack of column vector or pandas input support is likely contributing to crashing your attempts to use column transformer.

Yes, I also thought that if we change how scikit-mol handles transformer input in this PR (by making it more compatible with scikit-learn), ColumnTransformer is more likely to work. I will try after this PR is ready!

@enricogandini
Copy link
Contributor Author

enricogandini commented Feb 29, 2024

Hi Enrico, First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol. However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.

Thanks for the feedback @EBjerrum ! I agree with your comments, most of them are quite easy to address. I will address them in my next commits.

The main issue that emerged is about the "input format for scikit-mol transformer". I implemented my solution with the goal to make as few changes as possible, and maintaining backward compatibility. For this reason, I decided to write a quick utility function get_mols_from_X and make the transformer classes call it.

But if you agree that we should implement a more radical change in how the scikit-mol transformers accept an input, I would be happy to try to implement it!

The main inconsistency between scikit-mol transformers and scikit-learn transformers is that:

  • scikit-mol transformers are designed to accept "flat lists" (of smiles or molecules) as inputs.
  • scikit-learn transformers are designed to accept 2D structures as input. Even if the input is a single list of things, scikit-learn transformer want it to be in a 2D array:
    • a single-column DataFrame
    • a "vertical" NumPy array (of shape (N, 1)).

Otherwise scikit-learn transformers raise a ValueError: Expected 2D array, got 1D array instead.

So, in my opinion, we could change scikit-mol transformers to make them natively accet 2D arrays with a single column:

  • If a 2D array with multiple columns is given, they should raise an error instead of "trying to select the correct column" (as my code is doing now).
  • If a "flat list" is given, in my opinion we could still accept it. It would be annoying for users to have to create single-column DataFrames as input, when all they want is a list of objects! Even though that would mean "incompatibility by design" with scikit-learn.

Or do you think we should strive for maximum compatibility with scikit-learn?
Maybe we could raise a warning for now when flat lists are given. And in the future, decide to raise a ValueError, as scikit-learn does.

@EBjerrum
Copy link
Owner

EBjerrum commented Mar 1, 2024

Hi Enrico, First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol. However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.

Thanks for the feedback @EBjerrum ! I agree with your comments, most of them are quite easy to address. I will address them in my next commits.

The main issue that emerged is about the "input format for scikit-mol transformer". I implemented my solution with the goal to make as few changes as possible, and maintaining backward compatibility. For this reason, I decided to write a quick utility function get_mols_from_X and make the transformer classes call it.

But if you agree that we should implement a more radical change in how the scikit-mol transformers accept an input, I would be happy to try to implement it!

The main inconsistency between scikit-mol transformers and scikit-learn transformers is that:

* scikit-mol transformers are designed to accept "flat lists" (of smiles or molecules) as inputs.

* scikit-learn transformers are designed to accept 2D structures as input. Even if the input is a single list of things, scikit-learn transformer want it to be in a 2D array:
  
  * a single-column DataFrame
  * a "vertical" NumPy array (of shape `(N, 1)`).

Otherwise scikit-learn transformers raise a ValueError: Expected 2D array, got 1D array instead.

So, in my opinion, we could change scikit-mol transformers to make them natively accet 2D arrays with a single column:

* If a 2D array with multiple columns is given, they should raise an error instead of "trying to select the correct column" (as my code is doing now).

* If a "flat list" is given, in my opinion we could still accept it. It would be annoying for users to have to create single-column DataFrames as input, when all they want is a list of objects! Even though that would mean "incompatibility by design" with scikit-learn.

Or do you think we should strive for maximum compatibility with scikit-learn? Maybe we could raise a warning for now when flat lists are given. And in the future, decide to raise a ValueError, as scikit-learn does.

Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work.

It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not.

What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator?

I would really prefer to keep the ability to process flat lists, ideally we should have

list = trf.transform(list)
np.array = trf.transform(np.array)
pd.DataFrame = trf.transform(pd.DataFrame)

But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables.

@enricogandini
Copy link
Contributor Author

enricogandini commented Mar 1, 2024

Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work.

I also think that is likely! I will try after this PR is ready.

It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not.

What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator?

A also thought about a very simple mixin class, something like InputAdapter. But I don't know, mixins are considered a bad practice by some developers. Abusing mixins may cause hard-to-understand bugs. Scikit-learn is already using them a lot, maybe adding one more mixin class will make everything more complicated.
Maybe a simple function called internally by the transform methods will be easier to understand, and less error prone.

Of course, I now realize that that function should not be in the scikit-mol utilities module: that is meant to be user-facing. I will move that function into another module, for internal use only.

I would really prefer to keep the ability to process flat lists, ideally we should have

list = trf.transform(list) np.array = trf.transform(np.array) pd.DataFrame = trf.transform(pd.DataFrame)

But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables.

I think we should definitely keep supporting lists. But I think "preserving the input type" is a bad idea. That would definitely break compatibility with scikit-learn. In my opinion, we should return "what scikit-learn transformers would".
That would mean always returning a 2D numpy array (with a single column), or a DataFrame (with a single column) if the user has set_output(transform="pandas"). If users then really really needed a flat list, that would be easy. Maybe we could include a user-facing flatten_transformer_output utility function to make it even easier for them?

Btw, now I think I have enough information to write some more code. When I am done, I will kindly ask you to proceed with the review!

@EBjerrum
Copy link
Owner

EBjerrum commented Mar 2, 2024

Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work.

I also think that is likely! I will try after this PR is ready.

It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not.
What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator?

A also thought about a very simple mixin class, something like InputAdapter. But I don't know, mixins are considered a bad practice by some developers. Abusing mixins may cause hard-to-understand bugs. Scikit-learn is already using them a lot, maybe adding one more mixin class will make everything more complicated. Maybe a simple function called internally by the transform methods will be easier to understand, and less error prone.

Yeah, mixins are probably not the choice here, although sometimes they are a good solution to a given challenge irrespective of some developers spine responses. I would probably go with a function decorator, over the in-function call. That way we can control both the input to, and later output from the function, and the transform functions will only need care about its own functionality on the single input type. Have you made decorators before?

Of course, I now realize that that function should not be in the scikit-mol utilities module: that is meant to be user-facing. I will move that function into another module, for internal use only.

I would really prefer to keep the ability to process flat lists, ideally we should have
list = trf.transform(list) np.array = trf.transform(np.array) pd.DataFrame = trf.transform(pd.DataFrame)
But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables.

I think we should definitely keep supporting lists. But I think "preserving the input type" is a bad idea. That would definitely break compatibility with scikit-learn. In my opinion, we should return "what scikit-learn transformers would". That would mean always returning a 2D numpy array (with a single column), or a DataFrame (with a single column) if the user has set_output(transform="pandas"). If users then really really needed a flat list, that would be easy. Maybe we could include a user-facing flatten_transformer_output utility function to make it even easier for them?

In principle I agree with staying compatible, but I would maybe like to experiment with the other solution at a point in time, maybe configurable behaviour like the pd output type for scikit-learn. Then have a defaul "scikit_mode" and and "IO_types" mode for the transformers, that I can switch on when I get tired of not getting flat lists or arrays back. The decorator approach could be promising to enable that feature in the future.

Btw, now I think I have enough information to write some more code. When I am done, I will kindly ask you to proceed with the review!

Great! Happy Hacking

@EBjerrum
Copy link
Owner

EBjerrum commented Mar 4, 2024

Thanks for the effort, looking forward to go through your your code later this week!

@EBjerrum
Copy link
Owner

EBjerrum commented Mar 7, 2024

Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone.
Can you some way or another take a look at the proposed changes? Maybe I misunderstood something.

@enricogandini
Copy link
Contributor Author

Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone. Can you some way or another take a look at the proposed changes? Maybe I misunderstood something.

Thanks @EBjerrum ! I will happily have a look at the suggested changes. In the end did you manage to make pull request from your pandas_output branch to the pandas_output branch of my fork? Becakuse I don't see one. Otherwise, I'll try to make one myself.

Also, I saw your comment on Slack. Yes, I think it's possible to skip some tests. The missing scikit-learn feature from the Python 3.7 environment was just the set_output API, am I right? In that case, I think I can easily skip the tests that require that API, for Python <= 3.7 (I will check the actual version from the scikit-learn docs, maybe it's higher than that).

@enricogandini
Copy link
Contributor Author

@EBjerrum I managed to create a pull request from your version of the pandas_output branch to my version of the pandas_output branch. I will have a look at everything you added!

This is the pull request link, if you are interested: enricogandini#2

I am not very familiar with cross-forks pull requests, but I think I managed to do this correctly!

Changes by Esben

I reviewed all your changes Esben, they all make sense.

I agree that we should also support 1D arrays, I will add a warning as the comment suggests.

Thanks for spotting that useless `.reshape` call at line 56 of `conversions.py`. I probably added that there, then made the relevant changes elsewhere, and forgot about it!

Also thanks for making the Standardizer class accept all the new inputs that the other transformers will support
@EBjerrum
Copy link
Owner

EBjerrum commented Mar 8, 2024 via email

Which was described on GitHub
So we test that everything works, regardless of the specific name of the pandas container.
I executed the code locally with 3.12.2, and it worked.
@enricogandini
Copy link
Contributor Author

enricogandini commented Mar 8, 2024

Hi @EBjerrum , I installed a local Python 3.7 environment, and the only tests that failed were the ones that specifically tried to use the scikit-learn set_output API (so, the "Pandas output" tests). I used pytest.skipif to skip those tests when the Python version is < 3.8 .

I also added some more tests.

I took the liberty of configuring GitHub test runs for all Python versions <= 3.12. I had been running all the tests on Python 3.12.2, and everything worked! In my opinion this is a great testament of how stable the language and the main scientific libraries have become in the last years.

@EBjerrum
Copy link
Owner

EBjerrum commented Mar 8, 2024 via email

@enricogandini
Copy link
Contributor Author

Yes, it's great with the increased stability. Theres a rising focus on addition of e.g. tests and proper installers with dependencies. Collaboration also helps, two pair of eyes on a piece of code, is better than one pair. Sounds excellent with the updated tests. Let me know when you think it's ready, I think it's close for this feature addition and API fix. We should maybe afterwards also add a small notebook illustrating the pandas output, and your use-case.

Thanks @EBjerrum ! I am thinking about a couple of possible tests to further stabilize the feature, and I'll add a notebook with possible use cases.

As I understand it, its a situation with having a dataframe with existing values (e.g. from QM calculations??), that we then want to add some extra descriptors for, using ColumnFeature to select the SMILES to pass into a pipeline with smilestomol and a descriptor calculator.

Yes, one possible use-case is that users may have additional features, computed with some other packages. In this case, keeping everything in DataFrames with well-named features will simplify preprocessing and analysis of the results. Another use case would be to simplify feature importance analysis, as shown here.

But in general, I think that making scikit-mol as compatible as possible with the scikit-learn features is a good idea. Scikit-learn is the best package for generic ML workflows. I think scikit-mol users shouldn't be constrained to use a subset of scikit-learn features, that support only the most common QSAR tasks. ML experts should be able to leverage the full flexibility of scikit-learn with the chemical features provided by scikit-mol.

Of course, some exceptions make sense. I think supporting flat lists and 1D arrays / Series in scikit-mol is a good idea, because in the end a lot of common QSAR tasks will need only that.

I'll add an example notebook, possibly improve the tests, and I'll let you know when I think I'm done!

NOTE: this include an additional mandatory dependency, which is very small and pure-python: the `packaging` package. It's a package from the Python Packaging Authority, and it's the best tool to handle version information. It wasn't strictly necessary for this simple version check. But in the future, more complex version checks may be needed, and this package will help a lot.
It's not Python < 3.8 that doesn't support the sklearn set_output API: it's a sklearn version < 1.2.
A sklearn version < 1.2 could be present in an environment with Python > 3.8.

NOTE: should we raise a warning if the get_feature_names methods are used with sklearn < 1.2? In theory that method could still be used manually, but it would not be used authomatically by the sklearn set_output machinery.
Previously, they were created with a single name (the last in the list of names!).
NOTE: there are no tests for the Standardizer class!
I wanted to add a test about the sklearn pandas output for the Standardizer class, but there are no tests for it. Should we open an issue and a PR to add tests for it?
With all transformer classes that compute features (descriptors and fingerprints)
Having fingerprints column names with the same length is better for display purposes: fp_morgan_0001...fp_morgan_0200...fp_morgan_2048.

But this may make it harder to test different fingerprint configurations: the column names will be based on the number of bits!
For instance, if nBits is 512, the first fingerprint is called fp_morgan_001; if nBits is 1024, the first fingerprint is fp_morgan_0001.

Now we will always call the first fingerprint fp_morgan_1. This will simplify processing the transformer outputs in production workflows.

An helper method that returns the nice zero-prefixed column names is included.
Also, add a test dataset with CDDD pre-computed features that can be used with the notebook.

TODO: create a test to demonstrate that ColumnTransformer can now be used.
with column transformer
@enricogandini
Copy link
Contributor Author

Hi @EBjerrum , I think this is ready. Please have another look and let me know if you think this can be merged.

I added an example notebook and some more tests.

@EBjerrum
Copy link
Owner

Great, thanks for the effort! I'm super busy this week, but I will try and fit it in in one afternoon anyway. Otherwise it will first be next week.

@EBjerrum
Copy link
Owner

Hi @enricogandini ,
Looking at the code and your notebook. It however fails for me, as there are 5 rows with NaN values in the CDDD computations. It seems to be empty values in the .csv.gz itself. How did it work for you? was it an earlier version of the csv file that has since been updated? Otherwise we should either add some Nan imputation or filter the rows away.

@EBjerrum EBjerrum changed the base branch from main to pandas_output March 18, 2024 13:21
@EBjerrum EBjerrum merged commit 077c022 into EBjerrum:pandas_output Mar 18, 2024
9 checks passed
@enricogandini
Copy link
Contributor Author

enricogandini commented Mar 18, 2024

Hi @enricogandini , Looking at the code and your notebook. It however fails for me, as there are 5 rows with NaN values in the CDDD computations. It seems to be empty values in the .csv.gz itself. How did it work for you? was it an earlier version of the csv file that has since been updated? Otherwise we should either add some Nan imputation or filter the rows away.

I'll have a look @EBjerrum ! I didn't experience that issue with the CDDD sample file. Maybe I committed a wrong version.

Btw, I am not sure exactly what you did: I see this Pull Request was closed, but I don't see the changes in main. I also see that you changed the base branch from main to pandas_output.
image

@EBjerrum
Copy link
Owner

EBjerrum commented Mar 18, 2024 via email

@enricogandini
Copy link
Contributor Author

Hey @EBjerrum , I created a new pull request here, since the previous was closed: #40 .

Yes, there were few rows with only NaNs in the CDDD sample file (molecules that didn't pass the CDDD validation). I didn't notice them in the notebook, since I am using a very recent version of scikit-learn, which by default doesn't crash on NaNs.

Now this should work. If you merge it to the current "base" branch of scikit-mol (that is, the pandas_output branch), I then suggest that you merge from pandas_output to main, and then make the main branch the "base" branch: it would be weird for the "base" branch of a repository to be called pandas_output, the common names are eithermain or master (main being the more recent trend!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas transform output option is ignored or raises error
2 participants