Pandas output #39

enricogandini · 2024-02-27T21:46:31Z

Closes #38 .

All scikit-mol transformer classes now support the scikit-learn set_output API. They can keep the data in DataFrames with meaningful column names.

EBjerrum · 2024-02-28T07:31:10Z

Super, thanks. I'm super busy this week. I'll try to squeeze in a code-review and test-drive anyway.

EBjerrum · 2024-02-29T12:55:21Z

scikit_mol/fingerprints.py

+
+        Also sets the column prefix for use by the transform method with dataframe output.
+        """
+        self._get_column_prefix()


This is not doing anything as the return is simply discarded???

Yes, I added it there when I was developing the code, then I forgot about it.

Done in 49d39d0

EBjerrum

Hi Enrico,
First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol.
However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.

EBjerrum · 2024-02-29T12:57:20Z

tests/test_smilestomol.py

+        mols = smilestomol_transformer.transform(to_convert)
+        assert isinstance(mols, pd.DataFrame)
+        assert mols.columns.tolist() == ["molecule"]


I would stay with the pandastools defaults of 'ROMol', alternatively this needs to be configurable

Used the default ROMol.

Done in ac73c12

EBjerrum · 2024-02-29T13:00:05Z

scikit_mol/fingerprints.py

+                # contains multiple molecule columns (as if from a column selector transformer).
+                # In that case, the resulting array should be a concatenation of the fingerprint arrays
+                # for each molecule column.
+                return X.loc[:, "molecule"]


We can't really be sure that the molecule column is called 'molecule'. It should either be configurable or we should stay with the convention that input to the transformer is always a single column. I think we may be going beyond the scope of the classes by building in column selections to them.

EBjerrum · 2024-02-29T13:02:14Z

scikit_mol/fingerprints.py

+        If needed this property can be overwritten in the child class.
+        """
+        return np.int8
+


Why not keep this simple and add a hidden property directly to the class?

You are right, I used a property to make sure that users were not tempted to override it.
But I agree that the most Pythonic way to achieve this is just to define private attributes.

Changed in 8cb5cfe

EBjerrum · 2024-02-29T13:38:54Z

scikit_mol/conversions.py

@@ -13,6 +13,13 @@ def __init__(self, parallel: Union[bool, int] = False):
        self.parallel = parallel
        self.start_method = None  #TODO implement handling of start_method

+    def get_feature_names_out(self, input_features=None):


Is this needed by the base sklearn classes? I can't seem to understand where it's used otherwise

This is what makes the "pandas output" mechanism work. If this method is defined, then the base classes will use it to generate the column names, and return the result as a Pandas DataFrame. If this method is not defined, we get an error when trying to set the pandas output for transformers.

OK, thanks for the clarification

EBjerrum · 2024-02-29T13:43:42Z

scikit_mol/utilities.py

@@ -53,3 +53,18 @@ def sanitize(self, X_smiles_list, y=None):
            self.errors = pd.DataFrame({'SMILES':X_errors})

            return X_out, X_errors
+
+
+def get_mols_from_X(X):


This is not the place to do this I think. We should not make these kinds of column selections at this level. I would say, that we need to fix the input and output of smilesconverter and molfromsmiles converter to accept 2D input and pandas dataframes, under the assumption that there's only one column. We can assert that and throw a sensible error message about using ColumnTransformer if picking it out of a dataframe. Currently they accept iterables and returns lists or flat arrays, which isn't the way according to sklearn.
Although not completely in the vane of sklearn, I would like to keep the option to input lists and flat arrays as it's so fast to convert a list of smiles or mols with the tools.

Thinking of it, this lack of column vector or pandas input support is likely contributing to crashing your attempts to use column transformer.

Thinking of it, this lack of column vector or pandas input support is likely contributing to crashing your attempts to use column transformer.

Yes, I also thought that if we change how scikit-mol handles transformer input in this PR (by making it more compatible with scikit-learn), ColumnTransformer is more likely to work. I will try after this PR is ready!

enricogandini · 2024-02-29T20:45:30Z

Hi Enrico, First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol. However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.

Thanks for the feedback @EBjerrum ! I agree with your comments, most of them are quite easy to address. I will address them in my next commits.

The main issue that emerged is about the "input format for scikit-mol transformer". I implemented my solution with the goal to make as few changes as possible, and maintaining backward compatibility. For this reason, I decided to write a quick utility function get_mols_from_X and make the transformer classes call it.

But if you agree that we should implement a more radical change in how the scikit-mol transformers accept an input, I would be happy to try to implement it!

The main inconsistency between scikit-mol transformers and scikit-learn transformers is that:

scikit-mol transformers are designed to accept "flat lists" (of smiles or molecules) as inputs.
scikit-learn transformers are designed to accept 2D structures as input. Even if the input is a single list of things, scikit-learn transformer want it to be in a 2D array:
- a single-column DataFrame
- a "vertical" NumPy array (of shape (N, 1)).

Otherwise scikit-learn transformers raise a ValueError: Expected 2D array, got 1D array instead.

So, in my opinion, we could change scikit-mol transformers to make them natively accet 2D arrays with a single column:

If a 2D array with multiple columns is given, they should raise an error instead of "trying to select the correct column" (as my code is doing now).
If a "flat list" is given, in my opinion we could still accept it. It would be annoying for users to have to create single-column DataFrames as input, when all they want is a list of objects! Even though that would mean "incompatibility by design" with scikit-learn.

Or do you think we should strive for maximum compatibility with scikit-learn?
Maybe we could raise a warning for now when flat lists are given. And in the future, decide to raise a ValueError, as scikit-learn does.

EBjerrum · 2024-03-01T10:32:57Z

Hi Enrico, First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol. However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.

Thanks for the feedback @EBjerrum ! I agree with your comments, most of them are quite easy to address. I will address them in my next commits.

The main issue that emerged is about the "input format for scikit-mol transformer". I implemented my solution with the goal to make as few changes as possible, and maintaining backward compatibility. For this reason, I decided to write a quick utility function get_mols_from_X and make the transformer classes call it.

But if you agree that we should implement a more radical change in how the scikit-mol transformers accept an input, I would be happy to try to implement it!

The main inconsistency between scikit-mol transformers and scikit-learn transformers is that:
* scikit-mol transformers are designed to accept "flat lists" (of smiles or molecules) as inputs.

* scikit-learn transformers are designed to accept 2D structures as input. Even if the input is a single list of things, scikit-learn transformer want it to be in a 2D array:
  
  * a single-column DataFrame
  * a "vertical" NumPy array (of shape `(N, 1)`).
Otherwise scikit-learn transformers raise a ValueError: Expected 2D array, got 1D array instead.

So, in my opinion, we could change scikit-mol transformers to make them natively accet 2D arrays with a single column:
* If a 2D array with multiple columns is given, they should raise an error instead of "trying to select the correct column" (as my code is doing now).

* If a "flat list" is given, in my opinion we could still accept it. It would be annoying for users to have to create single-column DataFrames as input, when all they want is a list of objects! Even though that would mean "incompatibility by design" with scikit-learn.
Or do you think we should strive for maximum compatibility with scikit-learn? Maybe we could raise a warning for now when flat lists are given. And in the future, decide to raise a ValueError, as scikit-learn does.

Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work.

It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not.

What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator?

I would really prefer to keep the ability to process flat lists, ideally we should have

list = trf.transform(list)
np.array = trf.transform(np.array)
pd.DataFrame = trf.transform(pd.DataFrame)

But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables.

enricogandini · 2024-03-01T18:00:06Z

Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work.

I also think that is likely! I will try after this PR is ready.

It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not.

What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator?

A also thought about a very simple mixin class, something like InputAdapter. But I don't know, mixins are considered a bad practice by some developers. Abusing mixins may cause hard-to-understand bugs. Scikit-learn is already using them a lot, maybe adding one more mixin class will make everything more complicated.
Maybe a simple function called internally by the transform methods will be easier to understand, and less error prone.

Of course, I now realize that that function should not be in the scikit-mol utilities module: that is meant to be user-facing. I will move that function into another module, for internal use only.

I would really prefer to keep the ability to process flat lists, ideally we should have

list = trf.transform(list) np.array = trf.transform(np.array) pd.DataFrame = trf.transform(pd.DataFrame)

But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables.

I think we should definitely keep supporting lists. But I think "preserving the input type" is a bad idea. That would definitely break compatibility with scikit-learn. In my opinion, we should return "what scikit-learn transformers would".
That would mean always returning a 2D numpy array (with a single column), or a DataFrame (with a single column) if the user has set_output(transform="pandas"). If users then really really needed a flat list, that would be easy. Maybe we could include a user-facing flatten_transformer_output utility function to make it even easier for them?

Btw, now I think I have enough information to write some more code. When I am done, I will kindly ask you to proceed with the review!

EBjerrum · 2024-03-02T07:57:50Z

Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work.

I also think that is likely! I will try after this PR is ready.

It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not.
What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator?

A also thought about a very simple mixin class, something like InputAdapter. But I don't know, mixins are considered a bad practice by some developers. Abusing mixins may cause hard-to-understand bugs. Scikit-learn is already using them a lot, maybe adding one more mixin class will make everything more complicated. Maybe a simple function called internally by the transform methods will be easier to understand, and less error prone.

Yeah, mixins are probably not the choice here, although sometimes they are a good solution to a given challenge irrespective of some developers spine responses. I would probably go with a function decorator, over the in-function call. That way we can control both the input to, and later output from the function, and the transform functions will only need care about its own functionality on the single input type. Have you made decorators before?

Of course, I now realize that that function should not be in the scikit-mol utilities module: that is meant to be user-facing. I will move that function into another module, for internal use only.

I would really prefer to keep the ability to process flat lists, ideally we should have
list = trf.transform(list) np.array = trf.transform(np.array) pd.DataFrame = trf.transform(pd.DataFrame)
But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables.

I think we should definitely keep supporting lists. But I think "preserving the input type" is a bad idea. That would definitely break compatibility with scikit-learn. In my opinion, we should return "what scikit-learn transformers would". That would mean always returning a 2D numpy array (with a single column), or a DataFrame (with a single column) if the user has set_output(transform="pandas"). If users then really really needed a flat list, that would be easy. Maybe we could include a user-facing flatten_transformer_output utility function to make it even easier for them?

In principle I agree with staying compatible, but I would maybe like to experiment with the other solution at a point in time, maybe configurable behaviour like the pd output type for scikit-learn. Then have a defaul "scikit_mode" and and "IO_types" mode for the transformers, that I can switch on when I get tired of not getting flat lists or arrays back. The decorator approach could be promising to enable that feature in the future.

Btw, now I think I have enough information to write some more code. When I am done, I will kindly ask you to proceed with the review!

Great! Happy Hacking

EBjerrum · 2024-03-04T18:53:20Z

Thanks for the effort, looking forward to go through your your code later this week!

EBjerrum · 2024-03-07T09:18:05Z

Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone.
Can you some way or another take a look at the proposed changes? Maybe I misunderstood something.

enricogandini · 2024-03-07T18:05:59Z

Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone. Can you some way or another take a look at the proposed changes? Maybe I misunderstood something.

Thanks @EBjerrum ! I will happily have a look at the suggested changes. In the end did you manage to make pull request from your pandas_output branch to the pandas_output branch of my fork? Becakuse I don't see one. Otherwise, I'll try to make one myself.

Also, I saw your comment on Slack. Yes, I think it's possible to skip some tests. The missing scikit-learn feature from the Python 3.7 environment was just the set_output API, am I right? In that case, I think I can easily skip the tests that require that API, for Python <= 3.7 (I will check the actual version from the scikit-learn docs, maybe it's higher than that).

enricogandini · 2024-03-07T21:22:45Z

@EBjerrum I managed to create a pull request from your version of the pandas_output branch to my version of the pandas_output branch. I will have a look at everything you added!

This is the pull request link, if you are interested: enricogandini#2

I am not very familiar with cross-forks pull requests, but I think I managed to do this correctly!

Changes by Esben I reviewed all your changes Esben, they all make sense. I agree that we should also support 1D arrays, I will add a warning as the comment suggests. Thanks for spotting that useless `.reshape` call at line 56 of `conversions.py`. I probably added that there, then made the relevant changes elsewhere, and forgot about it! Also thanks for making the Standardizer class accept all the new inputs that the other transformers will support

EBjerrum · 2024-03-08T07:17:00Z

I think I managed to make a pull request, but GitHub was little weird when I tried to pull request the other way, I probably pushed the wrong button or something, I'm only orange belt in Git-Fu. Good that you could set it up! There's automatic tests for python 3.7,3.8... 3.10 on Linux. It only fails for 3.7, I really suspect the scikit-learn for that version is too old to have the pandas output support. Failing test means the GitHub workflows stops, and it can't autoupdate pypi. Esben Jannik Bjerrum cand.pharm, Ph.D Phone +46 76 303 90 13 http://dk.linkedin.com/in/esbenbjerrum https://www.cheminformania.com On Thursday, March 7, 2024 at 07:06:23 PM GMT+1, Enrico Gandini ***@***.***> wrote: Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone. Can you some way or another take a look at the proposed changes? Maybe I misunderstood something. Thanks @EBjerrum ! I will happily have a look at the suggested changes. In the end did you manage to make pull request from your pandas_output branch to the pandas_output branch of my fork? Becakuse I don't see one. Otherwise, I'll try to make one myself. Also, I saw your comment on Slack. Yes, I think it's possible to skip some tests. The missing scikit-learn feature from the Python 3.7 environment was just the set_output API, am I right? In that case, I think I can easily skip the tests that require that API, for Python <= 3.7 (I will check the actual version from the scikit-learn docs, maybe it's higher than that). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Which was described on GitHub

…ut tests

So we test that everything works, regardless of the specific name of the pandas container.

I executed the code locally with 3.12.2, and it worked.

enricogandini · 2024-03-08T08:23:18Z

Hi @EBjerrum , I installed a local Python 3.7 environment, and the only tests that failed were the ones that specifically tried to use the scikit-learn set_output API (so, the "Pandas output" tests). I used pytest.skipif to skip those tests when the Python version is < 3.8 .

I also added some more tests.

I took the liberty of configuring GitHub test runs for all Python versions <= 3.12. I had been running all the tests on Python 3.12.2, and everything worked! In my opinion this is a great testament of how stable the language and the main scientific libraries have become in the last years.

EBjerrum · 2024-03-08T08:39:54Z

Yes, it's great with the increased stability. Theres a rising focus on addition of e.g. tests and proper installers with dependencies. Collaboration also helps, two pair of eyes on a piece of code, is better than one pair. Sounds excellent with the updated tests. Let me know when you think it's ready, I think it's close for this feature addition and API fix. We should maybe afterwards also add a small notebook illustrating the pandas output, and your use-case. As I understand it, its a situation with having a dataframe with existing values (e.g. from QM calculations??), that we then want to add some extra descriptors for, using ColumnFeature to select the SMILES to pass into a pipeline with smilestomol and a descriptor calculator. Esben Jannik Bjerrum cand.pharm, Ph.D Phone +46 76 303 90 13 http://dk.linkedin.com/in/esbenbjerrum https://www.cheminformania.com On Friday, March 8, 2024 at 09:23:40 AM GMT+1, Enrico Gandini ***@***.***> wrote: Hi @EBjerrum , I installed a local Python 3.7 environment, and the only tests that failed were the ones that specifically tried to use the scikit-learn set_output API (so, the "Pandas output" tests). I used pytest.skipif to skip those tests when the Python version is < 3.8 . I also added some more tests. I took the liberty of configuring GitHub test runs for all Python versions <= 3.12. I had been running all the tests on Python 3.12.2, and it worked! In my opinion this is a great testament of how stable the language and the main scientific libraries have become in the last years. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

enricogandini · 2024-03-08T19:45:18Z

Yes, it's great with the increased stability. Theres a rising focus on addition of e.g. tests and proper installers with dependencies. Collaboration also helps, two pair of eyes on a piece of code, is better than one pair. Sounds excellent with the updated tests. Let me know when you think it's ready, I think it's close for this feature addition and API fix. We should maybe afterwards also add a small notebook illustrating the pandas output, and your use-case.

Thanks @EBjerrum ! I am thinking about a couple of possible tests to further stabilize the feature, and I'll add a notebook with possible use cases.

As I understand it, its a situation with having a dataframe with existing values (e.g. from QM calculations??), that we then want to add some extra descriptors for, using ColumnFeature to select the SMILES to pass into a pipeline with smilestomol and a descriptor calculator.

Yes, one possible use-case is that users may have additional features, computed with some other packages. In this case, keeping everything in DataFrames with well-named features will simplify preprocessing and analysis of the results. Another use case would be to simplify feature importance analysis, as shown here.

But in general, I think that making scikit-mol as compatible as possible with the scikit-learn features is a good idea. Scikit-learn is the best package for generic ML workflows. I think scikit-mol users shouldn't be constrained to use a subset of scikit-learn features, that support only the most common QSAR tasks. ML experts should be able to leverage the full flexibility of scikit-learn with the chemical features provided by scikit-mol.

Of course, some exceptions make sense. I think supporting flat lists and 1D arrays / Series in scikit-mol is a good idea, because in the end a lot of common QSAR tasks will need only that.

I'll add an example notebook, possibly improve the tests, and I'll let you know when I think I'm done!

NOTE: this include an additional mandatory dependency, which is very small and pure-python: the `packaging` package. It's a package from the Python Packaging Authority, and it's the best tool to handle version information. It wasn't strictly necessary for this simple version check. But in the future, more complex version checks may be needed, and this package will help a lot.

It's not Python < 3.8 that doesn't support the sklearn set_output API: it's a sklearn version < 1.2. A sklearn version < 1.2 could be present in an environment with Python > 3.8. NOTE: should we raise a warning if the get_feature_names methods are used with sklearn < 1.2? In theory that method could still be used manually, but it would not be used authomatically by the sklearn set_output machinery.

Also, improve the message.

Previously, they were created with a single name (the last in the list of names!).

… column name

NOTE: there are no tests for the Standardizer class! I wanted to add a test about the sklearn pandas output for the Standardizer class, but there are no tests for it. Should we open an issue and a PR to add tests for it?

With all transformer classes that compute features (descriptors and fingerprints)

Having fingerprints column names with the same length is better for display purposes: fp_morgan_0001...fp_morgan_0200...fp_morgan_2048. But this may make it harder to test different fingerprint configurations: the column names will be based on the number of bits! For instance, if nBits is 512, the first fingerprint is called fp_morgan_001; if nBits is 1024, the first fingerprint is fp_morgan_0001. Now we will always call the first fingerprint fp_morgan_1. This will simplify processing the transformer outputs in production workflows. An helper method that returns the nice zero-prefixed column names is included.

Also, add a test dataset with CDDD pre-computed features that can be used with the notebook. TODO: create a test to demonstrate that ColumnTransformer can now be used.

with column transformer

enricogandini · 2024-03-10T21:59:55Z

Hi @EBjerrum , I think this is ready. Please have another look and let me know if you think this can be merged.

I added an example notebook and some more tests.

EBjerrum · 2024-03-11T12:24:50Z

Great, thanks for the effort! I'm super busy this week, but I will try and fit it in in one afternoon anyway. Otherwise it will first be next week.

EBjerrum · 2024-03-18T12:37:49Z

Hi @enricogandini ,
Looking at the code and your notebook. It however fails for me, as there are 5 rows with NaN values in the CDDD computations. It seems to be empty values in the .csv.gz itself. How did it work for you? was it an earlier version of the csv file that has since been updated? Otherwise we should either add some Nan imputation or filter the rows away.

enricogandini · 2024-03-18T16:47:36Z

Hi @enricogandini , Looking at the code and your notebook. It however fails for me, as there are 5 rows with NaN values in the CDDD computations. It seems to be empty values in the .csv.gz itself. How did it work for you? was it an earlier version of the csv file that has since been updated? Otherwise we should either add some Nan imputation or filter the rows away.

I'll have a look @EBjerrum ! I didn't experience that issue with the CDDD sample file. Maybe I committed a wrong version.

Btw, I am not sure exactly what you did: I see this Pull Request was closed, but I don't see the changes in main. I also see that you changed the base branch from main to pandas_output.

EBjerrum · 2024-03-18T17:19:16Z

Yes, i merged knto the branch pandas output. Maybe it should have been done differently, but the code is now in the pandas_output on the top repository. Yahoo Mail: Søg, organiser, behersk På man., den 18. mar. 2024 klokken 17:47, Enrico ***@***.***> skrev: Hi @enricogandini , Looking at the code and your notebook. It however fails for me, as there are 5 rows with NaN values in the CDDD computations. It seems to be empty values in the .csv.gz itself. How did it work for you? was it an earlier version of the csv file that has since been updated? Otherwise we should either add some Nan imputation or filter the rows away. I'll have a look @EBjerrum ! I didn't experience that issue with the CDDD sample file. Maybe I committed a wrong version. Btw, I am not sure exactly what you did: I see this Pull Request was closed, but I don't see the changes in main. I also see that you changed the base branch from main to pandas_output. image.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

enricogandini · 2024-03-18T18:09:01Z

Hey @EBjerrum , I created a new pull request here, since the previous was closed: #40 .

Yes, there were few rows with only NaNs in the CDDD sample file (molecules that didn't pass the CDDD validation). I didn't notice them in the notebook, since I am using a very recent version of scikit-learn, which by default doesn't crash on NaNs.

Now this should work. If you merge it to the current "base" branch of scikit-mol (that is, the pandas_output branch), I then suggest that you merge from pandas_output to main, and then make the main branch the "base" branch: it would be weird for the "base" branch of a repository to be called pandas_output, the common names are eithermain or master (main being the more recent trend!).

EBjerrum reviewed Feb 29, 2024

View reviewed changes

EBjerrum requested changes Feb 29, 2024

View reviewed changes

enricogandini added 4 commits March 8, 2024 08:27

refactor: remove commented out code

c5afdde

Which was described on GitHub

test: set minor python version for pandas output and skip pandas_outp…

b747f47

…ut tests

test: add tests for pandas container with different names and no name

c54a9c5

So we test that everything works, regardless of the specific name of the pandas container.

build: include github test runs for Python versions <= 3.12

3928593

I executed the code locally with 3.12.2, and it worked.

enricogandini added 10 commits March 8, 2024 23:03

refactor: define the skip pandas out tests decorator

11e0988

Also, improve the message.

tests: make sure pandas containers are created with different names

c27d19b

Previously, they were created with a single name (the last in the list of names!).

refactor: use a centrally defined default mol column name

01e2cfe

refactor: use decorator for get_feature_names that return default mol…

8caff2f

… column name

feat: make standardizer compatible with sklearn pandas output

dd807d0

NOTE: there are no tests for the Standardizer class! I wanted to add a test about the sklearn pandas output for the Standardizer class, but there are no tests for it. Should we open an issue and a PR to add tests for it?

test: add test that pandas output preserves values

62950b2

With all transformer classes that compute features (descriptors and fingerprints)

docs: add sample notebook to show pandas output

3dfc194

Also, add a test dataset with CDDD pre-computed features that can be used with the notebook. TODO: create a test to demonstrate that ColumnTransformer can now be used.

test: test combined transformer

0027c13

with column transformer

EBjerrum changed the base branch from main to pandas_output March 18, 2024 13:21

EBjerrum merged commit 077c022 into EBjerrum:pandas_output Mar 18, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas output #39

Pandas output #39

enricogandini commented Feb 27, 2024

EBjerrum commented Feb 28, 2024

EBjerrum Feb 29, 2024

enricogandini Feb 29, 2024

EBjerrum left a comment

EBjerrum Feb 29, 2024

enricogandini Feb 29, 2024

EBjerrum Feb 29, 2024

EBjerrum Feb 29, 2024

enricogandini Feb 29, 2024

EBjerrum Feb 29, 2024

enricogandini Mar 1, 2024

EBjerrum Mar 1, 2024

EBjerrum Feb 29, 2024

EBjerrum Mar 1, 2024

enricogandini Mar 1, 2024

enricogandini commented Feb 29, 2024 •

edited

EBjerrum commented Mar 1, 2024

enricogandini commented Mar 1, 2024 •

edited

EBjerrum commented Mar 2, 2024

EBjerrum commented Mar 4, 2024

EBjerrum commented Mar 7, 2024

enricogandini commented Mar 7, 2024

enricogandini commented Mar 7, 2024

EBjerrum commented Mar 8, 2024 via email

enricogandini commented Mar 8, 2024 •

edited

EBjerrum commented Mar 8, 2024 via email

enricogandini commented Mar 8, 2024

enricogandini commented Mar 10, 2024

EBjerrum commented Mar 11, 2024

EBjerrum commented Mar 18, 2024

enricogandini commented Mar 18, 2024 •

edited

EBjerrum commented Mar 18, 2024 via email

enricogandini commented Mar 18, 2024

Pandas output #39

Pandas output #39

Conversation

enricogandini commented Feb 27, 2024

EBjerrum commented Feb 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EBjerrum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enricogandini commented Feb 29, 2024 • edited

EBjerrum commented Mar 1, 2024

enricogandini commented Mar 1, 2024 • edited

EBjerrum commented Mar 2, 2024

EBjerrum commented Mar 4, 2024

EBjerrum commented Mar 7, 2024

enricogandini commented Mar 7, 2024

enricogandini commented Mar 7, 2024

EBjerrum commented Mar 8, 2024 via email

enricogandini commented Mar 8, 2024 • edited

EBjerrum commented Mar 8, 2024 via email

enricogandini commented Mar 8, 2024

enricogandini commented Mar 10, 2024

EBjerrum commented Mar 11, 2024

EBjerrum commented Mar 18, 2024

enricogandini commented Mar 18, 2024 • edited

EBjerrum commented Mar 18, 2024 via email

enricogandini commented Mar 18, 2024

enricogandini commented Feb 29, 2024 •

edited

enricogandini commented Mar 1, 2024 •

edited

enricogandini commented Mar 8, 2024 •

edited

enricogandini commented Mar 18, 2024 •

edited