Move pandas dataframe handling to external convert_dataframe module #814

CalCraven · 2024-04-01T14:11:42Z

This PR looks to improve the handling for converting a topology to a dataframe. This currently lives as a method for topology. It is now being moved to a convert_dataframe.py module. A few different formats are available which give some nice default ways to view a topology. Notably, we have the formats:
-publication which gives all the parameter values you would want to have in a table for publication. This also removes duplicates so each parameter is only listed once.
-default some default values which are nice to have
-remove_duplicates which allows you to get a smaller dataframe with duplicate rows removed.
-specific_columns Allows the user to specify what they want in the dataframe.

There is also an added function that allows you to generate dataframes that cover the parameters for a set of topologies.

Finally, there will be some function that prints the dataframes with the rdkit mols which are labeled to match the dataframes.

TODO Checklist:

for more information, see https://pre-commit.ci

gmso/external/convert_dataframe.py

gmso/tests/test_convert_dataframe.py

gmso/external/convert_dataframe.py

gmso/tests/test_convert_dataframe.py

gmso/external/convert_dataframe.py

+
+            # handle positions?
+            # handle connection_members
+            pass


…ame_simplification

chrisjonesBSU

I left some comments after scrolling through the changes. I still need to test this out in a notebook though. I'll be sure to leave another review and/or comment after I do.

Thanks for doing this Cal, this will be a very useful feature!

chrisjonesBSU · 2024-05-13T15:17:16Z

gmso/external/convert_dataframe.py

+        You must also set the `parameter` argument to be one of {"sites", "bonds", "angles", "dihedrals", "impropers"}, not {"all"}
+        See Notes for more details on what this looks like.
+    columns : list of str, optional, default=None
+            List of strings that are attributes of the topology site and can be included as entries in the pandas dataframe.


Looks like there is an extra tab here

chrisjonesBSU · 2024-05-13T15:21:07Z

gmso/external/convert_dataframe.py

+        'publication' will use the default outputs, but remove duplicate values from the dataframes. It adds a column labeled
+        'Atom Indices' to the `sites` dataframe, which enumerates the indices that the atom_type is a part of.
+        `remove_duplicates` will use the labels in passed through the columns argument, and remove duplicates rows in the dataframe.
+        You must also set the `parameter` argument to be one of {"sites", "bonds", "angles", "dihedrals", "impropers"}, not {"all"}


I'm kind of confused about this line, and I don't see anything in the notes about it. Is this saying that remove_duplicates specifically requires that you don't use all

chrisjonesBSU · 2024-05-13T15:22:56Z

gmso/external/convert_dataframe.py

+        columnsDict = {parameter: columns}
+    else:
+        raise ValueError(
+            f"Please provide formt=['default', 'specific_columns', 'publication']"


Small typo here, I think this should say format

Also, we should add remove_duplicates as an option here correct? I think this error message could be a little more clear. Something like "Available options for format are 'default', 'specific_columns', 'publication', or 'remove_duplicates'."

Also, the f string isn't doing anything here, do we want to print out what they passed to format?

chrisjonesBSU · 2024-05-13T15:27:20Z

gmso/external/convert_dataframe.py

+            df.drop_duplicates(inplace=True, ignore_index=True)
+
+    ###############
+    # END OF FUNCTION


We could take this out.

chrisjonesBSU · 2024-05-13T15:38:50Z

gmso/external/convert_dataframe.py

+    return list(map(parseFunction, iteritems))
+
+
+def _pandas_from_parameters(


Is this function called or used anywhere? Looks like this is the old method from Topology, I'm guessing we can remove it here?

chrisjonesBSU · 2024-05-13T15:50:05Z

gmso/external/convert_dataframe.py

+    return df
+
+
+def _parse_dataframe_attrs(


We could remove this one as well

chrisjonesBSU · 2024-05-13T17:11:23Z

Do you have an example of how to see and access the units when using handle_unyts="in_headers"? It doesn't seem to be placing the units in the column names from what I can tell. It does work as expected when using with_data. Also, its not clear what the expected behavior of all_floats is from the doc strings. Does that mean not to include units at all? Maybe this could be more clear by using a different option name (e.g. no_units) or adding an extra sentence to the doc strings.

chrisjonesBSU · 2024-05-13T17:26:02Z

A couple more thoughts:

Do we want parameter to be passable as a list of parameters? Right now, it is either pick one, or get all of them. Would we prefer something like this to also work?

top_df_dict = to_dataframeDict(
    topology=top,
    parameter=["bonds", "angles"],
    handle_unyts="with_data"
)

We could handle this similarly to what we do with ignore_params in apply

This is a matter of preference and style, but these changes use a lot of camel case in variable names and the main function name (to_dataframeDict), and I think most of the gmso code base is snake case. Should we try to be consistent here as well?

CalCraven and others added 2 commits April 1, 2024 09:03

Move pandas dataframe handling to external convert_dataframe module

e02bf5c

[pre-commit.ci] auto fixes from pre-commit.com hooks

230807d

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Apr 1, 2024

View reviewed changes

daico007 and others added 3 commits April 3, 2024 21:05

Merge branch 'main' into dataframe_simplification

9c7b0f1

Fix docstrings and extraneous imports

61ce3f9

Merge branch 'main' of https://github.com/mosdef-hub/gmso into datafr…

1f0f55d

…ame_simplification

chrisjonesBSU requested changes May 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move pandas dataframe handling to external convert_dataframe module #814

Move pandas dataframe handling to external convert_dataframe module #814

CalCraven commented Apr 1, 2024

chrisjonesBSU left a comment

chrisjonesBSU May 13, 2024

chrisjonesBSU May 13, 2024

chrisjonesBSU May 13, 2024

chrisjonesBSU May 13, 2024 •

edited

chrisjonesBSU May 13, 2024

chrisjonesBSU May 13, 2024

chrisjonesBSU May 13, 2024

chrisjonesBSU commented May 13, 2024

chrisjonesBSU commented May 13, 2024 •

edited

		return list(map(parseFunction, iteritems))


		def _pandas_from_parameters(

Move pandas dataframe handling to external convert_dataframe module #814

Are you sure you want to change the base?

Move pandas dataframe handling to external convert_dataframe module #814

Conversation

CalCraven commented Apr 1, 2024

chrisjonesBSU left a comment

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024 • edited

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024

Choose a reason for hiding this comment

chrisjonesBSU May 13, 2024

Choose a reason for hiding this comment

chrisjonesBSU commented May 13, 2024

chrisjonesBSU commented May 13, 2024 • edited

chrisjonesBSU May 13, 2024 •

edited

chrisjonesBSU commented May 13, 2024 •

edited