Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move pandas dataframe handling to external convert_dataframe module #814

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

CalCraven
Copy link
Contributor

This PR looks to improve the handling for converting a topology to a dataframe. This currently lives as a method for topology. It is now being moved to a convert_dataframe.py module. A few different formats are available which give some nice default ways to view a topology. Notably, we have the formats:
-publication which gives all the parameter values you would want to have in a table for publication. This also removes duplicates so each parameter is only listed once.
-default some default values which are nice to have
-remove_duplicates which allows you to get a smaller dataframe with duplicate rows removed.
-specific_columns Allows the user to specify what they want in the dataframe.

There is also an added function that allows you to generate dataframes that cover the parameters for a set of topologies.

Finally, there will be some function that prints the dataframes with the rdkit mols which are labeled to match the dataframes.

TODO Checklist:

  • Error checking on arguments
  • Replace topology.py dataframe methods/tests
  • Doc strings
  • Handle units
  • Handle parameters that return lists
  • Handle parameters that return dictionaries
  • Return unique elements if style is publication
  • Handle parameter "all" better
  • Function to concatenate multiple topogies into one output
  • Remove replicate rows flag -> similar to the publication style, but without the atom_indices section added
  • Function to create a topology with all data from dataframe matching rdkit mol image
    • Could just also make this a format unique_types

gmso/external/convert_dataframe.py Fixed Show fixed Hide fixed
gmso/external/convert_dataframe.py Fixed Show fixed Hide fixed
gmso/tests/test_convert_dataframe.py Fixed Show fixed Hide fixed
gmso/tests/test_convert_dataframe.py Fixed Show fixed Hide fixed
gmso/external/convert_dataframe.py Fixed Show fixed Hide fixed
gmso/external/convert_dataframe.py Fixed Show fixed Hide fixed
gmso/tests/test_convert_dataframe.py Fixed Show fixed Hide fixed
gmso/tests/test_convert_dataframe.py Fixed Show fixed Hide fixed
gmso/tests/test_convert_dataframe.py Fixed Show fixed Hide fixed

# handle positions?
# handle connection_members
pass

Check warning

Code scanning / CodeQL

Unnecessary pass Warning

Unnecessary 'pass' statement.
Copy link
Contributor

@chrisjonesBSU chrisjonesBSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments after scrolling through the changes. I still need to test this out in a notebook though. I'll be sure to leave another review and/or comment after I do.

Thanks for doing this Cal, this will be a very useful feature!

You must also set the `parameter` argument to be one of {"sites", "bonds", "angles", "dihedrals", "impropers"}, not {"all"}
See Notes for more details on what this looks like.
columns : list of str, optional, default=None
List of strings that are attributes of the topology site and can be included as entries in the pandas dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is an extra tab here

'publication' will use the default outputs, but remove duplicate values from the dataframes. It adds a column labeled
'Atom Indices' to the `sites` dataframe, which enumerates the indices that the atom_type is a part of.
`remove_duplicates` will use the labels in passed through the columns argument, and remove duplicates rows in the dataframe.
You must also set the `parameter` argument to be one of {"sites", "bonds", "angles", "dihedrals", "impropers"}, not {"all"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of confused about this line, and I don't see anything in the notes about it. Is this saying that remove_duplicates specifically requires that you don't use all

columnsDict = {parameter: columns}
else:
raise ValueError(
f"Please provide formt=['default', 'specific_columns', 'publication']"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo here, I think this should say format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should add remove_duplicates as an option here correct? I think this error message could be a little more clear. Something like "Available options for format are 'default', 'specific_columns', 'publication', or 'remove_duplicates'."

Also, the f string isn't doing anything here, do we want to print out what they passed to format?

df.drop_duplicates(inplace=True, ignore_index=True)

###############
# END OF FUNCTION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could take this out.

return list(map(parseFunction, iteritems))


def _pandas_from_parameters(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function called or used anywhere? Looks like this is the old method from Topology, I'm guessing we can remove it here?

return df


def _parse_dataframe_attrs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove this one as well

@chrisjonesBSU
Copy link
Contributor

Do you have an example of how to see and access the units when using handle_unyts="in_headers"? It doesn't seem to be placing the units in the column names from what I can tell. It does work as expected when using with_data. Also, its not clear what the expected behavior of all_floats is from the doc strings. Does that mean not to include units at all? Maybe this could be more clear by using a different option name (e.g. no_units) or adding an extra sentence to the doc strings.

@chrisjonesBSU
Copy link
Contributor

chrisjonesBSU commented May 13, 2024

A couple more thoughts:

  1. Do we want parameter to be passable as a list of parameters? Right now, it is either pick one, or get all of them. Would we prefer something like this to also work?
top_df_dict = to_dataframeDict(
    topology=top,
    parameter=["bonds", "angles"],
    handle_unyts="with_data"
)

We could handle this similarly to what we do with ignore_params in apply

  1. This is a matter of preference and style, but these changes use a lot of camel case in variable names and the main function name (to_dataframeDict), and I think most of the gmso code base is snake case. Should we try to be consistent here as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants