New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move pandas dataframe handling to external convert_dataframe module #814
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments after scrolling through the changes. I still need to test this out in a notebook though. I'll be sure to leave another review and/or comment after I do.
Thanks for doing this Cal, this will be a very useful feature!
You must also set the `parameter` argument to be one of {"sites", "bonds", "angles", "dihedrals", "impropers"}, not {"all"} | ||
See Notes for more details on what this looks like. | ||
columns : list of str, optional, default=None | ||
List of strings that are attributes of the topology site and can be included as entries in the pandas dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is an extra tab here
'publication' will use the default outputs, but remove duplicate values from the dataframes. It adds a column labeled | ||
'Atom Indices' to the `sites` dataframe, which enumerates the indices that the atom_type is a part of. | ||
`remove_duplicates` will use the labels in passed through the columns argument, and remove duplicates rows in the dataframe. | ||
You must also set the `parameter` argument to be one of {"sites", "bonds", "angles", "dihedrals", "impropers"}, not {"all"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of confused about this line, and I don't see anything in the notes about it. Is this saying that remove_duplicates
specifically requires that you don't use all
columnsDict = {parameter: columns} | ||
else: | ||
raise ValueError( | ||
f"Please provide formt=['default', 'specific_columns', 'publication']" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small typo here, I think this should say format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should add remove_duplicates
as an option here correct? I think this error message could be a little more clear. Something like "Available options for format are 'default', 'specific_columns', 'publication', or 'remove_duplicates'."
Also, the f
string isn't doing anything here, do we want to print out what they passed to format
?
df.drop_duplicates(inplace=True, ignore_index=True) | ||
|
||
############### | ||
# END OF FUNCTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take this out.
return list(map(parseFunction, iteritems)) | ||
|
||
|
||
def _pandas_from_parameters( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function called or used anywhere? Looks like this is the old method from Topology
, I'm guessing we can remove it here?
return df | ||
|
||
|
||
def _parse_dataframe_attrs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove this one as well
Do you have an example of how to see and access the units when using |
A couple more thoughts:
We could handle this similarly to what we do with
|
This PR looks to improve the handling for converting a topology to a dataframe. This currently lives as a method for topology. It is now being moved to a convert_dataframe.py module. A few different formats are available which give some nice default ways to view a topology. Notably, we have the formats:
-
publication
which gives all the parameter values you would want to have in a table for publication. This also removes duplicates so each parameter is only listed once.-
default
some default values which are nice to have-
remove_duplicates
which allows you to get a smaller dataframe with duplicate rows removed.-
specific_columns
Allows the user to specify what they want in the dataframe.There is also an added function that allows you to generate dataframes that cover the parameters for a set of topologies.
Finally, there will be some function that prints the dataframes with the rdkit mols which are labeled to match the dataframes.
TODO Checklist: