Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H5MDReader/Writer undocumented limitations #4561

Open
edisj opened this issue Apr 8, 2024 · 5 comments
Open

H5MDReader/Writer undocumented limitations #4561

edisj opened this issue Apr 8, 2024 · 5 comments
Labels
Component-Docs Format-H5MD hdf5-based H5MD trajectory format

Comments

@edisj
Copy link
Contributor

edisj commented Apr 8, 2024

This is following up on the points brought up by @ljwoods2 in discussion #4559

The H5MD (https://www.nongnu.org/h5md/h5md.html#h5md-format-specification-version-work-version) format is very flexible in terms of which data the user decides to store from any molecular simulation engine. H5MDReader and H5MDWriter currently handle only a restricted subset of all possible h5md files, and as @ljwoods2 pointed out, this isn't documented/handled clearly.

The two undocumented limitations are:

  1. the /particles group (https://www.nongnu.org/h5md/h5md.html#particles-group) in an h5md file can have 1 or more subgroups that represent any arbitrary selection of atoms from a simulation, however in the current implementation of H5MDReader, it only reads in the first group found:

    # pulls first key out of 'particles'
    # allows for arbitrary name of group1 in 'particles'
    self._particle_group = self._file['particles'][
    list(self._file['particles'])[0]]

  2. the ~/step and ~/time datasets (https://www.nongnu.org/h5md/h5md.html#explicit-step-and-time-storage) for any H5MD element can be unique to that element, e.g. one can write every integration timestep out for position data, but only write every other timestep out for velocity data. Currently, H5MDReader assumes all data in the input h5md file share a common step and time:

    # pulls 'time' and 'step' out of first available parent group
    for name, value in self._has.items():
    if value:
    if 'time' in self._particle_group[name]:
    self.ts.time = self._particle_group[name][
    'time'][self._frame]
    break
    for name, value in self._has.items():
    if value:
    if 'step' in self._particle_group[name]:
    self.ts.data['step'] = self._particle_group[name][
    'step'][self._frame]
    break

Expected behavior

H5MDReader and H5MDWriter should catch these cases and raise an appropriate error

Actual behavior

For limitation 1: the reader just reads the positions, velocities, and forces of the first trajectory found and does not alert the user
For limitation 2: the reader just reads the first step and time found and assumes the same value for other datasets

immediate solution

As per @orbeckst 's suggestion here #4559 (comment), the an immediate solution is to fail explicitly in these cases and document the limitation.

future solution?

  1. Using an H5MD file to store an ensemble of trajectories actually sounds very useful. Could an extra keyword argument like which_trajectory or trajectory_name or something be used here to specify which group in /particles to load for the universe? An exception (or warning) can be thrown if the reader detects multiple trajectories and no argument passed, otherwise it would just load the first group in /particles if len(f['particles']) == 1 like before

  2. I think self.has_positions, self.has_velocities, etc. must be determined and set dynamically on a per-timestep basis if positions, velocities, and forces can be arbitrarily read for any timestep, unless the access pattern is obvious in cases where it's like "velocities on odd frames only". Would that mess with the performance of the buffer arrays in memory for a Timestep? From a quick glance it looks like it shouldn't be an issue because the Timestep will just throw a NoDataError but I'm not sure

@orbeckst orbeckst added Component-Docs Format-H5MD hdf5-based H5MD trajectory format labels Apr 9, 2024
@orbeckst
Copy link
Member

orbeckst commented Apr 9, 2024

The TRR reader handles arbitrary occurrences of x, v, f in a Timestep. However, it's quite a bit slower that others (see Fig 9 in 10.25080/majora-1b6fd038-005). I don't know if the lower performance is (partially) due to checking x/v/f or due to other reasons.

@orbeckst
Copy link
Member

With GROMACS apparently committed to supporting H5MD, including the feature that x,v,f can be stored at separate, arbitrary steps, we should make sure that our implementation can do that properly.

@edisj
Copy link
Contributor Author

edisj commented May 31, 2024

@orbeckst can I ask you about the multiple groups in /particles/ -

is there any case that makes sense for a mda.Universe to load multiple trajectories from multiple particle groups in /particles/? I can't think of how that would work... A universe should only contain a single trajectory, right?

If that's the case, then if a *.h5md file in the wild does contain multiple groups in /particles, the best way I can think of to handle which trajectory MDAnalysis chooses to load into ts.positions is by the user explicitly giving the group name as an argument.

The only other solutions I can think of is to fail right away, which I don't like because it's less flexible with the H5MD format, or to implicitly (and silently?) load only one of the groups, which I like even less...

@orbeckst
Copy link
Member

For right now, having a kwarg group for choosing the particle group such as group="<group 1>" makes most sense to me. group=None could be choosing the first one, as is currently done.

Later one could think about ways to combine groups to match the topology, although, given that these can be sampled at different times, this will likely be messy.

So I'd just start with giving the user the option to pick one of the groups.

@orbeckst
Copy link
Member

Just as a note: The sub kwarg of the XTCReader is describing somewhat related functionality, namely pulling a subsystem matching the topology out of a trajectory but sub requires a raw index array whereas group is a defined name (str).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-Docs Format-H5MD hdf5-based H5MD trajectory format
Projects
None yet
Development

No branches or pull requests

2 participants