`H5MDReader/Writer` undocumented limitations #4561

edisj · 2024-04-08T00:06:12Z

This is following up on the points brought up by @ljwoods2 in discussion #4559

The H5MD (https://www.nongnu.org/h5md/h5md.html#h5md-format-specification-version-work-version) format is very flexible in terms of which data the user decides to store from any molecular simulation engine. H5MDReader and H5MDWriter currently handle only a restricted subset of all possible h5md files, and as @ljwoods2 pointed out, this isn't documented/handled clearly.

The two undocumented limitations are:

the /particles group (https://www.nongnu.org/h5md/h5md.html#particles-group) in an h5md file can have 1 or more subgroups that represent any arbitrary selection of atoms from a simulation, however in the current implementation of H5MDReader, it only reads in the first group found:

mdanalysis/package/MDAnalysis/coordinates/H5MD.py

Lines 608 to 611 in 73acc9b

    
           # pulls first key out of 'particles' 
        
           # allows for arbitrary name of group1 in 'particles' 
        
           self._particle_group = self._file['particles'][ 
        
               list(self._file['particles'])[0]]

the ~/step and ~/time datasets (https://www.nongnu.org/h5md/h5md.html#explicit-step-and-time-storage) for any H5MD element can be unique to that element, e.g. one can write every integration timestep out for position data, but only write every other timestep out for velocity data. Currently, H5MDReader assumes all data in the input h5md file share a common step and time:

mdanalysis/package/MDAnalysis/coordinates/H5MD.py

Lines 681 to 693 in 73acc9b

    
           # pulls 'time' and 'step' out of first available parent group 
        
           for name, value in self._has.items(): 
        
               if value: 
        
                   if 'time' in self._particle_group[name]: 
        
                       self.ts.time = self._particle_group[name][ 
        
                           'time'][self._frame] 
        
                       break 
        
           for name, value in self._has.items(): 
        
               if value: 
        
                   if 'step' in self._particle_group[name]: 
        
                       self.ts.data['step'] = self._particle_group[name][ 
        
                           'step'][self._frame] 
        
                       break

Expected behavior

H5MDReader and H5MDWriter should catch these cases and raise an appropriate error

Actual behavior

For limitation 1: the reader just reads the positions, velocities, and forces of the first trajectory found and does not alert the user
For limitation 2: the reader just reads the first step and time found and assumes the same value for other datasets

immediate solution

As per @orbeckst 's suggestion here #4559 (comment), the an immediate solution is to fail explicitly in these cases and document the limitation.

future solution?

Using an H5MD file to store an ensemble of trajectories actually sounds very useful. Could an extra keyword argument like which_trajectory or trajectory_name or something be used here to specify which group in /particles to load for the universe? An exception (or warning) can be thrown if the reader detects multiple trajectories and no argument passed, otherwise it would just load the first group in /particles if len(f['particles']) == 1 like before
I think self.has_positions, self.has_velocities, etc. must be determined and set dynamically on a per-timestep basis if positions, velocities, and forces can be arbitrarily read for any timestep, unless the access pattern is obvious in cases where it's like "velocities on odd frames only". Would that mess with the performance of the buffer arrays in memory for a Timestep? From a quick glance it looks like it shouldn't be an issue because the Timestep will just throw a NoDataError but I'm not sure

The text was updated successfully, but these errors were encountered:

orbeckst · 2024-04-09T01:20:51Z

The TRR reader handles arbitrary occurrences of x, v, f in a Timestep. However, it's quite a bit slower that others (see Fig 9 in 10.25080/majora-1b6fd038-005). I don't know if the lower performance is (partially) due to checking x/v/f or due to other reasons.

orbeckst · 2024-05-30T22:44:15Z

With GROMACS apparently committed to supporting H5MD, including the feature that x,v,f can be stored at separate, arbitrary steps, we should make sure that our implementation can do that properly.

edisj · 2024-05-31T00:33:08Z

@orbeckst can I ask you about the multiple groups in /particles/ -

is there any case that makes sense for a mda.Universe to load multiple trajectories from multiple particle groups in /particles/? I can't think of how that would work... A universe should only contain a single trajectory, right?

If that's the case, then if a *.h5md file in the wild does contain multiple groups in /particles, the best way I can think of to handle which trajectory MDAnalysis chooses to load into ts.positions is by the user explicitly giving the group name as an argument.

The only other solutions I can think of is to fail right away, which I don't like because it's less flexible with the H5MD format, or to implicitly (and silently?) load only one of the groups, which I like even less...

orbeckst · 2024-05-31T04:55:57Z

For right now, having a kwarg group for choosing the particle group such as group="<group 1>" makes most sense to me. group=None could be choosing the first one, as is currently done.

Later one could think about ways to combine groups to match the topology, although, given that these can be sampled at different times, this will likely be messy.

So I'd just start with giving the user the option to pick one of the groups.

orbeckst · 2024-05-31T04:59:28Z

Just as a note: The sub kwarg of the XTCReader is describing somewhat related functionality, namely pulling a subsystem matching the topology out of a trajectory but sub requires a raw index array whereas group is a defined name (str).

orbeckst added Component-Docs Format-H5MD hdf5-based H5MD trajectory format labels Apr 9, 2024

edisj mentioned this issue May 30, 2024

H5MD Can not read observables/<group>/<observable>/... #4598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`H5MDReader/Writer` undocumented limitations #4561

`H5MDReader/Writer` undocumented limitations #4561

edisj commented Apr 8, 2024 •

edited

orbeckst commented Apr 9, 2024

orbeckst commented May 30, 2024

edisj commented May 31, 2024

orbeckst commented May 31, 2024

orbeckst commented May 31, 2024

H5MDReader/Writer undocumented limitations #4561

H5MDReader/Writer undocumented limitations #4561

Comments

edisj commented Apr 8, 2024 • edited

Expected behavior

Actual behavior

immediate solution

future solution?

orbeckst commented Apr 9, 2024

orbeckst commented May 30, 2024

edisj commented May 31, 2024

orbeckst commented May 31, 2024

orbeckst commented May 31, 2024

`H5MDReader/Writer` undocumented limitations #4561

`H5MDReader/Writer` undocumented limitations #4561

edisj commented Apr 8, 2024 •

edited