Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Storage manager update #679

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

richardjgowers
Copy link
Contributor

@richardjgowers richardjgowers commented Jan 4, 2024

[no ci]

fixes #292

Developers certificate of origin

@pep8speaks
Copy link

pep8speaks commented Jan 4, 2024

Hello @richardjgowers! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 821:80: E501 line too long (81 > 79 characters)
Line 1005:37: E127 continuation line over-indented for visual indent
Line 1006:37: E127 continuation line over-indented for visual indent

Comment last updated at 2024-01-22 14:33:52 UTC

currently fails with:

UnsupportedOperation: not readable
Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick overview of current status:

Main thing is that I think you're using file.as_path() more frequently than needed. This should be used primarily for avoiding the automatic registration that comes from pathlib.Path(file), especially for files that already exist in the cloud (user story: don't download a big file from the cloud when all I want is its parent directory, so I use .as_path()).

When the file doesn't exist, using just pathlib.Path(file) will mark it as something to transfer to the appropriate external storage. Mainly, you don't want to do this on files that you delete during the unit (although those really should have been in scratch, not in either shared or permanent). I think it will also do something on transfer if file in that was a directory (either warning or info-level log message; can't remember which).

You're also using str(file.as_path()) frequently. (1) I think in most cases you actually want str(pathlib.Path(file)), which invokes registration, so you don't need to do it manually; and (2) we should probably implement StagingPath.__str__(self): return str(pathlib.Path(self)), so that your usage would actually just be str(file).

Comment on lines 583 to 584
shared_basepath: stagingregistry.StagingRegistry,
permanent_basepath: stagingregistry.StagingRegistry,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are StagingRegistry ... aren't they StagingPath?

The input to this method was, e.g., root/$DAG_LABEL/shared_$UNIT_LABEL. StagingRegistry would correspond to root/. The full path comes in as a StagingPath, I believe (this is something that changed at one point, because there was no need to expose both classes to protocol developers; it also simplified serialization, IIRC.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working off these hints here, these three vars are just the context unpacked: https://github.com/OpenFreeEnergy/gufe/blob/staging-execute-dag/gufe/protocols/protocolunit.py#L30

Comment on lines 597 to 602
scratch_basepath: StagingDirectory
Where to store temporary files
shared_basepath : StagingDirectory
Where to run the calculation
permanent_basepath : StagingDirectory
Where to store files that must persist beyond the DAG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update typing in docstring

@@ -664,11 +664,13 @@ def run(self, *, dry=False, verbose=True,
else:
ffcache = None

ffcache.register()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem 1 (easy): this line and the ffcache.as_path() below will fail if ffcache is None.

Problem 2 (more complicated): How is ffcache supposed to work? Isn't that supposed to be something that we can pull down from some previous unit? shared_basepath is supposed to be for a specific execution of a given unit. That directory might not exist until we're about to start executing this unit. How does a file get in there? How are we currently using that? [EDIT: I'm having a call with @IAlibay tomorrow to clarify what is going on with this]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading through, I think the only thing that would be cached would be the LJ parameters for the small molecules. We can probably remove this?

chk = sim_settings.checkpoint_storage
reporter = multistate.MultiStateReporter(
storage=nc,
storage=str(nc.as_path()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using this str(file.as_path()) pattern a lot. When you use .as_path(), you're explicitly bypassing the automatic registration mechanism, meaning that you must explicitly register the object.

I'm not sure if this is implemented, but you should be able to instead just use str(file), which should also register the file. Equivalently, str(pathlib.Path(file)) will definitely work, which means you don't need to manually use .register().

You should only need to use .register() on paths that are created outside your control (like the checkpoint and RTA files, which are paths generated inside OpenMMTools and not exposed to us).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah once it's MultiStateReporter wants a string (not path) and the other is going into a subprocess call, so again it's bypassing file-like objects and is a string.

Comment on lines 992 to 993
'nc': nc.as_path(),
'last_checkpoint': checkpoint.as_path(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to return the StagingPath itself, which can then be reloaded even if you move the path. Imagine something running on the EC2, where files will get stored on S3. Knowing the path on your EC2 instance isn't particularly useful.

**analyzer.unit_results_dict
}
else:
return {'debug': {'sampler': sampler}}

@staticmethod
def analyse(where) -> dict:
def analyse(where: stagingregistry.StagingRegistry) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a StagingPath

# don't put energy analysis in here, it uses the open file reporter
# whereas structural stuff requires that the file handle is closed
ret = subprocess.run(['openfe_analysis', str(where)],
trjdir = (where / '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the question I was asking on slack. I needed a .as_path() of a directory and StagingRegistry doesn't have this

trjdir = (where / '')
output = (where / 'results.json')
ret = subprocess.run(['openfe_analysis', 'RFE_analysis',
str(trjdir.as_path()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, has the requirement that the files are already in the directory trjdir. Note that this will only work for situations where this analyse method is called on the same computer where the data was generated.

To run it on a different computer would require downloading all files within the directory. The best way to do that now would be to enumerate them here -- in principle there is a way to obtain all files with a given string prefix, but all of that gets deep into the weeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is called from _execute() right after run() so it should always be the same machine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a little ugly that it needs to know if the trj was stored in scratch/shared/permanent, so there's probably something there I need to fix

plt.close(f2)

f3 = plotting.plot_ligand_RMSD(data['time(ps)'], data['ligand_RMSD'])
f3.savefig(savedir / "ligand_RMSD.png")
f3.savefig(where / "ligand_RMSD.png")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not really this line, but this is the closest changed line)

You may want to include the figure filenames in the return value, too. Otherwise it will be harder to download them from the cloud.

@richardjgowers richardjgowers added this to the 1.0.0 milestone Jan 19, 2024
@richardjgowers richardjgowers removed this from the 1.0.0 milestone Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

openmm rbfe - use gufe.Context better
3 participants