Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: issue with aggregated checkpoint output on GLS #2108

Closed
wants to merge 2 commits into from

Conversation

aryarm
Copy link
Member

@aryarm aryarm commented Feb 12, 2023

resolves #2021

Description

PR #1294 (from a year ago) added a simple test for checkpoints on the cloud.

This PR adds tests for a more complicated example where the output of the checkpoint is now aggregated over wildcards. In this case, we are trying to reproduce the behavior in #2021, in which Snakemake might be

  1. rebuilding the DAG when it doesn't need to be and
  2. applying the remote prefix to the checkpoint output more than once (as it did in default remote prefix duplicated for checkpoint output on cloud #574)

Note that this PR does not yet contain a fix for the strange behavior. Our hope is just to reproduce things with the test suite at the moment. As noted in #574 (comment), it can be difficult to debug issues on the cloud because Snakemake redeploys a different version of itself in each cloud instance. But labeling this PR with update-container-image allows us to deploy a temporary version of the code from this PR.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).
    The docs do not need updating in this case, since this PR simply adds tests and fixes a bug.

@aryarm aryarm added update-container-image WIP Work in progress. Not meant for merging. labels Feb 12, 2023
@github-actions
Copy link
Contributor

Please format your code with black: black snakemake tests/*.py.

@sonarcloud
Copy link

sonarcloud bot commented Feb 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@aryarm
Copy link
Member Author

aryarm commented Feb 13, 2023

hmm looks like the tests passed

So at first glance, it seems like we might have failed to reproduce the issue. But then I looked through the logs and couldn't find any sign that our tests were executed.

@johanneskoester, why is this line commented out on main? That seems to be the reason it didn't run.

# pytest -s -v -x tests/test_google_lifesciences.py

@cademirch
Copy link
Contributor

Looking at the comments in #574, I believe @vsoch's proposed solution to check for the remote prefix in self.workflow.apply_default_remote(value) would probably the way to go in order to fix the double prefix in this scenario. Since in this case the user is applying the default prefix in the input function by using an absolute path, I'm not sure I can think of another way.

@cademirch
Copy link
Contributor

It has been a while since I started digging into this, and I think I've come up with a candidate solution.

To use a checkpoint's output we use this function:

def get(self, **wildcards):

Which returns the checkpoint's output for a set of wildcards if it has been produced. However, the function does not check if the checkpoint output exists remotely, and this is why when the DAG is built on the cloud node, it includes the checkpoint before the target job.

My proposed solution is to check if the checkpoint output exists remotely and if so download it, thus allowing the workflow to be run only using the target rule.

Proposed solution:

def get(self, **wildcards):
        missing = self.rule.wildcard_names.difference(wildcards.keys())
        if missing:
            raise WorkflowError(
                "Missing wildcard values for {}".format(", ".join(missing))
            )

        output, _ = self.rule.expand_output(wildcards)
        if self.checkpoints.future_output is not None:
            for iofile in output:
                if iofile.exists_remote and not iofile.exists_local:
                    iofile.download_from_remote()
                    return CheckpointJob(self.rule, output)
                if iofile in self.checkpoints.future_output:
                    break
                if not iofile.exists and not iofile.is_temp:
                    break
            else:
                return CheckpointJob(self.rule, output)
        raise IncompleteCheckpointException(self.rule, checkpoint_target(output[0]))

I have tested this locally by keeping the checkpoint output files remotely and it seems to work well. Would appreciate any feedback @aryarm @johanneskoester. Let me know if I can provide any more info or clarity!

I will commit this change to this PR soon.

@cademirch
Copy link
Contributor

See #2220, @aryarm if you could close this PR in favor for that one, that would be great, thanks!

@aryarm
Copy link
Member Author

aryarm commented Apr 17, 2023

Sure thing! Thanks for taking all of this on, @cademirch . Looking forward to seeing the fix!

@aryarm aryarm closed this Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
update-container-image WIP Work in progress. Not meant for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange behavior when using checkpoints in cloud execution
2 participants