fix: issue with aggregated checkpoint output on GLS #2108

aryarm · 2023-02-12T22:29:06Z

resolves #2021

Description

PR #1294 (from a year ago) added a simple test for checkpoints on the cloud.

This PR adds tests for a more complicated example where the output of the checkpoint is now aggregated over wildcards. In this case, we are trying to reproduce the behavior in #2021, in which Snakemake might be

rebuilding the DAG when it doesn't need to be and
applying the remote prefix to the checkpoint output more than once (as it did in default remote prefix duplicated for checkpoint output on cloud #574)

Note that this PR does not yet contain a fix for the strange behavior. Our hope is just to reproduce things with the test suite at the moment. As noted in #574 (comment), it can be difficult to debug issues on the cloud because Snakemake redeploys a different version of itself in each cloud instance. But labeling this PR with update-container-image allows us to deploy a temporary version of the code from this PR.

QC

The PR contains a test case for the changes or the changes are already covered by an existing test case.
The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).
The docs do not need updating in this case, since this PR simply adds tests and fixes a bug.

github-actions · 2023-02-12T22:29:53Z

Please format your code with black: black snakemake tests/*.py.

sonarcloud · 2023-02-12T22:33:08Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

aryarm · 2023-02-13T17:57:26Z

hmm looks like the tests passed

So at first glance, it seems like we might have failed to reproduce the issue. But then I looked through the logs and couldn't find any sign that our tests were executed.

@johanneskoester, why is this line commented out on main? That seems to be the reason it didn't run.

snakemake/.github/workflows/main.yml

Line 231 in 2a4b9b6

# pytest -s -v -x tests/test_google_lifesciences.py

cademirch · 2023-02-13T20:21:06Z

Looking at the comments in #574, I believe @vsoch's proposed solution to check for the remote prefix in self.workflow.apply_default_remote(value) would probably the way to go in order to fix the double prefix in this scenario. Since in this case the user is applying the default prefix in the input function by using an absolute path, I'm not sure I can think of another way.

cademirch · 2023-04-14T07:58:00Z

It has been a while since I started digging into this, and I think I've come up with a candidate solution.

To use a checkpoint's output we use this function:

snakemake/snakemake/checkpoints.py

Line 25 in bdde680

def get(self, **wildcards):

Which returns the checkpoint's output for a set of wildcards if it has been produced. However, the function does not check if the checkpoint output exists remotely, and this is why when the DAG is built on the cloud node, it includes the checkpoint before the target job.

My proposed solution is to check if the checkpoint output exists remotely and if so download it, thus allowing the workflow to be run only using the target rule.

Proposed solution:

def get(self, **wildcards):
        missing = self.rule.wildcard_names.difference(wildcards.keys())
        if missing:
            raise WorkflowError(
                "Missing wildcard values for {}".format(", ".join(missing))
            )

        output, _ = self.rule.expand_output(wildcards)
        if self.checkpoints.future_output is not None:
            for iofile in output:
                if iofile.exists_remote and not iofile.exists_local:
                    iofile.download_from_remote()
                    return CheckpointJob(self.rule, output)
                if iofile in self.checkpoints.future_output:
                    break
                if not iofile.exists and not iofile.is_temp:
                    break
            else:
                return CheckpointJob(self.rule, output)
        raise IncompleteCheckpointException(self.rule, checkpoint_target(output[0]))

I have tested this locally by keeping the checkpoint output files remotely and it seems to work well. Would appreciate any feedback @aryarm @johanneskoester. Let me know if I can provide any more info or clarity!

I will commit this change to this PR soon.

cademirch · 2023-04-17T22:43:58Z

See #2220, @aryarm if you could close this PR in favor for that one, that would be great, thanks!

aryarm · 2023-04-17T22:59:41Z

Sure thing! Thanks for taking all of this on, @cademirch . Looking forward to seeing the fix!

reproduce issue 2021 with MRE

1be511e

aryarm added update-container-image WIP Work in progress. Not meant for merging. labels Feb 12, 2023

refmt with black

5eb35ca

aryarm mentioned this pull request Feb 12, 2023

Strange behavior when using checkpoints in cloud execution #2021

Closed

aryarm closed this Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: issue with aggregated checkpoint output on GLS #2108

fix: issue with aggregated checkpoint output on GLS #2108

aryarm commented Feb 12, 2023 •

edited

github-actions bot commented Feb 12, 2023

sonarcloud bot commented Feb 12, 2023

aryarm commented Feb 13, 2023

cademirch commented Feb 13, 2023

cademirch commented Apr 14, 2023

cademirch commented Apr 17, 2023

aryarm commented Apr 17, 2023 •

edited

fix: issue with aggregated checkpoint output on GLS #2108

fix: issue with aggregated checkpoint output on GLS #2108

Conversation

aryarm commented Feb 12, 2023 • edited

Description

QC

github-actions bot commented Feb 12, 2023

sonarcloud bot commented Feb 12, 2023

aryarm commented Feb 13, 2023

cademirch commented Feb 13, 2023

cademirch commented Apr 14, 2023

cademirch commented Apr 17, 2023

aryarm commented Apr 17, 2023 • edited

aryarm commented Feb 12, 2023 •

edited

aryarm commented Apr 17, 2023 •

edited