Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing intermediate outputs are mishandled when reading from Google Cloud Storage #1460

Closed
CowanCS1 opened this issue Mar 7, 2022 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@CowanCS1
Copy link
Contributor

CowanCS1 commented Mar 7, 2022

Snakemake version
7.1, 7.0.1

Describe the bug
When an existing, intermediate output is present in the GS remote location it is not recognized as being present, possibly on purpose due to versioning considerations. When the rule that produces the file is run, it correctly creates the files and uploads them to the remote but then returns a MissingOutputException, not recognizes the files are present.

This minimal example is a partial reproduction of the issue. I've seen cases where the pipeline is stuck in an infinite loop or otherwise enters a state where it can be run repeatedly, correctly completing and uploading the outputs each time, without the outputs ever being recognized as present. This minimal example is partial in that it generates the bug on the first run, but then correctly recognizes the intermediates on the second run. I believe the error produced here also underlies the more severe cases.

The error appears to be restricted to intermediate rules which have a list of outputs. In all cases I've seen, the error can be circumvented by manually deleting the outputs

Logs

Minimal example
Instructions

  • Create the environment and Snakefile
  • Create the remote storage location
  • Run snakemake with the command below
  • The pipeline should run cleanly
  • Delete the output files test.txt, blob.txt and pretest.txt from the remote, leaving the file preblob.txt
  • Run the pipeline to generate the MisingOutputException.
  • Go to GCS and note two important things, preblob.txt has been overwritten so it is not missing in any sense, and despite the script reporting that pretest.txt was removed as an output of a failed job it is actually still present. So two cases in which operations performed on GS files are causing the system to lose track of the files.
  • Running the pipeline a third time will complete the pipeline

Snakefile

bucket = 'cowan_snakemake'
rule all:
    input:["blob.txt", "test.txt"]
rule intermediate:
    input:["preblob.txt", "pretest.txt"]
    output:["blob.txt", "test.txt"]
    shell:"""
          cp {bucket}/preblob.txt {bucket}/blob.txt
          cp {bucket}/pretest.txt {bucket}/test.txt
          """
rule create:
    output:["preblob.txt", "pretest.txt"]
    shell:'''
          echo "test file" > {bucket}/preblob.txt
          echo "test file" > {bucket}/pretest.txt
          '''

Command to execute

snakemake --google-lifesciences --default-remote-prefix cowan_snakemake --use-conda --google-lifesciences-regions europe-west1 -j 1 -k --ri --google-lifesciences-location europe-west2 --latency-wait 10

Output of the rule which produces the exception

[Mon Mar  7 10:51:23 2022]
INFO:snakemake.logging:[Mon Mar  7 10:51:23 2022]
rule create:
    output: cowan_snakemake/preblob.txt, cowan_snakemake/pretest.txt
    jobid: 2
    resources: tmpdir=/tmp
INFO:snakemake.logging:rule create:
    output: cowan_snakemake/preblob.txt, cowan_snakemake/pretest.txt
    jobid: 2
    resources: tmpdir=/tmp

INFO:snakemake.logging:

Get status with:
gcloud config set project roska-lab-gcp
gcloud beta lifesciences operations describe projects/roska-lab-gcp/locations/europe-west2/operations/5626651345779294243
gcloud beta lifesciences operations list
Logs will be saved to: cowan_snakemake/cowan_snakemake/google-lifesciences-logs

WARNING:snakemake.logging:Get status with:
gcloud config set project roska-lab-gcp
gcloud beta lifesciences operations describe projects/roska-lab-gcp/locations/europe-west2/operations/5626651345779294243
gcloud beta lifesciences operations list
Logs will be saved to: cowan_snakemake/cowan_snakemake/google-lifesciences-logs

Waiting at most 10 seconds for missing files.
WARNING:snakemake.logging:Waiting at most 10 seconds for missing files.
MissingOutputException in line 11 of /home/cameron.cowan/data/snakemake_test/Snakefile:
Job Missing files after 10 seconds:
cowan_snakemake/preblob.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 2 completed successfully, but some output files are missing. 2
ERROR:snakemake.logging:MissingOutputException in line 11 of /home/cameron.cowan/data/snakemake_test/Snakefile:
Job Missing files after 10 seconds:
cowan_snakemake/preblob.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 2 completed successfully, but some output files are missing. 2
Removing output files of failed job create since they might be corrupted:
cowan_snakemake/pretest.txt
WARNING:snakemake.logging:Removing output files of failed job create since they might be corrupted:
cowan_snakemake/pretest.txt
Job failed, going on with independent jobs.
WARNING:snakemake.logging:Job failed, going on with independent jobs.
shutdown

Additional info
I suspect the issue here is related to the blob handles first being created in a way that doesn't check whether the file exists or not, and then not being correctly updated within the list of output objects. Here's an illustration of how the blob object is changed over certain actions. Not in particular how 1) the get_blob provides None only if the the file actually doesn't exist, whereas the directly specified blob is None in both cases 2) the handles are updated even in the event of a rewrite, which is correctly handled here but I think the observed behavior could be caused by accidentally passing and updating a copy of the blob object.

Code

from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket('cowan_snakemake')

fname = 'test.txt'

print('Creating local blob object...')
blob = bucket.blob(fname)
print('Our blob object: %s'%blob)
print('Fresh blob object: %s\n'%bucket.get_blob(fname))
      
print('Uploading file to blob...')
blob.upload_from_filename(fname)
print('Blob object: %s'%blob)
print('Fresh blob object: %s\n'%bucket.get_blob(fname))

print('Overwriting blob with new file...')
blob.upload_from_filename(fname)
print('Blob object: %s'%blob)
print('Fresh blob object: %s\n'%bucket.get_blob(fname))


print('Creating new local blob object...')
blob = bucket.blob(fname)
print('Our blob object: %s'%blob)
print('Fresh blob object: %s\n'%bucket.get_blob(fname))

Output

Creating local blob object...
Our blob object: <Blob: cowan_snakemake, test.txt, None>
Fresh blob object: None

Uploading file to blob...
Blob object: <Blob: cowan_snakemake, test.txt, 1646648186177456>
Fresh blob object: <Blob: cowan_snakemake, test.txt, 1646648186177456>

Overwriting blob with new file...
Blob object: <Blob: cowan_snakemake, test.txt, 1646648186253457>
Fresh blob object: <Blob: cowan_snakemake, test.txt, 1646648186253457>

Creating new local blob object...
Our blob object: <Blob: cowan_snakemake, test.txt, None>
Fresh blob object: <Blob: cowan_snakemake, test.txt, 1646648186253457>
@CowanCS1 CowanCS1 added the bug Something isn't working label Mar 7, 2022
@cademirch
Copy link
Contributor

Related to #1405. I've experienced this quite a bit. Thank you for taking the time to reproduce it and make a MRE.

An observation I'll add is that I've experienced this without having to delete any files. For example, the pipeline will be running fine and one job will hit this issue on its own - no intervention. Files are present in GCS but snakemake thinks they are missing. If I have specified '--retry-times' then the workflow will enter the loop described in the OP.

Then when I rerun the workflow those files are marked as 'incomplete' though they are not since looking at the logs the job that produced them finished just fine.

@johanneskoester
Copy link
Contributor

@CowanCS1 @cademirch please have a look at PR #1531. I am trying to reproduce your issue there.

@johanneskoester
Copy link
Contributor

johanneskoester commented Mar 30, 2022

@vsoch, based on the test case in PR #1531, which fails as expected, do you think you could have a look whether you find the issue? Feel free to try the fix directly in that PR if you have an idea.

@CowanCS1
Copy link
Contributor Author

@CowanCS1 @cademirch please have a look at PR #1531. I am trying to reproduce your issue there.

@johanneskoester Looks correct from my side.

@vsoch
Copy link
Contributor

vsoch commented Mar 30, 2022

I tested a fix that would try to update the cache when a new file is uploaded, got the same error. #1541 I'm not sure how I can debug just looking at this, I'd need to be able to see / interact with the project to understand what is going on.

@vsoch
Copy link
Contributor

vsoch commented Mar 30, 2022

@CowanCS1 if you have a google project to test (I do not) can you give me more insight as to where you think the issue is? I tried adding a call to update the cache after upload but we get the same error. I feel kind of helpless for reproducing and debugging. Stanford had easy access to GCP, my new employer doesn't even use it 😭

@johanneskoester
Copy link
Contributor

@vsoch I can also give you access to the snakemake-testing project on GCP. Would that help? We run on the free budget, but the workflows are minimal, so it should not be a big issue, right?

@vsoch
Copy link
Contributor

vsoch commented Mar 31, 2022

Yes hugely! I would just interactively step through the test you created, nothing big.

@johanneskoester
Copy link
Contributor

Can you send me your google account email address via mail?

@vsoch
Copy link
Contributor

vsoch commented Apr 2, 2022

ok fixed! #1541

johanneskoester added a commit that referenced this issue Apr 2, 2022
* chore: add testcase for issue #1460

* fix expected results

* try force update of inventory after upload

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* run gls tests first

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* pushing test fix

the issue is that the blob object held by a GS remote object can go sort
of stale, returning a False for blob.exists() when it clearly exists! To fix
we need to do an additional self.update_blob() and then returning the exists
check again. I am not sure if this can be made more efficient by only checking
under certain conditions, but since it seems likely we cannot perfectly know when
the blob has gone stale the sure way is to always update.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* wrong version of black

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* remove unused remote inventory, was just testing!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
Co-authored-by: vsoch <vsoch@users.noreply.github.com>
@johanneskoester
Copy link
Contributor

Fixed by @vsoch, our hero!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants