Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: work around segfault with >100 jobs in google life sciences backend #1451

Merged
merged 7 commits into from Mar 4, 2022
Merged

fix: work around segfault with >100 jobs in google life sciences backend #1451

merged 7 commits into from Mar 4, 2022

Conversation

cademirch
Copy link
Contributor

Description

PR to fix segfault issue with --google-lifesciences with larger (at least 100 jobs) workflows. #1444

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

@vsoch
Copy link
Contributor

vsoch commented Mar 2, 2022

@moschetti would you mind taking a look at this and we can talk about what might be best practice? E.g., importing and defining inline vs. defining globally vs limiting the scope vs making it customizable, etc.

@johanneskoester johanneskoester changed the title refactor _retry_request to avoid segfault fix: work around segfault with >100 jobs in google life sciences backend Mar 3, 2022
snakemake/executors/google_lifesciences.py Outdated Show resolved Hide resolved
snakemake/executors/google_lifesciences.py Outdated Show resolved Hide resolved
@@ -897,24 +899,27 @@ def _retry_request(self, request, timeout=2, attempts=3):
attempts: remaining attempts, throw error when hit 0
"""
import googleapiclient

import google.auth
credentials, project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch do we authenticate this way in the other places as well? Or can there maybe be an auth object for the entire life science executor object instead creating a special one here in the retry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go about it different ways, but I suspect under the hood since both are using the default application credentials (exported to the environment as GOOGLE_APPLICATION_CREDENTIALS) the methods are similar. We use the discovery API clients to authenticate clients that are attached to the entire class, e.g., here

def _get_services(self):
"""use the Google Discovery Build to generate API clients
for Life Sciences, and use the google storage python client
for storage.
"""
from googleapiclient.discovery import build as discovery_build
from oauth2client.client import (
GoogleCredentials,
ApplicationDefaultCredentialsError,
)
from google.cloud import storage
# Credentials must be exported to environment
try:
creds = GoogleCredentials.get_application_default()
except ApplicationDefaultCredentialsError as ex:
log_verbose_traceback(ex)
raise ex
# Discovery clients for Google Cloud Storage and Life Sciences API
self._storage_cli = discovery_build(
"storage", "v1", credentials=creds, cache_discovery=False
)
self._compute_cli = discovery_build(
"compute", "v1", credentials=creds, cache_discovery=False
)
self._api = discovery_build(
"lifesciences", "v2beta", credentials=creds, cache_discovery=False
)
self._bucket_service = storage.Client()
. And if you look into the source code, the googleapiclient discovery has that same "google.auth.default" function here which will call that same auth function via here. Notably build takes an http object here and given that we don't provide one and it's generated,I think it might be saved on the returned service objects here so I think what I would do is compare the object there (likely directly or its credentials) to the one created above. If they are comparable, arguably we could grab this object from one of the service objects generated from the discovery build API. The discovery build APIs are going to be different to have scopes for their respective endpoints, vs the one above which is all of google cloud. We could also try the reverse - creating one credential object and then seeing if we can pass to the services, although we'd want to double check how the scopes line up. And if neither of those ideas work, we could minimally have this call in the same function so they are in the same place!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request object is generated ultimately from one of those services - either directly or as a result of doing like pipelines.run() so I'm curious why the request object isn't coming with its own http already (or maybe some of them are but not consistently?) E.g., pipelines.run() here is the one that was originally giving us trouble:

operation = pipelines.run(parent=self.location, body=body)
and maybe @moschetti knows what using these endpoints and generated from a discovery endpoint, e.g, here doesn't generate a request with a comparable http object that we are generating later in retry request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my first attempt I tried passing the credentials created here:

creds = GoogleCredentials.get_application_default()

To google_auth_httplib2.AuthorizedHttp in _retry_request, but it did not like those credentials giving this error: AttributeError: '_JWTAccessCredentials' object has no attribute 'before_request' which led me to this issue, where I found to use google.auth for the credentials.

@cademirch
Copy link
Contributor Author

cademirch commented Mar 3, 2022

So I pushed some new commits using google's recommended way to build the service apis. Seems to work so far on the test case. Still testing on a more complex workflow.

Sorry if the commits are a bit messy. Still new to contributing and using git.

@vsoch
Copy link
Contributor

vsoch commented Mar 3, 2022

No need to apologize, you're doing great.

@sonarcloud
Copy link

sonarcloud bot commented Mar 3, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

@johanneskoester
Copy link
Contributor

Let us know once your larger tests pass. I will wait with the merging until then.

@cademirch
Copy link
Contributor Author

cademirch commented Mar 4, 2022

@johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

Yes mostly. Though I believe using google_auth_httplib2 for building the requests is the more critical part of the solution.

Let us know once your larger tests pass. I will wait with the merging until then.

My larger workflow ran overnight without error, though it did end prematurely:

Finished job 572.
162 of 269 steps (60%) done
Complete log: /mnt/ccgp_workflow/.snakemake/log/2022-03-03T210937.628868.snakemake.log

I don't think this premature ending is related to this issue though, I've seen this behavior before when running on cloud execution mode.

@johanneskoester johanneskoester merged commit 2c0fee2 into snakemake:main Mar 4, 2022
@johanneskoester
Copy link
Contributor

@johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

Yes mostly. Though I believe using google_auth_httplib2 for building the requests is the more critical part of the solution.

Let us know once your larger tests pass. I will wait with the merging until then.

My larger workflow ran overnight without error, though it did end prematurely:

Finished job 572.
162 of 269 steps (60%) done
Complete log: /mnt/ccgp_workflow/.snakemake/log/2022-03-03T210937.628868.snakemake.log

I don't think this premature ending is related to this issue though, I've seen this behavior before when running on cloud execution mode.

Mhm, that premature ending is kind of worrying as well of course. I have never seen anything like that. Can you file a separate issue for that or is it already open?

@cademirch
Copy link
Contributor Author

@johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

Yes mostly. Though I believe using google_auth_httplib2 for building the requests is the more critical part of the solution.

Let us know once your larger tests pass. I will wait with the merging until then.

My larger workflow ran overnight without error, though it did end prematurely:

Finished job 572.
162 of 269 steps (60%) done
Complete log: /mnt/ccgp_workflow/.snakemake/log/2022-03-03T210937.628868.snakemake.log

I don't think this premature ending is related to this issue though, I've seen this behavior before when running on cloud execution mode.

Mhm, that premature ending is kind of worrying as well of course. I have never seen anything like that. Can you file a separate issue for that or is it already open?

Yeah I have only seen it a few times. I think it could be related to --until which I did use in testing my large workflow with this PR. I'll look into it and open an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants