fix: work around segfault with >100 jobs in google life sciences backend #1451

cademirch · 2022-03-02T22:34:02Z

Description

PR to fix segfault issue with --google-lifesciences with larger (at least 100 jobs) workflows. #1444

QC

The PR contains a test case for the changes or the changes are already covered by an existing test case.
The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

vsoch · 2022-03-02T22:38:22Z

@moschetti would you mind taking a look at this and we can talk about what might be best practice? E.g., importing and defining inline vs. defining globally vs limiting the scope vs making it customizable, etc.

snakemake/executors/google_lifesciences.py

johanneskoester · 2022-03-03T09:55:23Z

snakemake/executors/google_lifesciences.py

@@ -897,24 +899,27 @@ def _retry_request(self, request, timeout=2, attempts=3):
        attempts: remaining attempts, throw error when hit 0
        """
        import googleapiclient
-
+        import google.auth
+        credentials, project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])


@vsoch do we authenticate this way in the other places as well? Or can there maybe be an auth object for the entire life science executor object instead creating a special one here in the retry?

We go about it different ways, but I suspect under the hood since both are using the default application credentials (exported to the environment as GOOGLE_APPLICATION_CREDENTIALS) the methods are similar. We use the discovery API clients to authenticate clients that are attached to the entire class, e.g., here

snakemake/snakemake/executors/google_lifesciences.py

Lines 136 to 165 in 417f40d

def _get_services(self):

"""use the Google Discovery Build to generate API clients

for Life Sciences, and use the google storage python client

for storage.

"""

from googleapiclient.discovery import build as discovery_build

from oauth2client.client import (

GoogleCredentials,

ApplicationDefaultCredentialsError,

)

from google.cloud import storage

# Credentials must be exported to environment

try:

creds = GoogleCredentials.get_application_default()

except ApplicationDefaultCredentialsError as ex:

log_verbose_traceback(ex)

raise ex

# Discovery clients for Google Cloud Storage and Life Sciences API

self._storage_cli = discovery_build(

"storage", "v1", credentials=creds, cache_discovery=False

)

self._compute_cli = discovery_build(

"compute", "v1", credentials=creds, cache_discovery=False

)

self._api = discovery_build(

"lifesciences", "v2beta", credentials=creds, cache_discovery=False

)

self._bucket_service = storage.Client()

. And if you look into the source code, the googleapiclient discovery has that same "google.auth.default" function here which will call that same auth function via here. Notably build takes an http object here and given that we don't provide one and it's generated,I think it might be saved on the returned service objects here so I think what I would do is compare the object there (likely directly or its credentials) to the one created above. If they are comparable, arguably we could grab this object from one of the service objects generated from the discovery build API. The discovery build APIs are going to be different to have scopes for their respective endpoints, vs the one above which is all of google cloud. We could also try the reverse - creating one credential object and then seeing if we can pass to the services, although we'd want to double check how the scopes line up. And if neither of those ideas work, we could minimally have this call in the same function so they are in the same place!

The request object is generated ultimately from one of those services - either directly or as a result of doing like pipelines.run() so I'm curious why the request object isn't coming with its own http already (or maybe some of them are but not consistently?) E.g., pipelines.run() here is the one that was originally giving us trouble:

snakemake/snakemake/executors/google_lifesciences.py

Line 831 in bf5fdfe

operation = pipelines.run(parent=self.location, body=body)

and maybe @moschetti knows what using these endpoints and generated from a discovery endpoint, e.g, here doesn't generate a request with a comparable http object that we are generating later in retry request?

In my first attempt I tried passing the credentials created here:

snakemake/snakemake/executors/google_lifesciences.py

Line 150 in 417f40d

creds = GoogleCredentials.get_application_default()

To google_auth_httplib2.AuthorizedHttp in _retry_request, but it did not like those credentials giving this error: AttributeError: '_JWTAccessCredentials' object has no attribute 'before_request' which led me to this issue, where I found to use google.auth for the credentials.

…into gls-segfault

cademirch · 2022-03-03T19:54:56Z

So I pushed some new commits using google's recommended way to build the service apis. Seems to work so far on the test case. Still testing on a more complex workflow.

Sorry if the commits are a bit messy. Still new to contributing and using git.

vsoch · 2022-03-03T20:11:09Z

No need to apologize, you're doing great.

sonarcloud · 2022-03-03T20:34:52Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

johanneskoester · 2022-03-04T07:47:45Z

Let us know once your larger tests pass. I will wait with the merging until then.

cademirch · 2022-03-04T16:03:04Z

@johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

Yes mostly. Though I believe using google_auth_httplib2 for building the requests is the more critical part of the solution.

Let us know once your larger tests pass. I will wait with the merging until then.

My larger workflow ran overnight without error, though it did end prematurely:

Finished job 572.
162 of 269 steps (60%) done
Complete log: /mnt/ccgp_workflow/.snakemake/log/2022-03-03T210937.628868.snakemake.log

I don't think this premature ending is related to this issue though, I've seen this behavior before when running on cloud execution mode.

johanneskoester · 2022-03-04T17:15:29Z

@johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

Yes mostly. Though I believe using google_auth_httplib2 for building the requests is the more critical part of the solution.

Let us know once your larger tests pass. I will wait with the merging until then.

My larger workflow ran overnight without error, though it did end prematurely:
Finished job 572.
162 of 269 steps (60%) done
Complete log: /mnt/ccgp_workflow/.snakemake/log/2022-03-03T210937.628868.snakemake.log
I don't think this premature ending is related to this issue though, I've seen this behavior before when running on cloud execution mode.

Mhm, that premature ending is kind of worrying as well of course. I have never seen anything like that. Can you file a separate issue for that or is it already open?

cademirch · 2022-03-04T18:05:35Z

@johanneskoester

Awesome work @cademirch! So, am I getting it right that basically using google.auth overall solves the segfault issue then?

Yes mostly. Though I believe using google_auth_httplib2 for building the requests is the more critical part of the solution.

Let us know once your larger tests pass. I will wait with the merging until then.

My larger workflow ran overnight without error, though it did end prematurely:
Finished job 572.
162 of 269 steps (60%) done
Complete log: /mnt/ccgp_workflow/.snakemake/log/2022-03-03T210937.628868.snakemake.log
I don't think this premature ending is related to this issue though, I've seen this behavior before when running on cloud execution mode.
Mhm, that premature ending is kind of worrying as well of course. I have never seen anything like that. Can you file a separate issue for that or is it already open?

Yeah I have only seen it a few times. I think it could be related to --until which I did use in testing my large workflow with this PR. I'll look into it and open an issue.

refactor _retry_request to avoid segfault

d802477

cademirch requested a review from johanneskoester as a code owner March 2, 2022 22:34

cademirch mentioned this pull request Mar 2, 2022

--googlelifesciences segmentation fault #1444

Closed

vsoch requested a review from moschetti March 2, 2022 22:37

johanneskoester changed the title ~~refactor _retry_request to avoid segfault~~ fix: work around segfault with >100 jobs in google life sciences backend Mar 3, 2022

johanneskoester requested changes Mar 3, 2022

View reviewed changes

johanneskoester and others added 5 commits March 3, 2022 11:07

fmt

bf5fdfe

Merge branch 'main' into gls-segfault

4ae4df8

reworking _get_services to use auth_http

9728ada

Merge branch 'gls-segfault' of https://github.com/cademirch/snakemake …

344f8d4

…into gls-segfault

fix mistake in retry_request

9a7815b

remove imports outside of func

d8fa333

johanneskoester approved these changes Mar 4, 2022

View reviewed changes

johanneskoester merged commit 2c0fee2 into snakemake:main Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: work around segfault with >100 jobs in google life sciences backend #1451

fix: work around segfault with >100 jobs in google life sciences backend #1451

cademirch commented Mar 2, 2022

vsoch commented Mar 2, 2022

johanneskoester Mar 3, 2022

vsoch Mar 3, 2022

vsoch Mar 3, 2022

cademirch Mar 3, 2022

cademirch commented Mar 3, 2022 •

edited

vsoch commented Mar 3, 2022

sonarcloud bot commented Mar 3, 2022

johanneskoester left a comment

johanneskoester commented Mar 4, 2022

cademirch commented Mar 4, 2022 •

edited

johanneskoester commented Mar 4, 2022

cademirch commented Mar 4, 2022

	def _get_services(self):
	"""use the Google Discovery Build to generate API clients
	for Life Sciences, and use the google storage python client
	for storage.
	"""
	from googleapiclient.discovery import build as discovery_build
	from oauth2client.client import (
	GoogleCredentials,
	ApplicationDefaultCredentialsError,
	)
	from google.cloud import storage

	# Credentials must be exported to environment
	try:
	creds = GoogleCredentials.get_application_default()
	except ApplicationDefaultCredentialsError as ex:
	log_verbose_traceback(ex)
	raise ex

	# Discovery clients for Google Cloud Storage and Life Sciences API
	self._storage_cli = discovery_build(
	"storage", "v1", credentials=creds, cache_discovery=False
	)
	self._compute_cli = discovery_build(
	"compute", "v1", credentials=creds, cache_discovery=False
	)
	self._api = discovery_build(
	"lifesciences", "v2beta", credentials=creds, cache_discovery=False
	)
	self._bucket_service = storage.Client()

fix: work around segfault with >100 jobs in google life sciences backend #1451

fix: work around segfault with >100 jobs in google life sciences backend #1451

Conversation

cademirch commented Mar 2, 2022

Description

QC

vsoch commented Mar 2, 2022

johanneskoester Mar 3, 2022

Choose a reason for hiding this comment

vsoch Mar 3, 2022

Choose a reason for hiding this comment

vsoch Mar 3, 2022

Choose a reason for hiding this comment

cademirch Mar 3, 2022

Choose a reason for hiding this comment

cademirch commented Mar 3, 2022 • edited

vsoch commented Mar 3, 2022

sonarcloud bot commented Mar 3, 2022

johanneskoester left a comment

Choose a reason for hiding this comment

johanneskoester commented Mar 4, 2022

cademirch commented Mar 4, 2022 • edited

johanneskoester commented Mar 4, 2022

cademirch commented Mar 4, 2022

cademirch commented Mar 3, 2022 •

edited

cademirch commented Mar 4, 2022 •

edited