New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--googlelifesciences segmentation fault #1444
Comments
I've been seeing this issue on my own workflows, and confirm I can reproduce it from this example. |
@moschetti can you please have a look at this? Seems like an error in one of the google life science API packages, because Snakemake itself is pure Python and cannot really generate segfaults. |
@cademirch if possible, to speed up things on the google side, it would be nice if you could narrow it down to a particular API call. The life science backend of Snakemake is here. When you add some print statements there, you probably can discover in which line the segfault happens. |
@vsoch in case you are interested in this as well, any help would be appreciated, but it is also fine if you leave this to us and Google ;-). |
I could probably help over a weekend if I can still access the snakemake-testing GCP project? |
I think that still should work. |
Thanks a lot, that is really appreciated! |
Following the suggestion of @johanneskoester I placed print commands in most of the methods of the Here's the segmentation fault:
I was also seeing other errors at the same position, such as
and
To make sure the error was actually in
So I think that narrows it down to the In addition, I was seeing the occasional
This issue seemed to be more frequent when using a multi-region bucket, compared to a single region, but I was able to reproduce it in both. This is definitely a minor inconvenience compared to the original post, but if you start testing this in earnest you'll probably hit it. I should note that I was twice able to run all 100 jobs without an error. They were back to back executions, and qualitatively appeared to have relatively fewer calls to Hope that helps! |
Adding to @CowanCS1's findings. I ran the test with faulthandler enabled and got the following traceback:
|
I may not need to help if you have the full traceback! Here is where that seems to be in the client code: https://github.com/googleapis/google-api-python-client/blob/8e9750ee4fde0a531d1b7085d660fbe04d8f1e4e/googleapiclient/_helpers.py#L111-L131. I'm seeing it happens right after ssl.py so it might be related to that? @CowanCS1 are you able to change the entrypoint to run the same thing with gdb so we can maybe see more of what is going on? E.g., https://wiki.python.org/moin/DebuggingWithGdb.I'm guessing you are already building the container yourself so you'd want to make sure it's installed. |
I think we can’t rule them out yet - are you able to run with gdb to get more detail? |
@vsoch I gave it a shot earlier today by trying to get gdb to attach the the running snakemake process, but had no success unfortunately. I'll look into it more in the morning, though. |
@vsoch @cademirch I launched snakemake and attached gdb to the process using gdb interactive output, attaching:
SegFault error
py-bt
bt
New to gdb so let me know if other outputs would be helpful. |
So it looks like it's happening when we start with the GLS run() function and create the operation data structure, and then send that to retry request, which is basically trying to submit that to the GLS API. Question - does this still trigger if you don't include the flag for more than one job / control the threading? I'm wondering if the issue is just that we shouldn't be trying to run the local commands this many times at once, and instead try to add the scaling to a single / smaller number of API calls? There do seem to be a number of issues with controlling ssl connections via multiple threads that seem similar. |
I agree, regardless of the exact error it is always when the job is sent to the the GLS api. In my experience, it only errors when running multiple jobs. I ran a limited test of that with 100 jobs using |
@CowanCS1 Running the MRE with |
@vsoch Where could I make this change you're suggesting? I'm happy to work on that today and test it out. |
I put in some additional logging, and am seeing more than a few generic exceptions coming out of @vsoch could this be a thread safety issue? I saw Regarding your idea to do fewer API calls, also see this idea for maintaining a pool of httplib2 objects. Might be easier to incorporate with the current code than waiting to bundle a series of requests. |
Sorry you beat me to responding! So I was just looking at similar - and I'd say try those approaches first that refactor to enable thread safely. Other methods would change the way that the client works - and if you need more than that approach you could also try testing setting some custom values for GLS of:
it's been a while since I've traced the execution but I think we have a few good ideas to start! Thanks for offering to take a shot at fixing this @CowanCS1 ! |
So I tried a rather naive solution based on the Google docs @CowanCS1 linked above. With this, I have been able to run the MRE with |
@cademirch this is a great step in the right direction! I suspect there are best practices and I've pinged @moschetti to advise us. This has been a problem for years and we haven't fixed it so even a "this is a hacky" solution I'm pretty pumped about <3 |
Thanks @cademirch :) I've tested this version, and can confirm that it eliminates the SIGSEGV, all of the exotic SIGABRT errors, and all of the SSL exceptions. Nice! Unexpectedly, it is also eliminating another issue I was seeing with this test script where a subset of output files (5-10%) were either not present in cloud storage or present but not recognized by the job. I saw the latter type of MissingOutputExceptions frequently in my own pipeline. With this version, all of the jobs are present every time. That's actually all of the issues I was tracking, so hurrah! 👍 @vsoch Thanks for reaching out to get advice - this version is probably creating 10-20 connections per each of these quick jobs, but that would increase linearly with time due to the status checks. I hesitate to go fully into implementing a pool of connections, since I could imagine some edge cases like stale connections which we'd have to handle for minimal benefit compared to simpler solutions. My currently favored compromise is to create a single http connection for each call to Cheers all |
woohoo! So my best guess is that when we create the discovery clients the request objects aren't getting / inheriting the correct scopes, or if they are, the scopes are perhaps not consistent to authenticate everything (e.g., sending to storage?) I think I'd consider this a googleapiclient bug - if I create a discovery service to use an API my expectation would be that it generates whatever http is needed, especially given the google default application credentials are the auth attached (which I believe to be the same generated by google.auth.default). ping @moschetti in case he can take this to the developers and get some insight! |
@CowanCS1 @vsoch Glad I can help! Thank you both for your expertise and help with this. I agree that while this version works, it is not optimal since it creates so many http objects. I do like @CowanCS1 suggestions of creating the http connection in |
Sounds good! And check out the service objects created by the discovery API, one for each request, to see if you can grab an already created http object / credentials from there. I will say again it's strange that the library doesn't carry those forward to the request. |
Yes it is indeed strange. I'm curious if building the api like suggested here may be helpful. |
Yes! I was thinking we could try something like that - perhaps creating one build request for all the discovery services? |
Sounds good, I'll give it a go. Should I make a new branch for this? Or continue on my one with the PR (sorry am new to contributing). |
Please continue updating the PR! When you push new commits the CI will run again, and you can do that generally until you are content. Some communities are very strict about squash and merge, and require you to rebase or do other kind of git fu, but here we are good with squash and merge so contributing is easy. To take a quote from the movie "Signs" and mangle it a bit... push away Cade, push awaaay! :D |
@cademirch Have you tried the updated code with a more complex pipeline yet? I moved my main project over to the same virtual environment that is working perfectly for the test example and am seeing a very frequent Given that I observed a reduction in this type of error for the test example provided above, it isn't unthinkable that the authentication to cloud storage is somehow negatively impacted when the jobs are longer or more complex. Tried to reproduce it by adding some delays to the minimal example, but no dice. |
Does one of the services looking for them have scope for storage? |
@CowanCS1 I have not. I think you are right that it has to do with the authentication - I'm curious if it has to do with the fact that mew credentials are being generated in every call of This missingoutput behavior is frustrating. I've seen it about the same frequency as you in my |
I'm working on a different solution and will update my PR. Will also test on a more complex workflow and report back. |
I'm confused about this since storage seems to be handled separately from the other services?
Edit: |
That is my doing - the only logic is that the storage client directly worked really well (vs the other apis didn't have established relible clients). I think the scopes are here: https://github.com/googleapis/python-storage/blob/c6bf78d608832d69031cccb7e4f252dd6241ade7/google/cloud/storage/client.py#L100-L103 and akin to the others, the credentials are found in the environment, mentioned here https://github.com/googleapis/python-storage/blob/c6bf78d608832d69031cccb7e4f252dd6241ade7/google/cloud/storage/client.py#L76-L77 and under the hood (the base class of storage.Client) it's doing the same thing https://github.com/googleapis/python-cloud-core/blob/24b7de49943a49e8235f9dbee6b32693deed8c1f/google/cloud/client/__init__.py#L178 |
I agree, very bizarre. Very confused as to why I can't replicate it in the simple case. @vsoch Not certain. There is an old pull request in python's cloud storage api where they celebrate taking To completely rule it out, I guess we can try the same approach of creating and providing an independent http connection for the cloud storage api. It's provided as a Bedtime on this side of the atlantic :) |
Yea it totally makes sense to use the storage client. So it does seem that the storage client generates its own credentials from the environment. I guess it could be crosstalk? But seems unlikely as @CowanCS1 mentions. Gonna push some commits and test on a larger workflow and see if I can replicate. |
haha yes good observations indeed! I absolutely love using Google Cloud but the APIs are constantly moving targets and there are many ways to do the same thing. I try to choose and make the best decision for the time, but I suspect this also changes over time. |
My previous report was based on continuing a long (~1000 jobs) ongoing run, which had been failing consistently prior to the fixes here (~400 jobs in and was still slowly making progress on it with the old version). I made a duplicate of the data set and cleaned out all of the in-progress outputs, and now it appears to be working without any So my working theory is that all of the errors from the previous version somehow left it in a weird state, which was somehow causing issues with the new version. I'll let it run overnight, but wanted to let you know that the existing cloud storage API may be OK. |
@CowanCS1 Interesting. Thank you for noting this. I've had issues with MissingOutputExceptions despite the files being present, see #1405. Interestingly when I run with |
@CowanCS1 Also unrelated, but could you comment on how long it takes for the DAG to be built for your larger, more complex workflows when using |
Pipeline I described in my last post is still running, over 200 jobs with no crashes and the only missing files exceptions had a known cause.
That closely matches what I was seeing earlier. Rather than I do have a few observations on things that are going wrong with GS, but I haven't verified these in isolation through minimal examples
@vsoch @johanneskoester if you want I can try to create a minimal example of the above and put it in a new issue. Given that I can circumvent the problem by just cleaning the output directories and re-running, and the present issued being fixed has removed the primary entry point to this little cascade of events, tracking down the underlying bug isn't high on my priority list. |
~5 minutes. I've seen what you're describing on a previous version of our pipeline, it was 15-20 minutes for about 600 jobs on 120 inputs. I've built my own file inventories from cloud storage for a superset of what I'm analyzing here, and the inventory was done within seconds. So I suspect it is the calculation of the DAG itself which is time consuming. Are you using wildcards heavily? Our speed-up was part of a larger re-factoring, so I can't 100% attribute it to specific changes, but replacing wildcards with python scripts that read from |
Oh yuck. So some thoughts:
This makes sense in a way because when you look at the GS upload function: snakemake/snakemake/remote/GS.py Lines 253 to 282 in 4586ef7
snakemake/snakemake/remote/GS.py Lines 30 to 53 in 4586ef7
I suspect the call to cancel a job isn't like an immediate kill, but a request (that maybe isn't listened to) so it makes sense they keep on running and create outputs.
Hmm is this something we shouldn't be allowing users to set? snakemake/snakemake/executors/google_lifesciences.py Lines 527 to 532 in 4586ef7
Yeah this is making me think maybe we should clear the namespace before a run like this and not try to re-create, although my understanding is that snakemake is supposed to be able to deal with existing files (re the first part of this discussion).
If you are pooped out on debugging (and we are really appreciative of the help!) I think this minimal example would be great for us and others to keep the remaining issue in mind and have a reproducer if we're able to test Thank you @CowanCS1 !. |
Right, those files not being included in the cache caught me off guard a bit - as you say it was my understanding that this was the expected behavior. But I hadn't purposefully provided intermediate files, so maybe this was just my misunderstanding. I think the original rationale was I think the actual "error" that we're hitting is that when a rule attempts to upload an output file to a Likewise, that could explain why blobs are not deleted when we hit the
Sounds dangerous, with some wild failure modes where lots of files get irretrievably deleted. Better to just fix it so overwritten files are correctly recognized.
This was exactly my thought - in principle the expected behavior would be to repeat on a preemption event but not on any other type of failed job as we already have a dedicated command line option for that. Not sure how clearly that is signaled by GLS, though. Right now this flag also doubles as a
Sure thing! I just know I have to debug downstream steps that are still on my critical path, so wanted to try and avoid leaving anyone hanging. Will start with a simple check of the blob theory I described above... |
Made some decent progress on this. Looks like I have a minimal example, and learned some conditions under which this does and does not occur. Will test a bit more and create a new issue when I next get a chance. |
@CowanCS1 Awesome! Thank you for looking into this. Curious to see what you have found - this issue has plagued my workflows when running on Kubernetes. |
Okay this makes sense to me. My workflow does use quite a few wildcards so I can see how that blows up when computing the DAG. I'm curious about your method of using python scripts to read TSV files for inputs... could you share an example by chance? |
I've provided a partial reproduction of the GS bug in #1460. I think it's the same underlying issue that I've seen in more severe cases, just exacerbated by the more complex pipelines and multiple failed states from this issue resulting in a weird pipeline state. @cademirch Happy to share. It's in a private repo but I can send a relevant snippet to your university email - is the lab listed in your github profile current? |
@CowanCS1 I'll add my observations to that issue. And yes, you can email me at cade@ucsc.edu - thank you for that! |
Gonna close this since #1451 was merged. |
Snakemake version
Tested on
7.0.0
,6.15.5
and6.15.0
Describe the bug
Segmentation fault (core dumped)
when executing with--google-lifesciences
.Logs
Minimal example
Snakefile:
Command line:
snakemake --google-lifesciences --default-remote-prefix snake-test -j10
Additional context
Tried
-j10
and-j100
, to no effect, still same error.The text was updated successfully, but these errors were encountered: