Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segaling_repeat_masker still crashes #54

Open
glennhickey opened this issue Jun 15, 2022 · 7 comments
Open

segaling_repeat_masker still crashes #54

glennhickey opened this issue Jun 15, 2022 · 7 comments

Comments

@glennhickey
Copy link
Collaborator

This is the same command line and overall dataset as #53. While the patch for #53 worked and let most genomes go through (thanks!), there's still at least one problem. I will share the inputs offline, but the error message is

Chromosome block 2 interval 329/333 (2982000000:2985000000) with ref (570705458:1143705458) rc (4166672753:4169672753)
Chromosome block 2 interval 331/333 (2988000000:2991000000) with ref (570705458:1143705458) rc (4160672753:4163672753)
Chromosome block 2 interval 330/333 (2985000000:2988000000) with ref (570705458:1143705458) rc (4163672753:4166672753)
Chromosome block 2 interval 333/333 (2994000000:2996999981) with ref (570705458:1143705458) rc (4154672772:4157672753)
Chromosome block 2 interval 332/333 (2991000000:2994000000) with ref (570705458:1143705458) rc (4157672753:4160672753)
terminate called after throwing an instance of 'thrust::system::system_error'
terminate called recursively
terminate called recursively
terminate called recursively
what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
Command terminated by signal 6
@RenzoTale88
Copy link

I'm coming across the same issue using 2xA100 GPUs running segaling_repeat_masker on GCA_946052875.1 genome:

        Chromosome block 2 interval 96/100 (2950000000:2960000000) with ref (720545186:1330545186) rc (70545185:80545185)
        Chromosome block 2 interval 97/100 (2960000000:2970000000) with ref (720545186:1330545186) rc (60545185:70545185)
        Chromosome block 2 interval 98/100 (2970000000:2980000000) with ref (720545186:1330545186) rc (50545185:60545185)
        Chromosome block 2 interval 99/100 (2980000000:2990000000) with ref (720545186:1330545186) rc (40545185:50545185)
        Chromosome block 2 interval 100/100 (2990000000:2999999981) with ref (720545186:1330545186) rc (30545204:40545185)

        Sending block 3 ...
        Chromosome block 3 interval 1/4 (3000000000:3010000000) with ref (0:330545186) rc (20545185:30545185)
        Chromosome block 3 interval 2/4 (3010000000:3020000000) with ref (0:330545186) rc (10545185:20545185)
        Error: cudaMemcpy of 4 bytes for num_anchors failed with error " invalid argument "
        terminate called after throwing an instance of 'thrust::system::system_error'
          what():  CUDA free failed: cudaErrorCudartUnloading: driver shutting down

        [2023-06-02T23:43:08+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host nodeXXX
<=========

@glennhickey
Copy link
Collaborator Author

@RenzoTale88 This version of SegAlign (github.com/gsneha26/SegAlign) is broken and, sadly, doesn't seem to be maintained anymore. You may have more luck with the fork that Cactus uses, which can be found here: https://github.com/ComparativeGenomicsToolkit/SegAlign (in addition to the Cactus release GPU docker images). Even this fork has some overflow issues, but it works on most data.

@RenzoTale88
Copy link

@glennhickey thanks for the quick reply, I have been trying also with the second repository, but came across the same issue. I can open a second ticket on the cactus github page if you prefer?

@glennhickey
Copy link
Collaborator Author

Did you try the commit that cactus uses? I only just merged it into master. If you did not use that commit (or the current master as of 10 minutes ago), you would see the same error.

@RenzoTale88
Copy link

Just cloned the repo and currently compiling it. I'll come back on this, thanks!

@RenzoTale88
Copy link

@glennhickey still coming across this issue:

        Chromosome block 2 interval 100/100 (2990000000:2999999981) with ref (794006738:1424006738) rc (114006756:124006737)

        Sending block 3 ...
        Chromosome block 3 interval 2/12 (3010000000:3020000000) with ref (0:424006738) rc (94006737:104006737)
        Chromosome block 3 interval 1/12 (3000000000:3010000000) with ref (0:424006738) rc (104006737:114006737)
        Chromosome block 3 interval 4/12 (3030000000:3040000000) with ref (0:424006738) rc (74006737:84006737)
        Chromosome block 3 interval 3/12 (3020000000:3030000000) with ref (0:424006738) rc (84006737:94006737)
        Chromosome block 3 interval 5/12 (3040000000:3050000000) with ref (0:424006738) rc (64006737:74006737)
        Chromosome block 3 interval 6/12 (3050000000:3060000000) with ref (0:424006738) rc (54006737:64006737)
        Chromosome block 3 interval 7/12 (3060000000:3070000000) with ref (0:424006738) rc (44006737:54006737)
        Chromosome block 3 interval 8/12 (3070000000:3080000000) with ref (0:424006738) rc (34006737:44006737)
        Chromosome block 3 interval 9/12 (3080000000:3090000000) with ref (0:424006738) rc (24006737:34006737)
        Chromosome block 3 interval 10/12 (3090000000:3100000000) with ref (0:424006738) rc (14006737:24006737)
        Chromosome block 3 interval 11/12 (3100000000:3110000000) with ref (0:424006738) rc (4006737:14006737)
        Error: cudaMemcpy of 4 bytes for num_anchors failed with error " invalid argument "
        terminate called after throwing an instance of 'thrust::system::system_error'
          what():  CUDA free failed: cudaErrorCudartUnloading: driver shutting down

        [2023-06-06T00:58:05+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host NODE

@gsneha26
Copy link
Owner

I will get it fixed by the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants