Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with large input #16

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

adipi71
Copy link

@adipi71 adipi71 commented Sep 26, 2023

we have investigated the issue with cgmls-dists in handling large input files (the error has been reported with 80k Lm samples) . The tool goes in segmentation fault.
The bug is due to an incorrect memory allocation for the distance vector. The memory size is calculated as nrownrow which generates an Integer Overflow for a large nrow and using 32 bits (line 219 on the original version).
The maximum value that can be stored in an int variable is 2147483647 (in our case, the final dist vector size might be 80000
80000 = 6.400.000.000 > 2.147.483.647). This is due to the fact that the tool uses a vector and treats it as a matrix, which is a nice optimization.

We just imported the inttypes.h library to bypass the overflow using 64 bits. We have successfully tested on 80,000 samples and 1,748 loci.

We look forward to your feedback on this.
Best
Adriano

@wessel-eijs
Copy link

Dear Adriano,

First of all thanks for the update. Unfortunately, the allocation problem still occurs. I am trying a 55.000 isolate with 3002 loci, but keep getting the allocation problem. Is there a way around this, or do I just have to much loci (columns)?

@andreaderuvo
Copy link

Dear wessel-eijs,

There are some additional issues to address in the original code. We have released a Python version at:
https://github.com/genpat-it/cgmlst-dists-py

If you give it a try, please provide feedback.

@wessel-eijs
Copy link

Dear Andrea,

I have given your version a try and it worked on smaller files. Unfortunately, my 55.000 isolates file with 3002 loci still has an error. It does make the hamming distance matrix, but it stops making the full 55.000 columns after row 31648. I will paste the error message I get from slurm underneath here:

Row 31648 has fewer columns than the stated size of 54798

Has this happened to someone else as well? Or is there a solution already maybe?

Kind regards,
Wessel

@andreaderuvo
Copy link

Dear Andrea,

I have given your version a try and it worked on smaller files. Unfortunately, my 55.000 isolates file with 3002 loci still has an error. It does make the hamming distance matrix, but it stops making the full 55.000 columns after row 31648. I will paste the error message I get from slurm underneath here:

Row 31648 has fewer columns than the stated size of 54798

Has this happened to someone else as well? Or is there a solution already maybe?

Kind regards, Wessel

Hi Wessel,

I have two questions:

  1. Do you know which line generates the exception?
  2. Is the dataset public so I can try it myself?

Best regards,
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants