Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Count File Dataset -- Memory Issues #44

Open
aportell18 opened this issue Jan 12, 2023 · 3 comments
Open

Large Count File Dataset -- Memory Issues #44

aportell18 opened this issue Jan 12, 2023 · 3 comments

Comments

@aportell18
Copy link

I am attempting to process a relatively large count file. It is around 10GB and contains ~170 million unique sequences which are 60bp in length.

The computer I am using has 64GB of RAM and a Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz Processor. When I try to run starcode, the process is Killed without an error message being outputted. I am guessing that the program might exceed the memory allowance because when I subset my count file to contain 25million sequences, starcode is able to process it, but utilizes almost 100% of the computer's memory.

I am wondering if there is any solution to this, or if there is a RAM size that you all believe could process this dataset size?

Thank you!

@gui11aume
Copy link
Owner

gui11aume commented Jan 14, 2023

Thanks for reporting the issue. Starcode should terminate gracefully when it runs out of memory, but maybe we made a mistake somewhere. Are you running Starcode in a Docker container or in a cluster? We sometimes observed strange behavior in these contexts. I would be interested in trying to run your sample on our machine to understand better which statement fails, if you are interested.

As for the required memory, it depends on the parameters you use for the run. Allowing more errors means keeping more branches of the prefix tree in memory, so setting distance to a small value is not only faster, it also requires less memory. Depending on your goals, you could try lowering distance and see whether that works for you.

Also, if the sequences consist of a constant region and a barcode, you could try to isolate the barcodes and cluster those only. You can sometimes observe spectacular improvement of performance by reducing the sequence size because it can significantly prune the prefix tree.

Let me know if any of this helps.

@aportell18
Copy link
Author

aportell18 commented Jan 14, 2023 via email

@gui11aume
Copy link
Owner

gui11aume commented Jan 15, 2023

Thanks for clarifying. I'd be happy to look at the data on my machine. Can you contact me by email so that we can set up a way to transfer the data. My address is easy to find on the Internet (like here for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants