Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving more space during construction. #213

Open
wants to merge 1 commit into
base: lambda3
Choose a base branch
from

Conversation

SGSSGene
Copy link
Collaborator

@SGSSGene SGSSGene commented May 9, 2023

This is an addition to PR #211 .

It replace another relabeling data structure and replaces it with a binary search on the accumulated sizes of the sequences. The created index doesn't change (creates identical binary).

Memory: for 20GB input and sampling Rate of 5, I am expecting to additionally save 32GB of memory during construction.
Runtime: This is very hard to predict. Purely theoretically we replace O(n) with O(n * log(m)), where m is the number of sequences. In my test (386MB input) I have seen a runtime reduction of ~10%.

Update: There was another structure that I could reuse, which would be an additional 32GB of savings.
This should lead in a total of 64GB less memory...With this lambda3 might be the same in memory usage to lambda2

@SGSSGene SGSSGene force-pushed the fix/less_space_construction_2 branch 3 times, most recently from 6f9f55d to 464ff04 Compare May 9, 2023 14:30
@SGSSGene SGSSGene force-pushed the fix/less_space_construction_2 branch from 464ff04 to 0b06f35 Compare May 9, 2023 14:55
@SGSSGene SGSSGene marked this pull request as ready for review May 9, 2023 18:16
@h-2
Copy link
Member

h-2 commented May 10, 2023

Runtime: This is very hard to predict. Purely theoretically we replace O(n) with O(n * log(m)), where m is the number of sequences. In my test (386MB input) I have seen a runtime reduction of ~10%.

Is this a change that affects the construction or also the search?

edit: I assume it is just for construction, because the on-disk format doesn't change..?

@sarahet
Copy link
Member

sarahet commented May 11, 2023

I did some tests and RAM decreased to 170GB from previous 214 GB, but the runtime increased again from previous 40 minutes to now 1:10h. I think we probably need some more tests here and therefore might need to push this a bit back due to our current benchmarking efforts.

@SGSSGene
Copy link
Collaborator Author

This should produce an identical index, resulting in no impact of the search.
The construction itself will have a different runtime, which is hard to predict. An increase of 30min is surprising, this would mean that the majority time is spend outside of the suffix array construction. I will investigate with some larger data and see how it behaves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants