Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different runs return different number of unitigs and distinct color subsets #24

Open
jermp opened this issue May 16, 2023 · 6 comments

Comments

@jermp
Copy link
Contributor

jermp commented May 16, 2023

Hello @Guilucand and @alexandrutomescu.
Let me first congratulate with you for this excellent algorithm!

We are currently using GGCAT for building Fulgor, a colored k-mer index based on SSHash (same functionalities offered by Themisto).

I noticed that, when running the same command multiple times, I get small differences in the number of unitigs/color subsets.

For example, if I run:

./fulgor build -l ../test_data/salmonella_10_filenames.txt -o ../test_data/salmonella_10 -k 31 -m 19 -d tmp_dir --verbose --check -t 16

I sometimes get 171 color subsets and 86630 unitigs, or 172 and 86632, or 173 and 86648.

I'm inclined to think this might be an issue with multithreading since I did not notice this if I use 1 thread only.
Or, is it just something that can happen because unitigs are not required to be maximal anyway?

(The correctness of Fulgor is not broken as long as the number of kmers stays the same, as it seems to be.
But I just wanted you to know this potential issue. Ideally, I'd expect to always get the same number across different runs.)

Thank you very much!

CC: @rob-p.

@jermp
Copy link
Contributor Author

jermp commented May 17, 2023

I'm CCing also @jnalanko who might be interested. Have you also notice this?

I can confirm that different runs of themisto may report different number of color sets, e.g., 4,236,355 vs. 4,236,354, for the same issue.

@jnalanko
Copy link
Contributor

Hi, @jermp. No, I have not noticed this. Thankfully, the correctness of Themisto should not be affected either, as long as the unitigs cover all k-mers at least once, and the colors of the unitigs not wrong.

@jermp
Copy link
Contributor Author

jermp commented May 17, 2023

Yes exactly, both Themisto and Fulgor are correct anyway since the num. of kmers always seem to be correct and consistent. Just letting you know.

@Guilucand
Copy link
Collaborator

Hi @jermp, the difference in color subsets is due to a (partially wanted, for performance) race condition when updating the hash table of the colors.

For efficiency and technical reasons, a new color is first written to disk and then added to an in-memory hashmap that is used to deduplicate equal colors, without holding a global lock.
So when two kmers of the same colors (never seen before) are processed at the same time, they can get different color indexes (that are in practice referring to the same set of colors) and they can write them before noticing that the other kmer had the same color.
To avoid this problem ggcat should get a global lock each time it writes a new color to disk and hold it until the in-memory hashmap is updated (with the index returned by the disk write), slowing down parallel insertions.

In practice, this should not lead to incorrect results, unless you check for color equality only by comparing the color subset index without reading the colormap, also the increase in space for storing (few) duplicate colors is very small, and the computed compacted graph has always the same set of maximal unitigs.

The difference in unitigs counts you're seeing is also due to the problem above since when two adjacent kmers in the same maximal unitig have a different color index (even if it refers to the same color set) they are split into multiple unitigs.

Also, this problem tends to be more visible with very small datasets, where there is a high chance that two kmers processed by different threads at the exact same time share the same color.

If you need the color sets to be always distinct I can try to see if there is a way to ensure that, maybe putting this requirement under an optional feature flag if it hurts performance.

@jermp
Copy link
Contributor Author

jermp commented May 17, 2023

Hi @Guilucand and thank you for confirming.
I do not think it is a hard requirement. Indeed correctness is not impacted if we have some duplicated color.
Any thoughts here @rob-p?

@rob-p
Copy link
Contributor

rob-p commented May 17, 2023

I agree it's not a hard requirement, but it would be very nice to have consistency between runs and to know the true number of colors. Differences are small in our example, but maybe could grow larger for very similar pangenomes and tons of threads.

How is color equality checked, currently? As map contention is very likely to be low anyway, perhaps you could try a sharded map like DashMap to avoid almost all lock contention?

Also, thanks for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants