Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom geometry single-cell protocols #734

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

Gaura
Copy link
Collaborator

@Gaura Gaura commented Jan 14, 2022

This PR introduces a new feature that will allow users to specify custom single cell protocols and use with alevin. Custom Geometry (--custom-geo) should be used when:
1. Barcode or umi have variable lengths
2. There is known fixed sequence in the reads
3. There is some sequence to be excluded
From the input peglib spec it creates a regex. Boost regex library is used to parse the reads.

Apart from small tests on multiple outlier cases, it was tested on a sci-RNA-seq3 sample SRR7827207 for speed. For this the spec is --custom-geo 1{b[9-10]f[CAGAGC]u[8]b[10]}2{r}. It says:

  • Read 1 starts with barcode of variable length 9-10 bp, followed by
  • A fixed sequence CAGAGC, then
  • A umi of length 8, and lastly
  • barcode of length 10.
  • The second read is all biological.

The barcodes are concatenated in the output and a padding sequence is added to make the length as max length + 1. The extra base is added so that we don't introduce spurious matches in barcode. For example, if the barcodes have length 3-4 bp and the two barcodes are ATG and ATGA, after padding they will be ATGAC and ATGAA resp. Adding just A to shorter barcode would result in a spurious match.

Since --custom-geo uses regex, it is slower than protocol specific flag. The time with 8 threads on a large Ubuntu 20 machine:

  1. Using --sciseq3:
real    1m0.425s
user    7m21.501s
sys     0m2.964s
  1. Using --custom-geo
real    1m39.887s
user    11m55.602s
sys     0m6.839s

Notably, it is about 66% slower. However, it allows support for almost all current and future protocols.

There will be a tutorial shortly on how and when to use this and how is it different from other flags such as --umi-geometry, --read-geometry and --bc-geomtery. There is scope of speed improvement in the future along with integration of all custom geometry processing protocols.

@Gaura Gaura requested review from k3yavi and rob-p January 14, 2022 14:45
@rob-p
Copy link
Collaborator

rob-p commented Jan 14, 2022

Thanks @Gaura ! Requesting @gmarcais input on the regex speed. Could it be the boost impl, or the way we are using it?

@rob-p
Copy link
Collaborator

rob-p commented Jan 14, 2022

@Gaura one thought, since the regex doesnt change, can we precomputed that just once?

@gmarcais
Copy link
Contributor

I don't have a great handle on the code yet, but precomputing the regex object as Rob suggested is essential.

@Gaura
Copy link
Collaborator Author

Gaura commented Jan 14, 2022

Great suggestion, thanks @rob-p and @gmarcais . Somehow, I missed it. I added it in the latest commit. Speed now from 3 runs:

real    1m19.884s 1m15.891s 1m21.462s                                                                      
user    8m9.189s 9m1.100s 9m48.764s 
sys     0m5.079s 0m5.170s 0m3.477s

50% improvement over the past results, i.e., about 33% slower than specific protocol flag now. Although, ideally I should have ran the earlier tests thrice but the sd is small so results should be valid. Nonetheless, I'll do more speed tests with versions in the future.

Let me know what other thoughts you have and what else have I missed. I have some minor improvements in mind too.

@gmarcais
Copy link
Contributor

Can you give us the regular expression that are generated and used for matching?

@Gaura
Copy link
Collaborator Author

Gaura commented Jan 14, 2022

Sure.

reg[0]: ([ATGC]{9,10})(CAGAGC)([ATGC]{8})([ATGC]{10})
reg[1]: ([ATGC]{1,})

For other custom geometries, these can be obtained by uncommenting lines 1473 and 1474 in src/AlevinUtils.cpp.

@gmarcais
Copy link
Contributor

Does it make sense to have '^' and/or '$' around the regex? Having anchors usually speeds up a regex. Otherwise, I am not sure. Have we tested other libraries than boost::regexp?

@Gaura
Copy link
Collaborator Author

Gaura commented Jan 14, 2022

I can try '^' and '$' after lunch. I didn't test any other library, since boost was already a pre-requisite for salmon and my focus was on getting it to work first. But now that it is done, other libraries can be tried. However, at this moment, I'm not clear about the effort and speed-up ratio.

@Gaura
Copy link
Collaborator Author

Gaura commented Jan 14, 2022

Including '^' didn't change the speed, and '$' can't be added as there are extra bases (not in protocol spec) at end which can vary in number.

real    1m19.392s
user    9m38.357s
sys     0m4.176s

@rob-p
Copy link
Collaborator

rob-p commented Jan 22, 2022

Thanks @Gaura! @gmarcais — I don't have a good sense about how boost stacks up against other regex libraries, but I think it's competitive. Actually modern C++ has built-in regex support in the standard library, but I've heard many bad things about the standard implementation.

I wonder if you think @gmarcais that the approximate difference between hand-rolled extraction code and regex extraction that we see here is in line with what you'd expect. Would you expect a 1/3 slowdown to the overall procedure by using regex extraction?

Copy link
Collaborator

@rob-p rob-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments below, and it would be great to get feedback from @Gaura and @gmarcais.

include/SingleCellProtocols.hpp Outdated Show resolved Hide resolved
include/SingleCellProtocols.hpp Outdated Show resolved Hide resolved
include/SingleCellProtocols.hpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
src/AlevinUtils.cpp Outdated Show resolved Hide resolved
@Gaura
Copy link
Collaborator Author

Gaura commented Jan 27, 2022

Hi @rob-p! Thanks for all the great suggestions and comments. I have addressed all of them. I also tested the speed after the changes. The test was done 3 times in the same way as done earlier.

  1. --sciseq3
real    0m58.463s  0m57.884s  0m57.413s
user    7m0.652s  7m1.731s  7m1.278s
sys     0m3.305s  0m3.078s  0m2.665s
  1. --custom-geo
real    1m7.411s  1m14.988s  1m3.868s
user    8m8.795s  8m40.302s  7m49.107s
sys     0m4.194s  0m6.412s  0m2.969s

The real time in case 1. was 57.92 ± 0.57s and for case 2. it was 68.75 ± 5.68s, which is about 19% slower.

@gmarcais
Copy link
Contributor

1/3 loss in performance seems significant, given that presumably the code does something else than just parsing UMIs.

I am looking at Boost own comparison and benchmarks, and on long inputs (20MB) it is competitive with PCRE2. But with short inputs (20-30 characters) PCRE2 is consistently faster (by about 30% 🤔 ). And if PCRE2 is feature full, not sure it is the fastest either, especially for simple regexp.

@rob-p
Copy link
Collaborator

rob-p commented Jan 27, 2022

@Gaura So with the changes implemented, custom geometry is ~19% slower here than the hand-coded sci-seq3 protocol (improved from ~1/3 slower); is that correct? That's a nice improvement. @gmarcais — do you think it's worth testing out PCRE2? Most of these regexes are very short — and if boost is ~20% slower than PCRE2 and we are ~20% slower than the custom parsing code .... maybe that's the whole gap? Any idea how difficult this would be to try?

@Gaura
Copy link
Collaborator Author

Gaura commented Jan 27, 2022

Yes, @rob-p! It's an improvement from ~33% to ~19% slower (I'm not sure why sd is high for custom testing though). Yeah, I'm eager to hear your views too, @gmarcais.

@gmarcais
Copy link
Contributor

I don't know how difficult it would be to use PCRE2, I have never used it directly.

Can we write a thin struct wrapper structure that is defined as the boost object by default, but with#ifdef can be implemented by PCRE2 if we wish? (well, if cmake finds pcre2).

Keeping with boost, we should also try https://www.boost.org/doc/libs/1_78_0/doc/html/xpressive.html . That does not add a new dependency. Don't know about the speed.

@gmarcais
Copy link
Contributor

My little testing: https://github.com/OceanGenomics/RegexBench

Some results on matching 1 million short strings with ~90% positive match and ~10% random strings:

time-manual:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.06
time-boostregex:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.28
time-pcre2:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.23
time-retwo:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.63
time-boostxpressive:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.76
time-grep:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.17

retwo is RE2. It is surprisingly highly affected by the number of captures. With 4 captures as above, it is the slowest. Without any it is as fast as grep. And xpressive is very slow, while I expected it to be the fastest!

@rob-p
Copy link
Collaborator

rob-p commented Jan 28, 2022

Very interesting @gmarcais. I wonder if/how allocations have an effect here. Are all methods using pre-allocated space to store their captures? Also, is xpressive drunk?

@gmarcais
Copy link
Contributor

I explicitly preallocate the output array/vector for pcre and re2. Boost regex doesn't seem to offer that (at least, I don't know).

Regarding xpressive: yeah, what a disappointment. And I don't actually save the capture with xpressive. I thought the automaton was entirely generated and optimize at compile time. Apparently creating an automaton with C++ template system must be really hard because the generated code is garbage. Or I am using it wrong. In any case, xpressive as I use it is entirely static (I haven't tested the dynamic version). So it is not useful in our case. I was just curious if it could match hand crafted code. What was I thinking!

@rob-p
Copy link
Collaborator

rob-p commented Jan 28, 2022

I wonder of the preallocation capabilities of pcre2 might offer a bigger benefit over a larger workload. For small patterns processing hundreds of millions of reads, maybe there is a bigger difference?

Gaura added 10 commits March 6, 2022 19:03
- read returned only when read regex search is a success
- only check reads that have barcode and umi for bc and umi
- if regex search fails in bc, return false
- avoid bc and umi string allocation for each read, add them to CustomGeo
- pass hamming distance arguments by ref
@rob-p
Copy link
Collaborator

rob-p commented Feb 28, 2023

Hi all (@Gaura / @gmarcais),

While this has languished somewhat as we try to figure out what regex engine to include and how best to package it, I wanted to mention that I am attempting something similar in rust (which has a canonical regex crate which, I believe, is supposed to be among the fast ones). That repo is over on seq_geom_xform and it relies on seq_geom_parser. I like @Gaura's geometry string specification, so we're going with that for time time being.

If you want to chime in or start a discussion on either of those repos, please do let me know if you have any other thoughts on this generalized scheme. The purpose of the seq_geom_xform crate is actually that it will be both a rust library (to allow parsing complex geometry descriptions as a regex and extract the relevant sequence) and a stand-alone executable that can do streaming sequence transformation from a "complex" barcode geometry (e.g. the sciseq3 above) to a "simple" geometry (fixed position and fixed length barcode and UMI). Thus, one could imagine (at the cost of sticking a rust executable in the invocation here) replacing this feature by a streaming invocation to seq_geom_xform that would take the compressed fastq files as input along with the geometry specification, and which would output two streams (one for each read) with a simplified geometry that could be parsed in the simpler format.

--Rob

@rob-p
Copy link
Collaborator

rob-p commented Mar 8, 2023

Hi @Gaura and @gmarcais,

Just an update, this approach currently seems to be working. We can stream sciseq3 data through our seq_geom_xform tool to convert it to a standard "normalized" geometry, and the quantify using piscem + alevin-fry and get highly-concordant results to what we get with the builtin --sciseq3 flag! In case you are interested, the current grammar that is supported is described here. The syntax is that of the pest parser. In fact, if you copy that grammar into https://pest.rs/#editor you can try out geometry strings live and see how it's parsed!

Base automatically changed from develop to master March 15, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants