Allow custom geometry single-cell protocols #734

Gaura · 2022-01-14T14:45:15Z

This PR introduces a new feature that will allow users to specify custom single cell protocols and use with alevin. Custom Geometry (--custom-geo) should be used when:
1. Barcode or umi have variable lengths
2. There is known fixed sequence in the reads
3. There is some sequence to be excluded
From the input peglib spec it creates a regex. Boost regex library is used to parse the reads.

Apart from small tests on multiple outlier cases, it was tested on a sci-RNA-seq3 sample SRR7827207 for speed. For this the spec is --custom-geo 1{b[9-10]f[CAGAGC]u[8]b[10]}2{r}. It says:

Read 1 starts with barcode of variable length 9-10 bp, followed by
A fixed sequence CAGAGC, then
A umi of length 8, and lastly
barcode of length 10.
The second read is all biological.

The barcodes are concatenated in the output and a padding sequence is added to make the length as max length + 1. The extra base is added so that we don't introduce spurious matches in barcode. For example, if the barcodes have length 3-4 bp and the two barcodes are ATG and ATGA, after padding they will be ATGAC and ATGAA resp. Adding just A to shorter barcode would result in a spurious match.

Since --custom-geo uses regex, it is slower than protocol specific flag. The time with 8 threads on a large Ubuntu 20 machine:

Using --sciseq3:

real    1m0.425s
user    7m21.501s
sys     0m2.964s

Using --custom-geo

real    1m39.887s
user    11m55.602s
sys     0m6.839s

Notably, it is about 66% slower. However, it allows support for almost all current and future protocols.

There will be a tutorial shortly on how and when to use this and how is it different from other flags such as --umi-geometry, --read-geometry and --bc-geomtery. There is scope of speed improvement in the future along with integration of all custom geometry processing protocols.

rob-p · 2022-01-14T16:53:25Z

Thanks @Gaura ! Requesting @gmarcais input on the regex speed. Could it be the boost impl, or the way we are using it?

rob-p · 2022-01-14T16:55:10Z

@Gaura one thought, since the regex doesnt change, can we precomputed that just once?

gmarcais · 2022-01-14T17:03:50Z

I don't have a great handle on the code yet, but precomputing the regex object as Rob suggested is essential.

Gaura · 2022-01-14T17:42:37Z

Great suggestion, thanks @rob-p and @gmarcais . Somehow, I missed it. I added it in the latest commit. Speed now from 3 runs:

real    1m19.884s 1m15.891s 1m21.462s                                                                      
user    8m9.189s 9m1.100s 9m48.764s 
sys     0m5.079s 0m5.170s 0m3.477s

50% improvement over the past results, i.e., about 33% slower than specific protocol flag now. Although, ideally I should have ran the earlier tests thrice but the sd is small so results should be valid. Nonetheless, I'll do more speed tests with versions in the future.

Let me know what other thoughts you have and what else have I missed. I have some minor improvements in mind too.

gmarcais · 2022-01-14T17:54:15Z

Can you give us the regular expression that are generated and used for matching?

Gaura · 2022-01-14T18:00:02Z

Sure.

reg[0]: ([ATGC]{9,10})(CAGAGC)([ATGC]{8})([ATGC]{10})
reg[1]: ([ATGC]{1,})

For other custom geometries, these can be obtained by uncommenting lines 1473 and 1474 in src/AlevinUtils.cpp.

gmarcais · 2022-01-14T18:06:26Z

Does it make sense to have '^' and/or '$' around the regex? Having anchors usually speeds up a regex. Otherwise, I am not sure. Have we tested other libraries than boost::regexp?

Gaura · 2022-01-14T18:14:54Z

I can try '^' and '$' after lunch. I didn't test any other library, since boost was already a pre-requisite for salmon and my focus was on getting it to work first. But now that it is done, other libraries can be tried. However, at this moment, I'm not clear about the effort and speed-up ratio.

Gaura · 2022-01-14T19:04:06Z

Including '^' didn't change the speed, and '$' can't be added as there are extra bases (not in protocol spec) at end which can vary in number.

real    1m19.392s
user    9m38.357s
sys     0m4.176s

rob-p · 2022-01-22T02:05:16Z

Thanks @Gaura! @gmarcais — I don't have a good sense about how boost stacks up against other regex libraries, but I think it's competitive. Actually modern C++ has built-in regex support in the standard library, but I've heard many bad things about the standard implementation.

I wonder if you think @gmarcais that the approximate difference between hand-rolled extraction code and regex extraction that we see here is in line with what you'd expect. Would you expect a 1/3 slowdown to the overall procedure by using regex extraction?

rob-p

I've left some comments below, and it would be great to get feedback from @Gaura and @gmarcais.

include/SingleCellProtocols.hpp

src/AlevinUtils.cpp

Gaura · 2022-01-27T13:37:49Z

Hi @rob-p! Thanks for all the great suggestions and comments. I have addressed all of them. I also tested the speed after the changes. The test was done 3 times in the same way as done earlier.

--sciseq3

real    0m58.463s  0m57.884s  0m57.413s
user    7m0.652s  7m1.731s  7m1.278s
sys     0m3.305s  0m3.078s  0m2.665s

--custom-geo

real    1m7.411s  1m14.988s  1m3.868s
user    8m8.795s  8m40.302s  7m49.107s
sys     0m4.194s  0m6.412s  0m2.969s

The real time in case 1. was 57.92 ± 0.57s and for case 2. it was 68.75 ± 5.68s, which is about 19% slower.

gmarcais · 2022-01-27T15:17:26Z

1/3 loss in performance seems significant, given that presumably the code does something else than just parsing UMIs.

I am looking at Boost own comparison and benchmarks, and on long inputs (20MB) it is competitive with PCRE2. But with short inputs (20-30 characters) PCRE2 is consistently faster (by about 30% 🤔 ). And if PCRE2 is feature full, not sure it is the fastest either, especially for simple regexp.

rob-p · 2022-01-27T15:39:18Z

@Gaura So with the changes implemented, custom geometry is ~19% slower here than the hand-coded sci-seq3 protocol (improved from ~1/3 slower); is that correct? That's a nice improvement. @gmarcais — do you think it's worth testing out PCRE2? Most of these regexes are very short — and if boost is ~20% slower than PCRE2 and we are ~20% slower than the custom parsing code .... maybe that's the whole gap? Any idea how difficult this would be to try?

Gaura · 2022-01-27T15:43:28Z

Yes, @rob-p! It's an improvement from ~33% to ~19% slower (I'm not sure why sd is high for custom testing though). Yeah, I'm eager to hear your views too, @gmarcais.

gmarcais · 2022-01-27T15:58:56Z

I don't know how difficult it would be to use PCRE2, I have never used it directly.

Can we write a thin struct wrapper structure that is defined as the boost object by default, but with#ifdef can be implemented by PCRE2 if we wish? (well, if cmake finds pcre2).

Keeping with boost, we should also try https://www.boost.org/doc/libs/1_78_0/doc/html/xpressive.html . That does not add a new dependency. Don't know about the speed.

gmarcais · 2022-01-28T12:46:25Z

My little testing: https://github.com/OceanGenomics/RegexBench

Some results on matching 1 million short strings with ~90% positive match and ~10% random strings:

time-manual:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.06
time-boostregex:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.28
time-pcre2:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.23
time-retwo:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.63
time-boostxpressive:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.76
time-grep:	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.17

retwo is RE2. It is surprisingly highly affected by the number of captures. With 4 captures as above, it is the slowest. Without any it is as fast as grep. And xpressive is very slow, while I expected it to be the fastest!

rob-p · 2022-01-28T12:56:25Z

Very interesting @gmarcais. I wonder if/how allocations have an effect here. Are all methods using pre-allocated space to store their captures? Also, is xpressive drunk?

gmarcais · 2022-01-28T13:11:42Z

I explicitly preallocate the output array/vector for pcre and re2. Boost regex doesn't seem to offer that (at least, I don't know).

Regarding xpressive: yeah, what a disappointment. And I don't actually save the capture with xpressive. I thought the automaton was entirely generated and optimize at compile time. Apparently creating an automaton with C++ template system must be really hard because the generated code is garbage. Or I am using it wrong. In any case, xpressive as I use it is entirely static (I haven't tested the dynamic version). So it is not useful in our case. I was just curious if it could match hand crafted code. What was I thinking!

rob-p · 2022-01-28T13:55:36Z

I wonder of the preallocation capabilities of pcre2 might offer a bigger benefit over a larger workload. For small patterns processing hundreds of millions of reads, maybe there is a bigger difference?

- introduce exclude 'x' to parsing - change positions with lengths

- read returned only when read regex search is a success - only check reads that have barcode and umi for bc and umi - if regex search fails in bc, return false

- avoid bc and umi string allocation for each read, add them to CustomGeo - pass hamming distance arguments by ref

rob-p · 2023-02-28T04:17:50Z

Hi all (@Gaura / @gmarcais),

While this has languished somewhat as we try to figure out what regex engine to include and how best to package it, I wanted to mention that I am attempting something similar in rust (which has a canonical regex crate which, I believe, is supposed to be among the fast ones). That repo is over on seq_geom_xform and it relies on seq_geom_parser. I like @Gaura's geometry string specification, so we're going with that for time time being.

If you want to chime in or start a discussion on either of those repos, please do let me know if you have any other thoughts on this generalized scheme. The purpose of the seq_geom_xform crate is actually that it will be both a rust library (to allow parsing complex geometry descriptions as a regex and extract the relevant sequence) and a stand-alone executable that can do streaming sequence transformation from a "complex" barcode geometry (e.g. the sciseq3 above) to a "simple" geometry (fixed position and fixed length barcode and UMI). Thus, one could imagine (at the cost of sticking a rust executable in the invocation here) replacing this feature by a streaming invocation to seq_geom_xform that would take the compressed fastq files as input along with the geometry specification, and which would output two streams (one for each read) with a simplified geometry that could be parsed in the simpler format.

--Rob

rob-p · 2023-03-08T20:32:41Z

Hi @Gaura and @gmarcais,

Just an update, this approach currently seems to be working. We can stream sciseq3 data through our seq_geom_xform tool to convert it to a standard "normalized" geometry, and the quantify using piscem + alevin-fry and get highly-concordant results to what we get with the builtin --sciseq3 flag! In case you are interested, the current grammar that is supported is described here. The syntax is that of the pest parser. In fact, if you copy that grammar into https://pest.rs/#editor you can try out geometry strings live and see how it's parsed!

Gaura requested review from k3yavi and rob-p January 14, 2022 14:45

rob-p reviewed Jan 22, 2022

View reviewed changes

Gaura force-pushed the custom_geometry branch from b293731 to 66b6f68 Compare January 27, 2022 13:12

Gaura added 7 commits March 6, 2022 18:20

initial custom geometry modifications - successful make

b11dc42

add custom barcode extraction

12dda2c

return read sequence

25385fa

update modifyRegex and parsing:

e59434f

- introduce exclude 'x' to parsing - change positions with lengths

add more checks to custom geo

50decde

add padding for bc and umi seqs

52e76b0

get read sequence using regex

6abd281

Gaura added 10 commits March 6, 2022 19:03

make customGeo class members extern instead of global variables

9eb148a

introduce regex checks and avoid logical errors:

d0d56ff

- read returned only when read regex search is a success - only check reads that have barcode and umi for bc and umi - if regex search fails in bc, return false

add code documentation

5a12167

make regex static and initliaze once

52ce33c

modify data types

0019606

update data types and definitions

f46450d

updates:

1323deb

- avoid bc and umi string allocation for each read, add them to CustomGeo - pass hamming distance arguments by ref

make enum class for customGeo readpart type

7b14dde

change addPadding approach

e85be97

clarify comment for bc and umi vector

e9dbddb

Gaura force-pushed the custom_geometry branch from 2100acc to e9dbddb Compare March 25, 2022 17:14

gringer mentioned this pull request Nov 1, 2022

Feature request (Alevin): add rhapsody barcode mode #628

Open

Gaura mentioned this pull request Dec 10, 2022

doc on read geometry #816

Open

Base automatically changed from develop to master March 15, 2024 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom geometry single-cell protocols #734

Allow custom geometry single-cell protocols #734

Gaura commented Jan 14, 2022

rob-p commented Jan 14, 2022

rob-p commented Jan 14, 2022

gmarcais commented Jan 14, 2022

Gaura commented Jan 14, 2022

gmarcais commented Jan 14, 2022

Gaura commented Jan 14, 2022

gmarcais commented Jan 14, 2022

Gaura commented Jan 14, 2022

Gaura commented Jan 14, 2022

rob-p commented Jan 22, 2022

rob-p left a comment

Gaura commented Jan 27, 2022

gmarcais commented Jan 27, 2022

rob-p commented Jan 27, 2022

Gaura commented Jan 27, 2022

gmarcais commented Jan 27, 2022

gmarcais commented Jan 28, 2022

rob-p commented Jan 28, 2022

gmarcais commented Jan 28, 2022

rob-p commented Jan 28, 2022

rob-p commented Feb 28, 2023

rob-p commented Mar 8, 2023

Allow custom geometry single-cell protocols #734

Are you sure you want to change the base?

Allow custom geometry single-cell protocols #734

Conversation

Gaura commented Jan 14, 2022

rob-p commented Jan 14, 2022

rob-p commented Jan 14, 2022

gmarcais commented Jan 14, 2022

Gaura commented Jan 14, 2022

gmarcais commented Jan 14, 2022

Gaura commented Jan 14, 2022

gmarcais commented Jan 14, 2022

Gaura commented Jan 14, 2022

Gaura commented Jan 14, 2022

rob-p commented Jan 22, 2022

rob-p left a comment

Choose a reason for hiding this comment

Gaura commented Jan 27, 2022

gmarcais commented Jan 27, 2022

rob-p commented Jan 27, 2022

Gaura commented Jan 27, 2022

gmarcais commented Jan 27, 2022

gmarcais commented Jan 28, 2022

rob-p commented Jan 28, 2022

gmarcais commented Jan 28, 2022

rob-p commented Jan 28, 2022

rob-p commented Feb 28, 2023

rob-p commented Mar 8, 2023