Quadratic time complexity when parsing long recipient stanzas #368

DavidBuchanan314 · 2023-02-19T17:09:41Z

Environment

OS: aarch64 Arch Linux
rage version: rage 0.9.0

What were you trying to do

Decrypt an age file (attached, with test key) with an unusually long (~4MB) recipient stanza body, followed by a normal X25519 stanza.

Expected outcome (it is a completely valid age file, and works fine with age):

$ age -d -i testkey.txt test.age
Hello, world!

(the above completes near-instantly)

What happened

It takes a near-infinite amount of time to parse, and progress gets exponentially slower. Using pv to monitor the progress like so:

$ pv test.age | rage -d -i testkey.txt
 768KiB 0:02:47 [0.00 B/s] [====>                           ] 18% ETA 0:12:17

(pv's ETA grows constantly)

Something similar also happens with a very long recipient type string, but that seems less likely to be a problem in a "real world" scenario, although you might still want to consider it a bug.

Edit: I just learnt about rage's concept of "greasing" - consider this an extreme form of grease, heh

long_recipient.zip

The text was updated successfully, but these errors were encountered:

str4d · 2023-02-19T19:17:08Z

I see you've pulled out the high-viscosity grease with a lithium complex soap thickener 😆

The performance issue is almost certainly this parser:

rage/age-core/src/format.rs

Lines 191 to 218 in 3ff541a

    
           fn wrapped_encoded_data(input: &[u8]) -> IResult<&[u8], Vec<&[u8]>> { 
        
               map( 
        
                   many_till( 
        
                       // Any body lines before the last MUST be full-length. 
        
                       terminated(take_while_m_n(64, 64, is_base64_char), newline), 
        
                       // Last body line: 
        
                       // - MUST be short (empty if necessary). 
        
                       // - MUST be a valid Base64 length (i.e. the length must not be 1 mod 4). 
        
                       // - MUST NOT leave trailing bits (if the length is 2 or 3 mod 4). 
        
                       verify( 
        
                           terminated(take_while_m_n(0, 63, is_base64_char), newline), 
        
                           |line: &[u8]| match line.len() % 4 { 
        
                               0 => true, 
        
                               1 => false, 
        
                               2 => base64_has_no_trailing_bits_2(line.last().unwrap()), 
        
                               3 => base64_has_no_trailing_bits_3(line.last().unwrap()), 
        
                               // No other cases, but Rust wants an exhaustive match on u8. 
        
                               _ => unreachable!(), 
        
                           }, 
        
                       ), 
        
                   ), 
        
                   |(full_chunks, partial_chunk): (Vec<&[u8]>, &[u8])| { 
        
                       let mut chunks = full_chunks; 
        
                       chunks.push(partial_chunk); 
        
                       chunks 
        
                   }, 
        
               )(input) 
        
           }

My expectation is that full_chunks is reallocating every power-of-two lines. This is because the many_till parser is unaware of the length of the input data, so it:

Creates a res = Vec::new() allocating in a compiler-defined way (currently allocating nothing).
While a short line has not been found:
- Find a full line.
- res.push(full_line) which reallocates in a compiler-defined way, currently:
  - On first push to the empty Vec, allocating a "small vec" to avoid small reallocations.
  - After that, doubling on powers-of-two.

We do not know what comes after the stanza (so we can't just break on \n-> or \n---), as this parser is used for parsing both age headers and the v1 plugin IPC protocol. However, now that the age stanza format requires a final short line, we could probably write a variant of many_till that is aware of the newline-separated format, and uses the short-line parser to identify the last line of the wrapped data, and then counts the preceding lines to know what capacity to give the Vec.

str4d · 2023-02-20T01:58:05Z

I measured a flamegraph, and indeed the perfomance issue is in wrapped_encoded_data, but it's not the allocations!

The bottleneck is this function:

rage/age-core/src/format.rs

Lines 145 to 152 in 3ff541a

    
           fn is_base64_char(c: u8) -> bool { 
        
               // Check against the ASCII values of the standard Base64 character set. 
        
               matches!( 
        
                   c, 
        
                   // A..=Z | a..=z | 0..=9 | + | / 
        
                   65..=90 | 97..=122 | 48..=57 | 43 | 47, 
        
               ) 
        
           }

For a given slice, is_base64_char is called on every byte twice (due to how many_till works). However, thinking about it, the cause of this behaviour is more fundamentally about my use of combinatorial parsers, and how they interact with io::Read:

rage/age/src/format.rs

Lines 64 to 89 in 3ff541a

    
           impl Header { 
        
               pub(crate) fn read<R: Read>(mut input: R) -> Result<Self, DecryptError> { 
        
                   let mut data = vec![]; 
        
                   loop { 
        
                       match read::header(&data) { 
        
                           Ok((_, mut header)) => { 
        
                               if let Header::V1(h) = &mut header { 
        
                                   h.encoded_bytes = Some(data); 
        
                               } 
        
                               break Ok(header); 
        
                           } 
        
                           Err(nom::Err::Incomplete(nom::Needed::Size(n))) => { 
        
                               // Read the needed additional bytes. We need to be careful how the 
        
                               // parser is constructed, because if we read more than we need, the 
        
                               // remainder of the input will be truncated. 
        
                               let m = data.len(); 
        
                               let new_len = m + n.get(); 
        
                               data.resize(new_len, 0); 
        
                               input.read_exact(&mut data[m..new_len])?; 
        
                           } 
        
                           Err(_) => { 
        
                               break Err(DecryptError::InvalidHeader); 
        
                           } 
        
                       } 
        
                   } 
        
               }

Given an io::Read type, we don't have access to any external buffering, so we cannot over-read (or else the data after the header will be truncated), which means we can't use a buffered reader here. But the combinatorial parsers produce the entire header at once (so we avoid producing partial or invalid header states in memory). So the way we parse files is to run the parser until it tells us how many more bytes it needs, then we read that many bytes from the io::Read, and repeat.

The upshot is that as header length increases, the number of times we will run is_base64_char on the early bytes of the header increases.

I suspect the only ways to improve on this are:

Break the encapsulation of the current APIs, and make the outer header parser aware of the line-based nature of the header. Then instead of reading only the number of bytes that the parser asks for, we always read to the next newline (effectively buffering lines, like age does).
Break the type safety of the current APIs, and store partially-parsed stanzas in-memory instead of only fully-parsed stanzas. I strongly dislike this, but it's seemingly the only way to avoid re-parsing earlier lines of very large stanzas, and it's what age does.

str4d · 2023-02-20T03:13:33Z

I added a Decryptor::new_buffered API, and implemented a basic change to use BufRead::read_until('\n', ..) instead of reading precisely what the parser asks for. The effect on the header-parsing benchmarks (Ryzen 9 5950X, parsing a header containing 1-10 native age recipients):

Before

header/parse/1          time:   [27.608 µs 30.490 µs 33.336 µs]                            
                        thrpt:  [29.998 Kelem/s 32.798 Kelem/s 36.221 Kelem/s]
header/parse/2          time:   [55.729 µs 59.756 µs 63.719 µs]                           
                        thrpt:  [31.388 Kelem/s 33.469 Kelem/s 35.888 Kelem/s]
header/parse/3          time:   [96.203 µs 101.46 µs 106.67 µs]                           
                        thrpt:  [28.124 Kelem/s 29.569 Kelem/s 31.184 Kelem/s]
header/parse/4          time:   [152.19 µs 158.87 µs 165.37 µs]                           
                        thrpt:  [24.189 Kelem/s 25.177 Kelem/s 26.283 Kelem/s]
header/parse/5          time:   [214.20 µs 223.40 µs 232.31 µs]                           
                        thrpt:  [21.523 Kelem/s 22.381 Kelem/s 23.342 Kelem/s]
header/parse/6          time:   [290.83 µs 301.38 µs 312.14 µs]                           
                        thrpt:  [19.222 Kelem/s 19.908 Kelem/s 20.631 Kelem/s]
header/parse/7          time:   [400.78 µs 411.78 µs 423.08 µs]                           
                        thrpt:  [16.546 Kelem/s 16.999 Kelem/s 17.466 Kelem/s]
header/parse/8          time:   [514.58 µs 528.42 µs 542.48 µs]                           
                        thrpt:  [14.747 Kelem/s 15.140 Kelem/s 15.547 Kelem/s]
header/parse/9          time:   [636.98 µs 652.54 µs 667.84 µs]                           
                        thrpt:  [13.476 Kelem/s 13.792 Kelem/s 14.129 Kelem/s]

After

header/parse/1          time:   [1.8364 µs 1.9232 µs 2.0160 µs]                            
                        thrpt:  [496.04 Kelem/s 519.96 Kelem/s 544.54 Kelem/s]
                 change:
                        time:   [-94.062% -93.586% -93.044%] (p = 0.00 < 0.05)
                        thrpt:  [+1337.6% +1459.1% +1584.0%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
header/parse/2          time:   [2.8876 µs 2.9794 µs 3.0753 µs]                            
                        thrpt:  [650.34 Kelem/s 671.27 Kelem/s 692.62 Kelem/s]
                 change:
                        time:   [-95.384% -95.128% -94.844%] (p = 0.00 < 0.05)
                        thrpt:  [+1839.4% +1952.4% +2066.5%]
                        Performance has improved.
header/parse/3          time:   [4.2078 µs 4.3345 µs 4.4631 µs]                            
                        thrpt:  [672.18 Kelem/s 692.13 Kelem/s 712.95 Kelem/s]
                 change:
                        time:   [-95.887% -95.716% -95.525%] (p = 0.00 < 0.05)
                        thrpt:  [+2134.8% +2234.1% +2331.1%]
                        Performance has improved.
header/parse/4          time:   [5.8106 µs 5.9730 µs 6.1503 µs]                            
                        thrpt:  [650.38 Kelem/s 669.68 Kelem/s 688.40 Kelem/s]
                 change:
                        time:   [-96.337% -96.197% -96.043%] (p = 0.00 < 0.05)
                        thrpt:  [+2427.0% +2529.4% +2630.0%]
                        Performance has improved.
header/parse/5          time:   [7.8963 µs 8.0530 µs 8.2136 µs]                            
                        thrpt:  [608.75 Kelem/s 620.89 Kelem/s 633.21 Kelem/s]
                 change:
                        time:   [-96.446% -96.326% -96.200%] (p = 0.00 < 0.05)
                        thrpt:  [+2531.3% +2621.9% +2713.5%]
                        Performance has improved.
header/parse/6          time:   [10.423 µs 10.623 µs 10.840 µs]                            
                        thrpt:  [553.51 Kelem/s 564.82 Kelem/s 575.65 Kelem/s]
                 change:
                        time:   [-96.554% -96.456% -96.357%] (p = 0.00 < 0.05)
                        thrpt:  [+2644.7% +2721.7% +2802.3%]
                        Performance has improved.
header/parse/7          time:   [12.668 µs 12.849 µs 13.049 µs]                            
                        thrpt:  [536.44 Kelem/s 544.79 Kelem/s 552.59 Kelem/s]
                 change:
                        time:   [-96.865% -96.784% -96.702%] (p = 0.00 < 0.05)
                        thrpt:  [+2931.8% +3009.6% +3090.2%]
                        Performance has improved.
header/parse/8          time:   [15.245 µs 15.464 µs 15.709 µs]                            
                        thrpt:  [509.25 Kelem/s 517.32 Kelem/s 524.75 Kelem/s]
                 change:
                        time:   [-97.082% -97.015% -96.947%] (p = 0.00 < 0.05)
                        thrpt:  [+3175.6% +3250.1% +3326.6%]
                        Performance has improved.
header/parse/9          time:   [19.092 µs 19.343 µs 19.607 µs]                            
                        thrpt:  [459.02 Kelem/s 465.29 Kelem/s 471.40 Kelem/s]
                 change:
                        time:   [-97.074% -97.015% -96.950%] (p = 0.00 < 0.05)
                        thrpt:  [+3179.2% +3249.9% +3317.3%]
                        Performance has improved.

Soooo I guess this means I'm going to add buffered versions of my constructors. The very large header is still quadratic after this change, so I'll still need to consider breaking the type safety of the current types, or at least figure out a way to encapsulate the partial parsing.

str4d · 2023-03-24T23:59:26Z

#377 implements the above, which also required impl io::BufRead for ArmoredReader in order for rage to use it. I still need to implement Decryptor::new_async_buffered and impl AsyncBufRead for ArmoredReader to provide the equivalent performance improvements for the async APIs.

I've also confirmed that this change is insufficient for the long_recipient.zip test case to work with rage. So I'll look into breaking apart the internal type safety next (and hope that I can keep this breakage internal).

str4d · 2023-06-12T18:42:41Z

#393 implements Decryptor::new_async_buffered and impl AsyncBufRead for ArmoredReader to provide the equivalent performance improvements for the async APIs.

I'm going to deploy these performance improvements in rage 0.9.2, and defer the rest of this issue to later.

str4d modified the milestones: rage 0.9.1, rage 0.9.2 Mar 7, 2023

str4d mentioned this issue Mar 24, 2023

Enable header parsing to use R: std::io::BufRead #377

Merged

str4d mentioned this issue Jun 12, 2023

Enable header parsing to use R: futures::io::AsyncBufRead #393

Merged

str4d removed this from the rage 0.9.2 milestone Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadratic time complexity when parsing long recipient stanzas #368

Quadratic time complexity when parsing long recipient stanzas #368

DavidBuchanan314 commented Feb 19, 2023 •

edited

str4d commented Feb 19, 2023

str4d commented Feb 20, 2023 •

edited

str4d commented Feb 20, 2023 •

edited

str4d commented Mar 24, 2023

str4d commented Jun 12, 2023

Quadratic time complexity when parsing long recipient stanzas #368

Quadratic time complexity when parsing long recipient stanzas #368

Comments

DavidBuchanan314 commented Feb 19, 2023 • edited

Environment

What were you trying to do

What happened

str4d commented Feb 19, 2023

str4d commented Feb 20, 2023 • edited

str4d commented Feb 20, 2023 • edited

Before

After

str4d commented Mar 24, 2023

str4d commented Jun 12, 2023

DavidBuchanan314 commented Feb 19, 2023 •

edited

str4d commented Feb 20, 2023 •

edited

str4d commented Feb 20, 2023 •

edited