Add flush config options to Gzip and Brotli AsyncOutputStream impls #2027

jasnell · 2024-05-08T15:48:01Z

Adds the ability to configure the per-write flush behavior for both GzipAsyncOutputStream and BrotliAsyncOutputStream, keeping the current defaults in place.

By default, these implementations will flush only when either flush() is explicitly called or their output buffers are full. This is the ideal behavior for optimal compression but can increase perceived latency of the stream. This PR adds options to both Gzip and Brotli implementations to customize the per-write flush setting to permit for frequent flushes.

fhanau · 2024-05-08T16:00:40Z

Will take a look at this soon. Note that compression flushing is a can of worms – Zlib has a number of different flush options of which Z_SYNC_FLUSH should be sane but I'll take another look at them. Brotli is peculiar about when it provides output that led to a hard-to-find bug while developing support for it here, the current implementation should work with or without additional flushing but I want to understand well how it will behave differently.

jasnell · 2024-05-08T23:47:17Z

Yep, the way this is set up the exact flush option can be specified as an option, making it possible for us to evaluate the different options. Please definitely do give this some careful thought. If the brotli implementation is sufficient as is, I'm happy dropping the changes to that as the problem we're looking to address immediately is largely with gzip compression. I just did the brotli changes to keep things symmetrical.

I was going to update this PR to incorporate a deferred flush using kj::evalLater(...) but honestly it adds too much complexity for what seems worthwhile. I'm open to other suggestions on how to implement it. The issue is largely that the implementation assumes that only one call to pump() is active at any given time. The impl in this PR ends up flushing on every write which is... fun... but the alternative means more complicated bookkeeping to ensure that there are no overlapping calls to pump().

c++/src/kj/compat/brotli-test.c++

fhanau · 2024-05-09T15:52:23Z

LGTM – I reviewed the current implementation and the documentation for Z_SYNC_FLUSH and BROTLI_OPERATION_FLUSH, this approach should be correct.
One style concern I have is that this requires external code to specify Z_SYNC_FLUSH when using the gzip implementation – since there are 3 different flush flags supported by zlib this can cause confusion, it would be better to handle this like brotli does and provide a bool writeFlush parameter in GzipAsyncOutputStream::Options instead.

c++/src/kj/compat/brotli-test.c++

jasnell · 2024-05-09T16:27:47Z

@fhanau :

...One style concern I have is that this requires external code to specify Z_SYNC_FLUSH when using the gzip implementation

That was intentional to give us some flexibility later to use one of the other flush flags if necessary.

fhanau · 2024-05-09T17:10:04Z

@fhanau :

...One style concern I have is that this requires external code to specify Z_SYNC_FLUSH when using the gzip implementation

That was intentional to give us some flexibility later to use one of the other flush flags if necessary.

I think that's valid. From the comments in zlib.h it sounded to me like the other flags might be slightly more efficient in some cases and avoid some empty deflate blocks being written. It felt unlikely to be relevant for us though – Z_SYNC_FLUSH is the most widely used option and in cases where compression efficiency is important we will hopefully not be flushing excessively or not be using zlib in the first place... Not a big issue either way, this can go ahead as-is.

kentonv · 2024-05-10T19:33:00Z

I really think we need to be using evalLast (not evalLater) here, otherwise we're forcing a compressor flush on every chunk even if we know more data is coming immediately.

kentonv · 2024-05-10T19:55:07Z

To elaborate, it is quite common to have several write() calls invoked in rapid succession. When pumping from one stream to another, we usually tryRead() with minBytes = 1 into a 4k buffer, then write() that data, before reading again. At best we get 4k at a time, which is an awfully small window for compression. But we could also be reading from, say, a transfer-encoding: chunked body which returns after each chunk, which could be arbitrarily small chunks. Flushing after each of those could be quite awful for compression.

jasnell · 2024-05-13T16:11:11Z

@kentonv ... what is your recommendation on the best way to handle this with evalLast()? Specifically, since calls to pump() cannot overlap we need to introduce a small amount of book keeping to ensure that another call to write()/end()/flush() does not overlap with the scheduled call to evalLast() as well as add a check to ensure that we don't schedule more evalLast() calls when one is already pending. I want to avoid adding too much more to the existing impl here. I can see a couple ways of doing it but want to know which approach you'd prefer

kentonv · 2024-05-13T16:34:17Z

Probably something like...

enum FlushState {
  // No flush is pending.
  NONE,

  // A flush is pending. We're waiting for evalLast() before starting the flush.
  PENDING,

  // A flush is currently running. Any other writes must wait for `flushTask`.
  ACTIVE
};
FlushState flushState = NONE;

// In PENDING or ACTIVE state, this promise contains the evalLast(flush()) task.
kj::Promise<void> flushTask = nullptr;

At the start of a pump():

If the state is ACTIVE, wait for flushTask to complete before pumping.
If the state is PENDING, cancel flushTask. (It hasn't started flushing yet, and we can schedule a new one later.)

At the end of a pump():

Set state to PENDING.
Schedule flushTask = evalLast([this]() {flushState = ACTIVE; flush();});

kentonv · 2024-05-13T16:36:05Z

BTW I think we can always use Z_SYNC_FLUSH. Let's not let clients choose the flag until we have a use case. I'm concerned some of the flush flags also require slightly different control flow around them, so letting people choose those flags won't actually work anyway.

jasnell requested review from fhanau, mikea and kentonv May 8, 2024 15:48

jasnell force-pushed the jsnell/compression-stream-flush-option branch from 77d13a7 to 4e796a9 Compare May 9, 2024 01:18

fhanau reviewed May 9, 2024

View reviewed changes

c++/src/kj/compat/brotli-test.c++ Show resolved Hide resolved

jasnell force-pushed the jsnell/compression-stream-flush-option branch from 4e796a9 to b8d7856 Compare May 9, 2024 15:11

fhanau reviewed May 9, 2024

View reviewed changes

c++/src/kj/compat/brotli-test.c++ Show resolved Hide resolved

fhanau approved these changes May 9, 2024

View reviewed changes

jasnell added 2 commits May 13, 2024 15:16

Add new GzipAsyncOutputStream flush option

61709b7

Add new BrotliAsyncOutputStream flush option

dad7526

jasnell force-pushed the jsnell/compression-stream-flush-option branch from b8d7856 to dad7526 Compare May 13, 2024 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flush config options to Gzip and Brotli AsyncOutputStream impls #2027

Add flush config options to Gzip and Brotli AsyncOutputStream impls #2027

jasnell commented May 8, 2024

fhanau commented May 8, 2024

jasnell commented May 8, 2024

fhanau commented May 9, 2024

jasnell commented May 9, 2024

fhanau commented May 9, 2024

kentonv commented May 10, 2024

kentonv commented May 10, 2024

jasnell commented May 13, 2024

kentonv commented May 13, 2024

kentonv commented May 13, 2024 •

edited

Add flush config options to Gzip and Brotli AsyncOutputStream impls #2027

Are you sure you want to change the base?

Add flush config options to Gzip and Brotli AsyncOutputStream impls #2027

Conversation

jasnell commented May 8, 2024

fhanau commented May 8, 2024

jasnell commented May 8, 2024

fhanau commented May 9, 2024

jasnell commented May 9, 2024

fhanau commented May 9, 2024

kentonv commented May 10, 2024

kentonv commented May 10, 2024

jasnell commented May 13, 2024

kentonv commented May 13, 2024

kentonv commented May 13, 2024 • edited

kentonv commented May 13, 2024 •

edited