Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flush config options to Gzip and Brotli AsyncOutputStream impls #2027

Open
wants to merge 2 commits into
base: v2
Choose a base branch
from

Conversation

jasnell
Copy link
Collaborator

@jasnell jasnell commented May 8, 2024

Adds the ability to configure the per-write flush behavior for both GzipAsyncOutputStream and BrotliAsyncOutputStream, keeping the current defaults in place.

By default, these implementations will flush only when either flush() is explicitly called or their output buffers are full. This is the ideal behavior for optimal compression but can increase perceived latency of the stream. This PR adds options to both Gzip and Brotli implementations to customize the per-write flush setting to permit for frequent flushes.

@jasnell jasnell requested review from fhanau, mikea and kentonv May 8, 2024 15:48
@fhanau
Copy link
Collaborator

fhanau commented May 8, 2024

Will take a look at this soon. Note that compression flushing is a can of worms – Zlib has a number of different flush options of which Z_SYNC_FLUSH should be sane but I'll take another look at them. Brotli is peculiar about when it provides output that led to a hard-to-find bug while developing support for it here, the current implementation should work with or without additional flushing but I want to understand well how it will behave differently.

@jasnell
Copy link
Collaborator Author

jasnell commented May 8, 2024

Yep, the way this is set up the exact flush option can be specified as an option, making it possible for us to evaluate the different options. Please definitely do give this some careful thought. If the brotli implementation is sufficient as is, I'm happy dropping the changes to that as the problem we're looking to address immediately is largely with gzip compression. I just did the brotli changes to keep things symmetrical.

I was going to update this PR to incorporate a deferred flush using kj::evalLater(...) but honestly it adds too much complexity for what seems worthwhile. I'm open to other suggestions on how to implement it. The issue is largely that the implementation assumes that only one call to pump() is active at any given time. The impl in this PR ends up flushing on every write which is... fun... but the alternative means more complicated bookkeeping to ensure that there are no overlapping calls to pump().

@jasnell jasnell force-pushed the jsnell/compression-stream-flush-option branch from 77d13a7 to 4e796a9 Compare May 9, 2024 01:18
@jasnell jasnell force-pushed the jsnell/compression-stream-flush-option branch from 4e796a9 to b8d7856 Compare May 9, 2024 15:11
@fhanau
Copy link
Collaborator

fhanau commented May 9, 2024

LGTM – I reviewed the current implementation and the documentation for Z_SYNC_FLUSH and BROTLI_OPERATION_FLUSH, this approach should be correct.
One style concern I have is that this requires external code to specify Z_SYNC_FLUSH when using the gzip implementation – since there are 3 different flush flags supported by zlib this can cause confusion, it would be better to handle this like brotli does and provide a bool writeFlush parameter in GzipAsyncOutputStream::Options instead.

@jasnell
Copy link
Collaborator Author

jasnell commented May 9, 2024

@fhanau :

...One style concern I have is that this requires external code to specify Z_SYNC_FLUSH when using the gzip implementation

That was intentional to give us some flexibility later to use one of the other flush flags if necessary.

@fhanau
Copy link
Collaborator

fhanau commented May 9, 2024

@fhanau :

...One style concern I have is that this requires external code to specify Z_SYNC_FLUSH when using the gzip implementation

That was intentional to give us some flexibility later to use one of the other flush flags if necessary.

I think that's valid. From the comments in zlib.h it sounded to me like the other flags might be slightly more efficient in some cases and avoid some empty deflate blocks being written. It felt unlikely to be relevant for us though – Z_SYNC_FLUSH is the most widely used option and in cases where compression efficiency is important we will hopefully not be flushing excessively or not be using zlib in the first place... Not a big issue either way, this can go ahead as-is.

@kentonv
Copy link
Member

kentonv commented May 10, 2024

I really think we need to be using evalLast (not evalLater) here, otherwise we're forcing a compressor flush on every chunk even if we know more data is coming immediately.

@kentonv
Copy link
Member

kentonv commented May 10, 2024

To elaborate, it is quite common to have several write() calls invoked in rapid succession. When pumping from one stream to another, we usually tryRead() with minBytes = 1 into a 4k buffer, then write() that data, before reading again. At best we get 4k at a time, which is an awfully small window for compression. But we could also be reading from, say, a transfer-encoding: chunked body which returns after each chunk, which could be arbitrarily small chunks. Flushing after each of those could be quite awful for compression.

@jasnell
Copy link
Collaborator Author

jasnell commented May 13, 2024

@kentonv ... what is your recommendation on the best way to handle this with evalLast()? Specifically, since calls to pump() cannot overlap we need to introduce a small amount of book keeping to ensure that another call to write()/end()/flush() does not overlap with the scheduled call to evalLast() as well as add a check to ensure that we don't schedule more evalLast() calls when one is already pending. I want to avoid adding too much more to the existing impl here. I can see a couple ways of doing it but want to know which approach you'd prefer

@kentonv
Copy link
Member

kentonv commented May 13, 2024

Probably something like...

enum FlushState {
  // No flush is pending.
  NONE,

  // A flush is pending. We're waiting for evalLast() before starting the flush.
  PENDING,

  // A flush is currently running. Any other writes must wait for `flushTask`.
  ACTIVE
};
FlushState flushState = NONE;

// In PENDING or ACTIVE state, this promise contains the evalLast(flush()) task.
kj::Promise<void> flushTask = nullptr;

At the start of a pump():

  • If the state is ACTIVE, wait for flushTask to complete before pumping.
  • If the state is PENDING, cancel flushTask. (It hasn't started flushing yet, and we can schedule a new one later.)

At the end of a pump():

  • Set state to PENDING.
  • Schedule flushTask = evalLast([this]() {flushState = ACTIVE; flush();});

@kentonv
Copy link
Member

kentonv commented May 13, 2024

BTW I think we can always use Z_SYNC_FLUSH. Let's not let clients choose the flag until we have a use case. I'm concerned some of the flush flags also require slightly different control flow around them, so letting people choose those flags won't actually work anyway.

@jasnell jasnell force-pushed the jsnell/compression-stream-flush-option branch from b8d7856 to dad7526 Compare May 13, 2024 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants