Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed data & chunk size fails fetch #895

Open
voidware opened this issue Sep 18, 2023 · 7 comments
Open

Compressed data & chunk size fails fetch #895

voidware opened this issue Sep 18, 2023 · 7 comments
Assignees

Comments

@voidware
Copy link

I'm having a problem with emscripten sokol_fetch and compressed data with chunk size;

Sokol issues a HEAD and gets the compressed content length.

HTTP/1.1 200 OK
Date: Mon, 18 Sep 2023 12:41:06 GMT
Connection: Keep-Alive
ETag: "1695039706"
Cache-Control: max-age=86400
Content-Encoding: gzip
Content-Length: 18042
Content-Type: application/json
Last-Modified: Mon, 18 Sep 2023 12:21:46 GMT
Accept-Ranges: bytes
Vary: Origin

Sokol issues a get range and gets uncompressed data;

HTTP/1.1 206 Partial Content
Date: Mon, 18 Sep 2023 12:58:19 GMT
Connection: Keep-Alive
ETag: "1695039706"
Cache-Control: max-age=86400
Content-Length: 1024
Content-Range: bytes 0-1023/52494
Content-Type: application/json
Last-Modified: Mon, 18 Sep 2023 12:21:46 GMT
Accept-Ranges: bytes
Vary: Origin

And the server does not compress it (no Content-Encoding field). The range requested is interpreted as that of uncompressed data.

So here we get the first 1K of 52K.

But Sokol stops fetching after 18042 of uncompressed data and the download is incomplete.

I don't know if this is a server problem or a Sokol problem. But it would seem the server has the option always to send the data uncompressed anyway and this is what it is doing.

Also, would it ever be the case that ranges are compressed? For example, does the server have the option to compress each range separately and therefore have completely different Content-Length both to the request and to any HEAD request?

And if a range within a file were requested how would it ever be possible for to receive uncompressed data in the buffer? So i dont think the fetch buffer needs to be bigger than the chunk size ever. Except for chunk_size=0.

@floooh
Copy link
Owner

floooh commented Sep 18, 2023

Hmm, I'm somewhat sure that I had received compressed chunks when experimenting with streaming downloads, otherwise I wouldn't have gone to great length describing that scenario here:

sokol/sokol_fetch.h

Lines 600 to 638 in 751fc4c

CHUNK SIZE AND HTTP COMPRESSION
===============================
TL;DR: for streaming scenarios, the provided chunk-size must be smaller
than the provided buffer-size because the web server may decide to
serve the data compressed and the chunk-size must be given in 'compressed
bytes' while the buffer receives 'uncompressed bytes'. It's not possible
in HTTP to query the uncompressed size for a compressed download until
that download has finished.
With vanilla HTTP, it is not possible to query the actual size of a file
without downloading the entire file first (the Content-Length response
header only provides the compressed size). Furthermore, for HTTP
range-requests, the range is given on the compressed data, not the
uncompressed data. So if the web server decides to server the data
compressed, the content-length and range-request parameters don't
correspond to the uncompressed data that's arriving in the sokol-fetch
buffers, and there's no way from JS or WASM to either force uncompressed
downloads (e.g. by setting the Accept-Encoding field), or access the
compressed data.
This has some implications for sokol_fetch.h, most notably that buffers
can't be provided in the exactly right size, because that size can't
be queried from HTTP before the data is actually downloaded.
When downloading whole files at once, it is basically expected that you
know the maximum files size upfront through other means (for instance
through a separate meta-data-file which contains the file sizes and
other meta-data for each file that needs to be loaded).
For streaming downloads the situation is a bit more complicated. These
use HTTP range-requests, and those ranges are defined on the (potentially)
compressed data which the JS/WASM side doesn't have access to. However,
the JS/WASM side only ever sees the uncompressed data, and it's not possible
to query the uncompressed size of a range request before that range request
has finished.
If the provided buffer is too small to contain the uncompressed data,
the request will fail with error code SFETCH_ERROR_BUFFER_TOO_SMALL.

If the server answers that the data will be sent compressed with a HEAD request, but then doesn't send compressed chunks, then currently sokol_fetch.h indeed cannot know when the download has finished.

The streaming sample here doesn't seem to use compression (e.g. the HEAD request returns with the actual uncompressed data size, probably because compression is deactivated for MPEG files):

https://floooh.github.io/sokol-html5/plmpeg-sapp.html

If it's only about detecting when the streamed download is complete, then I can probably look at the Content-Range response header:

Content-Range: bytes 0-1023/52494

...since the part after the slash is the overall size, so it's possible to just look at the chunk's Content-Range header to check for completion.

That sounds like a plan. I need to look into sokol_fetch.h again soonish anyway because of #882.

@voidware
Copy link
Author

Thanks for looking at this.

It appears, when a HEAD is issued, the Content-Length will reflect whether compression is acceptable, since i think the value from HEAD is meant to be the same as the value from GET, all things being consistent.

So

curl -I <url> -H "Accept-Encoding: gzip"

Will contain:

Content-Encoding: gzip
Content-Length: 18042

curl -I <url>

Will contain

Content-Length: 52494

In such cases the Content-Length will then be consistent with a subsequent GET in the non-range case.

For ranges, I'm thinking the server can opt out of compression. I think identity is always implied. I tried to stop it with:

curl <url> -i -H "Accept-Encoding: gzip,identity;q=0" -H "Range: bytes=0-1023"

But it still returned uncompressed data.

@voidware
Copy link
Author

BTW, if you're going to be looking at fetch sometime, can you have a quick look at the case where a buffer is not pre-assigned. I tried the method of allocating the buffer in dispatch, but my callback never happened. I could only get the pre-allocated buffer method to work.

BTW2, For the short time i have a workaround for the range problem. It turns out i only need chunks for streaming media, which is already compressed. Fetching small text files does not need chunks as they always fit in my buffer anyhow. For now i just set chunk_size to zero for those files.

BTW3, it would be nice to know whether ranges can indeed be compressed and whether the server can opt to compress each range separately. I read somewhere some CDNs do this. I have had a look around and can't find anything definite in this area. Seems to be a bit of a hole in the specifications.

Thanks.

@floooh
Copy link
Owner

floooh commented Sep 20, 2023

@voidware
Copy link
Author

Thanks for checking this. I tried it again. Yes, the problem is only when you have a nonzero chunk_size. it blows an assert complaining the buffer is too small for the chunk, because there is no buffer yet!

@voidware
Copy link
Author

Also am i right in thinking assinging the buffer in dispatch will cause an additional frame delay? If so, I'll probably preassign the buffer anyhow.

@floooh
Copy link
Owner

floooh commented Sep 21, 2023

Also am i right in thinking assinging the buffer in dispatch will cause an additional frame delay?

It actually shouldn't because the dispatch callback is 'short-circuited' as soon as a lane is assigned to the request and before it is enqueued for processing, there's no extra roundtrip involved (the channel and lane index lets you pick a buffer which will only be written to by this specific request, because it's guaranteed that no other request is in flight with the same channel/lane combination):

sokol/sokol_fetch.h

Lines 2485 to 2491 in b803c9a

item->state = _SFETCH_STATE_DISPATCHED;
item->lane = _sfetch_ring_dequeue(&chn->free_lanes);
// if no buffer provided yet, invoke response callback to do so
if (0 == item->buffer.ptr) {
_sfetch_invoke_response_callback(item);
}
_sfetch_ring_enqueue(&chn->user_incoming, slot_id);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants