http module - buffer does not work? #712

grubberr · 2022-08-09T06:49:44Z

Hello,

As for me smart_open http module can improve buffering, please look on code sample:

import smart_open
import pandas as pd
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1

fp = smart_open.open("https://github.com/airbytehq/airbyte/files/9280856/test.xlsx", mode="rb")
df = pd.read_excel(fp)
print(df)

$ ./test.py | grep airbytehq/airbyte/files/9280856/test.xlsx
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6478-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=1724-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6694-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3832-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=2536-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3099-\r\n\r\n'

pandas.read_excel read file in random access way, it does a lot of seek and read calls.
I suspected if on first HTTP request we read all file contents, subsequent read calls will be from some internal buffer,
but I still see that library under the hood continue to make HTTP requests inside small bytes range which already was read on 1-st HTTP request.

Can we improve it? Can we skip additional HTTP request if we already have all needed data from 1-st HTTP request?

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

Described the problem clearly
Provided a minimal reproducible example, including any required data
Provided the version numbers of the relevant software

The text was updated successfully, but these errors were encountered:

mpenkov · 2022-08-12T07:13:32Z

smart_open's main use case is streaming. If your application does a lot of seeking, then it may be better for you to handle buffering separately (e.g. using tempfile).

Ideally, yes, smart_open would be smart enough to buffer the contents of the stream itself, but how do you determine the ideal size of the buffer? Automatically? Using some sort of parameter? It's a fair bit of work.

grubberr · 2022-08-12T07:21:32Z

As for me it can be any buffer size with some LRU mechanism.
The main idea was - don't re-read data from upstream if it's already was read recenently as much as possible.

Yes I agree, it's can be pretty complex task which complicate librabry too much and can entroduce new errors.

grubberr mentioned this issue Aug 9, 2022

Source File: ability to get HTTPS attachments airbytehq/airbyte#5537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http module - buffer does not work? #712

http module - buffer does not work? #712

grubberr commented Aug 9, 2022

mpenkov commented Aug 12, 2022

grubberr commented Aug 12, 2022

http module - buffer does not work? #712

http module - buffer does not work? #712

Comments

grubberr commented Aug 9, 2022

Versions

Checklist

mpenkov commented Aug 12, 2022

grubberr commented Aug 12, 2022