Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http module - buffer does not work? #712

Open
3 tasks done
grubberr opened this issue Aug 9, 2022 · 2 comments
Open
3 tasks done

http module - buffer does not work? #712

grubberr opened this issue Aug 9, 2022 · 2 comments

Comments

@grubberr
Copy link

grubberr commented Aug 9, 2022

Hello,

As for me smart_open http module can improve buffering, please look on code sample:

import smart_open
import pandas as pd
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1

fp = smart_open.open("https://github.com/airbytehq/airbyte/files/9280856/test.xlsx", mode="rb")
df = pd.read_excel(fp)
print(df)
$ ./test.py | grep airbytehq/airbyte/files/9280856/test.xlsx
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6478-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=1724-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6694-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3832-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=2536-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3099-\r\n\r\n'

pandas.read_excel read file in random access way, it does a lot of seek and read calls.
I suspected if on first HTTP request we read all file contents, subsequent read calls will be from some internal buffer,
but I still see that library under the hood continue to make HTTP requests inside small bytes range which already was read on 1-st HTTP request.

Can we improve it? Can we skip additional HTTP request if we already have all needed data from 1-st HTTP request?

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 12, 2022

smart_open's main use case is streaming. If your application does a lot of seeking, then it may be better for you to handle buffering separately (e.g. using tempfile).

Ideally, yes, smart_open would be smart enough to buffer the contents of the stream itself, but how do you determine the ideal size of the buffer? Automatically? Using some sort of parameter? It's a fair bit of work.

@grubberr
Copy link
Author

As for me it can be any buffer size with some LRU mechanism.
The main idea was - don't re-read data from upstream if it's already was read recenently as much as possible.

Yes I agree, it's can be pretty complex task which complicate librabry too much and can entroduce new errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants