Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf bug: Inserting data is O(num versions) #2318

Open
wjones127 opened this issue May 9, 2024 · 0 comments
Open

perf bug: Inserting data is O(num versions) #2318

wjones127 opened this issue May 9, 2024 · 0 comments
Assignees

Comments

@wjones127
Copy link
Contributor

It appears the time to write data scales linearly with the number of versions. This is not great. On my local computer, it starts off at 10 ms and after a few thousand versions becomes 30 ms. For a higher-latency store, I bet this is more dramatic. One user reported latency of 1.5 sec after 8k versions.

My best guess is this is because to load the latest version we are listing all files in versions directory. We might have to implement the first part of #1362 to fix this.

Reproduce this

from datetime import timedelta
import time
import pyarrow as pa
import lance

data = pa.table({'a': pa.array([1])})

# Uncomment this part to reset and see once we delete versions, the latency
# goes back down.
# ds = lance.dataset("test_data")
# ds.cleanup_old_versions(older_than=timedelta(seconds=1), delete_unverified=True)

for i in range(10000):
    start = time.monotonic()
    # Use overwrite to eliminate possibility that it is O(num files)
    lance.write_dataset(data, 'test_data', mode='overwrite')
    print(time.monotonic() - start)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant