Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when reading as_of a date with many early versions deleted #1384

Open
DrNickClarke opened this issue Feb 29, 2024 · 1 comment · May be fixed by #1443
Open

Poor performance when reading as_of a date with many early versions deleted #1384

DrNickClarke opened this issue Feb 29, 2024 · 1 comment · May be fixed by #1443
Assignees
Labels
Bug fixing drive Q1 2024 bug Something isn't working

Comments

@DrNickClarke
Copy link
Collaborator

DrNickClarke commented Feb 29, 2024

Describe the bug

There is a feature called tombstone all that is supposed to prevent version search having to walk the entire historical version list when the early versions have all been deleted.

It works for as_of = a version number (it will return quickly not having found the version)

However when as_of = a date is used it can be slow. This is much more apparent using AWS where the latency is higher.

When there are thousands of versions the read can take several minutes.

Steps/Code to Reproduce

import arcticdb as adb
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

arctic = adb.Arctic()
lib = arctic.get_library('adb_bugs', create_if_missing=True)

N = 3
df = pd.DataFrame(
index=pd.date_range("20240101", periods=N),
data={'col': np.arange(0., N)}
)

write 500 versions

sym1 = 'asof_slow_read'
for i in range(500):
lib.write(sym1, df)

remove early versions

lib.delete(sym1)

add one more version

lib.write(sym1, df)

this is slow (12s in my test)

as_of = datetime.now() - timedelta(days=1)
lib.read(sym1, as_of=as_of)

this is fast (171ms in my test)

lib.read(sym1, as_of=499)

Expected Results

Results are as expected. This is a performance issue.

OS, Python Version and ArcticDB Version

Python 3.10
Linux Linux version 5.15.133.1-microsoft-standard-WSL2
arcticdb 4.3.1

Backend storage used

AWS S3

Additional Context

This is possibly related to this issue (failure to observe tombstone correctly). It may be easier to solve the two together

#1385

@alexowens90
Copy link
Collaborator

This will be a failure to short-circuit on fast-tombstone all keys, as the logic is a bit more complex than when searching by exact version number.

IvoDD added a commit that referenced this issue Mar 20, 2024
- We fail to exit early when following the version chain with both
  LOAD_DOWNTO and LOAD_FROM_TIME. Adds a cpp test to show that
- Fix will be done in follow up commits
IvoDD added a commit that referenced this issue Mar 20, 2024
- Adds "_UNDELETED" suffix to LOAD_DOWNTO and LOAD_FROM_TIME to make it
  obvious that they should only work on undeleted versions
- No behavior change, just plain rename
IvoDD added a commit that referenced this issue Mar 20, 2024
LOAD_DOWNTO

- previously we did the check for tombstone_all only on LOAD_UNDELETED
  and LOAD_LATEST_UNDELETED
- Renamed the functions in version_utils which we use to check when to
  stop following the version chain to less confusing names
- Google test VersionMap.FollowingVersionChainEndEarly now passes
IvoDD added a commit that referenced this issue Mar 22, 2024
LOAD_DOWNTO WORK IN PROGRESS

- previously we did the check for tombstone_all only on LOAD_UNDELETED
  and LOAD_LATEST_UNDELETED
- Renamed the functions in version_utils which we use to check when to
  stop following the version chain to less confusing names
- Google test VersionMap.FollowingVersionChainEndEarly now passes
IvoDD added a commit that referenced this issue Mar 26, 2024
- We fail to exit early when following the version chain with both
  LOAD_DOWNTO and LOAD_FROM_TIME. Adds a cpp test to show that
- Fix will be done in follow up commits
IvoDD added a commit that referenced this issue Apr 16, 2024
- We fail to exit early when following the version chain with both
  LOAD_DOWNTO and LOAD_FROM_TIME. Adds a cpp test to show that
- Fix will be done in follow up commits
IvoDD added a commit that referenced this issue May 21, 2024
- We fail to exit early when following the version chain with both
  LOAD_DOWNTO and LOAD_FROM_TIME. Adds a cpp test to show that
- Fix will be done in follow up commits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug fixing drive Q1 2024 bug Something isn't working
Projects
None yet
3 participants