Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read index API #1568

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Read index API #1568

wants to merge 2 commits into from

Conversation

vasil-pashov
Copy link
Collaborator

@vasil-pashov vasil-pashov commented May 10, 2024

Reference Issues/PRs

relates to: #399

What does this implement or fix?

Add unit tests to showcase the proposed API for reading the index of a symbol.

Proposal
Add new arcticdb.Library API

def read_index(symbol: str, as_of: Optional[AsOf] = None) -> VersionedItem

symbol and as_of have the same meaning as in arcticdb.Library.read

VersionedItem.data will contain the index retrieved from the symbol. Its type will be one of pandas.RangeIndex, pandas.DatetimeIndex, pandas.MultiIndex (it can be expanded to all supported index types) .

TBD

  • Is optional date_range/row_range argument needed? (I would add it only if someone requests it)
  • Do we support categorical indexes should they be considered in the first iteration of the task.
  • Do we need to return VersionedItem? Pros: It's consistent with arcticdb.Library.read. Cons: If VersionedItem starts to grow it can start carrying additional data that's not used

Alternatives
Extend arcticdb.Library.read with additional parameter index_only.
Pros:

  • Less code duplication
  • Move all read related stuff to one function

Cons:

  • VersionedItem.data will be of different type depending on the parameter which might lead to confusion and bugs
  • arcticdb.Library.read might become huge and hard to understand and use

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@alexowens90
Copy link
Collaborator

alexowens90 commented May 13, 2024

Mostly looks good, few comments/ideas:

  • We have code to support string indexes, although I'm not sure how widely used it is. As a minimum, we need to behave sensibly for string indexed data.
  • What should the behaviour be for the empty index type?
  • We definitely want date_range, and for completeness I think we should include row_range too.
  • We do not support categorical indexes, so these do not need to be considered.
  • I think VersionedItem is the correct return type, with the index in the data field.
  • NativeVersionStore already has a method called read_index, so we should use something else for this. Maybe get_index?
  • I think it is better as a separate method, rather than an argument to read. When selecting a subset of columns, you always get the index columns too without specifying them. We could have a special case, where columns=[] gets just the index, but it isn't intuitive, and would change the return type of data based on a parameter, which is nasty, particularly if the columns list is generated at runtime and might be empty.
  • We should support a batch version of the call

class TestReadSingleIndex:
@pytest.mark.parametrize("index", [
pd.RangeIndex(start=0, stop=10),
pd.RangeIndex(start=0, stop=10, step=2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would test with a non-zero start value here.

])
def test_rowrange(self, lmdb_version_store_static_and_dynamic, index):
lmdb_version_store_static_and_dynamic.write("sym", pd.DataFrame({"col": range(0, len(index))}, index=index))
result = lmdb_version_store_static_and_dynamic.read_index("sym")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmdb_version_store_static_and_dynamic is of type NativeVersionStore, which has a read_index method with different functionality.

def test_as_of(self, lmdb_version_store_static_and_dynamic, indexes):
data = [list(range(0, len(index))) for index in indexes]

lmdb_version_store_static_and_dynamic.write("sym", pd.DataFrame({"col": data[0]}, index=indexes[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to use a fixture with 2 columns per segment, and >2 columns to check that everything works with column slicing.

* Different versions of as_of
* date_rage
* row_range
* Different column slice size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants