Read index API #1568

vasil-pashov · 2024-05-10T16:38:15Z

Reference Issues/PRs

relates to: #399

What does this implement or fix?

Add unit tests to showcase the proposed API for reading the index of a symbol.

Proposal
Add new arcticdb.Library API

def read_index(symbol: str, as_of: Optional[AsOf] = None) -> VersionedItem

symbol and as_of have the same meaning as in arcticdb.Library.read

VersionedItem.data will contain the index retrieved from the symbol. Its type will be one of pandas.RangeIndex, pandas.DatetimeIndex, pandas.MultiIndex (it can be expanded to all supported index types) .

TBD

Is optional date_range/row_range argument needed? (I would add it only if someone requests it)
Do we support categorical indexes should they be considered in the first iteration of the task.
Do we need to return VersionedItem? Pros: It's consistent with arcticdb.Library.read. Cons: If VersionedItem starts to grow it can start carrying additional data that's not used

Alternatives
Extend arcticdb.Library.read with additional parameter index_only.
Pros:

Less code duplication
Move all read related stuff to one function

Cons:

VersionedItem.data will be of different type depending on the parameter which might lead to confusion and bugs
arcticdb.Library.read might become huge and hard to understand and use

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

alexowens90 · 2024-05-13T09:57:21Z

Mostly looks good, few comments/ideas:

We have code to support string indexes, although I'm not sure how widely used it is. As a minimum, we need to behave sensibly for string indexed data.
What should the behaviour be for the empty index type?
We definitely want date_range, and for completeness I think we should include row_range too.
We do not support categorical indexes, so these do not need to be considered.
I think VersionedItem is the correct return type, with the index in the data field.
NativeVersionStore already has a method called read_index, so we should use something else for this. Maybe get_index?
I think it is better as a separate method, rather than an argument to read. When selecting a subset of columns, you always get the index columns too without specifying them. We could have a special case, where columns=[] gets just the index, but it isn't intuitive, and would change the return type of data based on a parameter, which is nasty, particularly if the columns list is generated at runtime and might be empty.
We should support a batch version of the call

alexowens90 · 2024-05-13T09:57:59Z

python/tests/unit/arcticdb/version_store/test_read_index.py

+class TestReadSingleIndex:
+    @pytest.mark.parametrize("index", [
+        pd.RangeIndex(start=0, stop=10),
+        pd.RangeIndex(start=0, stop=10, step=2),


I would test with a non-zero start value here.

alexowens90 · 2024-05-13T09:58:42Z

python/tests/unit/arcticdb/version_store/test_read_index.py

+    ])
+    def test_rowrange(self, lmdb_version_store_static_and_dynamic, index):
+        lmdb_version_store_static_and_dynamic.write("sym", pd.DataFrame({"col": range(0, len(index))}, index=index))
+        result = lmdb_version_store_static_and_dynamic.read_index("sym")


lmdb_version_store_static_and_dynamic is of type NativeVersionStore, which has a read_index method with different functionality.

alexowens90 · 2024-05-13T10:00:36Z

python/tests/unit/arcticdb/version_store/test_read_index.py

+    def test_as_of(self, lmdb_version_store_static_and_dynamic, indexes):
+        data = [list(range(0, len(index))) for index in indexes]
+
+        lmdb_version_store_static_and_dynamic.write("sym", pd.DataFrame({"col": data[0]}, index=indexes[0]))


Would be good to use a fixture with 2 columns per segment, and >2 columns to check that everything works with column slicing.

* Different versions of as_of * date_rage * row_range * Different column slice size

Add TDD style tests for the new read_index API

88dfc5d

vasil-pashov force-pushed the read-index branch from 68a097f to 88dfc5d Compare May 10, 2024 16:43

alexowens90 reviewed May 13, 2024

View reviewed changes

Expand tests to cover more ground

49a6e69

* Different versions of as_of * date_rage * row_range * Different column slice size

vasil-pashov requested a review from alexowens90 May 21, 2024 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read index API #1568

Read index API #1568

vasil-pashov commented May 10, 2024 •

edited

alexowens90 commented May 13, 2024 •

edited

alexowens90 May 13, 2024

alexowens90 May 13, 2024

alexowens90 May 13, 2024

Read index API #1568

Are you sure you want to change the base?

Read index API #1568

Conversation

vasil-pashov commented May 10, 2024 • edited

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

alexowens90 commented May 13, 2024 • edited

alexowens90 May 13, 2024

Choose a reason for hiding this comment

alexowens90 May 13, 2024

Choose a reason for hiding this comment

alexowens90 May 13, 2024

Choose a reason for hiding this comment

vasil-pashov commented May 10, 2024 •

edited

alexowens90 commented May 13, 2024 •

edited