Skip to content

Dataset.data property with ArrowTable (aka. MemoryMappedTable) not updated after filter? #6413

Answered by lhoestq
alvarobartt asked this question in Q&A
Discussion options

You must be logged in to vote

Hi !

I just wanted to know whether there's any reason why the _data attribute under the data property of a datasets.Dataset is not being updated after filter?

Because we only need to update the indices to keep and recreating a new pyarrow Table can take time and disk space.

So my other question would be, is there any efficient way to generate a new dataset after a filter without having to serialize to Python dict and then read from Python dict?

You can remove the indices on top of the arrow table using ds = ds.flatten_indices()

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by alvarobartt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants