SearchIndex: Rearranging the Index class structure #3557

splitbrain · 2021-12-04T15:36:24Z

Incremental Improvement for #3556

This is a first step at stuff at restructuring the indexing classes a bit more.

Some background:

We have basically two different kind of index files:

a) RowIndex (like page.idx)

Each line in the index contains a single value. The line number is used as primary ID. These files can be very large. Thus an index like that should never be read into memory completely if it can be avoided.

b) TupleIndex (like i12.idx)

Each line contains a list of tuples. The files tend to be smaller so loading them completely for search and replace is easier.

Since the the access is so completely different, I tried to model that in the two different classes, basically moving the methods from \dokuwiki\Search\AbstractIndex to the new classes.

While doing so, I tried to make the doc blocks, variable names and interface easier to understand. I also added tests for each of the methods.

The old code has not been touched yet. So these classes do not do anything outside of tests currently.

I also think that it might be useful to have a \dokuwiki\Search\Index\PageIndex inheriting from RowIndex providing a few more page-specific methods.

The next step would be to try just remove \dokuwiki\Search\AbstractIndex and try to model the Fulltext and Metadata Indexes as Collections.

This is a first step at stuff at restructuring the indexing classes a bit more. Some background: We have basically two different kind of index files: a) RowIndex (like page.idx) Each line in the index contains a single value. The line number is used as primary ID. These files can be very large. Thus an index like that should never be read into memory completely if it can be avoided. b) TupleIndex (like i12.idx) Each line contains a list of tuples. The files tend to be smaller so loading them completely for search and replace is easier. Since the the access is so completely different, I tried to model that in the two different classes, basically moving the methods from \dokuwiki\Search\AbstractIndex to the new classes. While doing so, I tried to make the doc blocks, variable names and interface easier to understand. I also added tests for each of the methods. The old code has not been touched yet. So these classes do not do anything outside of tests currently.

When saving word indexs (w*.idx) often multiple words of the same length will need to be accessed. This implements a new method that allows that in an efficient way. Note: this removes the INDEX_MARK_DELETED mechanism to mark deleted entries. Entries are now deleted using empty lines again. This makes the batch handling much simpler. If a good reason exists that we should keep it, it can be readded.

splitbrain · 2021-12-04T19:14:30Z

After working some more on this, I notice that the distinction between the two types of Index files might not be so clear cut. The "reverse" pageword.idx index is a TupleIndex by content, but might actually be the largest index we have and should not be loaded into memory.

I think we may need to actually split this into MemoryIndex and FileIndex but have tuple operations available on both. I will update this PR when I made up my mind ;-)

splitbrain added 2 commits December 4, 2021 15:53

splitbrain added 4 commits December 8, 2021 16:55

Changed IndexAccessor names based on access method not content

9bd7d62

we need the same access methods in both index types

d6396b6

better method names

8ed3501

retrieveRow needs to pad the index, too

dec2682

splitbrain mentioned this pull request Oct 19, 2022

SearchIndex: Implement Collections (WIP) #3810

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SearchIndex: Rearranging the Index class structure #3557

SearchIndex: Rearranging the Index class structure #3557

splitbrain commented Dec 4, 2021 •

edited

splitbrain commented Dec 4, 2021

SearchIndex: Rearranging the Index class structure #3557

Are you sure you want to change the base?

SearchIndex: Rearranging the Index class structure #3557

Conversation

splitbrain commented Dec 4, 2021 • edited

splitbrain commented Dec 4, 2021

splitbrain commented Dec 4, 2021 •

edited