Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SearchIndex: Rearranging the Index class structure #3557

Open
wants to merge 6 commits into
base: searchIndex
Choose a base branch
from

Conversation

splitbrain
Copy link
Collaborator

@splitbrain splitbrain commented Dec 4, 2021

Incremental Improvement for #3556

This is a first step at stuff at restructuring the indexing classes a bit more.

Some background:

We have basically two different kind of index files:

a) RowIndex (like page.idx)

Each line in the index contains a single value. The line number is used as primary ID. These files can be very large. Thus an index like that should never be read into memory completely if it can be avoided.

b) TupleIndex (like i12.idx)

Each line contains a list of tuples. The files tend to be smaller so loading them completely for search and replace is easier.

Since the the access is so completely different, I tried to model that in the two different classes, basically moving the methods from \dokuwiki\Search\AbstractIndex to the new classes.

While doing so, I tried to make the doc blocks, variable names and interface easier to understand. I also added tests for each of the methods.

The old code has not been touched yet. So these classes do not do anything outside of tests currently.

I also think that it might be useful to have a \dokuwiki\Search\Index\PageIndex inheriting from RowIndex providing a few more page-specific methods.

The next step would be to try just remove \dokuwiki\Search\AbstractIndex and try to model the Fulltext and Metadata Indexes as Collections.

This is a first step at stuff at restructuring the indexing classes a
bit more.

Some background:

We have basically two different kind of index files:

a) RowIndex (like page.idx)

Each line in the index contains a single value. The line number is used
as primary ID. These files can be very large. Thus an index like that
should never be read into memory completely if it can be avoided.

b) TupleIndex (like i12.idx)

Each line contains a list of tuples. The files tend to be smaller so
loading them completely for search and replace is easier.

Since the the access is so completely different, I tried to model that
in the two different classes, basically moving the methods from
\dokuwiki\Search\AbstractIndex to the new classes.

While doing so, I tried to make the doc blocks, variable names and
interface easier to understand. I also added tests for each of the
methods.

The old code has not been touched yet. So these classes do not do
anything outside of tests currently.
When saving word indexs (w*.idx) often multiple words of the same length
will need to be accessed. This implements a new method that allows that
in an efficient way.

Note: this removes the INDEX_MARK_DELETED mechanism to mark deleted
entries. Entries are now deleted using empty lines again. This makes the
batch handling much simpler. If a good reason exists that we should keep
it, it can be readded.
@splitbrain
Copy link
Collaborator Author

After working some more on this, I notice that the distinction between the two types of Index files might not be so clear cut. The "reverse" pageword.idx index is a TupleIndex by content, but might actually be the largest index we have and should not be loaded into memory.

I think we may need to actually split this into MemoryIndex and FileIndex but have tuple operations available on both. I will update this PR when I made up my mind ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant