Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Index Refactoring #3556

Draft
wants to merge 104 commits into
base: master
Choose a base branch
from
Draft

Search Index Refactoring #3556

wants to merge 104 commits into from

Conversation

splitbrain
Copy link
Collaborator

@splitbrain splitbrain commented Dec 4, 2021

This is work in progress based on #2943

Goals:

  • Plugin authors should be able to reuse the index mechanisms to build their own indexing (eg. the docsearch plugin)
    • This means an index may not have pages as the primary underlying object
  • All index related mechanism should be well covered by tests
  • The overall architecture should be easy to understand with clear doc comments, consistent naming, etc.
  • memory is precious so we need to be aware what can be loaded or not
  • speed is important when reading indexes

To me the Indexing/Search System consists of several building blocks

  • at the bottom are individual index files
    • some are small enough to load them into memory
    • some are too large to load them (remember people have 100k pages sometimes)
  • on top of the individual indexes are what I would call collections
    • The FullTextIndex is such a collection, making use of several index files
    • Collections should take care of all their specific indexing needs (like calculatin word length, etc)
  • finally there is the Indexer that manages the collections and fills them with data
    • I believe the indexer should be responsible for locking, not the lower level classes
    • thus all indexing tasks should go through the indexer
    • I don't think the indexer should do much specific preprocessing though (like splitting a page into words) - maybe we need Collection-specific indexers for that

Concerns between these different levels should be clearly differentiated currently all these things are very much mixed and spread all over the place.

A series of smaller PRs against this branch should be made before this can be merged into master.

idx_cleanName() was called only from Doku_Indexer::addMetaKeys(), lookupKey(), getPages(), histogram()
This function is not called from elsewhere.
Note:  idx_listIndexLengths() is used in inc/infoutils.php file
Note: idx_getIndex() is used in inc/infoutils.php file
This function is not called from elsewhere.
This function is not called from elsewhere.
Note: idx_listIndexLengths() is used in iinc/TaskRunner.php and inc/Remote/ApiCore.php file

Also used in _test files.
class FulltextSearch
class MetaSearch
because used in inc/infoutils.php file
updateTuple() accept int for the second argument
provides convert($query), recert_simple(), and termParser().
No needs to pass $Indexer in method's arguments.
pageLookup() does not use fulltext index, but metadata index
Warning: Parameter 1 to dokuwiki\Search\MetaSearch::callback_pageLookup() expected to be a reference, value given in /path/to/dokuwiki/inc/Extension/Event.php on line 135
PageIndex, PagewordIndex, MetadataIndex inherit the AbstractIndex class
all extending acstract classes should use a static pidCache array
ssahara and others added 25 commits March 14, 2020 14:17
Fixed inconsistent handling of falsy values on fperm setting
Third-party plugins may use this method. The [cloud plugin](https://github.com/dokufreaks/plugin-cloud) uses idx_indexLength().
* master: (111 commits)
  Update translation
  translation update
  don't crush tables too narrow. fixes #3250
  translation update
  Thorough tests for EO, DE, PT and ES
  translation update
  Optimized pageRestoreConfirm function
  Tests for Portuguese and Spanish
  Changes according to revisions in moisesbr-dw#2
  adjust callstack depth for deprecation message further
  better deprecation messages for self required plugin base files
  don't test on old PHP releases anymore
  increase minimum PHP version to 7.2
  fixed tests for cleanID and romanization for Greeklish
  Improved the transliteration from greek to latin.
  extension cli: do not try to upgrade bundled plugins
  Public access to patterns in external link parser
  test the collator fallback always
  cleanup for collator tests
  wrap sorting functions into their own class
  ...
Exceptions are better to handle than errors. What I don't like is that
we now have an unfortunate mix of return code and exception signalling
for errors. Some methods still return false for errors while others
now throw exceptions (always returning true otherwise).
getPID(), saveIndex(), saveIndexKey(), getPageWords() return always true, otherwise exceptions.
Just ignore $value argument if $key argument is array .
Ignore enpty key of $key argument.
Ensure to treat any null value of $key array as empty string.
Indexer, FulltextIndex, MetadataIndex uses common directory to store *.idx files, but this does not mean they should be singleton objects to avoid lock confrictions.
will reduce access to static $pidCache
frequently used in ajax call, singleton is not effective to reduce multiple instantiations.
singleton is not effective to reduce multiple instantiations, especially for MetadataSearch which is frequently used in ajax call.
this was already fixed by 5afd958 on 2021-02-05
@splitbrain
Copy link
Collaborator Author

@ssahara one general question: The INDEX_MARK_DELETED mechanism is new, isn't it? I am wondering what it's good for. Previously deletion simply set an empty line. That means that line positions were never reused, but I think that's fine. Having to look for deleted entries makes things more complicated than needed IMHO. Thoughts?

@ssahara
Copy link
Collaborator

ssahara commented Dec 12, 2021

@ssahara one general question: The INDEX_MARK_DELETED mechanism is new, isn't it? I am wondering what it's good for. Previously deletion simply set an empty line. That means that line positions were never reused, but I think that's fine. Having to look for deleted entries makes things more complicated than needed IMHO. Thoughts?

Yes, the INDEX_MARK_DELETED mechanism is new, because I thought that the 'numeric pageId' should be reused to avoid having sparse page.idx file. The page changelog file is reused when a deleted page is restored to its original file path, therefore it is natural (at least for me) that the same numeric pageId is reused too. For the moment, I wonder it may not be necessary to set empty for deleted page in the page master table page.idx? I need to study further the index mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants