Search Index Refactoring #3556

splitbrain · 2021-12-04T15:30:47Z

This is work in progress based on #2943

Goals:

Plugin authors should be able to reuse the index mechanisms to build their own indexing (eg. the docsearch plugin)
- This means an index may not have pages as the primary underlying object
All index related mechanism should be well covered by tests
The overall architecture should be easy to understand with clear doc comments, consistent naming, etc.
memory is precious so we need to be aware what can be loaded or not
speed is important when reading indexes

To me the Indexing/Search System consists of several building blocks

at the bottom are individual index files
- some are small enough to load them into memory
- some are too large to load them (remember people have 100k pages sometimes)
on top of the individual indexes are what I would call collections
- The FullTextIndex is such a collection, making use of several index files
- Collections should take care of all their specific indexing needs (like calculatin word length, etc)
finally there is the Indexer that manages the collections and fills them with data
- I believe the indexer should be responsible for locking, not the lower level classes
- thus all indexing tasks should go through the indexer
- I don't think the indexer should do much specific preprocessing though (like splitting a page into words) - maybe we need Collection-specific indexers for that

Concerns between these different levels should be clearly differentiated currently all these things are very much mixed and spread all over the place.

A series of smaller PRs against this branch should be made before this can be merged into master.

idx_cleanName() was called only from Doku_Indexer::addMetaKeys(), lookupKey(), getPages(), histogram()

This function is not called from elsewhere.

Note: idx_listIndexLengths() is used in inc/infoutils.php file

Note: idx_getIndex() is used in inc/infoutils.php file

This function is not called from elsewhere.

Note: idx_listIndexLengths() is used in iinc/TaskRunner.php and inc/Remote/ApiCore.php file Also used in _test files.

class FulltextSearch class MetaSearch

because used in inc/infoutils.php file

updateTuple() accept int for the second argument

This reverts commit 115f491.

provides convert($query), recert_simple(), and termParser(). No needs to pass $Indexer in method's arguments.

pageLookup() does not use fulltext index, but metadata index

Warning: Parameter 1 to dokuwiki\Search\MetaSearch::callback_pageLookup() expected to be a reference, value given in /path/to/dokuwiki/inc/Extension/Event.php on line 135

PageIndex, PagewordIndex, MetadataIndex inherit the AbstractIndex class

all extending acstract classes should use a static pidCache array

Fixed inconsistent handling of falsy values on fperm setting

Third-party plugins may use this method. The [cloud plugin](https://github.com/dokufreaks/plugin-cloud) uses idx_indexLength().

* master: (111 commits) Update translation translation update don't crush tables too narrow. fixes #3250 translation update Thorough tests for EO, DE, PT and ES translation update Optimized pageRestoreConfirm function Tests for Portuguese and Spanish Changes according to revisions in moisesbr-dw#2 adjust callstack depth for deprecation message further better deprecation messages for self required plugin base files don't test on old PHP releases anymore increase minimum PHP version to 7.2 fixed tests for cleanID and romanization for Greeklish Improved the transliteration from greek to latin. extension cli: do not try to upgrade bundled plugins Public access to patterns in external link parser test the collator fallback always cleanup for collator tests wrap sorting functions into their own class ...

Exceptions are better to handle than errors. What I don't like is that we now have an unfortunate mix of return code and exception signalling for errors. Some methods still return false for errors while others now throw exceptions (always returning true otherwise).

getPID(), saveIndex(), saveIndexKey(), getPageWords() return always true, otherwise exceptions.

Just ignore $value argument if $key argument is array . Ignore enpty key of $key argument. Ensure to treat any null value of $key array as empty string.

Indexer, FulltextIndex, MetadataIndex uses common directory to store *.idx files, but this does not mean they should be singleton objects to avoid lock confrictions.

will reduce access to static $pidCache

frequently used in ajax call, singleton is not effective to reduce multiple instantiations.

singleton is not effective to reduce multiple instantiations, especially for MetadataSearch which is frequently used in ajax call.

this was already fixed by 5afd958 on 2021-02-05

splitbrain · 2021-12-04T18:44:05Z

@ssahara one general question: The INDEX_MARK_DELETED mechanism is new, isn't it? I am wondering what it's good for. Previously deletion simply set an empty line. That means that line positions were never reused, but I think that's fine. Having to look for deleted entries makes things more complicated than needed IMHO. Thoughts?

ssahara · 2021-12-12T02:37:29Z

@ssahara one general question: The INDEX_MARK_DELETED mechanism is new, isn't it? I am wondering what it's good for. Previously deletion simply set an empty line. That means that line positions were never reused, but I think that's fine. Having to look for deleted entries makes things more complicated than needed IMHO. Thoughts?

Yes, the INDEX_MARK_DELETED mechanism is new, because I thought that the 'numeric pageId' should be reused to avoid having sparse page.idx file. The page changelog file is reused when a deleted page is restored to its original file path, therefore it is natural (at least for me) that the same numeric pageId is reused too. For the moment, I wonder it may not be necessary to set empty for deleted page in the page master table page.idx? I need to study further the index mechanism.

ssahara added 30 commits December 16, 2019 19:45

coding style

ddc452a

encapsulate idx_cleanName()

0fb77e9

idx_cleanName() was called only from Doku_Indexer::addMetaKeys(), lookupKey(), getPages(), histogram()

deprecated idx_indexLengths()

1dad69b

This function is not called from elsewhere.

encapsulate idx_listIndexLengths()

4316123

Note: idx_listIndexLengths() is used in inc/infoutils.php file

deprecated idx_getIndex()

861eb32

Note: idx_getIndex() is used in inc/infoutils.php file

deprecated idx_tokenizer()

e15020b

This function is not called from elsewhere.

deprecated idx_lookup()

abfaea2

This function is not called from elsewhere.

encapsulate idx_addPage()

56d1fe9

Note: idx_listIndexLengths() is used in iinc/TaskRunner.php and inc/Remote/ApiCore.php file Also used in _test files.

encapsulate idx_get_version()

8c01949

encapsulate idx_get_stopwords()

0af7b62

encapsulate wordlen() in the Indexer

8896568

make Doku_Indexer singleton

b5daf9f

encapsulate functions into Classes

cd17dbd

class FulltextSearch class MetaSearch

coding style PSR-12

83198f9

deprecated class Doku_Indexer extends \Indexer

d43b19d

new namespace dokuwiki\Search

173bfbc

use Indexer method instead of idx_get_indexer()

c31af4f

make public listIndex() and listIndexLengths() method

fe21229

because used in inc/infoutils.php file

remove compatibility idx_ functions that are not used anywhere

89b6193

remove compatibility ft_ functions that are not used anywhere

0a3e25f

updateTuple() 2nd parameter

115f491

updateTuple() accept int for the second argument

Revert "updateTuple() 2nd parameter"

f9c5d30

This reverts commit 115f491.

define constants inside namespace

48b9265

create QueryParser class

3837ea9

provides convert($query), recert_simple(), and termParser(). No needs to pass $Indexer in method's arguments.

move Quicksearch methods into MetaSearch class

677f78a

pageLookup() does not use fulltext index, but metadata index

bug fix and PHP74 warnings

6b6beca

Warning: Parameter 1 to dokuwiki\Search\MetaSearch::callback_pageLookup() expected to be a reference, value given in /path/to/dokuwiki/inc/Extension/Event.php on line 135

change class name to MetadataSearch

fe2d1da

Abstraction Index classes

f076e3f

PageIndex, PagewordIndex, MetadataIndex inherit the AbstractIndex class

separate methods into metadata, Pageword, Page index classes

86fc728

make $pidCache static, refactor getPID()

5aa57cb

all extending acstract classes should use a static pidCache array

ssahara and others added 25 commits March 14, 2020 14:17

Follow up #2985, fperm seetting

39f31b6

Fixed inconsistent handling of falsy values on fperm setting

Merge branch 'master' into Refactor_Fulltext

e36bcee

Merge branch 'master' into Refactor_Fulltext

9de2ceb

make FulltextIndex::getIndexLengths() public

558f089

Third-party plugins may use this method. The [cloud plugin](https://github.com/dokufreaks/plugin-cloud) uses idx_indexLength().

fix method name

d42a607

fix deprecated.php

22df765

throw IndexWriteException in saveIndex()/saveIndexkey()

265e2c9

remove unnecessary if blocks

a16bd54

getPID(), saveIndex(), saveIndexKey(), getPageWords() return always true, otherwise exceptions.

avoid null in addMetaKeys()

89e3dd3

Just ignore $value argument if $key argument is array . Ignore enpty key of $key argument. Ensure to treat any null value of $key array as empty string.

change Index objects to non-singleton

a32da6d

Indexer, FulltextIndex, MetadataIndex uses common directory to store *.idx files, but this does not mean they should be singleton objects to avoid lock confrictions.

instantiate *Index with numeric page id

725e8e5

will reduce access to static $pidCache

fix scrutinizer claims

5792814

change Tokenizer static utility

1755450

frequently used in ajax call, singleton is not effective to reduce multiple instantiations.

change MetadataSearch and FulltextSearch to non-singleton

cc3a3cd

singleton is not effective to reduce multiple instantiations, especially for MetadataSearch which is frequently used in ajax call.

dbglog() for SearchException

72ebc99

add missing namespace to fallbacks

4d69838

Merge branch 'master' into Refactor_Fulltext

9e7aeeb

replace deprecated ft_backlinks() in Ui

bcd7722

Merge remote-tracking branch 'upstream/master' into Refactor_Fulltext

c1803f3

fix undefined array in FulltextSearch

05606ae

this was already fixed by 5afd958 on 2021-02-05

added missing 'notns' related code

fab81cc

catch up #3115 Sort with collator

a02395a

use Logger::debug() instead of deprecated dbglog()

3df1553

splitbrain requested a review from micgro42 December 4, 2021 15:30

splitbrain mentioned this pull request Dec 4, 2021

SearchIndex: Rearranging the Index class structure #3557

Open

splitbrain mentioned this pull request Oct 19, 2022

SearchIndex: Implement Collections (WIP) #3810

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Index Refactoring #3556

Search Index Refactoring #3556

splitbrain commented Dec 4, 2021 •

edited

splitbrain commented Dec 4, 2021

ssahara commented Dec 12, 2021

Search Index Refactoring #3556

Are you sure you want to change the base?

Search Index Refactoring #3556

Conversation

splitbrain commented Dec 4, 2021 • edited

splitbrain commented Dec 4, 2021

ssahara commented Dec 12, 2021

splitbrain commented Dec 4, 2021 •

edited