Separate document and query token pipelines? #105

anentropic · 2019-04-15T11:50:20Z

elasticlunr.Pipelines maintain an ordered list of functions to be applied to both documents tokens and query tokens.

Does this mean there is one pipeline that is applied to both document tokens and query tokens?

For some use cases it would be useful to have separate pipelines for documents and queries, for example implementing something like "variations" in Whoosh, where a single token is expanded to a list of tokens at query time:
https://whoosh.readthedocs.io/en/latest/stemming.html#variations

Is that possible?

srenauld · 2019-04-26T13:50:01Z

That is an interesting concept. I don't think I've come across this one applied to full-text searching before.

The issue I am seeing with this is that, while the index size would be roughly the same (a stemmer does not create additional terms, but some terms may end up getting merged by stemming), the query cost would potentially skyrocket. Let's use the render example from woosh.

At query time, this would be expanded into every single possible declination. This means that rather than stemming and getting a one-term query, you're looking at a 20+ boolean should query.

It definitely is an interesting case, although one I think most people would not be leveraging, particularly for non-english languages. In particular, german would get mighty spicy with this seeing as words can be part of other words.

I've done this on the PR (#106) I have waiting, which substantially changes a whole bunch of things under the hood. When it is merged, you'll be able to use it; feel free to test out the branch if you're curious.

anentropic · 2019-04-26T13:55:31Z

My use case was for synonyms, rather than the stemming-style variations shown in the whoosh example. So I have a more limited set of variations expanded into the query (which are then matched against stemmed words in the index like normal)

Perhaps there are other use cases where you would want to dynamically change the available variations at runtime without reindexing

I will watch your PR 👍

srenauld · 2019-05-25T14:02:43Z

@anentropic there has been no sign of updates on the PR.

Would you like me to prop it up as its own package on npm so you can get a feel for the changes and see if it fits your goals?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate document and query token pipelines? #105

Separate document and query token pipelines? #105

anentropic commented Apr 15, 2019 •

edited

srenauld commented Apr 26, 2019

anentropic commented Apr 26, 2019

srenauld commented May 25, 2019

Separate document and query token pipelines? #105

Separate document and query token pipelines? #105

Comments

anentropic commented Apr 15, 2019 • edited

srenauld commented Apr 26, 2019

anentropic commented Apr 26, 2019

srenauld commented May 25, 2019

anentropic commented Apr 15, 2019 •

edited