Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate document and query token pipelines? #105

Open
anentropic opened this issue Apr 15, 2019 · 3 comments
Open

Separate document and query token pipelines? #105

anentropic opened this issue Apr 15, 2019 · 3 comments

Comments

@anentropic
Copy link

anentropic commented Apr 15, 2019

elasticlunr.Pipelines maintain an ordered list of functions to be applied to both documents tokens and query tokens.

Does this mean there is one pipeline that is applied to both document tokens and query tokens?

For some use cases it would be useful to have separate pipelines for documents and queries, for example implementing something like "variations" in Whoosh, where a single token is expanded to a list of tokens at query time:
https://whoosh.readthedocs.io/en/latest/stemming.html#variations

Is that possible?

@srenauld
Copy link
Collaborator

That is an interesting concept. I don't think I've come across this one applied to full-text searching before.

The issue I am seeing with this is that, while the index size would be roughly the same (a stemmer does not create additional terms, but some terms may end up getting merged by stemming), the query cost would potentially skyrocket. Let's use the render example from woosh.

At query time, this would be expanded into every single possible declination. This means that rather than stemming and getting a one-term query, you're looking at a 20+ boolean should query.

It definitely is an interesting case, although one I think most people would not be leveraging, particularly for non-english languages. In particular, german would get mighty spicy with this seeing as words can be part of other words.

I've done this on the PR (#106) I have waiting, which substantially changes a whole bunch of things under the hood. When it is merged, you'll be able to use it; feel free to test out the branch if you're curious.

@anentropic
Copy link
Author

My use case was for synonyms, rather than the stemming-style variations shown in the whoosh example. So I have a more limited set of variations expanded into the query (which are then matched against stemmed words in the index like normal)

Perhaps there are other use cases where you would want to dynamically change the available variations at runtime without reindexing

I will watch your PR 👍

@srenauld
Copy link
Collaborator

@anentropic there has been no sign of updates on the PR.

Would you like me to prop it up as its own package on npm so you can get a feel for the changes and see if it fits your goals?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants