New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mappings: add ngram analyzer #127
base: master
Are you sure you want to change the base?
Conversation
jrcastro2
commented
Feb 21, 2024
- Allows partial matches on search
- closes Some users cannot be found when inviting or setting restrictions CERNDocumentServer/cds-rdm#114
* Allows partial matches on search * closes CERNDocumentServer/cds-rdm#114
"type": "text", | ||
"analyzer": "ngram_3_to_5_analyzer", | ||
"fields": { | ||
"keyword": { | ||
"type": "keyword" | ||
} | ||
} | ||
}, | ||
"email": { | ||
"type": "keyword" | ||
"type": "text", | ||
"analyzer": "ngram_3_to_5_analyzer", | ||
"fields": { | ||
"keyword": { | ||
"type": "keyword" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this has to be tested and checked through the code as now to exactly match the email it will be email.keyword instead of email
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in principle when we do that e.g custom fields, we use the field.keyword
only for faceting.
"settings": { | ||
"index": { | ||
"max_ngram_diff": 2 | ||
}, | ||
"analysis": { | ||
"analyzer": { | ||
"ngram_3_to_5_analyzer": { | ||
"type": "custom", | ||
"tokenizer": "ngram_tokenizer_3_to_5", | ||
"filter": ["lowercase"] | ||
} | ||
}, | ||
"tokenizer": { | ||
"ngram_tokenizer_3_to_5": { | ||
"type": "ngram", | ||
"min_gram": 3, | ||
"max_gram": 5 | ||
} | ||
} | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like, is doing a good job to allow to find partial matches, however we would need to do some performance tests to ensure that it doesn't affect negatively too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested that with 2 letter names e.g Lu
? I have the feeling that the min_gram
should be 2...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were only giving results when typing 3 letters if I am not mistaken, but min_gram and max_gram can be adjusted and tested much more, specially to take into consideration how much will the index grow and how does it affect to performance, lowering the ngrams to 2 will end up doing way more matches but it should be tested and maybe it's worth to do it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered also this https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html to add in the specific field we need instead of ngram?