mappings: add ngram analyzer #127

jrcastro2 · 2024-02-21T16:53:23Z

Allows partial matches on search
closes Some users cannot be found when inviting or setting restrictions CERNDocumentServer/cds-rdm#114

* Allows partial matches on search * closes CERNDocumentServer/cds-rdm#114

jrcastro2 · 2024-02-21T16:54:18Z

invenio_users_resources/records/mappings/os-v1/users/user-v2.0.0.json

+        "type": "text",
+        "analyzer": "ngram_3_to_5_analyzer",
+        "fields": {
+          "keyword": {
+            "type": "keyword"
+          }
+        }
      },
      "email": {
-        "type": "keyword"
+        "type": "text",
+        "analyzer": "ngram_3_to_5_analyzer",
+        "fields": {
+          "keyword": {
+            "type": "keyword"
+          }
+        }


Note: this has to be tested and checked through the code as now to exactly match the email it will be email.keyword instead of email

in principle when we do that e.g custom fields, we use the field.keyword only for faceting.

jrcastro2 · 2024-02-21T16:56:03Z

invenio_users_resources/records/mappings/os-v1/users/user-v2.0.0.json

+  "settings": {
+    "index": {
+      "max_ngram_diff": 2
+    },
+    "analysis": {
+      "analyzer": {
+        "ngram_3_to_5_analyzer": {
+          "type": "custom",
+          "tokenizer": "ngram_tokenizer_3_to_5",
+          "filter": ["lowercase"]
+        }
+      },
+      "tokenizer": {
+        "ngram_tokenizer_3_to_5": {
+          "type": "ngram",
+          "min_gram": 3,
+          "max_gram": 5
+        }
+      }
+    }
+  },


This looks like, is doing a good job to allow to find partial matches, however we would need to do some performance tests to ensure that it doesn't affect negatively too much.

Have you tested that with 2 letter names e.g Lu? I have the feeling that the min_gram should be 2...

We were only giving results when typing 3 letters if I am not mistaken, but min_gram and max_gram can be adjusted and tested much more, specially to take into consideration how much will the index grow and how does it affect to performance, lowering the ngrams to 2 will end up doing way more matches but it should be tested and maybe it's worth to do it 👍

Have you considered also this https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html to add in the specific field we need instead of ngram?

mappings: add ngram analyzer

b6792e7

* Allows partial matches on search * closes CERNDocumentServer/cds-rdm#114

jrcastro2 commented Feb 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mappings: add ngram analyzer #127

mappings: add ngram analyzer #127

jrcastro2 commented Feb 21, 2024

jrcastro2 Feb 21, 2024

zzacharo Feb 22, 2024

jrcastro2 Feb 21, 2024

zzacharo Feb 22, 2024

jrcastro2 Feb 22, 2024

zzacharo Feb 22, 2024

mappings: add ngram analyzer #127

Are you sure you want to change the base?

mappings: add ngram analyzer #127

Conversation

jrcastro2 commented Feb 21, 2024

jrcastro2 Feb 21, 2024

Choose a reason for hiding this comment

zzacharo Feb 22, 2024

Choose a reason for hiding this comment

jrcastro2 Feb 21, 2024

Choose a reason for hiding this comment

zzacharo Feb 22, 2024

Choose a reason for hiding this comment

jrcastro2 Feb 22, 2024

Choose a reason for hiding this comment

zzacharo Feb 22, 2024

Choose a reason for hiding this comment