Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documnet setting with external synonyms file #2647

Closed
MobinaPak opened this issue Jul 25, 2023 · 7 comments
Closed

Documnet setting with external synonyms file #2647

MobinaPak opened this issue Jul 25, 2023 · 7 comments
Labels
type: enhancement A general enhancement

Comments

@MobinaPak
Copy link

MobinaPak commented Jul 25, 2023

Hi.

I was wondering if test case "SynonymRepositoryIntegrationTests" works with setting file "/synonyms/settings-with-external-file.json"?

I want my synonym file to be a part of my source code (instead of using a file in elasticsearch or listing the synonyms in the analyzer filter definition) . So I tried the approach in test case but it didn't work.

It would be nice to provide a feature so that you can include synonyms in the source code (In a way that the test case would work :))
Thanks

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Jul 25, 2023
@sothawo
Copy link
Collaborator

sothawo commented Jul 25, 2023

When I look at the Elasticsearch documentation the synonyms are either provided in a file on the server or defined inline in the settings. For Spring Data Elasticsearch this currently has two possibilities:

  1. You either define them a settings file like in the SynonymRepositoryIntegrationTests which you do not want to.
  2. You set the createIndex parameter in the @Document annotation to false and make sure that on application startup you create the index by yourself (IndexOperations.create(java.util.Map<java.lang.String,java.lang.Object>, org.springframework.data.elasticsearch.core.document.Document) and provide the mapping (you can obtain this with IndexOperations.createMapping() and the settings. The settings you'd need to create by yourself

Thinking about this, it might be a solution to add some kind of template resolution into loading a settings or mapping json file, but I am hesitant to introduce some random templating patterns like "#synoms#" into Spring Data Elasticsearch.

But you could try this approach by yourself. Write a settings file that might look like this (using the example from the test):

{
  "index": {
    "number_of_shards": "1",
    "number_of_replicas": "0",
    "analysis": {
      "analyzer": {
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "synonym_filter"
          ]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "lenient": true,
          "synonyms": [
            $$MY-SYNONYMS_HERE$$
          ]
        }
      }
    }
  }
}

Read this file into a String. Then lets assume that you have a file of synonym definitions, each in a line. Read that file line by line, warp each line in double quotes and then join them with a comma as separator. This joined string would then replace the "$$MY-SYNONYMS_HERE$$" from the settings file.

After that replacement you can put the resulting String into the org.springframework.data.elasticsearch.core.index.Settings#parse() method to get a Settings object (which implements Map<String, Object>. With that you can create the index.

Basically that could be integrated somehow into Spring Data Elasticsearch. But there could be as well something like an include mechanism that might look like this (just writing down an idea):

{
  "index": {
    "number_of_shards": "1",
    "number_of_replicas": "0",
    "analysis": {
      "analyzer": {
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "synonym_filter"
          ]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "lenient": true,
          "$include synonyms": "synonyms.txt"
        }
      }
    }
  }
}

That would be valid JSON; after reading that into a map we'd need to iterate through the map and replacing a property with a name like "$include <name>" and a value of "<filename>" by a new property named <name> that has as value the content of <filename>. That could be repeated recursively.

Not sure about that, have to think some time about that.

@sothawo sothawo added the status: waiting-for-feedback We need additional information before we can continue label Jul 25, 2023
@MobinaPak
Copy link
Author

Thank you for responding

About the first way as you mentioned I can't include the synonyms in the json file; because I have more than hundred synonyms and its reduces my files readablity.

Your second suggestion was a good alternative to solve the problem, but I don't want to create the index myself . It complicates my code and the automatic way is actually a great fit.

The idea that you have mentioned is awesome. It would be great if we could be able to add some extra notations to our setting that spring data elasticsearch would use before creating the index. A similar case of stopword file comes to my mind. It might also be useful for other field values. For example, you might want to add a part of your analyzer setting only if some condition is true, or load a field value from configuration properties or your environment variables. I'm not sure if it's a feature that all projects would use. just thinking out load here :)

@sothawo
Copy link
Collaborator

sothawo commented Aug 1, 2023

I think providing such an include mechanism seems could be handy in quite some situations, I created #2657 as a general improvement for this.

As for the use with synonyms keep in mind what the documentation states:

However, it is recommended to define large synonyms set in a file using synonyms_path, because specifying them inline increases cluster size unnecessarily.

@sothawo sothawo added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged status: feedback-provided Feedback has been provided labels Aug 1, 2023
@MobinaPak
Copy link
Author

Thanks

A part of our conversation was left, which I think could be a useful feature as well

It would be useful to be able to have access to config properties in setting.json.

For example :

{
"index": {
    "number_of_shards": "1",
    "number_of_replicas": "0",
    "analysis": {
      "analyzer": {
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "synonym_filter"
          ]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "lenient": "${synonyms.lenient}",
          "synonyms": "${synonyms.file.path}"
        }
      }
    }
  }

}

here "synonyms.lenient" and "synonyms.file.path" are properties that I've mentioned in my application.yml. And this feature is not only for synonyms. Rather, it will be suitable for all setting file blocks. maybe someone wants to fetch "number_of_shards" from config properties as well!

@MobinaPak
Copy link
Author

And another thing is still unclear for me (as I mentioned when I first created this issue) .

I was wondering if test case "SynonymRepositoryIntegrationTests" works with setting file "/synonyms/settings-with-external-file.json"?

if not please remove the mentioned json file from project test files because its confusing!

Thanks

@sothawo
Copy link
Collaborator

sothawo commented Aug 6, 2023

As for the use of property values: They would be needed to be retrieved from the Spring environment; although this would be possible, a replacement mechanism would be way more complicated. We not only would need to check the keys in a Map parsed from JSON to find a place where to include another JSON fragment, but also would need to parse the values of these map entries. And this would not work for example for numeric parameters as

"number_of_shards": ${config.number-of-shards}

would not be parsed into a Map as it is not valid JSON.
If you want to configure index settings parameters via the configuration, I'd suggest, you programmtically create a Settings object and set these values; then call the appropriate IndexOperations method.

As for the synonyms/settings-with-external-file.json file. This indeed seems to be a leftover from the times when Spring Data Elasticsearch started it's own Elasticsearch instance for the integration tests. Need to check if we can mount this inot the testcontainers instance and reintroduce a test for that or remove it.

@MobinaPak
Copy link
Author

Thank you for your cooperation and response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

3 participants