Skip to content
This repository has been archived by the owner on May 27, 2020. It is now read-only.

complex analyzer builder analyzer without requiring a custom jar addition #306

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rmannibucau
Copy link

Idea is to enable to configure through the json payload passed when creating the index/analyzers some more advanced and open configuration. Here is a sample:

{
  "type": "complex",
  "tokenizer": {
    "class": "ngram",
    "parameters": [
      "1",
      "2"
    ]
  },
  "token_streams": [
    {
      "class": "stop",
      "parameters": [
        null,
        "a,an,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,on,or,such,that,the,their,then,there,these,they,this,to,was,will,with"
      ]
    },
    {
      "class": "org.apache.lucene.analysis.core.LowerCaseFilter",
      "parameters": [
        null
      ]
    },
    {
      "class": "org.apache.lucene.analysis.standard.StandardFilter",
      "parameters": [
        null
      ]
    }
  ]
}

And here is the equivalent java code:

public classMyAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(final String field) {
        final Tokenizer source = new NGramTokenizer(1, 2);
        final TokenStream result = new StopFilter(
                new LowerCaseFilter(new StandardFilter(source)),
                new CharArraySet(asList(/*list of stop words*/), true));
        return new TokenStreamComponents(source, result);
    }
}

…analyzer without requiring a custom jar addition
@ealonsodb
Copy link
Contributor

Hi @rmannibucau:
Please excuse my late answer. Grab a coffee , this is going to be long.

We need to admit that this is an awesome feature. It gives user the ability to create custom analyzers just like elastic. We definitely want this feature included in the project without any doubts.

There are some things we want you to change before merging this pr:

  • You have used the master branch as the starting point to develop this feature but that branch is not upgraded with last changes for a long time. Can you rebase it to branch-3.0.12 please?

  • We don't really like how you have solved the json parsing part. We have always strived to mantain clean and well validated jsons. I would generate a new package named 'com.stratio.cassandra.lucene.schema.analysis.tokenizer' with an abstract class called TokenizerBuilder and one inherited class for every type of tokenizer. I would repeat this process for character filters and token filters with the corespongding unit tests for all of them(correct json parsing, invalid json parsing producing an exception with the correct error message).

I attach two pictures that would explain this better:

abstract_tokenizer_builder

child_tokenizer_builder

  • Also, parameters validation is not very helpful, error messages does not help user to know exactly where the error is. Also, if any user wants to use this project, it would be great if that user does not need to dig into lucene documentation to be able to know which are a certain filter parameter types. So, a new section id doc/documentation.rst must be added.

  • What about including tests for an invalid (tokenizer, char_filter..) type?

  • Also, there is a submodule named builder used by clients to easily generate CQL queries. This functionality must be included in that. It also has unit tests

I have started to develop my own version (what images shows) of this feature in branch feature/build_custom_analyzer. I will upload it

soon.

I can continue with this but it is your decision.
Do you want to change your code to meet our requirements with me as the reviewer or do you want me to develop this with you as reviewer?

Looking forward your answer.

@rmannibucau
Copy link
Author

Hmm, have to admit I dont know how you can achieve json validation and opening of the instantiated instance cause one of the goal was to ensure we have shortcuts for well know instances/types - where this fully makes sense to use inheritance - but also customs or evolutive ones - where you cant validate it anyway.

What would be the plan for such support?

Also not sure I get the versioning (that's surely why i used master): why 3.0 and not 3.10?

If you are on that I guess you'll be faster than me so happy to close that otherwise I can give it a try in a few weeks probably (can't at the moment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants