complex analyzer builder analyzer without requiring a custom jar addition #306

rmannibucau · 2017-03-24T21:43:09Z

Idea is to enable to configure through the json payload passed when creating the index/analyzers some more advanced and open configuration. Here is a sample:

{
  "type": "complex",
  "tokenizer": {
    "class": "ngram",
    "parameters": [
      "1",
      "2"
    ]
  },
  "token_streams": [
    {
      "class": "stop",
      "parameters": [
        null,
        "a,an,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,on,or,such,that,the,their,then,there,these,they,this,to,was,will,with"
      ]
    },
    {
      "class": "org.apache.lucene.analysis.core.LowerCaseFilter",
      "parameters": [
        null
      ]
    },
    {
      "class": "org.apache.lucene.analysis.standard.StandardFilter",
      "parameters": [
        null
      ]
    }
  ]
}

And here is the equivalent java code:

public classMyAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(final String field) {
        final Tokenizer source = new NGramTokenizer(1, 2);
        final TokenStream result = new StopFilter(
                new LowerCaseFilter(new StandardFilter(source)),
                new CharArraySet(asList(/*list of stop words*/), true));
        return new TokenStreamComponents(source, result);
    }
}

…analyzer without requiring a custom jar addition

ealonsodb · 2017-04-12T13:33:37Z

Hi @rmannibucau:
Please excuse my late answer. Grab a coffee , this is going to be long.

We need to admit that this is an awesome feature. It gives user the ability to create custom analyzers just like elastic. We definitely want this feature included in the project without any doubts.

There are some things we want you to change before merging this pr:

You have used the master branch as the starting point to develop this feature but that branch is not upgraded with last changes for a long time. Can you rebase it to branch-3.0.12 please?
We don't really like how you have solved the json parsing part. We have always strived to mantain clean and well validated jsons. I would generate a new package named 'com.stratio.cassandra.lucene.schema.analysis.tokenizer' with an abstract class called TokenizerBuilder and one inherited class for every type of tokenizer. I would repeat this process for character filters and token filters with the corespongding unit tests for all of them(correct json parsing, invalid json parsing producing an exception with the correct error message).

I attach two pictures that would explain this better:

Also, parameters validation is not very helpful, error messages does not help user to know exactly where the error is. Also, if any user wants to use this project, it would be great if that user does not need to dig into lucene documentation to be able to know which are a certain filter parameter types. So, a new section id doc/documentation.rst must be added.
What about including tests for an invalid (tokenizer, char_filter..) type?
Also, there is a submodule named builder used by clients to easily generate CQL queries. This functionality must be included in that. It also has unit tests

I have started to develop my own version (what images shows) of this feature in branch feature/build_custom_analyzer. I will upload it

soon.

I can continue with this but it is your decision.
Do you want to change your code to meet our requirements with me as the reviewer or do you want me to develop this with you as reviewer?

Looking forward your answer.

rmannibucau · 2017-04-12T13:55:18Z

Hmm, have to admit I dont know how you can achieve json validation and opening of the instantiated instance cause one of the goal was to ensure we have shortcuts for well know instances/types - where this fully makes sense to use inheritance - but also customs or evolutive ones - where you cant validate it anyway.

What would be the plan for such support?

Also not sure I get the versioning (that's surely why i used master): why 3.0 and not 3.10?

If you are on that I guess you'll be faster than me so happy to close that otherwise I can give it a try in a few weeks probably (can't at the moment).

adding a complex analyzer builder to let the user customize more the …

a34ac09

…analyzer without requiring a custom jar addition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complex analyzer builder analyzer without requiring a custom jar addition #306

complex analyzer builder analyzer without requiring a custom jar addition #306

rmannibucau commented Mar 24, 2017

ealonsodb commented Apr 12, 2017

rmannibucau commented Apr 12, 2017

complex analyzer builder analyzer without requiring a custom jar addition #306

Are you sure you want to change the base?

complex analyzer builder analyzer without requiring a custom jar addition #306

Conversation

rmannibucau commented Mar 24, 2017

ealonsodb commented Apr 12, 2017

rmannibucau commented Apr 12, 2017