[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003

jackluo923 · 2024-04-24T21:03:38Z

In this pull request, we introduce two significant feature enhancements that build upon #12027 [Configurable Lucene Analyzer] and enhance the flexibility and functionality of our text processing capabilities using Lucene:

Enhanced Flexibility for Custom Lucene Analyzers:
We've introduced backward-compatible support that enables the passing of arbitrary arguments and variable types to custom Lucene analyzers via reflection. This enhancement allows for dynamic customization of analyzer behavior based on runtime configurations specified in table configs. This feature is particularly beneficial for adapting the tokenization process to varying requirements without needing to alter the underlying codebase.
Configurable Lucene Query Parser:
We've added the ability to configure the which Lucene query parser to use at run-time. This addition makes it possible to tailor the behavior of the query parser to better align with specific use cases, enhancing the efficiency and relevance of search operations. At the moment, we only support QueryParser which inherits Lucene's QueryParserBase class and must implement Class(Field f, Analyzer a) constructor and Query parse(String query) instance method. In the future, we may add the ability to utilize more complex query parsers such as MultiFieldQueryParser which does not implement Query parse(String query) instance method.

The combination of these enhancements significantly improves our production system's capability to control text tokenization at runtime. This is especially useful for implementing precise log search functionalities, such as supporting concurrent case-sensitive and case-insensitive searches using wildcards and regular expressions. The flexibility to configure both the Lucene analyzer and query parser dynamically ensures that our application can efficiently handle diverse and complex search requirements.

The enhancement is contributed and improved by multiple developers (@jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber) across multiple iterations and this PR is a summary of the internal changes to be contributed to OSS.

…Parser.java

….java

…java

codecov-commenter · 2024-04-24T22:34:00Z

Codecov Report

Attention: Patch coverage is 42.25352% with 82 lines in your changes are missing coverage. Please review.

Project coverage is 62.16%. Comparing base (59551e4) to head (3129017).
Report is 387 commits behind head on master.

Files	Patch %	Lines
...ot/segment/local/segment/store/TextIndexUtils.java	30.13%	43 Missing and 8 partials ⚠️
...pache/pinot/segment/spi/index/TextIndexConfig.java	50.00%	16 Missing ⚠️
...cal/segment/index/text/TextIndexConfigBuilder.java	0.00%	5 Missing and 3 partials ⚠️
...me/impl/invertedindex/RealtimeLuceneTextIndex.java	58.33%	4 Missing and 1 partial ⚠️
.../org/apache/pinot/segment/spi/utils/CsvParser.java	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13003      +/-   ##
============================================
+ Coverage     61.75%   62.16%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2503      +67     
  Lines        133233   136742    +3509     
  Branches      20636    21191     +555     
============================================
+ Hits          82274    85003    +2729     
- Misses        44911    45445     +534     
- Partials       6048     6294     +246

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`62.10% <42.25%> (+0.39%)`	⬆️
java-21	`62.03% <42.25%> (+0.41%)`	⬆️
skip-bytebuffers-false	`62.13% <42.25%> (+0.38%)`	⬆️
skip-bytebuffers-true	`62.01% <42.25%> (+34.29%)`	⬆️
temurin	`62.16% <42.25%> (+0.41%)`	⬆️
unittests	`62.15% <42.25%> (+0.41%)`	⬆️
unittests1	`46.70% <20.42%> (-0.20%)`	⬇️
unittests2	`27.94% <21.83%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…java

…s should be used.

…in RealtimeLuceneTextIndex.java

…sue.

Bill-hbrhbr

Major comments:

User may pass in an argTypes string with space (e.g. java.lang.String.class, java.land.String.class). We need to apply CSVparser with trimming everywhere argTypes config string is passed in.
Use textIndexConfigBuilder to clarify code and reduce future code maintenance costs (we won't have to change all occurrences of textIndexConfig when we add/remove a new config key)

Nits:

Use more appropriate exception types

...java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java

Bill-hbrhbr · 2024-04-29T14:56:20Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/utils/CsvParser.java

+     * and other white space characters. These characters are sometimes expected to be part of the actual argument.
+     *
+     * @param input  string to split on comma
+     * @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split


Suggested change

* @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split

* @param escapeComma if true, we don't split on escaped commas, and we replace "\," with "," after the split

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java

...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java

Bill-hbrhbr · 2024-04-29T18:33:08Z

...a/org/apache/pinot/segment/local/realtime/impl/invertedindex/LuceneMutableTextIndexTest.java

+    TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500,
+            analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false);


Use config builder

Suggested change

TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500,

analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false);

TextIndexConfigBuilder builder = new TextIndexConfigBuilder();

if (null != analyzerClass) {

builder.withLuceneAnalyzerClass(analyzerClass);

}

if (null != analyzerClassArgs) {

builder.withLuceneAnalyzerClassArgs(analyzerClassArgs);

}

if (null != analyzerClassArgTypes) {

builder.withLuceneAnalyzerClassArgTypes(analyzerClassArgTypes);

}

if (null != queryParserClass) {

builder.withLuceneQueryParserClass(queryParserClass);

}

TextIndexConfig config = builder.withUseANDForMultiTermQueries(false).build();

lnbest0707-uber · 2024-04-30T03:46:31Z

LGTM overall, echo Bill's style comments. Please help check, thanks.

Bill-hbrhbr

Commented on two things discussed. Also there are some left over unit tests that can use configBuilder but it's up to you

Bill-hbrhbr · 2024-04-30T20:50:32Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/utils/CsvParser.java

+    public static String serialize(List<String> input, boolean escapeComma, boolean trim) {
+        Stream<String> tokenStream = input.stream();
+        if (escapeComma) {
+            tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,"));


Suggested change

tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,"));

tokenStream = tokenStream.map(s -> s.replaceAll(",", Matcher.quoteReplacement("\\,")));

Bill-hbrhbr · 2024-04-30T20:50:56Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java

+          CsvParser.serialize(_luceneAnalyzerClassArgs, false, false),
+          CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false),


Suggested change

CsvParser.serialize(_luceneAnalyzerClassArgs, false, false),

CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false),

CsvParser.serialize(_luceneAnalyzerClassArgs, true, false),

CsvParser.serialize(_luceneAnalyzerClassArgTypes, true, false),

Changed only _luceneAnalyzerClassArgs to escape comma as the args type should not contain escaped comma at all.

Bill-hbrhbr

Change name

Bill-hbrhbr · 2024-04-30T21:30:57Z

...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java

+  public static Analyzer getAnalyzer(TextIndexConfig config) throws ReflectiveOperationException {
+    String luceneAnalyzerClassName = config.getLuceneAnalyzerClass();
+    List<String> luceneAnalyzerClassArgs = config.getLuceneAnalyzerClassArgs();
+    List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes();


Suggested change

List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes();

List<String> luceneAnalyzerClassArgTypes = config.getLuceneAnalyzerClassArgTypes();

Bill-hbrhbr · 2024-04-30T21:31:33Z

...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java

+
+    if (null == luceneAnalyzerClassName || luceneAnalyzerClassName.isEmpty()
+            || (luceneAnalyzerClassName.equals(StandardAnalyzer.class.getName())
+                    && luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) {


Suggested change

&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) {

&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgTypes.isEmpty())) {

Bill-hbrhbr · 2024-04-30T21:31:57Z

...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java

+              config.getStopWordsInclude(), config.getStopWordsExclude());
+    } else {
+      // Custom analyzer + custom configs via reflection
+      if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) {


Suggested change

if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) {

if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgTypes.size()) {

Bill-hbrhbr · 2024-04-30T21:32:06Z

...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java

+
+      // Generate args type list
+      List<Class<?>> argClasses = new ArrayList<>();
+      for (String argType : luceneAnalyzerClassArgsTypes) {


Suggested change

for (String argType : luceneAnalyzerClassArgsTypes) {

for (String argType : luceneAnalyzerClassArgTypes) {

jackluo923 added 2 commits April 25, 2024 04:00

Support custom Lucene analyzer with arguments and custom query parser

5ad8b05

Add Apache license header to CsvParserTest.java

09d53bb

jackluo923 changed the title ~~[Feature] Support custom Lucene analyzer with args and custom query parser~~ [Feature] Support configurable Lucene analyzer with args and configurable query parser Apr 24, 2024

jackluo923 added 5 commits April 25, 2024 05:10

Adjusted dependency import order for javax.annotation.Nullable in Csv…

23d8de3

…Parser.java

Applied mvn spotless:apply on CsvParserTest.java

bc84aee

Correct a spelling mistake in the comments in RealtimeLuceneTextIndex…

610bfff

….java

Adjusted parsing methods naming to improve clarity in TextIndexUtils.…

559548e

…java

Removed redundant reflection logic in TextIndexUtils.java

e72ff6e

jackluo923 added 5 commits April 25, 2024 06:36

Removed redundant string replacement Streaming pipeline in CsvParser.…

7796f0e

…java

Break up long line over 120 char into two lines in TextIndexUtils.java

0e62b97

Adjusted the scenarios of when default Analyzers with custom argument…

fafdaa4

…s should be used.

Emit Class.getName() instead of Class.getCanonicalName() in logs with…

d97a383

…in RealtimeLuceneTextIndex.java

Moved || operator in if condition to a new line to resolve style is…

0b99d64

…sue.

Bill-hbrhbr suggested changes Apr 29, 2024

View reviewed changes

Jackie-Jiang added feature enhancement labels Apr 29, 2024

jackluo923 added 4 commits April 30, 2024 13:35

Addressed code review concerns from @Bill-hbrhbr

e4023bf

Fix style error.

5b43fff

Fix unit test.

2ba9812

Run maven spotless:apply

8df04e6

Bill-hbrhbr approved these changes Apr 30, 2024

View reviewed changes

Bill-hbrhbr reviewed Apr 30, 2024

View reviewed changes

Patched argument parsing bug.

3129017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003

[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003

jackluo923 commented Apr 24, 2024 •

edited

codecov-commenter commented Apr 24, 2024 •

edited

Bill-hbrhbr left a comment

Bill-hbrhbr Apr 29, 2024

jackluo923 Apr 30, 2024

Bill-hbrhbr Apr 29, 2024

lnbest0707-uber commented Apr 30, 2024

Bill-hbrhbr left a comment

Bill-hbrhbr Apr 30, 2024

jackluo923 Apr 30, 2024

Bill-hbrhbr Apr 30, 2024

jackluo923 Apr 30, 2024

Bill-hbrhbr left a comment

Bill-hbrhbr Apr 30, 2024

Bill-hbrhbr Apr 30, 2024

Bill-hbrhbr Apr 30, 2024

Bill-hbrhbr Apr 30, 2024

	* @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split
	* @param escapeComma if true, we don't split on escaped commas, and we replace "\," with "," after the split

		TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500,
		analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false);

-    TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500,
-            analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false);
+    TextIndexConfigBuilder builder = new TextIndexConfigBuilder();
+    if (null != analyzerClass) {
+      builder.withLuceneAnalyzerClass(analyzerClass);
+    }
+    if (null != analyzerClassArgs) {
+      builder.withLuceneAnalyzerClassArgs(analyzerClassArgs);
+    }
+    if (null != analyzerClassArgTypes) {
+      builder.withLuceneAnalyzerClassArgTypes(analyzerClassArgTypes);
+    }
+    if (null != queryParserClass) {
+      builder.withLuceneQueryParserClass(queryParserClass);
+    }
+    TextIndexConfig config = builder.withUseANDForMultiTermQueries(false).build();

	tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,"));
	tokenStream = tokenStream.map(s -> s.replaceAll(",", Matcher.quoteReplacement("\\,")));

		CsvParser.serialize(_luceneAnalyzerClassArgs, false, false),
		CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false),

	List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes();
	List<String> luceneAnalyzerClassArgTypes = config.getLuceneAnalyzerClassArgTypes();

	&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) {
	&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgTypes.isEmpty())) {

	if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) {
	if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgTypes.size()) {

	for (String argType : luceneAnalyzerClassArgsTypes) {
	for (String argType : luceneAnalyzerClassArgTypes) {

[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003

Are you sure you want to change the base?

[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003

Conversation

jackluo923 commented Apr 24, 2024 • edited

codecov-commenter commented Apr 24, 2024 • edited

Codecov Report

Bill-hbrhbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lnbest0707-uber commented Apr 30, 2024

Bill-hbrhbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bill-hbrhbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackluo923 commented Apr 24, 2024 •

edited

codecov-commenter commented Apr 24, 2024 •

edited