New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13003 +/- ##
============================================
+ Coverage 61.75% 62.16% +0.41%
+ Complexity 207 198 -9
============================================
Files 2436 2503 +67
Lines 133233 136742 +3509
Branches 20636 21191 +555
============================================
+ Hits 82274 85003 +2729
- Misses 44911 45445 +534
- Partials 6048 6294 +246
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…s should be used.
…in RealtimeLuceneTextIndex.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major comments:
- User may pass in an argTypes string with space (e.g.
java.lang.String.class, java.land.String.class
). We need to apply CSVparser with trimming everywhere argTypes config string is passed in. - Use
textIndexConfigBuilder
to clarify code and reduce future code maintenance costs (we won't have to change all occurrences oftextIndexConfig
when we add/remove a new config key)
Nits:
- Use more appropriate exception types
...java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java
Outdated
Show resolved
Hide resolved
* and other white space characters. These characters are sometimes expected to be part of the actual argument. | ||
* | ||
* @param input string to split on comma | ||
* @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split | |
* @param escapeComma if true, we don't split on escaped commas, and we replace "\," with "," after the split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java
Outdated
Show resolved
Hide resolved
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Show resolved
Hide resolved
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Show resolved
Hide resolved
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Show resolved
Hide resolved
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Outdated
Show resolved
Hide resolved
TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500, | ||
analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use config builder
TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500, | |
analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false); | |
TextIndexConfigBuilder builder = new TextIndexConfigBuilder(); | |
if (null != analyzerClass) { | |
builder.withLuceneAnalyzerClass(analyzerClass); | |
} | |
if (null != analyzerClassArgs) { | |
builder.withLuceneAnalyzerClassArgs(analyzerClassArgs); | |
} | |
if (null != analyzerClassArgTypes) { | |
builder.withLuceneAnalyzerClassArgTypes(analyzerClassArgTypes); | |
} | |
if (null != queryParserClass) { | |
builder.withLuceneQueryParserClass(queryParserClass); | |
} | |
TextIndexConfig config = builder.withUseANDForMultiTermQueries(false).build(); |
LGTM overall, echo Bill's style comments. Please help check, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on two things discussed. Also there are some left over unit tests that can use configBuilder but it's up to you
public static String serialize(List<String> input, boolean escapeComma, boolean trim) { | ||
Stream<String> tokenStream = input.stream(); | ||
if (escapeComma) { | ||
tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,")); | |
tokenStream = tokenStream.map(s -> s.replaceAll(",", Matcher.quoteReplacement("\\,"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
CsvParser.serialize(_luceneAnalyzerClassArgs, false, false), | ||
CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CsvParser.serialize(_luceneAnalyzerClassArgs, false, false), | |
CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false), | |
CsvParser.serialize(_luceneAnalyzerClassArgs, true, false), | |
CsvParser.serialize(_luceneAnalyzerClassArgTypes, true, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed only _luceneAnalyzerClassArgs to escape comma as the args type should not contain escaped comma at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change name
public static Analyzer getAnalyzer(TextIndexConfig config) throws ReflectiveOperationException { | ||
String luceneAnalyzerClassName = config.getLuceneAnalyzerClass(); | ||
List<String> luceneAnalyzerClassArgs = config.getLuceneAnalyzerClassArgs(); | ||
List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes(); | |
List<String> luceneAnalyzerClassArgTypes = config.getLuceneAnalyzerClassArgTypes(); |
|
||
if (null == luceneAnalyzerClassName || luceneAnalyzerClassName.isEmpty() | ||
|| (luceneAnalyzerClassName.equals(StandardAnalyzer.class.getName()) | ||
&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) { | |
&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgTypes.isEmpty())) { |
config.getStopWordsInclude(), config.getStopWordsExclude()); | ||
} else { | ||
// Custom analyzer + custom configs via reflection | ||
if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) { | |
if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgTypes.size()) { |
|
||
// Generate args type list | ||
List<Class<?>> argClasses = new ArrayList<>(); | ||
for (String argType : luceneAnalyzerClassArgsTypes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (String argType : luceneAnalyzerClassArgsTypes) { | |
for (String argType : luceneAnalyzerClassArgTypes) { |
In this pull request, we introduce two significant feature enhancements that build upon #12027 [Configurable Lucene Analyzer] and enhance the flexibility and functionality of our text processing capabilities using Lucene:
Enhanced Flexibility for Custom Lucene Analyzers:
We've introduced backward-compatible support that enables the passing of arbitrary arguments and variable types to custom Lucene analyzers via reflection. This enhancement allows for dynamic customization of analyzer behavior based on runtime configurations specified in table configs. This feature is particularly beneficial for adapting the tokenization process to varying requirements without needing to alter the underlying codebase.
Configurable Lucene Query Parser:
We've added the ability to configure the which Lucene query parser to use at run-time. This addition makes it possible to tailor the behavior of the query parser to better align with specific use cases, enhancing the efficiency and relevance of search operations. At the moment, we only support QueryParser which inherits Lucene's
QueryParserBase
class and must implementClass(Field f, Analyzer a)
constructor andQuery parse(String query)
instance method. In the future, we may add the ability to utilize more complex query parsers such as MultiFieldQueryParser which does not implementQuery parse(String query)
instance method.The combination of these enhancements significantly improves our production system's capability to control text tokenization at runtime. This is especially useful for implementing precise log search functionalities, such as supporting concurrent case-sensitive and case-insensitive searches using wildcards and regular expressions. The flexibility to configure both the Lucene analyzer and query parser dynamically ensures that our application can efficiently handle diverse and complex search requirements.
The enhancement is contributed and improved by multiple developers (@jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber) across multiple iterations and this PR is a summary of the internal changes to be contributed to OSS.