Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support configurable Lucene analyzer with args and configurable query parser #13003

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

jackluo923
Copy link
Contributor

@jackluo923 jackluo923 commented Apr 24, 2024

In this pull request, we introduce two significant feature enhancements that build upon #12027 [Configurable Lucene Analyzer] and enhance the flexibility and functionality of our text processing capabilities using Lucene:

  1. Enhanced Flexibility for Custom Lucene Analyzers:
    We've introduced backward-compatible support that enables the passing of arbitrary arguments and variable types to custom Lucene analyzers via reflection. This enhancement allows for dynamic customization of analyzer behavior based on runtime configurations specified in table configs. This feature is particularly beneficial for adapting the tokenization process to varying requirements without needing to alter the underlying codebase.

  2. Configurable Lucene Query Parser:
    We've added the ability to configure the which Lucene query parser to use at run-time. This addition makes it possible to tailor the behavior of the query parser to better align with specific use cases, enhancing the efficiency and relevance of search operations. At the moment, we only support QueryParser which inherits Lucene's QueryParserBase class and must implement Class(Field f, Analyzer a) constructor and Query parse(String query) instance method. In the future, we may add the ability to utilize more complex query parsers such as MultiFieldQueryParser which does not implement Query parse(String query) instance method.

The combination of these enhancements significantly improves our production system's capability to control text tokenization at runtime. This is especially useful for implementing precise log search functionalities, such as supporting concurrent case-sensitive and case-insensitive searches using wildcards and regular expressions. The flexibility to configure both the Lucene analyzer and query parser dynamically ensures that our application can efficiently handle diverse and complex search requirements.

The enhancement is contributed and improved by multiple developers (@jackluo923 @Bill-hbrhbr @itschrispeck @lnbest0707-uber) across multiple iterations and this PR is a summary of the internal changes to be contributed to OSS.

@jackluo923 jackluo923 changed the title [Feature] Support custom Lucene analyzer with args and custom query parser [Feature] Support configurable Lucene analyzer with args and configurable query parser Apr 24, 2024
@codecov-commenter
Copy link

codecov-commenter commented Apr 24, 2024

Codecov Report

Attention: Patch coverage is 42.25352% with 82 lines in your changes are missing coverage. Please review.

Project coverage is 62.16%. Comparing base (59551e4) to head (3129017).
Report is 387 commits behind head on master.

Files Patch % Lines
...ot/segment/local/segment/store/TextIndexUtils.java 30.13% 43 Missing and 8 partials ⚠️
...pache/pinot/segment/spi/index/TextIndexConfig.java 50.00% 16 Missing ⚠️
...cal/segment/index/text/TextIndexConfigBuilder.java 0.00% 5 Missing and 3 partials ⚠️
...me/impl/invertedindex/RealtimeLuceneTextIndex.java 58.33% 4 Missing and 1 partial ⚠️
.../org/apache/pinot/segment/spi/utils/CsvParser.java 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13003      +/-   ##
============================================
+ Coverage     61.75%   62.16%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2503      +67     
  Lines        133233   136742    +3509     
  Branches      20636    21191     +555     
============================================
+ Hits          82274    85003    +2729     
- Misses        44911    45445     +534     
- Partials       6048     6294     +246     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.10% <42.25%> (+0.39%) ⬆️
java-21 62.03% <42.25%> (+0.41%) ⬆️
skip-bytebuffers-false 62.13% <42.25%> (+0.38%) ⬆️
skip-bytebuffers-true 62.01% <42.25%> (+34.29%) ⬆️
temurin 62.16% <42.25%> (+0.41%) ⬆️
unittests 62.15% <42.25%> (+0.41%) ⬆️
unittests1 46.70% <20.42%> (-0.20%) ⬇️
unittests2 27.94% <21.83%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major comments:

  1. User may pass in an argTypes string with space (e.g. java.lang.String.class, java.land.String.class). We need to apply CSVparser with trimming everywhere argTypes config string is passed in.
  2. Use textIndexConfigBuilder to clarify code and reduce future code maintenance costs (we won't have to change all occurrences of textIndexConfig when we add/remove a new config key)

Nits:

  1. Use more appropriate exception types

* and other white space characters. These characters are sometimes expected to be part of the actual argument.
*
* @param input string to split on comma
* @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param escapeComma whether we should ignore "\," during splitting, replace it with "," after split
* @param escapeComma if true, we don't split on escaped commas, and we replace "\," with "," after the split

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Comment on lines 168 to 169
TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500,
analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use config builder

Suggested change
TextIndexConfig config = new TextIndexConfig(false, null, null, false, false, null, null, true, 500,
analyzerClass, analyzerClassArgs, analyzerClassArgTypes, queryParserClass, false);
TextIndexConfigBuilder builder = new TextIndexConfigBuilder();
if (null != analyzerClass) {
builder.withLuceneAnalyzerClass(analyzerClass);
}
if (null != analyzerClassArgs) {
builder.withLuceneAnalyzerClassArgs(analyzerClassArgs);
}
if (null != analyzerClassArgTypes) {
builder.withLuceneAnalyzerClassArgTypes(analyzerClassArgTypes);
}
if (null != queryParserClass) {
builder.withLuceneQueryParserClass(queryParserClass);
}
TextIndexConfig config = builder.withUseANDForMultiTermQueries(false).build();

@lnbest0707-uber
Copy link
Contributor

LGTM overall, echo Bill's style comments. Please help check, thanks.

Copy link

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on two things discussed. Also there are some left over unit tests that can use configBuilder but it's up to you

public static String serialize(List<String> input, boolean escapeComma, boolean trim) {
Stream<String> tokenStream = input.stream();
if (escapeComma) {
tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenStream = tokenStream.map(s -> s.replaceAll(",", "\\,"));
tokenStream = tokenStream.map(s -> s.replaceAll(",", Matcher.quoteReplacement("\\,")));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 221 to 222
CsvParser.serialize(_luceneAnalyzerClassArgs, false, false),
CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CsvParser.serialize(_luceneAnalyzerClassArgs, false, false),
CsvParser.serialize(_luceneAnalyzerClassArgTypes, false, false),
CsvParser.serialize(_luceneAnalyzerClassArgs, true, false),
CsvParser.serialize(_luceneAnalyzerClassArgTypes, true, false),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed only _luceneAnalyzerClassArgs to escape comma as the args type should not contain escaped comma at all.

Copy link

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name

public static Analyzer getAnalyzer(TextIndexConfig config) throws ReflectiveOperationException {
String luceneAnalyzerClassName = config.getLuceneAnalyzerClass();
List<String> luceneAnalyzerClassArgs = config.getLuceneAnalyzerClassArgs();
List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<String> luceneAnalyzerClassArgsTypes = config.getLuceneAnalyzerClassArgTypes();
List<String> luceneAnalyzerClassArgTypes = config.getLuceneAnalyzerClassArgTypes();


if (null == luceneAnalyzerClassName || luceneAnalyzerClassName.isEmpty()
|| (luceneAnalyzerClassName.equals(StandardAnalyzer.class.getName())
&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgsTypes.isEmpty())) {
&& luceneAnalyzerClassArgs.isEmpty() && luceneAnalyzerClassArgTypes.isEmpty())) {

config.getStopWordsInclude(), config.getStopWordsExclude());
} else {
// Custom analyzer + custom configs via reflection
if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgsTypes.size()) {
if (luceneAnalyzerClassArgs.size() != luceneAnalyzerClassArgTypes.size()) {


// Generate args type list
List<Class<?>> argClasses = new ArrayList<>();
for (String argType : luceneAnalyzerClassArgsTypes) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (String argType : luceneAnalyzerClassArgsTypes) {
for (String argType : luceneAnalyzerClassArgTypes) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants