Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preserve original on ascii folding filter. #2126

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Jul 17, 2023

Same as lucene filter: https://lucene.apache.org/core/6_4_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

There is a minor difference between the two filters with preserve_original set to true and when the original token is different from the folded token: the lucene filter first emits the folded token and then the original token, whereas tantivy first emits the original token and then the folded one.

@fmassot fmassot force-pushed the fmassot/add-preserve-original-ascii-folding branch from 0f95465 to 90b9059 Compare July 17, 2023 07:16
@fmassot fmassot force-pushed the fmassot/add-preserve-original-ascii-folding branch from ab593ca to 36d585a Compare July 17, 2023 07:25
@codecov-commenter
Copy link

Codecov Report

Merging #2126 (36d585a) into main (5fafe4b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2126      +/-   ##
==========================================
+ Coverage   94.37%   94.38%   +0.01%     
==========================================
  Files         321      319       -2     
  Lines       60821    60791      -30     
==========================================
- Hits        57401    57379      -22     
+ Misses       3420     3412       -8     
Impacted Files Coverage Δ
src/lib.rs 99.05% <ø> (ø)
src/core/index_meta.rs 96.14% <100.00%> (ø)
src/store/compressors.rs 97.70% <100.00%> (+5.51%) ⬆️
src/store/decompressors.rs 97.82% <100.00%> (-0.39%) ⬇️
src/store/mod.rs 99.20% <100.00%> (-0.03%) ⬇️
src/tokenizer/ascii_folding_filter.rs 99.92% <100.00%> (+0.03%) ⬆️

... and 3 files with indirect coverage changes

if !self.token_mut().text.is_ascii() {
// ignore its already ascii
to_ascii(&self.tail.token().text, self.buffer);
text_has_changed = to_ascii(&self.tail.token().text, self.buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can text_has_changed == false happen here even though the is_ascii test above already failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checked the to_ascii and as you can expect it will not change a full ascii text... I will fix that, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants