Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate opensearch regression, not treating spaces as && within fields #349

Open
Tracked by #302
akotlar opened this issue Nov 13, 2023 · 2 comments
Open
Tracked by #302
Assignees
Labels
Milestone

Comments

@akotlar
Copy link
Collaborator

akotlar commented Nov 13, 2023

Fix opensearch regression making
heterozygotes:(4805 && 1805) cadd > 20
and
heterozygotes:(4805 1805) cadd > 20 (no &&) work the same
Previously, in Elasticsearch 5.6 (b10), these were equivalent

@akotlar akotlar changed the title Fix opensearch regression making Fix opensearch regression, not treating spaces as && within fields Nov 13, 2023
@akotlar akotlar added this to the Sprint 4 milestone Nov 13, 2023
@akotlar akotlar added the search label Nov 13, 2023
@akotlar akotlar self-assigned this Nov 13, 2023
@akotlar akotlar changed the title Fix opensearch regression, not treating spaces as && within fields Investigate opensearch regression, not treating spaces as && within fields Nov 13, 2023
@akotlar akotlar mentioned this issue Nov 13, 2023
11 tasks
@akotlar
Copy link
Collaborator Author

akotlar commented Mar 6, 2024

This is related: elastic/elasticsearch#29148

@akotlar
Copy link
Collaborator Author

akotlar commented Mar 7, 2024

Fixed in https://github.com/bystrogenomics/bystro-web/pull/384 by creating a pre-processor for the query_string queries that transforms separate terms into parentheses-wrapped terms, which triggers elasticsearch/opensearch to search those terms individually, just as before. See the linked PR for more details. We also now have a small test suite to check that we are transforming things correctly, and the first set of transforms we check are:

const testCases = [
            { input: "exonic pathogenic", expected: "(exonic) (pathogenic)" },
            { input: "(exonic pathogenic)", expected: "(exonic pathogenic)" },
            { input: 'refseq.name2:GAA', expected: '(refseq.name2:GAA)' },
            { input: 'refseq.name2:"GAA"', expected: '(refseq.name2:"GAA")' },
            { input: 'gene:"HELLO"', expected: '(gene:"HELLO")' },
            { input: '"Hello"', expected: '("Hello")' },
            { input: '+(chrom:chr17 pos:39580562)', expected: '+(chrom:chr17 pos:39580562)' },
            { input: 'exonic AND cadd:>20.2', expected: '(exonic) AND (cadd:>20.2)' },
            { input: '-(gene:BRCA1) OR +(gene:BRCA2)', expected: '-(gene:BRCA1) OR +(gene:BRCA2)' },
            { input: '*pathogenic*', expected: '(*pathogenic*)' },
            { input: 'BRCA1? AND BRCA2?', expected: '(BRCA1?) AND (BRCA2?)' }
        ];

As seen above, terms that are already wrapped in parentheses are not affected. In this way we get the best of both worlds: by default queries behave as before, with the user being able to freely type queries like exonic pathogenic cadd > 20, while also now supporting synonyms that are phrases of multiple space separated terms, in which case we would now wrap those in parentheses (some long disease name), or if we want an exact match, in quote "some long disease name". I will add documentation for this.

live on https://bystro-dev.emory.edu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant