Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking a good search engine for PhysioNet #2180

Open
bemoody opened this issue Jan 18, 2024 · 3 comments
Open

Seeking a good search engine for PhysioNet #2180

bemoody opened this issue Jan 18, 2024 · 3 comments

Comments

@bemoody
Copy link
Collaborator

bemoody commented Jan 18, 2024

The current PhysioNet search function is not great (previous issues: #349, #1971). We would like to replace it with something based on a "real" information-retrieval engine, while also allowing more powerful and user-friendly queries.

There are a few options and in this issue I'll try to list advantages/disadvantages of each.

Requirements:

  • Free and open-source software
  • Reasonable security support

Good to have:

  • Django integration - Haystack (https://haystacksearch.org/), for example, makes it easy to index and search objects in the Django ORM
  • Language support - PhysioNet only publishes projects written in English, but we would like the platform to be international
  • Exact word searching - ability to search for a term without stemming or synonyms (often written +foo or "foo")
  • Phrase searching - ability to search for an exact phrase ("foo bar")
  • Range queries - e.g. "projects published between 2021-06-01 and 2021-09-01"
  • Faceting - e.g. "list the distinct authors of matching projects and the number of matching projects for each author"
  • Collapsing - e.g. "search for published projects matching the query, then list distinct core projects ordered by relevance"
  • Synonyms - e.g. treating ecg and electrocardiogram as equivalent
  • User-friendly query parser - if the query parser supports complex syntax, providing diagnostics so you can understand why your query isn't working

Some options we might consider:

  • Xapian
  • Whoosh
  • Solr
  • OpenSearch
  • Manticore
  • PostgreSQL
@bemoody
Copy link
Collaborator Author

bemoody commented Jan 18, 2024

Xapian (https://xapian.org/)

Implementation language: C++
Latest release: 2023-11-06

  • Free and open-source software: Yes
  • Reasonable security support: probably
  • Django integration: Yes (xapian-haystack)
  • Language support: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish
  • Exact word searching: Yes
  • Phrase searching: Yes
  • Range queries: Yes
  • Faceting: Yes
  • Collapsing: Yes
  • Synonyms: Yes
  • User-friendly query parser: No

Xapian is implemented in C++, but it's also a well-established package with security support in Debian. It has a Python wrapper which is maintained by the Xapian developers, but is not in PyPI (https://trac.xapian.org/ticket/807). The most reasonable option I think would be to use --system-site-packages or something equivalent.

The query parser supports prefixes for field searches, but if you type a prefix it doesn't understand, it seems to be silently ignored. It's possible to dump the AST but this is not super-friendly.

Searching for dates and ranges is possible, but difficult to do correctly.

@bemoody
Copy link
Collaborator Author

bemoody commented Jan 18, 2024

Whoosh (https://pypi.org/project/Whoosh/)

Implementation language: Python
Latest release: 2016-04-04

  • Free and open-source software: Yes
  • Reasonable security support: doubtful
  • Django integration: Yes (django-haystack)
  • Language support: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish
  • Exact word searching: No
  • Phrase searching: Yes (stemmed only)
  • Range queries: Yes
  • Faceting: Yes
  • Collapsing: No (but parent/child documents might be an alternative)
  • Synonyms: No
  • User-friendly query parser: ???

The Whoosh package is pure Python, and is thus slow, less likely to have security problems, and available on PyPI. However, it also appears to be unmaintained.

It appears to only index stemmed forms and not make a distinction between foo and "foo". This might be the fault of haystack and not whoosh itself.

Searching for dates and ranges is possible, but difficult to do correctly.

@bemoody
Copy link
Collaborator Author

bemoody commented Jan 18, 2024

Solr (https://solr.apache.org/)

Implementation language: Java
Latest release: 2023-10-15

  • Free and open-source software: Yes
  • Reasonable security support: Yes
  • Django integration: Yes (django-haystack)
  • Language support: Arabic, Bulgarian, Catalan, CJK, Czech, Danish, German, Greek, Spanish, Basque, Persian, Finnish, French, Irish, Galician, Hindi, Hungarian, Armenian, Indonesian, Italian, Japanese, Latvian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Thai, Turkish
  • Exact word searching: No
  • Phrase searching: Yes (stemmed only)
  • Range queries: Yes
  • Faceting: Yes
  • Collapsing: Yes
  • Synonyms: Yes
  • User-friendly query parser: somewhat

Solr is not in Debian; however, it's written in Java, so less likely to have security problems, and it works via an HTTP API so the search engine can run with minimal privileges.

The default query parser will report an error if the input has a syntax error or an unknown field prefix; the "dismax" and "edismax" parsers will not. There's also a debug option that outputs the AST as a string.

Recommendations for "how do I do exact word/phrase searching with Solr" seem to boil down to "define two fields with duplicate data". But there doesn't seem to be a friendly way to handle this with the standard query parsers, and I don't think Haystack supports this directly.

Searching for dates and ranges is possible, but difficult to do correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant