New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeking a good search engine for PhysioNet #2180
Comments
Xapian (https://xapian.org/)Implementation language: C++
Xapian is implemented in C++, but it's also a well-established package with security support in Debian. It has a Python wrapper which is maintained by the Xapian developers, but is not in PyPI (https://trac.xapian.org/ticket/807). The most reasonable option I think would be to use --system-site-packages or something equivalent. The query parser supports prefixes for field searches, but if you type a prefix it doesn't understand, it seems to be silently ignored. It's possible to dump the AST but this is not super-friendly. Searching for dates and ranges is possible, but difficult to do correctly. |
Whoosh (https://pypi.org/project/Whoosh/)Implementation language: Python
The Whoosh package is pure Python, and is thus slow, less likely to have security problems, and available on PyPI. However, it also appears to be unmaintained. It appears to only index stemmed forms and not make a distinction between Searching for dates and ranges is possible, but difficult to do correctly. |
Solr (https://solr.apache.org/)Implementation language: Java
Solr is not in Debian; however, it's written in Java, so less likely to have security problems, and it works via an HTTP API so the search engine can run with minimal privileges. The default query parser will report an error if the input has a syntax error or an unknown field prefix; the "dismax" and "edismax" parsers will not. There's also a debug option that outputs the AST as a string. Recommendations for "how do I do exact word/phrase searching with Solr" seem to boil down to "define two fields with duplicate data". But there doesn't seem to be a friendly way to handle this with the standard query parsers, and I don't think Haystack supports this directly. Searching for dates and ranges is possible, but difficult to do correctly. |
The current PhysioNet search function is not great (previous issues: #349, #1971). We would like to replace it with something based on a "real" information-retrieval engine, while also allowing more powerful and user-friendly queries.
There are a few options and in this issue I'll try to list advantages/disadvantages of each.
Requirements:
Good to have:
+foo
or"foo"
)"foo bar"
)ecg
andelectrocardiogram
as equivalentSome options we might consider:
The text was updated successfully, but these errors were encountered: