Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Harvest the power of xapian to provide advanced search and filter capabilities #851

Open
IMayBeABitShy opened this issue Dec 30, 2023 · 3 comments

Comments

@IMayBeABitShy
Copy link

Advanced search using Xapian

About this proposal

This proposal proposes advanced search and filter operations using the xapian search.
It starts with a general overview of the requirements for such a search and then proposes how this could be achieved using Xapian.

Motivation

The ZIM ecosystem thrives and thanks to the hard work of countless people new ZIMs are published for a variety of websites. I dare say that ZIM files may soon be the standard for offline websites. However, the increase in variety of content mean that the ZIM technology needs to stay flexible if it wants to stay convenient, both for developers and end users.

As you have probably already guessed by the title, this proposal is about improving the search functionality of ZIM files. The current search is great for searching text, but it lacks the flexibility for searches where metadata is more important.

Let's take a stackexchange/sotoki ZIM as an example: While it is already possible to search the title and text of a question and answers, other attributes may be just as important for a search. A user may want to only see questions with tag A but not tag B posted between 2012 and 2016 with a score above 64 and an accepted question where the title contains "foo" but the text not "bar". For ZIMs or websites which are primarily media focussed, searching for text and values is even more important.

Requirements

I believe the following requirements are important for an improved search:

  1. Support for searching specific fields (e.g. only title)
  2. Support for searching tags
  3. Searching for boolean values
  4. Searching within a range (both date and numeric)
  5. wildcard search
  6. Sorting depending on the value

Of course, this list is likely incomplete, so please feel free to add your own ideas to the discussion.

So, the awesome news: Xapian already supports everything we need and libzim already uses xapian. Adding the new search mostly boils down to adding an API to the ZIM creation process to allow the specification of the exact search metadata, storing additional information about fields in the ZIM and configuring the QueryParser. An example of such a query string would be tag:a AND NOT tag:b AND posted:1.1.2012..31.12.2016 AND score:64.. AND accepted:true title:foo NOT text:bar (NOTE: AND can be set to be optional).

Proposal for including the advanced search in ZIMs

I've wrote a short proof-of-concept for configuring xapian as needed in python (excluding any ZIM related logic). It contains the dynamic generation of terms and configuration of prefixes. You can find it here.

During ZIM creation/indexing

During the ZIM creation, we need a way for telling xapian which terms to add for each document (aka an item). The simplest way I can think of (beware: minimal C experience!) would be if each item had a method which returns an object describing which values for which terms to add. In the previously mentioned proof-of-concept I've used a simple hashmap mapping the user search key to the value, but as additional type info foe each field will be needed, a custom datastructure may be beneficial.

The ZIM creator could then, depending on the data types add the various terms/boolean terms as needed. In addition, the mapping of human-readable search prefixes to xapian prefixes as well as any additional configuration flags would need to be stored in a seperate item as xapian unfortunately does not seem to store this kind of information within the database.

I propose adding an entry X/fulltext/xapian_fields which should contain said information. At the very least, we need to store the xapian prefix, type and value slot for earch human-readable prefix. We should also store additional configuration options (e.g. suffix for numeric ranges, ...). A simple format would be [human readable prefix]\x00[xapian prefix]\x00[value slot as 4 byte unsigned int][flags as 8 bit unsigned int] for earch entry, although adding a header with general configuration options for the QueryParser (e.g. should FLAG_AUTO_SYNONYM be used) would probably be beneficial.

The generation of the terms would be as followed:

  • as per xapian convention, each field name would start with "X" and be uppercase. For some fields (such as author), specific single-letter field names exist, but utilizing them would make the API somewhat more complex, so let's just ignore them.
  • If the field value is a string, the text needs to be indexed. The crawler should be able to tell the ZIM creator wether the text should be searchable without specifying a field and/or when a field was specified. For indexing without a field, a simple call to the documents index_text method is enough, When a text should be searchable with a field (e.g. to restrict search to the title), it needs to be indexed in the X[upper_case_field] prefix.
  • A list of tags can be implemented by adding several boolean terms in the form X[upper_case_field][lower_case_tag]
  • boolean values behave rather similiar
  • To register numeric values, we need to add them as (sortable) values and store the index of the field as previously described
  • dates can be stored in a searchable manner if converted to YYYYMMDD, but it seems like xapian is unable to store additional time information (e.g. hour and minute). A custom RangeProcessor could solve this problem, but may not be necessary.

A simplified example for the term generation can be found in the previously mentioned proof-of-concept.

During ZIM reading

When opening a ZIM for reading, the reader would have to open and parse the previously discussed file and use the content to dynamically register the prefixes with the QueryParser while also setting the right flags (e.g. wildcard support). In addition, a method to select the value to sort by would have to be provided, as it does not appear like the sort order could be specified via the query string.

Compatibility

I am not a xapian expert, but I think these changes should still maintain compatibility with both older readers and older ZIMs, provided that the newer reader handles the missing X/fulltext/xapian_fields entry smartly and falls back to the old behavior.

Other concerns

Adding more search information will obviously make the search index larger. As a result, ZIM files with a lot of metadata may become somewhat noticable larger should they choose to utilize the proposed features. I don't think there'd be any significant size impact if the new features aren't used.

The xapian documentation contains a warning that some queries may be rather slow. A malicious or dumb user using a public ZIM server may enter search queries that could slow down the host system.

Other ideas

Please note that this section isn't really a part of this proposal but more like ideas for further improvements.

The proposed changes should IMO provide a significantly more flexible search. Yet, these changes are mostly background changes. I believe media-based ZIMs may benefit from having a slightly more flexible search frontend. For example, users searching a gutenberg ZIM may find it beneficial if the book cover is also shown.

I've got two ideas how this could be done:

  • provide several search layouts (e.g. the current one, one with an image, a image gallery grid) and let ZIMs specify which one they want. The problems I can see is search involving multiple ZIM files as well as a limited flexibility
  • provide an endpoint from which javascript in a ZIM can perform the search and generate the result HTML live. The great advantage of this would be the great flexibility as versatility, as the ZIM itself knows the best how the results should be presented to feel natural to the user. However, such communication may be hard to implement in some ZIM readers. This could be archieved using something like a .well-known/zim/search REST endpoint.
@kelson42
Copy link
Contributor

kelson42 commented Jan 13, 2024

@IMayBeABitShy Thank you for your ticket. I can already say that this ticket is very difficult for us to handle as it talks about many different things (see https://github.com/openzim/overview/wiki/Report-a-bug). Such kind of tickets tend to be either be closed very quickly because they are too broad or just die after years of inactivity. I will try to handle it to make the best of it, but I can not guaranty anything.

@kelson42
Copy link
Contributor

One part of the answer is that we don't want in ZIM some kind of database possibilities allowing to build sophisticated applications in ZIM. If someone wants to do that, it has to do it application side in Javascript, by using https://pouchdb.com/ for example. This is for example what we do with our Nautilus and Gutenberg scrapers.

@IMayBeABitShy
Copy link
Author

@kelson42 Thank you for your response. I understand that this is a relatively big task, this issue was mostly meant as a proposal and sharing of a possible approach and opening a venue for discussion.

One part of the answer is that we don't want in ZIM some kind of database possibilities allowing to build sophisticated applications in ZIM.

Is there a specific reason for this? While sophiscated applications are definitely not supposed to be distributed as a ZIM file, improving the search functionality could significantly improve the usability and user experience of ZIMs containing a lot of content/pages and where the user does not know the url/title of the entry they are looking for. This would, for example, be the case in Q&A-based ZIMs like stackoverflow.

Would a pull request implementing such a feature potentially be accepted or is this more of an architectural decision?

If someone wants to do that, it has to do it application side in Javascript, by using https://pouchdb.com/ for example.

Thank you for the recommendation, I'll be sure to check out pouchdb. But generally the problems I see with using JS for search is:

  1. It duplicates features: there's already a perfectly (but IMO not powerful enough) search existing, adding a second one will just be a redundant feature, may waste storage space and results in a feature disparity between both of them
  2. It would likely be less efficient than using the xapian
  3. The current search has a nice, well defined "endpoint", an individual solution would, for example, make it harder to access the search via the API. This, in turn, would lead to worse integration into viewers.
  4. General compatibility concerns

this ticket is very difficult for us to handle as it talks about many different things

Sorry about this, I had thought I had done well structuring the proposal. Do you perhaps have any tips for improving the structure? The page you have linked only applies for bug reports and would be relatively hard to apply for proposals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants