Migrate JabRef search to Lucene #8857

btut · 2022-05-28T16:36:52Z

Currently, JabRef implements it's own search syntax and backend for bib-fields. Fulltext pdf files are indexed by a Lucene backend. Since we already manage an Index for the fulltext search, we could also index the bib-fields for a faster, more versatile search functionality.
However, this is no easy task as keeping the index up-to-date poses multiple questions. Mainly how to link bib-entries to their index entry, when to update the index, what fields to index and where to store the index and how to show the results.

I summarize some thoughts below. I would like to work on these ideas over the next weeks and then maybe implement the functionality during JabCon2022.

How to link bib-entries to the index

One problem that I already struggled with when implementing the fulltext search is the absence of a unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index. Citation keys are not necessarily present. JabRefs entry identifier is volatile and may be different each time JabRef opens. To synchronize the index however, we need a mechanism to link an entry to the index.

One solution could be hashes. When the user changes an entry, we would need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change. This would also allow us to easily check which entries need to be re-indexed at startup. Just compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.

When

Every time an entry changes, the index needs to change with it. This can be:

At startup
When the user changes an entry from JabRef
When the user changes something in the bib-file
Other?

Also, we noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. I assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed). I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.

What

ALL bib-fields and linked files (if files can be parsed by JabRef, currently only .pdf but could probably easily be extended to txt, rtf... if that is a valid use-case).

Uncertain: How to treat custom fields. I am unsure if the fields-set needs to be fixed in the Lucene index or if one can add fields on the fly. This needs further inverstigation.

Where

Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.

How to show the results

I would like to highlight search matches in the table. Fulltext-results are currently shown in a tab in the entry editor - which I really do not like. Back when I implemented the feature, @calixtus proposed a way to show the results directly in the table by inserting a row under the corresponding entry that spans the whole table and shows the results. I cannot currently find the link Carl sent back then, but will look it up again. I think that would be a great way to highlight the search results.

ThiloteE · 2022-05-28T17:20:03Z

Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.

I am sure this could be solved via a preference that allows users to choose where to store such files. Might especially be useful for full-text search index, since that one will probably be larger than normal search. We do have a preference right now to disable full-text search index, after some reports of it taking too long and re-indexing triggering for too often for large databases. As default I also favour a system folder so as to not pollute folders holding the library file.

Hashing seems an interesting idea and the post is really well explained. Thanks :)

koppor · 2022-07-07T21:23:59Z

(Working on the ADR on how - feel free to edit this comment to reach a final ADR - I copied the text from the issue)

How to link bib entries to the index

Context and Problem Statement

To synchronize the index with the bibliography database, we need a mechanism to link an entry to the index.
There is no unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index.

Considered Options

Use org.jabref.model.entry.BibEntry#hashCode
Use BibEntry#getCitationKey (Citation keys)
Use org.jabref.model.entry.BibEntry#getId

Decision Outcome

Chosen option: "Use org.jabref.model.entry.BibEntry#hashCode", because comes out best (see below).

Pros and Cons of the Options

Use `org.jabref.model.entry.BibEntry#hashCode`

When the user changes an entry, we need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change.

Good, because unique for a BibEntry
Good, because this allows to easily check which entries need to be re-indexed on startup: Compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.
Bad, because change when the entry changes

Use `BibEntry#getCitationKey` (Citation keys)

Bad, because not always existing

Use `org.jabref.model.entry.BibEntry#getId`

Bad, because volatile: It is generated at each start of JabRef based on the order of entries in the .bib file

koppor · 2022-07-07T21:29:40Z

When

Context and Problem Statement

Every time an entry changes, the index needs to change with it:

At startup
When the user changes an entry from JabRef
When the user changes the file on the file system
When the remote database changes

Decision Outcome

For this, we have the abstract org.jabref.model.entry.event.EntriesEvent, with org.jabref.model.entry.event.EntryChangedEvent, org.jabref.model.database.event.EntriesAddedEvent, and org.jabref.model.database.event.EntriesRemovedEvent. Subscribing to that
There is also the fulltext index, which could interfer with the search index.

koppor · 2022-07-07T21:32:05Z

Status: Needs testing --> what if 50k entries exist?

How to co-exist with the fulltext index

Context and Problem Statement

We noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. We assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed).

Considered Options

Run in the GUI thread and use a separate index
Run in the GUI thread and use the same index

Pros and Cons of the Options

Run in the GUI thread

I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.

koppor · 2022-07-07T21:34:37Z

Status: proposed

Note: I excluded the .rtf discussion. We should focus on .bib only

Which contents of the Bib Library to index?

Context and Problem Statement

Indexed fields need to be transferred to Lucene explicitely.
Which fields should be transferred?

Considered Options

Index known fields
Index all fields (including custom fields)

Decision Outcome

Chosen option: "Index all fields", because the expectation on a search is that everything in the .bib file is searched and this is close enough.

koppor · 2022-07-07T21:37:07Z

Where to store the index

Context and Problem Statement

The search index needs to be stored somewhere

Considered Options

Use the directory provided by AppDirs
Story close to the .bib file.
Make it configurable

Decision Outcome

Chosen option: "Use the directory provided by AppDirs", because the index is generated data, which can be regenerated on demand. AppDirs provides reasonable defaults for application data.

We should implement a cleanup of the AppDirs directory. Maybe after 3 months unused index, that index should be deleted.

Pros and Cons of the Options

Use the directory provided by AppDirs

Neutral, because directory grows: Docker and Firefox take much more space there.

Make it configurable

The default setting might be the directory returned by AppDirs

Good, because it offers freedom to pro users
Bad, because this leads to additional code and additional maintenance effort

koppor · 2022-07-07T21:38:16Z

How to show the search results

(No ADR until now)

It should work as it currently works (needs double check)

Highlight the words directly in the entry table
Highlight the words in the entry editor
Highlight the words in the entry preview

ThiloteE added search jabcon labels May 28, 2022

ThiloteE assigned ThiloteE and btut and unassigned ThiloteE May 28, 2022

btut linked a pull request Jul 9, 2022 that will close this issue

Lucene search backend #8963

Draft

16 tasks

koppor added this to the v6.0 milestone Apr 24, 2023

koppor unassigned btut Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate JabRef search to Lucene #8857

Migrate JabRef search to Lucene #8857

btut commented May 28, 2022 •

edited by calixtus

ThiloteE commented May 28, 2022

koppor commented Jul 7, 2022

koppor commented Jul 7, 2022

koppor commented Jul 7, 2022

koppor commented Jul 7, 2022 •

edited

koppor commented Jul 7, 2022 •

edited

koppor commented Jul 7, 2022

Migrate JabRef search to Lucene #8857

Migrate JabRef search to Lucene #8857

Comments

btut commented May 28, 2022 • edited by calixtus

How to link bib-entries to the index

When

What

Where

How to show the results

ThiloteE commented May 28, 2022

koppor commented Jul 7, 2022

How to link bib entries to the index

Context and Problem Statement

Considered Options

Decision Outcome

Pros and Cons of the Options

Use org.jabref.model.entry.BibEntry#hashCode

Use BibEntry#getCitationKey (Citation keys)

Use org.jabref.model.entry.BibEntry#getId

koppor commented Jul 7, 2022

When

Context and Problem Statement

Decision Outcome

koppor commented Jul 7, 2022

How to co-exist with the fulltext index

Context and Problem Statement

Considered Options

Pros and Cons of the Options

Run in the GUI thread

koppor commented Jul 7, 2022 • edited

Which contents of the Bib Library to index?

Context and Problem Statement

Considered Options

Decision Outcome

koppor commented Jul 7, 2022 • edited

Where to store the index

Context and Problem Statement

Considered Options

Decision Outcome

Pros and Cons of the Options

Use the directory provided by AppDirs

Make it configurable

koppor commented Jul 7, 2022

How to show the search results

btut commented May 28, 2022 •

edited by calixtus

Use `org.jabref.model.entry.BibEntry#hashCode`

Use `BibEntry#getCitationKey` (Citation keys)

Use `org.jabref.model.entry.BibEntry#getId`

koppor commented Jul 7, 2022 •

edited

koppor commented Jul 7, 2022 •

edited