Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate JabRef search to Lucene #8857

Open
btut opened this issue May 28, 2022 · 7 comments · May be fixed by #8963
Open

Migrate JabRef search to Lucene #8857

btut opened this issue May 28, 2022 · 7 comments · May be fixed by #8963

Comments

@btut
Copy link
Contributor

btut commented May 28, 2022

Currently, JabRef implements it's own search syntax and backend for bib-fields. Fulltext pdf files are indexed by a Lucene backend. Since we already manage an Index for the fulltext search, we could also index the bib-fields for a faster, more versatile search functionality.
However, this is no easy task as keeping the index up-to-date poses multiple questions. Mainly how to link bib-entries to their index entry, when to update the index, what fields to index and where to store the index and how to show the results.

I summarize some thoughts below. I would like to work on these ideas over the next weeks and then maybe implement the functionality during JabCon2022.

How to link bib-entries to the index

One problem that I already struggled with when implementing the fulltext search is the absence of a unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index. Citation keys are not necessarily present. JabRefs entry identifier is volatile and may be different each time JabRef opens. To synchronize the index however, we need a mechanism to link an entry to the index.

One solution could be hashes. When the user changes an entry, we would need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change. This would also allow us to easily check which entries need to be re-indexed at startup. Just compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.

When

Every time an entry changes, the index needs to change with it. This can be:

  • At startup
  • When the user changes an entry from JabRef
  • When the user changes something in the bib-file
  • Other?

Also, we noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. I assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed). I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.

What

ALL bib-fields and linked files (if files can be parsed by JabRef, currently only .pdf but could probably easily be extended to txt, rtf... if that is a valid use-case).

Uncertain: How to treat custom fields. I am unsure if the fields-set needs to be fixed in the Lucene index or if one can add fields on the fly. This needs further inverstigation.

Where

Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.

How to show the results

I would like to highlight search matches in the table. Fulltext-results are currently shown in a tab in the entry editor - which I really do not like. Back when I implemented the feature, @calixtus proposed a way to show the results directly in the table by inserting a row under the corresponding entry that spans the whole table and shows the results. I cannot currently find the link Carl sent back then, but will look it up again. I think that would be a great way to highlight the search results.

@ThiloteE
Copy link
Member

Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.

I am sure this could be solved via a preference that allows users to choose where to store such files. Might especially be useful for full-text search index, since that one will probably be larger than normal search. We do have a preference right now to disable full-text search index, after some reports of it taking too long and re-indexing triggering for too often for large databases. As default I also favour a system folder so as to not pollute folders holding the library file.

Hashing seems an interesting idea and the post is really well explained. Thanks :)

@koppor
Copy link
Member

koppor commented Jul 7, 2022

(Working on the ADR on how - feel free to edit this comment to reach a final ADR - I copied the text from the issue)

How to link bib entries to the index

Context and Problem Statement

To synchronize the index with the bibliography database, we need a mechanism to link an entry to the index.
There is no unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index.

Considered Options

  • Use org.jabref.model.entry.BibEntry#hashCode
  • Use BibEntry#getCitationKey (Citation keys)
  • Use org.jabref.model.entry.BibEntry#getId

Decision Outcome

  • Chosen option: "Use org.jabref.model.entry.BibEntry#hashCode", because comes out best (see below).

Pros and Cons of the Options

Use org.jabref.model.entry.BibEntry#hashCode

When the user changes an entry, we need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change.

  • Good, because unique for a BibEntry
  • Good, because this allows to easily check which entries need to be re-indexed on startup: Compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.
  • Bad, because change when the entry changes

Use BibEntry#getCitationKey (Citation keys)

  • Bad, because not always existing

Use org.jabref.model.entry.BibEntry#getId

  • Bad, because volatile: It is generated at each start of JabRef based on the order of entries in the .bib file

@koppor
Copy link
Member

koppor commented Jul 7, 2022

When

Context and Problem Statement

Every time an entry changes, the index needs to change with it:

  • At startup
  • When the user changes an entry from JabRef
  • When the user changes the file on the file system
  • When the remote database changes

Decision Outcome

For this, we have the abstract org.jabref.model.entry.event.EntriesEvent, with org.jabref.model.entry.event.EntryChangedEvent, org.jabref.model.database.event.EntriesAddedEvent, and org.jabref.model.database.event.EntriesRemovedEvent. Subscribing to that
There is also the fulltext index, which could interfer with the search index.

@koppor
Copy link
Member

koppor commented Jul 7, 2022

Status: Needs testing --> what if 50k entries exist?

How to co-exist with the fulltext index

Context and Problem Statement

We noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. We assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed).

Considered Options

  • Run in the GUI thread and use a separate index
  • Run in the GUI thread and use the same index

Pros and Cons of the Options

Run in the GUI thread

I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.

@koppor
Copy link
Member

koppor commented Jul 7, 2022

Status: proposed

Note: I excluded the .rtf discussion. We should focus on .bib only

Which contents of the Bib Library to index?

Context and Problem Statement

Indexed fields need to be transferred to Lucene explicitely.
Which fields should be transferred?

Considered Options

  • Index known fields
  • Index all fields (including custom fields)

Decision Outcome

Chosen option: "Index all fields", because the expectation on a search is that everything in the .bib file is searched and this is close enough.

@koppor
Copy link
Member

koppor commented Jul 7, 2022

Where to store the index

Context and Problem Statement

The search index needs to be stored somewhere

Considered Options

  • Use the directory provided by AppDirs
  • Story close to the .bib file.
  • Make it configurable

Decision Outcome

Chosen option: "Use the directory provided by AppDirs", because the index is generated data, which can be regenerated on demand. AppDirs provides reasonable defaults for application data.

We should implement a cleanup of the AppDirs directory. Maybe after 3 months unused index, that index should be deleted.

Pros and Cons of the Options

Use the directory provided by AppDirs

  • Neutral, because directory grows: Docker and Firefox take much more space there.

Make it configurable

The default setting might be the directory returned by AppDirs

  • Good, because it offers freedom to pro users
  • Bad, because this leads to additional code and additional maintenance effort

@koppor
Copy link
Member

koppor commented Jul 7, 2022

How to show the search results

(No ADR until now)

It should work as it currently works (needs double check)

  • Highlight the words directly in the entry table
  • Highlight the words in the entry editor
  • Highlight the words in the entry preview

@btut btut linked a pull request Jul 9, 2022 that will close this issue
16 tasks
@koppor koppor added this to the v6.0 milestone Apr 24, 2023
@koppor koppor unassigned btut Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
JabRef Maintainers' Focus
  
Awaiting triage
Development

Successfully merging a pull request may close this issue.

3 participants