Solr-based similarity search #1753

ffont · 2024-01-24T12:41:49Z

Issue(s)
#1714

Description
This PR implements Solr-based similarity search for the web interface (not for the API!). It does not remove any of the previous functionality, and old/new similarity can be toggled with the setting USE_SEARCH_ENGINE_SIMILARITY.

This PR is to investigate different ways in which we can use Solr for similarity search. I have successfully implemented it, ~~but there are some decisions to be made which have important implications~~ (EDIT: I think after #12531aa it is clear how to proceed). To implement what is in this PR, I had to gain a good understanding of how Solr documents work, so it is possible that the explanations below are not very clear if you're not familiar with such concepts.

Essentially this PR adds similarity support by adding a new type of field in the Solr schema which can store "similarity vectors". Currently we only support 100-dimension vectors, but more types of vectors could be added. Then, add index time, we add vector information (embeddings data) to the Solr document just like any other field (or almost like that, see below). We retrieve embeddings from a SoundAnalysis object of a specific type of analyzer (for example, fsd-sinet_v1). All this can be configured using global Django settings.

The index is updated with the same strategy as normal Solr index updates. When a sound gets analyzed with the analyzer used for similarity, then it is marked as "index dirty", so the next time the job post_dirty_sounds_to_search_engine runs, the sound will be updated in the index with the similarity data. Also, I had to add new methods for sounds to get their "similarity state" and decide whether similarity buttons should be shown. I do that by checking the existence of SoundAnalysis objects of the designed analyzer (and I already took care of some optimizations to reduce number of queries). Therefore the similarity_state boolean field in the sound model is not needed. Nevertheless, I'd consider using it again and setting it at index time to avoid having to do some complex queries for knowing if the sound is ready for similarity (and also that method would be a bit more accurate for a number of reasons I won't detail here).

I experimented with two methods of adding the similarity data to the sound index:

The first method is straightforward: we simply add a similarity vector as a field to a Solr document. Using this method, search queries work basically like any other query, but the q parameter is used to define a point of the embedding space as a target for the similarity search. I call this method the "parent" method.
The second method involves the use of Solr "child" documents. This "child" method supports relating several similarity vectors with one single sound, so that we can store similarity vectors computed using different analyzers (i.e. different embedding extractors) and also we an even store different similarity vectors that correspond to different timestamps of the sound. This method therefore potentially allows to find similar sounds "within" a sound (so we can say sound X is similar to sound Y at 1:35), and also allows to easily switch between analyzers (embeddings) for getting similarity results. The downside of this method is that there are a number of query options like grouping and faceting that will not work because the documents that Solr returns are "child" documents and these include the id of the parent (and some other stuff) but not all parent fields (EDIT: this is no longer true, thanks to help from Solr users mailing list I could fix that).

A way to obtain the best of both methods, would be to index child documents also with the fields of the parent documents (or most of them at least), but this would mean having a lot of redundant information in the search index (EDIT: this is no longer true, no redundant information needed). Also, if we include several vectors per sound, it remains to be seen how this would scale in our production environment. Therefore, there are some things to investigate here...

So here are some next steps for this PR:

~~Add tests!~~ Done
~~Choose a method (parent or child) for similarity queries~~ We'll use child method
~~Check that NN returned by Solr are consistent with our experiments~~ This will be better done once we actually have some experiments to compare to. For now we can merge without really activating solr-based similarity
~~Performance analysis~~ This will be better done once deployed with a full index
~~Populate similarity_state Sound field at index time~~ This optimization we can add when completely removing old similarity code, as otherwise there would be conflicts if using the same field.

Deployment steps:
This needs a full reindex of sounds, including removing the freesound Solr core data folder as the schema has changed. Also, the setting USE_SEARCH_ENGINE_SIMILARITY must be set to True for the Solr-based similarity to take over gaia similarity.

This means that now documents can be partially updated instead of always being completely replaced. This features is not used yet anywhere, but it will be useful when including similarity data to the search engine. #1714

…bled We used to store similarity_state field in the sound model, but this is no longer needed when using the search engine based similarity.

This mode complements get_child mode, and will match parent documents instead of child documents

Also fixed an issue with grouping by pack in similarity search

Bomme

Sorry, for commenting so late. I started this review and wanted to take another look, but then never got around to it.
Here are some comments from my first round of review

README.md

Bomme · 2024-02-01T17:06:37Z

search/templatetags/search.py

@@ -92,6 +92,8 @@ def display_facet(context, flt, facet, facet_type, title=""):
            context['sort'] if context['sort'] is not None else '',
            context['weights'] or ''
        )
+        if context['similar_to'] is not None:
+            element['add_filter_url'] += '&similar_to={}'.format(context['similar_to'])


I know it was not started in this PR but we should really use something like urllib.parse.urlencode to build the query args to avoid bugs introduced by typos etc.

For example, drag along a filter_url_args dict and only at the very end:

element['add_filter_url'] = urllib.parse.urlencode(filter_url_args)

Yes I agree, I did not update stuff here because this will be part of the refactoring of the whole search views/template code that I've been planning to do. All URL that we build (filter urls, pagination, etc.) should be built by some object that knows how to handle this and then that object should do the proper encoding.

sounds/views.py

utils/search/__init__.py

Bomme · 2024-02-01T18:36:08Z

utils/search/backends/solr555pysolr.py

@@ -358,10 +149,302 @@ def get_forum_index(self):
                always_commit=True
            )
        return self.forum_index
+
+    # Util functions
+    def transform_document_into_update_document(self, document):


This should be function and not a method of this class since we do not need anything from the class

It is true that we don't use self but it made sense to have it as a method because in this way it is easy to override methods in subclasses. This method does actually not need to be overwritten, but there are other similar methods that don't use self either but that need to be overwritten (e.g. search_process_filter). I decided to keep them all as methods for consistency.

Bomme · 2024-02-01T18:36:57Z

utils/search/backends/solr555pysolr.py

+        new_document.update({key: {'set': value} for key, value in document.items() if key != 'id'})
+        return new_document
+
+    def convert_sound_to_search_engine_document(self, sound, fields_to_include=[]):


This should be function and not a method of this class since we do not need anything from the class

Explained in previous comment.

Bomme · 2024-02-01T18:37:44Z

utils/search/backends/solr555pysolr.py

+
+        return document
+
+    def add_similarity_vectors_to_documents(self, sound_objects, documents):


Actually all this function seem to better suited as functions :D

Explained in previous comment.

ffont · 2024-02-13T09:33:17Z

Thanks for the comments @Bomme!
I'll re-implement some of them in the master branch and commit there. The methods/functions one I'll keep it like it is right know, but we can discuss this further if you want. Thanks!

ffont added 21 commits January 23, 2024 12:30

Add support for search engine document updates

d235fd2

This means that now documents can be partially updated instead of always being completely replaced. This features is not used yet anywhere, but it will be useful when including similarity data to the search engine. #1714

Small fixes in search test command

9e3b678

Rename comments field to num_comments

300201e

Make fields non-required, some cleanups

21349a9

Add tests for update/fields_to_include parameters of add_sounds_to_index

26991be

Add solr-based basic similarity search support

92ea90e

Add more taks to vscode workspace

0d793d1

Replace docker-compose by docker compose in docs

edddb73

Add parameter to chose analyzer for similarity

475a222

Handle case with no valid embeddings

442b3a3

Get appropriate similairty state when search engine similarity is ena…

85fd8d2

…bled We used to store similarity_state field in the sound model, but this is no longer needed when using the search engine based similarity.

Document similar_to new params in search_sound

fc09d38

Add get param to test using a sound as target in search page

5233d6a

Add get_parent similarity search mode

b06c30c

This mode complements get_child mode, and will match parent documents instead of child documents

Add faceting and grouping support in similarity search

12531aa

Add support for similar_to param in search page

c2c228b

Also fixed an issue with grouping by pack in similarity search

Support for limiting max search sounds dynamically

b8a6e98

Use euclidean distance in similarity field type

92461a2

Update default similarity analyzers

be81ad3

Make test pass after new simialr_to parameter

4da141d

Merge branch 'master' into similarity-solr

f9c1e29

ffont mentioned this pull request Feb 2, 2024

Add an advanced search option to "display results in a map" #1754

Merged

5 tasks

Update default analyzer settings

820685e

ffont mentioned this pull request Feb 8, 2024

Reimplement similarity search using Solr #1714

Open

14 tasks

Add tests for search engine-based similarity search

199ce0f

ffont marked this pull request as ready for review February 9, 2024 09:19

ffont merged commit 20d47d9 into master Feb 9, 2024
1 check passed

ffont deleted the similarity-solr branch February 9, 2024 09:24

Bomme reviewed Feb 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr-based similarity search #1753

Solr-based similarity search #1753

ffont commented Jan 24, 2024 •

edited

Bomme left a comment

Bomme Feb 1, 2024

ffont Feb 13, 2024

Bomme Feb 1, 2024

ffont Feb 13, 2024

Bomme Feb 1, 2024

ffont Feb 13, 2024

Bomme Feb 1, 2024

ffont Feb 13, 2024

ffont commented Feb 13, 2024


		return document

		def add_similarity_vectors_to_documents(self, sound_objects, documents):

Solr-based similarity search #1753

Solr-based similarity search #1753

Conversation

ffont commented Jan 24, 2024 • edited

Bomme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffont commented Feb 13, 2024

ffont commented Jan 24, 2024 •

edited