Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with outdated tag set in document index (#1331) #1332

Merged
merged 8 commits into from
Jul 21, 2023

Conversation

blms
Copy link
Contributor

@blms blms commented Feb 27, 2023

In this PR

Per #1331:

  • Index document, refreshed from DB, after change in m2m tag relationships

Questions

  • Is it ok that this means documents cannot have tags programmatically added or removed during a transaction without discarding the prior changes in that transaction? How can we mitigate?

previous notes

After investigating #1331, it seems that when index_data is called after a document is saved, the updated tag set is not yet present in self.tags.all().

My attempt at a workaround is to check specifically for the related change via a change in TaggedItem. This does trigger a reindex with the correct tag set when you change tags on a document. However, it seems that index_data is called again for the document save after it's called for the TaggedItem save, which means that it overwrites the index data to be the same as before (outdated tag set).

Is there some way to ensure the TaggedItem related save indexing trigger always happens after the document save indexing trigger? Or is the approach wrong?

(Not exactly the same issue, but also wondering if it's related to this: jazzband/django-taggit#818)

@blms blms requested a review from rlskoeser February 27, 2023 17:28
@blms blms changed the base branch from main to develop February 27, 2023 17:28
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing on taggit relationship post save does seem logical... I'm puzzled about the order you're reporting; could it be that the tags are cached somehow when we save the document?

I just checked, parasolr is listening to the post_save signal on the object itself. I don't know what order that happens in, or why there would be a discrepancy between them.

Possibly this stack overflow is related? Seems to be discussing updating m2m on post save, but the clear behavior may also be relevant for what we're trying todo. https://stackoverflow.com/questions/1925383/issue-with-manytomany-relationships-not-updating-immediately-after-save

@blms
Copy link
Contributor Author

blms commented Mar 7, 2023

Yeah, I'm at a bit of a loss here! It does indeed seem to be referring to an old/cached copy of the relationship on post_save of the Document. Thanks for the link; I'll look into that, and I also see reference there to a signal m2m_changed which I wasn't aware of.

@codecov
Copy link

codecov bot commented Jun 28, 2023

Codecov Report

Merging #1332 (565be05) into develop (05c833c) will increase coverage by 0.11%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #1332      +/-   ##
===========================================
+ Coverage    98.63%   98.75%   +0.11%     
===========================================
  Files          187      197      +10     
  Lines        10341    11523    +1182     
===========================================
+ Hits         10200    11379    +1179     
- Misses         141      144       +3     

@blms blms changed the title Bugfix/1331 tag indexing Fix issue with outdated tag set in document index (#1331) Jul 6, 2023
@blms
Copy link
Contributor Author

blms commented Jul 6, 2023

@rlskoeser I have a working version of this now, but with one pitfall: because of the refresh_from_db call during the m2m_changed signal handler, programmatically changing tags (using tags.add or tags.remove in code) is potentially dangerous in terms of data loss. Those calls in code will trigger the m2m_changed signal, and thus refresh_from_db, wiping any unsaved changes made to the document before the tags were added/removed.

You can see an example of this in the document merge code changes.

Is this acceptable? I couldn't figure out any other way for the tag set to be up-to-date when the indexing code is called.

In all other contexts, m2m_changed is called after post_save, which means the refresh call won't wipe any data.

@blms blms marked this pull request as ready for review July 6, 2023 16:19
@blms blms requested a review from rlskoeser July 6, 2023 16:19
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, m2m relationships and signals are tricky to think about and handle properly in terms of indexing. (Wondering if there's anything here we could adapt for common parasolr indexing behavior or at least document in the examples...)

one question for you about the refresh_from_db call, otherwise I think this is good

Comment on lines 1222 to 1225
"tags": {
"post_save": DocumentSignalHandlers.related_save,
"pre_delete": DocumentSignalHandlers.related_delete,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need these handlers, though, right? If someone renames a tag without changing any m2m relationships, we should reindex all the documents with that renamed tag so they reflect the update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added back.

@@ -485,6 +486,15 @@ def unidecode_tag(sender, instance, **kwargs):
"""Convert saved tags to ascii, stripping diacritics."""
instance.name = unidecode(instance.name)

@staticmethod
def tagged_item_change(sender, instance, action, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the instance here is the document with the m2m relation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Added comment to clarify.

# add any tags from merge document tags to primary doc
# NOTE: This must be done before any other changes to self because it will fire
# m2m_changed signal which calls self.refresh_from_db()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this refresh from db likely to be problematic elsewhere? it seems like it could lead to unexpected behavior that might be hard to troubleshoot

in the signal handler, instead of refreshing, could we get a new instance from the db without modifying the local copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, much better than refresh from db, thanks!

Copy link
Contributor Author

@blms blms Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this (or adding back the related_save?) broke a unit test where a shelfmark in index data was relied upon, and the way to fix it was switching the order of these two lines in conftest, such that the tags.add line happens first:

doc.tags.add("bill of sale", "real estate")
TextBlock.objects.create(document=doc, fragment=fragment)

Not sure what's going on here because it seems like index_data should have been called after all of these lines anyway with the most recent copy of the document. Indeed, logging the results of index_data actually showed that the shelfmark is present in the index data for all three records even before making this change. But I can't test it more thoroughly because my local Solr (on v9) is still returning 0 results for every query.

@blms blms requested a review from rlskoeser July 20, 2023 17:42
@blms blms merged commit 8fbf70e into develop Jul 21, 2023
15 checks passed
@blms blms deleted the bugfix/1331-tag-indexing branch July 21, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants