run caching and performance? #7

Pomax · 2022-01-18T17:29:34Z

I got here via https://exchange.adobe.com/creativecloud.details.20053.deduplicator.html, which has a review that notes:

Once you get it working, it works extremely slowly, literally several days of processing, and if it's interrupted, it starts all over again.

Can the readme be updated with a performance benchmark so folks have an idea of what to expect, and does this extension perform run-caching, so it doesn't "Start from zero" but just resumes for catalog entries it hasn't checked? (if not, is that's something that be added?)

I'm currently using Duplicate Photo Finder outside of LR, with a script that patches the LR catalog sqlite file after duplicates get removed, but it's a pretty crappy workflow.

teran · 2022-01-21T23:07:50Z

Hello @Pomax,

First of all thank you for your feedback.

About caching or any other way to preserve the state: the main thing I not wanted to do when wrote the imgsum and Deduplicator - is to make them working isolated, i.e. not producing any side effects to current environment. I've seen some similar Plug-ins to Lightroom and their main flaw was in exactly that place: they left their data in Lightroom catalog (which is actually SQLite Database). Unfortunately Lightroom SDK provides (provided actually - at the moment of writing Deduplicator) no way to track database schemata changes in context of Plug-In to make it possible to completely clean up the Plug-In data on Plug-In uninstallation and produce no garbage to stay in Lightroom Catalog file forever.

Another approach here is storing the state of analysis process separately in some kind of cache file which is also not perfect - you could want to use dedicated scratch dir, don't have enough disk space, etc.

So it's kinda painful part and while I couldn't imagine any rock-solid solution it was done as stateless task so that's why it begins from start each time - there's no way to make it reliable without trade-offs.

The only way I could imagine how it could be done - is to make it with web-service storing image identifies bound to user ID - but it's another story.

The pretty simple workaround how to deal with it: start analysis by parts, i.e. by year, month, place, etc. manually - this will obviously work since it doesn't need anything from Deduplicator and the whole control is on user side.

Now about performance:
It's an opposite side of precision - first and parallelisation - second.
I'm not sure if it's a good idea to sacrifice precision but it requires some research or probably faster algorithms,
Parallelisation could be done on imgsum side - it should work fine and make things faster by better CPU utilisation - I've created an issue in imgsum repository about that - so it's up for grabs since I'm not sure when I could grab it.

And it's good idea to perform the benchmark and provide some info about that - I could probably do that in both of imgsum and Deduplicator README's for better transparency in a couple of days.

So that's it. Thanks again for your feedback. Please feel free to post any updates or other issues :)

PS. Not closing the issue is there's no solution at the moment.

Pomax · 2022-01-22T18:02:37Z

Another approach here is storing the state of analysis process separately in some kind of cache file which is also not perfect - you could want to use dedicated scratch dir, don't have enough disk space, etc.

This would be a great solution though, as long as it comes with a checkbox that's disabled by default, so that the plugin keeps doing exactly what it's doing already right now when updated, while giving people the option to accept the downsides if they consider that a price worth paying for the upside. E.g:

[x] Allow deduplication checking in batches, using a separate state cache file until a
    run has been completed. NOTE: this will write a file to S:\ome\location\dot.file that
    may grow quite large. Make sure you have at least 1GB of free space on your drive
    if you wish to enable this feature.

(removing the file after a completed run, so as not to have to run into the problem of having to resync based on old data)

The idea that people who use Lighroom may not have enough diskspace is probably an overly-safe one: I have hundreds of gigs of data (even a casual shooter using RAW will have tens of gigs), so lightroom catalogs tend to live on large drives with plenty of space. If some app or plugin needs to write a temporary file that grows to several hundred mb I will literally never notice. So I personally consider that perfectly fine, since every application and plugin already has their own settings files and folders.

I do agree that writing it into the lrcat sqlite3 file would be bad. No matter what table name(s) someone comes up with, it's never guaranteed to be a safe name, and it's quite a lot harder to remove the orphaned data if you remove the plugin. Having a real file on disk, that the plugin tells you about up front, is much better. (Even if that file is also a .sqlite3 database file rather than a flat text file).

teran mentioned this issue Jan 21, 2022

Parallelise imgsum calculations teran/imgsum#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run caching and performance? #7

run caching and performance? #7

Pomax commented Jan 18, 2022

teran commented Jan 21, 2022

Pomax commented Jan 22, 2022 •

edited

run caching and performance? #7

run caching and performance? #7

Comments

Pomax commented Jan 18, 2022

teran commented Jan 21, 2022

Pomax commented Jan 22, 2022 • edited

Pomax commented Jan 22, 2022 •

edited