Image analysis job #307

JMicheli · 2024-04-15T23:11:29Z

This pull request creates a first draft of an image analysis job (#181), which will need to be updated further before merging.

I chose a slightly broader approach to this job, making an AnalyzeMediaJob which includes an AnalyzeImage task. This is intended to be more extensible, as additional stages to media analysis will probably be wanted later on.

The job isn't fully integrated. Although it is written to have four variants (Individual, Library, Series, and MediaGroup) I've only implemented a client-side way to access the Individual analysis command. On that note, I did this by adding an "Analyze Media" button to the "Manage" page for individual media items, it seemed like the right place to put it. Implementing the job fully would involve adding the same button to other menus and linking them up on the axum side. The axum side also needs authentication boilerplate added.

Finally, the job currently updates the number of pages on a media item always as opposed to just validating the number of pages. It sounds, from the issue, like this isn't the goal, so I can change that alongside other changes.

In the meantime, I think this is a good place to discuss how this could be improved and made ready for integration to develop.

aaronleopold

You said this PR is a first draft towards #181, so I didn't comment on some things I assume you'll fix in followup commits, e.g. unwraps and TODOs. I had a few low priority comments and a couple of nit picks, but overall I think this is looking great and I appreciate the time you've put into this so far!

I'll mark this PR as a draft just so that when you're ready for me to take another look you can just change the status and I'll know to look.

I chose a slightly broader approach to this job, making an AnalyzeMediaJob which includes an AnalyzeImage task. This is intended to be more extensible, as additional stages to media analysis will probably be wanted later on.

I think this is definitely the right approach. I imagine we'll be adding various types of media analysis in the future, so not having to define and maintain separate jobs for each type of analysis is a good idea.

On that note, I did this by adding an "Analyze Media" button to the "Manage" page for individual media items, it seemed like the right place to put it.

I think so, it satisfies your Individual variant for the job. The other variants would need to triggers in different places, as you already mentioned, and I think this is a good starting point.

Finally, the job currently updates the number of pages on a media item always as opposed to just validating the number of pages. It sounds, from the issue, like this isn't the goal, so I can change that alongside other changes.

I think this is the eventual goal IMO. I think one thing to consider that might be good to gather input on is whether we want always want to auto-update metadata diffs when a media item is analyzed. An alternative that might be appealing is some sort of reconcilation process, which would require a user to manually review and accept the changes. What do you think?

As for the other functionalities outlined in that feature, things like image dimensions could just be an automatic update to whatever relation winds up holding that information on a media record.

apps/server/src/routers/api/v1/media.rs

core/src/filesystem/media/analyze_media_job.rs

aaronleopold · 2024-04-16T00:31:54Z

core/src/filesystem/media/analyze_media_job.rs

+pub struct AnalyzeMediaOutput {
+	/// The number of images analyzed
+	images_analyzed: u64,
+}


I think you could likely add more state to this, e.g. media_updated, but this is a good start and I'm fine leaving as-is if that is what you decide

core/src/filesystem/media/analyze_media_job.rs

core/src/filesystem/media/epub.rs

core/src/filesystem/media/rar.rs

core/src/filesystem/media/zip.rs

JMicheli · 2024-04-22T06:32:25Z

Okay, these last several commits get things into a generally workable state. There are probably a few things to do still. A few that come to mind:

Add more doc comments and logging to job.
Any other media analysis tasks we need?
Should the media's page_count always be updated or do we want to only do that if it doesn't already have one?
Need to enforce permissions properly in server when the start analyze media task endpoint is hit.

aaronleopold · 2024-04-24T01:02:00Z

Any other media analysis tasks we need?

I think what you have is good for now wrt to what analysis tasks are performed, and you've built it already to be able to add more tasks as needed which is 🔥

Should the media's page_count always be updated or do we want to only do that if it doesn't already have one?

I'm not 100% yet. I think if it flat out doesn't exist, it's fine to set it to the value you read. The exception might be EPUB files.

Otherwise, I think some sort of reconciliation would be neat, e.g. the analysis generates a list of actions to be taken on the frontend. In the case of mismatch, it would tell you what the current is and what the analyzed is and you can decide the action.

However, that in itself is a huge feature. I'll err on the side of caution and say just persist JobExecuteLog with a WARN level for now. I think folks are generally anxious about automatic operations on their carefully curated metadata library (even if it doesn't change the underlying metadata files)

JMicheli · 2024-04-27T22:19:13Z

So I was working on this pull request today and I realized that a media item's page is non-null.

Investigating further, to see if perhaps an invalid value might be assigned under some conditions that should be overwritten (versus properly assigned page counts where no update is necessary, per your prior comment), I found something I hadn't considered: the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).

That initially made the job here seem a bit unnecessary - it would never have a new value. However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job. In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.

Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.

aaronleopold · 2024-04-27T23:12:56Z

So I was working on this pull request today and I realized that a media item's id is non-null.

I'm not sure I understand, did you mean to refer to the page being non-nullable?

the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).

Yeah good catch, there is a pages field directly embedded into a media file (IIRC it is -1 for EPUBs). This is mostly because there wasn't proper metadata support until a little later on, and that representation has just stuck around.

I think the big goal for the analysis is more for ensuring the metadata matches the actual count, not necessarily the pages field in the media directly (since like you said, that was generated by Stump already and should just be correct). The metadata which is read from misc files within e.g. an archive are more subject to error. However if someone were to manually edit the file, e.g. adding a page, obviously that would lead to a mismatch in both.

However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job.

I'm good with this 👍 I do like my idea for some sort of reconciliation flow (I'm starting to overuse this word now lol) but I don't think it is necessarily needed for this small of an operation.

In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.

Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.

Yeah I think that makes sense, I agree 👍

I also think the random buttons are good for now, especially since this is based into experimental

JMicheli · 2024-04-29T03:29:05Z

This last commit does two things. First, it fixes an error in the way I had the job for rars counting pages (needed to check if it was an image). Second, it addresses a todo in process that suggested using the same validation as in get_page for counting pages (basically checking if the entry is an image).

Once you've confirmed that this is what you want this should be good to merge.

aaronleopold

I had a few nit-picks, but the only one I think care about is the conditional update during analysis. If you have time to tackle any, great, if not I'm not worried. I think this looks great otherwise so I'm approving it. Thanks again!!

Let me know if you'd prefer me to handle the conflicts with the base branch 👍

apps/server/src/routers/api/v1/media.rs

core/src/filesystem/media/analyze_media_job.rs

aaronleopold · 2024-05-01T23:43:58Z

core/src/filesystem/media/analyze_media_job.rs

+				// Update media item in database
+				let _ = ctx
+					.db
+					.media()
+					.update(media::id::equals(id), vec![media::pages::set(page_count)])
+					.exec()
+					.await?;
+				output.media_updated += 1;


This update should probably only hit if the page counts are actually different

I went ahead and followed this rule in my latest commit - I check if there's metadata, and if there is and page_count doesn't match the one we calculated, update it. If there isn't, create new metadata with the new page_count.

I'd appreciate if you could check the logic there, it may not have been the most idiomatic prisma usage.

JMicheli and others added 6 commits April 14, 2024 16:02

Progress toward image analysis job

9f457a5

Get functional test of analyze image.

4f8bda5

Finish first draft of job.

a541ec5

Merge branch 'experimental' into image_analysis

9e780fb

Merge remote-tracking branch 'origin/experimental' into pr/307

a59eb92

Fix lint errors

ee4d02a

aaronleopold reviewed Apr 16, 2024

View reviewed changes

aaronleopold marked this pull request as draft April 16, 2024 01:01

JMicheli added 7 commits April 21, 2024 22:40

Clean up code and add doc comments

0ae89fe

Add media analysis api paths and series management button.

74da386

Fix accidentally removed error result.

6e5e70a

Reorganize client API

621d41e

Address comments

8fb180f

Remove errant tracing log in analyze_media_job

185625c

Update images_analyzed count during task

67a5fdf

Only load ids when building list for analysis job.

e40d4ad

This comment has been minimized.

Sign in to view

JMicheli added 3 commits April 27, 2024 18:18

Add media_updated state and reactivate epub

e0d584a

Enforce permissions

8af74d5

Fix error in rar page counting and address todo

8430152

Fix utopia descriptions.

a58f205

aaronleopold marked this pull request as ready for review May 1, 2024 23:49

aaronleopold approved these changes May 1, 2024

View reviewed changes

JMicheli added 2 commits May 2, 2024 22:44

Update metadata properly, address comments

5e78342

Update page count if it is None

65f05fa

Merge remote-tracking branch 'stumpapp/experimental' into image_analysis

3558924

aaronleopold merged commit 8a8bd86 into stumpapp:experimental May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image analysis job #307

Image analysis job #307

JMicheli commented Apr 15, 2024

aaronleopold left a comment •

edited

aaronleopold Apr 16, 2024

JMicheli commented Apr 22, 2024 •

edited

aaronleopold commented Apr 24, 2024 •

edited

JMicheli commented Apr 27, 2024 •

edited

aaronleopold commented Apr 27, 2024

This comment has been minimized.

JMicheli commented Apr 29, 2024

aaronleopold left a comment

aaronleopold May 1, 2024

JMicheli May 3, 2024

Image analysis job #307

Image analysis job #307

Conversation

JMicheli commented Apr 15, 2024

aaronleopold left a comment • edited

Choose a reason for hiding this comment

aaronleopold Apr 16, 2024

Choose a reason for hiding this comment

JMicheli commented Apr 22, 2024 • edited

aaronleopold commented Apr 24, 2024 • edited

JMicheli commented Apr 27, 2024 • edited

aaronleopold commented Apr 27, 2024

This comment has been minimized.

JMicheli commented Apr 29, 2024

aaronleopold left a comment

Choose a reason for hiding this comment

aaronleopold May 1, 2024

Choose a reason for hiding this comment

JMicheli May 3, 2024

Choose a reason for hiding this comment

aaronleopold left a comment •

edited

JMicheli commented Apr 22, 2024 •

edited

aaronleopold commented Apr 24, 2024 •

edited

JMicheli commented Apr 27, 2024 •

edited