File upload: Add task for file format detection on file commit #553

max-moser · 2024-01-11T14:50:37Z

Overview

Right now, InvenioRDM offers information about the MIME types of files (that are associated with records), but the logic for coming up with this information is quite simple.
This PR intends to improve the file type detection capabilities of InvenioRDM by utilizing the signature-based file format identification tool siegfried (which is permissively licensed under Apache-2.0).

Some more context

Every record file in InvenioRDM has some high-level information stored in their ObjectVersion and some "physical" file-related information in the associated FileInstance (both defined in Invenio-Files-REST).
The former has a field for the MIME type (object_version._mimetype), but there doesn't seem to be a code path which populates this field to any non-null value.
Instead, the object_version.mimetype property usually falls back to guessing the MIME type through the standard library function mimetypes.guess_type().
This function bases its guess purely on the file extension, which is fine as fallback value but isn't ideal as the primary source.

Outline of this PR

This PR hooks into the file upload process via the files service.
Whenever a file is "committed", a background task is being scheduled which calls the external sf binary on the uploaded file and interprets its output.
This is done because the file format identification step is a potentially long-running operation (e.g. for large files).

If a MIME type is reported by siegfried, it will be used to populate the object_version._mimetype field.
Additionally, the PRONOM identifier is stored as an ObjectVersionTag.

Alternatives considered

Feeding the file stream into siegfried during upload

Instead of identifying the file format by calling sf on the file commit, it could potentially be done during the file upload by feeding a duplicate of the upload stream into siegfried on the go.
This could eliminate the waiting time until the MIME type is set for larger files, but would require a deeper integration of the external tool into core functionalities (probably in Invenio-Files-REST).
Given that this functionality is more of a nice-to-have, I'm not sure if that trade-off would be worth it.

Let external applications handle this information

There are external tools such as Archivematica and FITS which (among others) specialize in detecting file formats.
Solutions for overviews over the file format landscape in InvenioRDM could be built externally with such tools.
However, InvenioRDM has some built-in capabilities for this already, and they are actively being used (even if only for display purposes in the REST API).
Thus, I think enhancing these existing capabilities rather than building external pipelines is worthwhile.

To do

check how non-local files could be handled
some manual testing
write test cases
make the integration configurable (e.g. disabling the step, setting a custom path for the sf binary, etc.)
polish the code
address the comments and remarks

Test results

In a fresh v12 my-site, I created a new draft and uploaded a PDF file (renamed to have a JPG extension).
Siegfried correctly identifies the file format as PDF:

In our own v11 deployment (without this PR), the same file still has a reported MIME type of "image/jpeg", which is incorrect.

max-moser · 2024-01-11T14:56:47Z

invenio_records_resources/services/files/tasks.py

+
+    mimetype, pronom_id = None, None
+    try:
+        sf_bin = "sf"


if installed with go install, the $GOPATH needs to be added to the $PATH with this logic

* this functionality uses a tool called siegfried https://github.com/richardlehane/siegfried

max-moser · 2024-01-11T15:27:46Z

invenio_records_resources/services/files/tasks.py

+    if mimetype is not None:
+        ov.mimetype = mimetype
+    if pronom_id is not None:
+        ObjectVersionTag.create_or_update(ov, "PUID", pronom_id)


is this an appropriate place to store the PRONOM identifier?
or should it be stored somewhere else, e.g. the file "metadata" which is used to report dimensions of images files and such?

max-moser commented Jan 11, 2024

View reviewed changes

files: add file format detection on file commit

f3badce

* this functionality uses a tool called siegfried https://github.com/richardlehane/siegfried

max-moser force-pushed the mm/file-format-detection branch from ce9ce98 to f3badce Compare January 11, 2024 15:25

max-moser commented Jan 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File upload: Add task for file format detection on file commit #553

File upload: Add task for file format detection on file commit #553

max-moser commented Jan 11, 2024

max-moser Jan 11, 2024

max-moser Jan 11, 2024

File upload: Add task for file format detection on file commit #553

Are you sure you want to change the base?

File upload: Add task for file format detection on file commit #553

Conversation

max-moser commented Jan 11, 2024

Overview

Some more context

Outline of this PR

Alternatives considered

Feeding the file stream into siegfried during upload

Let external applications handle this information

To do

Test results

max-moser Jan 11, 2024

Choose a reason for hiding this comment

max-moser Jan 11, 2024

Choose a reason for hiding this comment