Fix user exports to deal with s3 storage #3228

hughrun · 2024-01-18T08:04:19Z

(updated 29 Jan (AEDT) )

User exports are failing on instances using s3 storage (i.e. most) because we're treating all image files as local files.

This PR:

separates export database queries into multiple tasks
uses S3Tar library to combine s3 stored images and JSON data into a tar.gz file when using s3 storage
uses custom Storage to secure export files from public access
uses a signed s3 url with 5 minute timeout when using s3 storage

The main guts of the changes here are in a new class called AddFileToTar in the export job.

This has been manually tested with local and s3 storage and appears to work with both.

OUTSTANDING ISSUE

Azure blob storage is not considered here, I'm not really sure how it works but I assume it won't work for the same reason s3 wasn't working.

- custom storages - tar.gz within bucket using s3_tar - slightly changes export directory structure - major problems still outstanding re delivering s3 files to end users

- remove test export files - check in emblackened files

hughrun · 2024-01-27T05:06:41Z

Ok I think I've worked this out. Hopefully will have a fix and a cleaner PR tomorrow.

- use signed url for s3 downloads - re-arrange tar.gz file to match original - delete all working files after tarring - import from s3 export TODO - check local export and import - fix error when avatar missing - deal with multiple s3 storage options (e.g. Azure)

pulls Mouse's fix for imagefile serialization

.env.example

dato · 2024-01-29T01:13:27Z

I certainly agree it would be cleaner but I'm not sure how to do it.

os.putenv()/os.environ, as @skobkin suggests, could be made to work, yes; though lines like this one in the s3-tar source make me think it's a fragile approach, for s3-tar—that style of coding could easily lead to the variable needing to be set at import time.

Another approach would be to subclass boto3.Session, as done here (i.e. hughrun#3; untested, sorry).

Thoughts?

As of 0.1.13, the s3-tar library uses an environment variable (`S3_ENDPOINT_URL`) to determine the AWS endpoint. See: https://github.com/xtream1101/s3-tar/blob/0.1.13/s3_tar/utils.py#L25-L29. To save BookWyrm admins from having to set it (e.g., through `.env`) when they are already setting `AWS_S3_ENDPOINT_URL`, we create a Session class that unconditionally uses that URL, and feed it to S3Tar.

hughrun · 2024-01-29T01:43:50Z

Another approach would be to subclass boto3.Session, as done here (i.e. hughrun#3; untested, sorry).

Thoughts?

I can get overriding os.environ to work, but it feels very hacky. I like this better. I'll test it out and see what happens.

also undoes a line space change in settings.py to make the PR cleaner

Subclass boto3.Session to use AWS_S3_ENDPOINT_URL

Thanks Dato!

hughrun · 2024-01-29T03:25:18Z

Ok almost done, I just need to disable azure storage, which I forgot about.

hughrun · 2024-02-03T23:14:03Z

@bookwyrm-social/code-review this is ready to check.

Conflicts: bookwyrm/models/bookwyrm_export_job.py requirements.txt

bookwyrm/views/preferences/export.py

bookwyrm/models/bookwyrm_export_job.py

Minnozz · 2024-03-25T17:23:26Z

I've refactored the creation of the tar file a bit, by not re-using the same File (the one behind the file field) as a temporary file every time one is needed, but just creating it directly.

…storage backend

Minnozz · 2024-03-26T15:21:32Z

BookwyrmExportJob is now just two tasks, which run sequentially. Tests still need to reflect this.

Creating the export JSON and export TAR are now the only two tasks.

Minnozz · 2024-03-27T22:12:53Z

I noticed that we are re-using the IMPORTS queue for exports too (even before this PR). Should those get a separate queue?

Minnozz · 2024-03-27T22:42:21Z

I just realised we may need to fix the acl of the temporary export json that is uploaded with the "manual" connection

Saving avatars to /images is problematic because it changes the original filepath from avatars/filename to images/avatars/filename. In this PR prior to this commit, imports failed as they are looking for a file path beginning with "avatar"

Minnozz · 2024-04-13T11:41:05Z

@hughrun Thanks for testing and fixing my additions! Do you consider this PR ready to merge?

hughrun · 2024-04-13T19:18:45Z

I do. I've tested the latest changes on s3 and local.

hughrun added 6 commits January 14, 2024 12:14

initial work on fixing user exports with s3

cbd0812

- custom storages - tar.gz within bucket using s3_tar - slightly changes export directory structure - major problems still outstanding re delivering s3 files to end users

oops

62cc6c2

- remove test export files - check in emblackened files

ignore exports dir

d4d2734

Merge branch 'main' into user-export

833f26f

cleanup and linting

4691729

linting

26c37de

hughrun added 5 commits January 28, 2024 15:07

various fixes

2bb9a85

- use signed url for s3 downloads - re-arrange tar.gz file to match original - delete all working files after tarring - import from s3 export TODO - check local export and import - fix error when avatar missing - deal with multiple s3 storage options (e.g. Azure)

Merge branch 'main' into user-export

0d619f7

Merge branch 'image-serialize' into user-export

582e97e

pulls Mouse's fix for imagefile serialization

fix avatar import path

a3e0525

linting and tests

2c231ac

hughrun marked this pull request as ready for review January 28, 2024 09:41

hughrun requested a review from mouse-reeve January 28, 2024 09:41

hughrun added the feature: import/export label Jan 28, 2024

skobkin reviewed Jan 28, 2024

View reviewed changes

.env.example Outdated Show resolved Hide resolved

dato and others added 2 commits January 28, 2024 22:21

fix tests

765fc1e

hughrun and others added 3 commits January 29, 2024 13:45

allow user exports with s3

adff3c4

also undoes a line space change in settings.py to make the PR cleaner

Merge pull request #3 from dato/export_job_inject_aws_endpoint_setting

f96ddaa

Subclass boto3.Session to use AWS_S3_ENDPOINT_URL

subclass boto3 session instead of adding new env value

5f7be84

Thanks Dato!

disable user exports if using azure

3675a4c

Merge from main into 'user-export'

518da3b

Conflicts: bookwyrm/models/bookwyrm_export_job.py requirements.txt

dato force-pushed the user-export branch from a9ce860 to 518da3b Compare March 18, 2024 17:47

Make get_file_size robust against typing errors

a6dc5bd

dato reviewed Mar 18, 2024

View reviewed changes

bookwyrm/views/preferences/export.py Outdated Show resolved Hide resolved

Minnozz requested changes Mar 25, 2024

View reviewed changes

bookwyrm/views/preferences/export.py Outdated Show resolved Hide resolved

bookwyrm/models/bookwyrm_export_job.py Outdated Show resolved Hide resolved

Minnozz added 5 commits March 25, 2024 18:25

Fix pylint warnings

d9bf848

Merge branch 'main' into user-export

6a67943

Merge BookwyrmExportJob export_data field back into one with dynamic …

145c67d

…storage backend

Check last user export too in post handler

ef57c0b

Merge migration

ed2e9e5

Consolidate BookwyrmExportJob into two tasks

9685ae5

Creating the export JSON and export TAR are now the only two tasks.

Minnozz force-pushed the user-export branch from 2cb376b to 9685ae5 Compare March 27, 2024 19:14

Minnozz added 3 commits March 27, 2024 20:15

Update migrations

9afd0eb

Update BookwyrmExportJob tests

797d5cb

Fix migration formatting

c6ca547

Fix double exports subdir in S3 user export

cdbc1d1

Minnozz approved these changes Mar 27, 2024

View reviewed changes

Minnozz and others added 8 commits March 28, 2024 13:09

User export testing fixes

dabf7c6

Fix mypy error

bb5d815

Test user export archive contents

2bbe3d4

Merge branch 'main' into user-export

0ac9d12

Use new "with ()" style

5d597f1

export avatars to own directory

501fb45

Saving avatars to /images is problematic because it changes the original filepath from avatars/filename to images/avatars/filename. In this PR prior to this commit, imports failed as they are looking for a file path beginning with "avatar"

Merge branch 'main' into user-export

d48d312

add merge migration

c3c4614

Minnozz merged commit 21a39f8 into bookwyrm-social:main Apr 13, 2024
10 checks passed

This was referenced Apr 26, 2024

Export tasks run for a long time #3203

Closed

Exports use a lot of disc space, and don't use S3 #3204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix user exports to deal with s3 storage #3228

Fix user exports to deal with s3 storage #3228

hughrun commented Jan 18, 2024 •

edited

hughrun commented Jan 27, 2024

dato commented Jan 29, 2024 •

edited

hughrun commented Jan 29, 2024

hughrun commented Jan 29, 2024

hughrun commented Feb 3, 2024

Minnozz commented Mar 25, 2024

Minnozz commented Mar 26, 2024

Minnozz commented Mar 27, 2024

Minnozz commented Mar 27, 2024

Minnozz commented Apr 13, 2024

hughrun commented Apr 13, 2024

Fix user exports to deal with s3 storage #3228

Fix user exports to deal with s3 storage #3228

Conversation

hughrun commented Jan 18, 2024 • edited

hughrun commented Jan 27, 2024

dato commented Jan 29, 2024 • edited

hughrun commented Jan 29, 2024

hughrun commented Jan 29, 2024

hughrun commented Feb 3, 2024

Minnozz commented Mar 25, 2024

Minnozz commented Mar 26, 2024

Minnozz commented Mar 27, 2024

Minnozz commented Mar 27, 2024

Minnozz commented Apr 13, 2024

hughrun commented Apr 13, 2024

hughrun commented Jan 18, 2024 •

edited

dato commented Jan 29, 2024 •

edited