Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix user exports to deal with s3 storage #3228

Merged
merged 51 commits into from Apr 13, 2024

Conversation

hughrun
Copy link
Contributor

@hughrun hughrun commented Jan 18, 2024

(updated 29 Jan (AEDT) )

User exports are failing on instances using s3 storage (i.e. most) because we're treating all image files as local files.

This PR:

  • separates export database queries into multiple tasks
  • uses S3Tar library to combine s3 stored images and JSON data into a tar.gz file when using s3 storage
  • uses custom Storage to secure export files from public access
  • uses a signed s3 url with 5 minute timeout when using s3 storage

The main guts of the changes here are in a new class called AddFileToTar in the export job.

This has been manually tested with local and s3 storage and appears to work with both.

OUTSTANDING ISSUE

  • Azure blob storage is not considered here, I'm not really sure how it works but I assume it won't work for the same reason s3 wasn't working.

- custom storages
- tar.gz within bucket using s3_tar
- slightly changes export directory structure
- major problems still outstanding re delivering s3 files to end users
- remove test export files
- check in emblackened files
@hughrun
Copy link
Contributor Author

hughrun commented Jan 27, 2024

Ok I think I've worked this out. Hopefully will have a fix and a cleaner PR tomorrow.

- use signed url for s3 downloads
- re-arrange tar.gz file to match original
- delete all working files after tarring
- import from s3 export

TODO

- check local export and import
- fix error when avatar missing
- deal with multiple s3 storage options (e.g. Azure)
pulls Mouse's fix for imagefile serialization
@hughrun hughrun marked this pull request as ready for review January 28, 2024 09:41
.env.example Outdated Show resolved Hide resolved
@dato
Copy link
Contributor

dato commented Jan 29, 2024

I certainly agree it would be cleaner but I'm not sure how to do it.

os.putenv()/os.environ, as @skobkin suggests, could be made to work, yes; though lines like this one in the s3-tar source make me think it's a fragile approach, for s3-tar—that style of coding could easily lead to the variable needing to be set at import time.

Another approach would be to subclass boto3.Session, as done here (i.e. hughrun#3; untested, sorry).

Thoughts?

dato and others added 2 commits January 28, 2024 22:21
As of 0.1.13, the s3-tar library uses an environment variable
(`S3_ENDPOINT_URL`) to determine the AWS endpoint. See:
https://github.com/xtream1101/s3-tar/blob/0.1.13/s3_tar/utils.py#L25-L29.

To save BookWyrm admins from having to set it (e.g., through `.env`)
when they are already setting `AWS_S3_ENDPOINT_URL`, we create a Session
class that unconditionally uses that URL, and feed it to S3Tar.
@hughrun
Copy link
Contributor Author

hughrun commented Jan 29, 2024

Another approach would be to subclass boto3.Session, as done here (i.e. hughrun#3; untested, sorry).

Thoughts?

I can get overriding os.environ to work, but it feels very hacky. I like this better. I'll test it out and see what happens.

hughrun and others added 3 commits January 29, 2024 13:45
also undoes a line space change in settings.py to make the PR cleaner
Subclass boto3.Session to use AWS_S3_ENDPOINT_URL
@hughrun
Copy link
Contributor Author

hughrun commented Jan 29, 2024

Ok almost done, I just need to disable azure storage, which I forgot about.

@hughrun
Copy link
Contributor Author

hughrun commented Feb 3, 2024

@bookwyrm-social/code-review this is ready to check.

Conflicts:
	bookwyrm/models/bookwyrm_export_job.py
	requirements.txt
bookwyrm/views/preferences/export.py Outdated Show resolved Hide resolved
bookwyrm/models/bookwyrm_export_job.py Outdated Show resolved Hide resolved
@Minnozz
Copy link
Contributor

Minnozz commented Mar 25, 2024

I've refactored the creation of the tar file a bit, by not re-using the same File (the one behind the file field) as a temporary file every time one is needed, but just creating it directly.

@Minnozz
Copy link
Contributor

Minnozz commented Mar 26, 2024

BookwyrmExportJob is now just two tasks, which run sequentially. Tests still need to reflect this.

Creating the export JSON and export TAR are now the only two tasks.
@Minnozz
Copy link
Contributor

Minnozz commented Mar 27, 2024

I noticed that we are re-using the IMPORTS queue for exports too (even before this PR). Should those get a separate queue?

@Minnozz
Copy link
Contributor

Minnozz commented Mar 27, 2024

I just realised we may need to fix the acl of the temporary export json that is uploaded with the "manual" connection

Minnozz and others added 8 commits March 28, 2024 13:09
Saving avatars to /images is problematic because it changes the original filepath from avatars/filename to images/avatars/filename.
In this PR prior to this commit, imports failed as they are looking for a file path beginning with "avatar"
@Minnozz
Copy link
Contributor

Minnozz commented Apr 13, 2024

@hughrun Thanks for testing and fixing my additions! Do you consider this PR ready to merge?

@hughrun
Copy link
Contributor Author

hughrun commented Apr 13, 2024

I do. I've tested the latest changes on s3 and local.

@Minnozz Minnozz merged commit 21a39f8 into bookwyrm-social:main Apr 13, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants