Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store file locations #839

Open
trevorgerhardt opened this issue Nov 18, 2022 · 0 comments
Open

Store file locations #839

trevorgerhardt opened this issue Nov 18, 2022 · 0 comments
Labels
cleanup t1 Time level 1: think days

Comments

@trevorgerhardt
Copy link
Member

The code base is littered with code that generates file locations from metadata. This code is certainly necessary during the upload of those files or creation of new files, but once the files exist in our system we should no longer need to "create" the filename and path, only retrieve it.

For example,

  • In aggregation areas we have a method that generates the S3 Path.
  • In opportunity datasets we have methods that generate the storage location.
  • For regional analyses, results and locations are generated on demand.

Each instance of generating the storage location is not problematic in and of themselves, but they add up across the code base. Improvements in this area could take a massive migration, both in the database and the stored files but I believe it would be well worth it.

Storing file locations

I see two different options for storing the locations:

  1. We can store the paths directly on the models, in a common format that aligns with our "File Storage" implementation.
  2. We create a file collection in the database with an entry for each file.

We've discussed the second and have partially done it with data sources. But data sources attempt to do too much. I think extracting out a shared "file" collection would be very beneficial. We could model it like:

type FileItem = {
  _id: UUID
  name: string

  // Parameters to generate a `FileStorageKey` from:
  bucket: string 
  path: string

  // Auth
  accessGroup: string
  createdBy: string

  // Metadata
  bytes: number // File size, in bytes
  isGzipped: boolean
  type: string // MIME Type
}

All other types that have a file would reference it by its _id. Opportunity datasets and aggregation areas would have a fileItemId parameter now.

We would also be able to lookup all the files for a specific access group and calculate the storage size of a specific access group's uploaded data.

There are certain files this would not apply for, like Taui sites, which pre-generate thousands of files.

@trevorgerhardt trevorgerhardt added cleanup t1 Time level 1: think days labels Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup t1 Time level 1: think days
Projects
None yet
Development

No branches or pull requests

1 participant