Skip to content
Brock Palen edited this page Sep 7, 2020 · 2 revisions

File vs. Data Distribution

Data follows and 80/20 rule, 20% of files hold 80% of data. Many storage services sell by the total data stored, but many lower cost offerings such as HPSS, AWS Glacier, etc. are very sensitive to file (object) count. Thus projects with 1M small source files, and 100 huge data files (video, 4k images, NetCDF etc) have the worse case of large data volumes and large number of files.

The goal of archivetar is to reduce the file count drastically but not waste time reading large data files that are acceptable for these systems in their current format.

filesize-80-20

Proposed Workflow

  1. Sort files into small and large
  2. Optionally Start uploading large files
  3. Sort small files into multiple tars of target size
  4. Tar small file lists in parallel
  5. Upload tars as they complete
  6. Optionally delete small files

workflow.png

Clone this wiki locally