Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync s3 to file system target for backup #31

Open
MarkRx opened this issue Apr 24, 2018 · 3 comments
Open

Sync s3 to file system target for backup #31

MarkRx opened this issue Apr 24, 2018 · 3 comments

Comments

@MarkRx
Copy link

MarkRx commented Apr 24, 2018

I'm looking to backup some s3 buckets to a filesystem. I have been able to successfully sync from s3 to the filesystem, but I can't find a way to cleanup unreferenced files on the target. What I would like to do is delete a file on the target if it has not been referenced by the source s3 for over 30 days.

There is a --delete-older-than flag but this only appears to be for source objects.

Is this possible (without using force-sync)? I was thinking if there was an easy way to know when each file was last checked for syncing it could be done. Files could be purged if their last sync time was > 30 days (so long as sync ran more frequent than every 30 days). It could also be done if the target filesystem syncer had an option to always touch a file as a "liveness" indicator (without always redownloading it). Files with a timestamp > X days could be purged.

A plugin injection point of "no-op" or something along those lines could be added to SyncStorage
and called here: https://github.com/EMCECS/ecs-sync/blob/master/src/main/java/com/emc/ecs/sync/TargetFilter.java#L78
Plugins could then perform custom logic if a sync was not performed. In this case the file system plugin could have an option to force touch a file.

@MarkRx
Copy link
Author

MarkRx commented Apr 25, 2018

I believe I may have a solution. I can use the id-logging filter to log what artifacts are processed. I can then compare that list after each run against what is on the filesystem. Files on the filesystem but not in that list have been deleted.

@MarkRx
Copy link
Author

MarkRx commented Apr 26, 2018

This should work in a cron so long as none of the files end in .deleted or .log:

#!/bin/bash
# Compares files in S3 against what have been backed up to disk. Files deleted on s3 will
# remain on disk for 30 days before being deleted.

if [ -z "$2" ]
  then
  echo "Usage: s3-backup.sh <xml_config> <backup_directory>"
  exit 1
fi

BACKUP_DIR="$2"

# The number of days before we delete the deleted s3 object from disk
EXPIRE_CUTOFF=30

SOURCE_LIST=/tmp/s3b_source.txt
TARGET_LIST=/tmp/s3b_target.txt
DELETED_LIST=/tmp/s3b_deleted.txt
EXPIRED_LIST=/tmp/s3b_expired.txt

# Run s3 backup utility
java -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts -jar /opt/dell/ecs/bin/ecs-sync.jar --no-rest-server --log-level verbose --xml-config "$1"

# Objects in S3 (source)
echo [$(date +"%m/%d/%Y %H:%M:%S")] Creating source list $SOURCE_LIST...
cat "$BACKUP_DIR/syncids.log" | cut -d, -f1 | sort > "$SOURCE_LIST"

# Objects on disk (target)
echo [$(date +"%m/%d/%Y %H:%M:%S")] Creating target list $TARGET_LIST...
find "$BACKUP_DIR" -type f \( ! -iname "*.log" ! -iname "*.deleted" \) -print | sed -r -e 's|^\'"$BACKUP_DIR/"'||' | sort > "$TARGET_LIST"

# Compare to determine files on disk that are not in the s3
echo [$(date +"%m/%d/%Y %H:%M:%S")] Creating deleted list $DELETED_LIST...
comm --check-order -13 "$SOURCE_LIST" "$TARGET_LIST" > "$DELETED_LIST"

# Create marker files to indicate a file was deleted
ndel=0
while read obj; do
  file="$BACKUP_DIR/$obj"
  dfile="$file.deleted"
  
  if [ ! -f "$dfile" ]; then
    echo [$(date +"%m/%d/%Y %H:%M:%S")] Marking $file as deleted
    touch "$dfile"
  fi
  
  ndel=$((ndel+1))
done < "$DELETED_LIST"

echo [$(date +"%m/%d/%Y %H:%M:%S")] There are $ndel objects on disk that are not on the source S3 system.

# Find marker files that are older than 30 days
echo [$(date +"%m/%d/%Y %H:%M:%S")] Searching for deleted files older than 30 days...
find "$BACKUP_DIR" -type f \( -iname "*.deleted" \) -mtime "+$EXPIRE_CUTOFF" -print | sed -r -e 's_^\./__' -e 's_\.deleted$__' -e 's|^\'"$BACKUP_DIR/"'||' | sort > "$EXPIRED_LIST"

# Delete files older than 30 days that are not referenced anymore
nexp=0
while read obj; do
  file="$BACKUP_DIR/$obj"
  dfile="$file.deleted"
  pdir="$(dirname "$file")"
  
  echo [$(date +"%m/%d/%Y %H:%M:%S")] Deleting $obj
  rm "$file"
  rm "$dfile"
  nexp=$((nexp+1))
  
  ptarget=$pdir
  while [ -z "$(ls -A $ptarget)" ]; do
    echo [$(date +"%m/%d/%Y %H:%M:%S")] Deleting $ptarget
    rmdir "$ptarget"
    ptarget="$(dirname "$ptarget")"
  done
done < "$EXPIRED_LIST"

echo [$(date +"%m/%d/%Y %H:%M:%S")] Deleted $nexp objects that have not existed on the source s3 system for $EXPIRE_CUTOFF days.

@twincitiesguy
Copy link
Contributor

Unfortunately, it's impossible for ecs-sync to know what was deleted on the source. The only way we can sync deletes is, as you have outlined above, by comparing the list of both storage locations. Given the nature and design of ecs-sync, we have not add this as an option (yet).

We are still tinkering with ideas about efficient ways to accomplish this, but there are many caveats and considerations when deleting data that have curtailed development. The best option I've heard so far is to use the sync database to identify files that definitely were on the source system, were synced at one point, and are now gone. This at least attempts to ensure we don't delete 3rd party data on the target storage.

This feature is in the backlog, but not on the roadmap as of now. As always, we welcome suggestions for improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants