Skip to content
This repository has been archived by the owner on Jul 26, 2022. It is now read-only.

Data model

Philipp C. Heckel edited this page Dec 9, 2013 · 2 revisions

This page describes Syncany's data model.

DRAFTING: As a first iteration, we'll draft some use cases and scenarios the data model in Syncany is used for. The purpose of this is to find out if the model is still valid and what views are required on the model in what scenarios. This article is closely related to this this thread on the mailing list.

Data model

Entities:

  • Database: The database represents the internal file and chunk index of the application. It can be used to reference or load a full local database (local client) or a remote database (from a delta database file of another clients).

    A database consists of a sorted list of database versions, i.e. it is a collection of changes to the local file system. For convenience, the class also offers a set of functionality to select objects from the current accumulated database.

  • Database Version: ...

  • File Content: A file content represents the content of a file. It contains a list of references to chunk entries, and identifies a content by its checksum.

    A file content is implicitly referenced by one or many file versions through the checksum attribute. A file content always contains the full list of chunks it resembles. There are no deltas! Unlike the chunk list in a multi chunk entry, the order of the chunks is very important, because a file can only be reconstructed if the order of its chunks are followed.

  • File Version: A file version represents a version of a file at a certain time and captures all of a file's properties.

    A partial file history typically consists of multiple file versions each of which is the incarnation of the same file, but with either changed properties, or changed content. The file versions checksum attribute implicitly links to a file content which represents the content of a file. Multiple file versions can link to the same file content.

  • Multi Chunk Entry: The multichunk entry represents the chunk container in which a set of chunk entries is stored. On a file, level, a multichunk is represented by a file (container format) and chunks are added to this file. A multichunk is identified by a unique identifier (random, not a checksum), and contains references to chunk entries.

  • Partial file history: A partial file history represents a single file in a repository over a certain period of time/versions. Whenever a file is updated or deleted, a new file version is added to the file history. A file history is identified by a unique random identifier and holds a sorted list of file versions.

    Due to cleanup mechanisms and the delta database concept, the list of file versions is not always complete. The class hence represents a part of the file history.

Use cases / scenarios

DRAFTING: This section describes the views required on the database, namely which get*() methods and/or SELECT statements must be implemented in a DAO class.

Up operation

  • Deduplication: chunk lookup by chunk checksum (purpose: only store chunks once)
    db.getChunk(byte[] checksum) -- lookup by chunk checksum (does chunk exist)

  • Versioning: append new file version to existing file history (purpose: guess matching file history)
    db.getFileHistoryByPath(String path) -- matching by path
    db.getFileHistoriesByChecksum(byte[] fileChecksum) -- matching by file checksum

Down operation

  • Recon / apply winners branch & prune branch / add/remove winning database versions to local db (recon was on vector clock basis)
    db.getDatabaseVersionByVectorClock(VectorClock)

  • Post-Reconciliation / download required multichunks FileVersion.getChecksum() --> db.getFileContent(checksum) --> for each chunk in FileContent: 1. db.getMultiChunkForChunk(chunk), 2. Download multichunk

    db.getFileContent(byte[] fileChecksum) -- mapping from FileVersion to FileContent
    db.getMultiChunkByChunk(byte[] chunkId) -- figure out in which multichunk a chunk is stored

    ---> could be done with one SELECT

Restore operation

  • Get all "current" files at date/time X

Cleanup operation

  • outdated = older than expiry date, and not used in any "active" object
  • Get file versions by file checksum (content) that are older than expiry date (= outdated)
  • find outdated contents: identify stale file contents (only used in outdated
  • find outdated chunks: identify if chunks in stale file contents can be removed by checking if they appear in any other contents which are used in file versions that are not outdated
  • find entirely outdated multichunks (= all chunks are outdated, can be deleted)
  • find ratio of outdated to currect chunks in multichunk (purpose: "repackage" multichunk if 50% outdated chunks)