Skip to content

Storage On Disk File Formats

Pekka Enberg edited this page Feb 5, 2018 · 4 revisions

Storage On-Disk File Formats

Introduction

This document describes the on-disk file formats for storage. This includes the commit log and SSTables. The formats are binary compatible with Apache Cassandra.

Data Types

Integer values are in big-endian (network byte order) format unless specified otherwise. We'll mark these as be16, be32, be64.

Byte strings are represented by a length, either 16-bit or 32-bit, followed by the actual bytes.

Commit Log

struct header {
    be32 version;
    be64 id;
    be32 crc;
};

struct sync_marker {
    be32 next_marker;
    be32 crc;
};

SSTables

SSTables ("sorted strings table") are organized into six component files:

  • Index
  • Compression info (optional)
  • Data
  • Filter
  • Statistics
  • Summary

Data

SSTables Data File

The data file contains a part of the actual data stored in the database. Basically, it is one long list of rows, where each row lists its key and its columns. The data file on its own is not useful for finding a row with a specific key - for this the "Index" and "Summary" exist, and are described below.

Index

The index file maps row keys to their offsets in the data file. The file is filled with variable-length index entries. Please note that the data file may be compressed and the offsets point to data after it has been decompressed.

struct index_entry {
    be16 key_size;
    char key[key_size];
    be64 position;
    be32 promoted_index_size;
    char promoted_index[promoted_index_size];
};

struct index {
    struct index_entry[...];
};

CRC

CRC file is generated if compression isn't defined in the schema. CRC file stores chunk size, and a vector of 32-bit checksum. Each checksum in the vector corresponds to a chunk of the data file. Adler is the checksum algorithm used.

struct crc {
    be32 chunk_size;
    be32 checksums[...];
};

Digest

Digest file stores the checksum of the complete SSTable data file as a string (decimal form). Adler is the checksum algorithm used.

Compression Info (optional)

[SSTables Compression](SSTables Compression)

The compression info file is optional and only used if the data file is compressed. The info file specifies which compression algorithm is used and metadata for the compressed chunks in the data file.

Filter

Statistics

be32 Size;
struct {
    be32 metadata_type;
    be32 offset; /* offset into this file */ 
} metadata_metadata[Size];
  • NOTE: each metadata_metadata entry corresponds to a metadata stored in the file.

Summary

SSTables Summary File

References

http://distributeddatastore.blogspot.fi/2013/08/cassandra-sstable-storage-format.html

Clone this wiki locally