How content-addressed deduplication works inside Frostholm

Deduplication internals

"Content-addressed storage" sounds like jargon, but the idea is simple: instead of naming data by where you put it (a path, an offset), you name it by what it contains (a hash). Two identical pieces of data always have the same name, so you naturally never store them twice. This post explains exactly how Frostholm applies that idea — from splitting a file into chunks all the way to writing encrypted pack files.

Why not just hash whole files?

The naive version of content-addressed dedup hashes entire files. If file A and file B are identical, you store one copy. Simple — but useless for archival. In practice, the files you care most about backing up change incrementally: a 4 GB video project file has one clip updated. A 200 MB Postgres dump has a few dozen rows changed. Hashing the whole file means treating those as completely new, unrelated objects every time.

Sub-file deduplication requires splitting files into pieces and hashing each piece independently. The question is: how do you split?

Fixed-size chunks: the wrong answer

The obvious answer is fixed-size chunks — say, 4 MB each. Byte 0–4194303 is chunk 0, byte 4194304–8388607 is chunk 1, and so on. This works for files that only grow at the end (append-only logs), but fails immediately for any file that has data inserted or deleted near the beginning.

Insert 100 bytes at offset 0 of a 1 GB file, and every fixed chunk shifts by 100 bytes. Chunk 0 now contains bytes 0–4194203 (one chunk had bytes 0–4194303 before). That's a different hash. Every chunk in the file is now "new" — you've just stored the entire file again despite it being 99.99% identical.

Content-defined chunking: the right answer

Content-defined chunking (CDC) places chunk boundaries based on the content itself, not on byte position. The classic mechanism is a rolling hash: maintain a hash over a sliding window of N bytes. When the hash meets a certain condition (e.g., the low 21 bits are all zero), that's a chunk boundary.

Because the boundary condition depends only on the local bytes around it, inserting 100 bytes at the beginning of a file shifts boundaries near the insertion point but leaves boundaries further downstream unchanged — they're still determined by the same local byte patterns. After a few kilobytes, the chunk structure resynchronizes to what it was before.

This is the core insight that makes incremental backup practical: resynchronization means that modifying one part of a large file only invalidates a small number of chunks, not all of them.

Frostholm's FastCDC implementation

Frostholm v0.4 uses a FastCDC-inspired chunker. The rolling hash is a gear hash — for each byte b, the state is updated as:

state = (state << 1) ^ GEAR_TABLE[b]

where GEAR_TABLE is a fixed array of 256 random 64-bit values. This is dramatically cheaper than Rabin fingerprinting because it's just a shift, an XOR, and a table lookup — no modular arithmetic.

The chunker uses three thresholds:

  • Min size: 512 KB — no chunk smaller than this regardless of hash condition.
  • Normal size: 2 MB — target average chunk size; checked with a 21-bit mask.
  • Max size: 8 MB — hard cutoff; if no boundary found by here, cut anyway.

FastCDC also uses a "small" mask (fewer bits, triggers more easily) for the region immediately above min size, then switches to the normal mask. This avoids a pathological distribution where many chunks cluster just at min size.

BLAKE3 for chunk identity

Each chunk is hashed with BLAKE3. I chose BLAKE3 over SHA-256 for two reasons:

  • Speed: BLAKE3 is roughly 3–5× faster than SHA-256 on modern hardware with SIMD acceleration. On an M2, it sustains ~10 GB/s for large inputs. This matters because chunk hashing is on the critical path of every backup.
  • Collision resistance: BLAKE3 produces 256-bit digests. The probability of two different chunks having the same hash is astronomically small (2−128 for a birthday collision across 264 chunks). In a backup tool, a hash collision means silently corrupting data — that's unacceptable, so I wanted a hash with a comfortable security margin.

The 32-byte BLAKE3 digest is used as the chunk's storage key. The index maps these keys to locations in pack files.

Pack files

Writing millions of small files to a remote backend is slow and expensive — each chunk would require its own API call. Instead, Frostholm aggregates chunks into pack files of roughly 128 MB.

A pack file has a simple structure:

[ encrypted chunk 0 ][ encrypted chunk 1 ] ... [ encrypted chunk N ][ footer ]

The footer contains a list of (hash, offset, length) tuples — one per chunk in the pack. When the index is populated, these tuples are read and used to build the in-memory chunk → pack mapping.

Each chunk is independently encrypted with ChaCha20-Poly1305 using a random 12-byte nonce prepended to the ciphertext. This means chunks can be decrypted individually without decrypting the whole pack file — essential for selective restore.

The index

The index is an in-memory hash map from chunk hash (32 bytes) to pack location (pack_id, offset, length). During backup:

  1. Read source file, produce chunks.
  2. For each chunk, check index: if hash exists, record reference; if not, queue for writing.
  3. Accumulate new chunks into a pack buffer; flush to backend when buffer reaches 128 MB.
  4. Update index with new chunk locations.

The index is persisted to a local cache directory between runs (since v0.3.7). On the next backup, it's loaded from cache rather than rebuilt from the backend — which would require downloading and parsing all pack footers. On a 200 GB repository this saves 10–30 seconds of startup time.

The cache uses a simple write-ahead log for crash consistency: index updates are written to a WAL file before being applied to the main cache. If the process dies mid-backup, the WAL is replayed on the next run.

Dedup across files and snapshots

Because all chunks share a single index per repository, deduplication works globally: if the same 2 MB block of data appears in file_a.pdf and file_b.pdf, it's stored once. If it appears in the same file across 50 snapshots, it's stored once. This is why Frostholm can store a year of daily backups of a code repository in 3–4× the size of a single snapshot rather than 365×.

For workloads with genuine uniqueness (RAW photos, video files, already-compressed archives), dedup won't help and storage approaches 1:1. Frostholm is honest about this — fh backup prints actual dedup ratios so you know exactly what you're getting.

Questions or corrections: hello [at] frostholm.fun. The chunker and pack format are both documented in docs/FORMAT.md in the repository if you want the precise binary layout.