Frostholm v0.4: Backblaze B2 backend and faster chunking

2026-03-14 · E.V. · 8 min read

v0.4 is the largest release since the initial public version. It took about four months of evenings and a few weekends. Two features drove most of the work: a native Backblaze B2 backend and a new variable-length chunker. Everything else in this release is a consequence of needing those two things to work well together.

Native Backblaze B2 backend

The S3-compatible endpoint that B2 exposes is fine for most tools. I used it myself for about six months before writing the native backend. The issue isn't correctness — B2's S3-compat layer is well-implemented — it's API call economics.

B2's free tier includes 2,500 Class B transactions per day (downloads, includes chunk fetches during restore) and 2,500 Class C transactions (uploads, list operations). When you're using the S3-compat layer, every operation goes through an S3-to-B2 translation layer that occasionally does two B2 API calls for one logical S3 call. On a 200 GB repository restore, this was chewing through 30–40% of the daily free quota before I'd moved 20 GB.

The native backend uses B2's b2_upload_large_file multipart API directly and batches chunk fetch ranges using Range headers against pack files. On the same 200 GB restore, API call count dropped by about 60%.

export B2_APPLICATION_KEY_ID="your-key-id"
export B2_APPLICATION_KEY="your-application-key"

fh init --repo b2://my-bucket/frostholm-home

One thing to configure on the B2 side: set a lifecycle rule to hide old file versions after 1 day. When fh prune runs, it deletes old pack files by uploading a "hide" marker rather than a real delete (B2 charges for delete API calls). The lifecycle rule cleans up automatically. Without it, hidden versions accumulate and inflate your storage bill.

FastCDC-inspired chunker

The original Frostholm chunker used Rabin fingerprinting — the same algorithm restic uses. It works, but Rabin has a known weakness: the rolling hash is computed over every byte, which creates a throughput ceiling. On modern hardware with fast NVMe storage, the chunker was the bottleneck during backup, not I/O.

The new chunker is based on the FastCDC algorithm from the 2016 USENIX ATC paper (Xia et al.). The key insight is using a gear table (a lookup table of random 64-bit values indexed by byte value) instead of a polynomial hash. The rolling window can be updated with a single XOR and table lookup per byte, which is significantly cheaper than Rabin's multiply-and-mod.

FastCDC also uses a two-threshold approach: a "small" threshold for fast cut-point detection and a "normal" threshold for the regular distribution. This reduces the number of very small chunks without sacrificing deduplication quality.

Benchmark results on a 180 GB photo library (M2 MacBook Air, reading from internal SSD):

Chunker	Throughput	Avg chunk size	Dedup ratio
Rabin (v0.3)	820 MB/s	1.8 MB	0.96×
FastCDC (v0.4)	1,140 MB/s	2.1 MB	0.97×

The dedup ratio is nearly identical because the workload (RAW photos) doesn't deduplicate well regardless of chunker. On a code repository the improvement is more visible because FastCDC's larger average chunk size means fewer index entries and less index overhead.

Breaking change: existing repositories can still be read correctly — the pack file format is unchanged. But new chunks written by v0.4 use the FastCDC layout, so the repository will contain a mix of formats until you run fh migrate v4. Migration rewrites all existing pack files with FastCDC chunks. It's optional; mixed repositories work fine. Migration takes roughly the same time as a full restore + re-backup.

Parallel restore

Before v0.4, restore was single-threaded: fetch chunk → decrypt → write → repeat. For local repositories this was fine (memory bandwidth limited). For remote backends it was terrible — each chunk fetch blocked on network latency before the next one started.

v0.4 restore uses a pipeline with configurable parallelism:

# Default parallelism: 4 concurrent chunk fetches
fh restore --repo b2://my-bucket/frostholm-home --snapshot latest --target /tmp/out

# Increase for high-bandwidth connections
fh restore --repo b2://my-bucket/frostholm-home --snapshot latest --target /tmp/out \
    --parallelism 16

On a 1 Gbps connection restoring from B2 Europe, parallelism=16 restored 50 GB in 7 minutes. With parallelism=1 (v0.3 behavior) it took 31 minutes. The difference is almost entirely latency hiding.

fh diff

A frequently requested feature. Given two snapshot IDs, fh diff shows what changed between them:

fh diff --repo /mnt/backup/home abc12345 def67890

# M  Documents/taxes/2026.pdf  (42 KB → 118 KB)
# A  Pictures/iceland-trip/DSC_0847.ARW
# A  Pictures/iceland-trip/DSC_0848.ARW
# D  Downloads/old-stuff.zip
# ...

M = modified, A = added, D = deleted. The size column shows old → new for modified files. Pass --json for machine-readable output.

Upgrade

go install github.com/e-var/frostholm/cmd/fh@latest
# or
brew upgrade frostholm

If you're using the B2 backend for the first time, see the backend docs for bucket configuration tips. Full release notes are in the changelog.