Frostholm v0.4: Backblaze B2 backend and faster chunking
v0.4 is the largest release since the initial public version. It took about four months of evenings and a few weekends. Two features drove most of the work: a native Backblaze B2 backend and a new variable-length chunker. Everything else in this release is a consequence of needing those two things to work well together.
Native Backblaze B2 backend
The S3-compatible endpoint that B2 exposes is fine for most tools. I used it myself for about six months before writing the native backend. The issue isn't correctness — B2's S3-compat layer is well-implemented — it's API call economics.
B2's free tier includes 2,500 Class B transactions per day (downloads, includes chunk fetches during restore) and 2,500 Class C transactions (uploads, list operations). When you're using the S3-compat layer, every operation goes through an S3-to-B2 translation layer that occasionally does two B2 API calls for one logical S3 call. On a 200 GB repository restore, this was chewing through 30–40% of the daily free quota before I'd moved 20 GB.
The native backend uses B2's b2_upload_large_file multipart API directly and
batches chunk fetch ranges using Range headers against pack files. On the same
200 GB restore, API call count dropped by about 60%.
export B2_APPLICATION_KEY_ID="your-key-id"
export B2_APPLICATION_KEY="your-application-key"
fh init --repo b2://my-bucket/frostholm-home
One thing to configure on the B2 side: set a lifecycle rule to hide old file versions
after 1 day. When fh prune runs, it deletes old pack files by uploading a
"hide" marker rather than a real delete (B2 charges for delete API calls). The lifecycle
rule cleans up automatically. Without it, hidden versions accumulate and inflate your
storage bill.
FastCDC-inspired chunker
The original Frostholm chunker used Rabin fingerprinting — the same algorithm restic uses. It works, but Rabin has a known weakness: the rolling hash is computed over every byte, which creates a throughput ceiling. On modern hardware with fast NVMe storage, the chunker was the bottleneck during backup, not I/O.
The new chunker is based on the FastCDC algorithm from the 2016 USENIX ATC paper (Xia et al.). The key insight is using a gear table (a lookup table of random 64-bit values indexed by byte value) instead of a polynomial hash. The rolling window can be updated with a single XOR and table lookup per byte, which is significantly cheaper than Rabin's multiply-and-mod.
FastCDC also uses a two-threshold approach: a "small" threshold for fast cut-point detection and a "normal" threshold for the regular distribution. This reduces the number of very small chunks without sacrificing deduplication quality.
Benchmark results on a 180 GB photo library (M2 MacBook Air, reading from internal SSD):
| Chunker | Throughput | Avg chunk size | Dedup ratio |
|---|---|---|---|
| Rabin (v0.3) | 820 MB/s | 1.8 MB | 0.96× |
| FastCDC (v0.4) | 1,140 MB/s | 2.1 MB | 0.97× |
The dedup ratio is nearly identical because the workload (RAW photos) doesn't deduplicate well regardless of chunker. On a code repository the improvement is more visible because FastCDC's larger average chunk size means fewer index entries and less index overhead.
Breaking change: existing repositories can still be read correctly — the
pack file format is unchanged. But new chunks written by v0.4 use the FastCDC layout, so
the repository will contain a mix of formats until you run fh migrate v4.
Migration rewrites all existing pack files with FastCDC chunks. It's optional; mixed
repositories work fine. Migration takes roughly the same time as a full restore + re-backup.
Parallel restore
Before v0.4, restore was single-threaded: fetch chunk → decrypt → write → repeat. For local repositories this was fine (memory bandwidth limited). For remote backends it was terrible — each chunk fetch blocked on network latency before the next one started.
v0.4 restore uses a pipeline with configurable parallelism:
# Default parallelism: 4 concurrent chunk fetches
fh restore --repo b2://my-bucket/frostholm-home --snapshot latest --target /tmp/out
# Increase for high-bandwidth connections
fh restore --repo b2://my-bucket/frostholm-home --snapshot latest --target /tmp/out \
--parallelism 16
On a 1 Gbps connection restoring from B2 Europe, parallelism=16 restored 50 GB in 7 minutes. With parallelism=1 (v0.3 behavior) it took 31 minutes. The difference is almost entirely latency hiding.
fh diff
A frequently requested feature. Given two snapshot IDs, fh diff shows what
changed between them:
fh diff --repo /mnt/backup/home abc12345 def67890
# M Documents/taxes/2026.pdf (42 KB → 118 KB)
# A Pictures/iceland-trip/DSC_0847.ARW
# A Pictures/iceland-trip/DSC_0848.ARW
# D Downloads/old-stuff.zip
# ...
M = modified, A = added, D = deleted. The size
column shows old → new for modified files. Pass --json for machine-readable
output.
Upgrade
go install github.com/e-var/frostholm/cmd/fh@latest
# or
brew upgrade frostholm
If you're using the B2 backend for the first time, see the backend docs for bucket configuration tips. Full release notes are in the changelog.