Mu TSDB PDD: Gorilla-inspired Features (Design Intake)

Source: pdd-gorilla-inspired/index.md

Mu TSDB PDD: Gorilla-inspired Features (Design Intake)

Date: 2026-02-25

Context

Mu is a lean TSDB built for three hard goals:

Blazing writes
Blazing queries
Never drop a metric

This document captures implementation ideas worth "ripping" from Meta/Facebook's Gorilla TSDB design and paper, adapted to Mu's goals and constraints.

Primary reference: "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (Teller et al.). https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Non-goals (for this PDD)

Prometheus/Influx/OTEL wire compatibility
Full distributed control-plane design (Paxos shard assignment, etc.) in the near term
Exact reproduction of Gorilla's architecture (Mu can diverge where it helps correctness or simplicity)

Key decision: "never drop a metric" vs Gorilla availability tradeoffs

Gorilla explicitly tolerates dropping a small amount of recent data under some failure/overload situations to preserve availability. Mu's "never drop a metric" goal implies different defaults:

The default write contract should be "accepted only after durable enough to survive process crash."
Under overload, Mu should prefer backpressure (block) over dropping or rejecting with 5xx.
If we ever add a "best-effort" write mode, it must be explicit and opt-in.

Design ideas to adopt (with Mu adaptation and phase)

1) Delta-of-delta timestamp bitpacking (per-series)

What Gorilla did:

Encode timestamps as delta-of-delta with variable-length bitpacking inside a compressed block.

Why we want it:

Huge write amplification reduction.
Faster query scans due to less memory bandwidth.

Mu adaptation:

Implement a per-series encoder for (ts_ms) using Gorilla-style delta-of-delta.
Keep decoder streaming-friendly so aggregates can consume without materializing points.

Phase:

v1 storage format.

2) XOR float value compression (per-series)

What Gorilla did:

Encode values as XOR vs previous value, with leading/trailing zero compression and a "reuse previous window" fast path.

Why we want it:

This is the core "2 bytes/point-ish" win for float64-ish telemetry series.

Mu adaptation:

Store values as f64 (or f32 if explicitly configured) and apply XOR encoding in block format.
Add support for integer value types later if Mu adds typed metrics.

Phase:

v1 storage format.

3) Time-partitioned compressed blocks (2h as a starting point)

What Gorilla did:

Use fixed-ish time blocks (they mention 2 hours) as a good compression vs decode tradeoff.

Why we want it:

Bounded memory overhead per series.
Efficient eviction/retention boundaries.

Mu adaptation:

Use time-partitioned blocks for each series.
The block duration should be configurable (start with 2h) and can be shorter for high-cardinality series.

Phase:

v1 storage format.

4) Open/closed block lifecycle (append-only open, immutable closed)

What Gorilla did:

Keep one open block per series for appends.
When full, close it and treat closed blocks as immutable.

Why we want it:

Writes are fast and localized.
Closed blocks become cache-friendly and can be shared across threads safely.

Mu adaptation:

Model OpenBlock with an encoder and small mutable buffers.
Model ClosedBlock as immutable byte slices (or slab-allocated arenas).

Phase:

5) Serve queries by copying compressed blocks (avoid server-side decompression)

What Gorilla did:

Return compressed blocks for the query range; decompression happens in a downstream tier.

Why we want it:

Dramatically reduces CPU and p99 under large fanout queries.
Shifts work toward a caller or a dedicated query tier.

Mu adaptation:

Introduce a query mode that returns compressed blocks or block summaries.
Keep current JSON results for MVP; add "binary blocks" response behind a new endpoint or content-type.

Phase:

v1 for block format and transport.
v2 for a proper "query tier" split if desired.

6) Slab allocation to control fragmentation

What Gorilla did:

Copy closed blocks into large slab allocations to avoid heap fragmentation.

Why we want it:

High-cardinality series plus lots of small allocations will crush allocator performance and RSS stability.

Mu adaptation:

Use arena/slab allocation for ClosedBlock bytes.
Keep OpenBlock on the regular heap (or small arenas) and copy-on-close.

Phase:

7) Hybrid in-memory index (paged scan + O(1) lookup) with tombstones/free-pool

What Gorilla did:

Maintain an array/vector for scan-friendly iteration over series pointers.
Maintain a hash map for name->series pointer lookup.
Use tombstones and a free-pool to reuse slots on delete.

Why we want it:

Query planning and "scan lots of series" workloads need linear, cache-friendly iteration.
Direct ingest lookups need constant time.

Mu adaptation:

Split "series catalog" from "series data."
Maintain:
- Vec<SeriesRef> for scans
- HashMap<SeriesKey, SeriesId> for lookup
- Free-list for recycling SeriesId

Phase:

MVP-next for the catalog split.
v1 for tombstones/free-pool and tighter memory behavior.

8) Locking strategy optimized for write-heavy workloads

What Gorilla did:

Use short-lived RW spinlocks for the index structures.
Use tiny per-series locks so writes don't contend globally.
Copy pointers / page lists out of critical sections.

Why we want it:

Mu currently risks query/write contention if a query holds a large read lock while aggregating.

Mu adaptation:

Introduce sharded locks (N-way) for the series catalog and block stores.
Store per-series block lists behind a small lock, or use append-only + atomic swap techniques.
Ensure the query path can snapshot references quickly and release locks before aggregating.

Phase:

MVP-next (this is the highest leverage for "never drop" under query load).

9) Per-shard persistence format (key dictionary + id-tagged append log + checkpoints)

What Gorilla did:

Per-shard directory.
A key list mapping string->int id.
An append-only log interleaving series updates tagged by 32-bit id.
Periodic complete-block files and checkpoint files for crash recovery validation.

Why we want it:

Better replay time than scanning a monolithic WAL of JSON lines.
Smaller on-disk footprint and faster warm start.

Mu adaptation:

Evolve from JSONL WAL to a binary format:
- keys.bin or keys.json mapping series key -> id
- wal.bin id-tagged record stream
- checkpoint.bin with last-good offsets and CRCs
- Optional blocks/ files for closed blocks and compaction outputs
Keep "idempotent request_id" semantics independent of the storage encoding.

Phase:

v1 for binary WAL + checkpointing.
v2 for compaction and block files.

10) Ops model: dual region redundancy, shard assignment, client buffering, partial results

What Gorilla did:

Two independent regional instances and write replication to both.
Paxos-based shard assignment.
Clients buffer during shard moves.
Partial results semantics in queries.

Why we want it:

This is the path to "never drop a metric" at infrastructure scale.

Mu adaptation:

Define the contracts now even if we do not implement the full system:
- Replication protocol between peers.
- Shard ownership and reassignment model.
- Query response includes completeness metadata.

Phase:

v2+ (design first, implement later).

Additional Gorilla-derived features to consider

11) Hot-window cache with long-term source of truth

What Gorilla did:

Keep the most recent ~26 hours in memory.
Store long-term data elsewhere (HBase in their case).