Table of Contents

TL;DR: OpenZL is a modern, lossless compression framework that learns the shape of your data and composes a fast pipeline of transforms + entropy coding to beat generic compressors on structured payloads-without shipping a bespoke decoder per format. When structure is weak, it falls back to a solid general-purpose path. ๐Ÿš€


Why this matters (now)

Most production data isnt random bytes. Its tables, logs, metrics, events, feature vectors-highly structured. Classic tools (gzip, zstd, xz) treat everything the same and leave wins on the table. Teams hack around this with custom format-specific codecs, then drown in operational overhead (distribution, patching, compatibility, audits).

OpenZL aims to hedge both sides: format-aware gains with single-binary operability. One decoder. Many shapes. Fewer headaches. ๐Ÿงฉ


The mental model: compression as a graph

OpenZL formalizes compression as a directed acyclic graph (DAG) of modular transforms and codecs. You describe (or OpenZL infers) how bytes map to fields; the framework splits the input into homogeneous streams and applies type-appropriate transforms before entropy coding.

Core ideas:

The graph is small but expressive: a few carefully chosen transforms composed in the right order often yield large wins with predictable CPU cost.


What โ€œformat-awareโ€ looks like in practice

Imagine a telemetry table:

ts:   2025-10-13T08:00:00Z, 2025-10-13T08:00:01Z, ...
dev:  A13, A13, A13, A14, ...
temp: 24.1, 24.1, 24.2, 24.2, ...

A reasonable OpenZL plan would:

1) Columnize the rows โ†’ streams: ts[], dev[], temp[]
2) On ts[]: delta โ†’ varint โ†’ bitpack
3) On dev[]: tokenize (dictionary) โ†’ rle (if clustered) โ†’ bitpack
4) On temp[]: frame-of-reference โ†’ zigzag (for small deltas) โ†’ bitpack
5) Containerize the 3 streams in one frame with a compact header + recipe

End result: better compression and faster decode due to simpler, more cache-friendly data paths. ๐ŸŽ๏ธ


Frame layout (at a glance)

+-------------------+
| Magic + Version   |  -> future-proofing
+-------------------+
| Recipe (Plan DAG) |  -> operators, parameters, stream map
+-------------------+
| Stream Directory  |  -> offsets, checksums, types
+-------------------+
| Encoded Streams   |  -> one blob per field/column
+-------------------+
| Footer (CRC)      |  -> integrity
+-------------------+

The recipe is small (think a few hundred bytes for most datasets) and lets the universal decoder execute exactly the pipeline used at compress-time.


Performance model: what to expect

Reality check: you wont win on already-compressed payloads (JPEG, MP4, ZIP). Dont waste CPU. โŒ


When to use OpenZL (and when not)

Great fit โœ…

Probably skip โš ๏ธ


Quick start (CLI)

The commands below illustrate the developer workflow. Adapt names/paths to your setup.

1) Compress, generic path

# Single file โ†’ .ozl frame using the default (safe) plan
openzl compress input.bin -o input.bin.ozl

2) Teach OpenZL your data shape (simple schema)

# schema.yaml (skeletal)
type: record
fields:
  - name: ts
    type: u64
    transforms: [delta, varint, bitpack]
  - name: dev
    type: string
    transforms: [tokenize, rle, bitpack]
  - name: temp
    type: f32
    transforms: [for, zigzag, bitpack]
openzl compress telemetry.csv --schema schema.yaml -o telemetry.ozl

3) Train a plan (find the best speed/ratio frontier)

# Explore candidate graphs and emit a pinned plan
openzl train --schema schema.yaml --input shard/*.csv --out plan.json

# Use that plan for deterministic builds
openzl compress telemetry.csv --plan plan.json -o telemetry.ozl

4) Decompress anywhere (universal decoder)

openzl decompress telemetry.ozl -o telemetry.csv

If your consumer only needs a subset of columns, project at decode time:

openzl decompress telemetry.ozl --columns ts,dev -o two-cols.csv

Nice side-effect: column projection reduces I/O and speeds up downstream jobs. โšก๏ธ


Integration playbook (pragmatic and boring, on purpose)

1) Pick one high-volume lane (e.g., analytics events).
2) Model the schema or write a thin parser. Keep it brutally simple first.
3) Run train on a representative shard; pin the resulting plan.json.
4) Benchmark on your real SLOs (throughput, CPU, size).
5) Ship: adopt .ozl frames in storage + streaming paths; standardize the decoder everywhere.
6) Iterate plans as data drifts-no decoder redeploys needed. ๐Ÿ”


Safety, compatibility, and ops

Pro tip: treat your plan like a binary-review it, version it, and roll it out with feature flags. ๐Ÿ›ก๏ธ


FAQ

Is OpenZL lossless?
Yes. Its byte-for-byte reversible. (Transforms are decorrelation only; entropy coding is exact.)

Do I need a schema?
No, but a schema (or even a partial one) unlocks the big wins. Otherwise you get the safe generic path.

Will I beat zstd on everything?
No. Youll shine on structured data; youll break even (or fall back) on messy text or already-compressed inputs.

Whats the migration risk?
Low, if you pin plans and run shadow traffic first. The universal decoder keeps ops surface small. ๐Ÿ™‚


Closing thoughts

Compression is moving from โ€œone level fits allโ€ to data-aware pipelines. OpenZL leans into that reality: take a small amount of structure, compose a simple graph, and harvest predictable, repeatable wins. The decoder cost stays flat; the plan does the heavy lifting. If your payloads are structured (and most are), OpenZL is worth a serious trial. ๐Ÿ’ก

Enter your name:
Enter your e-mail:

(Powered by Un-static Forms)