Skip to main content

Storage Format

This document describes the storage layout and file formats used by OSO Kafka Backup.

Directory Structure

Backups are organized in a hierarchical directory structure:

<storage-root>/
├── <backup-id>/
│ ├── manifest.json # Backup metadata
│ ├── topics/
│ │ ├── <topic-name>/
│ │ │ ├── <partition>/
│ │ │ │ ├── segment-00000000.kbs # Segment files
│ │ │ │ ├── segment-00000001.kbs
│ │ │ │ ├── segment-00000002.kbs
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ └── checkpoints/
│ └── checkpoint.json # Progress checkpoint
└── <another-backup-id>/
└── ...

Manifest File

The manifest.json file contains backup metadata:

{
"version": "1.0",
"backup_id": "production-backup-001",
"created_at": "2024-12-03T10:00:00Z",
"completed_at": "2024-12-03T10:15:32Z",
"source_cluster": {
"cluster_id": "prod-cluster-east",
"bootstrap_servers": ["broker-1:9092", "broker-2:9092"]
},
"compression": {
"algorithm": "zstd",
"level": 3
},
"statistics": {
"topics": 5,
"partitions": 24,
"segments": 156,
"records": 2456789,
"uncompressed_bytes": 1288490188,
"compressed_bytes": 262832640
},
"time_range": {
"earliest_timestamp": 1701388800000,
"latest_timestamp": 1701475199000
},
"topics": [
{
"name": "orders",
"partitions": [
{
"partition": 0,
"first_offset": 0,
"last_offset": 150233,
"records": 150234,
"segments": 12
},
{
"partition": 1,
"first_offset": 0,
"last_offset": 148891,
"records": 148892,
"segments": 11
}
]
}
]
}

Manifest Fields

FieldTypeDescription
versionstringManifest format version
backup_idstringUnique backup identifier
created_atstringISO 8601 timestamp when backup started
completed_atstringISO 8601 timestamp when backup completed
source_cluster.cluster_idstringSource cluster identifier
source_cluster.bootstrap_serversarraySource broker addresses
compression.algorithmstringCompression algorithm used
compression.levelintCompression level
statistics.topicsintNumber of topics backed up
statistics.partitionsintNumber of partitions
statistics.segmentsintNumber of segment files
statistics.recordsintTotal records
statistics.uncompressed_bytesintOriginal data size
statistics.compressed_bytesintCompressed data size
time_range.earliest_timestampintEarliest message timestamp (Unix ms)
time_range.latest_timestampintLatest message timestamp (Unix ms)

Segment Files

Segment files (.kbs - Kafka Backup Segment) contain the actual message data.

Segment File Format

┌─────────────────────────────────────────┐
│ Segment Header (32 bytes) │
├─────────────────────────────────────────┤
│ Record Batch 1 │
├─────────────────────────────────────────┤
│ Record Batch 2 │
├─────────────────────────────────────────┤
│ ... │
├─────────────────────────────────────────┤
│ Record Batch N │
├─────────────────────────────────────────┤
│ Segment Footer (16 bytes) │
└─────────────────────────────────────────┘

Segment Header

OffsetSizeFieldDescription
04MagicMagic bytes: KBS1
41VersionFormat version
51CompressionCompression algorithm (0=none, 1=lz4, 2=zstd)
62FlagsReserved flags
88First OffsetFirst record offset in segment
168Record CountNumber of records in segment
248ChecksumCRC64 checksum of segment data

Record Batch

Each record batch contains:

┌─────────────────────────────────────────┐
│ Batch Header (24 bytes) │
├─────────────────────────────────────────┤
│ Compressed Records │
└─────────────────────────────────────────┘

Batch Header

OffsetSizeFieldDescription
08Base OffsetFirst offset in batch
84Record CountNumber of records
124Compressed SizeSize of compressed data
164Uncompressed SizeOriginal data size
204CRC32Checksum of compressed data

Record Format

Records are stored in a compact binary format compatible with Kafka's record format:

┌─────────────────────────────────────────┐
│ Offset Delta (varint) │
├─────────────────────────────────────────┤
│ Timestamp Delta (varint) │
├─────────────────────────────────────────┤
│ Key Length (varint) │
├─────────────────────────────────────────┤
│ Key Data (bytes) │
├─────────────────────────────────────────┤
│ Value Length (varint) │
├─────────────────────────────────────────┤
│ Value Data (bytes) │
├─────────────────────────────────────────┤
│ Headers Count (varint) │
├─────────────────────────────────────────┤
│ Headers (key-value pairs) │
└─────────────────────────────────────────┘

Offset Header

When include_offset_headers: true is set, each record includes a header:

Header KeyValueDescription
x-kafka-backup-offsetstringOriginal Kafka offset
x-kafka-backup-timestampstringOriginal timestamp (Unix ms)
x-kafka-backup-partitionstringOriginal partition number

Checkpoint File

The checkpoint.json file tracks backup progress for resumable operations:

{
"backup_id": "production-backup-001",
"timestamp": "2024-12-03T10:05:00Z",
"topics": {
"orders": {
"0": {
"offset": 75234,
"timestamp": 1701432000000,
"segment": "segment-00000005.kbs",
"segment_offset": 12456
},
"1": {
"offset": 74123,
"timestamp": 1701432000000,
"segment": "segment-00000005.kbs",
"segment_offset": 11234
}
}
}
}

Checkpoint Fields

FieldDescription
backup_idBackup being checkpointed
timestampWhen checkpoint was created
topics.<topic>.<partition>.offsetLast backed up offset
topics.<topic>.<partition>.timestampLast message timestamp
topics.<topic>.<partition>.segmentCurrent segment file
topics.<topic>.<partition>.segment_offsetPosition in segment

S3 Object Layout

When using S3 storage, the structure maps to object keys:

s3://bucket/prefix/
├── backup-001/
│ ├── manifest.json
│ ├── topics/orders/0/segment-00000000.kbs
│ ├── topics/orders/0/segment-00000001.kbs
│ ├── topics/orders/1/segment-00000000.kbs
│ └── ...
└── backup-002/
└── ...

S3 Storage Classes

Backups can be stored in different S3 storage classes for cost optimization:

Storage ClassUse Case
STANDARDFrequently accessed backups
STANDARD_IAInfrequent access (30+ days)
GLACIER_IRArchive with instant retrieval
GLACIERLong-term archive

Compression

Zstandard (zstd)

Default compression algorithm. Provides excellent compression ratio with fast decompression.

LevelRatioSpeedUse Case
1-3GoodVery FastDefault, balanced
4-9BetterFastHigher compression
10-19BestModerateMaximum compression
20-22OptimalSlowArchive

LZ4

Extremely fast compression with moderate ratio.

ModeRatioSpeedUse Case
DefaultModerateFastestHigh-throughput backups

No Compression

Use compression: none when:

  • Data is already compressed
  • Maximum restore speed is critical
  • Storage cost is not a concern

Integrity Verification

Checksums

Every segment includes:

  • Segment checksum: CRC64 of entire segment
  • Batch checksum: CRC32 of each record batch
  • Record checksum: Optional per-record verification

Validation

# Quick validation (check metadata and structure)
kafka-backup validate --path /data --backup-id backup-001

# Deep validation (read and verify all data)
kafka-backup validate --path /data --backup-id backup-001 --deep

Compatibility

Forward Compatibility

New versions can read backups created by older versions.

Backward Compatibility

Older versions may not read backups created by newer versions if the manifest version is higher.

Version Matrix

Backup VersionReader VersionCompatible
1.01.0+Yes
1.11.0Limited*
1.11.1+Yes

*Limited: Core data readable, new features unavailable