Storage Format

This document describes the storage layout and file formats used by OSO Kafka Backup.

Directory Structure

Backups are organized in a hierarchical directory structure:

<storage-root>/
├── <backup-id>/
│   ├── manifest.json           # Backup metadata
│   ├── topics/
│   │   ├── <topic-name>/
│   │   │   ├── <partition>/
│   │   │   │   ├── segment-00000000.kbs    # Segment files
│   │   │   │   ├── segment-00000001.kbs
│   │   │   │   ├── segment-00000002.kbs
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── ...
│   └── checkpoints/
│       └── checkpoint.json     # Progress checkpoint
└── <another-backup-id>/
    └── ...

Manifest File

The manifest.json file contains backup metadata:

{
  "version": "1.0",
  "backup_id": "production-backup-001",
  "created_at": "2024-12-03T10:00:00Z",
  "completed_at": "2024-12-03T10:15:32Z",
  "source_cluster": {
    "cluster_id": "prod-cluster-east",
    "bootstrap_servers": ["broker-1:9092", "broker-2:9092"]
  },
  "compression": {
    "algorithm": "zstd",
    "level": 3
  },
  "statistics": {
    "topics": 5,
    "partitions": 24,
    "segments": 156,
    "records": 2456789,
    "uncompressed_bytes": 1288490188,
    "compressed_bytes": 262832640
  },
  "time_range": {
    "earliest_timestamp": 1701388800000,
    "latest_timestamp": 1701475199000
  },
  "topics": [
    {
      "name": "orders",
      "partitions": [
        {
          "partition": 0,
          "first_offset": 0,
          "last_offset": 150233,
          "records": 150234,
          "segments": 12
        },
        {
          "partition": 1,
          "first_offset": 0,
          "last_offset": 148891,
          "records": 148892,
          "segments": 11
        }
      ]
    }
  ]
}

Manifest Fields

Field	Type	Description
`version`	string	Manifest format version
`backup_id`	string	Unique backup identifier
`created_at`	string	ISO 8601 timestamp when backup started
`completed_at`	string	ISO 8601 timestamp when backup completed
`source_cluster.cluster_id`	string	Source cluster identifier
`source_cluster.bootstrap_servers`	array	Source broker addresses
`compression.algorithm`	string	Compression algorithm used
`compression.level`	int	Compression level
`statistics.topics`	int	Number of topics backed up
`statistics.partitions`	int	Number of partitions
`statistics.segments`	int	Number of segment files
`statistics.records`	int	Total records
`statistics.uncompressed_bytes`	int	Original data size
`statistics.compressed_bytes`	int	Compressed data size
`time_range.earliest_timestamp`	int	Earliest message timestamp (Unix ms)
`time_range.latest_timestamp`	int	Latest message timestamp (Unix ms)

Segment Files

Segment files (.kbs - Kafka Backup Segment) contain the actual message data.

Segment File Format

┌─────────────────────────────────────────┐
│           Segment Header (32 bytes)      │
├─────────────────────────────────────────┤
│              Record Batch 1              │
├─────────────────────────────────────────┤
│              Record Batch 2              │
├─────────────────────────────────────────┤
│                   ...                    │
├─────────────────────────────────────────┤
│              Record Batch N              │
├─────────────────────────────────────────┤
│           Segment Footer (16 bytes)      │
└─────────────────────────────────────────┘

Segment Header

Offset	Size	Field	Description
0	4	Magic	Magic bytes: `KBS1`
4	1	Version	Format version
5	1	Compression	Compression algorithm (0=none, 1=lz4, 2=zstd)
6	2	Flags	Reserved flags
8	8	First Offset	First record offset in segment
16	8	Record Count	Number of records in segment
24	8	Checksum	CRC64 checksum of segment data

Record Batch

Each record batch contains:

┌─────────────────────────────────────────┐
│         Batch Header (24 bytes)          │
├─────────────────────────────────────────┤
│           Compressed Records             │
└─────────────────────────────────────────┘

Batch Header

Offset	Size	Field	Description
0	8	Base Offset	First offset in batch
8	4	Record Count	Number of records
12	4	Compressed Size	Size of compressed data
16	4	Uncompressed Size	Original data size
20	4	CRC32	Checksum of compressed data

Record Format

Records are stored in a compact binary format compatible with Kafka's record format:

┌─────────────────────────────────────────┐
│ Offset Delta (varint)                    │
├─────────────────────────────────────────┤
│ Timestamp Delta (varint)                 │
├─────────────────────────────────────────┤
│ Key Length (varint)                      │
├─────────────────────────────────────────┤
│ Key Data (bytes)                         │
├─────────────────────────────────────────┤
│ Value Length (varint)                    │
├─────────────────────────────────────────┤
│ Value Data (bytes)                       │
├─────────────────────────────────────────┤
│ Headers Count (varint)                   │
├─────────────────────────────────────────┤
│ Headers (key-value pairs)                │
└─────────────────────────────────────────┘

Offset Header

When include_offset_headers: true is set, each record includes a header:

Header Key	Value	Description
`x-kafka-backup-offset`	string	Original Kafka offset
`x-kafka-backup-timestamp`	string	Original timestamp (Unix ms)
`x-kafka-backup-partition`	string	Original partition number

Checkpoint File

The checkpoint.json file tracks backup progress for resumable operations:

{
  "backup_id": "production-backup-001",
  "timestamp": "2024-12-03T10:05:00Z",
  "topics": {
    "orders": {
      "0": {
        "offset": 75234,
        "timestamp": 1701432000000,
        "segment": "segment-00000005.kbs",
        "segment_offset": 12456
      },
      "1": {
        "offset": 74123,
        "timestamp": 1701432000000,
        "segment": "segment-00000005.kbs",
        "segment_offset": 11234
      }
    }
  }
}

Checkpoint Fields

Field	Description
`backup_id`	Backup being checkpointed
`timestamp`	When checkpoint was created
`topics.<topic>.<partition>.offset`	Last backed up offset
`topics.<topic>.<partition>.timestamp`	Last message timestamp
`topics.<topic>.<partition>.segment`	Current segment file
`topics.<topic>.<partition>.segment_offset`	Position in segment

S3 Object Layout

When using S3 storage, the structure maps to object keys:

s3://bucket/prefix/
├── backup-001/
│   ├── manifest.json
│   ├── topics/orders/0/segment-00000000.kbs
│   ├── topics/orders/0/segment-00000001.kbs
│   ├── topics/orders/1/segment-00000000.kbs
│   └── ...
└── backup-002/
    └── ...

S3 Storage Classes

Backups can be stored in different S3 storage classes for cost optimization:

Storage Class	Use Case
`STANDARD`	Frequently accessed backups
`STANDARD_IA`	Infrequent access (30+ days)
`GLACIER_IR`	Archive with instant retrieval
`GLACIER`	Long-term archive

Compression

Zstandard (zstd)

Default compression algorithm. Provides excellent compression ratio with fast decompression.

Level	Ratio	Speed	Use Case
1-3	Good	Very Fast	Default, balanced
4-9	Better	Fast	Higher compression
10-19	Best	Moderate	Maximum compression
20-22	Optimal	Slow	Archive

LZ4

Extremely fast compression with moderate ratio.

Mode	Ratio	Speed	Use Case
Default	Moderate	Fastest	High-throughput backups

No Compression

Use compression: none when:

Data is already compressed
Maximum restore speed is critical
Storage cost is not a concern

Integrity Verification

Checksums

Every segment includes:

Segment checksum: CRC64 of entire segment
Batch checksum: CRC32 of each record batch
Record checksum: Optional per-record verification

Validation

# Quick validation (check metadata and structure)
kafka-backup validate --path /data --backup-id backup-001

# Deep validation (read and verify all data)
kafka-backup validate --path /data --backup-id backup-001 --deep

Compatibility

Forward Compatibility

New versions can read backups created by older versions.

Backward Compatibility

Older versions may not read backups created by newer versions if the manifest version is higher.

Version Matrix

Backup Version	Reader Version	Compatible
1.0	1.0+	Yes
1.1	1.0	Limited*
1.1	1.1+	Yes

*Limited: Core data readable, new features unavailable

Directory Structure​

Manifest File​

Manifest Fields​

Segment Files​

Segment File Format​

Segment Header​

Record Batch​

Batch Header​

Record Format​

Offset Header​

Checkpoint File​

Checkpoint Fields​

S3 Object Layout​

S3 Storage Classes​

Compression​

Zstandard (zstd)​

LZ4​

No Compression​

Integrity Verification​

Checksums​

Validation​

Compatibility​

Forward Compatibility​

Backward Compatibility​

Version Matrix​