Skip to main content

Disaster Recovery

Implement robust disaster recovery for your Kafka infrastructure with OSO Kafka Backup.

The Problem

Kafka replication (MirrorMaker, Confluent Replicator) provides high availability but has limitations:

ChallengeImpact
Active-active complexityDual writes, conflict resolution
Data corruption propagatesReplication copies bad data too
No point-in-time recoveryCan't go back to "before the incident"
Topic deletion is permanentAccidentally deleted topics can't be recovered
High cross-region costsContinuous replication is expensive

The Solution

OSO Kafka Backup provides true disaster recovery:

  • Point-in-time recovery: Restore to any moment
  • Isolated backups: Corruption doesn't propagate
  • Cost-effective: Backup storage vs. active replication
  • Flexible recovery: Full or partial restore

DR Architecture

Production Region                    DR Region
┌─────────────────┐ ┌─────────────────┐
│ │ │ │
│ Kafka Cluster │───Backup───────▶│ S3 Bucket │
│ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Topic A │ │ │ │ Backups │ │
│ │ Topic B │ │ │ └───────────┘ │
│ │ Topic C │ │ │ │
│ └───────────┘ │ └────────┬────────┘
│ │ │
└─────────────────┘ │
│ Restore

┌─────────────────┐
│ DR Kafka │
│ Cluster │
│ ┌───────────┐ │
│ │ Topic A │ │
│ │ Topic B │ │
│ │ Topic C │ │
│ └───────────┘ │
└─────────────────┘

DR Scenarios

Scenario 1: Regional Failure

Situation: Entire production region becomes unavailable.

Recovery:

  1. Activate DR Kafka cluster
  2. Restore from latest backup
  3. Reset consumer offsets
  4. Redirect applications to DR cluster

RTO: 30 minutes - 2 hours (depending on data volume) RPO: Last backup (typically 15 minutes - 1 hour)

Scenario 2: Data Corruption

Situation: Bad data published to topics, affecting consumers.

Recovery:

  1. Identify corruption timestamp
  2. PITR restore to just before corruption
  3. Continue from clean state
restore:
time_window_end: 1701234500000 # Just before corruption

Scenario 3: Accidental Topic Deletion

Situation: Critical topic accidentally deleted.

Recovery:

  1. Identify which backup contains the topic
  2. Restore specific topic
  3. Resume operations
restore:
topics:
- accidentally-deleted-topic

Scenario 4: Ransomware Attack

Situation: Kafka cluster compromised, data encrypted.

Recovery:

  1. Provision new cluster (isolated)
  2. Restore from clean backup
  3. Validate data integrity
  4. Cut over to new cluster

Implementation

Step 1: Configure Automated Backups

dr-backup.yaml
mode: backup
backup_id: "production-${TIMESTAMP}"

source:
bootstrap_servers:
- kafka-prod-1:9092
- kafka-prod-2:9092
- kafka-prod-3:9092
security:
security_protocol: SASL_SSL
sasl_mechanism: SCRAM-SHA-256
sasl_username: backup-service
sasl_password: ${KAFKA_PASSWORD}
topics:
include:
- "*"
exclude:
- "__consumer_offsets"
- "_schemas"

storage:
backend: s3
bucket: kafka-dr-backups
region: us-east-1 # DR region
prefix: production/hourly

backup:
compression: zstd
compression_level: 3
checkpoint_interval_secs: 30
include_offset_headers: true
source_cluster_id: "prod-us-west-2"

Step 2: Schedule Backups

Kubernetes Operator:

apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: dr-backup
spec:
schedule: "0 * * * *" # Hourly
kafkaCluster:
bootstrapServers:
- kafka-prod-1:9092
topics:
- "*"
storage:
storageType: s3
s3:
bucket: kafka-dr-backups
region: us-east-1

Step 3: Create DR Runbook

#!/bin/bash
# dr-restore.sh - Disaster Recovery Runbook

BACKUP_ID="${1:-latest}"
DR_CLUSTER="kafka-dr-1:9092,kafka-dr-2:9092,kafka-dr-3:9092"

echo "=== Kafka Disaster Recovery ==="
echo "Backup ID: $BACKUP_ID"
echo "Target: $DR_CLUSTER"

# 1. Validate backup
echo "Step 1: Validating backup..."
kafka-backup validate \
--path s3://kafka-dr-backups/production/hourly \
--backup-id "$BACKUP_ID" \
--deep

# 2. Validate restore configuration
echo "Step 2: Validating restore config..."
kafka-backup validate-restore --config dr-restore.yaml

# 3. Execute restore
echo "Step 3: Executing restore..."
kafka-backup three-phase-restore --config dr-restore.yaml

# 4. Verify
echo "Step 4: Verifying restore..."
kafka-topics --bootstrap-server "$DR_CLUSTER" --list

echo "=== DR Complete ==="

Step 4: Test DR Regularly

Create a DR test schedule:

Test TypeFrequencyDescription
Backup validationDailyVerify backup integrity
Restore testWeeklyRestore to test cluster
Full DR drillQuarterlyComplete failover simulation

RTO/RPO Planning

Recovery Time Objective (RTO)

Time to restore service:

Data VolumeRTO Estimate
< 10 GB15-30 minutes
10-100 GB30-60 minutes
100 GB - 1 TB1-2 hours
> 1 TB2-4 hours

Factors affecting RTO:

  • Network bandwidth to DR region
  • Target cluster capacity
  • Number of topics/partitions
  • Consumer offset reset time

Recovery Point Objective (RPO)

Maximum acceptable data loss:

Backup FrequencyRPO
Continuous~1 minute
Every 15 minutes15 minutes
Hourly1 hour
Daily24 hours

Choose based on:

  • Data criticality
  • Backup costs
  • Compliance requirements

Cost Analysis

Replication vs. Backup

ApproachMonthly Cost (100 GB/day)
Active-Active (MirrorMaker)$$$$ (2x infrastructure + transfer)
Cross-region replication$$$ (transfer costs)
Hourly backup to S3$ (storage + occasional transfer)
Daily backup to S3$ (storage + rare transfer)

Backup Storage Costs

Monthly storage = daily_backup_size × retention_days × storage_cost

Example:
10 GB/day × 30 days × $0.023/GB = $6.90/month (S3 Standard)
With 4x compression: $1.73/month
With Glacier after 30 days: Even less

Best Practices

Backup Strategy

  1. Multiple backup frequencies

    • Hourly for critical data
    • Daily for full backups
    • Weekly for long-term retention
  2. Cross-region storage

    • Store backups in DR region
    • Consider multi-region buckets
  3. Encryption

    • Enable storage encryption
    • Use customer-managed keys

DR Readiness

  1. Pre-provision DR cluster

    • Keep cluster running (minimal)
    • Or use auto-scaling on demand
  2. Automate everything

    • Scripted restore process
    • Infrastructure as code
  3. Regular testing

    • Monthly restore tests
    • Quarterly full DR drills

Consumer Recovery

After restore, consumers need attention:

Option 1: Reprocess All

kafka-consumer-groups \
--bootstrap-server kafka-dr:9092 \
--group my-consumer \
--reset-offsets \
--to-earliest \
--execute

Option 2: Resume from Position

Use three-phase restore with offset reset:

restore:
consumer_group_strategy: header-based
reset_consumer_offsets: true
consumer_groups:
- order-processor
- payment-service

Option 3: Timestamp-Based Resume

kafka-consumer-groups \
--bootstrap-server kafka-dr:9092 \
--group my-consumer \
--reset-offsets \
--to-datetime 2024-12-01T10:00:00.000 \
--execute

Next Steps