Performance Issues

This guide helps diagnose and resolve slow backup and restore operations.

Diagnosing Performance Problems

Check Current Throughput

# Monitor backup progress
kafka-backup status --config backup.yaml --watch

# Output includes:
# Records/sec: 50,000
# Bytes/sec: 25 MB/s
# Estimated completion: 2h 15m

Expected Performance

Operation	Per Partition	With 10 Partitions
Backup	50-100 MB/s	500 MB/s - 1 GB/s
Restore	75-150 MB/s	750 MB/s - 1.5 GB/s

If you're seeing significantly lower numbers, continue troubleshooting.

Backup Performance

Problem: Slow Kafka Consumption

Symptoms:

Low records/sec
High CPU idle
Storage I/O is fine

Diagnosis:

# Check consumer lag
kafka-consumer-groups --bootstrap-server kafka:9092 \
  --describe --group kafka-backup-$BACKUP_ID

Solutions:

# Increase fetch sizes
source:
  kafka_config:
    fetch.max.bytes: 104857600          # 100 MB
    max.partition.fetch.bytes: 10485760  # 10 MB per partition
    fetch.min.bytes: 1048576             # 1 MB minimum
    fetch.max.wait.ms: 500               # Wait for batches

Problem: Network Bottleneck

Symptoms:

Throughput limited regardless of settings
Network utilization at 100%

Diagnosis:

# Check network utilization
iftop -i eth0

# Test network bandwidth
iperf3 -c kafka-broker-0 -p 5201

Solutions:

Compress before transfer:

backup:
  compression: zstd
  compression_level: 1  # Fast compression

Use closer storage region:

storage:
  backend: s3
  region: us-west-2  # Same region as Kafka

Enable VPC endpoints:

storage:
  backend: s3
  endpoint: https://s3.us-west-2.amazonaws.com
  use_vpc_endpoint: true

Problem: Slow Compression

Symptoms:

High CPU utilization
Compression taking most of the time

Diagnosis:

# Check CPU during backup
top -p $(pgrep kafka-backup)

# Check metrics
curl localhost:9090/metrics | grep compression

Solutions:

# Use faster compression
backup:
  compression: lz4  # Faster than zstd
  # Or reduce level
  compression: zstd
  compression_level: 1  # Fastest

Problem: Slow Storage Writes

Symptoms:

Low storage throughput
High storage latency

Diagnosis:

# Test S3 write speed
dd if=/dev/zero bs=1M count=100 | aws s3 cp - s3://bucket/test-file

# Check CloudWatch metrics for S3

Solutions:

# Optimize multipart uploads
storage:
  backend: s3
  multipart_threshold: 52428800    # 50 MB
  multipart_part_size: 10485760    # 10 MB
  max_concurrent_uploads: 10

Problem: Single Partition Bottleneck

Symptoms:

One partition much slower than others
Uneven partition sizes

Diagnosis:

# Check partition sizes
kafka-log-dirs --bootstrap-server kafka:9092 --describe

Solutions:

For future: Use better partition keys
For now: Accept longer backup time for skewed partitions

Restore Performance

Problem: Slow Kafka Production

Symptoms:

Low write throughput
Producer backpressure

Diagnosis:

# Check producer metrics
kafka-backup status --config restore.yaml --watch

Solutions:

# Optimize producer settings
target:
  kafka_config:
    batch.size: 1048576        # 1 MB batches
    linger.ms: 100             # Wait for batches
    buffer.memory: 67108864    # 64 MB buffer
    acks: 1                    # Trade durability for speed
    compression.type: lz4      # Producer-side compression

Problem: Slow Decompression

Symptoms:

Reading from storage is fine
CPU bound during restore

Solutions:

# Use faster decompression (for future backups)
backup:
  compression: lz4  # Decompresses faster than zstd

Problem: Slow PITR Filtering

Symptoms:

PITR restore much slower than full restore
Many segments being read but few records restored

Diagnosis:

# Check how many records match time window
kafka-backup describe \
  --path s3://bucket/backups \
  --backup-id my-backup \
  --format json | jq '.segments[] | select(.start_timestamp < 1701388800000)'

Solutions:

# Ensure time window aligns with segment boundaries
# Backups with shorter checkpoint intervals = better PITR performance
backup:
  checkpoint_interval_secs: 30  # More granular segments

Resource Optimization

Memory Tuning

# Reduce memory for constrained environments
backup:
  batch_size: 10000           # Fewer records per batch
  max_batch_bytes: 52428800   # 50 MB max

# Increase memory for better throughput
backup:
  batch_size: 100000          # More records per batch
  max_batch_bytes: 209715200  # 200 MB max

CPU Tuning

# Reduce CPU usage
backup:
  compression: lz4           # Lower CPU
  compression_level: 1

# Maximize throughput (more CPU)
backup:
  compression: zstd
  compression_level: 3
  # Parallel compression across partitions

Parallelism

# More parallelism (higher resource usage)
backup:
  parallel_partitions: 20    # Process more partitions concurrently

# Less parallelism (lower resource usage)
backup:
  parallel_partitions: 4

Storage-Specific Optimization

S3 Performance

storage:
  backend: s3
  bucket: my-bucket
  region: us-west-2

  # Transfer acceleration
  use_accelerate_endpoint: true

  # Multipart optimization
  multipart_threshold: 52428800
  multipart_part_size: 10485760
  max_concurrent_uploads: 20

  # Connection pooling
  max_connections: 50

Azure Blob Performance

storage:
  backend: azure
  container: kafka-backups

  # Block blob settings
  block_size: 10485760      # 10 MB blocks
  max_concurrency: 10

GCS Performance

storage:
  backend: gcs
  bucket: kafka-backups

  # Parallel composite uploads
  parallel_composite_upload: true
  chunk_size: 10485760      # 10 MB chunks

Local/PVC Performance

storage:
  backend: local
  path: /backups

  # Use SSD if available
  # Ensure sufficient IOPS

Monitoring Performance

Prometheus Metrics

# Backup throughput
rate(kafka_backup_bytes_total[5m])

# Records per second
rate(kafka_backup_records_total[5m])

# Compression ratio
kafka_backup_compression_ratio

# Batch processing time (p99)
histogram_quantile(0.99, kafka_backup_batch_duration_seconds_bucket)

Grafana Dashboard

Key panels to monitor:

Throughput (MB/s) over time
Records/sec per partition
Compression ratio
Checkpoint latency
Storage write latency

Real-time Monitoring

# Watch backup progress
kafka-backup backup --config backup.yaml --progress

# Output:
# Progress: 45% (450,000/1,000,000 records)
# Speed: 85 MB/s
# ETA: 10 minutes

Performance Checklist

Before Backup

Kafka cluster has capacity for backup consumer
Storage bucket is in same region as Kafka
Network allows sufficient bandwidth
Compression level matches needs (speed vs size)

During Backup

Monitor throughput metrics
Check for consumer lag
Verify checkpoint writes succeeding
Watch for rate limiting errors

After Backup

Verify backup size is reasonable
Check compression ratio
Note total duration for planning
Run validation

Benchmarking

Baseline Test

# Test maximum throughput to storage
dd if=/dev/zero bs=1M count=1000 | \
  aws s3 cp - s3://bucket/benchmark/test-1gb

# Test maximum throughput from Kafka
kafka-consumer-perf-test \
  --bootstrap-server kafka:9092 \
  --topic orders \
  --messages 1000000 \
  --threads 4

Compare Configurations

# Test with different compression
for level in 1 3 6 9; do
  kafka-backup backup \
    --config backup.yaml \
    --compression-level $level \
    --dry-run \
    --benchmark
done

Next Steps

Debug Mode - Enable verbose logging
Performance Tuning Guide - Optimization strategies
Architecture - Understanding internals

Diagnosing Performance Problems​

Check Current Throughput​

Expected Performance​

Backup Performance​

Problem: Slow Kafka Consumption​

Problem: Network Bottleneck​

Problem: Slow Compression​

Problem: Slow Storage Writes​

Problem: Single Partition Bottleneck​

Restore Performance​

Problem: Slow Kafka Production​

Problem: Slow Decompression​

Problem: Slow PITR Filtering​

Resource Optimization​

Memory Tuning​

CPU Tuning​

Parallelism​

Storage-Specific Optimization​

S3 Performance​

Azure Blob Performance​

GCS Performance​

Local/PVC Performance​

Monitoring Performance​

Prometheus Metrics​

Grafana Dashboard​

Real-time Monitoring​

Performance Checklist​

Before Backup​

During Backup​

After Backup​

Benchmarking​

Baseline Test​

Compare Configurations​

Next Steps​

Diagnosing Performance Problems

Check Current Throughput

Expected Performance

Backup Performance

Problem: Slow Kafka Consumption

Problem: Network Bottleneck

Problem: Slow Compression

Problem: Slow Storage Writes

Problem: Single Partition Bottleneck

Restore Performance

Problem: Slow Kafka Production

Problem: Slow Decompression

Problem: Slow PITR Filtering

Resource Optimization

Memory Tuning

CPU Tuning

Parallelism

Storage-Specific Optimization

S3 Performance

Azure Blob Performance

GCS Performance

Local/PVC Performance

Monitoring Performance

Prometheus Metrics

Grafana Dashboard

Real-time Monitoring

Performance Checklist

Before Backup

During Backup

After Backup

Benchmarking

Baseline Test

Compare Configurations

Next Steps