Metrics Reference
OSO Kafka Backup exposes Prometheus metrics for monitoring backup and restore operations.
Metrics Endpoint
When running as a service or operator, metrics are exposed at:
http://localhost:8080/metrics
CLI Metrics
The CLI reports metrics to stdout at the end of operations:
Backup completed successfully
────────────────────────────────
Records: 2,456,789
Duration: 5m 32s
Throughput: 7,423 records/sec
Compressed: 245 MB (4.9x ratio)
Backup Metrics
kafka_backup_records_total
Type: Counter
Total number of records backed up.
Labels:
topic- Topic namepartition- Partition numberbackup_id- Backup identifier
Example:
kafka_backup_records_total{topic="orders",partition="0",backup_id="daily-001"} 150234
kafka_backup_bytes_total
Type: Counter
Total bytes backed up (uncompressed).
Labels:
topic- Topic namepartition- Partition numberbackup_id- Backup identifier
Example:
kafka_backup_bytes_total{topic="orders",partition="0",backup_id="daily-001"} 52428800
kafka_backup_compressed_bytes_total
Type: Counter
Total bytes written to storage (compressed).
Labels:
topic- Topic namebackup_id- Backup identifier
Example:
kafka_backup_compressed_bytes_total{topic="orders",backup_id="daily-001"} 10485760
kafka_backup_segments_total
Type: Counter
Total number of segment files created.
Labels:
topic- Topic namepartition- Partition numberbackup_id- Backup identifier
kafka_backup_duration_seconds
Type: Histogram
Backup operation duration in seconds.
Labels:
backup_id- Backup identifierstatus-successorfailure
Buckets: 1, 5, 15, 30, 60, 120, 300, 600, 1800, 3600
Example:
kafka_backup_duration_seconds_bucket{backup_id="daily-001",status="success",le="300"} 1
kafka_backup_duration_seconds_sum{backup_id="daily-001",status="success"} 332.5
kafka_backup_duration_seconds_count{backup_id="daily-001",status="success"} 1
kafka_backup_throughput_records_per_second
Type: Gauge
Current backup throughput (records per second).
Labels:
backup_id- Backup identifier
kafka_backup_throughput_bytes_per_second
Type: Gauge
Current backup throughput (bytes per second).
Labels:
backup_id- Backup identifier
kafka_backup_lag_records
Type: Gauge
Number of records behind the high watermark.
Labels:
topic- Topic namepartition- Partition numberbackup_id- Backup identifier
kafka_backup_compression_ratio
Type: Gauge
Compression ratio (uncompressed / compressed).
Labels:
backup_id- Backup identifieralgorithm- Compression algorithm (zstd, lz4, none)
Example:
kafka_backup_compression_ratio{backup_id="daily-001",algorithm="zstd"} 4.9
Restore Metrics
kafka_restore_records_total
Type: Counter
Total number of records restored.
Labels:
topic- Target topic namepartition- Target partitionbackup_id- Source backup identifier
kafka_restore_bytes_total
Type: Counter
Total bytes restored.
Labels:
topic- Target topic namebackup_id- Source backup identifier
kafka_restore_duration_seconds
Type: Histogram
Restore operation duration.
Labels:
backup_id- Source backup identifierstatus-successorfailure
Buckets: 1, 5, 15, 30, 60, 120, 300, 600, 1800, 3600
kafka_restore_progress_percent
Type: Gauge
Restore progress percentage (0-100).
Labels:
backup_id- Source backup identifier
kafka_restore_throughput_records_per_second
Type: Gauge
Current restore throughput.
Labels:
backup_id- Source backup identifier
kafka_restore_eta_seconds
Type: Gauge
Estimated time remaining for restore.
Labels:
backup_id- Source backup identifier
Offset Reset Metrics
kafka_offset_reset_partitions_total
Type: Counter
Total partitions where offsets were reset.
Labels:
consumer_group- Consumer group namestatus-successorfailure
kafka_offset_reset_duration_seconds
Type: Histogram
Offset reset operation duration.
Labels:
consumer_group- Consumer group namestatus-successorfailure
Buckets: 0.5, 1, 2.5, 5, 10, 30, 60, 120
kafka_offset_reset_latency_seconds
Type: Histogram
Per-partition offset reset latency.
Labels:
consumer_group- Consumer group namequantile- p50, p90, p99
Error Metrics
kafka_backup_errors_total
Type: Counter
Total number of errors encountered.
Labels:
operation-backup,restore,validate,offset_reseterror_type- Error category (see Error Codes)
Example:
kafka_backup_errors_total{operation="backup",error_type="KAFKA_CONNECTION_FAILED"} 2
kafka_backup_retries_total
Type: Counter
Total number of retry attempts.
Labels:
operation- Operation typereason- Retry reason
Storage Metrics
kafka_backup_storage_write_bytes_total
Type: Counter
Total bytes written to storage.
Labels:
backend- Storage backend (filesystem, s3, azure, gcs)backup_id- Backup identifier
kafka_backup_storage_write_latency_seconds
Type: Histogram
Storage write latency.
Labels:
backend- Storage backendoperation-segment,manifest,checkpoint
Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
kafka_backup_storage_read_bytes_total
Type: Counter
Total bytes read from storage.
Labels:
backend- Storage backendbackup_id- Backup identifier
Operator Metrics
When running as a Kubernetes operator, additional metrics are exposed:
kafka_backup_operator_reconciliations_total
Type: Counter
Total reconciliation loops.
Labels:
kind- Resource kind (KafkaBackup, KafkaRestore, etc.)result-success,failure,requeue
kafka_backup_operator_reconciliation_errors_total
Type: Counter
Total reconciliation errors.
Labels:
kind- Resource kind
kafka_backup_operator_reconcile_duration_seconds
Type: Histogram
Reconciliation loop duration.
Labels:
kind- Resource kind
Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
kafka_backup_operator_managed_resources
Type: Gauge
Number of managed resources by type.
Labels:
kind- Resource kind
kafka_backup_operator_health
Type: Gauge
Operator health status (1 = healthy, 0 = unhealthy).
Grafana Dashboard
Recommended Panels
Backup Overview
# Backup throughput
rate(kafka_backup_records_total[5m])
# Compression ratio
kafka_backup_compression_ratio
# Active backups
count(kafka_backup_throughput_records_per_second > 0)
Restore Progress
# Progress percentage
kafka_restore_progress_percent
# ETA
kafka_restore_eta_seconds
# Throughput
kafka_restore_throughput_records_per_second
Error Rates
# Error rate
rate(kafka_backup_errors_total[5m])
# Retry rate
rate(kafka_backup_retries_total[5m])
Alert Rules
groups:
- name: kafka-backup
rules:
- alert: BackupFailed
expr: increase(kafka_backup_errors_total{operation="backup"}[1h]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka backup failed"
description: "Backup {{ $labels.backup_id }} has failed"
- alert: BackupLagging
expr: kafka_backup_lag_records > 100000
for: 10m
labels:
severity: warning
annotations:
summary: "Backup is lagging"
description: "Backup {{ $labels.backup_id }} is {{ $value }} records behind"
- alert: LowCompressionRatio
expr: kafka_backup_compression_ratio < 2
for: 30m
labels:
severity: warning
annotations:
summary: "Low compression ratio"
description: "Compression ratio is only {{ $value }}x"
Prometheus Scrape Config
scrape_configs:
- job_name: 'kafka-backup'
static_configs:
- targets: ['kafka-backup:8080']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'kafka-backup-operator'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: kafka-backup-operator
action: keep
ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-backup-operator
labels:
release: prometheus
spec:
selector:
matchLabels:
app: kafka-backup-operator
endpoints:
- port: metrics
interval: 15s
path: /metrics