Deploy ClickHouse on Kubernetes for real-time incident analytics
clickhousekubernetesobservability

Deploy ClickHouse on Kubernetes for real-time incident analytics

ddeployed
2026-01-28
10 min read
Advertisement

Step-by-step Kubernetes deployment and tuning guide for ClickHouse to power sub-second outage dashboards and reliable alerting.

Ship real-time outage dashboards: Deploy ClickHouse on Kubernetes for incident analytics

Hook: If your incident dashboards lag by minutes, your on-call engineers are fighting a delayed truth — and that increases MTTD and MTR. In 2026, teams are moving high-cardinality, real-time observability pipelines off legacy TSDBs and into ClickHouse to power instant outage analytics. This guide gives a practical, step-by-step Kubernetes deployment and tuning recipe so you can run a reliable ClickHouse analytics cluster that ingests streams (Kafka, Vector), powers sub-second dashboards, and drives alerting.

Why ClickHouse on Kubernetes for incident analytics (2026 context)

ClickHouse’s OLAP engine continues to gain traction — reinforced by strong funding and growing cloud offerings — because it scales for high-cardinality, high-throughput analytics at low cost compared to traditional cloud OLAP. In late 2025 ClickHouse raised substantial capital, accelerating cloud-native features and operational tooling. For observability and outage detection in 2026, ClickHouse is popular because:

  • Fast analytical queries: Vectorized execution and MergeTree families deliver sub-second aggregations across millions of events.
  • Streaming ingestion: Native Kafka engine + materialized views let you convert event streams into analytic tables with minimal glue — a pattern that benefits from edge and low-latency workflows where producers buffer locally before forwarding.
  • Cost-efficiency: Columnar compression and tiered storage reduce cloud spend for long retention windows; pair this with cost-aware tiering strategies to optimize S3 warm/cold policies.
  • Cloud-native operators: Mature Kubernetes operators (CRDs) make stateful deployments repeatable and maintainable — a must if you want automated lifecycle and reproducible upgrades like those described in modern serverless and infra patterns.

Deployment strategy overview

We’ll present two supported patterns and when to use each:

  1. ClickHouse Operator (recommended): CRD-based cluster orchestration for production: replicas, shards, autoscaling hooks, and ClickHouse Keeper management.
  2. Hand-built StatefulSet: Lightweight, explicit control for PoCs, training, or constrained Kubernetes clusters — a useful approach when you’re prototyping on small hardware or unconventional hosts (see field notes on running services on constrained fleets like Raspberry Pi clusters).

Both approaches assume: Kubernetes 1.25+ (ensure CSI support), a storage class that offers low-latency persistent disks (local SSDs or NVMe-backed PVs for hot data), and a Kafka cluster for ingestion (or Vector agents forwarding events).

Architecture for real-time incident analytics

Design principles:

  • Shard for ingestion, replicate for read availability: Use logical shards for partitioning high-volume producers, and at least 2 replicas per shard for query reliability.
  • Hot-warm storage: Keep recent minutes/hours on fast PVs (local NVMe). Move older aggregated data to cheaper object storage via ClickHouse native S3 storage policies.
  • Materialized views: Ingest raw events into a Kafka engine and use materialized views to write to ReplicatedMergeTree tables optimized for query access patterns.
  • Observability & alerting: Expose ClickHouse metrics to Prometheus, build Grafana dashboards, and create alerting rules for query latency, ingestion lag, and unprocessed partitions — pair this with operator-aware runbooks and the practices in modern observability playbooks.

Step 1 — Choose Operator or StatefulSet

Option A: ClickHouse Operator (production)

The operator manages ClickHouse configuration, replication, sharding, and ClickHouse Keeper instances. Use it for predictable upgrades and CRD-driven automation. Example CRD snippet (abridged):

# clickhouse-install.yaml (abridged)
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: chc-analytics
spec:
  configuration:
    clusters:
      - name: analytics
        layout:
          shardsCount: 2
          replicasCount: 2
        templates:
          podTemplate: clickhouse-pod
  zookeeper:
    nodes:
      - name: ck1
        host: clickhouse-keeper-0

Benefits: automated backups, seamless scaling of replicas/shards, built-in secrets handling, and multi-zone anti-affinity templates.

Option B: StatefulSet (PoC / debug)

For a minimal three-node cluster, create a StatefulSet per replica with a headless service. You’ll need to run ClickHouse Keeper (or external ZooKeeper) and manage ReplicatedMergeTree manually.

# clickhouse-statefulset.yaml (simplified)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ch-node
spec:
  serviceName: ch-headless
  replicas: 3
  selector:
    matchLabels:
      app: clickhouse
  template:
    metadata:
      labels:
        app: clickhouse
    spec:
      containers:
      - name: clickhouse
        image: clickhouse/clickhouse-server:latest
        ports:
        - containerPort: 9000
        volumeMounts:
        - name: clickhouse-data
          mountPath: /var/lib/clickhouse
  volumeClaimTemplates:
  - metadata:
      name: clickhouse-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast-local
      resources:
        requests:
          storage: 500Gi

Step 2 — Storage choices and performance

Hot path for incident analytics needs low-latency disk. Recommendations:

  • Local NVMe PVs: Best for write-heavy MergeTree merges and fast query performance. Use local persistent volumes with node affinity — these patterns are similar to field guides on running distributed services on edge hardware and dealing with local storage constraints (see notes on small-cluster deployments).
  • Filesystem: Use XFS with directio where possible. Avoid overlayfs on host OS that can hurt fsync performance.
  • IOPS vs throughput: For analytics, throughput is often the limiter during merges. Prioritize bandwidth and low latency.
  • S3 for cold storage: Configure storage_policy to tier older parts to S3 to control PV usage and cost; combine this with autonomous tiering thinking when forecasting egress and long-term retention.

Step 3 — Ingesting events in real time

For outage detection you’ll ingest event streams like user error logs, health pings, and synthetic checks. We recommend using Kafka as the central ingestion bus and Vector (or Fluent Bit) at service edges.

ClickHouse Kafka engine + materialized view

Canonical pattern:

-- raw kafka table
CREATE TABLE kafka_events
(
  ts DateTime64(3),
  svc String,
  level String,
  message String,
  trace_id String
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'events',
  kafka_group_name = 'ch-ingest',
  format = 'JSONEachRow';

-- target analytic table
CREATE TABLE events_mv
(
  ts DateTime64(3),
  svc String,
  level String,
  message String,
  trace_id String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events_mv', '{replica}')
PARTITION BY toYYYYMMDD(ts)
ORDER BY (svc, ts)
SETTINGS index_granularity = 8192;

-- materialized view to move rows
CREATE MATERIALIZED VIEW mv_events TO events_mv AS
SELECT
  ts, svc, level, message, trace_id
FROM kafka_events;

Notes:

  • Use JSONEachRow or Protobuf depending on producers.
  • Materialized views consume the Kafka engine automatically—this is low-latency and scales with your ClickHouse cluster. If you need to quantify end-to-end lag and SLOs, combine these ingestion patterns with a latency budgeting process.
  • Batch producers (Vector) can reduce small message overhead by grouping records.

Step 4 — Schema and MergeTree tuning for real-time queries

Schema design impacts query latency. For incident analytics you typically need per-minute aggregates, grouping by service and region. Use the following patterns:

  • ORDER BY for query patterns: Order by columns used in WHERE and GROUP BY (e.g., (svc, toStartOfMinute(ts))).
  • PARTITION BY short intervals: Partition by day for quick partition drops but keep partition size reasonable.
  • index_granularity: Higher granularity reduces index memory but increases scanned rows. For real-time, 8192–16384 is a good starting point.
  • Compression: Use LZ4 for hot data; consider ZSTD with tuning for warm/cold tiers.

Example optimized table for outage metrics

CREATE TABLE incidents
(
  minute DateTime64(3),
  svc String,
  region String,
  status_code UInt16,
  errors UInt64
) ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{shard}/incidents', '{replica}')
PARTITION BY toYYYYMMDD(minute)
ORDER BY (svc, region, minute)
SETTINGS index_granularity = 8192,
         compression_codec = 'LZ4';

Step 5 — Resource limits, concurrency, and memory

ClickHouse is memory- and CPU-sensitive. Here’s a practical set of settings to protect cluster stability:

  • Per-pod resource requests/limits: Request CPU and memory close to expected sustained load; set limits to prevent noisy neighbors. Example: requests 4 CPU / 16GB, limits 8 CPU / 32GB for analytics nodes.
  • clickhouse-server settings: Set max_memory_usage and max_bytes_before_external_group_by to force external processing instead of OOM.
  • max_concurrent_queries: Limit concurrency to avoid degraded latency under spikes.
  • mark_cache_size / uncompressed_cache_size: Tune mark cache proportionally to dataset size to accelerate lookups.
# sample settings in users.xml or config
max_memory_usage = 20000000000 -- ~20GB
max_threads = 8
max_concurrent_queries = 8
mark_cache_size = 5368709120 -- 5GB

Step 6 — Observability, metrics, and alerting

Instrument ClickHouse and your pipeline so alerts are meaningful:

  • Prometheus metrics: Use clickhouse-exporter or the HTTP /metrics endpoint to scrape: query_latency, inserts lag, parts pending, merge queue length. Integrate these metrics into an operator-aware dashboard and follow maturity patterns from modern observability playbooks.
  • Dashboard patterns: Real-time ingest rate, per-shard lag, slow queries, and hot partitions. Visualize 1m/5m/1h aggregates for incident context.
  • Alert rules:
    • Ingest lag: any kafka consumer lag > X seconds for 1 minute
    • Query tail latency: 99th percentile > SLO threshold
    • Merge queue depth: sustained growth indicates IO bottleneck
    • Disk usage: PV nearly full on hot nodes

Example Prometheus alert for ingestion lag (pseudo-rule):

alert: ClickHouseKafkaConsumerLagHigh
expr: clickhouse_kafka_consumer_lag_seconds > 30
for: 2m
labels:
  severity: critical
annotations:
  summary: "Kafka consumer lag > 30s for ClickHouse ingestion"
  description: "Partition lag on topic 'events' indicates stalled ingestion"

Step 7 — Security and multi-tenant considerations

Security is essential for incident data that may include PII or business-sensitive telemetry:

  • Network policy: Restrict access to ClickHouse ports (TCP 8123, 9000, keeper ports) to only ingestion agents, dashboards, and trusted networks. For regionally distributed deployments consider operational resilience guidance similar to energy resiliency plans like the 90-day resilience playbooks — small operational constraints can have outsized effects on failover.
  • TLS and auth: Use mTLS for ClickHouse client-server connections and enable user authentication via hashed passwords or LDAP/SSO where available.
  • RBAC: Partition access at the query/table level when multiple teams share a cluster. Use role-based users in ClickHouse config files.
  • Secret management: Store JDBC/HTTP credentials and S3 keys in Kubernetes Secrets and mount them as files or env vars.

Step 8 — Testing and load validation

Before production cutover:

  • Load test writes: Use clickhouse-benchmark or custom producers (Vector) to simulate peak event rates.
  • Query latency tests: Run representative dashboard queries at concurrent rates to validate p99 latency. Combine these tests with an operational tool-stack audit to surface blind spots in monitoring and alerting.
  • Chaos scenarios: Simulate pod restarts, node failure, and PV loss to validate replication and recovery, especially for shards and ClickHouse Keeper. Field-testing guidance from real-world diagnostic toolkits can help structure these exercises (see diagnostic toolkit reviews for similar checklists applied to infra).
  • Cost simulation: Project storage growth and network egress for S3 tiering to forecast 30/90/365-day cost.

Operational recipes and troubleshooting

Common performance hotspots and fixes

  • High merge queue: Increase background_pool_size and consider larger disks or faster IOPS. Re-balance partitions if data skews.
  • OOM during complex queries: Enable external aggregations and tune max_memory_usage settings.
  • Kafka consumer stalls: Ensure materialized view errors are not silently eating messages. Check system.mutations and system.zookeeper for stuck offsets.
  • Slow backup/restore: Use S3 snapshot policies and avoid full-volume snapshots during heavy merges.

Quick debugging checklist

  1. Check ClickHouse logs (/var/log/clickhouse-server/) for ingestion errors.
  2. Inspect system.parts and system.merges for backlogs.
  3. Validate Kafka topic partitions and consumer groups for lag.
  4. Run simple SELECT count() over a small time window to verify data flow.

Real-world example: Outage spike detection pipeline

Scenario: Your synthetic checks and error logs feed events that must trigger a PagerDuty incident within 30s when error rate spikes above baseline.

  1. Ingest checks and logs into Kafka topic 'checks'.
  2. ClickHouse materialized view writes to a minute-granularity SummingMergeTree that stores counts by service and region.
  3. Prometheus scrapes ClickHouse metrics and Grafana runs alert queries of the form:
-- compute per-minute error rate
SELECT
  minute,
  svc,
  sum(errors) AS errors,
  sum(total) AS total,
  errors / total AS error_rate
FROM incidents
WHERE minute >= now() - INTERVAL 10 MINUTE
GROUP BY minute, svc
ORDER BY minute DESC
LIMIT 100;

Alert rule: If error_rate for any service sustained > 5% for two consecutive minutes and traffic > baseline, fire an alert. Using ClickHouse for the aggregation reduces alert evaluation cost and returns numeric context to PagerDuty messages.

What to expect and prepare for:

  • Wider adoption for observability: ClickHouse will increasingly replace purpose-built TSDBs for high-cardinality metrics and logs because of query flexibility and price-performance.
  • Operator improvements: Expect richer autoscaling primitives and built-in backup-to-cloud policies in operators through 2026. Watch how operator ecosystems borrow ideas from serverless infra and monorepo observability patterns such as those outlined in serverless monorepo best practices.
  • Better cloud integrations: Managed ClickHouse services and S3-native tiering will make hot/warm/cold separation easier to operate.
  • OpenTelemetry sync: Better connectors between OpenTelemetry pipelines and ClickHouse will reduce ingestion plumbing and latency.

Decision guidance: Operator vs StatefulSet checklist

  • Choose Operator if: you run production analytics, need automated recovery, multi-shard replication, and team wants infrastructure-as-code for DB lifecycle.
  • Choose StatefulSet if: you need a small testbed, want full control over cluster config, or Kubernetes environment prevents CRD usage.

Wrap-up: Actionable next steps

To get started this week:

  1. Provision a 3-node test cluster using the ClickHouse Operator with 2 shards x 2 replicas.
  2. Wire a Kafka topic with a sample producer and create a Kafka-engine + materialized view pipeline.
  3. Tune index_granularity to 8192, enable LZ4, and measure p99 dashboard latency under load.
  4. Expose metrics to Prometheus and create alert rules for ingestion lag and p99 query latency.
"Real-time incident analytics is not just about ingesting more data — it's about making the right data immediately queryable. ClickHouse on Kubernetes gives you that capability with cost efficiency and operator automation."

Further reading, tools, and resources

  • ClickHouse documentation and CRD operator docs (official)
  • Vector.dev for edge-level, high-throughput ingestion
  • Prometheus and Grafana for metrics and dashboards
  • ClickHouse benchmark tools and clickhouse-benchmark utility

Call to action

If you want a reproducible starting point, grab our tested Kubernetes manifests and operator recipes on GitHub and run the included ingestion & load tests. Try the reference deploy, run the load script, and open a PR with any cluster-sized adjustments you need — we’ll help tune it for your production SLOs. If you’re operating in constrained environments or need portable power for on-prem testbeds, see notes on portable power reviews such as the Jackery vs EcoFlow field comparisons to plan run-time availability.

Advertisement

Related Topics

#clickhouse#kubernetes#observability
d

deployed

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-04T01:34:35.008Z