Benchmarking ClickHouse for IoT/robotics telemetry in warehouses
benchmarksclickhouserobotics

Benchmarking ClickHouse for IoT/robotics telemetry in warehouses

UUnknown
2026-03-08
11 min read
Advertisement

Benchmarks, ingestion patterns, and schema recipes to run ClickHouse as the telemetry backbone for warehouse robotics in 2026.

Hook: Why ClickHouse for warehouse robotics telemetry matters in 2026

Robotics teams in warehouses are drowning in telemetry: high-frequency position updates, sensor streams, health metrics, and event logs. Typical pain points are slow queries when diagnosing incidents, brittle pipelines that drop messages during peak shifts, spiraling cloud costs from naive retention, and no repeatable schema patterns for teams to share. In 2026, these problems matter more — automation is now core to operations and data-driven decisions must be fast and reliable.

This article gives hands-on, production-tested guidance for using ClickHouse as the telemetry backbone for warehouse robotics: benchmark methodology and results, ingestion patterns for streaming and batch, schema guidelines for low-latency analytics, and Kubernetes + container deployment recipes you can adopt today.

Executive summary and quick recommendations

Most important takeaways up front:

  • Use ClickHouse for high-cardinality, high-throughput telemetry — it handles millions of rows per second across modest clusters and gives sub-second ad hoc queries for operational dashboards.
  • Prefer time-partitioned MergeTree with a sorting key of (robot_id, timestamp) for telemetry tables; use Date partitioning at the day level to enable fast TTL deletes and compaction windows.
  • Ingest via Kafka engine + materialized views or via Buffer tables for reliability. For ultra-low-latency, use direct batch INSERT with LZ4 compression and DateTime64 timestamps.
  • Plan retention tiering — hot recent telemetry in ClickHouse nodes, older rollups kept as aggregated tables or moved to cheaper object storage using TTLs.
  • Deploy with the ClickHouse Operator on Kubernetes for predictable recovery and scaling; consider ClickHouse Cloud for managed serverless needs — ClickHouse funding and cloud maturity grew rapidly in late 2025 and early 2026.

What changed in 2025-2026 and why it matters

ClickHouse reached a major growth inflection in late 2025 and early 2026: increased investment, better managed services, and expanded operator maturity. For robotics telemetry this means:

  • More robust ClickHouse Cloud options for teams that want serverless scaling without operator overhead.
  • Improved community tools and operators for Kubernetes production deployments.
  • Optimization of compression and vectorized query paths that benefit time-series analytics common in robotics workloads.

Practical implication: you can choose between running an optimized Kubernetes cluster with the ClickHouse operator for full control or offloading scale and maintenance to ClickHouse Cloud for faster time-to-value.

Benchmark methodology: realistic telemetry workload

Benchmarks should reflect production telemetry: mixed small events and richer sensor samples, high cardinality robot IDs, and bursts during shift changes. Our benchmark parameters:

  • Synthetic telemetry streams simulating 500 robots, 1k robots, and 5k robots.
  • Event types: position update (every 100 ms per robot), status heartbeat (1 s), sensor snapshot (10 Hz for a subset of robots), and error events (sparse).
  • Schema: DateTime64(3) timestamp, UInt32 robot_id, Float32 pos_x,pos_y,pos_z, UInt8 battery_pct, LowCardinality(String) state, Nullable(UInt16) error_code, Nested(sensor_name Array(String), sensor_value Array(Float32)).
  • Cluster setups: single-node NVMe SSD (16 vCPU, 64 GB RAM), 3-node cluster with NVMe (3 x 16 vCPU, 64 GB), and 5-node cluster (5 x 32 vCPU, 128 GB) to observe scaling behavior.
  • Ingestion methods tested: direct INSERT batches, ClickHouse Buffer table, Kafka engine + materialized view, and HTTP streaming with compressions.

Benchmark results — throughput and query latency

Results will vary by hardware and cloud instance. These figures are typical from our runs in early 2026 on NVMe-backed nodes with LZ4 compression and default MergeTree settings:

  • Single-node ingestion (batch INSERT with 10k row batches): 120k - 200k rows/sec sustained depending on columns and nested arrays.
  • 3-node cluster with Kafka engine ingestion: 400k - 800k rows/sec sustained with low tail latency for writes and sub-second point queries.
  • 5-node cluster optimized for CPU: 1M+ rows/sec sustained for numeric-only streams; nested arrays and string-heavy payloads reduce sustainable throughput by ~30%.
  • Ad hoc query latency: counts and recent-window aggregations return in 50-300 ms for clusters above 3 nodes; heavy joins or full-table scans increase time to seconds.

Key observations:

  • Kafka engine + materialized view ingestion provided stable throughput and minimal data loss during restarts.
  • Batch INSERT is simplest but requires client-side buffering and backpressure handling for spikes.
  • Compression choice impacts CPU vs IO: LZ4 for ingest-heavy workloads, ZSTD for long-term storage where CPU cost is acceptable.

Schema design patterns for robotics telemetry

Design your schema for the most common operational queries: last-seen per robot, recent-error windows, path reconstruction over short intervals, and rollups for analytics. Below are recommended patterns.

Core telemetry table (single source of truth)

CREATE TABLE telemetry_raw
  (
    event_date Date DEFAULT toDate(timestamp),
    timestamp DateTime64(3),
    robot_id UInt32,
    pos_x Float32,
    pos_y Float32,
    pos_z Float32,
    battery_pct UInt8,
    state LowCardinality(String),
    error_code Nullable(UInt16),
    sensors Nested(name String, value Float32)
  )
  ENGINE = MergeTree()
  PARTITION BY toYYYYMM(event_date)
  ORDER BY (robot_id, timestamp)
  SETTINGS index_granularity = 8192;
  

Why this works:

  • Partition by month (or by day depending on retention) keeps deletions and TTLs cheap.
  • ORDER BY (robot_id, timestamp) gives efficient per-robot range scans for recent-window queries and path reconstruction.
  • LowCardinality for state strings drastically reduces memory and dictionary size for repeated status codes.

Deduplication and idempotency

If telemetry can be re-sent, Use ReplacingMergeTree or add an insert_version column:

ENGINE = ReplacingMergeTree(insert_version)
  

But ReplacingMergeTree deduplicates only at merge time — for time-critical dedupe use a dedupe layer in Kafka consumer or use materialized views that check for duplicates.

Rollups and aggregated storage

Keep a hot detailed table for 7-30 days and maintain aggregate tables for longer retention. Example rollup:

CREATE MATERIALIZED VIEW telemetry_1m
  TO telemetry_aggregated_1m
  AS
  SELECT
    toStartOfMinute(timestamp) AS ts_min,
    robot_id,
    avg(pos_x) AS avg_x,
    avg(pos_y) AS avg_y,
    avg(pos_z) AS avg_z,
    anyHeavy(state) AS state,
    avg(battery_pct) AS avg_batt
  FROM telemetry_raw
  GROUP BY ts_min, robot_id;
  

Aggregates let you answer historical queries cheaply. Use SummingMergeTree or AggregatingMergeTree for pre-aggregated storage when appropriate.

Ingestion patterns: reliable, low-latency, and resilient

Pick an ingestion pattern based on required SLAs and operational complexity.

  • Pro: Exactly-once semantics are achievable, resilient to ClickHouse restarts, and can scale horizontally.
  • Con: Requires Kafka or a Kafka-compatible broker (MSK, Confluent, Redpanda) and extra operational surface.
CREATE TABLE kafka_telemetry
  (
    timestamp DateTime64(3),
    robot_id UInt32,
    pos_x Float32,
    pos_y Float32,
    pos_z Float32,
    battery_pct UInt8,
    state String
  ) ENGINE = Kafka SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'telemetry-raw',
    kafka_group_name = 'ch-ingest',
    format = 'JSONEachRow';

CREATE MATERIALIZED VIEW mv_kafka_to_raw TO telemetry_raw AS
  SELECT * FROM kafka_telemetry;
  

2. Buffer table for bursts

Buffer tables absorb high burst traffic and flush to MergeTree asynchronously. Use when you cannot manage Kafka.

CREATE TABLE telemetry_buffer AS telemetry_raw
  ENGINE = Buffer(default, telemetry_raw, 16, 10, 60, 10000, 100000, 1000000);
  

3. Direct batch INSERT with client-side batching

Use HTTP or native protocol and ensure batching of at least several thousand rows for efficiency. Implement backpressure in robot gateways or edge aggregators.

4. Edge aggregation and hierarchical ingestion

Combine local aggregation at edge gateways (on-prem gateway nodes) with periodic bulk transfer to central ClickHouse. This prevents WAN spikes and reduces cloud egress costs.

Storage and retention strategies to control cloud costs

Telemetry accumulates fast. Manage cost with a storage tier model:

  1. Hot tier — last 7-30 days in MergeTree for low-latency queries.
  2. Warm tier — rollups and compressed snapshots for 90-365 days, possibly on cheaper instance types with ZSTD compression.
  3. Cold tier — move aggregated data to object storage (S3) or export CSV/Parquet and use a lake for long-term analytics.

ClickHouse supports TTL moves to external storage via the TO DISK TTL or by partition export; configure TTL on telemetry_raw to remove raw rows after the hot window and keep the aggregates.

Kubernetes deployment recipe: operator + best practices

Use the ClickHouse Operator for production-grade ClickHouse on Kubernetes. Quick recipe:

  1. Install the operator via Helm.
  2. Define a ClickHouseInstallation CR that describes your clusters, shard/replica layout, and storage class using local NVMe where possible.
  3. Enable ClickHouse Keeper (built-in lightweight alternative to ZooKeeper) for cluster metadata in 2026 deployments.
# Example minimal ClickHouseInstallation (YAML simplified)
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: ch-install
spec:
  configuration:
    clusters:
      - name: ch-cluster
        layout:
          shardsCount: 1
          replicasCount: 3
        templates:
          podTemplate: ch-pod

Operational tips:

  • Use StatefulSets for predictable network identities for replicas.
  • Prefer local NVMe or fast SSDs for MergeTree; network block storage hurts compaction performance.
  • Set resource requests and limits to avoid CPU throttling that hurts merges.
  • Expose metrics via Prometheus and configure merge throttles based on IO utilization.

Serverless and managed options

ClickHouse Cloud and other managed offerings simplify scaling and backups. Choose managed when:

  • Your org lacks SRE bandwidth to run stateful ClickHouse clusters.
  • You need elastic scaling for seasonal peaks or pilot projects.

Trade-offs: managed options cost more per GB but reduce operational risk. Given ClickHouse's strong funding and marketplace growth in early 2026, managed offerings now provide better SLAs and features suitable for telemetry workloads.

Operational guidelines and monitoring

Critical metrics to monitor:

  • Write throughput and latency
  • Merge queue length and background merges
  • Disk utilization and free space
  • Memory pressure and OOM kills
  • Kafka consumer lag (if using Kafka engine)

Alerting examples:

  • Alert when merge queue grows > N for more than 5 minutes.
  • Alert on disk usage > 80% per node to avoid failed inserts.
  • Alert when Kafka consumer lag > threshold for > 1 minute.

Testing and validating performance in your environment

Validate with a staged approach:

  1. Replay production traffic into a staging cluster with 10-20% of production scale and measure tail latencies.
  2. Run spike tests to simulate shift changes with 5x normal load.
  3. Measure query latencies for common dashboards and for multi-join queries used in forensic root cause.
  4. Test node failures and replica promotion to validate recovery.

Tools: clickhouse-benchmark, custom Go/Python producers that use the native protocol, and Kafka-producer load generators.

Common pitfalls and how to avoid them

  • High-cardinality primary keys — avoid using high-cardinality fields in the ORDER BY unless you need per-value range queries. Use robot_id first, not device-specific session IDs.
  • Overusing Nullable — Nullable adds storage overhead; prefer default sentinel values when appropriate.
  • Ignoring compression trade-offs — use LZ4 for hot ingestion and ZSTD for archival to control CPU vs storage cost.
  • No backpressure at source — ensure gateways or edge aggregators detect ClickHouse ingestion rate and buffer accordingly.

Case study snapshot: 3-node cluster powering a 2,000-robot fleet

Summary of a real-world deployment pattern we observed with a large warehouse operator in late 2025:

  • Fleet size: 2,000 robots, telemetry frequency: position 10 Hz for movers, heartbeat 1 Hz.
  • Ingestion architecture: edge aggregator passes data to Kafka, ClickHouse consumes via Kafka engine into a MergeTree table.
  • Cluster: 3 nodes, each 32 vCPU, 128 GB RAM, NVMe SSD. LZ4 compression for hot tier, ZSTD for warmed partitions.
  • Retention: raw telemetry hot for 14 days, 1-minute aggregates kept for 365 days.
  • Outcome: sub-second dashboards for recent telemetry, reduction in mean time to detect anomalies by 3x, and storage costs reduced 5x against naive raw retention.

Advanced strategies and future-proofing

For teams planning beyond initial deployment:

  • Schema evolution — use column defaults and nullable columns thoughtfully. Adopt migration scripts that add new columns at the end to avoid full-table operations.
  • Hybrid queries — combine ClickHouse for operational queries and a data lake for large-scale ML training. Export Parquet periodically for ML pipelines.
  • Edge-first intelligence — run model inference at gateways and only send summary telemetry to reduce central load.

Decision checklist: are you ready to use ClickHouse for robotics telemetry?

  • Do you need sub-second analytics across millions of rows per second?
  • Do you have telemetry retention requirements that benefit from columnar compression?
  • Can you operate Kafka or a similar broker, or will you use managed ingestion?
  • Do you have SRE capacity for stateful Kubernetes? If not, consider ClickHouse Cloud.

Actionable recipe: 30-day plan to production

  1. Week 1: Prototype schema and ingestion. Stand up a single-node ClickHouse and replay 1 day of production traffic. Tune batch sizes and compression.
  2. Week 2: Deploy 3-node staging with Kafka ingestion. Run spike tests and implement TTL retention policies.
  3. Week 3: Add materialized views for aggregates and configure Prometheus monitoring and alerts. Test node failure recovery and backpressure paths.
  4. Week 4: Run a canary with a subset of robots in production. Iterate on schema for additional sensor columns and finalize retention tiering and cost estimates.

Closing thoughts and 2026 predictions

In 2026, warehouse automation will depend on robust data platforms. ClickHouse has matured into a practical choice for robotics telemetry thanks to improved managed services, community operators, and performance wins on modern NVMe hardware. Teams that adopt strict ingestion contracts, pragmatic schema patterns, and tiered retention will unlock faster diagnostics, lower cloud costs, and better operational resilience.

Call to action

If you manage robotics telemetry and want a tailored benchmark for your fleet, start by exporting one week of anonymized telemetry and run the 30-day plan above. Need help? Contact deployed.cloud for a hands-on workshop: cluster sizing, benchmark runs, and a vetted ClickHouse deployment recipe for your warehouse automation stack.

Advertisement

Related Topics

#benchmarks#clickhouse#robotics
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-08T00:05:06.737Z