kubernetesgpurisc-v

Kubernetes scheduling for RISC-V + GPU nodes: node feature discovery and cost-aware placement

ddeployed

2026-01-31

10 min read

Step-by-step guide to make Kubernetes schedule RISC‑V hosts with Nvidia GPUs, using NFD, device plugins, and a cost‑aware scheduler.

Hook: Why this matters now

If you're managing heterogeneous clusters, you know the pain: slow, brittle scheduling across mixed CPU ISAs and expensive GPU resources that sit idle because the scheduler can't reason about topology, NVLink connectivity, or per-node cost. In 2026 the game changed — RISC‑V CPU+NVLink GPU nodes are becoming real (SiFive + Nvidia NVLink Fusion) and teams need a clear how‑to for making Kubernetes schedule them correctly and cost‑efficiently.

What you'll build in this guide

This is a practical how‑to for adding custom schedulers, node feature discovery, device plugins, and cost‑aware placement so Kubernetes can schedule RISC‑V hosts with attached Nvidia GPUs and NVLink topology. You will learn how to:

Detect and label RISC‑V and NVLink features on nodes with Node Feature Discovery (NFD)
Build or adapt an NVIDIA device plugin for riscv64 and expose GPU topology information
Annotate nodes with cost metadata and expose it to the scheduler
Deploy a lightweight custom scheduler plugin (Score) to prefer low‑cost, NVLink‑adjacent placements
Wire up Pod specs to request GPUs and use the custom scheduler

Context — Why this approach in 2026

Late 2025 and early 2026 accelerated two trends: broader RISC‑V silicon adoption and Nvidia's push to bring NVLink Fusion into RISC‑V ecosystems. As heterogeneous hardware proliferates, the default scheduler's generic heuristics are not enough. You need explicit feature discovery, device plugin metadata, and a scheduler that scores by topology awareness and cost. This guide gives a repeatable recipe for production‑grade placement.

Prerequisites

A Kubernetes cluster with at least one control plane and multiple worker nodes. Some nodes are RISC‑V hosts with attached Nvidia GPUs (NVLink capable).
Cluster admin privileges to deploy DaemonSets, CRDs, and a scheduler.
Build environment that can cross‑compile Go binaries for riscv64 (if you need to build the device plugin).
Familiarity with kubectl, systemd, and basic Linux tools (lspci, nvidia‑smi).

Step 1 — Discover node features (RISC‑V, NVLink) and label nodes

Detecting the CPU ISA and GPU topology is the foundation. Use Node Feature Discovery (NFD) to automate this. NFD can be extended with custom scripts to probe NVLink and expose node features as well‑formed labels.

Install NFD (DaemonSet) and add a probe script

Deploy NFD as a DaemonSet and add a small probe that emits labels like feature.node.kubernetes.io/riscv and feature.node.kubernetes.io/nvlink.

<!-- nfd-daemonset.yaml (excerpt) -->
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-feature-discovery
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: nfd-worker
        image: k8s.gcr.io/node-feature-discovery:nfd-2026.0
        volumeMounts:
        - name: custom-probes
          mountPath: /etc/nfd/custom
      volumes:
      - name: custom-probes
        configMap:
          name: nfd-custom-probes

Create a ConfigMap with a script /etc/nfd/custom/probe.sh that runs on each node:

# probe.sh (riscv + nvlink detection)
#!/bin/sh
# detect RISC-V
if grep -qi riscv /proc/cpuinfo; then
  echo "feature.node.kubernetes.io/riscv=true"
fi

# detect NVLink / Nvidia GPU topology (best‑effort)
if command -v nvidia-smi >/dev/null 2>&1; then
  if nvidia-smi topo -m >/dev/null 2>&1; then
    # simple NVLink detection: look for NVLINK in topology output
    if nvidia-smi topo -m | grep -qi nvlink; then
      echo "feature.node.kubernetes.io/nvlink=true"
    fi
  fi
fi

When NFD runs the script it publishes labels under feature.node.kubernetes.io/<em>name</em>. These labels are used later for nodeAffinity and scheduling decisions.

Step 2 — Device plugin for RISC‑V + Nvidia GPUs

The device plugin is how kubelet advertises GPUs as schedulable resources (nvidia.com/gpu). In 2026 you may need to build or cross‑compile the official Nvidia device plugin for riscv64 kernels unless your vendor provides an image.

Cross‑compile the Nvidia device plugin for riscv64 (example)

# Example Dockerfile to build device plugin for riscv64
FROM golang:1.21 as build
WORKDIR /src
COPY . .
ENV GOOS=linux GOARCH=riscv64 CGO_ENABLED=0
RUN go build -o /out/nvidia-device-plugin ./cmd/nvidia-device-plugin

FROM debian:bookworm-slim
COPY --from=build /out/nvidia-device-plugin /usr/local/bin/nvidia-device-plugin
ENTRYPOINT ["/usr/local/bin/nvidia-device-plugin"]

Push the image to your registry and deploy the device plugin as a DaemonSet. Make sure the device plugin runs privileged and mounts /dev and the nvidia driver locations. The plugin will create the extended resource nvidia.com/gpu in the node capacity and expose device topology information by creating environment variables or mounting metadata into the kubelet plugin socket.

Expose NVLink groups and topology

Modern device plugins can export topology hints (NUMA, PCI proximity, NVLink groups). If the plugin doesn't automatically expose NVLink groups, add a small sidecar that runs nvidia-smi topo -m, parses link clusters, and writes node labels like:

feature.node.kubernetes.io/gpu.topology=nvlink-fusion
gpu.topology/nvlink-group=A (or a JSON node annotation listing groups)

These labels and annotations let the scheduler prefer colocating pods on NVLink‑connected GPUs.

Step 3 — Annotate nodes with cost and provisioning metadata

Cost‑aware placement requires exposing per‑node cost metrics to the scheduler. This can be cloud hourly rates, amortized on‑prem cost, or spot price. For reproducibility, we recommend annotating nodes with short, machine‑readable keys:

node.k8s.cost/hour — numeric cost in USD/hour
node.k8s.tier — "spot" | "ondemand" | "reserved"

Create a small controller or CronJob that fetches pricing from your provider or internal CMDB and annotates nodes:

# annotate-cost.sh
NODE=$1
COST=$(fetch-price-for-node $NODE) # implement for your environment
kubectl annotate node $NODE node.k8s.cost/hour="$COST" --overwrite

Make sure annotations are numeric and kept up to date. These are inputs to the custom scheduler scoring plug‑in described next.

Step 4 — Build and deploy a cost‑aware scheduler plugin

Instead of replacing the built‑in scheduler, create a small scheduler that uses the Kubernetes Scheduler Framework. It implements a Score plugin that calculates a node score from two signals:

NVLink adjacency and GPU topology (prefer nodes with NVLink and same group)
Per‑node cost (prefer lower cost nodes)

Plugin logic (pseudocode)

// Score(node, pod)
score = baseScore(node)
if node has feature.node.kubernetes.io/nvlink and pod requests GPU:
  score += NVLINK_BONUS
if pod requests GPU with affinity to 'nvlink-group:X' and node has that group:
  score += GROUP_MATCH_BONUS
// cost is normalized: lower cost => higher score
cost = getNodeCost(node) // node.k8s.cost/hour
score += int((MAX_COST - cost) * COST_WEIGHT)
return score

Key points:

Score normalization: normalize cost to fit scheduler's 0–100 scoring range
Respect existing constraints: call Filter plugins like NodeResourcesFit first

Deploy the scheduler

Build your Go plugin and produce a single binary scheduler. Deploy it as a Deployment with a ConfigMap that configures the scheduler framework to use your plugin. Pods that should land on RISC‑V+GPU nodes set spec.schedulerName: riscv-cost-scheduler.

# scheduler-config.yaml (excerpt)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: riscv-cost-scheduler
  plugins:
    score:
      enabled:
      - name: CostAwareScore
        weight: 1
  pluginConfig:
  - name: CostAwareScore
    args:
      nvlinkBonus: 30
      costWeight: 1.5

Running a dedicated scheduler keeps experimentation low‑risk: the default scheduler continues to serve normal pods.

Step 5 — Pod manifest: request GPU, set affinity and schedulerName

Now create a Pod/Deployment that explicitly requests GPUs and prefers NVLink adjacency. Two mechanisms are used:

Resource request: nvidia.com/gpu: "1"
Node affinity: require RISC‑V and optional NVLink group
schedulerName: riscv-cost-scheduler

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      schedulerName: riscv-cost-scheduler
      containers:
      - name: worker
        image: myregistry/llm-runtime:2026
        resources:
          limits:
            nvidia.com/gpu: 1
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: feature.node.kubernetes.io/riscv
                operator: In
                values:
                - "true"
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: feature.node.kubernetes.io/nvlink
                operator: In
                values:
                - "true"

The custom scheduler will score NVLink nodes higher and prefer lower cost nodes when multiple fit.

Step 6 — NUMA, Topology Manager and performance tuning

GPU performance depends on CPU affinity and NUMA. Enable kubelet Topology Manager and set policy to best-effort or single-numa-node depending on your workload. Ensure the device plugin and kubelet agree on topology hints so the scheduler can make correct placement decisions.

# kubelet flags (example)
--feature-gates=KubeletPodResources=true
--topology-manager-policy=single-numa-node

Test with synthetic workloads: collect p95 latencies, GPU utilization (nvidia-smi), and PCI/NVLink bandwidth. Use this data to tune NVLINK_BONUS and COST_WEIGHT in the scheduler plugin. For capture and field validation, a compact field kit review is a good reference for on‑node telemetry and capture tooling: Field Kit Review 2026.

Step 7 — Observability and debugging checklist

kubectl get nodes --show-labels to verify NFD labels
kubectl describe node <node> to inspect resource capacity (nvidia.com/gpu)
kubectl logs <device-plugin-pod> for GPU plugin startup errors
kubectl describe pod <pod> to see scheduling events and score decisions
nvidia-smi topo -m on nodes to confirm NVLink connectivity
Prometheus metrics: expose custom scheduler metrics (scored nodes, rejected nodes), GPU exporter — see observability playbooks: Observability playbook

Security, hardening and operational considerations

Device plugins frequently require privileged access. Apply these practices:

Use RBAC to restrict who can deploy device plugins and schedulers. (See IT playbooks for consolidation & governance: consolidating tools.)
Run device plugin containers with minimal capabilities and only the mounts they need.
Sign and verify device plugin images (Cosign) before deployment — supply chain attacks are a real risk.
Keep drivers and firmware updated — firmware and driver stacks evolve fast in 2026.

Case study: Proof‑of‑concept outcomes (observations)

In a POC cluster we built with 12 worker nodes (6 x RISC‑V with NVLink GPUs and 6 x x86 GPU nodes), applying NFD + device plugin + cost‑aware scheduler delivered two practical benefits:

Better bin‑packing of GPU workloads on NVLink groups: jobs that benefit from NVLink were collocated, reducing multi‑GPU latency.
Cost signal improved placement: benchmark experiments showed a noticeable reduction in run‑time cost by preferring cheaper nodes when topology needs were equal.

These results align with the industry trend in 2026: hybrid heterogeneous clusters require richer metadata and dedicated scheduling logic to maximize both performance and cost efficiency.

Troubleshooting common issues

Device plugin not advertising GPUs

Check the device plugin DaemonSet logs for driver incompatibility. RISC‑V kernel modules and Nvidia drivers may need vendor support.
Confirm /dev/nvidia* devices exist on the host.

Pods schedule to wrong nodes

Verify NFD labels: if the label keys don't match affinity, the pod won't prefer NVLink nodes.
Check scheduler logs and scoring events; increase plugin log level to see scoring calculations.

Advanced: Combining binpacking, preemption and cost

For production fleets, consider combining the cost‑aware scheduler with:

Cluster autoscaler that understands GPU instance groups
Pod priority and preemption for critical inference workloads
Admission controller to enforce that GPU pods must set schedulerName to your custom scheduler (prevents accidental scheduling by default scheduler) — see governance playbooks: IT playbook

2026 Trends & future‑proofing

Expect these shifts through 2026 and beyond:

RISC‑V + NVLink Ecosystem Growth: With players like SiFive integrating NVLink Fusion, more boards and cloud vendors will expose RISC‑V GPU instances.
Standardized topology metadata: Device plugin interfaces will converge on a common way to expose NVLink groups and NUMA info.
Policy-driven cost scheduling: Cost signals will be first‑class in many schedulers, and multi‑cloud cost APIs will become standard telemetry sources.

"As hardware gets heterogeneous, orchestration must evolve from simple resource counts to topology and cost‑aware reasoning." — deployment.cloud engineering patterns, 2026

Actionable takeaways

Start with feature discovery: deploy NFD early to consistently label riscv and nvlink capabilities.
Device plugin matters: ensure a riscv64‑compatible device plugin that exports topology.
Annotate cost: expose cost/hour as node annotations for the scheduler to consume.
Use a dedicated scheduler plugin: implement a Score plugin to combine topology and cost without disturbing the default scheduler.
Measure and iterate: collect performance and cost telemetry and tune scoring weights.

Next steps & call to action

Ready to try this in your environment? Start by deploying Node Feature Discovery and a test device‑plugin sidecar that emits NVLink labels — then run a single GPU pod with schedulerName set to your experimental scheduler. If you want a starter kit, we maintain a reference implementation and CI recipes (cross‑compile, DaemonSet templates, scheduler plugin skeleton). Reach out on GitHub or spin up a POC and iterate with real telemetry.

Ship faster, cheaper, and more reliably: treat RISC‑V+GPU nodes like first‑class citizens in your scheduling strategy — and make topology and cost signals drive decisions, not guesswork.

deployed

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.