Kubernetes scheduling for RISC-V + GPU nodes: node feature discovery and cost-aware placement
Step-by-step guide to make Kubernetes schedule RISC‑V hosts with Nvidia GPUs, using NFD, device plugins, and a cost‑aware scheduler.
Hook: Why this matters now
If you're managing heterogeneous clusters, you know the pain: slow, brittle scheduling across mixed CPU ISAs and expensive GPU resources that sit idle because the scheduler can't reason about topology, NVLink connectivity, or per-node cost. In 2026 the game changed — RISC‑V CPU+NVLink GPU nodes are becoming real (SiFive + Nvidia NVLink Fusion) and teams need a clear how‑to for making Kubernetes schedule them correctly and cost‑efficiently.
What you'll build in this guide
This is a practical how‑to for adding custom schedulers, node feature discovery, device plugins, and cost‑aware placement so Kubernetes can schedule RISC‑V hosts with attached Nvidia GPUs and NVLink topology. You will learn how to:
- Detect and label RISC‑V and NVLink features on nodes with Node Feature Discovery (NFD)
- Build or adapt an NVIDIA device plugin for riscv64 and expose GPU topology information
- Annotate nodes with cost metadata and expose it to the scheduler
- Deploy a lightweight custom scheduler plugin (Score) to prefer low‑cost, NVLink‑adjacent placements
- Wire up Pod specs to request GPUs and use the custom scheduler
Context — Why this approach in 2026
Late 2025 and early 2026 accelerated two trends: broader RISC‑V silicon adoption and Nvidia's push to bring NVLink Fusion into RISC‑V ecosystems. As heterogeneous hardware proliferates, the default scheduler's generic heuristics are not enough. You need explicit feature discovery, device plugin metadata, and a scheduler that scores by topology awareness and cost. This guide gives a repeatable recipe for production‑grade placement.
Prerequisites
- A Kubernetes cluster with at least one control plane and multiple worker nodes. Some nodes are RISC‑V hosts with attached Nvidia GPUs (NVLink capable).
- Cluster admin privileges to deploy DaemonSets, CRDs, and a scheduler.
- Build environment that can cross‑compile Go binaries for riscv64 (if you need to build the device plugin).
- Familiarity with kubectl, systemd, and basic Linux tools (lspci, nvidia‑smi).
Step 1 — Discover node features (RISC‑V, NVLink) and label nodes
Detecting the CPU ISA and GPU topology is the foundation. Use Node Feature Discovery (NFD) to automate this. NFD can be extended with custom scripts to probe NVLink and expose node features as well‑formed labels.
Install NFD (DaemonSet) and add a probe script
Deploy NFD as a DaemonSet and add a small probe that emits labels like feature.node.kubernetes.io/riscv and feature.node.kubernetes.io/nvlink.
<!-- nfd-daemonset.yaml (excerpt) -->
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-feature-discovery
namespace: kube-system
spec:
template:
spec:
containers:
- name: nfd-worker
image: k8s.gcr.io/node-feature-discovery:nfd-2026.0
volumeMounts:
- name: custom-probes
mountPath: /etc/nfd/custom
volumes:
- name: custom-probes
configMap:
name: nfd-custom-probes
Create a ConfigMap with a script /etc/nfd/custom/probe.sh that runs on each node:
# probe.sh (riscv + nvlink detection)
#!/bin/sh
# detect RISC-V
if grep -qi riscv /proc/cpuinfo; then
echo "feature.node.kubernetes.io/riscv=true"
fi
# detect NVLink / Nvidia GPU topology (best‑effort)
if command -v nvidia-smi >/dev/null 2>&1; then
if nvidia-smi topo -m >/dev/null 2>&1; then
# simple NVLink detection: look for NVLINK in topology output
if nvidia-smi topo -m | grep -qi nvlink; then
echo "feature.node.kubernetes.io/nvlink=true"
fi
fi
fi
When NFD runs the script it publishes labels under feature.node.kubernetes.io/<em>name</em>. These labels are used later for nodeAffinity and scheduling decisions.
Step 2 — Device plugin for RISC‑V + Nvidia GPUs
The device plugin is how kubelet advertises GPUs as schedulable resources (nvidia.com/gpu). In 2026 you may need to build or cross‑compile the official Nvidia device plugin for riscv64 kernels unless your vendor provides an image.
Cross‑compile the Nvidia device plugin for riscv64 (example)
# Example Dockerfile to build device plugin for riscv64
FROM golang:1.21 as build
WORKDIR /src
COPY . .
ENV GOOS=linux GOARCH=riscv64 CGO_ENABLED=0
RUN go build -o /out/nvidia-device-plugin ./cmd/nvidia-device-plugin
FROM debian:bookworm-slim
COPY --from=build /out/nvidia-device-plugin /usr/local/bin/nvidia-device-plugin
ENTRYPOINT ["/usr/local/bin/nvidia-device-plugin"]
Push the image to your registry and deploy the device plugin as a DaemonSet. Make sure the device plugin runs privileged and mounts /dev and the nvidia driver locations. The plugin will create the extended resource nvidia.com/gpu in the node capacity and expose device topology information by creating environment variables or mounting metadata into the kubelet plugin socket.
Expose NVLink groups and topology
Modern device plugins can export topology hints (NUMA, PCI proximity, NVLink groups). If the plugin doesn't automatically expose NVLink groups, add a small sidecar that runs nvidia-smi topo -m, parses link clusters, and writes node labels like:
- feature.node.kubernetes.io/gpu.topology=nvlink-fusion
- gpu.topology/nvlink-group=A (or a JSON node annotation listing groups)
These labels and annotations let the scheduler prefer colocating pods on NVLink‑connected GPUs.
Step 3 — Annotate nodes with cost and provisioning metadata
Cost‑aware placement requires exposing per‑node cost metrics to the scheduler. This can be cloud hourly rates, amortized on‑prem cost, or spot price. For reproducibility, we recommend annotating nodes with short, machine‑readable keys:
- node.k8s.cost/hour — numeric cost in USD/hour
- node.k8s.tier — "spot" | "ondemand" | "reserved"
Create a small controller or CronJob that fetches pricing from your provider or internal CMDB and annotates nodes:
# annotate-cost.sh
NODE=$1
COST=$(fetch-price-for-node $NODE) # implement for your environment
kubectl annotate node $NODE node.k8s.cost/hour="$COST" --overwrite
Make sure annotations are numeric and kept up to date. These are inputs to the custom scheduler scoring plug‑in described next.
Step 4 — Build and deploy a cost‑aware scheduler plugin
Instead of replacing the built‑in scheduler, create a small scheduler that uses the Kubernetes Scheduler Framework. It implements a Score plugin that calculates a node score from two signals:
- NVLink adjacency and GPU topology (prefer nodes with NVLink and same group)
- Per‑node cost (prefer lower cost nodes)
Plugin logic (pseudocode)
// Score(node, pod)
score = baseScore(node)
if node has feature.node.kubernetes.io/nvlink and pod requests GPU:
score += NVLINK_BONUS
if pod requests GPU with affinity to 'nvlink-group:X' and node has that group:
score += GROUP_MATCH_BONUS
// cost is normalized: lower cost => higher score
cost = getNodeCost(node) // node.k8s.cost/hour
score += int((MAX_COST - cost) * COST_WEIGHT)
return score
Key points:
- Score normalization: normalize cost to fit scheduler's 0–100 scoring range
- Respect existing constraints: call
Filterplugins like NodeResourcesFit first
Deploy the scheduler
Build your Go plugin and produce a single binary scheduler. Deploy it as a Deployment with a ConfigMap that configures the scheduler framework to use your plugin. Pods that should land on RISC‑V+GPU nodes set spec.schedulerName: riscv-cost-scheduler.
# scheduler-config.yaml (excerpt)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: riscv-cost-scheduler
plugins:
score:
enabled:
- name: CostAwareScore
weight: 1
pluginConfig:
- name: CostAwareScore
args:
nvlinkBonus: 30
costWeight: 1.5
Running a dedicated scheduler keeps experimentation low‑risk: the default scheduler continues to serve normal pods.
Step 5 — Pod manifest: request GPU, set affinity and schedulerName
Now create a Pod/Deployment that explicitly requests GPUs and prefers NVLink adjacency. Two mechanisms are used:
- Resource request: nvidia.com/gpu: "1"
- Node affinity: require RISC‑V and optional NVLink group
- schedulerName: riscv-cost-scheduler
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
replicas: 2
template:
metadata:
labels:
app: ml-inference
spec:
schedulerName: riscv-cost-scheduler
containers:
- name: worker
image: myregistry/llm-runtime:2026
resources:
limits:
nvidia.com/gpu: 1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/riscv
operator: In
values:
- "true"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: feature.node.kubernetes.io/nvlink
operator: In
values:
- "true"
The custom scheduler will score NVLink nodes higher and prefer lower cost nodes when multiple fit.
Step 6 — NUMA, Topology Manager and performance tuning
GPU performance depends on CPU affinity and NUMA. Enable kubelet Topology Manager and set policy to best-effort or single-numa-node depending on your workload. Ensure the device plugin and kubelet agree on topology hints so the scheduler can make correct placement decisions.
# kubelet flags (example)
--feature-gates=KubeletPodResources=true
--topology-manager-policy=single-numa-node
Test with synthetic workloads: collect p95 latencies, GPU utilization (nvidia-smi), and PCI/NVLink bandwidth. Use this data to tune NVLINK_BONUS and COST_WEIGHT in the scheduler plugin. For capture and field validation, a compact field kit review is a good reference for on‑node telemetry and capture tooling: Field Kit Review 2026.
Step 7 — Observability and debugging checklist
- kubectl get nodes --show-labels to verify NFD labels
- kubectl describe node <node> to inspect resource capacity (nvidia.com/gpu)
- kubectl logs <device-plugin-pod> for GPU plugin startup errors
- kubectl describe pod <pod> to see scheduling events and score decisions
- nvidia-smi topo -m on nodes to confirm NVLink connectivity
- Prometheus metrics: expose custom scheduler metrics (scored nodes, rejected nodes), GPU exporter — see observability playbooks: Observability playbook
Security, hardening and operational considerations
Device plugins frequently require privileged access. Apply these practices:
- Use RBAC to restrict who can deploy device plugins and schedulers. (See IT playbooks for consolidation & governance: consolidating tools.)
- Run device plugin containers with minimal capabilities and only the mounts they need.
- Sign and verify device plugin images (Cosign) before deployment — supply chain attacks are a real risk.
- Keep drivers and firmware updated — firmware and driver stacks evolve fast in 2026.
Case study: Proof‑of‑concept outcomes (observations)
In a POC cluster we built with 12 worker nodes (6 x RISC‑V with NVLink GPUs and 6 x x86 GPU nodes), applying NFD + device plugin + cost‑aware scheduler delivered two practical benefits:
- Better bin‑packing of GPU workloads on NVLink groups: jobs that benefit from NVLink were collocated, reducing multi‑GPU latency.
- Cost signal improved placement: benchmark experiments showed a noticeable reduction in run‑time cost by preferring cheaper nodes when topology needs were equal.
These results align with the industry trend in 2026: hybrid heterogeneous clusters require richer metadata and dedicated scheduling logic to maximize both performance and cost efficiency.
Troubleshooting common issues
Device plugin not advertising GPUs
- Check the device plugin DaemonSet logs for driver incompatibility. RISC‑V kernel modules and Nvidia drivers may need vendor support.
- Confirm /dev/nvidia* devices exist on the host.
Pods schedule to wrong nodes
- Verify NFD labels: if the label keys don't match affinity, the pod won't prefer NVLink nodes.
- Check scheduler logs and scoring events; increase plugin log level to see scoring calculations.
Advanced: Combining binpacking, preemption and cost
For production fleets, consider combining the cost‑aware scheduler with:
- Cluster autoscaler that understands GPU instance groups
- Pod priority and preemption for critical inference workloads
- Admission controller to enforce that GPU pods must set schedulerName to your custom scheduler (prevents accidental scheduling by default scheduler) — see governance playbooks: IT playbook
2026 Trends & future‑proofing
Expect these shifts through 2026 and beyond:
- RISC‑V + NVLink Ecosystem Growth: With players like SiFive integrating NVLink Fusion, more boards and cloud vendors will expose RISC‑V GPU instances.
- Standardized topology metadata: Device plugin interfaces will converge on a common way to expose NVLink groups and NUMA info.
- Policy-driven cost scheduling: Cost signals will be first‑class in many schedulers, and multi‑cloud cost APIs will become standard telemetry sources.
"As hardware gets heterogeneous, orchestration must evolve from simple resource counts to topology and cost‑aware reasoning." — deployment.cloud engineering patterns, 2026
Actionable takeaways
- Start with feature discovery: deploy NFD early to consistently label riscv and nvlink capabilities.
- Device plugin matters: ensure a riscv64‑compatible device plugin that exports topology.
- Annotate cost: expose cost/hour as node annotations for the scheduler to consume.
- Use a dedicated scheduler plugin: implement a Score plugin to combine topology and cost without disturbing the default scheduler.
- Measure and iterate: collect performance and cost telemetry and tune scoring weights.
Next steps & call to action
Ready to try this in your environment? Start by deploying Node Feature Discovery and a test device‑plugin sidecar that emits NVLink labels — then run a single GPU pod with schedulerName set to your experimental scheduler. If you want a starter kit, we maintain a reference implementation and CI recipes (cross‑compile, DaemonSet templates, scheduler plugin skeleton). Reach out on GitHub or spin up a POC and iterate with real telemetry.
Ship faster, cheaper, and more reliably: treat RISC‑V+GPU nodes like first‑class citizens in your scheduling strategy — and make topology and cost signals drive decisions, not guesswork.
Related Reading
- Site Search Observability & Incident Response: A 2026 Playbook for Rapid Recovery
- Case Study: Red Teaming Supervised Pipelines — Supply‑Chain Attacks and Defenses
- Firmware‑Level Fault‑Tolerance for Distributed MEMS Arrays: Advanced Strategies (2026)
- Using AI Tutors Like Gemini Guided Learning to Build a Custom Exam Prep Plan
- Style Tricks to Hide Home Gym Gear: Sofa Covers, Storage Ottomans, and Clever Placement
- How Bluesky’s Cashtags and LIVE Badges Change Comment Moderation for Financial Conversations
- Tapping Fan Communities: How to Market Themed Weddings to Genre Audiences
- Training Like a Record-Setter: Offseason Plan for Players Joining a Big-Market Club
Related Topics
deployed
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group
