Blog Article:

Virtualizing NVIDIA GB200 NVL4: Zero Performance Overhead with Full Isolation

The NVIDIA GB200 NVL4 platform introduces a new class of accelerated infrastructure, combining NVIDIA Grace CPUs and NVIDIA Blackwell GPUs with high-bandwidth interconnects designed for large-scale AI training and inference. These systems are built for performance at the silicon level. The question for cloud and AI Factory operators is how to virtualize them without compromising that performance.

OpenNebula enables GB200 NVL4 virtualization using PCI passthrough, delivering full GPU access to virtual machines while preserving isolation and operational flexibility.

Managing Grace Blackwell with PCI Passthrough

OpenNebula integrates with KVM-based hypervisors to manage GPU resources through direct PCI passthrough. In this model, NVIDIA Grace Blackwell GPUs can be assigned exclusively to a virtual machine. The hypervisor does not emulate the GPU or introduce a software abstraction layer. Instead, the device is mapped directly into the guest environment.

From the VM’s perspective, the GPU appears as native hardware. Drivers, NVIDIA CUDA-X libraries, and AI frameworks operate exactly as they would on bare metal. The NVIDIA NVLink interconnect remains available within the host boundary, ensuring that GPU-to-GPU bandwidth is preserved for multi-GPU training workloads.

OpenNebula controls allocation at the orchestration level. Administrators define GPU pools, apply quotas, and assign devices to tenants through RBAC policies. Once allocated, the GPU is fully dedicated to the VM, ensuring strict workload separation.

When MIG is enabled, a single physical GPU can be partitioned into multiple hardware-isolated GPU instances. Each MIG slice has dedicated compute cores, memory partitions, and cache resources. OpenNebula can allocate individual MIG instances to separate VMs, allowing multiple tenants to share the same physical GPU with strict hardware-enforced isolation.

From the VM’s perspective, the assigned MIG instance behaves as a dedicated GPU. CUDA, NVIDIA NCCL, and AI frameworks operate normally, without awareness of the underlying partitioning. This enables granular GPU allocation within multi-tenant AI Factories.

Demonstrating Zero Virtualization Overhead

A frequent concern when virtualizing high-end accelerators is the potential performance impact. With PCI passthrough, however, the GPU is not virtualized in the data path. The virtual machine interacts directly with the physical device, so the software stack sees the GPU exactly as it would on bare metal.

In practice, benchmark comparisons between bare-metal and passthrough configurations show negligible differences in performance. CUDA kernels run natively, GPU memory access is not intercepted, and NCCL communication within the host behaves the same as in a non-virtualized setup. The only additional overhead comes from normal hypervisor CPU scheduling, which has no meaningful effect on workloads dominated by GPU computation.

The result is production-grade performance combined with the operational flexibility of virtualization.

cuBLAS Benchmark

To evaluate raw compute performance, we used an NVIDIA cuBLAS GEMM benchmark, which measures floating-point throughput for large matrix multiplications. The benchmark repeatedly executes the operation:

C = αAB + βC

where A, B, and C are square matrices stored in column-major format using FP32 precision. Internally, the test calls the cublasGemmEx API with explicit compute and algorithm selection.

The matrix size used in the experiments was N = 16384, repeated over 10 iterations, ensuring that the workload is compute-bound and allows the GPU to approach its peak floating-point throughput.

The benchmark was executed on identical GPU hardware in both bare-metal servers and passthrough virtual machines. Measurements were collected for a full NVIDIA GB200 GPU as well as several MIG profiles with reduced numbers of streaming multiprocessors.

The results show almost identical performance across both environments. For example, the full GPU (152 SMs) delivers 71,891 GFLOP/s on bare metal and 71,738 GFLOP/s in passthrough, a difference of only 0.21%. Even smaller MIG profiles show differences of just a few percent, confirming that passthrough preserves native compute performance.

vLLM Inference Benchmark

To evaluate inference workloads, we used vLLM as the model serving engine and GuideLLM to generate concurrent request workloads. vLLM exposes models through an OpenAI-compatible API, while GuideLLM controls concurrency levels and records performance metrics.

Three models were selected to represent typical inference workloads of increasing size:

  • OPT-2.7B (small)
  • Qwen2.5-7B-Instruct (medium)
  • Llama-2-13B (large)

Each run deployed a single-GPU vLLM instance configured for BF16 inference. The benchmark used prompts of 512 tokens and generated 256 tokens, approximating typical chat-style workloads. Runs lasted 60 seconds, including warm-up and cool-down periods, while concurrency levels were varied to identify the throughput saturation point.

The results again show very similar performance between bare-metal and passthrough environments.

  • Throughput: Differences remain within ±4%, and performance scales proportionally with the GPU resources assigned to each MIG profile.
  • Latency: The 95th percentile latency (p95) at peak throughput differs by less than ±3%.

Overall, the experiments show that GPU passthrough delivers near bare-metal performance for both compute and LLM inference workloads, while retaining the operational advantages of virtualization such as isolation, resource partitioning, and flexible infrastructure management.

Full Isolation Without Sacrificing Flexibility

One of the advantages of this model is flexibility at the platform layer. Each VM can run its own software stack independently.

For example:

  • Kubernetes can run inside the VM to manage AI workloads.
  • GPU operators and CUDA toolkits can be versioned per tenant.
  • Inference services, training pipelines, or SLURM schedulers can operate independently.

Because the GPU instance is directly assigned, Kubernetes inside the VM sees a native accelerator device. There is no dependency on host-level orchestration. This simplifies governance and reduces cross-tenant interference.

MIG also allows operators to tailor GPU profiles to workload types. Smaller partitions can be assigned for inference workloads, while larger partitions can be dedicated to fine-tuning or model development tasks.

A Practical Model for AI Factories

Virtualizing GB200 NVL4 with PCI passthrough provides a balanced architecture for AI Factories. It preserves silicon-level performance while introducing operational structure and multi-tenant isolation.

Instead of forcing a choice between bare metal and flexibility, OpenNebula enables both. GPU resources can be allocated dynamically, governed centrally, and delivered as isolated environments without performance compromise.

For operators building sovereign AI infrastructure, neocloud platforms, or enterprise AI clusters, this approach offers a practical path forward: full hardware performance with cloud-level control.

Meet us in person! We’re be exhibiting NVIDIA GTC in San Jose. Come visit our team, see live demos, and discuss how OpenNebula can power your AI Factories and neocloud platforms.


ONEnextgen-logo

Funded by the Spanish Ministry for Digital Transformation and Civil Service through the ONEnextgen Project  (UNICO IPCEI-2023-003), and co-funded by the European Union’s NextGenerationEU through the RRF.


Ruben S. Montero

Chief Technology Advisor at OpenNebula Systems

Mar 18, 2026

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *