Blog Article:

Multi-Tenant AI Factory on NVIDIA GB200 NVL4 with InfiniBand

Operating AI infrastructure at scale requires more than raw GPU performance. It demands efficient resource sharing, strict tenant isolation, and predictable performance across workloads. OpenNebula enables this by combining GPU passthrough, NUMA-aware scheduling, and segmented networking to build multi-tenant AI environments on NVIDIA GB200 NVL4 systems with InfiniBand.

Architecture Overview of a Multi-Tenant AI Factory

In this setup, OpenNebula is deployed on a front-end node that manages orchestration, scheduling, and system services. The architecture separates infrastructure components into dedicated networks, ensuring operational control and scalability.

A storage network provides access to virtual machine images and disks, while virtual machine networks leverage VXLAN to enable multi-tenancy and network isolation. An appliance or public network allows external connectivity when required, and a management network is used by OpenNebula to coordinate hosts and resources.

This separation enables controlled operation and scaling of the infrastructure.

High-Performance Networking and Tenant Isolation

Multi-tenancy is enforced through network segmentation. Each tenant is assigned its own virtual network domain, implemented via dedicated virtual bridges mapped to VLAN interfaces over InfiniBand.

This design ensures low-latency, high-throughput connectivity while maintaining strict isolation between tenants, preventing interference between workloads.

At the same time, tenants can optionally connect to a shared public network when external access is required.

GPU Virtualization and NUMA Alignment

Achieving near-native performance requires alignment between compute and accelerator resources. OpenNebula enables PCI passthrough of NVIDIA GB200 GPUs directly into virtual machines, preserving full hardware capabilities.

To optimize performance, virtual CPUs are pinned to the same physical NUMA node as the GPU. PCIe devices are also aligned with CPU locality, avoiding cross-node latency and ensuring efficient memory access.

Inside the virtual machine, this topology is fully exposed. The guest operating system sees a NUMA layout that mirrors the host, including dedicated CPU and GPU-related nodes. NUMA distance metrics guide memory allocation, maintaining performance characteristics comparable to bare metal.

Flexible GPU Sharing with Multi-Instance GPU

With full GPU passthrough, users can leverage Multi-Instance GPU (MIG) technology directly within the virtual machine, enabling a single GPU to be partitioned into smaller instances for inference workloads.

Alternatively, GPU partitioning can be configured at the host level, with individual GPU slices assigned to different virtual machines. OpenNebula supports both approaches, ensuring correct device assignment and NUMA alignment.

This flexibility enables efficient utilization of GPU resources across different workload types.

Deploying AI Workloads with OpenNebula

OpenNebula simplifies the deployment of AI workloads through reusable virtual machine templates. In this example, a template is used to deploy a preconfigured appliance running a vLLM service.

The template defines compute resources, includes GPU passthrough configuration, and exposes parameters for selecting and configuring the language model. Networking and sizing can be adjusted at deployment time.

Once instantiated, the virtual machine is accessible through a service endpoint, allowing users to validate and interact with the deployed workload.

Infrastructure Visibility and Tenant Autonomy

Operators can monitor GPU usage, NUMA topology, and resource allocation directly within the OpenNebula interface. Host-level views provide insight into GPU connectivity and distance relationships, while virtual machine-level views expose the allocated subset of resources.

From the tenant perspective, users can manage their own virtual machines, networks, and configurations within their assigned scope. They only see their private networks and any shared public networks, ensuring isolation while maintaining self-service resource control.

End-to-End Workflow Demonstration

The accompanying screencast walks through the workflow from infrastructure inspection to deploying a GPU-enabled virtual machine and validating the workload, showing how networking, GPU passthrough, and NUMA alignment are applied in practice.

Screencast covers 4 1

Building Multi-Tenant AI Infrastructure at Scale

This approach enables organizations to run multiple AI workloads on shared GPU infrastructure without compromising performance or isolation. By combining hardware acceleration with virtualization and orchestration, OpenNebula provides a practical foundation for scalable AI infrastructure.

The result is an environment where tenants can deploy and manage workloads independently, while operators retain control over resource allocation, performance, and infrastructure operations.

From GPU acceleration to multi-tenant orchestration, the future of AI infrastructure is already taking shape. Join us at OneNext to hear from NVIDIA and other industry leaders on AI factories, high-performance networking, and scalable deployment models. Reserve your spot at OneNext.io 


ONEnextgen-logo

Funded by the Spanish Ministry for Digital Transformation and Civil Service through the ONEnextgen Project  (UNICO IPCEI-2023-003), and co-funded by the European Union’s NextGenerationEU through the RRF.


Neal Hansen

Senior Cloud Solutions Architect at OpenNebula Systems

Apr 14, 2026

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *