Solving GPU Passthrough Memory Addressing in OpenStack
Delivering Accelerator enabled Developer Cloud Functionality on Rackspace OpenStack Flex.
When AMD launched the AMD Developer Cloud, we took notice. Here was a streamlined platform giving developers instant access to high-performance MI300X GPUs, complete with pre-configured containers, Jupyter environments, and pay-as-you-go pricing. The offering resonated with the AI/ML community because it eliminated friction: spin up a GPU instance, start training, destroy it when done.
We wanted to deliver similar functionality for our customers using NVIDIA H100 NVL GPUs on Rackspace OpenStack Flex. Not to become a NeoCloud, that's not our business, but to ensure customers leveraging our GPU infrastructure could access the same seamless, high-memory multi-GPU experience that specialized providers offer.
Then we hit a wall. Instances with four H100 NVL GPUs would boot successfully, only three GPUs were visible to the NVIDIA driver when the instance was configured with 256G of RAM, 2 GPUs were visible with 512G of RAM, and 1 GPU was visible with 768G of RAM. The problem was confounding. The more RAM we had, the less GPUs we could use and they consistently failed with nvidia: probe of 0000:0X:00.0 failed with error -1.
This is the story of how we diagnosed the problem, discovered it wasn't NUMA (our initial suspect), and implemented a two-line fix that unlocked full multi-GPU functionality for our customers.
The target: matching developer cloud configurations
AMD's Developer Cloud configurations set the benchmark we were measuring against:
AMD Developer Cloud (MI300X)
| Configuration | GPU Memory | vCPUs | System Memory | Storage |
|---|---|---|---|---|
| Small (1× GPU) | 192GB | 20 | 240GB | 720GB boot + 5TB scratch NVMe |
| Large (8× GPU) | 1,536GB | 160 | 1,920GB | 1.9TB boot + 40TB scratch NVMe |
Rackspace OpenStack Flex (H100 NVL)
| Configuration | GPU Memory | vCPUs | System Memory | Storage |
|---|---|---|---|---|
| Small (1× GPU) | 96GB | 48 | 128GB | 2TB+ HA-NetApp volume |
| Large (4× GPU) | 384GB | 96 | 512GB | 2TB+ HA-NetApp volume |
Different hardware, different scale, but the same goal: give developers frictionless access to GPU compute for AI/ML workloads. The H100 NVL's 96GB HBM3 memory per GPU, combined with 900 GB/s NVLink interconnect bandwidth, makes it exceptionally capable for distributed training, if all GPUs are actually addressable.
The symptom: one GPU always missing
Our four-GPU configuration exhibited a consistent failure pattern. The instance would boot normally, lspci would show all four NVIDIA devices, but nvidia-smi reported only a subset of the GPUs. The kernel ring buffer told the story:
[ 2.847291] pci 0000:09:00.0: BAR 1: no space for [mem size 0x4000000000 64bit pref]
[ 2.847294] pci 0000:09:00.0: BAR 1: failed to assign [mem size 0x4000000000 64bit pref]
[ 3.521847] nvidia: probe of 0000:09:00.0 failed with error -1
Note
That hex value, 0x4000000000, is 256GB. The guest firmware couldn't allocate address space for the fourth GPU's BAR1 memory region.
The initial misdiagnosis: NUMA topology
Our first instinct pointed to NUMA. Multi-GPU systems are notoriously sensitive to NUMA topology, and cross-socket memory access can cause exactly the kind of asymmetric behavior we observed. We spent considerable time examining:
numactl --hardwareoutput inside guests- GPU-to-NUMA-node mappings via
nvidia-smi topo -m - Various
hw:numa_nodesandhw:pci_numa_affinity_policyflavor configurations - CPU pinning strategies to ensure GPU locality
- Kernel boot parameters related to NUMA and PCI
- HugePages configurations
- IOMMU settings and VFIO parameters
- and more...
None of it helped. In a four GPU configuration, the fourth GPU failed regardless of NUMA configuration. The BAR 1: no space message kept appearing, suggesting something more fundamental than memory locality.
The real problem: QEMU's 32GB IO window
Gerd Hoffmann's blog post on physical address space in QEMU provided the breakthrough. The issue traces back to x86_64's early days.
When QEMU first supported 64-bit guests, it adopted AMD Opteron's 40-bit physical address space (1TB) as the default. This conservative choice persists today for backward compatibility. But the real constraint isn't QEMU, it's OVMF, the UEFI firmware.
OVMF couldn't reliably determine actual physical address space available to a VM. Early Intel 64-bit CPUs supported only 36 bits, while AMD offered 40. Rather than risk crashes from address space misconfiguration, OVMF adopted an ultra-conservative approach: a fixed 32GB MMIO aperture for PCI devices.
For typical VMs, this worked fine. An 8GB VM with a basic GPU fits easily. But modern AI workloads changed the equation dramatically:
| Component | Address Space Required |
|---|---|
| Guest RAM | 512GB |
| H100 NVL BAR1 (×4) | 512GB (128GB each, power-of-two alignment) |
| Control BARs, platform devices | ~1GB |
| Total | ~1,025GB |
The legacy 32GB MMIO aperture can't accommodate 512GB of GPU BARs. OVMF maps the first three GPUs successfully, exhausts available address space, and leaves the fourth GPU with nowhere to live. The kernel sees the PCI device but can't initialize it, hence probe failed with error -1.
Debugging checklist for operators
When multi-GPU instances show partial GPU visibility:
Inside the guest:
# Check physical address bits visible to guest
lscpu | grep "Address sizes"
# Look for BAR allocation failures
dmesg | grep -E "(BAR|no space|failed to assign)"
# Check NVIDIA probe errors
dmesg | grep nvidia
# Verify GPU count
nvidia-smi -L
On the compute host:
# Verify host capabilities
virsh capabilities | grep maxphysaddr
# Check instance configuration
virsh dumpxml <instance-id> | grep -A5 '<cpu'
# Verify flavor properties applied
openstack flavor show <flavor-name> -f json | jq '.properties'
The solution: expanding physical address bits
Modern datacenter CPUs support far more than 40 bits of physical addressing:
| Platform | Physical Address Bits | Addressable Space |
|---|---|---|
| QEMU default | 40 bits | 1TB |
| Intel Xeon Scalable (Sapphire Rapids) | 46-52 bits | 64TB - 4PB |
| AMD EPYC (Genoa) | 48-52 bits | 256TB - 4PB |
With sufficient physical address bits, OVMF scales the IO window dynamically and places GPU BARs in the expanded high-memory region. The fix required discovering our hypervisors' capabilities and configuring Nova flavors to expose them.
Step 1: Discover hypervisor capabilities
Our compute nodes, running Intel Xeon Scalable processors, support 46-bit physical addressing, providing 64TB of addressable space.
Step 2: Configure Nova flavor properties
OpenStack Nova 2024.1 (Caracal) introduced hw:maxphysaddr_bits and hw:maxphysaddr_mode flavor extra specs. The fix required just two properties:
We used maxphysaddr_mode=emulate because our hypervisors run with cpu_mode=host-model, which allows live migration between hosts with different CPU microarchitectures. If we had configured cpu_mode=host-passthrough, we could use maxphysaddr_mode=passthrough to directly expose host CPU capabilities.
Notice
The maxphysaddr_bits=46 value isn't arbitrary, it matches our hypervisors' specific capabilities. Operators must query their own hosts and configure flavors accordingly. Nova's ComputeCapabilitiesFilter automatically ensures instances land only on hosts supporting the requested physical address width.
Step 3: Complete flavor configuration
To ensure optimal performance and reliability for GPU workloads, we combined the physical address space settings with other best-practice flavor properties, resulting in the following complete flavor definitions for both four-GPU and single-GPU instances.
Four-GPU (Training/Rendering) Flavor:
openstack flavor create --vcpu 96 \
--ram 524288 \
--disk 0 \
--ephemeral 0 \
--property :price=10.022336 \
--property :architecture='x86_architecture' \
--property :category='compute_optimized' \
--property hw:cpu_max_sockets='2' \
--property hw:cpu_max_threads='1' \
--property hw:cpu_policy='dedicated' \
--property hw:cpu_thread_policy='require' \
--property hw:hide_hypervisor_id='true' \
--property hw:maxphysaddr_bits='46' \
--property hw:maxphysaddr_mode='emulate' \
--property hw:numa_nodes='2' \
--property hw:numa_cpus.0='0-47' \
--property hw:numa_cpus.1='48-95' \
--property hw:numa_mem.0='262144' \
--property hw:numa_mem.1='262144' \
--property hw:mem_page_size='any' \
--property hw:pci_numa_affinity_policy='required' \
--property hw:watchdog_action='disabled' \
--property pci_passthrough:alias='h100:4' \
--property trait:CUSTOM_HW_GPU_H100='required' \
--private \
ao.3.96.512_H100-4
Single GPU (Development/Inference) Flavor:
openstack flavor create --vcpu 48 \
--ram 131072 \
--disk 0 \
--ephemeral 0 \
--property :architecture='x86_architecture' \
--property :category='compute_optimized' \
--property hw:cpu_max_sockets='2' \
--property hw:cpu_max_threads='1' \
--property hw:cpu_policy='dedicated' \
--property hw:cpu_thread_policy='require' \
--property hw:hide_hypervisor_id='true' \
--property hw:mem_page_size='any' \
--property pci_passthrough:alias='h100:1' \
--property trait:CUSTOM_HW_GPU_H100='required' \
ao.3.96.512_H100
Flavor property explanations
Property |
Value |
Description |
|---|---|---|
hw:cpu_max_sockets |
2 |
Maximum CPU sockets exposed to the guest. Distributes vCPUs across two virtual sockets, mirroring dual-socket physical hosts and aligning with the hw:numa_nodes=2 configuration. |
hw:cpu_max_threads |
1 |
Maximum threads per core in guest topology. Setting to 1 disables SMT presentation inside the guest, ensuring each vCPU maps to a distinct physical core for consistent compute performance. |
hw:cpu_policy |
dedicated |
CPU allocation policy. dedicated pins each vCPU to a specific physical CPU, preventing core sharing with other instances. Eliminates noisy-neighbor effects critical for GPU workloads. |
hw:cpu_thread_policy |
require |
Controls vCPU placement on host SMT threads. require mandates SMT-enabled hosts and places vCPUs on thread siblings. Ensures predictable resource allocation. Options: require, prefer, isolate. |
hw:hide_hypervisor_id |
true |
Hides hypervisor CPUID signature from guest. Essential for NVIDIA GPU passthrough—prevents driver detection of virtualized environment. Also bypasses software with anti-VM licensing checks. |
hw:maxphysaddr_bits |
46 |
Maximum physical address bits for guest. 46 bits provides 64TB addressable space, enabling OVMF firmware to allocate sufficient MMIO aperture for four H100 GPUs (~512GB BAR space required). |
hw:maxphysaddr_mode |
emulate |
Physical address bit determination method. emulate explicitly sets the value via QEMU; required with cpu_mode=host-model. Alternative passthrough requires cpu_mode=host-passthrough. |
hw:mem_page_size |
any |
Guest memory backing page size. any allows Nova to select available sizes (4KB, 2MB, or 1GB). Consider 1GB for reduced TLB pressure with large memory footprints if huge pages are pre-allocated on hosts. |
hw:numa_cpus.0 |
0-47 |
Explicit vCPU-to-NUMA mapping for node 0. Pins guest vCPUs 0-47 to virtual NUMA node 0, ensuring predictable topology rather than relying on Nova's automatic distribution. |
hw:numa_cpus.1 |
48-95 |
Explicit vCPU-to-NUMA mapping for node 1. Pins guest vCPUs 48-95 to virtual NUMA node 1, completing the symmetric dual-node topology. |
hw:numa_mem.0 |
262144 |
Memory allocation for NUMA node 0 in MB. Pins 256GB to node 0, preventing cross-socket memory allocation and ensuring GPU0-1 access local memory. |
hw:numa_mem.1 |
262144 |
Memory allocation for NUMA node 1 in MB. Pins 256GB to node 1, ensuring GPU2-3 access local memory. Total: 512GB evenly split across nodes. |
hw:numa_nodes |
2 |
Number of virtual NUMA nodes exposed to guest. Matches dual-socket physical topology for optimal memory locality and GPU affinity. |
hw:pci_numa_affinity_policy |
required |
PCI device NUMA affinity enforcement. required strictly enforces GPU-to-NUMA alignment—instance scheduling fails if affinity cannot be satisfied rather than silently degrading performance. |
hw:watchdog_action |
disabled |
Guest watchdog behavior. disabled eliminates watchdog timer interrupts, reducing interrupt overhead for latency-sensitive GPU workloads. |
pci_passthrough:alias |
h100:4 |
PCI passthrough device request. h100 references alias in nova.conf mapping to NVIDIA H100 vendor/product IDs. :4 requests four GPUs for this instance. |
trait:CUSTOM_HW_GPU_H100 |
required |
Placement API scheduling constraint. Instances only land on compute nodes reporting this custom trait, ensuring H100-equipped host selection beyond PCI alias matching. |
Property Prefix Reference:
| Prefix | Scope | Function |
|---|---|---|
hw: |
Nova libvirt driver | Configures guest hardware: CPU topology, memory, NUMA, physical addressing |
hw:numa_* |
Nova libvirt driver | Explicit NUMA pinning for CPUs and memory per virtual node |
pci_passthrough: |
Nova PCI manager | PCI device assignment by alias or direct address |
trait: |
Placement API | Scheduling constraints based on host resource provider traits |
The result: full multi-GPU functionality
After applying the fix along with our tuning, all four H100 NVL GPUs initialize correctly:
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:06:00.0 Off | 0 |
| N/A 32C P0 63W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 NVL On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 62W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 NVL On | 00000000:08:00.0 Off | 0 |
| N/A 35C P0 64W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 NVL On | 00000000:09:00.0 Off | 0 |
| N/A 31C P0 61W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
NVLink topology is preserved, enabling efficient multi-GPU communication:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3
GPU0 X NV12 PHB PHB
GPU1 NV12 X PHB PHB
GPU2 PHB PHB X NV12
GPU3 PHB PHB NV12 X
NUMA Topology Visualization:
flowchart TB
subgraph Guest["Guest Instance"]
subgraph NUMA0["NUMA Node 0"]
direction TB
CPU0["vCPUs 0-47"]
MEM0["256GB RAM"]
subgraph NVPair0["NVLink Pair A"]
G0["GPU0<br/>96GB HBM3"]
G1["GPU1<br/>96GB HBM3"]
G0 <-->|"NV12<br/>600 GB/s"| G1
end
end
subgraph NUMA1["NUMA Node 1"]
direction TB
CPU1["vCPUs 48-95"]
MEM1["256GB RAM"]
subgraph NVPair1["NVLink Pair B"]
G2["GPU2<br/>96GB HBM3"]
G3["GPU3<br/>96GB HBM3"]
G2 <-->|"NV12<br/>600 GB/s"| G3
end
end
end
NVPair0 <-->|"PHB (PCIe)<br/>~64 GB/s"| NVPair1
style Guest fill:#0d1117,stroke:#30363d,color:#c9d1d9
style NUMA0 fill:#161b22,stroke:#21262d,color:#c9d1d9
style NUMA1 fill:#161b22,stroke:#21262d,color:#c9d1d9
style NVPair0 fill:#238636,stroke:#2ea043,color:#fff
style NVPair1 fill:#238636,stroke:#2ea043,color:#fff
style G0 fill:#8957e5,stroke:#a371f7,color:#fff
style G1 fill:#8957e5,stroke:#a371f7,color:#fff
style G2 fill:#8957e5,stroke:#a371f7,color:#fff
style G3 fill:#8957e5,stroke:#a371f7,color:#fff
style CPU0 fill:#1f6feb,stroke:#388bfd,color:#fff
style CPU1 fill:#1f6feb,stroke:#388bfd,color:#fff
style MEM0 fill:#da3633,stroke:#f85149,color:#fff
style MEM1 fill:#da3633,stroke:#f85149,color:#fff
Key takeaways
Multi-GPU passthrough failures aren't always NUMA issues. The 32GB MMIO aperture limitation affects any high-memory GPU configuration, regardless of NUMA topology.
The fix is simple but requires discovery. Two flavor properties unlock full functionality, but the maxphysaddr_bits value must match your hypervisors' capabilities, query virsh capabilities on each compute node.
Nova 2024.1+ is required. The hw:maxphysaddr_mode and hw:maxphysaddr_bits extra specs were introduced in the Caracal release.
CPU mode affects passthrough options. Use emulate mode with host-model for migration flexibility, or passthrough mode with host-passthrough for maximum performance on homogeneous clusters.
With these configurations in place, Rackspace OpenStack Flex delivers the multi-GPU compute functionality our customers expect, matching the seamless experience of specialized developer clouds while maintaining the operational benefits of OpenStack infrastructure: API compatibility, hybrid cloud integration, and complete data sovereignty.
For more information about GPU-enabled compute on Rackspace OpenStack Flex, see the Genestack documentation or contact your Rackspace account team. Sign-up for Rackspace OpenStack today.