Skip to content

Bringing the AMD Radeon AI PRO R9700 Online in OpenStack Flex

I'll be honest. When the AMD Radeon AI PRO R9700 first showed up on my radar, I wasn't sure what to make of it. It's not a traditional datacenter card and it's not a gaming card either. The R9700 is a 32 GB professional GPU that won't break the bank, and sits in a product category that didn't really exist eighteen months ago.

This week our team brought a pair of R9700 GPUs online in Rackspace OpenStack Flex; like any good story there was a bit of drama with servers, placement, shipping times, cables oddities, chassis crisis, and more; we had the making of a full feature length K-Drama with all the twists and turns. Once we got past the drama, parts were installed and powered on, the entire deployment took about ten minutes which is a testament to the power of Genestack's Kubernetes-native architecture and OpenStack's hardware-agnostic design.

This is what we did and why I think it matters more than the price tag suggests.

The gap nobody talks about

Here's the thing about GPU cloud instances in 2026: the market is weirdly bimodal. You've got the heavy hitters, H100s, B200, MI300Xs, L40S cards, that cost tens of thousands of dollars each and serve the "training a foundation model" crowd. Then you've got consumer cards with 16 GB of VRAM that can technically run inference if you squint hard enough and don't mind watching your 13B model offload half its weights to system RAM while latency climbs into "are you sure this isn't just a very slow human typing?" territory.

What's missing is the middle. The tenant who needs to run Llama 3 8B or Mistral 7B quantized, with the entire model sitting in VRAM, at a price that doesn't require a finance team to approve. The developer prototyping a RAG pipeline who needs 32 GB to hold a model and an embedding engine side by side. The research team whose fine-tuning job fits comfortably in 32 GB but absolutely, categorically does not fit in 16 GB. The developer who wants to test their application on AMD hardware and doesn't want to pay a premium for a card that's designed for datacenters but isn't actually necessary for their workload.

That's the gap. And the R9700 lands right in the middle of it.

The card itself

Quick specs, because they matter for what comes next.

Specification Value
Architecture RDNA 4 (Navi 48 die, TSMC 4nm)
Compute Units 64
Stream Processors 4,096
AI Accelerators (2nd Gen) 128
VRAM 32 GB GDDR6 ECC
Memory Bandwidth 640 GB/s
FP32 Vector 47.8 TFLOPS
FP16 Matrix 191 TFLOPS (383 sparse)
FP8 Matrix 383 TFLOPS (766 sparse)
INT8 Matrix 383 TOPS (766 sparse)
INT4 Matrix 766 TOPS (1,531 sparse)
Total Board Power 300W

The price for all of that is around $1,299 (USD). I'll let that number sit for a second.

NVIDIA's RTX PRO 4500 Blackwell gets you the same 32 GB for a few thousand dollars, an L40S 48 GB even more money. Two R9700 cards, 64 GB total, cost less than a single RTX 6000 Ada. Phoronix benchmarked a dual-R9700 setup and found it slightly outperformed a single RTX 6000 Ada on vLLM inference while drawing less power per card (190W average vs. 223W). That's the kind of math that changes what you can offer tenants.

Ten minutes from bare metal to schedulable

Our environment runs Genestack, Rackspace's open-source Kubernetes-native OpenStack deployment platform. If you're not familiar, Genestack orchestrates all the OpenStack services via Helm and Kustomize on top of Kubernetes. The practical upside for something like a GPU deployment is that adding new hardware capabilities to the cloud doesn't involve SSHing into boxes and hand-editing config files like it's 2014. You label a node, define overrides, run the chart, and go make coffee. Except you won't need the coffee, because it'll be done before the water boils.

The full sequence, bootstrap the host, join it to the Kubernetes cluster, roll out the Nova compute configuration, verify the devices, took about ten minutes of actual hands-on-keyboard time. The rest was waiting on our enterprise telenovelas play out, but that is a story for another day.

Preparing the host for VFIO

Before OpenStack can hand a GPU to a VM, you have to convince the host kernel to not use the GPU itself. This means binding the device to the vfio-pci driver at boot and blacklisting every graphics driver that might try to claim it.

The kernel command line:

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt vfio-pci.ids=1002:7551,148c:2444 vfio_iommu_type1.allow_unsafe_interrupts=1 modprobe.blacklist=amdgpu rs.driver.blacklist=amdgpu kvm.ignore_msrs=1"

There's a lot crammed in there, so let me pick apart the pieces that actually matter.

amd_iommu=on iommu=pt turns on the IOMMU in passthrough mode, giving the guest direct DMA access to the GPU without bouncing through a software translation layer. vfio-pci.ids=1002:7551,148c:2444 grabs the R9700's video function and its companion audio function at boot, before anything else can touch them. kvm.ignore_msrs=1 is there because AMD GPUs occasionally poke at model-specific registers that KVM doesn't know about, harmless, but KVM will fault on them without this flag. And vfio_iommu_type1.allow_unsafe_interrupts=1 handles IOMMU groups that aren't perfectly isolated, which is common on workstation-class boards where the R9700 tends to live.

We reinforce this with modprobe configuration. In /etc/modprobe.d/vfio-pci.conf:

options vfio-pci ids=1002:7551,148c:2444
softdep radeon pre: vfio-pci
softdep amdgpu pre: vfio-pci
softdep nouveau pre: vfio-pci
softdep nvidia pre: vfio-pci
softdep efifb pre: vfio-pci
softdep drm pre: vfio-pci

Those softdep lines are the belt-and-suspenders part. They tell the kernel that if anything, amdgpu, radeon, nouveau, nvidia, efifb, even the base drm subsystem, tries to load, it has to load vfio-pci first. Since vfio-pci already claimed the device by PCI ID, those drivers show up, find an empty parking lot, and leave. It's not elegant. It's thorough. In production, thorough wins.

And in /etc/modules-load.d/vfio-pci.conf, just to make sure the VFIO module is present at boot:

vfio-pci

Labeling the node

With the host ready, one Kubernetes label tells Genestack what kind of hardware is attached:

kubectl label node <node-name> openstack-compute-gpu=r9700

This label is the anchor for everything that follows. It's the same pattern we use for H100 nodes, for MI300X nodes, for whatever ships next quarter. One label, one override block.

Nova PCI passthrough configuration

Here's the Genestack override that configures Nova to manage the R9700 devices:

conf:
  nova:
    pci:
      alias:
        type: multistring
        values:
          - '{"vendor_id": "1002", "product_id": "7551", "device_type": "type-PCI", "name": "r9700-video", "numa_policy": "preferred"}'
          - '{"vendor_id": "1002", "product_id": "ab40", "device_type": "type-PCI", "name": "r9700-audio", "numa_policy": "preferred"}'
  overrides:
    nova_compute:
      labels:
        - label:
            key: "openstack-compute-gpu"
            values:
              - "r9700"
          conf:
            nova:
              pci:
                alias:
                  type: multistring
                  values:
                    - '{"vendor_id": "1002", "product_id": "7551", "device_type": "type-PCI", "name": "r9700-video", "numa_policy": "preferred"}'
                    - '{"vendor_id": "1002", "product_id": "ab40", "device_type": "type-PCI", "name": "r9700-audio", "numa_policy": "preferred"}'
                device_spec: >-
                  [{"address": "0000:e3:00.0"},
                   {"address": "0000:e3:00.1"},
                   {"address": "0000:23:00.0"},
                   {"address": "0000:23:00.1"}]

Two things are happening here. The alias section defines named PCI device groups that Nova's scheduler and API understand, r9700-video for the GPU compute function and r9700-audio for the HDMI audio controller riding in the same IOMMU group. Both use numa_policy: preferred, which means "try to place this on the same NUMA node as the VM's memory, but don't fail if you can't." Strict NUMA enforcement is great when you have it; crashing a deployment because of topology constraints you can't control is not.

The overrides section scopes the device_spec, the specific PCI bus addresses, to only the nodes labeled openstack-compute-gpu=r9700. Addresses 0000:e3:00.0 and 0000:e3:00.1 are the first card (video and audio functions), 0000:23:00.0 and 0000:23:00.1 are the second. Two R9700 cards per host, which gives us single-GPU and dual-GPU flavors from the same hardware.

You'll notice the alias definitions appear in both the global section and the per-node override. That's not a mistake. The global aliases tell Nova's API and scheduler what the names mean. The per-node config tells the compute agent which physical devices to expose. You need both sides of that conversation or the scheduler will happily accept a GPU flavor request and then have no idea where to put it.

Once the nodes are labeled and the overrides are in place, we roll the Nova compute chart to apply the changes; /opt/genestack/bin/install-nova.sh. The compute agents on the R9700 hosts pick up the new config, bind the devices to VFIO, and report them to the Placement API. At this point, the hardware is ready and Nova knows about it, but we still need to connect it to flavors so tenants can actually request it.

Host aggregates: the scheduling glue

Kubernetes labels get the right Nova config onto the right hosts. But OpenStack also needs its own way to organize compute nodes into logical groups, that's what host aggregates do. They're the bridge between "this host has R9700 hardware" and "this flavor should only land on hosts with R9700 hardware."

We create two aggregates for our R9700 hosts. One for the GPU, one for the CPU. But before we can reference a custom trait on the aggregate, the trait itself has to exist in the Placement API. Custom traits, anything prefixed with CUSTOM_, aren't built-in; you have to explicitly register them:

openstack --os-placement-api-version 1.6 trait create CUSTOM_HW_GPU_R9700

You can verify it landed:

openstack --os-placement-api-version 1.6 trait show CUSTOM_HW_GPU_R9700 -f yaml
name: CUSTOM_HW_GPU_R9700

This is one of those steps that's easy to skip and painful to debug when you do. If the trait doesn't exist in Placement, the aggregate property referencing it is just a string that nobody acts on. Your flavor will request trait:CUSTOM_HW_GPU_R9700: required, the scheduler will find zero matching resource providers, and every instance create will fail with a No valid host error. You'll stare at the aggregate config, confirm it looks correct, and slowly lose your mind. Ask me how I know.

With the trait registered, now we wire it up:

openstack aggregate create R9700
openstack aggregate set --property trait:CUSTOM_HW_GPU_R9700=required R9700
openstack aggregate add host R9700 $NODE_NAME_PROD

The R9700 aggregate carries the trait:CUSTOM_HW_GPU_R9700=required property. This is the custom trait that the Placement API uses to match flavor requests, when a flavor says trait:CUSTOM_HW_GPU_R9700: required, the scheduler looks for resource providers reporting that trait, and the aggregate is what makes the host report it. Without this, the Placement API has no way to know that a given compute node has R9700 cards. PCI passthrough aliases alone aren't enough, the scheduler needs the trait to do its filtering before it ever gets to the PCI level. A single host can belong to multiple aggregates, which is how you express "this machine has EPYC 9354 CPUs AND R9700 GPUs" without creating a combinatorial explosion of aggregate names for every possible hardware configuration.

Here's what the full picture looks like in production:

# openstack aggregate list -f yaml
- Availability Zone: null
  ID: 1
  Name: R9700
# openstack aggregate show R9700 -f yaml
name: R9700
hosts:
- $NODE_NAME_PROD
properties:
  trait:CUSTOM_HW_GPU_R9700: required

When we rack the next pair of R9700 cards on a different host, it's one aggregate add host command and we're done. When AMD ships a new GPU next quarter, it's one new aggregate with one new trait. The pattern scales because it's boring, and boring infrastructure is reliable infrastructure.

The flavors

Here's what tenants actually consume.

Single GPU, ao.9.24.96_R9700:

name: ao.9.24.96_R9700
vcpus: 24
ram: 98304  # 96 GB
disk: 0
properties:
  :architecture: x86_architecture
  :category: compute_optimized
  hw:cpu_max_sockets: '2'
  hw:cpu_max_threads: '1'
  hw:hide_hypervisor_id: 'true'
  hw:mem_page_size: any
  hw:pci_numa_affinity_policy: required
  hw:watchdog_action: disabled
  pci_passthrough:alias: r9700-video:1,r9700-audio:1
  trait:CUSTOM_HW_GPU_R9700: required

Dual GPU, ao.9.48.128_R9700-2:

name: ao.9.48.128_R9700-2
vcpus: 48
ram: 131072  # 128 GB
disk: 0
properties:
  :architecture: x86_architecture
  :category: compute_optimized
  hw:cpu_max_sockets: '2'
  hw:cpu_max_threads: '1'
  hw:cpu_policy: dedicated
  hw:cpu_thread_policy: require
  hw:hide_hypervisor_id: 'true'
  hw:maxphysaddr_bits: '52'
  hw:maxphysaddr_mode: emulate
  hw:mem_page_size: any
  hw:watchdog_action: disabled
  pci_passthrough:alias: r9700-video:2,r9700-audio:2
  trait:CUSTOM_HW_GPU_R9700: required

Naming convention: ao for accelerator-optimized, 9 for generation, vCPU count, RAM in GB, then the accelerator suffix. Boring, predictable, and easy to parse in a billing system. Exactly what a naming convention should be.

  • hw:hide_hypervisor_id: 'true' masks the KVM signature from the guest OS. Some GPU drivers, NVIDIA's especially, detect virtualization and either refuse to initialize or disable features. AMD's ROCm stack is more forgiving about this, but we set it anyway. There's no penalty and it eliminates an entire category of "works on bare metal, fails in VM" support tickets.
  • hw:pci_numa_affinity_policy: required on the single-GPU flavor is worth noting. The PCI aliases use numa_policy: preferred which tells the compute agent to try for NUMA locality but not insist. The flavor-level required overrides that for scheduling, the instance won't launch unless the scheduler can place the VM's memory and the GPU on the same NUMA node. For a single-GPU inference workload where every microsecond of memory access latency matters, that's the right tradeoff. The dual-GPU flavor omits this because spanning NUMA nodes is sometimes unavoidable with two cards.
  • hw:cpu_policy: dedicated and hw:cpu_thread_policy: require on the dual-GPU flavor pin vCPUs to physical cores. When you're feeding two GPUs from 128 GB of RAM, the last thing you want is jitter from a noisy neighbor's cron job landing on your shared core at the wrong moment.
  • trait:CUSTOM_HW_GPU_R9700: required is the Placement API filter ensuring these flavors only schedule on nodes that actually have R9700 hardware. Without it, a GPU flavor could land on a CPU-only node, and instead of a working instance you'd get a very expensive lesson in reading error logs.

Why maxphysaddr_bits shows up again

See hw:maxphysaddr_bits: '52' and hw:maxphysaddr_mode: emulate on the dual-GPU flavor? There's a story behind those two lines, and we told the whole thing in Solving GPU Passthrough Memory Addressing in OpenStack back in December.

The condensed version: QEMU defaults to giving guest VMs a 40-bit physical address space. That's 1 TB. Sounds generous until you realize the UEFI firmware (OVMF) uses that space to carve out a fixed 32 GB window for PCI device memory mappings. We originally hit this wall with H100 NVL cards, four GPUs, each needing 128 GB of BAR space, plus 512 GB of guest RAM. The fourth GPU would show up in lspci but the driver couldn't initialize it. The dmesg output was a wonderfully unhelpful BAR 1: no space for [mem size ...] that we initially blamed on NUMA topology. (It wasn't NUMA. It was address bits. We spent hours on that misdiagnosis.)

The R9700's 32 GB VRAM is more modest than the H100's, but two cards plus 128 GB of system RAM in the dual-GPU flavor still pushes the boundaries of a 40-bit address space. Setting maxphysaddr_bits: '52' gives the guest 4 PB of addressable space. That's preposterously more than two R9700s will ever use, but it completely eliminates the problem and means we don't have to revisit this flavor when AMD ships something with even more VRAM next year.

We use emulate mode because our compute hosts run cpu_mode=host-model for live migration compatibility. If your environment runs host-passthrough on homogeneous hardware, passthrough mode works too and carries slightly less overhead. Either way, the guest gets enough address bits. That's what matters.

The software side: ROCm and what actually works

I should be straightforward about where the R9700's software ecosystem stands, because pretending AMD's ROCm stack is equivalent to CUDA would be doing you a disservice.

ROCm 7.2.1 officially supports the R9700 under the gfx1201 target. PyTorch works, training and inference, mixed precision, torch.compile. vLLM works, and as of January 2026 treats ROCm as a first-class platform, but there's a catch: the Flash Attention CK (Composable Kernel) backend doesn't support RDNA architectures yet. You have to use the Triton backend instead. It's functional, but it's not as optimized as CK is on AMD's datacenter CDNA hardware. AMD has indicated CK support for RDNA is targeted for end of 2026. llama.cpp runs via the HIP backend with GGUF models. TensorFlow works through tensorflow-rocm.

  • Is it CUDA? No
  • Is it good? Yes

The standout here is the innovation in hardware

Not every framework supports every feature of every GPU architecture on day one. That's just the reality of the software ecosystem, but the diversity of frameworks that support the R9700, the fact that it's on ROCm's supported hardware list, and the performance benchmarks that have been published all suggest this card is a very solid choice for GPU-accelerated inference workloads in 2026. If your application can run on ROCm, the R9700 is a fantastic value proposition.

ROCm is great and getting better at a pace that would have seemed unlikely two years ago. For the use cases this card targets, running 7B to 32B parameter models for inference, doing RAG, prototyping AI applications, the ROCm ecosystem is incredibly functional, and this hardware is fantastic at a cost savings that more than justify an occasional rough edge.

The R9700 also doesn't support SR-IOV, which means no hardware-level GPU sharing between VMs. Each instance gets exclusive access to the whole card. In a multi-tenant cloud, that's a constraint. But for dedicated inference instances, it's arguably a feature, your tenant gets deterministic, repeatable GPU performance with zero contention. That's exactly what inference SLAs want.

Why this matters beyond one GPU

I keep coming back to the same thought: the interesting part isn't the R9700 specifically. It's what deploying it proves about the platform.

The same Genestack override pattern, the same PCI passthrough pipeline, the same Nova Placement API, the same ten-minute operational process: it works for H100s, it works for MI300Xs, it works for an R9700. OpenStack doesn't care about the brand on the silicon. You define the PCI IDs, you label the nodes, you create the flavors, and the scheduler handles the rest.

That hardware agnosticism is hard to replicate outside of OpenStack. Proprietary hypervisors gate GPU support behind certification lists and driver partnerships. Hyperscalers offer exactly the GPU SKUs they've negotiated volume pricing on and nothing else. OpenStack lets you rack whatever PCIe device makes sense for your workload economics and expose it to tenants through a consistent API.

For the R9700 specifically, this means we can offer GPU-accelerated inference instances at price points that previously bought you a CPU-only flavor with an optimistic attitude.

Our single-GPU R9700 flavor runs at a price that will make Lovelace blush and the Dual-GPU flavor is giving the Hopper crowd a run for their money.

For the broader strategy, it means Rackspace OpenStack Flex can absorb whatever the silicon vendors ship next, AMD, NVIDIA, Intel, whoever, without a six-month integration cycle or a vendor certification program. The next GPU that makes economic sense for our customers is ten minutes away from being schedulable. That's the real story here.

Getting started

We're just getting started with the R9700 GPU-enabled flavors on Rackspace OpenStack Flex and while we're not ready to roll out full support to the public just yet, if you're still reading this post and interested in trying out what's possible with an R9700 on an open-source hyperscaler cloud, let me know, I'd love to hear from you.

If you're running your own Genestack deployment, everything above is what you need to rollout the R9700 into your cloud.

  1. The Genestack documentation covers the full deployment workflow
  2. The PCI passthrough guide details the Nova configuration
  3. The maxphysaddr post has the full war story on memory addressing if you're deploying multi-GPU flavors.

Are you ready to begin clouding? Sign up for Rackspace OpenStack Flex today.