In OpenStack environments using the Cinder LVM backend with tgt, a volume deletion can fail even after the instance side of the workflow appears complete. One common cause is that the iSCSI session on the block storage node is still active, preventing tgt from removing the target.
When a pod fails to join the network in a Kube-OVN-backed cluster, the first symptom often looks like a generic CNI problem. In one Genestack-operated IAD sandbox case, the actual cause was a duplicate IP allocation: a new pod was assigned an address that was still recorded against an older, non-running pod in the same subnet.
In containerized OpenStack compute environments, a hard reboot or instance start can fail even when the hypervisor node itself looks healthy. One failure mode is a permissions mismatch on /dev/kvm, where the device inside the libvirt pod is mapped with ownership or permissions that do not line up with the host device.
When Neutron and OVN drift out of sync, one of the standard recovery tools is neutron-ovn-db-sync-util. In some environments, though, the sync itself can fail before it repairs anything, especially if Neutron still contains stale objects that reference routers that no longer exist.
In a large-scale OpenStack environment, especially one leveraging Genestack, networking is the lifeblood of the platform. One of the more frustrating issues operators can face is the intermittent Floating IP: connectivity works for a while, then drops unexpectedly, or only succeeds from certain source networks.
I'll be honest. When the AMD Radeon AI PRO R9700 first showed up on my radar, I wasn't sure what to make of it. It's not a traditional datacenter card and it's not a gaming card either. The R9700 is a 32 GB professional GPU that won't break the bank, and sits in a product category that didn't really exist eighteen months ago.
This week our team brought a pair of R9700 GPUs online in Rackspace OpenStack Flex; like any good story there was a bit of drama with servers, placement, shipping times, cables oddities, chassis crisis, and more; we had the making of a full feature length K-Drama with all the twists and turns. Once we got past the drama, parts were installed and powered on, the entire deployment took about ten minutes which is a testament to the power of Genestack's Kubernetes-native architecture and OpenStack's hardware-agnostic design.
Your instance is up, your AMD GPU is attached, and you're staring at a terminal with no nvidia-smi to lean on. Welcome to the other side.
If you've read our NVIDIA getting started guide, you know the drill: provision an instance, install drivers, verify the hardware, start computing. The AMD path follows the same logic but with different tooling. Instead of CUDA, you're working with ROCm. Instead of nvidia-smi, you've got rocm-smi. Instead of a driver ecosystem that's had two decades of cloud deployment polish, you've got one that's been moving fast and getting dramatically better, but still has some rough edges worth knowing about.
Let's be honest—NVIDIA's naming conventions are designed to confuse procurement teams. H100 SXM5, H100 NVL, H200 SXM, B200... it sounds like someone spilled alphabet soup on a product roadmap.
I've spent way too many hours explaining these differences to engineering teams, so here's everything you actually need to know before signing that hardware purchase order.
TL;DR: Small models (≤30B) and large models (100B+) require fundamentally different infrastructure skills. Small models are an inference optimization problem—make one GPU go fast. Large models are a distributed systems problem—coordinate a cluster, manage memory as the primary constraint, and plan for multi-minute failure recovery. The threshold is around 70B parameters. Most ML engineers are trained for the first problem, not the second.
Here's something companies learn after burning through 6 figures in cloud credits: the skills for small models and large models are completely different. And most of your existing infra people can't do both.
Once you cross ~70B parameters, your job description flips. You're not doing inference optimization anymore. You're doing distributed resource management. Also known as: the nightmare.
Your finance team doesn't care about tokens per second. They care about predictable costs, compliance risk, and vendor lock-in. Here's how CPU inference stacks up.