The Business Case for CPU-Based AI Inference
Your finance team doesn't care about tokens per second. They care about predictable costs, compliance risk, and vendor lock-in. Here's how CPU inference stacks up.
The other week I published a technical deep-dive on running LLM inference with AMD EPYC processors and ZenDNN. The benchmarks showed that a $0.79/hour VM can push 40-125 tokens per second depending on model size, genuinely usable performance for a surprising range of workloads.
But benchmarks don't answer the question that actually matters: Should you do this?
That depends on your volume, your compliance requirements, and your appetite for vendor dependency. Let's work through the math and the strategy.
The Economics: What Does CPU Inference Actually Cost?
The $0.79/hour headline is useful, but CIOs and finance teams think in cost-per-unit. For AI inference, that unit is tokens. Let's translate.
Cost-Per-Token Calculation
Using results from the Qwen3-4B model, a capable 4-billion parameter model suitable for production workloads:
| Metric | Value |
|---|---|
| Output throughput | 60.00 tokens/sec |
| Tokens per hour | 216,000 |
| Instance cost | $0.79/hour |
| Cost per million output tokens | $3.66 |
For other models in the benchmark suite:
| Model | Parameters | Tokens/sec | Tokens/Hour | Cost per Million Tokens |
|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | 124.81 | 449,316 | $1.76 |
| Llama-3.2-1B | 1B | 123.95 | 446,220 | $1.77 |
| Gemma-3-1b-it | 1B | 97.35 | 350,460 | $2.25 |
| Qwen3-1.7B | 1.7B | 95.55 | 343,980 | $2.30 |
| Llama-3.2-3B | 3B | 71.87 | 258,732 | $3.05 |
| Phi-4-mini-instruct | 4B | 67.32 | 242,352 | $3.26 |
| Qwen3-4B | 4B | 60.00 | 216,000 | $3.66 |
| Gemma-3-4b-it | 4B | 53.83 | 193,788 | $4.08 |
| Qwen3-8B | 8B | 39.75 | 143,100 | $5.52 |
| Gemma-3-12b-it | 12B | 25.36 | 91,296 | $8.65 |
| Phi-4 | 15B | 23.83 | 85,788 | $9.21 |
These numbers assume continuous operation at observed throughput rates. Real-world utilization varies, but the math scales linearly.
How Does This Compare to Commercial APIs?
Here's where honesty matters more than advocacy.
| Provider | Model | Output $/Million | Capability Class |
|---|---|---|---|
| OpenAI | GPT-4o-mini | $0.60 | Small, fast |
| Self-hosted | Qwen3-0.6B | $1.76 | Small, fast |
| Self-hosted | Llama-3.2-1B | $1.77 | Small, fast |
| Self-hosted | Qwen3-4B | $3.66 | Medium, capable |
| Anthropic | Claude Haiku 3.5 | $4.00 | Small, capable |
| OpenAI | GPT-4o | $10.00 | Large, sophisticated |
| Anthropic | Claude Sonnet 4.5 | $15.00 | Large, sophisticated |
The honest take: On pure cost-per-token for the smallest models, GPT-4o-mini wins. At $0.60 per million output tokens, OpenAI's pricing reflects massive scale economies that self-hosted infrastructure can't match at low volumes.
But that comparison is incomplete.
The Crossover: Where Self-Hosted Wins
Commercial API pricing assumes you pay per token. Self-hosted infrastructure costs the same whether you process one token or one billion.
Scenario: Customer support automation
Processing 100,000 tickets monthly, averaging 500 tokens input and 300 tokens output each.
Monthly volume: 50 million input tokens + 30 million output tokens
| Approach | Model | Monthly Cost |
|---|---|---|
| OpenAI API | GPT-4o-mini | $25.50 |
| Anthropic API | Claude Sonnet 4.5 | $600.00 |
| Self-hosted | Qwen3-4B (continuous) | $575.00 |
At 80 million tokens monthly, self-hosted CPU inference matches Claude Sonnet pricing while providing a model you fully control.
Scale it further: 500,000 tickets monthly (400 million tokens):
| Approach | Model | Monthly Cost |
|---|---|---|
| OpenAI API | GPT-4o-mini | $127.50 |
| Anthropic API | Claude Sonnet 4.5 | $3,000.00 |
| Self-hosted | Qwen3-4B (continuous) | $575.00 |
Now self-hosted beats everything except the cheapest API tier, and that cheapest tier comes with constraints that matter.
The Hidden Costs APIs Don't Show
Commercial API pricing excludes costs that don't appear on invoices but absolutely appear in your operations:
| Factor | API Reality | Self-Hosted Reality |
|---|---|---|
| Rate limits | Throttling during demand spikes | Your capacity, your limits |
| Outages | Service degradation outside your control | Hardware fails predictably, locally |
| Price changes | OpenAI has changed pricing multiple times | Hardware costs are commodity markets |
| Model deprecation | GPT-3 is gone; integrations break | Downloaded weights are permanent |
| Compliance burden | Proving data handling to auditors | Data never leaves your infrastructure |
Self-hosted infrastructure has its own hidden costs, operations, maintenance, expertise. But those costs are visible and controllable. API costs are opaque until the invoice arrives or the service degrades.
The Break-Even Framework
When APIs make sense:
- Under 20 million tokens monthly
- No compliance constraints on data handling
- Variable, unpredictable workload patterns
- Time-to-value priority over long-term cost optimization
When self-hosted makes sense:
- Over 50 million tokens monthly
- Regulated data requiring sovereignty
- Predictable, sustained workload patterns
- Strategic priority on vendor independence
The hybrid approach: Self-hosted for baseline capacity, API overflow for burst demand. Fixed costs stay controlled; spikes get absorbed without over-provisioning.
Beyond Cost: Control, Compliance, and Continuity
For many organizations, cost-per-token isn't the deciding factor. Three strategic considerations often matter more.
Data Sovereignty: The Compliance Imperative
Every token sent to a commercial API leaves your infrastructure. For organizations handling protected health information, personally identifiable information, financial data, or classified information, this creates immediate complexity.
The HIPAA reality: When patient data goes to OpenAI's API, you're trusting their Business Associate Agreement, their security practices, their employee access controls, and their data retention policies. You're also trusting that no prompt injection or model behavior will cause data leakage across tenant boundaries.
Self-hosted inference eliminates this trust chain. PHI never leaves your network. Audit trails exist on systems you control. Data residency is whatever you configure.
| Sector | API Challenge | Self-Hosted Advantage |
|---|---|---|
| Healthcare | BAA complexity, PHI exposure | Complete data isolation |
| Financial Services | SOX audit requirements | Full audit trail ownership |
| Legal | Attorney-client privilege | Air-gapped deployment option |
| Government | FedRAMP, ITAR, classification | On-premise or sovereign cloud |
This isn't theoretical, multiple healthcare organizations in the USE and several European medical centers run production AI workloads on OpenStack specifically because data sovereignty requirements preclude commercial API usage.
Vendor Independence: The Lock-In Calculation
Commercial AI APIs create dependencies that compound over time.
Model availability risk: OpenAI has deprecated models multiple times. GPT-3 is gone. GPT-4 variants come and go. If your application depends on specific model behavior, you're one deprecation notice away from an emergency rewrite.
Pricing power asymmetry: Once your application is built on a specific API, switching costs are substantial. GPT-4 launched at $60/$30 per million tokens and dropped to $10/$2.50. Good for customers, but it demonstrates pricing is a strategic decision by the vendor, not a stable input to your financial planning.
The open alternative: Self-hosted inference on open-weight models (Llama, Qwen, Gemma, Phi) provides permanent model availability, commodity-market pricing for hardware, and portability across any deployment target. If your cloud provider disappeared tomorrow, vLLM containers run identically elsewhere. The same cannot be said for applications built on proprietary API features.
Operational Continuity: Failure Modes Matter
Commercial AI services experience outages. When the API is down, your AI-powered features are down.
Self-hosted infrastructure fails differently. Hardware fails predictably and locally. You build redundancy. You control failover. You own the blast radius.
| Failure Mode | API Impact | Self-Hosted Impact |
|---|---|---|
| Provider outage | Complete service loss | N/A |
| Rate limiting | Degraded service, queuing | N/A |
| Network partition | Service loss | Local operations continue |
| Hardware failure | N/A | Failover to redundant capacity |
| Demand spike | Throttling, latency | Scale horizontally |
For applications where AI is core functionality, availability on your terms matters.
The Strategic Framework
Evaluate AI infrastructure across three dimensions:
Data Sensitivity
- Low sensitivity, non-regulated → APIs are fine
- Moderate sensitivity, compliance-conscious → Evaluate carefully
- High sensitivity, heavily regulated → Self-hosted likely required
Volume and Predictability
- Low volume, variable → APIs offer flexibility
- High volume, predictable → Self-hosted offers cost stability
- Burst capacity needs → Hybrid approach
Strategic Importance
- Experimental/exploratory → APIs minimize commitment
- Core business capability → Ownership reduces risk
- Competitive differentiation → Control enables customization
Is CPU Inference Production-Ready?
Yes, with appropriate expectations.
The technical testing demonstrated zero failed requests across all benchmark runs, stable memory utilization, and consistent throughput. The benchmarks include both ZenDNN-optimized and baseline configurations, showing measurable performance improvements from AMD's optimization library.
Version Compatibility
The vLLM and ZenTorch ecosystems are evolving rapidly. The technical post includes guidance on version compatibility, currently v0.11.0 is the recommended vLLM version for ZenTorch integration. Running from HEAD works but may produce version warnings. Pin your versions for production deployments.
What requires attention:
- Version compatibility between vLLM and ZenTorch needs monitoring
- Larger models (12B+) push memory limits; size your instances appropriately
- Operational expertise for container orchestration and monitoring is required
This is mature enough for production use cases where operational requirements align with organizational capabilities. It's not "deploy and forget", but neither is any production AI infrastructure.
The Decision
| If you need... | Consider... |
|---|---|
| Fastest time-to-value | Commercial APIs |
| Lowest cost at small scale | Commercial APIs (GPT-4o-mini, Claude Haiku) |
| Predictable costs at scale | Self-hosted |
| Data sovereignty | Self-hosted |
| Regulatory compliance | Self-hosted |
| Vendor independence | Self-hosted |
| Development/testing environment | Self-hosted CPU |
The answer isn't universal. Commercial APIs are excellent for many use cases, fast, capable, constantly improving.
But the assumption that APIs are always the right choice deserves scrutiny. For organizations where data control, cost predictability, and vendor independence are strategic priorities, CPU inference on your own infrastructure is a legitimate option that the industry has under-discussed.
The 24-core AMD EPYC VM processing 60 tokens per second on a 4B parameter model isn't competing with H100 clusters. It's competing with the default assumption that AI inference requires external dependencies. For a substantial category of workloads, it doesn't.
Getting Started
The companion technical post covers the complete setup: Docker builds, vLLM configuration, ZenTorch integration, and benchmark methodology. Start there if you want to replicate the testing or deploy CPU inference in your own environment.
For Rackspace OpenStack Flex customers, the gp.5 flavor family provides AMD EPYC instances ready for this workload. Adjust your KV cache and CPU binding settings based on instance size, and you're running inference within the hour.
API pricing reflects published rates as of December 2025. Pricing changes frequently; verify current rates before making infrastructure decisions.