The Business Case for CPU-Based AI Inference

Your finance team doesn't care about tokens per second. They care about predictable costs, compliance risk, and vendor lock-in. Here's how CPU inference stacks up.

The other week I published a technical deep-dive on running LLM inference with AMD EPYC processors and ZenDNN. The benchmarks showed that a $0.79/hour VM can push 40-125 tokens per second depending on model size, genuinely usable performance for a surprising range of workloads.

But benchmarks don't answer the question that actually matters: Should you do this?

That depends on your volume, your compliance requirements, and your appetite for vendor dependency. Let's work through the math and the strategy.

The Economics: What Does CPU Inference Actually Cost?

The $0.79/hour headline is useful, but CIOs and finance teams think in cost-per-unit. For AI inference, that unit is tokens. Let's translate.

Cost-Per-Token Calculation

Using results from the Qwen3-4B model, a capable 4-billion parameter model suitable for production workloads:

Metric	Value
Output throughput	60.00 tokens/sec
Tokens per hour	216,000
Instance cost	$0.79/hour
Cost per million output tokens	$3.66

For other models in the benchmark suite:

Model	Parameters	Tokens/sec	Tokens/Hour	Cost per Million Tokens
Qwen3-0.6B	0.6B	124.81	449,316	$1.76
Llama-3.2-1B	1B	123.95	446,220	$1.77
Gemma-3-1b-it	1B	97.35	350,460	$2.25
Qwen3-1.7B	1.7B	95.55	343,980	$2.30
Llama-3.2-3B	3B	71.87	258,732	$3.05
Phi-4-mini-instruct	4B	67.32	242,352	$3.26
Qwen3-4B	4B	60.00	216,000	$3.66
Gemma-3-4b-it	4B	53.83	193,788	$4.08
Qwen3-8B	8B	39.75	143,100	$5.52
Gemma-3-12b-it	12B	25.36	91,296	$8.65
Phi-4	15B	23.83	85,788	$9.21

These numbers assume continuous operation at observed throughput rates. Real-world utilization varies, but the math scales linearly.

How Does This Compare to Commercial APIs?

Here's where honesty matters more than advocacy.

Provider	Model	Output $/Million	Capability Class
OpenAI	GPT-4o-mini	$0.60	Small, fast
Self-hosted	Qwen3-0.6B	$1.76	Small, fast
Self-hosted	Llama-3.2-1B	$1.77	Small, fast
Self-hosted	Qwen3-4B	$3.66	Medium, capable
Anthropic	Claude Haiku 3.5	$4.00	Small, capable
OpenAI	GPT-4o	$10.00	Large, sophisticated
Anthropic	Claude Sonnet 4.5	$15.00	Large, sophisticated

The honest take: On pure cost-per-token for the smallest models, GPT-4o-mini wins. At $0.60 per million output tokens, OpenAI's pricing reflects massive scale economies that self-hosted infrastructure can't match at low volumes.

But that comparison is incomplete.

The Crossover: Where Self-Hosted Wins

Commercial API pricing assumes you pay per token. Self-hosted infrastructure costs the same whether you process one token or one billion.

Scenario: Customer support automation

Processing 100,000 tickets monthly, averaging 500 tokens input and 300 tokens output each.

Monthly volume: 50 million input tokens + 30 million output tokens

Approach	Model	Monthly Cost
OpenAI API	GPT-4o-mini	$25.50
Anthropic API	Claude Sonnet 4.5	$600.00
Self-hosted	Qwen3-4B (continuous)	$575.00

At 80 million tokens monthly, self-hosted CPU inference matches Claude Sonnet pricing while providing a model you fully control.

Scale it further: 500,000 tickets monthly (400 million tokens):

Approach	Model	Monthly Cost
OpenAI API	GPT-4o-mini	$127.50
Anthropic API	Claude Sonnet 4.5	$3,000.00
Self-hosted	Qwen3-4B (continuous)	$575.00

Now self-hosted beats everything except the cheapest API tier, and that cheapest tier comes with constraints that matter.

The Hidden Costs APIs Don't Show

Commercial API pricing excludes costs that don't appear on invoices but absolutely appear in your operations:

Factor	API Reality	Self-Hosted Reality
Rate limits	Throttling during demand spikes	Your capacity, your limits
Outages	Service degradation outside your control	Hardware fails predictably, locally
Price changes	OpenAI has changed pricing multiple times	Hardware costs are commodity markets
Model deprecation	GPT-3 is gone; integrations break	Downloaded weights are permanent
Compliance burden	Proving data handling to auditors	Data never leaves your infrastructure

Self-hosted infrastructure has its own hidden costs, operations, maintenance, expertise. But those costs are visible and controllable. API costs are opaque until the invoice arrives or the service degrades.

The Break-Even Framework

When APIs make sense:

Under 20 million tokens monthly
No compliance constraints on data handling
Variable, unpredictable workload patterns
Time-to-value priority over long-term cost optimization

When self-hosted makes sense:

Over 50 million tokens monthly
Regulated data requiring sovereignty
Predictable, sustained workload patterns
Strategic priority on vendor independence

The hybrid approach: Self-hosted for baseline capacity, API overflow for burst demand. Fixed costs stay controlled; spikes get absorbed without over-provisioning.

Beyond Cost: Control, Compliance, and Continuity

For many organizations, cost-per-token isn't the deciding factor. Three strategic considerations often matter more.

Data Sovereignty: The Compliance Imperative

Every token sent to a commercial API leaves your infrastructure. For organizations handling protected health information, personally identifiable information, financial data, or classified information, this creates immediate complexity.

The HIPAA reality: When patient data goes to OpenAI's API, you're trusting their Business Associate Agreement, their security practices, their employee access controls, and their data retention policies. You're also trusting that no prompt injection or model behavior will cause data leakage across tenant boundaries.

Self-hosted inference eliminates this trust chain. PHI never leaves your network. Audit trails exist on systems you control. Data residency is whatever you configure.

Sector	API Challenge	Self-Hosted Advantage
Healthcare	BAA complexity, PHI exposure	Complete data isolation
Financial Services	SOX audit requirements	Full audit trail ownership
Legal	Attorney-client privilege	Air-gapped deployment option
Government	FedRAMP, ITAR, classification	On-premise or sovereign cloud

This isn't theoretical, multiple healthcare organizations in the USE and several European medical centers run production AI workloads on OpenStack specifically because data sovereignty requirements preclude commercial API usage.

Vendor Independence: The Lock-In Calculation

Commercial AI APIs create dependencies that compound over time.

Model availability risk: OpenAI has deprecated models multiple times. GPT-3 is gone. GPT-4 variants come and go. If your application depends on specific model behavior, you're one deprecation notice away from an emergency rewrite.

Pricing power asymmetry: Once your application is built on a specific API, switching costs are substantial. GPT-4 launched at $60/$30 per million tokens and dropped to $10/$2.50. Good for customers, but it demonstrates pricing is a strategic decision by the vendor, not a stable input to your financial planning.

The open alternative: Self-hosted inference on open-weight models (Llama, Qwen, Gemma, Phi) provides permanent model availability, commodity-market pricing for hardware, and portability across any deployment target. If your cloud provider disappeared tomorrow, vLLM containers run identically elsewhere. The same cannot be said for applications built on proprietary API features.

Operational Continuity: Failure Modes Matter

Commercial AI services experience outages. When the API is down, your AI-powered features are down.

Self-hosted infrastructure fails differently. Hardware fails predictably and locally. You build redundancy. You control failover. You own the blast radius.

Failure Mode	API Impact	Self-Hosted Impact
Provider outage	Complete service loss	N/A
Rate limiting	Degraded service, queuing	N/A
Network partition	Service loss	Local operations continue
Hardware failure	N/A	Failover to redundant capacity
Demand spike	Throttling, latency	Scale horizontally

For applications where AI is core functionality, availability on your terms matters.

The Strategic Framework

Evaluate AI infrastructure across three dimensions:

Data Sensitivity

Low sensitivity, non-regulated → APIs are fine
Moderate sensitivity, compliance-conscious → Evaluate carefully
High sensitivity, heavily regulated → Self-hosted likely required

Volume and Predictability

Low volume, variable → APIs offer flexibility
High volume, predictable → Self-hosted offers cost stability
Burst capacity needs → Hybrid approach

Strategic Importance

Experimental/exploratory → APIs minimize commitment
Core business capability → Ownership reduces risk
Competitive differentiation → Control enables customization

Is CPU Inference Production-Ready?

Yes, with appropriate expectations.

The technical testing demonstrated zero failed requests across all benchmark runs, stable memory utilization, and consistent throughput. The benchmarks include both ZenDNN-optimized and baseline configurations, showing measurable performance improvements from AMD's optimization library.

Version Compatibility

The vLLM and ZenTorch ecosystems are evolving rapidly. The technical post includes guidance on version compatibility, currently v0.11.0 is the recommended vLLM version for ZenTorch integration. Running from HEAD works but may produce version warnings. Pin your versions for production deployments.

What requires attention:

Version compatibility between vLLM and ZenTorch needs monitoring
Larger models (12B+) push memory limits; size your instances appropriately
Operational expertise for container orchestration and monitoring is required

This is mature enough for production use cases where operational requirements align with organizational capabilities. It's not "deploy and forget", but neither is any production AI infrastructure.

The Decision

If you need...	Consider...
Fastest time-to-value	Commercial APIs
Lowest cost at small scale	Commercial APIs (GPT-4o-mini, Claude Haiku)
Predictable costs at scale	Self-hosted
Data sovereignty	Self-hosted
Regulatory compliance	Self-hosted
Vendor independence	Self-hosted
Development/testing environment	Self-hosted CPU

The answer isn't universal. Commercial APIs are excellent for many use cases, fast, capable, constantly improving.

But the assumption that APIs are always the right choice deserves scrutiny. For organizations where data control, cost predictability, and vendor independence are strategic priorities, CPU inference on your own infrastructure is a legitimate option that the industry has under-discussed.

The 24-core AMD EPYC VM processing 60 tokens per second on a 4B parameter model isn't competing with H100 clusters. It's competing with the default assumption that AI inference requires external dependencies. For a substantial category of workloads, it doesn't.

Getting Started

The companion technical post covers the complete setup: Docker builds, vLLM configuration, ZenTorch integration, and benchmark methodology. Start there if you want to replicate the testing or deploy CPU inference in your own environment.

For Rackspace OpenStack Flex customers, the gp.5 flavor family provides AMD EPYC instances ready for this workload. Adjust your KV cache and CPU binding settings based on instance size, and you're running inference within the hour.

API pricing reflects published rates as of December 2025. Pricing changes frequently; verify current rates before making infrastructure decisions.