Skip to content

Running AI Inference on AMD EPYC Without a GPU in Sight

Spoiler: You don't need a $40,000 GPU to run LLM inference. Sometimes 24 CPU cores and the right software stack will do just fine.

The AI infrastructure conversation has become almost synonymous with GPU procurement battles, NVIDIA allocation queues, and eye-watering hardware costs. But here's a reality that doesn't get enough attention: for many inference workloads, especially during development, testing, and moderate-scale production, modern CPUs with optimized software can deliver surprisingly capable performance at a fraction of the cost.

I recently spent some time exploring AMD's ZenDNN optimization library paired with vLLM on Rackspace OpenStack Flex, and the results challenge the assumption that CPU inference is merely a curiosity. Let me walk through what I found.

The Setup: AMD EPYC 9454 on OpenStack Flex

For this testing, I spun up a general-purpose VM in Rackspace OpenStack Flex's DFW3 environment using the gp.5.24.96 flavor.

Resource Specification
vCPUs 24
RAM 96 GB
Root Disk 240 GB
Ephemeral 128 GB
Processor AMD EPYC 9454 (Genoa)
Hourly Cost $0.79

The AMD EPYC 9454 is a 4th-generation Zen 4 processor with AVX-512 support, including the BF16 and VNNI extensions that matter for inference workloads. These aren't just marketing checkboxes; they translate directly into optimized matrix operations that LLMs depend on.

Containerization with Docker

This post isn't going into how to install Docker, but before getting started, it should be installed.

Getting vLLM

vLLM is an open-source library designed for efficient large language model inference. It supports CPU and GPU backends and features a pluggable architecture that allows integration with optimization libraries like ZenDNN. To get started, clone the vLLM repository.

git clone https://github.com/vllm-project/vllm

Tip

Review the version compatibility matrix in the vLLM documentation and Zentorch to ensure you're using compatible versions of Python, PyTorch, vLLM, and Zentorch (ZenDNN).

At the time of writing, the recommended vLLM version to use with ZenTorch is v0.11.0. To check out this specific version, run:

git checkout v0.11.0

Warning

Make sure to use the correct version of vLLM that is compatible with the ZenTorch plugin to avoid any runtime issues. If the version of ZenDNN-pytorch-plugin is incompatible with the installed vLLM version, you may encounter errors such as:

[WARNING zentorch.vllm - register:72] [zentorch] Unsupported vLLM version: X.YY.Z. Plugin supports versions: A.BB.C

This indicates that the plugin does not support the installed vLLM version and will likely result in a failure to load or operate optimally.

Building vLLM with ZenTorch

AMD's ZenDNN library provides optimized deep learning primitives specifically tuned for Zen architecture processors. The ZenTorch plugin integrates these optimizations into PyTorch, and by extension, into vLLM's inference pipeline.

Build the initial Docker Image for vLLM with CPU optimizations enabled and the AVX-512 extensions activated.

docker build -f docker/Dockerfile.cpu \
             --build-arg VLLM_CPU_AVX512BF16=1 \
             --build-arg VLLM_CPU_AVX512VNNI=1 \
             --build-arg VLLM_CPU_DISABLE_AVX512=0 \
             --tag vllm-cpu:local \
             --target vllm-openai \
             .

With the base container built, we now add the layers to make sure we can leverage ZenDNN optimizations. The build process involves creating a custom Docker image that layers ZenDNN-pytorch-plugin on top of vLLM's CPU-optimized base image.

Dockerfile for vLLM with ZenTorch at docker/Dockerfile.cpu-amd

Dockerfile.cpu-amd
FROM vllm-cpu:local
RUN apt-get update -y \
    && apt-get install -y --no-install-recommends make cmake ccache git curl wget ca-certificates \
                                                gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg \
                                                libsm6 libxext6 libgl1 jq lsof libjemalloc2 gfortran \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

RUN git clone https://github.com/amd/ZenDNN-pytorch-plugin.git && \
    cd ZenDNN-pytorch-plugin && \
    uv pip install -r requirements.txt && \
    CC=gcc CXX=g++ python3 setup.py bdist_wheel && \
    uv pip install dist/*.whl

ENTRYPOINT ["vllm", "serve"]

Now build the final Docker image with ZenTorch enabled.

docker build -f docker/Dockerfile.cpu-amd \
             --build-arg VLLM_CPU_AVX512BF16=1 \
             --build-arg VLLM_CPU_AVX512VNNI=1 \
             --build-arg VLLM_CPU_DISABLE_AVX512=0 \
             --tag vllm-cpu-zentorch:local \
             .

Runtime configuration binds vLLM to available CPU cores and allocates substantial memory for the KV cache to maximize throughput. If you plan to use smaller instances, adjust these values accordingly.

For the test environment I set the shared memory size to 95G to accommodate larger models.

computing SHM_SIZE
export SHM_SIZE="$(($(free -m | awk '/Mem/ {print $2}') - 1024))"

For the test environment I set the CPU core binding to use all but one core for vLLM processing.

computing CORES
export CORES="0-$(($(nproc) - 1))"

For the vLLM server, we set the VLLM_CPU_KVCACHE_SPACE to 50GB to allow room for larger models. Adjust this value based on your instance's memory capacity and the models you plan to run. In general, we've found that this should be around 75% of your total memory minus some overhead for the model, OS and other processes.

setting KVCACHE_SPACE
export KVCACHE_SPACE="$(($(free -g | awk '/Mem/ {print $2}') * 75 / 100))"

Warning

Setting this value too high may lead to out-of-memory errors, so monitor your system's memory usage during initial runs, which will cause the vLLM server to fail with a RuntimeError if it cannot allocate the requested KV cache space.

Finally, set the model you want to run inference against. For this testing, In the following example we're using the Qwen3-4B model from HuggingFace.

setting MODEL
export MODEL="Qwen/Qwen3-4B"

Now run the vLLM container with ZenTorch enabled.

The HF_TOKEN variable should be set to a valid HuggingFace token with model access.

If you intend to use a model with access restrictions, ensure your HuggingFace token is set in the HF_TOKEN environment variable. Models like LLama 3.2 require an acceptance to their terms as well as authentication using a read-only token.

docker run --net=host \
           --ipc=host \
           --shm-size=${SHM_SIZE}m \
           --privileged=true \
           --detach \
           --volume /var/lib/huggingface:/root/.cache/huggingface \
           --env HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
           --env VLLM_PLUGINS="zentorch" \
           --env VLLM_CPU_KVCACHE_SPACE=${KVCACHE_SPACE} \
           --env VLLM_CPU_OMP_THREADS_BIND=${CORES} \
           --env VLLM_CPU_NUM_OF_RESERVED_CPU=1 \
           --name vllm-server \
           --rm \
           vllm-cpu-zentorch:local --dtype=bfloat16 \
                                    --max-num-seqs=5 \
                                    --model=${MODEL}
Validating the server is running

You can check that the server output is healthy by following the logs from docker.

docker logs -f vllm-server

If the plugin is loading correctly, you should see output similar to the following in the logs.

INFO 12-18 01:09:29 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 12-18 01:09:29 [__init__.py:45] - zentorch -> zentorch.vllm:register
INFO 12-18 01:09:29 [__init__.py:57] Loading plugin zentorch
...
ZenDNN Info: Execution has entered the ZenDNN library. Optimized deep learning kernels are now active for high-performance inference on AMD CPUs.
...
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Benchmark Results: What Can CPU Inference Actually Do?

I ran vLLM's built-in benchmark suite across several model families with 128-token input/output sequences and 4 concurrent requests. Here's what the numbers look like.

Benchmark setup and command

# Install
apt install python3.12-venv
python3 -m venv ~/.venvs/vllm
~/.venvs/vllm/bin/pip install vllm ijson

# Run benchmark
HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-"None"} ~/.venvs/vllm/bin/python3 \
    -m vllm.entrypoints.cli.main bench serve --backend vllm \
                                             --base-url http://localhost:8000 \
                                             --model ${MODEL} \
                                             --tokenizer ${MODEL} \
                                             --random-input-len 128 \
                                             --random-output-len 128 \
                                             --num-prompts 20 \
                                             --max-concurrency 4 \
                                             --temperature 0.7

While this command is being run against the local vLLM server, you can change the --base-url parameter to point to any vLLM server instance, allowing you to benchmark remote deployments as well.

Sliding Window

The benchmark uses a sliding window approach however, for some models this may not be fully supported. If you see warnings from the server logs similar to the following:

(APIServer pid=1) WARNING 12-20 16:33:05 [_logger.py:72] sliding window (CPU backend) is not supported by the V1 Engine. Falling back to V0.

This indicates that the sliding window feature is not supported for that model and you should consider disabling it using the following flag --disable-sliding-window in your benchmark command.

Qwen3 Family (Alibaba)

Model Parameters Output Tokens/sec TTFT (median) Tokens per Output (median)
Qwen3-0.6B 0.6B 124.81 403.64ms 27.57ms
Qwen3-1.7B 1.7B 95.55 482.31ms 36.75ms
Qwen3-4B 4B 60.00 1024.18ms 57.91ms
Qwen3-8B 8B 39.75 1724.28ms 86.42ms

Llama 3.2 Family (Meta)

Model Parameters Output Tokens/sec TTFT (median) Tokens per Output (median)
Llama-3.2-1B 1B 123.95 389.88ms 27.99ms
Llama-3.2-3B 3B 71.87 934.78ms 47.79ms

Gemma 3 Family (Google)

Model Parameters Output Tokens/sec TTFT (median) Tokens per Output (median)
Gemma-3-1b-it 1B 97.35 374.32ms 37.05ms
Gemma-3-4b-it 4B 53.83 1178.90ms 64.73ms
Gemma-3-12b-it 12B 25.36 3508.26ms 129.60ms

Phi-4 Family (Microsoft)

Model Parameters Output Tokens/sec TTFT (median) Tokens per Output (median)
Phi-4-mini-instruct 4B 67.32 1519.74ms 46.95ms
Phi-4 15B 23.83 5383.14ms 129.80ms

Full Benchmark Results

The full benchmark results, including detailed metrics, can be found here. The following files contain the complete output from the benchmark runs for both optimized (with ZenDNN) and un-optimized (without ZenDNN) configurations, which highlights the performance improvements achieved through ZenDNN optimizations.

ZenDNN optimized vLLM Benchmark Results
results.optimized.txt
Results Qwen/Qwen3-8B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  64.40
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.31
Output token throughput (tok/s):         39.75
Peak output token throughput (tok/s):    52.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          79.50
---------------Time to First Token----------------
Mean TTFT (ms):                          1616.70
Median TTFT (ms):                        1724.28
P99 TTFT (ms):                           2761.99
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.65
Median TPOT (ms):                        86.42
P99 TPOT (ms):                           102.80
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.65
Median ITL (ms):                         87.08
P99 ITL (ms):                            101.28
==================================================

Results Qwen/Qwen3-4B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  42.67
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.47
Output token throughput (tok/s):         60.00
Peak output token throughput (tok/s):    76.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          120.00
---------------Time to First Token----------------
Mean TTFT (ms):                          1037.59
Median TTFT (ms):                        1024.18
P99 TTFT (ms):                           2069.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.00
Median TPOT (ms):                        57.91
P99 TPOT (ms):                           70.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.00
Median ITL (ms):                         57.16
P99 ITL (ms):                            66.71
==================================================

Results Qwen/Qwen3-1.7B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  26.79
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.75
Output token throughput (tok/s):         95.55
Peak output token throughput (tok/s):    112.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          191.10
---------------Time to First Token----------------
Mean TTFT (ms):                          560.05
Median TTFT (ms):                        482.31
P99 TTFT (ms):                           1492.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.77
Median TPOT (ms):                        36.75
P99 TPOT (ms):                           45.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.77
Median ITL (ms):                         36.54
P99 ITL (ms):                            45.62
==================================================

Results Qwen/Qwen3-0.6B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  20.51
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.98
Output token throughput (tok/s):         124.81
Peak output token throughput (tok/s):    148.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          249.62
---------------Time to First Token----------------
Mean TTFT (ms):                          495.14
Median TTFT (ms):                        403.64
P99 TTFT (ms):                           1452.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.39
Median TPOT (ms):                        27.57
P99 TPOT (ms):                           36.38
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.39
Median ITL (ms):                         27.24
P99 ITL (ms):                            31.03
==================================================

Results meta-llama/Llama-3.2-1B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  20.65
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.97
Output token throughput (tok/s):         123.95
Peak output token throughput (tok/s):    149.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          247.91
---------------Time to First Token----------------
Mean TTFT (ms):                          480.86
Median TTFT (ms):                        389.88
P99 TTFT (ms):                           1313.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.72
Median TPOT (ms):                        27.99
P99 TPOT (ms):                           35.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.72
Median ITL (ms):                         27.64
P99 ITL (ms):                            30.66
==================================================

Results meta-llama/Llama-3.2-3B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  35.62
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.56
Output token throughput (tok/s):         71.87
Peak output token throughput (tok/s):    92.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          143.74
---------------Time to First Token----------------
Mean TTFT (ms):                          979.22
Median TTFT (ms):                        934.78
P99 TTFT (ms):                           1795.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          48.35
Median TPOT (ms):                        47.79
P99 TPOT (ms):                           53.81
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.35
Median ITL (ms):                         46.53
P99 ITL (ms):                            52.83
==================================================

Results google/gemma-3-1b-it

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  26.30
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.76
Output token throughput (tok/s):         97.35
Peak output token throughput (tok/s):    112.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          194.70
---------------Time to First Token----------------
Mean TTFT (ms):                          483.12
Median TTFT (ms):                        374.32
P99 TTFT (ms):                           1394.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.59
Median TPOT (ms):                        37.05
P99 TPOT (ms):                           44.19
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.59
Median ITL (ms):                         36.55
P99 ITL (ms):                            40.29
==================================================

Results google/gemma-3-4b-it

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  47.56
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.42
Output token throughput (tok/s):         53.83
Peak output token throughput (tok/s):    68.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          107.65
---------------Time to First Token----------------
Mean TTFT (ms):                          1194.13
Median TTFT (ms):                        1178.90
P99 TTFT (ms):                           1899.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.44
Median TPOT (ms):                        64.73
P99 TPOT (ms):                           69.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.44
Median ITL (ms):                         63.80
P99 ITL (ms):                            73.35
==================================================

Results google/gemma-3-12b-it

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  100.96
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.20
Output token throughput (tok/s):         25.36
Peak output token throughput (tok/s):    32.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          50.71
---------------Time to First Token----------------
Mean TTFT (ms):                          3094.45
Median TTFT (ms):                        3508.26
P99 TTFT (ms):                           4408.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          134.57
Median TPOT (ms):                        129.60
P99 TPOT (ms):                           156.76
---------------Inter-token Latency----------------
Mean ITL (ms):                           134.57
Median ITL (ms):                         126.41
P99 ITL (ms):                            145.44
==================================================

Results microsoft/phi-4

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  107.44
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.19
Output token throughput (tok/s):         23.83
Peak output token throughput (tok/s):    36.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          47.65
---------------Time to First Token----------------
Mean TTFT (ms):                          4263.18
Median TTFT (ms):                        5383.14
P99 TTFT (ms):                           6157.70
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.58
Median TPOT (ms):                        129.80
P99 TPOT (ms):                           167.44
---------------Inter-token Latency----------------
Mean ITL (ms):                           135.58
Median ITL (ms):                         124.98
P99 ITL (ms):                            150.06
==================================================

Results microsoft/Phi-4-mini-instruct

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  38.03
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.53
Output token throughput (tok/s):         67.32
Peak output token throughput (tok/s):    92.00
Peak concurrent requests:                8.00
Total token throughput (tok/s):          134.64
---------------Time to First Token----------------
Mean TTFT (ms):                          1380.97
Median TTFT (ms):                        1519.74
P99 TTFT (ms):                           2179.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          48.99
Median TPOT (ms):                        46.95
P99 TPOT (ms):                           60.58
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.99
Median ITL (ms):                         45.89
P99 ITL (ms):                            51.02
==================================================
ZenDNN un-optimized vLLM Benchmark Results
results.unoptimized.txt
Results Qwen/Qwen3-8B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  123.99
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.16
Output token throughput (tok/s):         20.65
Peak output token throughput (tok/s):    24.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          41.29
---------------Time to First Token----------------
Mean TTFT (ms):                          2473.31
Median TTFT (ms):                        2156.38
P99 TTFT (ms):                           3685.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          175.59
Median TPOT (ms):                        176.40
P99 TPOT (ms):                           178.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           175.59
Median ITL (ms):                         171.24
P99 ITL (ms):                            220.61
==================================================

Results Qwen/Qwen3-4B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  71.57
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.28
Output token throughput (tok/s):         35.77
Peak output token throughput (tok/s):    44.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          71.54
---------------Time to First Token----------------
Mean TTFT (ms):                          1351.49
Median TTFT (ms):                        1366.27
P99 TTFT (ms):                           2523.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          102.02
Median TPOT (ms):                        99.59
P99 TPOT (ms):                           116.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           102.02
Median ITL (ms):                         98.58
P99 ITL (ms):                            145.66
==================================================

Results Qwen/Qwen3-1.7B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  37.10
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.54
Output token throughput (tok/s):         69.00
Peak output token throughput (tok/s):    80.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          137.99
---------------Time to First Token----------------
Mean TTFT (ms):                          630.01
Median TTFT (ms):                        542.26
P99 TTFT (ms):                           1645.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.45
Median TPOT (ms):                        52.55
P99 TPOT (ms):                           62.05
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.45
Median ITL (ms):                         52.17
P99 ITL (ms):                            57.15
==================================================

Results Qwen/Qwen3-0.6B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  21.13
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.95
Output token throughput (tok/s):         121.17
Peak output token throughput (tok/s):    140.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          242.33
---------------Time to First Token----------------
Mean TTFT (ms):                          436.76
Median TTFT (ms):                        246.71
P99 TTFT (ms):                           1347.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.80
Median TPOT (ms):                        29.74
P99 TPOT (ms):                           30.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.80
Median ITL (ms):                         29.24
P99 ITL (ms):                            34.54
==================================================

Results meta-llama/Llama-3.2-1B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  27.27
Total input tokens:                      2540
Total generated tokens:                  2560
Request throughput (req/s):              0.73
Output token throughput (tok/s):         93.89
Peak output token throughput (tok/s):    108.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          187.04
---------------Time to First Token----------------
Mean TTFT (ms):                          491.38
Median TTFT (ms):                        385.28
P99 TTFT (ms):                           1484.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.06
Median TPOT (ms):                        38.46
P99 TPOT (ms):                           47.34
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.06
Median ITL (ms):                         37.85
P99 ITL (ms):                            48.89
==================================================

Results meta-llama/Llama-3.2-3B

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  58.71
Total input tokens:                      2540
Total generated tokens:                  2560
Request throughput (req/s):              0.34
Output token throughput (tok/s):         43.61
Peak output token throughput (tok/s):    52.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          86.87
---------------Time to First Token----------------
Mean TTFT (ms):                          1137.78
Median TTFT (ms):                        934.36
P99 TTFT (ms):                           2103.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.40
Median TPOT (ms):                        83.52
P99 TPOT (ms):                           84.41
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.40
Median ITL (ms):                         81.68
P99 ITL (ms):                            102.00
==================================================

Results google/gemma-3-1b-it

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  30.54
Total input tokens:                      2540
Total generated tokens:                  2560
Request throughput (req/s):              0.65
Output token throughput (tok/s):         83.81
Peak output token throughput (tok/s):    96.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          166.97
---------------Time to First Token----------------
Mean TTFT (ms):                          452.64
Median TTFT (ms):                        336.75
P99 TTFT (ms):                           1422.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.52
Median TPOT (ms):                        43.66
P99 TPOT (ms):                           53.19
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.52
Median ITL (ms):                         43.00
P99 ITL (ms):                            55.78
==================================================

Results google/gemma-3-4b-it

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  70.37
Total input tokens:                      2540
Total generated tokens:                  2560
Request throughput (req/s):              0.28
Output token throughput (tok/s):         36.38
Peak output token throughput (tok/s):    40.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          72.47
---------------Time to First Token----------------
Mean TTFT (ms):                          1044.36
Median TTFT (ms):                        1050.06
P99 TTFT (ms):                           2162.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          102.52
Median TPOT (ms):                        102.40
P99 TPOT (ms):                           104.82
---------------Inter-token Latency----------------
Mean ITL (ms):                           102.52
Median ITL (ms):                         100.12
P99 ITL (ms):                            122.59
==================================================

Results google/gemma-3-12b-it

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  183.76
Total input tokens:                      2540
Total generated tokens:                  2560
Request throughput (req/s):              0.11
Output token throughput (tok/s):         13.93
Peak output token throughput (tok/s):    20.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          27.75
---------------Time to First Token----------------
Mean TTFT (ms):                          3657.92
Median TTFT (ms):                        3872.83
P99 TTFT (ms):                           5594.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          260.39
Median TPOT (ms):                        260.42
P99 TPOT (ms):                           268.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           260.39
Median ITL (ms):                         251.44
P99 ITL (ms):                            278.57
==================================================

Results microsoft/phi-4

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  187.20
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.11
Output token throughput (tok/s):         13.68
Peak output token throughput (tok/s):    16.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          27.35
---------------Time to First Token----------------
Mean TTFT (ms):                          3416.20
Median TTFT (ms):                        3838.29
P99 TTFT (ms):                           4923.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          267.80
Median TPOT (ms):                        263.83
P99 TPOT (ms):                           287.36
---------------Inter-token Latency----------------
Mean ITL (ms):                           267.80
Median ITL (ms):                         260.23
P99 ITL (ms):                            344.91
==================================================

Results microsoft/Phi-4-mini-instruct

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  59.19
Total input tokens:                      2560
Total generated tokens:                  2560
Request throughput (req/s):              0.34
Output token throughput (tok/s):         43.25
Peak output token throughput (tok/s):    52.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          86.50
---------------Time to First Token----------------
Mean TTFT (ms):                          1016.35
Median TTFT (ms):                        999.32
P99 TTFT (ms):                           2139.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          85.18
Median TPOT (ms):                        83.85
P99 TPOT (ms):                           94.76
---------------Inter-token Latency----------------
Mean ITL (ms):                           85.18
Median ITL (ms):                         82.53
P99 ITL (ms):                            108.94
==================================================

Resource Utilization: What the System Actually Does

Beyond throughput numbers, understanding resource consumption patterns matters for capacity planning. Here's what the system looked like under load during these benchmarks.

Dashboard: System metrics showing CPU, memory, network, and load patterns during vLLM inference testing

NewRelic Dashboard

  • CPU load patterns (1-minute load spiking to 5-6 during inference)
  • Memory utilization bands (50-70% during active runs)
  • Network traffic spikes during HuggingFace model downloads (16 MB/s peak)
  • Process table data showing VLLM::EngineCore threads (50-2000% CPU, 106-151 threads)

CPU Behavior

The load average tells the real story. During active inference, the 1-minute load spiked to 5-6 on this 24-vCPU system, significant but not saturated. The CPU usage percentage chart shows bursty patterns: idle between requests, then concentrated utilization during token generation.

The process table captures vLLM's multi-threaded architecture in action. Multiple VLLM::EngineCore processes consumed 50-2000% CPU (remember, 100% = one core, so 2000% means 20 cores active). Thread counts ranged from 106 to 151 per engine process, reflecting the parallelized inference pipeline.

Memory Patterns

Memory utilization climbed to 50-70% during model loading and sustained inference, consuming roughly 48-67GB of the 96GB available. This tracks with model size plus KV cache allocation (configured at 50GB via VLLM_CPU_KVCACHE_SPACE).

Container-level metrics show memory consumption scaling with model complexity.

Model Size Class Memory Consumption
Sub-1B models ~27-57 GB
3-4B models ~56-60 GB
8B+ models ~69-74 GB

The larger memory footprint relative to model parameter count reflects vLLM's continuous batching and KV cache management overhead, memory traded for throughput optimization.

Network and Storage I/O

Network traffic spiked dramatically during model downloads from HuggingFace Hub, reaching 16 MB/s receive rates. Once models cached locally in /var/lib/huggingface, subsequent runs showed minimal network activity.

Disk I/O patterns were write-heavy during model caching (21GB+ written across test runs) with modest read activity. The root disk sat at 17% utilization, model weights and container layers fit comfortably within the 240GB allocation.

Container Resource Summary

Across all benchmark runs, the vLLM containers exhibited these aggregate characteristics.

Metric Range Notes
CPU % 44-873% Multi-core utilization during inference
Memory 682MB - 74GB Scales with model size
Thread Count 73-253 Parallel inference workers
Network Rx 46-97 GB Model downloads from HuggingFace

The key insight: CPU inference is memory-bandwidth bound more than compute-bound. The EPYC 9454's 12-channel DDR5 memory architecture matters as much as its core count for this workload class.

Reading the Results

Let's be direct about what these numbers mean for practical use cases.

Sub-2B models are genuinely usable. The Qwen3-0.6B and 1.7B models deliver ~95-125 tokens per second with sub-second time-to-first-token. That's responsive enough for interactive applications, chatbots, code completion, document summarization. You're not waiting around.

4B models hit a sweet spot for quality vs. speed. At ~53-71 tokens per second, models like Phi-4-mini-instruct, Qwen3-4B, and Llama-3.2-3B provide meaningfully better outputs than their smaller siblings while remaining practical for batch processing and near-real-time applications. A ~1-1.5 second TTFT is noticeable but not painful.

8B+ models work but require patience. The Qwen3-8B at ~40 tokens/sec, Gemma-3-12b and Phi-4 at ~25 tokens/sec, are slower but absolutely functional for use cases where quality trumps latency, document analysis, async processing, development and testing workflows.

The Economics: GPU-Free Doesn't Mean Value-Free

Here's where this gets interesting from an infrastructure planning perspective.

That gp.5.24.96 flavor runs at $0.79/hour, roughly $575/month for continuous operation. Compare that to GPU instance pricing where you're looking at $1-4/hour for entry-level accelerator access, assuming availability.

Instance Sizing

The gp.5.24.96 flavor was chosen for its balance of CPU cores and memory capacity. Depending on your workload, smaller or larger instances may be more appropriate. The key is ensuring sufficient memory for the model and KV cache while providing enough CPU cores to handle concurrency. Bottom line: Choose the instance size that aligns with your workload requirements and budget. Any of the gp.5 family flavors will work, simply adjust your VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND settings accordingly.

For development teams iterating on prompts, testing model behavior, or running moderate inference loads, CPU-based instances provide a dramatically lower barrier to entry. You can spin up the infrastructure in minutes without joining a GPU allocation queue.

This isn't about replacing GPU infrastructure for training or high-throughput production inference. It's about recognizing that not every AI workload requires the same hardware profile, and that forcing GPU dependency on all AI workloads is both expensive and often unnecessary.

Practical Applications

Where does CPU inference with ZenDNN actually make sense?

Development and testing environments. Every AI application needs a place to iterate that doesn't burn through GPU budget. CPU inference lets teams test model behavior, refine prompts, and validate integrations without competing for accelerator resources.

Batch processing at moderate scale. Processing thousands of documents overnight? Analyzing logs for anomalies? Generating embeddings for search indexing? These workloads often care more about cost-per-token than tokens-per-second.

Edge and hybrid deployments. Not every deployment location has GPU infrastructure. Branch offices, on-premise installations, and resource-constrained environments can still run inference workloads.

Burst capacity. When your GPU fleet is fully loaded, CPU instances can absorb overflow traffic rather than dropping requests or queuing indefinitely.

Running This Yourself

The complete setup on Rackspace OpenStack Flex involves.

  1. Launch an AMD EPYC instance (gp.5 flavor family)
  2. Install Docker and clone the vLLM repository
  3. Build the CPU-optimized image with ZenTorch
  4. Configure CPU binding and memory allocation
  5. Deploy and test

The vLLM server exposes an OpenAI-compatible API, so existing tooling and integrations work without modification:

curl http://localhost:8000/v1/models | jq

From there, your application code doesn't need to know whether inference is happening on a GPU or CPU, the API contract remains identical.

The Bigger Picture

The AI infrastructure narrative has over-indexed on GPU scarcity and the assumption that meaningful work requires accelerators. That's true for training and high-throughput production inference, but it misses a substantial category of workloads where CPU-based solutions deliver genuine value.

AMD's investment in ZenDNN, combined with vLLM's architecture that supports pluggable backends, creates a practical path for organizations to deploy AI capabilities without GPU dependency. Running this on OpenStack Flex demonstrates that cloud infrastructure doesn't need to be hyperscaler-specific to support modern AI workloads.

The 24-core EPYC VM running inference at 125 tokens per second for a ~1B model, or 60 tokens per second for a ~4B model, isn't a compromise. It's the right tool for a substantial portion of the AI workload landscape.

Sometimes the most expensive hardware isn't the most appropriate hardware. And sometimes, 24 flexible CPU cores on the Rackspace cloud is exactly what you need.