Skip to content

AI

Why Your "Senior ML Engineer" Can't Deploy a 70B Model

TL;DR: Small models (≤30B) and large models (100B+) require fundamentally different infrastructure skills. Small models are an inference optimization problem—make one GPU go fast. Large models are a distributed systems problem—coordinate a cluster, manage memory as the primary constraint, and plan for multi-minute failure recovery. The threshold is around 70B parameters. Most ML engineers are trained for the first problem, not the second.

Here's something companies learn after burning through 6 figures in cloud credits: the skills for small models and large models are completely different. And most of your existing infra people can't do both.

Once you cross ~70B parameters, your job description flips. You're not doing inference optimization anymore. You're doing distributed resource management. Also known as: the nightmare.

The Business Case for CPU-Based AI Inference

Your finance team doesn't care about tokens per second. They care about predictable costs, compliance risk, and vendor lock-in. Here's how CPU inference stacks up.

The other week I published a technical deep-dive on running LLM inference with AMD EPYC processors and ZenDNN. The benchmarks showed that a $0.79/hour VM can push 40-125 tokens per second depending on model size, genuinely usable performance for a surprising range of workloads.

But benchmarks don't answer the question that actually matters: Should you do this?

Agentic AI in Action: Deploying Secure, Task-Driven Agents in Rackspace Cloud

The concept of AI agents has emerged as a transformative tool, empowering organizations to create AI systems capable of secure, real-world interactions. By leveraging a language model’s natural language abilities alongside function-calling capabilities, an agentic AI system can interact with external systems, retrieve data, and perform complex tasks autonomously. Enterprises can harness the full potential of AI by designing agent workflows that interact securely with business data, ensuring control and privacy. In this post, we explore building an AI agentic workflow with Meta’s LLaMA 3.1, specifically crafted for interacting with private data from a database. We’ll dive into the technical foundation of agentic systems, examine how function calls operate, and show how to securely deploy all of this within a private cloud, keeping data secure.

Running Llama on Rackspace Cloud

In one of my favourite movie series The Avengers, Tony Stark (Iron Man) creates this Artificial Intelligence (AI) named Jarvis, which helps him make much of his other works possible. This portrayal sparks curiosity: Are such sophisticated AIs possible in real life? Until a few years ago, AI capabilities like JARVIS were confined to the realm of science fiction. However, advancements in AI have bridged the gap between fantasy and reality, making powerful, customizable AI models accessible to enthusiasts and professionals alike.