Running AI in production isn’t like running another microservice. Generative AI models are heavy, resource-hungry, and in very high demand. When your users expect real-time responses, you can’t gamble on whether GPUs will be free at that moment. That’s why every major cloud provider now offers some form of provisioned capacity -- a way to reserve throughput so you get guaranteed performance and predictable spend.
Of course, as with everything cloud, each provider reinvented the wheel with its own terminology, quirks, and pricing structures. Let’s break down how AWS, Azure, Google Cloud, and Oracle approach provisioned AI capacity, what makes each unique, and what it means for FinOps teams trying to balance reliability and cost.
Service & Models
Amazon Bedrock is AWS’s managed platform for foundation models, giving you API access to multiple providers as well as Amazon’s own Titan models. Bedrock is also where AWS enforces its strictest rule: if you fine-tune a base model, you must purchase provisioned throughput to serve it in production. For base models, provisioned capacity is optional, but at scale, it’s practically a necessity.
Terminology
The reserved unit here is called a Model Unit (MU). Each MU buys you a defined throughput slice -- essentially tokens per minute for a given model. Bigger models consume more compute per token, so the throughput per MU shrinks as you move up the model ladder.
Purchasing & Commitment
AWS gives some flexibility:
Longer commitments reduce the hourly MU price, just like EC2 Reserved Instances. But once you commit, you’re locked in until the term ends. You pay hourly while it’s active, whether you use it or not.
Scaling & Use
Provisioned throughput creates a dedicated model endpoint. Your application routes requests directly there. Need more capacity? Add more MUs, assuming AWS has GPU stock in that region. AWS warns explicitly that capacity is finite and should be reserved ahead of time.
On-Demand Alternative
Bedrock’s default mode is pure pay-per-request. That’s fine for testing or low-volume workloads, but it comes with throttling risks and no guarantees.
Unique Points
Service & Models
Azure’s OpenAI Service provides enterprise access to models like GPT-3.5, GPT-4, Codex, and DALL-E. As demand spiked, Azure rolled out provisioned capacity to give enterprises guaranteed throughput and higher limits.
Terminology
Azure sells Provisioned Throughput Units (PTUs). These are model-agnostic quota units you can allocate to deployments. Buy 100 PTUs in a region, and you can carve them up however you want -- e.g., 60 PTUs for GPT-4, 40 for GPT-3.5 -- and rebalance later.
Purchasing & Commitment
Azure supports two models:
The reservation approach is integrated into Azure’s broader reservation system, so enterprises used to reserving VMs or databases will feel right at home.
Performance
Throughput per PTU is model-dependent, and Azure makes one point clear: output tokens are more expensive than input tokens. For some GPT-4 deployments, one output token might count as the equivalent of four input tokens. This matters when sizing deployments, and Microsoft even provides calculators to help.
Scaling & Use
Provisioned deployments are fixed capacity. You can add PTUs, but Azure warns that instant scaling isn’t guaranteed -- unallocated PTUs don’t always mean instant extra capacity. The recommendation: provision for peak load rather than trying to “chase” usage in real time.
On-Demand Alternative
Without PTUs, you’re on shared, multi-tenant infrastructure with strict rate limits. That’s fine for dev or low-volume traffic, but it won’t cut it for production-grade usage.
Unique Points
Service & Models
Google’s generative AI lives inside Vertex AI, covering models like PaLM and Gemini. To meet production needs, Google added provisioned throughput -- dedicated slices of infrastructure that guarantee consistent performance.
Terminology
Google keeps it simple: Provisioned Throughput. You reserve a defined QPS (queries per second) or tokens per second for a specific model, usually tied to a region.
Purchasing & Commitment
Google stands out for offering flexible subscription terms:
Cancel early, and you still pay for the term. Need more? Buy an additional subscription or upgrade your plan.
Performance
With a provisioned subscription, your requests hit a dedicated pool with deterministic throughput and predictable costs. You pay the flat fee regardless of usage, which eliminates surprise overages.
Scaling & Use
Scaling means layering more subscriptions. Google still allows dynamic quota (pay-as-you-go), so you can mix the two in the same project.
On-Demand Alternative
Pay-as-you-go is still available for testing and smaller workloads, but heavy production requires provisioned capacity for reliability.
Unique Points
Service & Models
Oracle’s OCI Generative AI Service focuses on partner models like Cohere and Meta’s Llama. True to Oracle’s DNA, the pitch leans heavily on raw infrastructure performance and GPU scale.
Terminology
Capacity comes as Dedicated AI Clusters, defined by AI Units. Each AI Unit is essentially a fraction of GPU power assigned to a deployment. Oracle offers separate cluster types for inference and for fine-tuning.
Purchasing & Commitment
The minimum is 744 hours (31 days) per cluster. That’s one month, billed hourly per AI Unit. Spin it up, and you’re paying for at least a month, even if you stop early.
Larger contracts are typically negotiated directly with Oracle.
Performance
Dedicated clusters mean isolated infrastructure. Scaling happens by adding AI Units, and Oracle emphasizes “zero-downtime” scaling plus very large GPU availability (tens of thousands of H100s, if needed).
On-Demand Alternative
Shared pay-per-use is available, billed per 10,000 characters. Fine for light traffic, but limited in throughput and consistency.
Unique Points
|
Provider |
Offering |
Unit of Capacity |
Commitment Options |
Scaling |
Standout |
|
AWS |
Bedrock Provisioned Throughput |
Model Units |
Hourly, 1/6/12 mo |
Add MUs |
Required for custom models |
|
Azure |
Azure OpenAI PTUs |
PTUs (flexible pool) |
Hourly or 1/3 yr reservations |
Add PTUs |
Reassign across models |
|
|
Vertex AI Provisioned Throughput |
Subscription capacity |
1 week–1 yr |
Add subscriptions |
Weekly commitments & predictable billing |
|
Oracle |
OCI Dedicated AI Clusters |
AI Units |
1 month min |
Add AI Units |
Heavy GPU infra focus |
Provisioned AI capacity is the new reserved instances. Without it, you’re gambling with availability and unpredictable costs. With it, you buy performance guarantees and financial predictability. The FinOps challenge is the same as always: commit wisely, monitor utilization, and blend on-demand with reserved capacity to maximize efficiency.
Generative AI is compute-hungry. Cloud providers know it, which is why they’ve all built a pay-to-play lane for serious workloads. If you’re running production AI, provisioned capacity isn’t optional -- it’s table stakes.