Comparing Provisioned AI Capacity Options Across AWS, Azure, Google Cloud, and OCI

Written by Asaf Liveanu | Nov 11, 2025 4:44:26 PM

Running AI in production isn’t like running another microservice. Generative AI models are heavy, resource-hungry, and in very high demand. When your users expect real-time responses, you can’t gamble on whether GPUs will be free at that moment. That’s why every major cloud provider now offers some form of provisioned capacity -- a way to reserve throughput so you get guaranteed performance and predictable spend.

Of course, as with everything cloud, each provider reinvented the wheel with its own terminology, quirks, and pricing structures. Let’s break down how AWS, Azure, Google Cloud, and Oracle approach provisioned AI capacity, what makes each unique, and what it means for FinOps teams trying to balance reliability and cost.

AWS: Bedrock Provisioned Throughput (Model Units)

Service & Models
Amazon Bedrock is AWS’s managed platform for foundation models, giving you API access to multiple providers as well as Amazon’s own Titan models. Bedrock is also where AWS enforces its strictest rule: if you fine-tune a base model, you must purchase provisioned throughput to serve it in production. For base models, provisioned capacity is optional, but at scale, it’s practically a necessity.

Terminology
The reserved unit here is called a Model Unit (MU). Each MU buys you a defined throughput slice -- essentially tokens per minute for a given model. Bigger models consume more compute per token, so the throughput per MU shrinks as you move up the model ladder.

Purchasing & Commitment
AWS gives some flexibility:

Hourly, no-commit usage (turn it off anytime)
1-month, 6-month, or 12-month commitments

Longer commitments reduce the hourly MU price, just like EC2 Reserved Instances. But once you commit, you’re locked in until the term ends. You pay hourly while it’s active, whether you use it or not.

Scaling & Use
Provisioned throughput creates a dedicated model endpoint. Your application routes requests directly there. Need more capacity? Add more MUs, assuming AWS has GPU stock in that region. AWS warns explicitly that capacity is finite and should be reserved ahead of time.

On-Demand Alternative
Bedrock’s default mode is pure pay-per-request. That’s fine for testing or low-volume workloads, but it comes with throttling risks and no guarantees.

Unique Points

Custom models require provisioned throughput. No exceptions.
Commitment options as short as 1 month give more flexibility than some competitors.
AWS speaks in tokens per minute, mapping directly to how you’re billed for usage.

Azure: Azure OpenAI Service – Provisioned Throughput Units (PTUs)

Service & Models
Azure’s OpenAI Service provides enterprise access to models like GPT-3.5, GPT-4, Codex, and DALL-E. As demand spiked, Azure rolled out provisioned capacity to give enterprises guaranteed throughput and higher limits.

Terminology
Azure sells Provisioned Throughput Units (PTUs). These are model-agnostic quota units you can allocate to deployments. Buy 100 PTUs in a region, and you can carve them up however you want -- e.g., 60 PTUs for GPT-4, 40 for GPT-3.5 -- and rebalance later.

Purchasing & Commitment
Azure supports two models:

On-demand PTUs, billed hourly
Reserved PTUs, typically 1- or 3-year terms with upfront or annual billing

The reservation approach is integrated into Azure’s broader reservation system, so enterprises used to reserving VMs or databases will feel right at home.

Performance
Throughput per PTU is model-dependent, and Azure makes one point clear: output tokens are more expensive than input tokens. For some GPT-4 deployments, one output token might count as the equivalent of four input tokens. This matters when sizing deployments, and Microsoft even provides calculators to help.

Scaling & Use
Provisioned deployments are fixed capacity. You can add PTUs, but Azure warns that instant scaling isn’t guaranteed -- unallocated PTUs don’t always mean instant extra capacity. The recommendation: provision for peak load rather than trying to “chase” usage in real time.

On-Demand Alternative
Without PTUs, you’re on shared, multi-tenant infrastructure with strict rate limits. That’s fine for dev or low-volume traffic, but it won’t cut it for production-grade usage.

Unique Points

PTUs are flexible across models -- shift them as new model versions arrive.
Reservations integrate cleanly into Azure’s existing discount and commitment framework.
Options exist for regional vs. global capacity, hinting at more advanced deployment strategies.

Google Cloud: Vertex AI Provisioned Throughput

Service & Models
Google’s generative AI lives inside Vertex AI, covering models like PaLM and Gemini. To meet production needs, Google added provisioned throughput -- dedicated slices of infrastructure that guarantee consistent performance.

Terminology
Google keeps it simple: Provisioned Throughput. You reserve a defined QPS (queries per second) or tokens per second for a specific model, usually tied to a region.

Purchasing & Commitment
Google stands out for offering flexible subscription terms:

Weekly, monthly, quarterly, or annual
Fixed monthly billing
Significant discounts for longer commitments

Cancel early, and you still pay for the term. Need more? Buy an additional subscription or upgrade your plan.

Performance
With a provisioned subscription, your requests hit a dedicated pool with deterministic throughput and predictable costs. You pay the flat fee regardless of usage, which eliminates surprise overages.

Scaling & Use
Scaling means layering more subscriptions. Google still allows dynamic quota (pay-as-you-go), so you can mix the two in the same project.

On-Demand Alternative
Pay-as-you-go is still available for testing and smaller workloads, but heavy production requires provisioned capacity for reliability.

Unique Points

Shortest commit in the market: as little as one week. Great for campaigns or seasonal spikes.
Predictable billing: no variable usage surprises -- you pay the subscription and that’s it.
Coverage across text, image, code, and even audio/video models with similar mechanics.

Oracle Cloud: Dedicated AI Clusters

Service & Models
Oracle’s OCI Generative AI Service focuses on partner models like Cohere and Meta’s Llama. True to Oracle’s DNA, the pitch leans heavily on raw infrastructure performance and GPU scale.

Terminology
Capacity comes as Dedicated AI Clusters, defined by AI Units. Each AI Unit is essentially a fraction of GPU power assigned to a deployment. Oracle offers separate cluster types for inference and for fine-tuning.

Purchasing & Commitment
The minimum is 744 hours (31 days) per cluster. That’s one month, billed hourly per AI Unit. Spin it up, and you’re paying for at least a month, even if you stop early.

Larger contracts are typically negotiated directly with Oracle.

Performance
Dedicated clusters mean isolated infrastructure. Scaling happens by adding AI Units, and Oracle emphasizes “zero-downtime” scaling plus very large GPU availability (tens of thousands of H100s, if needed).

On-Demand Alternative
Shared pay-per-use is available, billed per 10,000 characters. Fine for light traffic, but limited in throughput and consistency.

Unique Points

Minimum 1-month commitment makes OCI less flexible than others.
Heavy focus on dedicated, high-performance GPU infrastructure.
Model catalog depends on partnerships, not in-house development.

Side-by-Side Comparison

Provider	Offering	Unit of Capacity	Commitment Options	Scaling	Standout
AWS	Bedrock Provisioned Throughput	Model Units	Hourly, 1/6/12 mo	Add MUs	Required for custom models
Azure	Azure OpenAI PTUs	PTUs (flexible pool)	Hourly or 1/3 yr reservations	Add PTUs	Reassign across models
Google	Vertex AI Provisioned Throughput	Subscription capacity	1 week–1 yr	Add subscriptions	Weekly commitments & predictable billing
Oracle	OCI Dedicated AI Clusters	AI Units	1 month min	Add AI Units	Heavy GPU infra focus

Which One Fits?

AWS: Best if you’re on Bedrock, especially with custom fine-tuning. Flexible commitment terms.
Azure: Perfect if GPT-4 is central to your workload. PTU pooling gives you flexibility.
Google: Great for experimentation, weekly bursts, and deterministic billing. Strong Vertex AI ecosystem.
Oracle: Suited for enterprises that want raw GPU muscle and are fine with fewer model options.

Closing Thought

Provisioned AI capacity is the new reserved instances. Without it, you’re gambling with availability and unpredictable costs. With it, you buy performance guarantees and financial predictability. The FinOps challenge is the same as always: commit wisely, monitor utilization, and blend on-demand with reserved capacity to maximize efficiency.

Generative AI is compute-hungry. Cloud providers know it, which is why they’ve all built a pay-to-play lane for serious workloads. If you’re running production AI, provisioned capacity isn’t optional -- it’s table stakes.

View full post