FinOps for AI Tokens: Why the Rules Changed — and What to Do About It

Written by Finout Writing Team | May 26, 2026 8:21:17 AM

Cloud cost management gave enterprises a decade to establish their practices. Token economics are moving considerably faster.

That was one of the clearest takeaways from a recent conversation between Finout's CEO Roi Ravhon and Google Cloud's FinOps lead Pathik Sharma at Google Cloud Next in Las Vegas — a conversation The New Stack published this month. The response confirmed what we're hearing consistently from practitioners: FinOps teams recognize that something has shifted, but most are still working with frameworks built for a different cost surface.

This post is the longer version of that conversation — a working guide for FinOps practitioners who need to govern AI spend today, not wait for best practices to calcify over the next three years.

The problem isn't what you think it is

The intuitive assumption is that AI costs are falling. It's true that token prices have dropped significantly over the past two years. You can get meaningful LLM capability today for a fraction of what it cost in 2023.

So why are enterprise AI bills going up?

Two reasons. First, the new generation of reasoning models — the ones that produce better answers — are "thinking 3x as much," in Roi's words, which means they consume more tokens per task even as the per-token price drops. The net bill goes up. Second, and more insidiously, the volume of AI usage across an enterprise is exploding. When you put AI coding tools in the hands of 200 engineers, spin up customer-facing AI features, add internal AI agents, and layer in model-powered analytics, the number of requests compounds in ways that no FinOps team anticipated in last year's budget.

The cost of the same prompt isn't even fixed. "You ask the same question twice, and you get different token usage for everything," as Roi noted in the New Stack conversation. That's not a bug — it's how reasoning models work. But it makes budgeting for AI spend genuinely hard in a way that cloud compute never was. A VM has predictable hourly costs. An agent completing a task does not.

CFOs noticed. A lot of them gave engineering teams effectively unlimited AI budgets in 2024 because being innovative was seen as table stakes. That period is over. The conversation has moved back to ROI, and FinOps teams are now in the middle of it.

What makes AI cost management structurally different from cloud FinOps

Cloud FinOps, at its core, is about ownership and allocation. Which team owns which resource? How do you share costs fairly across shared infrastructure? How do you set budgets and detect anomalies? These problems were hard enough to take a decade to institutionalize, but they were ultimately tractable: a VM has a tag, or it doesn't. A database belongs to a team, or it's orphaned.

Token costs introduce three new structural challenges that break the old model.

Non-determinism. As described above, the same workload doesn't have the same cost twice. This makes traditional anomaly detection — comparing this week's spend to last week's for the same workload — unreliable on its own. You need to track cost-per-output, not just total spend.

Model proliferation. A single product may call GPT-5 for complex reasoning, Claude for document summarization, Gemini Flash for classification, and an open-source model running on-premise for sensitive data. Each has different pricing, different latency characteristics, and different failure modes. The FinOps discipline hasn't historically had to care about which specific compute type answered a request — it just saw the bill. With AI, the model choice is the cost driver.

Attribution gaps. Most AI API calls don't carry team, product, or user metadata by default. If an engineering team makes 10 million calls to an LLM API this month, the invoice tells you a dollar amount. It doesn't tell you which features those calls powered, which team owns them, or whether the usage was intentional or runaway. This is the same tagging problem that plagued early cloud FinOps — except the surface area is larger and the usage patterns are less predictable.

The model routing insight: don't reach for Thor's hammer

Google Cloud's Pathik Sharma put the key FinOps intervention plainly in our conversation: "Don't reach for Thor's hammer when you don't need it."

His example: a customer was defaulting to Gemini Pro for everything — summarizing emails, drafting responses, answering internal FAQs. But most of those tasks are handled perfectly well by Gemini Flash, which is significantly cheaper. The problem wasn't that the engineers were reckless. It was that no one had built a layer that routes each request to the cheapest model capable of reliably answering it.

Model routing is the highest-leverage FinOps intervention available for AI spend right now. The implementation varies — some teams build a routing layer in code, others use a managed orchestration service — but the principle is consistent: classify the request by complexity, sensitivity, and latency requirements, then match it to the appropriate model tier.

This isn't a FinOps team's job to build, but it is a FinOps team's job to insist on. If your engineering organization doesn't have a model routing layer, the FinOps case for it is straightforward: you're paying frontier-model prices for commodity-model workloads, and there's a quantifiable cost reduction available without degrading output quality.

The same logic applies to where inference runs. Sharma installed Gemma, Google's small open model, on his phone — under 4 GB, capable of summarization, OCR, and translation entirely on-device. Not every AI request needs to touch a cloud API. Edge inference, where it's appropriate, is one of the most aggressive cost reduction levers available in 2026.

Why AI agents alone cannot govern FinOps

There is a tempting shortcut gaining traction in FinOps circles: give an AI agent access to your cloud data and let it optimize spend autonomously. The appeal is understandable. The results, in practice, are not reliable — at least not with the architectures most organizations are currently deploying.

FinOps is "a partially deterministic problem," as Roi put it in the New Stack conversation. Right-sizing recommendations have hard thresholds. Anomaly detection has math behind it. Commitment discount optimization runs on your actual utilization data. These are not problems where you want an LLM interpolating from general knowledge about best practices. LLMs can "convince themselves that they're right when they want to be right," Roi warns. The confidence of an AI-generated recommendation is not correlated with its accuracy when the model lacks the specific organizational context — your tagging coverage, your migration roadmap, your utilization patterns — that makes the recommendation actually valid.

The right architecture for agentic FinOps, in our framing, looks something like Lego bricks. The deterministic parts stay deterministic. Anomaly detection, right-sizing thresholds, and commitment analysis run on structured algorithms with known inputs. The agentic layer does what it's actually good at: enrichment, synthesis, and context analysis. An agent is excellent at summarizing why a cost spike happened across multiple systems, routing a recommendation to the right team owner, or drafting an explanation for a CFO. It should not be autonomously terminating compute.

And before anything destructive happens — before a server is shut down, a resource is deleted, a commitment is purchased — a deterministic check or a human approval step must come first, without exception.

Sharma's framing from Google Cloud's side is useful here too: think about onboarding an AI agent the way you'd onboard a new SRE. You give that SRE standards, scoped permissions, and a playbook. You don't give them root access on day one and ask them to figure it out. For Kubernetes right-sizing, that means giving the agent the same signals a senior SRE would use — golden signals, requests versus limits, p99 vCPU utilization over 30 days — and having it produce a recommendation as a pull request that an application owner approves. The agent doesn't act; it recommends. "Now you instantly build that trust," as Sharma puts it.

What FinOps teams need to build for AI spend: a practical checklist

Based on where we're seeing mature FinOps programs succeed, here's what the foundational AI cost governance layer looks like in 2026.

1. Attribution at the call level. Every LLM API call needs to carry metadata that identifies the feature, team, and ideally the business process it serves. This requires work at the application layer — FinOps teams can't do it retroactively. The conversation between FinOps and engineering here is about instrumentation standards, not cost optimization. Get this wrong and everything downstream is guesswork.

2. Cost-per-output metrics, not just total spend. Raw token spend tells you how much you're spending. Cost-per-output tells you whether the spend is justified. For a customer support AI, this might be cost per resolved ticket. For a coding assistant, cost per accepted code suggestion. For a document summarizer, cost per summarized page. The unit economics of AI are the same discipline as the unit economics of cloud — you need to know what you're buying per business outcome, not just the aggregate bill.

3. Model tier visibility. Your FinOps platform needs to show spending broken down by model, not just by API provider. If your company is spending $200K/month on LLM APIs, the FinOps question isn't "how do we spend less on LLMs?" — it's "what percentage of that spend is going to frontier models, and what percentage of those frontier-model calls actually require frontier-model reasoning?" Without model-tier visibility, you can't answer that question.

4. Budget guardrails at the feature level, not the team level. Cloud budgets typically sit at the team or cost center level. AI budgets need to go a level deeper: to the individual product feature or AI agent. A single misbehaving agent can exhaust a team's entire AI budget in hours. Feature-level budgets with hard limits and alerting — before costs spike, not after — are the guardrail that prevents the class of incidents we documented in our recent post on AI cost disasters.

5. Shared AI infrastructure cost allocation. If your organization runs a shared LLM proxy, a shared vector store, or a shared embedding service, the cost of that shared infrastructure needs to be allocated fairly across the teams that use it. This is the same shared cost problem that made Kubernetes cost allocation hard — and the same principles apply. Usage-based allocation, not equal-split, is the defensible approach.

How Finout handles AI cost management today

Finout's MegaBill already ingests AI spend — AWS Bedrock, Azure OpenAI, Google Vertex AI, Anthropic, and OpenAI direct — alongside cloud, Kubernetes, and SaaS spend in a single view. You can see your full technology bill in one place without stitching together five different reports.

For AI-specific cost allocation, Virtual Tags let teams define ownership logic that adapts as fast as their product does. If a feature ships, changes ownership, or gets deprecated, the allocation model updates without waiting on an engineering ticket or a tagging pipeline. This matters specifically for AI costs because AI features move faster than the tag hygiene that traditional FinOps depends on.

Unit economics — cost per API call, cost per completion, cost per model tier — are first-class metrics in Finout, not something you have to calculate in a spreadsheet after the fact. When a CFO asks "what is our AI spend per customer interaction?", that answer should come from your FinOps platform, not from an analyst spending two hours in SQL.

The broader point is that AI costs aren't a separate problem from cloud costs. They're the same problem at a new point of complexity. The discipline — ownership, allocation, unit economics, anomaly detection — is the same. What changes is the instrumentation, the non-determinism, and the speed at which the spend surface grows.

Start with the organization, not the tool

The instinct when costs get out of control is to reach for a new tool. That instinct is wrong, and both Roi Roi and Pathik Sharma said as much in the New Stack conversation — without anyone prompting us to say it.

"Sign up to the FinOps Foundation," Roi says, is still the right first advice. "FinOps is first and foremost an organizational problem that we're trying to solve. Just buying a FinOps tool is not going to solve the problem."

The culture change has to come first: cross-team accountability, engineering teams that treat cost as a first-class concern alongside latency and reliability, a relationship with AI spend that treats it as an investment with an expected return rather than an operating tax.

The model routing layer, the feature-level budgets, the attribution instrumentation — none of those things get built unless someone with sufficient authority has made AI cost governance an organizational priority rather than a secondary concern.

Tools come second. "Only when you understand that you need a tool to continue scaling, this is the time you need to talk to Finout or an equivalent tool," as Roi put it.

Sharma puts it this way: "No matter who you are, if you are working with cloud, you hold keys to the kingdom. With great power comes great responsibility." If everyone running AI infrastructure starts looking at it from a value perspective — what is this spend producing, and is it worth it — accountability, efficiency, and governance follow automatically.

The FinOps discipline grew into its current form precisely because someone, somewhere, decided that cloud spend deserved the same rigor as any other capital investment. AI spend deserves the same. Cloud had a decade to get there. AI doesn't have that long.

Key takeaways

What's different about AI cost management:

Token costs are non-deterministic; the same workload has variable cost
Reasoning model usage is growing faster than per-token prices are falling
Attribution gaps (missing metadata) are larger and more damaging than in cloud FinOps
Model proliferation means model choice is the primary cost lever

The highest-ROI interventions for FinOps teams right now:

Mandate attribution metadata on all LLM API calls
Build or require a model routing layer that matches task complexity to model tier
Set budget guardrails at the feature level, with automated alerting
Track cost-per-output unit economics, not just aggregate spend
Allocate shared AI infrastructure costs by usage, not by team headcount

What agentic FinOps looks like when it works:

Deterministic core (anomaly detection, right-sizing thresholds, commitment math)
Agentic layer for enrichment, synthesis, and routing recommendations
Human approval required before any destructive action

Finout is a unified FinOps platform for enterprises managing cloud, AI, Kubernetes, and shared cost complexity. To see how Finout handles AI cost allocation, unit economics, and MegaBill visibility in practice, talk to our team.

This post expands on the conversation between Finout CEO Roi Roi and Google Cloud FinOps lead Pathik Sharma, originally published by The New Stack on May 12, 2026.

View full post