AI Inference Economy: Who Profits from AI at Scale

Everyone keeps talking about how much it costs to train GPT-5 or Claude 4. The real number nobody focuses on is what happens after training ends. For every $1 spent training a model, companies spend $15-20 running inference on it over its production lifetime. That ratio is why the inference economy is where the actual money is.

The inference flip

Until recently, most AI spending went to training. Roughly 80% training, 20% inference. Those numbers have reversed. Deloitte confirmed that inference surpassed training in total data center revenue in late 2025. By early 2026, inference crossed 55% of AI cloud infrastructure spending, hitting $37.5 billion.

The global AI inference market sits at roughly $106 billion in 2025 and is projected to reach $255 billion by 2030. That 19.2% CAGR makes inference the fastest-growing segment of the entire AI stack.

One tweet from @vijayshekhar put the stakes plainly:

“OpenAI lost roughly $5B on $3.7B in revenue. The bottleneck isn’t model quality. It’s the cost of serving every single token to every single user. Inference is bleeding these companies dry.” — @vijayshekhar, March 2026

Every AI query costs between $0.0016 and $0.10 in GPU compute. At a billion queries per day, that translates to $1.6 million to $100 million in daily infrastructure spend. The numbers get uncomfortable fast. And they explain why the AI inference stack has turned into its own economy, with distinct winners and losers at every layer.

So who actually profits? Let me walk through the stack from bottom to top.

Building an AI product and wondering where your compute budget will actually go? We help founders plan their infrastructure from day one.

The picks-and-shovels layer: Nvidia and TSMC

Image 1: Nvidia Vera Rubin platform blog post

Nvidia is the clearest winner in the inference economy. Their Q3 FY2026 numbers tell the story: $57 billion in total revenue, $51.2 billion from data centers alone, up 66% year-over-year. Gross margins sit at 75%. Net profit margins at 54.3%. Those are software-company margins on hardware sales.

Jensen Huang has said inference already accounts for more than 40% of data center revenue and is “about to go up by a billion times.” That 40% number is climbing because every deployed AI model generates ongoing inference demand that never stops. Chatbots, code assistants, recommendation engines, search features. All of them run inference 24/7. Training is a one-time cost. Inference is forever.

Nvidia’s position gets even stronger when you look at their moves in the inference optimization space. They invested in Nscale’s $2B Series C at a $14.6 billion valuation. They struck a licensing deal with Groq worth $20 billion for inference-specific technology. And the upcoming Vera Rubin platform promises 10x lower inference cost per token compared to Blackwell. Every efficiency improvement drives more adoption, which drives more GPU sales. Classic virtuous cycle.

Then there’s TSMC, the company that fabricates every significant AI chip. Nvidia’s GPUs, Apple’s M-series, AMD’s Instinct accelerators, Groq’s LPUs, Cerebras’ wafer-scale chips. All TSMC. Their 2025 revenue hit $122.9 billion, up 31.6% year-over-year, with gross margins at 59.9%. Q1 2026 guidance pushed that to 63-65%. HPC (mostly AI and data center) now accounts for 57% of their revenue. CoWoS advanced packaging capacity is sold out through 2026.

Byrne Hobart captured the absurdity of the semiconductor supply chain on Twitter: “ASML can’t figure out how to make money from EUV machines, so they sell them to TSMC. But TSMC can’t figure out how to make money from chips, so they sell them to Apple.” The joke works because TSMC’s margins are actually excellent and they have zero demand risk. Everyone needs their fabs.

The picks-and-shovels layer is the only part of the inference economy generating serious, durable profits. That should tell you something about the layers above it.

The neocloud gamble

Image 2: CoreWeave integration with Nvidia Vera Rubin platform

Right, I should back up and explain what “neoclouds” are. These are the GPU-focused cloud providers that popped up to serve AI workloads specifically: CoreWeave, Lambda Labs, Nscale, RunPod. They buy Nvidia GPUs in bulk, rack them in data centers, and rent them out. Think of them as the landlords of AI compute.

CoreWeave is the poster child, and the cautionary tale. Revenue went from $16 million in 2022 to $1.9 billion in 2024. That 737% year-over-year growth is real. The IPO priced at $40/share in March 2025, below the expected $47-55 range. The stock bounced between $33 and $187. As of March 2026, market cap sits around $42 billion. But the problem is everything else in the S-1 filing.

Microsoft accounts for 62% of CoreWeave’s revenue. Two customers account for 77%. The company carries $7.9 billion in debt against $1.4 billion in cash. Net losses hit $863 million in 2024, which is 45% of revenue. CoreWeave filed for its IPO at a $35 billion target valuation, and the market was skeptical. One r/CRWV commenter noted that CoreWeave’s debt is “customer-guaranteed, which is not normal” and called the financing model “bullish for survival but bearish for fundamentals.”

The bull case for CoreWeave is the backlog: $66 billion in remaining performance obligations, a new $8.5 billion loan for a Meta data center, and OpenAI contracts totaling $22.4 billion across three deals signed between March and September 2025. OpenAI even invested $350 million in CoreWeave stock as part of the initial arrangement. The bear case is that Microsoft could pull its compute in-house at any time, and the debt structure only works as long as GPU demand stays white-hot.

Lambda Labs takes a more measured approach. They’re at roughly $500 million revenue run rate, raised a $1.5 billion Series E in late 2025, and offer H100 SXM access at $2.99/GPU-hour. They’re building a 24MW AI factory in Kansas City with over 10,000 GPUs.

RunPod is the scrappier player. They crossed $120 million ARR in January 2026 with 90% year-over-year growth on just a $20 million seed round. Half a million developers on the platform. H100s as low as $2.39/hour. Gross margins in the mid-60s to high-70s percent, which is better than most neoclouds manage.

Nscale is the newest entrant. A $2B Series C at $14.6 billion valuation, with Nvidia, Dell, Lenovo, Nokia, and Citadel among the investors. Founded in 2024 and already valued higher than most public SaaS companies. That either signals genuine demand or a valuation bubble. Possibly both.

The neocloud business model has a structural problem: you’re essentially arbitraging the gap between what Nvidia charges for GPUs and what customers will pay to rent them. If GPU supply loosens (Nvidia is shipping more Blackwell racks every quarter) or if the hyperscalers build enough capacity, the neocloud margin compresses fast. And the hyperscalers are building. Amazon committed $200 billion in 2026 capex, Google $175-185 billion, Microsoft $120 billion, Meta $115-135 billion. The Big Five combined: nearly $700 billion. Most of it going to AI infrastructure.

Want to build an AI product without betting your runway on GPU leases? We ship MVPs in weeks, not months. Check our pricing.

The optimization layer: where speed meets money

Here’s where it gets interesting. If inference is the dominant cost, then companies that make inference cheaper or faster sit in a strong position. A few different approaches are competing.

Groq built custom silicon from scratch. Their Language Processing Unit (LPU) delivers inference at 877 tokens per second on Llama 3 8B and over 500 tokens per second on Qwen3 32B. That throughput is roughly 2x the fastest alternatives. The market noticed: $750 million raised at a $6.9 billion valuation in September 2025, up from $2.8 billion just a year earlier. Then Nvidia struck a $20 billion licensing deal for Groq’s inference technology in December 2025, roughly 2.9x Groq’s most recent valuation. Groq’s CEO and senior leaders joined Nvidia, though Groq continues operating independently. When the GPU monopolist pays $20 billion for your inference IP, that validates the market.

Cerebras took a different bet: wafer-scale processors. Instead of individual chips, they fabricate entire wafers as single processors. The WSE-3 packs 4 trillion transistors and 900,000 AI-optimized cores onto a single wafer. They raised $1.1 billion at an $8.1 billion valuation, then reportedly another $1 billion at $23 billion. Targeting a Q2 2026 IPO. But Cerebras has its own concentration problem: G42, a UAE-based company, accounted for over 80% of their revenue, and G42 had to fully divest its stake to satisfy U.S. national security reviews. Sound familiar? Customer concentration keeps showing up at every layer.

Open-source inference engines like vLLM and TGI represent the software optimization layer. vLLM now runs on over 400,000 GPUs worldwide. Its PagedAttention algorithm improved GPU memory utilization for inference by 2-4x, letting you serve more concurrent users on the same hardware. The commercial entity behind vLLM, Inferact, raised $150 million at an $800 million valuation in January 2026. Production deployments include Roblox (serving 4 billion tokens per week), Amazon Rufus (250 million customers), and LinkedIn. We compared vLLM and TGI head-to-head in our benchmark analysis.

And model quantization keeps making inference cheaper. A 70B-parameter model needs 140GB of VRAM at full precision. Quantize it to INT4, and that drops to 35GB. You can serve the same model on half the GPUs. The quality tradeoff is real but shrinking as quantization methods improve.

Elon Musk put it bluntly: “The future is overwhelmingly inference.” The optimization layer is where the most interesting technical competition is happening, but it’s also the hardest layer to build a business on. Speed improvements get commoditized. Open-source eats margins. Custom silicon is a billion-dollar bet that takes years to validate.

The API providers: the margin squeeze nobody talks about

Image 3: Helicone inference cost tracking dashboard

Backing up a step. The companies most people associate with AI (OpenAI, Anthropic, Google) are actually in the worst economic position in the inference stack. They sell tokens at prices that often don’t cover the cost of generating them.

Token costs have collapsed at a rate that’s hard to wrap your head around. GPT-4 launched at $30 per million input tokens in March 2023. Today, GPT-4o charges $2.50. Claude Sonnet runs at $3 per million input. Budget models like GPT-4o Mini and Claude Haiku serve tokens for under $0.25 per million. Epoch AI analyzed the trend and found that the cost to inference at a fixed performance level has been halving every two months. Since January 2024, prices have declined at a median rate of 200x per year. That pricing pressure is relentless, driven by competition and efficiency gains compounding together.

OpenAI lost roughly $5 billion on $3.7 billion in revenue. The losses come almost entirely from inference. Every ChatGPT query, every API call, every Codex suggestion costs GPU time that OpenAI pays for (largely through CoreWeave, ironically). The more users they acquire, the more money they lose. That’s a business model that works only if you believe future margins will improve faster than usage grows.

Anthropic is in a similar position. Google subsidizes some of its inference costs through TPUs and Gemini’s integration into Search and Android. But even Google’s Cloud AI division is still chasing profitability.

For what it’s worth, this is the layer that matters most to AI SaaS founders. If you’re building on top of these APIs, your cost structure is directly tied to their inference pricing. Token prices dropping 100x in two years sounds great until you realize your competitors’ costs dropped the same amount. The competitive advantage isn’t cheap tokens. It’s what you do with them.

The real value map

Image 4: Server infrastructure in a data center

Let me lay out who captures what in the inference economy, because the picture is counterintuitive.

Layer	Who	Gross Margin	Moat
Chip design	Nvidia	~75%	CUDA ecosystem, 10-year head start
Chip fabrication	TSMC	~60% (trending to 65%)	Only company with leading-edge EUV fabs
GPU cloud (neoclouds)	CoreWeave, Lambda, Nscale	15-30% (estimated)	Contract backlog, supply agreements
Hyperscaler cloud	AWS, Azure, GCP	30-60%	Existing enterprise relationships
Inference optimization	Groq, Cerebras	Negative (pre-revenue scaling)	Custom silicon IP
API providers	OpenAI, Anthropic, Google	Negative to low single digits	Model quality, developer adoption
AI SaaS (end product)	You	60-80% (if you get it right)	Product differentiation, user relationships

Look at that table for a second. The companies closest to end users often have the best margins. A SaaS product that uses AI as a feature (not the product itself) can charge based on the value delivered, not the inference cost. An AI-powered contract review tool that saves a lawyer two hours of work can charge $50 for $0.15 worth of tokens. The token cost is irrelevant. The value capture is in the application layer.

That’s why the “who profits” question has a surprising answer. Nvidia wins at the bottom. Your SaaS product can win at the top. Everything in the middle is fighting over margin.

One pattern to watch: Meta’s custom MTIA chips signal that the biggest consumers of inference are moving to build their own silicon. If Meta, Google (TPUs), Amazon (Trainium/Inferentia), and Microsoft all bring inference hardware in-house, the neocloud and even Nvidia’s pricing power could erode. The hyperscalers don’t want to be Nvidia’s largest customers forever.

And the compute crisis adds another wrinkle. Data centers consumed 415 TWh of electricity in 2024. The IEA projects nearly 945 TWh by 2030. Power availability, not GPU supply, may become the actual bottleneck. Whoever controls the power controls the inference.

What this means if you’re building an AI product

Image 5: GPU chips and circuit boards representing the AI gold rush

The practical takeaway for founders: don’t try to compete in the inference stack. Compete on top of it.

Track your inference costs per user and per task. Use model routing to send cheap queries to cheap models. Take advantage of the 100x token price collapse by building features that would have been prohibitively expensive two years ago.

The inference economy will keep shifting. Token prices will keep falling. GPU supply will loosen as Vera Rubin and Blackwell Ultra ship. Custom silicon from Meta, Amazon, and Google will add capacity. All of that benefits application builders.

Nvidia and TSMC will profit regardless. The neoclouds are a bet on GPU scarcity lasting longer than the market expects. The optimization companies are a bet on physics and architecture beating brute-force GPU scaling. The API providers are a bet on model quality mattering more than inference costs.

And founders building AI SaaS products? You’re the only layer where margin can expand as inference gets cheaper. That’s the real opportunity in the inference economy.

Ready to stop researching and start building? Book a free MVP strategy call or explore our free tools to get started.