Finding the formula: The new economics of SaaS pricing
Designing adaptive revenue models
ランタイム
動画全編を視聴するには、フォームに入力してください
AI compute costs can complicate seat-based pricing. This session covers how SaaS companies are moving to hybrid models—when to charge per seat, based on usage, or both—and the operational trade-offs that come with each. Plus, hear lessons from OpenRouter’s own AI pricing story.
Speakers
Alex Atallah, CEO, OpenRouter
ALEX ATALLAH: Okay. Hello, everyone. I’ll wait a little bit for people to settle down. I’m Alex Atallah, CEO and cofounder of OpenRouter—the largest language model marketplace, router, and source of public data about what people are doing with language models and why. So I want to start, I’ll talk a little bit about what OpenRouter is first. Beginning of 2023, there was only one real model that people were using, in January and February, and that was ChatGPT. It was GPT-3.5 and 3.5 Turbo. And the open source scene hadn’t really developed yet. Llama came out in January. It was not a chat completions model. It was a completions model. So you couldn’t really interact with it or chat with it. But as the ecosystem matured, it became clear that there were going to be tons of these. And moreover, that people were going to find really powerful use cases for open source models that they could optimize for their company and improve not just cost, but also quality.
And so, we took a bet and decided to build a marketplace that allowed us to see all models in one place, showing everyone across all apps and all use cases, what people were using AI for and why. And this is from our rankings page, which is public, and you can visit it and track it and see all AI usage grouped by model, grouped by app, grouped by use case, by the type of prompt, grouped by modality images. We just launched audio rankings today. So you can see the top audio input models and which ones are best at doing speech to text. And then we’ve launched video and image and audio output models recently as well. So all of this data is now public and available not just to humans, but soon also to agents. So you’ll be able to, in Claude Code, just ask what the most popular models are for your use case and ask about benchmarks as well, like which ones are ranking highest for math or highest for science or highest for coding, and just iteratively explore the public data in a way that no one has ever done before.
And that’s our vision and that’s what we’re trying to do with our public data. So in this presentation, I’m going to be talking about some insights that we’ve pulled from our public data that can help you with pricing.
Nearly everyone’s pricing is being challenged. We see thousands of applications and millions of people using inference, and everyone is scrambling to figure out how to price their inference-driven features. You see usage-based pricing, you see a subscription model, you see hybrids of both, where you subscribe and overage costs double. You see single-time payments. People are trying everything and struggling because power users often just use an enormous amount of inference way more than the average user does. So it’s easy to feel like AI has wrecked all of our pricing models and that we need something brand new. It looks like inference is getting too expensive. If you consider pay as you go, it’s really, really hard to forecast this. If you’re like trying to price in a seat-based way, like agents, like I said, are power users. They just use 1,000 times what a human uses. And if you try to price based on outcomes, which is something that a lot of companies are trying now, it’s very difficult. You assume a lot of risk, and it’s hard to know how your margins are going to scale.
So I want to take a step back and make you challenge an assumption going in here, which is that you may be thinking of your inference in an unscalable way. It might not be your pricing that you need to think about at all. It’s the way that you’re using AI models under the hood and how much they’re costing you. And if you can fix that problem, a lot of pricing problems, a lot of optionality opens up for you. So here’s some data that we’re seeing across the market, across all use cases. The root problem I’m seeing over and over again is that the cost of inference is exploding, driven by increasing the amount of context we put into agents and the number of loops and iterations we do around it. So here you can see like the average input tokens going up over time and average output tokens when across all API requests on OpenRouter.
And if you just look at this on a session basis, it gets even wilder. These are just like per request average tokens. So what we want to do is get us back to value-based pricing. Rather than starting with your pricing models, consider the way inference fits into the existing models you have. And here’s a framework that I suggest you try. Classify the actual tasks that you are building around. So rather than simply throwing a ton of data at GPT-5.5 or the latest or whatever frontier model you want and asking for multiple outputs all in the same request, break the problem down into components. Then map those components, map those subtasks to optimal models. And I’ll show a couple examples of how people are doing this successfully in the industry very soon. So we mapped out all the inference. We wrote a report called <em>State of AI</em> at the end of last year in collaboration with Andreessen Horowitz, and we compiled a bunch of data about the different use cases across the market, how much it costs on average to serve those use cases, and the amount of tokens that we typically see associated with it.
So we mapped out all inference by the category of usage, and you can see the highest token volumes are on top, and it gets more expensive over to the right. What we noticed is that tasks are breaking down on a type of determinism with a heavy skew towards more variable workloads. So tasks over in this area are deterministic. Some people call them saturated tasks. They’re ones like looking for a certain number, like is this PR HIPAA compliant, or can you classify this input as needing moderation or not?
Or can you just translate this input into another language? Can you check to see if there’s some kind of syntax error here? These really bespoke deterministic tasks, if you break it down far enough, you can optimize really aggressively. There’s also variable tasks where you have no idea what the output is going to be. Like your users might be doing wild auto research pipelines. They might be like exploring their next novel. You have no idea what it’s going to be. Then you end up spending more, because you just can’t optimize that level of unknown. But if you can break your problem down, you can do quite a bit. And then there’s an area in the middle, which we call semivariable, where we kind of know what we want, but the outputs could vary pretty dramatically. And this would be like a general poll request review, for example, or trivia, or completing the chapter of a novel, even doing like kind of open-ended legal requests.
So you might be skeptical about this. Why not just use the best intelligence all the time and find ways to charge for it? Isn’t that what the best companies are doing? Like if you’re serious, shouldn’t you just use frontier models?
But this is not what we actually observe in the ecosystem, that the most serious teams are actually heavily optimizing their model usage. And that’s because they realize they can do a lot more by cutting costs, and the costs are not insubstantial at all. So from a volume perspective, for months, this hasn’t been true. When you look at popular models, we see that the lowest costs overtake the highest cost models in token volume. And we exclude free data here. So this is cheapest versus most expensive popular models by token volume. And you can see that like the lower cost models are actually growing faster. If you start, we start in Q4 with the inflection in open source models and lightweight models, including those from Gemini, these models have been like surging in usage recently because people are figuring out how to offload tasks. People are breaking down their larger tasks into subtasks.
And what about programming, which is pretty open-ended. This is a really hard one to make deterministic. Surely this is only going to land on top-tier models, right? Even programming, I would challenge you to decompose it. Like I mentioned before, you can put code review as a semivariable task, and we see a lot of teams using nonfrontier models, but with very clear guidelines about what to check each poll request for when they make their own PR review bots. For code scanning, for just looking for simple accessibility violations in HTML, these can be done in a really cheap bespoke way. And we’ve seen this shift happen now across the whole year. So you can see the top five most expensive models, and then you can see everything else taking up token share. And the top five most expensive models are still dominating by dollar volume, but tokens give you a sense of the tasks and time that models are being spent, that people are spending doing inference on different models.
This is kind of another way of visualizing the landscape here. You can see open source in green on the left and closed source on the right, where closed source in general is a higher cost per million tokens. And then you can see kind of frontier models down below where usage gets… It is actually lower than the open source models today. So look into your tasks, look at which models they’re using, and look at how much you’re paying. Let me give a concrete example of a company that did this really well. Shopify had a single-purpose agent whose purpose was to crawl and analyze shops for different kinds of outcomes. And it started out as a one-shot agent using a GPT series model.
And what they did was, they would like throw tons of context at GPT-5, and then ask for a couple different outcomes like, is there like a fraud situation here? Can agents crawl this site? They were asking a couple different questions about the site all in a single prompt. This is a perfect example of a good task to break down and do in multiple requests instead of one. So what they did was they broke up their usage into three separate subtasks, and they were able to lower the spend from, I think it was like $5.5 million to about $75,000 a year. And it improved results like the F1 score, which is kind of a harmonic mean between precision and recall, improved for their compound agent relative to the one-shot GPT-5 agent. So these are like the three separate subagents they ended up developing, and they just would just run the subagents in parallel.
You don’t need to be at their scale though to see this level of savings. We see a surprisingly persistent premium for the best intelligence versus acceptable intelligence. This is one of those crazy charts with a crazy y-axis, but like here you get the priciest 10 models, here you get the cheapest 10 models on OpenRouter, and here you get a lot of time. And what we see is that the delta between the priciest 10 models and the cheapest 10 models, this is a pretty interesting chart, is reliably 25x. I don’t know if anybody figured that out from the y-axis, but yeah, that’s like a 25x delta. So that’s basically what the market saves when moving to lower cost models.
The dropping cost of intelligence also means you can do a lot more with these lower cost models, and the pace is picking up. So here’s a chart from Artificial Analysis, where you can see this applying across all intelligence levels. Here, this blue line is kind of a lower intelligence level, and you can see that we started at this price point and after about four months, it dropped, it dropped, it dropped again. The next intelligence level, same pattern happened. It just takes a couple months after each intelligence level is achieved, which is basically each leap forward in LLMs for cost drop. And so you can kind of like reliably expect declining costs and to get down to this 25x savings level. We did some research on this last year, like I mentioned, and we noticed a striking effect in how models are adopted as a result.
People basically try models really quickly, and there’s always this power user cohort that finds an amazing use case for a new model, and they’re like, “Whoa, this solves all my issues with all the previous models. I don’t have to triple prompt anymore, or I finally don’t have to create these guardrails that I used to have in my code,” and they stick to it. And you’d expect that if this is true, then we should be able to find a cohort, like a first-week cohort that has much higher retention than all other cohorts. These are the power users who discover these latent capabilities in new model launches. So we call this the “glass slipper effect,” and you can illustrate it here. So this is a chart of Claude for Sonnet adoption, and you can see that the first cohort, which is in orange here, this is the retention of people who adopt it the month that it launches, and it’s significantly bigger than the retention of cohorts who adopt it in subsequent months.
So this is the glass slipper effect in action. It’s the fact that there are these tons of enthusiasts and tons of companies who are actively searching for models that work for new use cases where they’re struggling with the existing frontier.
So I believe that you can try on these glass slippers in a sense, and that your path to getting inference spend under control will allow you to return to value-based pricing, the value-based pricing that you intend, when you launch your pricing plan. And if you can get your costs down quite a bit, then you can just charge the way you want, which is just the way your customers want, the way your customers actually see your service. This is the thought experiment I’ve been giving everyone: whether you’re planning your personal agent or the spend for your company, you should analyze your tasks by how deterministic they are and start testing them against more models.
Another way to look at this is that you can make the token disappear from your calculations and turn it into an infrastructure decision, so that you can reclaim your value-based pricing. And this is an incredible way to get started. One way, the way that Shopify did, is by using a framework like DSPy, which works really well with OpenRouter. You can compile your code against a prompt that it’ll automatically optimize. And the moment you want to try out a new model like Qwen or a new open source model that launches, you just change the model slug, recompile, and find the new prompt that’s going to work for your use case. Okay. So the question is, “Do you see availability and capacity being phased out for older models? We find our glass slipper; can we rely on them staying available?” So this is a good question. We frequently see models get deprecated across the ecosystem.
It is amazing how many people are still using Gemini 2.5 Flash. There are a lot of open source models that get dropped from a lot of the open source inference providers. And this goes for all inference providers, hyperscalers, neoclouds, they’re all dropping models when they find that demand on their platform just does not justify the cost to keep the model running. So what do you do about deprecations? First, OpenRouter was partly made to help with this. We aggregate the whole market for you, so that when a model gets deprecated on one provider, your app stays up. Moreover, you can provide fallback models or just use the Auto Router, so that if the whole model were to disappear—which means all providers on the market disappear—your app stays up. And we’re seeing a shift towards auto routers, partly for this reason, but also a shift towards aggregating inference.
We started back in, I think, June of 2023, we put all providers on a model page, so that we could increase everyone’s uptime. And that way, when you use OpenRouter, you get the combination of uptime from all providers in the ecosystem, and you also get flexibility on which features you want. If you want speed, if you want certain samplers, we can aggregate the features of the ecosystem for you, and that’s a big reason people use us—especially for inferencing open source models. So when old providers drop out, you can continue using them on OpenRouter. And then if you’re worried about the model disappearing, I just urge kind of moving to routers in general. We allow you to define your own routering policies. And then we also build routers that help you target models just above a certain intelligence level—a new one that we’re about to announce soon that is available for people today, called the Pareto Router.
The second question is, “How close to frontier do you think the new DeepSeek, Kimmy, and Qwen models actually are? Can we be more aggressive in moving these highly variable workloads yet?” So there are a couple nuances to this. First, I think, like I mentioned before, when you optimize your inference, you can get better than frontier quality out of all of these models. That Shopify example that I mentioned, moved to Qwen from GPT-5, but it requires optimization, requires you to decompose your task and to be careful about what your prompt is. But the cost savings are quite worthwhile, and you can improve performance on your agent just by putting the effort into optimizing it.
The nuance here is that there are just a lot of nondeterministic use cases in major apps today. And if you’re building a general purpose coding agent, there’s just not that much task decomposition out there. You can label the chats, you can have a CI agent that runs separately from the code generation agent. You can do live feedback to the model as it’s outputting code. So there are some ways of breaking things down, but a lot of the aspects of AI today are about, how do we deal with this emerging market where we don’t even know what to build? And that’s where we see frontier models still be really, really strong.
“What do you mean ‘make the token disappear’?” So what I mean by that is, if you are really worried about tokens, it generally means that you have not optimized your app. You haven’t broken the task down into subtasks. And so when I say, “Make the token disappear,” I just mean like think about the actual tasks that you’re accomplishing and how much they cost on average, rather than the total number of tokens you’re consuming in the background.
“Will OpenRouter accept stablecoin payment natively?” So we’ve been accepting stablecoin payments since May of 2023. I think we were one of the earliest platforms to do it, and we also built a programmatic way for agents to do it prior to x402. Since then, protocols like x402 and MPP have emerged that do this in a better way, so expect to see some more work from us soon to make stablecoin payments even easier for agents.
“Do you have any insight between the time from frontier model intelligence or capabilities and open source models?” We do work with the Model Labs, but I wouldn’t say we have any insights that we can share that are not public. The frontier model intelligence generally leads open source model intelligence by about six months. I’d say that when a new open source model launches, there’s typically a ton of hype on Twitter, and be careful how assuaged you get by the hype the week the model launches. It’s really important to be scientific about whether it’s actually working. What I see a lot of people do, for example, is a new model will launch or a new harness will launch, and then they’ll give it like three prompts manually and they’ll be like, “I don’t think it’s better. I think it’s worse.” Or they’ll give it one, two prompts in Claude Code and be like, “Oh my God, I think it’s better. It’s better.” And then they’ll post about it. And we’re just in this phase of the market where a sample size of one is okay, which is crazy. So increase your sample size and do actual studies because these things are nondeterministic black boxes, and you have to learn from real data in the market in addition to running this on evals that you write yourself.
I think that’s the last one. Oh, “A year ago, I asked you if users wanted to pay per use instead of sign up and buy credits. You said ‘No.’ Has that changed in the last year?” There are still pay-per-use use cases. It has not changed that it’s a huge need. I mean, I would say that the pay per use is really, really powerful when you have agents being built that have no idea how they’re going to be used, because they have no idea what services they’re going to need as a result. And OpenClaw actually was a really big moment for us, for this case when OpenClaw started, it had these two actions, a heartbeat and everything else. And a lot of the setup tasks from OpenClaw just don’t work unless you’re using a really good model, but then those really good models would charge you a request every 5, 10 minutes for the heartbeat.
So people really wanted to move to the Auto Router, and that’s how our Auto Router started getting a lot of growth. So these pay-per-use use cases tend to emerge when some sort of general purpose app like OpenClaw needs some service that the original developer never knew would be needed. And I think we’ll see more in the coming 6 to 12 months that does that.
Okay. Thanks for having me.