I can't be the only one who thinks this version is no better than the previous o...

TechDebtDevin · 2025-05-22T20:13:00 1747944780

I think they are just getting better at the edges, MCP/Tool Calls, structured output. This definitely isn't increased intelligence, but it an increase in the value add, not sure the value added equates to training costs or company valuations though.

In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.

layoric · 2025-05-22T22:28:04 1747952884

> how any of these companies remain sustainable

They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.

KennyBlanken · 2025-05-23T16:55:39 1748019339

If your house is on fire, the fact that the village are throwing firewood through the windows doesn't really mean the house will stay standing longer.

tymscar · 2025-05-22T23:03:09 1747954989

Doesn’t this mean that realistically even if “the bubble never pops”, at some point money will run dry?

Or do these people just bet on the post money world of AI?

Aeolun · 2025-05-22T23:25:00 1747956300

The money won’t run dry. They’ll just stop providing a free plan when the marginal benefits of having one don’t outweigh the costs any more.

fy20 · 2025-05-23T01:51:03 1747965063

In two years time you'll need to add an 10% Environmental Tax, 25% Displaced Workers Tax, and 50% tip to your OpenAI bills.

FridgeSeal · 2025-05-23T05:39:42 1747978782

Or at that point, maybe stop using it and just let them go broke?

Iolaum · 2025-05-23T09:14:52 1747991692

It's more likely that the free tier model will be a distilled lower parameter count model that will be cheap enough to run.

layoric · 2025-05-23T05:08:04 1747976884

They will likely just charge a lot more money for these services. Eg, the $200+ per months I think could become more of the entry level in 3-5 years. Saying that smaller models are getting very good, so there could be low margin direct model services and expensive verticals IMO.

AstroBen · 2025-05-23T15:32:51 1748014371

At that price it would start to be worth it to set up your own hardware and run local open source models

yahoozoo · 2025-05-22T23:37:34 1747957054

https://www.wheresyoured.at/reality-check/

holoduke · 2025-05-23T09:34:28 1747992868

This man (in the article) clearly hates AI. I also think he does not understand business and is not really able to predict the future.

sameermanek · 2025-05-23T10:48:03 1747997283

But he did make good points though. AI was perceived more dangerous when only select few mega corps (usually backing each other) were pushing its capabilities.

But now, every $50B+ company seems to have their own model. Chinese companies have an edge in local models and the big tech seems to be fighting each other like cats and dogs for a tech which has failed to generate any profit while masses are draining the cash out from the companies with free usage and ghiblis.

What is the concrete business model here? Someone at google said "we have no moat" and i guess he was right, this is becoming more and more like a commodity.

JohnPrine · 2025-05-23T13:24:22 1748006662

oil is a commodity, and yet the oil industry is massive and has multiple major players

gspetr · 2025-05-24T06:25:42 1748067942

Which oil companies have a free tier product available to the general public?

notfromhere · 2025-05-23T17:06:55 1748020015

also was kind of a shit investment unless you figured out which handful of companies were gonna win.

hijodelsol · 2025-05-23T07:27:09 1747985229

If you read any work from Ed Zitron [1], they likely cannot remain sustainable. With OpenAI failing to convert into a for-profit, Microsoft being more interested in being a multi-modal provider and competing openly with OpenAI (e.g., open-sourcing Copilot vs. Windsurf, GitHub Agent with Claude as the standard vs. Codex) and Google having their own SOTA models and not relying on their stake in Anthropic, tarrifs complicating Stargate, explosion in capital expenditure and compute, etc., I would not be surprised to see OpenAI and Anthropic go under in the next years.

1: https://www.wheresyoured.at/oai-business/

vessenes · 2025-05-23T15:06:04 1748012764

I see this sentiment everywhere on hacker news. I think it’s generally the result of consuming the laziest journalism out there. But I could be wrong! Are you interested in making a long bet banking your prediction? I’m interested in taking the positive side on this.

hijodelsol · 2025-05-23T20:29:47 1748032187

While some critical journalism may be simplistic, I would not qualify it as lazy. Much of it is deeply nuanced and detail-oriented. To me, lazy would be publications regurgitating the statements of CEOs and company PR people who have a vested interest in making their product seem appealing. Since most of the hype is based on perceived futures, benchmarks, or the automation of the easier half of code development, I consider the simplistic voices asking "Where is the money?" to be important because most people seem to neglect the fundamental business aspects of this sector.

I am someone who works professionally in ML (though not LLM development itself) and deploys multiple RAG- and MCP-powered LLM apps in side businesses. I code with Copilot, Gemini, and Claude and read and listen to most AI-industry outputs, be they company events, papers, articles, MSM reports, the Dwarkesh podcast, MLST, etc. While I acknowledge some value, having closely followed the field and extensively used LLMs, I find the company's projections and visions deeply unconvincing and cannot identify the trillion-dollar value.

While I never bet for money and don't think everything has to be transactional or competitive, I would bet on defining terms and recognizing if I'm wrong. What do you mean by taking the positive side? Do you think OpenAI's revenue projections are realistic and will be achieved or surpassed by competing in the open market (i.e., excluding purely political capture)?

Betting on the survival of the legal entity would likely not be the right endpoint because OpenAI could likely be profitable with a small team if it restricted itself to serving only GPT 4.1 mini and did not develop anything new. They could also be acquired by companies with deeper pockets that have alternative revenue streams.

But I am highly convinced that OpenAI will not have a revenue of > 100 billion by 2029 while being profitable [1] and willing to take my chances.

1: https://www.reuters.com/technology/artificial-intelligence/o...

vessenes · 2025-05-26T15:45:24 1748274324

OK, I like this so far, and I accept the critique on lazy. I might restate to ‘knee jerk uninformed and poorly reasoned journalism’.

Do I think OpenAI’s revenue projections are realistic? I’m aware of leaks that say $12.5bn in 2025 and $100+bn in 2029. Order of magnitude, yes, I think they’re realistic. Well, let me caveat that. I believe they will be selling $100+bn at today’s prices in 2029.

Is this based only/largely on political capture? Don’t know or care really, I’m just tired of (formerly called lazy) journalism that gets people confidently wrong about the future telling everyone OpenAI is doomed.

On the Reuters story — to be clear, OpenAI’s current plans mean that being cash flow positive in 2029 would be a bad thing for the company. It would mean they haven’t gotten investment to the level they think is needed for a long term winning play, and will have been forced to rely on operating cash flow for growth. In this market, which they postulated was winner take all, and now propose is “winner take most” or “brand matters and TAM is monstrous”, they need to raise money to compete with the monstrous cash flow engines arrayed against them: Meta, Google, and likely some day Apple and possibly Oracle. On the flip side, they offer a pretty interesting investment: If you believe most of the future value of GOOG or META will come from AI (and I don’t necessarily believe this, but a certain class of investors may), then you could buy that same value rise for cheap investing in OpenAI. Unusually for such a pitch they have a pretty fantastic track record so far.

For reference, there are roughly 20mm office jobs in USA alone. Chat is currently 65% or so of all chatbot usage. The US is < 1/6 of oAi’s customer base. 10mm people currently pay for chat. OpenAI projects chat to be about 1/3 of income with no innovations beyond Agentic tool calling.

To wit: in 2029 will we be somewhere in the following band of scenarios:

Low Growth in Customers but increased model value: Will 10mm people pay $3.6k a year for chat ($300/month) worldwide in 2029, and API and Agent use each cover a similar amount of usage?

High Growth in Customers with moderate increased model value: Will 100mm people pay $360 a year for o5, which is basically o4 high but super fast and tool-connected to everything?

Ending somewhere in this band seems likely to me, not crazy. The reasons to fall out of this band are: they get beat hard and lose their research edge thoroughly to Google and Anthropic, so badly that they cannot deliver a product that can be backed by their brand and large customer base, or an Open Weights model achieves true AGI ahead of / concurrent with OpenAi and they decide not to become an inference providing company, or the world decides they don’t want to use these tools (hah), or the world’s investors stop paying for frontier model training and everyone has to move to cashflow positive behavior.

Upshot: I’d say OpenAI will be cashflow positive or $100bn+ in CF in 2029.

viraptor · 2025-05-23T10:38:25 1747996705

There's still the question of whether they will try to change the architecture before they die. Using RWKV (or something similar) would drop the costs quite a bit, but will require risky investment. On the other hand some experiment with diffusion text already, so it's slowly happening.

NitpickLawyer · 2025-05-22T20:19:32 1747945172

> and that LLMs have basically reached a plateau

This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!

That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.

sensanaty · 2025-05-22T22:55:04 1747954504

And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.

But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic

[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530

hsn915 · 2025-05-22T20:23:11 1747945391

Is this something that the models from 4 months ago were not able to do?

vessenes · 2025-05-23T15:08:56 1748012936

For a fair definition of able, yes. Those models had no ability to engage in a search of emails.

What’s special about it is that it required no handholding; that is new.

camdenreslink · 2025-05-23T15:37:42 1748014662

Is this because the models improved, or the tooling around models improved (both visible and not visible to the end user).

My impression is that the base models have not improved dramatically in the last 6 months and incremental improvements in those models is becoming extremely expensive.

vessenes · 2025-05-26T15:13:38 1748272418

Resist getting your news from Brooklyn journalists. :)

Tooling has improved, and the models have. The combo is pretty powerful.

https://aider.chat/docs/leaderboards/ will give you a flavor of the last six months of improvements. Francois Cholet (ARC AGI: https://arcprize.org/leaderboard) has gone from “No current architecture will ever beat ARC” to “o3 has beaten ARC and now we have designed ARC 2”.

At the same time, we have the first really useful 1mm token context model available with reasonably good skills across the context window (Gemini Pro 2.5), and that opens up a different category of work altogether. Reasoning models got launched to the world in the last six months, another significant dimension of improvement.

TLDR: Massive, massive increase in quality for coding models. And o3 is to my mind over the line people had in mind for generally intelligent in, say, 2018 — o3 alone is a huge improvement launched in the last six months. You can now tell o3 something like: “research the X library and architect a custom extension to that library that interfaces with my weird garage door opener; after writing the architecture implement the extension in (node/python/go) and come back in 20 minutes with something that almost certainly compiles and likely largely interfaces properly, leaving touch-up work to be done.

camdenreslink · 2025-05-27T00:59:17 1748307557

I use LLMs every day, so I get my news from myself (I couldn't even name a Brooklyn journalist?). My experience so far is they are good for greenfield development (i.e. getting a project started), and operating within a well defined scope (e.g. please define this function in this specific place with these constraints).

What I haven't seen is any LLM model consistently being able to fully implement new features or make refactors in a large existing code base (100k+ LOC, which are the code bases that most businesses have). These code bases typically require making changes across multiple layers (front end, API, service/business logic layer, data access layer, and the associated tests, even infrastructure changes). LLMs seem to ignore the conventions of the existing code and try to do their own thing, resulting in a mess.

vessenes · 2025-05-27T17:13:48 1748366028

Def a pain point. Anecdotally Claude code and aider both can be of some help. My go to method is: dump everything in the code base into Gemini and ask for an architecture spec, then ask for implementation from aider or Claude code. This 90% works 80% of the time. Well maybe 90% of the time. Notably it can deal with cross codebase interfaces and data structures, in general, with good prompting.

Dumping it at Claude 3.7 with no instructions will 100% get random rewriting - very annoying.

morepedantic · 2025-05-22T22:13:09 1747951989

The LLMs have reached a plateau. Successive generations will be marginally better.

We're watching innovation move into the use and application of LLMs.

the8472 · 2025-05-23T15:04:49 1748012689

Innovation and better application of a relatively fixed amount of intelligence got us from wood spears to the moon.

So even if the plateau is real (which I doubt given the pace of new releases and things like AlphaEvolve) and we'd only expect small fundamental improvements some "better applications" could still mean a lot of untapped potential.

strangescript · 2025-05-22T21:08:50 1747948130

I have used claude code a ton and I agree, I haven't noticed a single difference since updating. Its summaries I guess a little cleaner, but its has not surprised me at all in ability. I find I am correcting it and re-prompting it as much as I didn't with 3.7 on a typescript codebase. In fact I was kind of shocked how badly it did in a situation where it was editing the wrong file and it never thought to check that more specifically until I forced it to delete all the code and show that nothing changed with regards to what we were looking at.

hsn915 · 2025-05-23T00:58:41 1747961921

I'd go so far as to say Sonnet 3.5 was better than 3.7

At least I personally liked it better.

vessenes · 2025-05-23T15:10:14 1748013014

I also liked it better but the aider leaderboards are clear that 3.7 was better. I found it extremely over eager as a coding agent but my guess is that it needed different prompting than 3.6

jug · 2025-05-23T08:30:47 1747989047

This is my feeling too, across the board. Nowadays, benchmark wins seem to come from tuning, but then causing losses in other areas. o3, o4-mini also hallucinates more than o1 in SimpleQA, PersonQA. Synthetic data seems to cause higher hallucination rates. Reasoning models at even higher risk due to hallucinations risking to throw the model off track at each reasoning step.

LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.

I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.

voiper1 · 2025-05-22T21:08:59 1747948139

The benchmarks in many ways seem to be very similar to claude 3.7 for most cases.

That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!

I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.

fintechie · 2025-05-23T10:08:17 1747994897

It's not that it isn't better, it's actually worse. Seems like the big guys are stuck on a race to overfit for benchmarks, and this is becoming very noticeable.

sanex · 2025-05-22T21:41:47 1747950107

Well to be fair it's only .3 difference.

pantsforbirds · 2025-05-23T16:43:30 1748018610

It seems MUCH better at tool usage. Just had an example where I asked Sonnet 4 to split a PR I had after we had to revert an upstream commit.

I didn't want to lose the work I had done, and I knew it would be a pain to do it manually with git. The model did a fantastic job of iterating through the git commits and deciding what to put into each branch. It got everything right except for a single test that I was able to easily move to the correct branch myself.

brookst · 2025-05-22T20:03:22 1747944202

How much have you used Claude 4?

hsn915 · 2025-05-22T20:24:20 1747945460

I asked it a few questions and it responded exactly like all the other models do. Some of the questions were difficult / very specific, and it failed in the same way all the other models failed.

theptip · 2025-05-23T04:51:15 1747975875

Great example of this general class of reasoning failure.

“AI does badly on my test therefore it’s bad”.

The correct question to ask is, of course, what is it good at? (For bonus points, think in terms of $/task rather than simply being dominant over humans.)

atworkc · 2025-05-23T09:05:03 1747991103

"AI does badly on my test much like other AI's did before it, therefore I don't immediately see much improvement" is a fair assumption.

brookst · 2025-05-23T12:52:35 1748004755

No, it’s really not.

“I used an 8088 CPU to whisk egg whites, then an Intel core 9i-12000-vk4*, and they were equally mediocre meringues, therefore the latest Intel processor isn’t a significant improvement over one from 50 years ago”

* Bear with me, no idea their current naming

Kon-Peki · 2025-05-23T16:09:37 1748016577

You’re holding them wrong. An 8088 package should be able to emulate a whisk about a million times better than an i9.

theptip · 2025-05-23T14:49:18 1748011758

“Human can’t fly, much like other humans. Therefore it’s bad”

Spot the problem now?

AI capabilities are highly jagged, they are clearly superhuman in many dimensions, and laughably bad compared to humans in others.

illegally · 2025-05-23T10:46:19 1747997179

Yes.

They just need to put out a simple changelog for these model updates, no need to make a big announcement everytime to make it look like it's a whole new thing. And the version numbers are even worse.

flixing · 2025-05-22T19:53:45 1747943625

i think you are.

go_elmo · 2025-05-22T19:54:05 1747943645

I feel like the model making a memory file to store context is more than a gimmick, no?

make3 · 2025-05-22T21:18:44 1747948724

the increases are not as fast, but they're still there. the models are already exceptionally strong, I'm not sure that basic questions can capture differences very well

hsn915 · 2025-05-22T21:35:01 1747949701

Hence, "plateau"

j_maffe · 2025-05-22T21:41:19 1747950079

"plateau" in the sense that your tests are not capturing the improvements. If your usage isn't using its new capabilities then for you then effectively nothing changed, yes.

rxtexit · 2025-05-23T11:10:24 1747998624

"I don't have anything to ask the model, so the model hasn't improved"

Brilliant!

I am pretty much ready to be done talking to human idiots on the internet. It is just so boring after talking to these models.

neal_ · 2025-05-23T11:41:19 1748000479

make3 · 2025-05-23T02:47:44 1747968464

plateau means stopped

camdenreslink · 2025-05-23T12:39:19 1748003959

It could mean improving more and more slowly all the time, approaching an asymptote.