It's better than it was a year ago, as you'd have discovered for yourself if you...

alganet · 2025-06-16T19:06:00 1750100760

What I expect from a human is to ask "in which language?", because it makes a difference. If no language was supplied, I expect a brief summary of null coalescing and shorthand ternary options with useful examples in the most popular languages.

--

The JavaScript example should have mentioned the use of `||` (or operator) to achieve the same effect of a shorthand ternary. It's common knowledge.

In PHP specifically, `??` allows you to null coalesce array keys and other types of complex objects. You don't need to write `isset($arr[1]) ? $arr[1] : "ipsum"`, you can just `$arr[1] ?? "ipsum"`. TypeScript has it too and I would expect anyone answering about JavaScript to mention that, since it's highly relevant for the ecosystem.

Also in PHP, there is the `?:` that is similar to what `||` does in JavaScript in an assignment context, but due to type juggling, it can act as a null coalesce operator too (although not for arrays or complex types).

The PHP example they present, therefore, is plain wrong and would lead to a warning for trying to access an unset array key. Something that the `??` operator (not mentioned in the response) would solve.

I would go as far as explaining null conditional acessors as well `$foo?->bar` or `foo?.bar`. Those are often called Elvis operators coloquially and fall within the same overall problem-solving category.

The LLM answer is a dangerous mix of incomplete and wrong. It could lead a beginner to adopt an old bad practice, or leave a beginner without a more thorough explanation. Worst of all, the LLM makes those mistakes with confidence.

--

What I think is going on is that null handling is such a basic task, that programmers learn it in the first few years of their careers and almost never write about it. There's no need to. I'm sure a code-completion LLM can code using those operators effectively, but LLMs cannot talk about them consistently. They'll only get better at it if we get better at it, and we often don't need to write about it.

In this particular elvis operator thing, there has been no significant improvement in the correctedness of the answer in __more than 2 whole years__. Samples from ChatGPT in 2023 (note my image date): https://imgur.com/UztTTYQ https://imgur.com/nsqY2rH.

So, _for some things_, contrary to what you suggested before, LLMs are not getting that much better.

CamperBob2 · 2025-06-16T22:35:45 1750113345

Having read the reply in 2.5 Pro, I have to agree with you there. I'm surprised it whiffed on those details. They are fairly basic and rather important. It could have provided a better answer (I fed your reply back to it at https://g.co/gemini/share/7f87b5e9d699 ), but it did a crappy job deciding what to include in its initial response.

I don't agree that you can pick one cherry example and use it to illustrate anything general about the progress of the models in general, though. There are far too many counterexamples to enumerate.

(Actually I suspect what will happen is that we'll change the way we write documentation to make it easy for LLMs to assimilate. I know I'm already doing that myself.)

alganet · 2025-06-17T04:07:22 1750133242

> I don't agree that you can pick one cherry example

Benchmarks and evaluations are made of cherry picked examples. What makes my example invalid, and benchmark prompts valid? (it's a rethorical question, you don't need to answer).

> write documentation to make it easy for LLMs to assimilate.

If we ever do that, it means LLMs failed at their job. They are supposed to help and understand us, not the other way around.

CamperBob2 · 2025-06-17T17:09:31 1750180171

If we ever do that, it means LLMs failed at their job. They are supposed to help and understand us, not the other way around.

If you buy into the whole AGI thing, I guess so, but I don't. We don't have a good definition of intelligence, so it's a meaningless question.

We do know how to make and use tools, though. And we know that all tools, especially the most powerful and/or hazardous ones, reward the work and care that we put into using them. Further, we know that tool use is a skill, and that some people are much better at it than others.

What makes my example invalid, and benchmark prompts valid?

Your example is a valid case of something that doesn't work perfectly. We didn't exactly need to invent AI to come up with something that didn't work perfectly. I have examples of using LLMs to generate working, useful code in advanced, specialized disciplines, code that I frankly don't understand myself and couldn't have written without months of study, but that I can validate.

Just one of those examples is worth a thousand examples like yours, in my book. I can now do things that were simply impossible for me before. It would take some nerve to demand godlike perfection on top of that, or to demand useful results with little or no effort on my part.

alganet · 2025-06-17T18:15:09 1750184109

> We do know how to make and use tools

It's the same principle. A tool is supposed to assist us, not the other way around.

An LLM, "AGI magic" or not, is supposed to write for me. It's a tool that writes for me. If I am writing for the tool, there's something wrong with it.

> I have examples [...] Just one of those examples is worth a thousand examples like yours

Please, share them. I shared my example. It can be a very small "bug report", but it's real and reproducible. Other people can build on it, either to improve their "tool skills" or to improve LLMs themselves.

An example that is shared is worth much more than an anectode.

CamperBob2 · 2025-06-17T20:16:37 1750191397

It's hard to get too specific without running afoul of NDAs and such, since most of my work is for one customer or another, but the case that really blew me away was when I needed to find a way to correct an oscillator that had inherent stability problems due to a combination of a very good crystal and very poor thermal engineering on the OEM's part. The customer uses a lot of these oscillators, and they are a massive pain point in production test because they often perform so much worse than they should.

I started out brainstorming with o1-pro, trying to come up with ways to anticipate drift on multiple timescales, from multiple influences with differing lag times, and correct it using temperature trends measured a couple of inches away on a different component. It basically said, "Here, train this LSTM model to predict your drift observations from your observed temperature," and spewed out a bunch of cryptic-looking PyTorch code. It would have been familiar enough to an ML engineer, I'm sure, but it was pretty much Perl to me.

I was like, Okaaaaayyy....? but I tried it anyway, suggested hyperparameters and all, and it was a real road-to-Damascus moment. Again, I can't share the plots and they wouldn't make sense anyway without a lot of explanation, but the outcome of my initial tests was freakishly good.

Another model proved to be able to translate the Python to straight C for use by the onboard controller, which was no mean feat in itself (and also allowed me to review it myself), and now that problem is just gone. Basically for free. It was a ridiculous, silly thing to try, and it worked.

When this tech gets another 10x better, the customer won't need me anymore... and that is fucking awesome.

alganet · 2025-06-17T20:48:16 1750193296

I too have all sorts of secret stuff that I wouldn't share. I'm not asking for that. Isolating and reproducing example behavior is different from sharing your whole work.

> It would have been familiar enough to an ML engineer, I'm sure, but it was pretty much Perl to me.

How can you be sure that the solution doesn't have obvious mistakes that an ML engineer would spot right away?

> When this tech gets another 10x better

A chainsaw is way better than a regular saw, but it's also more dangerous. Learning to use it can be fun. Learning not to cut your toes is also important.

I am looking for ways in which LLMs could potentially cut people's toes.

I know you don't want to hear that your favorite tool can backfire, and you're still skeptic despite having experienced the example I gave you firsthand. However, I was still hopeful that you could understand my point.