An LLM has never saved me time. It has always produced something that doesn't quite work, has the rough shape of what I want, but somehow always gets all the details wrong.
I can type up what I want much faster and be sure it's at least solving the right problem, even if it may have bugs.
There are also tools to generate boilerplate that work much much better than LLMs. And they're deterministic.
If you do not plan out the architecture soundly, no amount of prompting will fix it if it is bad. I know this because my "handmade" project made with backward compatibility and horrible architecture keeps being badly fixed by LLM while the ones that rely on preemptive planning of the features and architecture, end up working right.
I think that's true, but something even more subtle is going on. The quality of the LLM output depends on how it was prompted in a way more profound than I think most people realize. If you prompt the LLM using jargon and lingo that indicate you are already well experienced with the domain space, the LLM will rollplay an experienced developer. If you prompt it like you're a clueless PHB who's never coded, the LLM will output shitty code to match the style of your prompt. This extends to architecture, if your prompts are written with a mature understanding of the architecture that should be used, the LLM will follow suit, but if not then the LLM will just slap together something that looks like it might work, but isn't well thought out.
I don't care if the machine has a soul, I only care what the machine can produce. With good prompting, the machine produces more ""thoughtful"" results. As an engineer, that's all I care about.
It is magical thinking to claim that LLMs are definitely physically incapable of thinking. You don't know that. No one knows that, since such large neural networks are opaque blackboxes that resist interpretation and we don't really know how they function internally.
You are just repeating that because you read that before somewhere else. Like a stochastic parrot. Quite ironic. ;)
They really aren't that mysterious. We can confidently say that they function at the lexical level, using Monte Carlo principles to carve out a likely path in lexical space.
The output depends on the distribution of n-grams in the training set, and the composition of the text in it's context window.
This process cannot produce reasoning.
1) an LLM cannot represent the truth value of statements, only their likelihood of being found in its training data.
2) because it uses lexical data, an LLM will answer differently based on the names / terms used in a prompt.
Both of these facts contradict the idea that the LLM is reasoning, or "thinking".
This isn't really a very hit take either, I don't think I've talked to a single researcher who thinks that LLMs are thinking.
You're just strawmanning now. I've prompted extremely well-specced, contained features, and the LLM has failed nonetheless.
In fact, the more details I give it about a specific problem, the more it seems to hallucinate. Presumably because it is more outside the training set.
because you need to consider the context window, thus separate the prompts by task. Separating by tasks and planning things out is still your own work, no AI can replace that. assuming you do that properly, AI-generating the code may save you up to 15% of your full work time. Please reread my comment: "If you do not plan out the architecture soundly", planning includes breaking the task down and make multiple prompts.
Our job is to break problems down into simpler ones until they are easily solvable, and if a machine simplifies the last steps, it is fine.
You're going to get a lot of "skill issue" comments but your experience basically matches mine. I've only found LLMs to be useful for quick demos where I explicitly didn't care about the quality of implementation. For my core responsibility it has never met my quality bar and after getting it there has not saved me time. What I'm learning is different people and domains have very different standards for that.
> An LLM has never saved me time. It has always produced something that doesn't quite work, has the rough shape of what I want, but somehow always gets all the details wrong.
This reads like a skill issue on your end, in part at least in the prompting side.
It does take time to reach a point where you can prompt an LLM sufficiently well to get a correct answer in one shot, developing an intuitive understanding of what absolutely needs to be written out and what can be inferred by the model.
I’m curious about how you landed “git gud; prompt better” and not “maybe the domain I work in is a better fit for LLM code”. Or, to be a bit less generous, consider the possibility that the code you’re generating is boilerplate, marshaling, and/or API calls. A facade of perceived complexity over something that’s as complex as a filter-map or two.
In the past 2 months I've been using all the SOTA models to help me design a new DSL for narrative scripting (such as game story telling) and a c# runtime implementation o the script player engine.
The language spec and design is about 95% authored by me up to this point; I have the LLMs work on the 2nd layer: the implementation specs/guidelines and the 3rd layer: concrete c# implementation.
Since it's a new language, I consider it's somewhat new/novel tasks for LLMs (at least, not like boilerplate stuff like HTTP API or CRUD service). I'd say, these LLMs have been very helpful - you can tell they sometimes get confused and have trouble to comply to the foreign language spec and design - but they are mostly smart enough to carry out the objectives, and they get better and better after the project got on track and has plenty of files/resources to read and reference.
And I'd also say "prompt better" is a important factor, just much more nuanced/complicated. I started with 0 experience with LLM agents and have learned a lot about how to tame them, and developed a protocol to collaborate with agents, these all comes from countless trial and errors, but in the end get boiled down to "prompt better".
I wonder if my intuition here is correct; I would posit that “PL implementation” is a far more popular and well-explored field than it seems. How many toy/small/labor-of-love langs make it to Show HN? How many more simply don’t?
I’ve never personally caught the language implementation bug. I appreciate your perspective here.
I totally agree, and I was fully aware of how common people make language for fun when I replied.
But I feel like the rationale would still stands: Considering LLMs' natures, common boilerplate tasks are easy because they can kind of just "decompress" from training data. But for a new language design, unless the language is almost identical to some other captured by the model, "decompression" would just fail.
As someone who has implemented a fair few DSLs, lexical and syntactic analysis is pretty much the same anywhere, and the structure of the lexer/parser does not really depend on the grammar of the language.
And even semantic analysis is at least very similar in most PLs. Even DSLs. Assuming you're using concepts like variables and functions.
When it comes to codegen / interpreter runtimes, things start to diverge. But this also depends on the use case. More often than not a DSL is a one-to-one map to an existing language, with syntactic sugar on top.
The points you brought up all are valid. Lexer, parser and general concepts are not language-specific, yes, and I wasn't talking about how the implementation is different.
When I said "you can tell they sometimes get confused and have trouble to comply to the foreign language spec and design", I was thinking about the many times they just fail to write in my language even when provided will full language specs. LLMs don't "think" and boilerplate is easy for LLMs because highly similar syntax structure even identical code exist in their training data, they are kind of just copying stuff. But that doesn't work that well when they are tasked to write in a original language that is... too creative.
I am prompting better. It doesn't help the LLM be more productive than me on a regular tuesday.
Sure, I can get the task done by delegating everything to an agentic workflow, but it just adds a bunch of useless overhead to my work.
I still need to know what the code does at the end of the day, so I can document it and reason about it. If I write the code myself, it's easy. If an LLM does it, it's a chore.
And even without those concerns, the LLM is still slower than me. Unless it's trivial boilerplate, in which case other tools serve me better and cheaper.
I'll note that a compiler is one of the most well understood and implemented software projects, much of it open source, which means the LLM has a lot of prior art that it can copy.
When web search first arrived, the same thing happened. That is, some people didn't like using the tool because it wasn't finding what they wanted. This is still true for a lot of folks today, actually.
It's less "git gud; prompt better", and more, "be able to explain (well) what you want as the output". If someone messages the IT guy and says "hey my computer is broken" - what sort of helpful information can the IT guy offer beyond "turn it on and off again"?
So how do you rectify your anecdotal experience against those made by public figures in the industry who we can all agree are at least pretty good engineers? I think that's important because if we want to stay ~anonymous, neither you nor I can verify the reputation of one another (and therefore, one another's relative master of the "Craft").
Here are some well known names who are now saying they regularly use LLM's for development. For many of these folks, that wasn't true 1-2 years ago:
My point being - some random guy on the internet says LLM's have never been useful for them and they only output garbage vs. some of the best engineers in the field using the same tools, and saying the exact opposite of what you are.
>Here are some well known names who are now saying they regularly use LLM's for development. For many of these folks, that wasn't true 1-2 years ago:
This is a huge overstatement that isn't supported by your own links.
- Donald Knuth: the link is him acknowledging someone else solved one of his open problems with Claude. Quote: "It seems that I’ll have to revise my opinions about “generative AI” one of these days."
- Linus Torvalds: used it to write a tool in Python because "I know more about analog filters—and that’s not saying much—than I do about python" and he doesn't care to learn. He's using it as a copy-paste replacement, not to write the kernel.
- John Carmack: he's literally just opining on what he thinks will happen in the future.
You are overstating those sources. That alone makes me doubt that you're engaging in this discussion in good faith.
I read them all, and in none of them do any of the three say that they "regularly use LLMs for development".
Carmack is speculating about how the technology will develop. And Carmack has a vested interest in AI, so I would not put any value on this as an "engineers opinion".
Torvalds has vibe coded one visualizer for a hobby project. That's within what I might use to test out LLM output: simple, inconsequential, contained. There's no indication in that article that Linus is using LLMs for any serious development work.
Knuth is reporting about somebody else using LLMs for mathematical proofs. The domain of mathematical proofs is much more suitable for LLM work, because the LLM can be guided by checking the correctness of proofs.
And Knuth himself only used the partial proof sent in by someone else as inspiration for a handcrafted proof.
I don't mind arguing this case with you, but please don't fabricate facts. That's dishonest
> I’m curious about how you landed “git gud; prompt better” and not “maybe the domain I work in is a better fit for LLM code”.
1. Personal experience. Lazy prompting vs careful prompting.
2. They're coincidentally good at things I'm good at, and shit at things I don't understand.
3. Following from 2, when used by somebody who does understand a problem space which I do not, they easily succeed. That dog vibe coding games succeeded in getting claude to write games because his master knew a thing or two about it. I on the other hand have no game Dev experience, even almost no hobby experience with games specifically, so I struggle to get any game code that even remotely works.
Irrespective of the domain you specifically listed in 3 (game dev is, believe it or not, one of the “more complex” domains), you have completely failed to miss the point.
> 2. They're coincidentally good at things I'm good at, and shit at things I don't understand.
This may well be! In the perfect world this would be balanced with the knowledge that maybe “the things you’re good at” are objectively* easier than “things you don’t understand”. Speaking for myself, I’m proficient in many more easy things than hard things.
I have definitely considered the possibility that I'm simply good at easy things and the LLM is good at easy things, and that hard things are hard for both of us. And there certainly must be some element of that going on, but I keep noticing that different people get different quality results for the same kind of problems, and it seems to line up with how good they themselves would be at that task. If you know the problem space well, you can describe the problem (and approaches to it) with a precision that people unfamiliar with the problem space will struggle with.
I think you can observe this in action by making vague requests, seeing how it does, then roll back that work and make a more precise request using relevant jargon and compare the results. For example, I asked claude to make a system that recommends files with similar tags. It gave me a recommender that just orders files by how many tags they had in common with the query file. This is the kind of solution that somebody may think up quick but it doesn't actually work great in practice. Then I reverted all of that and instead specified that it should use a vector space model with cosine similarity. It did pretty good but there was something subtly off. That is however about the limit of my expertise in this direction, so I tabbed over to a session with ChatGPT and discussed the problem on a high level for about 20 minutes, then asked ChatGPT to write up a single terse technically precise paragraph describing the problem. I told ChatGPT to use no bullet points and write no psuedocode, telling it the coding agent was already an expert in the codebase so let it worry about the coding. I give that paragraph to claude and suddenly it clicks, it bangs out a working solution without any drama. So I conclude the quality of the prompting determined the quality of the results.
The parent is specifically talking about producing boilerplate code -a domain in which LLM excell at- and not having had any success at that. It's therefore not a leap of logic to assume they haven't put (enough) effort into getting better at prompting first, which is perfectly fine per se but leans towards a skill issue and not an immutable property of gen AI.
The uncomfortable fact remains that one cannot really expect to get much better results from an LLM without putting some work themselves. They aren't magical oracles.
I can type up what I want much faster and be sure it's at least solving the right problem, even if it may have bugs.
There are also tools to generate boilerplate that work much much better than LLMs. And they're deterministic.