Hacker News new | past | comments | ask | show | jobs | submit | more ai-christianson's comments login

That seems way too good to be true.

What's the catch?


It is not very good at hard tasks, its ranking is much worse there.


sorry, any examples of hard tasks


I used to defend LMSys/Chatbot Arena a lot but threw in the towel after events of the past three months.

I can give more details if you (or anyone else!) is interested.

TL;DR: it is scoring only for "How authoritative did the answer look? How much flattering & emojis?"


Is this not what Style Control (which IIRC they're making default soon) aims to mitigate?


I'm not 100% sure what their rationale is for it, the launch version of style control was a statistical model that penalized a few (4?) markdown shibboleths (lists, headers, ?).

Not sure if they've shared more since.

IMVHO it won't help, at all, even if they trained a perfect model that could accurately penalize it*

The main problem is its one off responses, A/B tested. There's no way to connect it into all the stuff we're using to do work these days (i.e. tools / MCP servers), so at this point its sort of skipping the hard problems we'd want to see graded.

(this situation is a example: whats more likely, style control is a small idea for an intractable problem, or Google has now released multiple free models better than Sonnet, including the latest, only 4B params?

To my frustration, I have to go and bench these things myself because I have an AI-agnostic app I build, but I can confirm it is not the case that Gemma 3-not-n is better than Sonnet. 12B can half-consistently make file edits, which is a major step forward for local tbh)

* I'm not sure how, "correctness" is a confounding metric here: we're probably much more likely to describe a formatted answer in negative terms if the answer is incorrect.

In this case I am also setting aside how that could be done, just saying it as an illustration of no matter what, it's the wrong platform for a "how intelligent is this model?" signal, at this point, post-Eliza post-Turing, couple years out from ChatGPT 1.0


The catch is that "does as good as X" is pretty much never representative of real world performance when it comes to LLMs.

In general, all those scores are mostly useful to filter out the models that are blatantly and obviously bad. But to determine whether the model is actually good at any specific thing that you need, you'll have to evaluate them yourself to find out.


Is Zed managing the containerized dev environments, or creating multiple worktrees or anything like that? Or are they all sharing the same work tree?


As far as I know, they are sharing a single work tree. So I suppose that could get messy by default.

That said, it might be possible to tell each agent to create a branch and do work there? I haven't tried that.

I haven't seen anything about Zed using containers, but again you might be able to tell each agent to use some container tooling you have in place since it can run commands if you give it permission.


> rather than looking at simulations

You mean like automated test suites?


automated visual fuzzy-testing with some self-reinforcement loops

There's already library's for QA testing and VLM's can give critique on a series of screenshots automated by a playwright script per branch


Cool. Putting vision in the loop is a great idea.

Ambitious idea, but I like it.


I used Cline to build a tiny testing helper app and this is exactly what it did!

It made changes in TS/Next.js given just the boiletplate from create-next-app, ran `yarn dev` then opened its mini LLM browser and navigated to localhost to verify everything looked correct.

It found 1 mistake and fixed the issue then ran `yarn dev` again, opened a new browser, navigated to localhost (pointing at the original server it brought up, not the new one at another port) and confirmed the change was correct.

I was very impressed but still laughed at how it somehow backed its way into a flow the worked, but only because Next has hot-reloading.


SmolVLM, Gemma, LlaVa, in case you wanna play with some of the ones i've tried.

https://huggingface.co/blog/smolvlm

recently both llama.cpp and ollama got better support for them too, which makes this kind of integration with local/self-hosted models now more attainable/less expensive


also this for the visual regression testing parts, but you can add some AI onto the mix ;) https://github.com/lost-pixel/lost-pixel


Yes, the above reply is more what I meant! Vision / visualization not just more automated testing.

Definitely ambitious!


I'm the dev of a semi-popular FOSS coding agent --I agree that "vibe coding" as a term is fairly cringe.

The thing is, I see AI coding tools, especially agents, as a major force multiplier for individual devs and small teams. The output of AI is everything from total AI slop to actually good, scoped PRs, or even occasionally fixing bugs that were otherwise impossible to fix within a given time-box. I don't see how using tools like this is bad at all. A lot of it really comes down to the operator --when I use coding agents, I end up leaning even harder into my background/knowledge on higher level abstractions, architecture, design decisions, etc.


> I don't see how using tools like this is bad at all.

It's a bit like putting sawdust in a car instead of oil. She'll run smooth as silk...for a few miles.

Right now, we have a solid core of developers with enough knowledge (built mainly by actually coding themselves) to be able to call bullshit on LLM's when they write poor code. For those people, yes, it's a solid tool.

The problem is when you consider coding in 10 years. The draw to use vibe coding both in schools and among nascent hackers will be incredibly strong. Those folks will get positive reinforcement from having code that just magically works without having to spend a ton of time designing and actually coding. If a whole generation of would-be programmers comes up using that technology exclusively, that core of hackers that have the chops to recognize bad code and prevent maintenance headaches down the road will slowly dwindle over time.

At that point, LLM's will be consuming LLM code that hasn't been properly reviewed and sanitized. Garbage in, garbage out (unless there is some magical AGI breakthrough that makes this all moot).

Going from a manual saw to a buzz saw can really cut down the time needed to get the job done, but without the knowhow of how to use a saw effectively, eventually fingers will replace the time being cut.


I like the buzz saw analogy better than the sawdust for oil analogy :)


I think most of the thread is talking about SSR with partial HTML replacement.


There are AI models that can generate 3D models, e.g. Hunyuan3D. Not quite CAD models, but maybe this could eventually be adapted to that use case.

Then there's the possibility of an agent automating an actual CAD program. This has already been done with game dev, e.g. Unity MCP.


Things like Hunyuan 3D are nice for game assets and the like, but they aren't able to really do CAD well. That would be like using Stable Diffusion to code.


This looks awesome --do you cache or store the logs, or is that left up to k8s?


Thanks! Kubetail doesn't cache or store logs itself. By default, it uses the Kubernetes API to fetch logs from your cluster and send them directly to your client (browser or terminal). If the "Kubetail Cluster API" is installed then it uses Kubetail's custom agent to do the same.


Presumably I can install this (the web frontend) into k8s itself. Is there a helm chart or kustomize?

Would be really cool to install it into k8s and just hit a hosted web endpoint with all the logs and grep/exploration capabilities kubetail has.


Yep! You can use Kubetail on your desktop (using the CLI tool) or you can install it directly in your cluster using helm:

    helm repo add kubetail https://kubetail-org.github.io/helm-charts/
    helm install kubetail kubetail/kubetail --namespace kubetail-system --create-namespace
Then you can access it using `kubectl proxy` or `kubectl port-forward`:

    kubectl port-forward -n kubetail-system svc/kubetail-dashboard 8080:8080
You can also configure an ingress using the values.yaml file (https://github.com/kubetail-org/helm-charts/blob/main/charts...)


Very cool. We're using helm currently for our startup. It works, but I'm not super excited about it. We've also played with deploying k8s resources with ansible. It works better than expected, but is a bit clunky.

We've also written operators and controllers for some things, but that's a lot of work and not always worth it.

Your project looks like a very interesting alternative, will check it out!


Thanks! Very appreciated.

Also, it has a kro-like operator for deploying packages defined as code. If the operators you are writing are just managing resources and not doing anything special with the outside world, then you can use the AirTrafficController to build your CRDs and all you need to provide is the program to transform a CR into the underlying resources. Just in case that part interests you.

Thank you for the feedback. Its very appreciated.


> So I rewrote the whole thing from scratch using Python

So this isn't really codex then?


Why Kotlin?


JVM compatible, native UI.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: