I tried creating some on a personal project just using ChatGPT and it saved me a lot of toil on tests I probably wouldn’t have written. I did find I had low trust in refactoring my code, but higher than if I’d had no tests.
As others have said, I find it very useful for smaller and simpler cases. Focused, small functions. A lot of times both Copilot and ChatGPT (and also Llama 3 via Ollama) are great at sometimes writing tests for edge cases that I might have forgotten.
But anything more complex and it is very hit or miss. I'm trying now to use GPT-4 Turbo to write some integration tests for some Go code that talks to the database and it is mostly a disaster.
It will constantly mock things that I want tested, and write useless tests that do basically nothing because either everything is mocked or the setup is not complete.
I'm settling in using it for tests for those small, pure functions, and more using it as a guide to find possible bugs / edge cases in more complex cases, then writing the tests myself and asking it in another prompt if they would cover those cases.
As most people that actually use AI heavily these days, I think the usefulness of AI for coding increases a lot if you already have a pretty good grasp of the subject and the problem space you are working on. If you already know roughly what you want and how to ask, they can be a huge time saver on the smaller and simpler things.
The most value i have ever gotten out of AI for coding was when i refactored about 20 thousand lines of gomega assertions into the more robust complex object matcher pattern. It did a good chunk of the grunt work quickly. was probably 85% accurate.
It can work for more complex tests, but you have to give it an initial test that already sets everything up and utilizes mocks correctly. From there it will generate mostly correct tests.
This is where prompt engineering becomes more important. Next time consider pre-pending some kind of plain English set of expectations before pasting your code. Something like, “I want you to write tests for this code. Here are the expected behaviors <expected behaviors list>, and here are unexpected behaviors <unexpected behaviors list>. Tests should pass if they adhere to the expected behaviors and fail if they have unexpected behaviors. Here is the code: <code>”.
Like most LLM generation though, it’s not a deterministic thing and like you mentioned originally it takes some verification of the output. I still think with the extra steps it saves time when applied to the right scenarios. The longer the input the higher the hallucinations count in my experience though, so I always keep the code provided in the smallest chunk possible which still has enough context.
I can see them being useful for reducing the tedious cases. Especially the failure paths. Otherwise, my view is that if a unit test can be easily auto generated and the code is not already covered by a feature/integration test, then maybe it's a waste of time to look at it. I'd be more interested in integration tests generated from a description... which doesn't work that great yet.
Adding the simple unit tests ensures the current design of the internals is suddenly more "final" than before, and maybe more than intended.
The idea behind the facebook paper was to generate extra tests in the style of the ones that exist that cover extra cases.
Generating tests from a description will likely suck because the LLM will just generate obvious tests and possibly make a logic mistake while you want the tricky cases.
Many people see unit tests as a chore, rather than the foundation of a codebase. If every project could get 60-80% coverage basically for free (except some super novel or intricate path, which should already have tons of unit tests), I see that as a net win for everyone.
For many of the CRUD platforms, you could even get close to 90% with a bit of hand holding.
So I see nothing wrong.
As they say: don't let perfect be the enemy of good.
For low risk boilerplate it helps me a lot, I know what I want and it doesn't take any mental toll validating if the generated code is correct.
Anything else with more complex logic I don't get much gain, having to read and understand the nuances of the generated logic is more taxing on my mind than just writing it myself, got tired of having to correct hallucinations most of the time.
For very simple test cases of simple methods/functions it's ok. Once code starts to complicate, you'll lose much more time trying to steer the LLM into writing what you need, than simply doing it yourself. In particular setting up and creating mocks seems where the most difficulty lies.
Once you create the test, you can use the LLM as a code completion tool, to test for more things once you get a complex result. Or you can ask it to extrapolate from test data to create more test cases.
Hey, co-creator here, I agree with the sentiment that code coverage may be a proxy and even sometimes a vanity metric but at the same time, IMO unit regression tests are necessary for a maintainable production codebase. I personally don’t feel confident making changes to production code that isn’t tested.
Specifically for generating unit regression tests the Cover-Agent tool already works quite well in the wild for some projects, especially isolated projects (as opposed to complex enterprise-level code). You can see in the few (somewhat cherry-picked) examples we posted [0] that it generates working tests that increase coverage (they were cherry-picked in the sense that these are examples we like to work with often internally at CodiumAI).
I believe that it’s possible to generate additional meaningful tests including end-to-end tests by creating a more sophisticated flow that uses prompting techniques like reflection on the code and existing tests, and generates the tests iteratively, feeding errors and failures back to the LLM to let it fix them. Just as an example. This is somewhat similar to the approach we used with AlphaCodium [1] which hit 54% on the CodeContests benchmark (DeepMind’s AlphaCode 2 hit 43% [2] with the equivalent amount of LLM calls).
If like me you think tests are important but hate writing them, please consider contributing to the open source to help make it work better for more use cases.
https://github.com/Codium-ai/cover-agent
LLM generated tests are the ultimate response to “Our CTO wants high test coverage but we all know it’s bullshit and provides little to no value”. They exist to tick a checkbox.
Maybe there’s a little value in using these as regression tests. Beyonce rule and all that. Kinda like double entry accounting. “Oh the code makes the same mistake in 2 places, that must mean it’s on purpose”
Yes. The problem with automatically deriving tests from your implementation is that you’re no longer testing your intent, just your implementation. Your tests and your code will have the same bugs dutifully recorded in 2 places.
Your Beyonce rule no longer means “I meant to do that” it now means “I wrote what I wrote”
Per the cited real world figures, that's about 1 in 40 tests that pass human review, or a success rate of about 2.5%.
It's hard to see value in spending resources this way right now - most notably, engineer time to review the generated tests. Improve the hit rate by an order of magnitude, and I suspect I'd feel differently.
Based on my understanding, only 1:20 passed the automated acceptance criteria (build, run, pass, increase coverage). Of those that made it through to the human review, “over 50% of the diffs submitted were accepted by developers” according to the paper
> In highly controlled cases, the ratio of generated tests to those that pass all of the steps is 1:4, and in real-world scenarios, Meta’s authors report a 1:20 ratio.
> Following the automated process, Meta had a human reviewer accept or reject tests. The authors reported an average acceptance ratio of 1:2, with a 73% acceptance rate in their best reported cases.
1:20 of real-world generated tests reach human review, or 5%. Of those, on average 1:2 are approved, or about 50%. 50% of 5% is 2.5%, or 1 in 40. Where do you see the error?
edit: Okay, I think I see it, specifically in considering engineer time vs the ~2.5% overall hit rate vs. the ~50% rate for tests reaching human review and thus requiring effort. Fair callout, thanks!
Hey, one of the creators here.
As mentioned in the post, TestGen-LLM (by Meta) focused on Kotlin, and the prompts were very Kotlin-oriented.
In Cover-Agent (by CodiumAI) we tried to reimplement Meta's work, and stay mostly similar to the original implementation, although we did a bit improved the prompts. But it isn't generic enough.
We believe we know how to improve generality, as we did with our PR-Agent, and here is a rough plan:
https://github.com/Codium-ai/cover-agent/issues/13
Using ChatGPT to generate unit tests works great almost out of the box, but I guess this system solves the remaining 5% to make it fully automated end-to-end. I believe this will work and help us write better software, given that I have experienced numerous cases where the generated tests (even with inferior models) catch no-so-obvious bugs.
Potentially, but if you wrote a test for your incorrect code wouldn't you do the same?
In the end I think the important part is not giving tests more trust than they deserve. If you fix the incorrect code and break a test you should give your change a closer look, but also give the test a closer look before you decide which needs to be fixed.
Depends on how you write tests. If you're writing tests by looking at the code you wrote and just asserting that it behaves as written, then yeah. If you're writing tests that assert that your code conforms to a predefined specification, then the tests could be the thing that helps you uncover the bugs.
I tried creating some on a personal project just using ChatGPT and it saved me a lot of toil on tests I probably wouldn’t have written. I did find I had low trust in refactoring my code, but higher than if I’d had no tests.
It seemed like a net positive for low risk cases.