We created the first open source implementation of Meta's TestGen–LLM

data-ottawa · on May 21, 2024

How do people feel about LLM generated tests?

I tried creating some on a personal project just using ChatGPT and it saved me a lot of toil on tests I probably wouldn’t have written. I did find I had low trust in refactoring my code, but higher than if I’d had no tests.

It seemed like a net positive for low risk cases.

hugocbp · on May 21, 2024

As others have said, I find it very useful for smaller and simpler cases. Focused, small functions. A lot of times both Copilot and ChatGPT (and also Llama 3 via Ollama) are great at sometimes writing tests for edge cases that I might have forgotten.

But anything more complex and it is very hit or miss. I'm trying now to use GPT-4 Turbo to write some integration tests for some Go code that talks to the database and it is mostly a disaster.

It will constantly mock things that I want tested, and write useless tests that do basically nothing because either everything is mocked or the setup is not complete.

I'm settling in using it for tests for those small, pure functions, and more using it as a guide to find possible bugs / edge cases in more complex cases, then writing the tests myself and asking it in another prompt if they would cover those cases.

As most people that actually use AI heavily these days, I think the usefulness of AI for coding increases a lot if you already have a pretty good grasp of the subject and the problem space you are working on. If you already know roughly what you want and how to ask, they can be a huge time saver on the smaller and simpler things.

acedTrex · on May 21, 2024

The most value i have ever gotten out of AI for coding was when i refactored about 20 thousand lines of gomega assertions into the more robust complex object matcher pattern. It did a good chunk of the grunt work quickly. was probably 85% accurate.

dartos · on May 22, 2024

It’s nice for doing refactors.

I like having it translate config formats too. A series of env vars to a yaml or toml or something

afro88 · on May 21, 2024

It can work for more complex tests, but you have to give it an initial test that already sets everything up and utilizes mocks correctly. From there it will generate mostly correct tests.

rcleveng · on May 21, 2024

When I generated a few tests with a LLM, there was one that were generated to test he code as written, not the code as I should have wrote it.

Was great to have a pattern to tweak vs. blank code editor though, way faster.

chfalck · on May 21, 2024

This is where prompt engineering becomes more important. Next time consider pre-pending some kind of plain English set of expectations before pasting your code. Something like, “I want you to write tests for this code. Here are the expected behaviors <expected behaviors list>, and here are unexpected behaviors <unexpected behaviors list>. Tests should pass if they adhere to the expected behaviors and fail if they have unexpected behaviors. Here is the code: <code>”.

Like most LLM generation though, it’s not a deterministic thing and like you mentioned originally it takes some verification of the output. I still think with the extra steps it saves time when applied to the right scenarios. The longer the input the higher the hallucinations count in my experience though, so I always keep the code provided in the smallest chunk possible which still has enough context.

RealityVoid · on May 21, 2024

> there was one that were generated to test he code as written, not the code as I should have wrote it.

So like 90% of human-written tests then?

szundi · on May 21, 2024

How can you expect this to find out what should have been the goal when this code was written without meticulous prompting?

viraptor · on May 21, 2024

I can see them being useful for reducing the tedious cases. Especially the failure paths. Otherwise, my view is that if a unit test can be easily auto generated and the code is not already covered by a feature/integration test, then maybe it's a waste of time to look at it. I'd be more interested in integration tests generated from a description... which doesn't work that great yet.

Adding the simple unit tests ensures the current design of the internals is suddenly more "final" than before, and maybe more than intended.

pawelmurias · on May 21, 2024

The idea behind the facebook paper was to generate extra tests in the style of the ones that exist that cover extra cases. Generating tests from a description will likely suck because the LLM will just generate obvious tests and possibly make a logic mistake while you want the tricky cases.

xandrius · on May 21, 2024

Same really.

Many people see unit tests as a chore, rather than the foundation of a codebase. If every project could get 60-80% coverage basically for free (except some super novel or intricate path, which should already have tons of unit tests), I see that as a net win for everyone.

For many of the CRUD platforms, you could even get close to 90% with a bit of hand holding.

So I see nothing wrong.

As they say: don't let perfect be the enemy of good.

kolinko · on May 21, 2024

How about asserts?

xandrius · on May 21, 2024

What about them?

piva00 · on May 21, 2024

For low risk boilerplate it helps me a lot, I know what I want and it doesn't take any mental toll validating if the generated code is correct.

Anything else with more complex logic I don't get much gain, having to read and understand the nuances of the generated logic is more taxing on my mind than just writing it myself, got tired of having to correct hallucinations most of the time.

DeathArrow · on May 21, 2024

>How do people feel about LLM generated tests?

I used just ChatGPT and Copilot.

For very simple test cases of simple methods/functions it's ok. Once code starts to complicate, you'll lose much more time trying to steer the LLM into writing what you need, than simply doing it yourself. In particular setting up and creating mocks seems where the most difficulty lies.

Once you create the test, you can use the LLM as a code completion tool, to test for more things once you get a complex result. Or you can ask it to extrapolate from test data to create more test cases.

itamarcode · on May 21, 2024

Hey, co-creator here, I agree with the sentiment that code coverage may be a proxy and even sometimes a vanity metric but at the same time, IMO unit regression tests are necessary for a maintainable production codebase. I personally don’t feel confident making changes to production code that isn’t tested.

Specifically for generating unit regression tests the Cover-Agent tool already works quite well in the wild for some projects, especially isolated projects (as opposed to complex enterprise-level code). You can see in the few (somewhat cherry-picked) examples we posted [0] that it generates working tests that increase coverage (they were cherry-picked in the sense that these are examples we like to work with often internally at CodiumAI).

I believe that it’s possible to generate additional meaningful tests including end-to-end tests by creating a more sophisticated flow that uses prompting techniques like reflection on the code and existing tests, and generates the tests iteratively, feeding errors and failures back to the LLM to let it fix them. Just as an example. This is somewhat similar to the approach we used with AlphaCodium [1] which hit 54% on the CodeContests benchmark (DeepMind’s AlphaCode 2 hit 43% [2] with the equivalent amount of LLM calls).

If like me you think tests are important but hate writing them, please consider contributing to the open source to help make it work better for more use cases. https://github.com/Codium-ai/cover-agent

[0] https://www.youtube.com/@Codium-AI/videos [1] https://github.com/Codium-ai/AlphaCodium [2] https://storage.googleapis.com/deepmind-media/AlphaCode2/Alp...

Swizec · on May 21, 2024

> How do people feel about LLM generated tests?

LLM generated tests are the ultimate response to “Our CTO wants high test coverage but we all know it’s bullshit and provides little to no value”. They exist to tick a checkbox.

Maybe there’s a little value in using these as regression tests. Beyonce rule and all that. Kinda like double entry accounting. “Oh the code makes the same mistake in 2 places, that must mean it’s on purpose”

dkdbejwi383 · on May 21, 2024

> Beyonce rule and all that

If you fixed it then you should have put a test on it?

Swizec · on May 21, 2024

Yes. The problem with automatically deriving tests from your implementation is that you’re no longer testing your intent, just your implementation. Your tests and your code will have the same bugs dutifully recorded in 2 places.

Your Beyonce rule no longer means “I meant to do that” it now means “I wrote what I wrote”

throwanem · on May 21, 2024

Per the cited real world figures, that's about 1 in 40 tests that pass human review, or a success rate of about 2.5%.

It's hard to see value in spending resources this way right now - most notably, engineer time to review the generated tests. Improve the hit rate by an order of magnitude, and I suspect I'd feel differently.

gronky_ · on May 21, 2024

I think you may have misunderstood the figures.

Based on my understanding, only 1:20 passed the automated acceptance criteria (build, run, pass, increase coverage). Of those that made it through to the human review, “over 50% of the diffs submitted were accepted by developers” according to the paper

throwanem · on May 21, 2024

FTA:

> In highly controlled cases, the ratio of generated tests to those that pass all of the steps is 1:4, and in real-world scenarios, Meta’s authors report a 1:20 ratio.

> Following the automated process, Meta had a human reviewer accept or reject tests. The authors reported an average acceptance ratio of 1:2, with a 73% acceptance rate in their best reported cases.

1:20 of real-world generated tests reach human review, or 5%. Of those, on average 1:2 are approved, or about 50%. 50% of 5% is 2.5%, or 1 in 40. Where do you see the error?

edit: Okay, I think I see it, specifically in considering engineer time vs the ~2.5% overall hit rate vs. the ~50% rate for tests reaching human review and thus requiring effort. Fair callout, thanks!

rohitpaulk · on May 21, 2024

Tried this out on a Ruby codebase and it generated Python tests: https://github.com/Codium-ai/cover-agent/issues/17. Is there any data available on whether this actually works?

itamarcode · on May 21, 2024

Hey, one of the creators here. As mentioned in the post, TestGen-LLM (by Meta) focused on Kotlin, and the prompts were very Kotlin-oriented. In Cover-Agent (by CodiumAI) we tried to reimplement Meta's work, and stay mostly similar to the original implementation, although we did a bit improved the prompts. But it isn't generic enough. We believe we know how to improve generality, as we did with our PR-Agent, and here is a rough plan: https://github.com/Codium-ai/cover-agent/issues/13

darknoon · on May 21, 2024

Why does this webpage have auto-playing audio?

dopp0 · on May 21, 2024

probably some PM decided it would be nice.

sagimedina · on May 21, 2024

i think they fixed it...

ryoshu · on May 21, 2024

The audio track on load that has no obvious way to stop playing prevents me from reading this content. Please don't do that.

_pdp_ · on May 21, 2024

Using ChatGPT to generate unit tests works great almost out of the box, but I guess this system solves the remaining 5% to make it fully automated end-to-end. I believe this will work and help us write better software, given that I have experienced numerous cases where the generated tests (even with inferior models) catch no-so-obvious bugs.

joeberg8 · on May 21, 2024

Seems decent enough for boilerplate. But if my code is incorrect, won’t an LLM generated a test for incorrect code?

semi · on May 21, 2024

Potentially, but if you wrote a test for your incorrect code wouldn't you do the same?

In the end I think the important part is not giving tests more trust than they deserve. If you fix the incorrect code and break a test you should give your change a closer look, but also give the test a closer look before you decide which needs to be fixed.

_bohm · on May 21, 2024

Depends on how you write tests. If you're writing tests by looking at the code you wrote and just asserting that it behaves as written, then yeah. If you're writing tests that assert that your code conforms to a predefined specification, then the tests could be the thing that helps you uncover the bugs.

Havoc · on May 21, 2024

Interesting idea. I generally don’t run tests at all (hobbyist) so even mediocre llm tests may actually be a win

muglug · on May 21, 2024

Don't see any actual output measurement in the conclusion — it seems like the effort may not have really borne fruit.

EGreg · on May 21, 2024

To the OP:

Is your name a reference to Gronky Scripples? https://www.youtube.com/watch?v=4KG3v365mq4

yuvalkarmi · on May 21, 2024

Love that you took something that meta wrote about but didn't actually release and then... did it for them haha :)

wocka · on May 21, 2024

I get redirected to an oops 404 page when I try to create an account using Github.

jrawlings · on May 21, 2024

Any chance of supporting integrations with AWS, Azure, GCP APIs?