Hacker News new | past | comments | ask | show | jobs | submit login

It's my personal experience for now. Have some experiments and further study planned.

It's difficult to set up evals, especially with production code situations. Any tips?




Yeah, it is hard.

The principal you need to work to is that you need to create evidence that other people will find compelling and then show that you have interrogated your results to show that you have checked that it's really working better than chance and not the result of some fluke or other. Finally you need to find a way to explain what's happening - like an actual mechanism.

1. Find or make a data set - I've been using code_search_net to try and study the ability of LLM's to document code, specifically the impact optimising n-shot learning on them*, this may not be close enough to your application, but you need many examples to draw conclusions. It's likely that you will have to do some statistics to demonstrate the effects of your innovations so you probably need around 100 examples.

2. Results from one model may not be informative enough, it might be useful/necessary to compare several different models to see if the effect you are finding is consistent or whether some special feature of a model is what is required. For example, does this effect work with only the largest and most sophisticated modern models, or is this something that can be seen to a greater or lesser effect with a variety of models?

3) You need to ablate - what is it in the setup that is most impactful? What happens if we change a word or add a word to the prompt? Does this work on long code snippits? Does it work on code with many functions? Does it work on code from particular languages?

4) You need a quantitative measure of performance. I am a liberal sort , but I will not be convinced by an assertion that "it worked better than before" or "this review is like an senior, I think". There needs to be a number that someone like me can't argue with - or at least, can argue with but can't dismiss.

*I couldn't make it work, I think because the search space for finding good prompting shots (sample function)is vast and the output space (possible documents) is vast. Many bothans died in order to bring you this very very very (in hindsight with about $200 of OpenAI spending) obvious result. Having said that I am not confident that it couldn't be made to work at this point so I haven't written it up and won't make any sort of definitive claim. Mainly I wonder if there is a heuristic that I could use to choose examples a-priori instead of trying them at random. I did try shorter examples and I did try more typical (size) examples. The other issue is that I am using sentence similarity as a measure of quality, but that isn't something I am confident of.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: