This tool doesn’t benchmark based on how a model actually responds to the generated prompts. Instead, it trusts GPT4 to rank prompts simply in terms of how well it imagines they will perform head-to-head. Thus, there’s no way to tell if the chosen ‘best prompt’ actually is the best, because there’s no ground truth against actual responses.
Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.
Because this is the upside if not the peak of the hype bubble. All you have to do is use GPT for a task, and nobody cares whether it actually works, you still get whatever VC funding is left after the interest rate hikes.
Grifters. I won't say that the person working on this is a grifter. Instead, it's so popular right now because of grifters. The same type of NFT grifters and crypto grifters who are mostly silent now. They've moved on.
How ethical would it be to sell things to these grifters, to sell the shovels they will use? But I'm always hung up on the idea that they will use those shovels on others and exploit them.
Thanks for mentioning promptfoo. For anyone else who might prefer deterministic, programmatic evaluation of LLM outputs, I've been building this for evaluating prompts and models: https://github.com/typpo/promptfoo
Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).
No problem! I guess I will make a plug myself --we've been working on a similar 'prompt engineering' tool, ChainForge (https://github.com/ianarawjo/ChainForge). It's targeted towards slightly different users and use cases than promptfoo --geared more towards early-stage, 'quick-and-dirty' explorations of differences between prompts and models for less experienced programmers, versus the kind of continuous benchmarking and verification testing power that promptfoo offers.
I particularly like promptfoo's support for CI, which I haven't seen anywhere else, and is very important for developers pushing prompts into production (esp since OpenAI keeps updating their models every few months...).
There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive. I am also uncomfortable with it, but using GPT4 as a grader is not as bad as you think.
You’re missing the point here. It’s not even getting the LLM’s opinion on evaluating the responses to the prompts (which itself is fraught for some tasks, and benchmarks are known to be limited —even OpenAI admits this, it’s why they made evals). It’s one level abstracted from that. It’s evaluating what the LLM thinks of how well the prompt will do, in purely hypothetical terms. That’s hogwash —different LLMs perform very differently even for the same prompts. Try any tool that lets you compare model responses side-by-side. Unless I see actual use cases, this is yet another iteration of overtrusting AI.
> Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool.
I was reminded of the same thing. What a lot of it boils down to is that LLMs have no innate ability to self-reflect. They can pretend to do it, but no more effectively than an untrained human would.
Of course there is strong correlation. That is literally what it was designed to do.
The problem is that it will simultaneously say that "cow eggs are bigger than chicken eggs", with the same confidence (and in a way that correlates well with human evaluators).
I just asked and it told me cows are mammals and do not lay eggs. That Reddit post is not even GPT-4 and is 5 months old, which may as well be the 19th century on AI tech timescales.
You are concentrating on the details and avoiding the point.
The point is that the tool fails, and it is known to fail, so much that we even have a name for the times when it fails - hallucinations. I have been calling them cow eggs because that's a nice mental image and I didn't want to have to remember for the proper English term. I will continue calling them cow eggs.
Easiest way to get to Radio Shack on a bicycle is to ride that bike down to Doc Brown’s house, charge the Delorean up to 1.21 gigawatts, and go back in time.
Humans benefit from good communication too. For example, annual U.S. deaths from medical errors is in the hundreds of thousands. Much of it is due to miscommunication. Is this akin to poor human-to-human prompt engineering? Of course, humans will rush and not attempt better communication, and you can take all the time you wish with an AI. And AI will continue to incorporate better prompt engineering that you won't have to write out. But there will always be a continuum from good to bad for communication, and communication outcomes.
You're forgetting what you may consider to be factual, self-evident and a priori is your opinion.
You may be under the impression that annual U.S. deaths from medical errors being in the hundreds of thousands miscommunicates but that is truly your opinion. You are merely jumping to conclusions at places another person might not.
And going on to rely on the LLM to validate your perspective is a lossy process. It may not lose your perspective but it loses someone else's and you don't even seem to notice or care.
The post you replied to was saying that the deaths were caused by miscommunication, but you interpreted it to mean that stating the number of such deaths is somehow a miscommunication itself!
If we could use GPT-4 to grade prompts, we wouldn't need to talking about grading prompts to use for GPT-4, since this solution requires that the problem doesn't exist. The question then becomes, how do you grade the prompt grading, objectively? At the bottom, there has to be a ground truth.
You can't use the thing you're testing to evaluate its own performance. This applies to rulers, speedometers, and AI. It's the difference between a "subjective" and "objective" metrics. If you want an objective metric, you need to have it based on something external, based on reality, objective. Otherwise, you have metrics and ideas that have to held themselves up.
Source: My day job is test and measurement. These concepts go back centuries. You never trust your measurement system, you verify it against a standard.
> A positive correlation just means better than chance. "Strongly" is vague, and might not be much better than chance.
No, adverbs like “strongly” modify adjectives (or verbs, but that’s not relevant here) not nouns; “strongly” is an intensifier that modifies “positive”, its not a separate adjective that modifies the noun “correlation”.
Well, you could keep everything else in the project and put yourself or a human as the "does this result feel better than the other one" decision maker
it seems to me also, that this is very much some sort of snake oil for the llm era. prompt generation varies from llm-to-llm and I doubt gpt4 can do reasonable evaluation, provided that it does not know at all about other models.
That's if you're trying to become a scientist. You can do science even if your experiment doesn't contain any of those elements besides trying multiple things and observing the outcomes.
Yes, because practicing Medicine is about a whole apparatus of agreed upon practices, not pick and choosing and slapping a label on it. Much like the case with doing Science, which is what I was getting at.
Peer review has been a thing since, albeit in a much less rigorous way, at least the XVII century. The so-called 'republic of letters' was a world (europe) wide society of philosophers, mathematicians, and early experimenters sharing results and data in writing. Mathematicians have been publishing proofs and challenging each other in Europe since at least the Italian Risorgimento...
This is more engineering than science. For a given task, try a bunch of approaches and keep the best one.
Science would be replicable, ie demonstrate that this particular approach to prompt writing yields better prompts than some baseline prompt writing approach across an array of different problems.
The difference is that in science the dialectic is between you the observer or a group of observers and a "reality" that hopefully(!) remains relatively stable, because the human organism is the way that it is - just as physical reality.
But these models are changing literally every day, so there's no fixed thing to reveal.
So no one will be able to reproduce anything at all.
This makes all this "engineering" pretty ridiculous in my eyes, it's literally for one models bizarre emergent properties.
Gradient descent type optimization is far from "trying random things until one of them works without really knowing why". You can calculate all partial derivatives and understand the impact.
I think it is a freaking miracle it even works. I understand how it works for say 100 parameter linear regression, but that it would work for billions of parameters (billion dimensional space) by nudging the parameters each a little (based on purely it's impact on the loss, assuming everything stays the same), is not obvious to me. It is a kind of magic.
Regarding randomness, the initialization of the weights is random, and if they use dropouts that is random too, plus the order in which to process the text might be random.
To be honest, there’s always been a tension in machine learning between the: “we won’t do it unless the theory is complete and sound” versus the “we don’t understand this but it works much better consistently so we do it” crowd.
In the 90s and even 00s, the theory first crowd was mainstream and the empirical first crowd was considered fringe. Very fringe.
Personally I appreciated it when LeCun was like: “you can’t find the solution if you only search where the lamplight is shining.” Or other early deep learning practitioners note that ML theory is usually so far disconnected from practice in terms of tightness and bounds that you might as well ignore pure theory completely.
Anyway, it wasn’t until deep learning methods really smashed benchmarks across the board did people give in to the black magic / alchemy driven approaches of empiricism based upon intuition and bias developed through long-held experience.
I think the real etymology of it was "social engineering". Which also means clever hacks, not a real engineering, it deliberately subverted the meaning.
Isn’t engineering an exact science while prompt engineering is completely not?
Although, even software engineering being an exact science, it is a funny one: most of us don’t get certified as like, let’s say, mechanical engineers do. Would they say we are engineers?
So perhaps the “engineer” term got overloaded in recent years?
Prompt "engineering" is just writing prayers to forest faeries.
Whilst BASIC/JavaScript/etc are all magic incantations to a child, a child will soon figure out there's underlaying logic, and learn the ability to reason about what code does, and what certain changes will do.
With prompts, it's all faerie logic. There is nothing to learn, there are only magic incantations that change drastically if the model is updated.
Worse yet, the incantations cannot be composed. E.g. take the SQL statement "SELECT column FROM table WHERE column = [%s]". For any given string you insert here, the output is predictable. You can even know which characters would trigger an injection attack.
With prompts you cannot predict results. Any word, phrase, or sequence of characters may upset the faeries and cause the model to misbehave in who knows what way. No processing of user-input will stop injection attacks.
Whilst it's dubious to call current software development practices "engineering", it's utterly ridiculous to do so for prompt-writing.
I don't get where this sentiment comes from. I build software specifically on the concept of predictable results from llm's being composable.
Sure, the results are not deterministic in that 100% of the time the exact prompt returns the exact same result, but you can tune your prompts so that 100% of the time they give you a valid result in the result category you were seeking, and with a specific probability distribution of available choices.
Prompts are functions that can take concrete input and create a probabilistic output that can be automated upon. Especially if you only need to output one token, i.e a number, boolean, word, object reference. And for obvious reasons - the further you forecast out in a sequence the less accurate you will be.
As long as you don't change the underlying model, in a massive model with billions of parameters, there are definitely mechanisms and behaviors to discover that you can reason about.
but you can tune your prompts so that 100% of the time they give you a valid result in the result
You can't though, that's the issue. Illustrative here are tokens like "SolidGoldMagikarp", but this does happen to "normal" sequences of tokens as well.
There is no filter you can build to keep out such mistakes, any set of otherwise normal tokens could trigger the model to produce wrong output.
Because of how large these models and most prompts are, even slight changes in things like attention can cascade into extremely different results.
there are definitely mechanisms and behaviors to discover that you can reason about.
It's faerie logic. The behaviours are mere trends and observations, not underlaying truth.
The faeries reward you for offering them fruit. But offer them apple which fell from the tree exactly 74 hours ago down to the second and they'll kill you. There is no way to know ahead of time which things will upset them.
The risk here is that you're fooled into believing these systems are understandable, that you know how they work, and that you'll mistakenly use them for something where the wrong results have consequences. You'll stop double-checking the output, all humans are lazy like that, and then you'll have disaster on your hands.
You can reasonably expect an LLM to respond appropriately often. Which percentage of the time depends on the details, but it’s not much more magic than expecting the bridge you built to hold up.
you could do a sort of validation of output by prompting the llm repeatedly with the same prompt and then compare the responses to eliminate outliers. I do feel like this stuff is magic though, just wanted to provide a counterpoint.
In "The Information," James Gleick discusses a concept related to our current discourse. In the days when computers were merely an array of switching circuits, luminaries such as Claude Shannon believed that "thinking" could be captured in a structured format of logical representation.
However, even with formally composable languages like JavaScript, a semblance of unpredictability — akin to the "faerie logic" metaphor — still persists. Languages evolve over time; Python, for instance, with its various imports that constantly disrupt my code, serves as a good example. This is perhaps the reason behind the emergence of containers to ensure code consistency.
While some elements may be more "composable" than others, it appears increasingly unrealistic in today's world to encapsulate thought processes or interactions with systems within a rigid logical framework. Large Language Models (LLMs) will keep evolving and improving, making continual interaction with them unavoidable. The notion that we can pass a set of code or words through them once and expect a flawless result is simply illogical.
I firmly believe that any effective system should incorporate a robust user interaction component, regardless of the specific task or problem at hand.
It's not so much about formal logic, but general predictability.
even with formally composable languages like JavaScript, a semblance of unpredictability — akin to the "faerie logic" metaphor — still persists
And they're ridiculed for it, and as you state, we design around them or replace such systems entirely.
making continual interaction with them unavoidable
Technology is never unavoidable or "inevitable". We can choose not to use it, or when to use it.
The notion that we can pass a set of code or words through them once and expect a flawless result is simply illogical.
Yet that is what we expect when we put these systems into production use, especially when many proposed use cases are user-facing and subject to injection attacks.
Whether it be the writing of adcopy, the processing of loan applications, or generating code, mistakes in these tasks have very real consequences.
I don't disagree we can choose to use it or not, but my point was more meant to indicate that, if we want a good experience with LLMs, we have to continue to interact with them to achieve good results.
We need to move away from prompt-engineering - it's AI-Management. You pretend you're speaking to another (albeit confusing/confused) person when extracting work from a model. You're coaxing things out of it based on hearsay and mysticism that work most of the time. Sounds a lot like AGILE and free pizza to get a junior to stay late and deliver on time.
It’s so refreshing to see someone actually write this about prompt writing. It makes an extremely refreshing change from Twitter AI influencers posting their ridiculous prose as some marvel of harnessing LLMs.
In my opinion, coding is a craft. As software has only existed for like 70 years, we are more like the guilds building cathedrals in the middle ages than like modern civil engineering.
One day I think there will be true software engineering. When that happens you won't be able to start software projects without certifications, and most people (or programs!) who actually do the coding will be following careful plans and instructions from the engineers who designed the project.
I for one am very happy software isn't fully professionalized yet!
> One day I think there will be true software engineering. When that happens you won't be able to start software projects without certifications, and most people (or programs!) who actually do the coding will be following careful plans and instructions from the engineers who designed the project.
Sounds really bad.
Software can't fall on your head and kill you, not all of it at least.
Different software should require different professionals building it.
And it's usually not about the software but about the management telling the engineers to take shortcuts or whatever (Boeing comes to mind)
You might enjoy this article on the Therac-25 [1]. It's kind of the standard example of how errors in software can wind up harming people. I have written medical device software for about 30 years. In my experience, delivering high quality software for Class B and Class C devices is both challenging and expensive.
Every software developer should know this story, it is a humbling and important lesson. Yes luckily most of can’t ship code that accidentally kills people but we can absolutely empathize with the conditions which led to it happening.
Thank you so much for the link to the postmortem. I will be sharing it and discussing it with my colleagues. We are currently working on the embedded software for an AED.
Don't want to argue against you, since the term is definitely misused for signaling a lot of things, but in a stricter sense, software engineering is not about the puzzle, it's about the whole pipeline surrounding it.
I’m guessing you aren’t from the US because in the US there is not much prestige in the title of engineer, but it seems to be a title people in Europe get weirdly hung up on. If anything, software engineering positions are more prestigious than most traditional engineering positions here.
I was a real engineer for a decade before switching to programming.
I cannot use the word software engineer, since its nothing like real engineering.
Real engineering was def harder, more math intense, and the stakes were sooo much higher. While many software problems can cause you to lose money, engineering problems can cause you to lose time. Yes it sucks your CAD designer had everything on a 0.05 degree angle and it costs 1M to redo the tool, but it also costs 16 weeks to redo the tool. We'd even offer to pay absurd money, prevent future business, etc... to get the tool done in 8 weeks, but its impossible to get it done faster. Now everything in the company is 8 weeks behind schedule.
Anyway, real engineering was harder, but programming pays soooo much more money. Its a demand thing, not a difficulty thing.
> While many software problems can cause you to lose money, engineering problems can cause you to lose time.
I'm guessing you were trying to say something else here. I literally cannot think of a single software engineering problem I've ever encountered that didn't cost time. By your definition, then, software engineering is engineering. Your claim and your definitions are at odds with one-another.
Also, you don't directly claim it, but you seem to imply, that software engineering can't have real-world consequences... or something? As another reply points out, sometimes software is in the critical path for things like rockets and airplanes, where mistakes cost lives.
And some people making software for less life-altering systems take their craft just as seriously. Some people think that losing $10M every single second while their software is failing is a big deal.
Are you claiming people who write HFT code, ad arbitrage code, code that powers the front page of Apple, Amazon, Microsoft, and Google are just cowboying it through the day, doing nothing special?
Overall I just find this comment very confused. Maybe you could put some thought into what you're trying to say, and say it better?
So if you have your PE license and are writing software that controls a commercial airplane system or a spacex rocket, you are not an engineer doing software engineering?
From Wikipedia:
"A software engineer is a person who applies the engineering design process to design, develop, maintain, test, and evaluate computer software."
"The engineering design process, also known as the engineering method, is a common series of steps that engineers use in creating functional products and processes. The process is highly iterative - parts of the process often need to be repeated many times before another can be entered - though the part(s) that get iterated and the number of such cycles in any given project may vary.
It is a decision making process (often iterative) in which the basic sciences, mathematics, and engineering sciences are applied to convert resources optimally to meet a stated objective. Among the fundamental elements of the design process are the establishment of objectives and criteria, synthesis, analysis, construction, testing and evaluation.[1]"
It is not dependent on the problem domain, rather on how the work is performed.
I agree that far too many people call themselves software engineers when they really aren't.
But I just mean to say that software engineering itself is definitely still a real thing and there are many people out there that can and should call themselves software engineers.
“Real engineer,” is such an insufferable way to make your point. I guess only civic engineering is “real.” If you don’t like the term software engineer you’d be real mad if you found out about sound stage engineers. Being an engineer isn’t about engineering solutions to business problems; nope it’s having to deal with bureaucratic hurdles that slow you down that makes one an engineer.
In essence, programming is an ephemeral art. At its core, our work involves utilizing the electromagnetic field to perform tasks akin to 'thinking.' Without continuous computational processes, our products might as well be inert objects like rocks that simply exist unchanged for millennia.
In my country there used to be some 5 years (realistically 6 for most students) Computer Engineering programs that were basically electronic engineering with selected classes from computer science tackled on top of it.
And as Electronic Engineering programs used to have a lot of classes from Civil Engineering tackled upon it, legally said Computer Engineers were legally licensed to build small 3 floor buildings.
People who have not studied and been given that distinction in engineering going on to call themselves engineers is the reason the term is overloaded.
I've always assumed there's a near 100% overlap between people using the term wrongly to describe any programming activity, and people complaining that it has no meaning or is self-aggrandisement
How does certification work? I’d imagine it would be in a less abstracted language like C or C++. The problem for me is most of my schooling was based on web technologies and 2 classes of Java.
I wound hate having to studying for a C based test when the area I work in is all web tech. Same could be said of Java. I learned it in school and haven’t used it in 10 years except for it being the backend on my first front end dev project.
Anyway, the best engineer for ChatGPT prompt can be ChatGPT itself.
People seem to think that “automation will create new jobs”, but in the age or AI, those job opprtunities will be very temporary as the companies making the AI automate that thin layer.
Similarly, people think that humans will control AIs. That’s a bit quaint, a bit like humans controlling a corporation. The thin layer of “control” can be easily swapped out and present an improvement in the market, so that the number of totally autonomous (no human in the loop) workflows will grow.
That can include predictive policing with Palantir (thanks Peter Thiel!), autonomous killbots in war etc. Seeing how reckless companies have been in releasing the current AI in an arms race, I don’t see how they would be restrained in a literal arms race of slaughterbot swarms and panopticon camera meshes.
PS: I remember this exact phase when computers like Deep Blue beat Garry Kasparov. For a while he and others advocated “centaurs” — humans collaborating with computers. But the last decade hardly anyone will claim that a system with humans in the loop can beat a system that’s fully automated: https://en.m.wikipedia.org/wiki/Advanced_chess
“Engineer” originally just meant someone who builds engines (in a broad sense of that word). The formal titles requiring certification, etc. are the more recent development.
you make a valid point, and no - we are not engineers. we are people with printed labels at best, where the label says architect or engineer. but most of these people with these labels don't even have a degree, which is the prerequisite to have this designations. architects also typically need to comply for a local guild.
we, the IT crowd, are long over-due for this formalization of the professions.
I feel like many places have tried this and didn’t like it or molded it into a Frankenstein.
The idea of an engineer that researches, designs, tests, and measures and a programmer that implements seemed to cost too much (not just monetarily) for the industry that employees them and there isn’t the sufficient need to regulate all sun-groups of the industries that employ software programmers / engineers.
BTW, GPT-Engineer is openly collecting all of your data: user prompts and other metadata. And they were even defending it until they received some strong responses from the community: https://github.com/AntonOsika/gpt-engineer/issues/415 They now explicitly ask for consent regarding user data, but can we really trust their motives?
Is this prompt generation for the purposes of prompt engineering? Is this then a kind of meta engineering? Engineering for the purposes of engineering which then hopefully will generate working code for the computer that generated the prompt and the response to the prompt.
In sense yes, previously we were engaged in "google engineering, then we went to "stackoverflow engineering" and now its "prompt engineering" - with every step, the magic and mystic increases.
Usage query:
It looks like this could get expensive quite quickly. The approach is great, but with GPT-4 especially, could be very difficult.
Is it worth using with 3.5 as a first pass then switching prompts to GPT4 once you've got the best prompt?
I am not sure this approach is doable since GPT-4 is capable of solving assignments that GPT-3.5 get wrong. Example GPT-3.5 fails to solve the prompt (with the dvdrental sample database schema added [1]):
> find customers who didn't rent a movie in the last 12 months but rented a movie in the 12 months before that
GPT-4 solves this without a problem [2]. Combining logic like (without additional database schema added):
> find all users who lives in Paris using lat/lng and who visited the south of France within the last month
GPT-3.5 can't understand this at all, GPT-4 solves it [3].
Creating a computer to find "the ultimate answer to life, the universe and everything", getting a cryptic answer and then creating an even bigger and more complicated computer to find the question is a pretty good satire of generative ai based chatbots of exponentially increasing model size.
Don't remember if it made it into the book, but from the radio series my favorite is the scene where they end up in a nightclub filled with dancing mannequins sprayed with sweat in order to convince people it was popular and come in.
Well, I am intrigued by the obvious analogies of Deep Thought and the prompt engineers (and their upcoming tools and their respective complexities) while on the other hand I already see the upcoming war between simple and dumb business interests on both sides that will try to make the fascinating und useful concept (which it is after all) consumable on one side while the other side will be ready to kill it if they don't get paid for it using/refrencing/quoting their pieces of work in its answers or "thinking"/modelling. Probably resulting in a cripled tool if politics won't visibly value educational gain over commercial interest.
Pain and joke too often come as closely bundled as within Adams's pointed work.
Isn’t that what this codebase is doing? I haven’t grokked it 100% yet.
Recently I’ve been trying to engineer a prompt that I intend to run 1k times.
Noticing GPT4 bug out on several responses, I’ve talked it through the problem more and asked it to rewrite the prompt. So an automated approach to help build better prompts based upon held out gold data is useful to me.
It seems like a `ranking_system_prompt` is used to rank the output of other prompts, which is pretty cool!
> Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task. You will be provided with the task description, the test prompt, and two generations - one for each system prompt. Rank the generations in order of quality. If Generation A is better, respond with 'A'. If Generation B is better, respond with 'B'. Remember, to be considered 'better', a generation must not just be good, it must be noticeably superior to the other. Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other. Respond with your ranking, and nothing else. Be fair and unbiased in your judgement.
Wait, so we are asking the innkeeper if the wine is good?
If the model thinks say that the current year is 2021 it will rate higher a response that does?
There should be at a minimum a way to print which comparisons were made, so that you could double-check if you agree with those.
Also, we could modify the prompt to explain why the decision was made.
They don’t. They simply assume the model’s most likely output is meaningfully correlated with true rankings even though it was never trained on this task and certainly has not been trained to output the most likely prompt given a prompt in some meaningful order. It’s hogwash.
I'm pretty surprised more people dont use logit biases to call openai with. Checking if something is either a or b means that the tokens for those letters must be 100 weight which means they will be chosen no matter what and no other character is allowed.
If you're using OAI function calling, you can define a json schema with boolean values. If you have a handful of values, you can use an enum with a list of possible values.
Unless your prompt seriously conflicts with the schema, it's pretty consistent.
A bit like an autoGPT, I did not immediately see any kind of token limits. But I did not look carefully. On a complex problem or one that accesses a lot of data the cost might ramp up.
Currently working on something similar for myself, this doesn't seem to fit my needs (benchmarking generations too rather than just classification). I only have a crude cosine similarity metric for accuracy for now. Also I'm using function calling rather than the normal completions.
I was hoping this would do something more interesting with multiple messages (if using a chat model) rather than just dumping the entire prompt in one message. The assistant lets you do stuff with examples.
It would be cool if, given a handful of test cases, you could send those off to the LLM to generate even more test cases.
My first thought when looking over this tool was "Why do I have to do all the work?", the ideal scenario is that I give the high level description and the LLM does the hard work to create the best prompt.
I haven't read the article at all, but what you said just reminded me that I've been using ChatGPT to create Dungeons & Dragons campaigns / adventures / worlds (it's crazy good at that) and one trick i've started doing at the end of a fruitful conversation is to ask it to summarize the discussion so far _in a format to be used as a prompt_. it works more or less.
Off topic but Jupyter cells on GitHub can't display horizontally long content and it frustrates me a lot. This small piece of code for the browser console helps me see more content, but it only works in large displays.
Try editing the pre style so it has `white-space: pre-wrap`. That works for me. -- This uses `pre` whitespace formatting but wraps to the next line if it is too long, unlike the default `pre` element behaviour.
You need to learn how to use this to generate a good prompt but why not just learn how to generate good prompts? This code is basically asking for something, with examples. And ask a few real questions to test it
This could work really well if it replaced GPT-X-judged performance ranking with human-in-the-loop ranking of prompts, but that’s not as exciting, I guess.
I think the human in the loop evaluation of prompts is a red herring since the LLMs were trained with HFRL, which optimized them to be convincing. Humans aren't reliable for evaluating these things because it was literally trained to make us think it's doing well.
Uh...am I missing something, or is this whole thing setting the user up for humiliating failure by doing its testing the same way that bit that lawyer in the ass?
> Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task.
> You will be provided with the task description, the test prompt, and two generations - one for each system prompt.
> Rank the generations in order of quality. If Generation A is better, respond with 'A'. If Generation B is better, respond with 'B'.
> Remember, to be considered 'better', a generation must not just be good, it must be noticeably superior to the other.
> Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other.
> Respond with your ranking, and nothing else. Be fair and unbiased in your judgement.
So what factors make the "quality" of one prompt "better" than another?
How "impressive" it is to an LLM? What even impresses an LLM? I thought as an AI language model, it lacks human emotional reactions or whatever.
Quality is subjective. Even accuracy is subjective. What needs testing is alignment-- with your interests. The thing is hardcoded to rate based on what aligns with model hosts' interests, not yours.
Only the "classification version" looks capable of making any kind of assertion:
> 'prompt': 'I had a great day!', 'output': 'true' [sentiment analysis I assume?]
The rest of the test prompts aren't even complete sentences, they're half-thoughts you'd expect to hear Peter Gregory mutter to himself:
> 'prompt': 'Launching a new line of eco-friendly clothing' [ok, and?]
The one for 'Why a vegan diet is beneficial for your health' makes some sense at least, but it's really ambiguous.
I'm just some idiot, but if I were creating this, I'd expect the response to ask for a number of expected keywords or something to measure how close each model comes to what the user actually wants. Like, for me, 'what are operating systems' "must" mention all keywords Linux, Windows, and iOS, and "should" mention any of Unix, Symbian, PalmOS, etc.
All tests should tank the score if it detects fourth-wall-breaking "As an AI language model/I don't feel comfortable" crap anywhere in the response. National Geographic got outed on that one the other day.
"Prompt engineering is kind of like alchemy. There's no clear way to predict what will work best. It's all about experimenting until you find the right prompt."
It's a response to the notion that the quote above comparing writing prompts to alchemy is so wrong that it's funny in some way. I thought it was a pretty good analogy.
Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.