GPT-Prompt-Engineer

fatso784 · on July 11, 2023

This tool doesn’t benchmark based on how a model actually responds to the generated prompts. Instead, it trusts GPT4 to rank prompts simply in terms of how well it imagines they will perform head-to-head. Thus, there’s no way to tell if the chosen ‘best prompt’ actually is the best, because there’s no ground truth against actual responses.

Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.

immibis · on July 11, 2023

Because this is the upside if not the peak of the hype bubble. All you have to do is use GPT for a task, and nobody cares whether it actually works, you still get whatever VC funding is left after the interest rate hikes.

wilg · on July 11, 2023

This seems to be some guy who made a MIT licensed project, so I'm not sure there's any reason to complain about VCs.

__loam · on July 11, 2023

The market correction is gonna be gnarly.

jasonlotito · on July 11, 2023

> Why is this so popular

Grifters. I won't say that the person working on this is a grifter. Instead, it's so popular right now because of grifters. The same type of NFT grifters and crypto grifters who are mostly silent now. They've moved on.

How ethical would it be to sell things to these grifters, to sell the shovels they will use? But I'm always hung up on the idea that they will use those shovels on others and exploit them.

typpo · on July 11, 2023

Thanks for mentioning promptfoo. For anyone else who might prefer deterministic, programmatic evaluation of LLM outputs, I've been building this for evaluating prompts and models: https://github.com/typpo/promptfoo

Example asserts include basic string checks, regex, is-json, cosine similarity, etc. (and LLM self-eval is an option if you'd like).

fatso784 · on July 11, 2023

No problem! I guess I will make a plug myself --we've been working on a similar 'prompt engineering' tool, ChainForge (https://github.com/ianarawjo/ChainForge). It's targeted towards slightly different users and use cases than promptfoo --geared more towards early-stage, 'quick-and-dirty' explorations of differences between prompts and models for less experienced programmers, versus the kind of continuous benchmarking and verification testing power that promptfoo offers.

I particularly like promptfoo's support for CI, which I haven't seen anywhere else, and is very important for developers pushing prompts into production (esp since OpenAI keeps updating their models every few months...).

donkeyboy · on July 11, 2023

There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive. I am also uncomfortable with it, but using GPT4 as a grader is not as bad as you think.

fatso784 · on July 11, 2023

You’re missing the point here. It’s not even getting the LLM’s opinion on evaluating the responses to the prompts (which itself is fraught for some tasks, and benchmarks are known to be limited —even OpenAI admits this, it’s why they made evals). It’s one level abstracted from that. It’s evaluating what the LLM thinks of how well the prompt will do, in purely hypothetical terms. That’s hogwash —different LLMs perform very differently even for the same prompts. Try any tool that lets you compare model responses side-by-side. Unless I see actual use cases, this is yet another iteration of overtrusting AI.

Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool: https://news.ycombinator.com/item?id=35660751

duskwuff · on July 11, 2023

> Here is what HN was talking about, nearly three months ago -the exact same type of ‘auto-prompt-gen’ tool.

I was reminded of the same thing. What a lot of it boils down to is that LLMs have no innate ability to self-reflect. They can pretend to do it, but no more effectively than an untrained human would.

ChikkaChiChi · on July 11, 2023

> They can pretend to do it, but no more effectively than an untrained human would.

Which is exactly as much as Generative AI should be trusted.

otikik · on July 11, 2023

Of course there is strong correlation. That is literally what it was designed to do.

The problem is that it will simultaneously say that "cow eggs are bigger than chicken eggs", with the same confidence (and in a way that correlates well with human evaluators).

https://www.reddit.com/r/Funnymemes/comments/10ohd2n/chatgpt...

So when you get an evaluation you are playing the russian roulette - you may get a decent result, or you may get cow eggs.

flangola7 · on July 11, 2023

I just asked and it told me cows are mammals and do not lay eggs. That Reddit post is not even GPT-4 and is 5 months old, which may as well be the 19th century on AI tech timescales.

albrewer · on July 11, 2023

The post you're replying to is a case in point. This time it's cow eggs; what next?

oefnak · on July 11, 2023

Some people believe the earth is flat, but they can still provide useful work.

amelius · on July 11, 2023

These people typically subscribe to a very limited number of conspiracy theories.

otikik · on July 13, 2023

You are concentrating on the details and avoiding the point.

The point is that the tool fails, and it is known to fail, so much that we even have a name for the times when it fails - hallucinations. I have been calling them cow eggs because that's a nice mental image and I didn't want to have to remember for the proper English term. I will continue calling them cow eggs.

jstanley · on July 11, 2023

That's definitely begging the question.

If you're prepared to accept that GPT-4 can answer questions just as well as humans can, why do you even need to do prompt engineering?

damascus · on July 11, 2023

Humans still need 'prompt engineering' to answer questions more accurately though.

* What's the best way to get to Radio Shack from here?

is not the same as

* What's the easiest way to get to Radio Shack from memory when riding a bicycle from here?

smogcutter · on July 11, 2023

Easiest way to get to Radio Shack on a bicycle is to ride that bike down to Doc Brown’s house, charge the Delorean up to 1.21 gigawatts, and go back in time.

QuantumGood · on July 11, 2023

Humans benefit from good communication too. For example, annual U.S. deaths from medical errors is in the hundreds of thousands. Much of it is due to miscommunication. Is this akin to poor human-to-human prompt engineering? Of course, humans will rush and not attempt better communication, and you can take all the time you wish with an AI. And AI will continue to incorporate better prompt engineering that you won't have to write out. But there will always be a continuum from good to bad for communication, and communication outcomes.

therein · on July 11, 2023

You're forgetting what you may consider to be factual, self-evident and a priori is your opinion.

You may be under the impression that annual U.S. deaths from medical errors being in the hundreds of thousands miscommunicates but that is truly your opinion. You are merely jumping to conclusions at places another person might not.

And going on to rely on the LLM to validate your perspective is a lossy process. It may not lose your perspective but it loses someone else's and you don't even seem to notice or care.

jstanley · on July 11, 2023

This is an excellent example.

The post you replied to was saying that the deaths were caused by miscommunication, but you interpreted it to mean that stating the number of such deaths is somehow a miscommunication itself!

therein · on July 11, 2023

Doesn't help that I had just recently woken up but yes, most definitely.

nomel · on July 11, 2023

If we could use GPT-4 to grade prompts, we wouldn't need to talking about grading prompts to use for GPT-4, since this solution requires that the problem doesn't exist. The question then becomes, how do you grade the prompt grading, objectively? At the bottom, there has to be a ground truth.

You can't use the thing you're testing to evaluate its own performance. This applies to rulers, speedometers, and AI. It's the difference between a "subjective" and "objective" metrics. If you want an objective metric, you need to have it based on something external, based on reality, objective. Otherwise, you have metrics and ideas that have to held themselves up.

Source: My day job is test and measurement. These concepts go back centuries. You never trust your measurement system, you verify it against a standard.

LASR · on July 11, 2023

I think the parent poster is saying that it’s grading the prompts and not the output generated from the prompts.

Yeah I agree there. Unless you can check against the output, it’s not really telling much.

DearAll · on July 11, 2023

> There is a paper on arxiv saying that GPT4 correlation with human evaluators on a variety of tasks with strongly positive

Could you post the link please

martincmartin · on July 11, 2023

correlation ... strongly positive.

A positive correlation just means better than chance. "Strongly" is vague, and might not be much better than chance.

dragonwriter · on July 11, 2023

> A positive correlation just means better than chance. "Strongly" is vague, and might not be much better than chance.

No, adverbs like “strongly” modify adjectives (or verbs, but that’s not relevant here) not nouns; “strongly” is an intensifier that modifies “positive”, its not a separate adjective that modifies the noun “correlation”.

firtoz · on July 11, 2023

Well, you could keep everything else in the project and put yourself or a human as the "does this result feel better than the other one" decision maker

larodi · on July 11, 2023

it seems to me also, that this is very much some sort of snake oil for the llm era. prompt generation varies from llm-to-llm and I doubt gpt4 can do reasonable evaluation, provided that it does not know at all about other models.

londons_explore · on July 11, 2023

The various leaderboards show that, in aggregate, LLM's acting as evaluators match other LLM's and humans remarkably closely.

mistymountains · on July 11, 2023

Agreed. This is a pretty terrible idea.

mangecoeur · on July 11, 2023

Should we really call it "engineering" if its a case of "try random things until one of them works without really knowing why"?

muzani · on July 11, 2023

Looks more like science here. Run a bunch of experiments and see which does better.

But literally the first sentence of the readme is "Prompt engineering is kind of like alchemy."

namaria · on July 11, 2023

> Run a bunch of experiments and see which does better.

That's empiricism. Scientific method implies formulating falsifiable theoretical claims, producing meaningful analysis of experimental conditions, publishing peer-reviewed and reproducible results.

judge2020 · on July 11, 2023

That's if you're trying to become a scientist. You can do science even if your experiment doesn't contain any of those elements besides trying multiple things and observing the outcomes.

rTX5CMRXIfFG · on July 11, 2023

At this point you’re just deliberately making the definition of “science” ambiguous to conveniently support your point.

judge2020 · on July 15, 2023

Is calling someone (or a fictional character) a "mad scientist" incorrect just because they don't follow every step of the scientific method?

namaria · on July 11, 2023

Sure, the same way I can practice medicine by putting leeches on people. I'm not trying to become a physician so it's alright to call that medicine.

Kerb_ · on July 13, 2023

You could take some classes and learn how and when to apply leeches and become a physician call it medicine too.

https://www.nationalgeographic.com/magazine/article/leeches-...

namaria · on July 20, 2023

Yes, because practicing Medicine is about a whole apparatus of agreed upon practices, not pick and choosing and slapping a label on it. Much like the case with doing Science, which is what I was getting at.

olddustytrail · on July 11, 2023

Well, not you, but other people could because leeches are still used because they work.

You're trying to correct people on a subject you know nothing about.

l33t233372 · on July 11, 2023

Does the scientific method include peer review? I thought that was a recent phenomenon.

namaria · on July 11, 2023

Peer review has been a thing since, albeit in a much less rigorous way, at least the XVII century. The so-called 'republic of letters' was a world (europe) wide society of philosophers, mathematicians, and early experimenters sharing results and data in writing. Mathematicians have been publishing proofs and challenging each other in Europe since at least the Italian Risorgimento...

l33t233372 · on July 12, 2023

Does the fact that it’s old mean it’s part of the scientific method?

namaria · on July 20, 2023

That fact that it's part of the scientific method makes it part of the scientific method. I was addressing the "is it recent" bit.

bravura · on July 11, 2023

This is more engineering than science. For a given task, try a bunch of approaches and keep the best one.

Science would be replicable, ie demonstrate that this particular approach to prompt writing yields better prompts than some baseline prompt writing approach across an array of different problems.

bryanrasmussen · on July 11, 2023

alchemy is kind of like science before a lot of principles were formalized.

kossTKR · on July 11, 2023

The difference is that in science the dialectic is between you the observer or a group of observers and a "reality" that hopefully(!) remains relatively stable, because the human organism is the way that it is - just as physical reality.

But these models are changing literally every day, so there's no fixed thing to reveal.

So no one will be able to reproduce anything at all.

This makes all this "engineering" pretty ridiculous in my eyes, it's literally for one models bizarre emergent properties.

f1shy · on July 11, 2023

This comes to mind: http://lambda-the-ultimate.org/node/5335

mnky9800n · on July 11, 2023

That's called machine learning.

boyka · on July 11, 2023

Gradient descent type optimization is far from "trying random things until one of them works without really knowing why". You can calculate all partial derivatives and understand the impact.

quickthrower2 · on July 11, 2023

I think it is a freaking miracle it even works. I understand how it works for say 100 parameter linear regression, but that it would work for billions of parameters (billion dimensional space) by nudging the parameters each a little (based on purely it's impact on the loss, assuming everything stays the same), is not obvious to me. It is a kind of magic.

Regarding randomness, the initialization of the weights is random, and if they use dropouts that is random too, plus the order in which to process the text might be random.

bravura · on July 11, 2023

To be honest, there’s always been a tension in machine learning between the: “we won’t do it unless the theory is complete and sound” versus the “we don’t understand this but it works much better consistently so we do it” crowd.

In the 90s and even 00s, the theory first crowd was mainstream and the empirical first crowd was considered fringe. Very fringe.

Personally I appreciated it when LeCun was like: “you can’t find the solution if you only search where the lamplight is shining.” Or other early deep learning practitioners note that ML theory is usually so far disconnected from practice in terms of tightness and bounds that you might as well ignore pure theory completely.

Anyway, it wasn’t until deep learning methods really smashed benchmarks across the board did people give in to the black magic / alchemy driven approaches of empiricism based upon intuition and bias developed through long-held experience.

rmnclmnt · on July 11, 2023

What’s even crazier is the fact that the underlying numerical math solution was exposed early 1800’s by Gauss!

creata · on July 11, 2023

To a layperson like me, the "try random things" part of machine learning doesn't seem to lie in optimizing the parameters, but in designing the model.

varispeed · on July 11, 2023

Sure you can, but you can also throw things in randomly and see what works then build theory around it.

andrewdb · on July 11, 2023

"I don't play craps. I'm a dice engineer!" - prompt engineers, satirically

js8 · on July 11, 2023

I think the real etymology of it was "social engineering". Which also means clever hacks, not a real engineering, it deliberately subverted the meaning.

classified · on July 11, 2023

Hasn't software "engineering" been like that forever?

brunoluiz · on July 11, 2023

Isn’t engineering an exact science while prompt engineering is completely not?

Although, even software engineering being an exact science, it is a funny one: most of us don’t get certified as like, let’s say, mechanical engineers do. Would they say we are engineers?

So perhaps the “engineer” term got overloaded in recent years?

_fbpp · on July 11, 2023

Prompt "engineering" is just writing prayers to forest faeries.

Whilst BASIC/JavaScript/etc are all magic incantations to a child, a child will soon figure out there's underlaying logic, and learn the ability to reason about what code does, and what certain changes will do.

With prompts, it's all faerie logic. There is nothing to learn, there are only magic incantations that change drastically if the model is updated.

Worse yet, the incantations cannot be composed. E.g. take the SQL statement "SELECT column FROM table WHERE column = [%s]". For any given string you insert here, the output is predictable. You can even know which characters would trigger an injection attack.

With prompts you cannot predict results. Any word, phrase, or sequence of characters may upset the faeries and cause the model to misbehave in who knows what way. No processing of user-input will stop injection attacks.

Whilst it's dubious to call current software development practices "engineering", it's utterly ridiculous to do so for prompt-writing.

broast · on July 11, 2023

I don't get where this sentiment comes from. I build software specifically on the concept of predictable results from llm's being composable.

Sure, the results are not deterministic in that 100% of the time the exact prompt returns the exact same result, but you can tune your prompts so that 100% of the time they give you a valid result in the result category you were seeking, and with a specific probability distribution of available choices.

Prompts are functions that can take concrete input and create a probabilistic output that can be automated upon. Especially if you only need to output one token, i.e a number, boolean, word, object reference. And for obvious reasons - the further you forecast out in a sequence the less accurate you will be.

As long as you don't change the underlying model, in a massive model with billions of parameters, there are definitely mechanisms and behaviors to discover that you can reason about.

_fbpp · on July 11, 2023

but you can tune your prompts so that 100% of the time they give you a valid result in the result

You can't though, that's the issue. Illustrative here are tokens like "SolidGoldMagikarp", but this does happen to "normal" sequences of tokens as well.

There is no filter you can build to keep out such mistakes, any set of otherwise normal tokens could trigger the model to produce wrong output.

Because of how large these models and most prompts are, even slight changes in things like attention can cascade into extremely different results.

there are definitely mechanisms and behaviors to discover that you can reason about.

It's faerie logic. The behaviours are mere trends and observations, not underlaying truth.

The faeries reward you for offering them fruit. But offer them apple which fell from the tree exactly 74 hours ago down to the second and they'll kill you. There is no way to know ahead of time which things will upset them.

The risk here is that you're fooled into believing these systems are understandable, that you know how they work, and that you'll mistakenly use them for something where the wrong results have consequences. You'll stop double-checking the output, all humans are lazy like that, and then you'll have disaster on your hands.

l33t233372 · on July 11, 2023

You can reasonably expect an LLM to respond appropriately often. Which percentage of the time depends on the details, but it’s not much more magic than expecting the bridge you built to hold up.

penjelly · on July 11, 2023

you could do a sort of validation of output by prompting the llm repeatedly with the same prompt and then compare the responses to eliminate outliers. I do feel like this stuff is magic though, just wanted to provide a counterpoint.

kordlessagain · on July 11, 2023

In "The Information," James Gleick discusses a concept related to our current discourse. In the days when computers were merely an array of switching circuits, luminaries such as Claude Shannon believed that "thinking" could be captured in a structured format of logical representation.

However, even with formally composable languages like JavaScript, a semblance of unpredictability — akin to the "faerie logic" metaphor — still persists. Languages evolve over time; Python, for instance, with its various imports that constantly disrupt my code, serves as a good example. This is perhaps the reason behind the emergence of containers to ensure code consistency.

While some elements may be more "composable" than others, it appears increasingly unrealistic in today's world to encapsulate thought processes or interactions with systems within a rigid logical framework. Large Language Models (LLMs) will keep evolving and improving, making continual interaction with them unavoidable. The notion that we can pass a set of code or words through them once and expect a flawless result is simply illogical.

I firmly believe that any effective system should incorporate a robust user interaction component, regardless of the specific task or problem at hand.

_fbpp · on July 11, 2023

It's not so much about formal logic, but general predictability.

even with formally composable languages like JavaScript, a semblance of unpredictability — akin to the "faerie logic" metaphor — still persists

And they're ridiculed for it, and as you state, we design around them or replace such systems entirely.

making continual interaction with them unavoidable

Technology is never unavoidable or "inevitable". We can choose not to use it, or when to use it.

The notion that we can pass a set of code or words through them once and expect a flawless result is simply illogical.

Yet that is what we expect when we put these systems into production use, especially when many proposed use cases are user-facing and subject to injection attacks.

Whether it be the writing of adcopy, the processing of loan applications, or generating code, mistakes in these tasks have very real consequences.

kordlessagain · on July 11, 2023

I don't disagree we can choose to use it or not, but my point was more meant to indicate that, if we want a good experience with LLMs, we have to continue to interact with them to achieve good results.

Reminds me of raising kids...

jmoak3 · on July 12, 2023

You're too right.

We need to move away from prompt-engineering - it's AI-Management. You pretend you're speaking to another (albeit confusing/confused) person when extracting work from a model. You're coaxing things out of it based on hearsay and mysticism that work most of the time. Sounds a lot like AGILE and free pizza to get a junior to stay late and deliver on time.

That's not engineering, that's management.

ml-anon · on July 11, 2023

It’s so refreshing to see someone actually write this about prompt writing. It makes an extremely refreshing change from Twitter AI influencers posting their ridiculous prose as some marvel of harnessing LLMs.

l33t233372 · on July 11, 2023

You cannot predict results in _any_ domain with 100% accuracy, especially not in most engineering domains.

Why do you think rockets explode, bridges collapse, etc.

zerodensity · on July 11, 2023

This was magical really made my day. Thanks for this.

mellosouls · on July 11, 2023

Software engineer has always been a daft, grandiose term that seems aimed at prestige rather than reality in the majority of cases.

If coders can call themselves engineers, no reason why anybody else solving puzzles for a living can't.

xmcqdpt2 · on July 11, 2023

In my opinion, coding is a craft. As software has only existed for like 70 years, we are more like the guilds building cathedrals in the middle ages than like modern civil engineering.

One day I think there will be true software engineering. When that happens you won't be able to start software projects without certifications, and most people (or programs!) who actually do the coding will be following careful plans and instructions from the engineers who designed the project.

I for one am very happy software isn't fully professionalized yet!

tough · on July 11, 2023

> One day I think there will be true software engineering. When that happens you won't be able to start software projects without certifications, and most people (or programs!) who actually do the coding will be following careful plans and instructions from the engineers who designed the project.

Sounds really bad.

Software can't fall on your head and kill you, not all of it at least.

Different software should require different professionals building it.

And it's usually not about the software but about the management telling the engineers to take shortcuts or whatever (Boeing comes to mind)

jes · on July 11, 2023

You might enjoy this article on the Therac-25 [1]. It's kind of the standard example of how errors in software can wind up harming people. I have written medical device software for about 30 years. In my experience, delivering high quality software for Class B and Class C devices is both challenging and expensive.

https://en.wikipedia.org/wiki/Therac-25

rmilejczz · on July 11, 2023

Every software developer should know this story, it is a humbling and important lesson. Yes luckily most of can’t ship code that accidentally kills people but we can absolutely empathize with the conditions which led to it happening.

I recommend reading the entire postmortem, http://sunnyday.mit.edu/papers/therac.pdf yes it is quite long but if you write code in any capacity it’s worth the read

jes · on July 11, 2023

Thank you so much for the link to the postmortem. I will be sharing it and discussing it with my colleagues. We are currently working on the embedded software for an AED.

groestl · on July 11, 2023

You will need a software engineer to argue in your favor before the court ;)

groestl · on July 11, 2023

Don't want to argue against you, since the term is definitely misused for signaling a lot of things, but in a stricter sense, software engineering is not about the puzzle, it's about the whole pipeline surrounding it.

skwirl · on July 11, 2023

I’m guessing you aren’t from the US because in the US there is not much prestige in the title of engineer, but it seems to be a title people in Europe get weirdly hung up on. If anything, software engineering positions are more prestigious than most traditional engineering positions here.

hospitalJail · on July 11, 2023

I was a real engineer for a decade before switching to programming.

I cannot use the word software engineer, since its nothing like real engineering.

Real engineering was def harder, more math intense, and the stakes were sooo much higher. While many software problems can cause you to lose money, engineering problems can cause you to lose time. Yes it sucks your CAD designer had everything on a 0.05 degree angle and it costs 1M to redo the tool, but it also costs 16 weeks to redo the tool. We'd even offer to pay absurd money, prevent future business, etc... to get the tool done in 8 weeks, but its impossible to get it done faster. Now everything in the company is 8 weeks behind schedule.

Anyway, real engineering was harder, but programming pays soooo much more money. Its a demand thing, not a difficulty thing.

philovivero · on July 11, 2023

> While many software problems can cause you to lose money, engineering problems can cause you to lose time.

I'm guessing you were trying to say something else here. I literally cannot think of a single software engineering problem I've ever encountered that didn't cost time. By your definition, then, software engineering is engineering. Your claim and your definitions are at odds with one-another.

Also, you don't directly claim it, but you seem to imply, that software engineering can't have real-world consequences... or something? As another reply points out, sometimes software is in the critical path for things like rockets and airplanes, where mistakes cost lives.

And some people making software for less life-altering systems take their craft just as seriously. Some people think that losing $10M every single second while their software is failing is a big deal.

Are you claiming people who write HFT code, ad arbitrage code, code that powers the front page of Apple, Amazon, Microsoft, and Google are just cowboying it through the day, doing nothing special?

Overall I just find this comment very confused. Maybe you could put some thought into what you're trying to say, and say it better?

SirMaster · on July 11, 2023

So if you have your PE license and are writing software that controls a commercial airplane system or a spacex rocket, you are not an engineer doing software engineering?

9659 · on July 11, 2023

From Wikipedia: "A software engineer is a person who applies the engineering design process to design, develop, maintain, test, and evaluate computer software."

"The engineering design process, also known as the engineering method, is a common series of steps that engineers use in creating functional products and processes. The process is highly iterative - parts of the process often need to be repeated many times before another can be entered - though the part(s) that get iterated and the number of such cycles in any given project may vary.

It is a decision making process (often iterative) in which the basic sciences, mathematics, and engineering sciences are applied to convert resources optimally to meet a stated objective. Among the fundamental elements of the design process are the establishment of objectives and criteria, synthesis, analysis, construction, testing and evaluation.[1]"

It is not dependent on the problem domain, rather on how the work is performed.

hospitalJail · on July 11, 2023

Ahh I forgot to put my usual caveat.

Assembly and safety critical C can be considered software engineering.

Anything with abstraction, no.

SirMaster · on July 11, 2023

I agree that far too many people call themselves software engineers when they really aren't.

But I just mean to say that software engineering itself is definitely still a real thing and there are many people out there that can and should call themselves software engineers.

edgyquant · on July 11, 2023

“Real engineer,” is such an insufferable way to make your point. I guess only civic engineering is “real.” If you don’t like the term software engineer you’d be real mad if you found out about sound stage engineers. Being an engineer isn’t about engineering solutions to business problems; nope it’s having to deal with bureaucratic hurdles that slow you down that makes one an engineer.

tough · on July 11, 2023

it's just -soft- engineering bro, get over it.

It's always been a demand thing, not a difficulty thing.

zaphirplane · on July 12, 2023

Don’t want to be that person but (tm) Engineer has to mean something other than person doing something.

Knot engineer, floor polishing engineer, water spillage cleanup engineer, tax minimization engineer, heart engineer, bus operating engineer, flower pruning engineer

kordlessagain · on July 11, 2023

In essence, programming is an ephemeral art. At its core, our work involves utilizing the electromagnetic field to perform tasks akin to 'thinking.' Without continuous computational processes, our products might as well be inert objects like rocks that simply exist unchanged for millennia.

groestl · on July 11, 2023

I am a Software Engineer, I am certified (since in my country, they do so, just as with Mechanical Engineers). Still think it's overloaded though.

elzbardico · on July 11, 2023

In my country there used to be some 5 years (realistically 6 for most students) Computer Engineering programs that were basically electronic engineering with selected classes from computer science tackled on top of it. And as Electronic Engineering programs used to have a lot of classes from Civil Engineering tackled upon it, legally said Computer Engineers were legally licensed to build small 3 floor buildings.

KnobbleMcKnees · on July 11, 2023

People who have not studied and been given that distinction in engineering going on to call themselves engineers is the reason the term is overloaded.

I've always assumed there's a near 100% overlap between people using the term wrongly to describe any programming activity, and people complaining that it has no meaning or is self-aggrandisement

pavel_lishin · on July 11, 2023

Canada?

groestl · on July 11, 2023

Austria

wil421 · on July 11, 2023

How does certification work? I’d imagine it would be in a less abstracted language like C or C++. The problem for me is most of my schooling was based on web technologies and 2 classes of Java.

I wound hate having to studying for a C based test when the area I work in is all web tech. Same could be said of Java. I learned it in school and haven’t used it in 10 years except for it being the backend on my first front end dev project.

EGreg · on July 11, 2023

Anyway, the best engineer for ChatGPT prompt can be ChatGPT itself.

People seem to think that “automation will create new jobs”, but in the age or AI, those job opprtunities will be very temporary as the companies making the AI automate that thin layer.

Similarly, people think that humans will control AIs. That’s a bit quaint, a bit like humans controlling a corporation. The thin layer of “control” can be easily swapped out and present an improvement in the market, so that the number of totally autonomous (no human in the loop) workflows will grow.

That can include predictive policing with Palantir (thanks Peter Thiel!), autonomous killbots in war etc. Seeing how reckless companies have been in releasing the current AI in an arms race, I don’t see how they would be restrained in a literal arms race of slaughterbot swarms and panopticon camera meshes.

PS: I remember this exact phase when computers like Deep Blue beat Garry Kasparov. For a while he and others advocated “centaurs” — humans collaborating with computers. But the last decade hardly anyone will claim that a system with humans in the loop can beat a system that’s fully automated: https://en.m.wikipedia.org/wiki/Advanced_chess

umanwizard · on July 11, 2023

“Engineer” originally just meant someone who builds engines (in a broad sense of that word). The formal titles requiring certification, etc. are the more recent development.

ris58h · on July 11, 2023

https://en.wikipedia.org/wiki/Engineer

> The word engineer (Latin ingeniator) is derived from the Latin words ingeniare ("to contrive, devise") and ingenium ("cleverness").

PartiallyTyped · on July 11, 2023

what does “exact” mean?

If we are being pedantic and literal, this is exact in the sense that for identical seeds you get identical results.

larodi · on July 11, 2023

you make a valid point, and no - we are not engineers. we are people with printed labels at best, where the label says architect or engineer. but most of these people with these labels don't even have a degree, which is the prerequisite to have this designations. architects also typically need to comply for a local guild.

we, the IT crowd, are long over-due for this formalization of the professions.

aiisjustanif · on July 11, 2023

I feel like many places have tried this and didn’t like it or molded it into a Frankenstein.

The idea of an engineer that researches, designs, tests, and measures and a programmer that implements seemed to cost too much (not just monetarily) for the industry that employees them and there isn’t the sufficient need to regulate all sun-groups of the industries that employ software programmers / engineers.

shivams · on July 12, 2023

BTW, GPT-Engineer is openly collecting all of your data: user prompts and other metadata. And they were even defending it until they received some strong responses from the community: https://github.com/AntonOsika/gpt-engineer/issues/415 They now explicitly ask for consent regarding user data, but can we really trust their motives?

Towaway69 · on July 11, 2023

Is this prompt generation for the purposes of prompt engineering? Is this then a kind of meta engineering? Engineering for the purposes of engineering which then hopefully will generate working code for the computer that generated the prompt and the response to the prompt.

danielbln · on July 11, 2023

A bit like code generation, really. Transpile one code to another and have the execution engine run that.

fluxinflex · on July 11, 2023

In sense yes, previously we were engaged in "google engineering, then we went to "stackoverflow engineering" and now its "prompt engineering" - with every step, the magic and mystic increases.

Apfel · on July 11, 2023

Usage query: It looks like this could get expensive quite quickly. The approach is great, but with GPT-4 especially, could be very difficult. Is it worth using with 3.5 as a first pass then switching prompts to GPT4 once you've got the best prompt?

l5870uoo9y · on July 11, 2023

I am not sure this approach is doable since GPT-4 is capable of solving assignments that GPT-3.5 get wrong. Example GPT-3.5 fails to solve the prompt (with the dvdrental sample database schema added [1]):

> find customers who didn't rent a movie in the last 12 months but rented a movie in the 12 months before that

GPT-4 solves this without a problem [2]. Combining logic like (without additional database schema added):

> find all users who lives in Paris using lat/lng and who visited the south of France within the last month

GPT-3.5 can't understand this at all, GPT-4 solves it [3].

[1]: https://www.postgresqltutorial.com/postgresql-getting-starte...

[2]: https://aihelperbot.com/snippets/cljy8km2h0000my0fgq8kut5w

[3]: https://aihelperbot.com/snippets/cljy8q6gz000al70fvfzxt2hh

hoc · on July 11, 2023

Doug Adams would've had so much fun these days.

Deestan · on July 11, 2023

His genius laid in not only seeing which technologies were coming, but predicting what perversions business people would twist it into.

namaria · on July 11, 2023

Creating a computer to find "the ultimate answer to life, the universe and everything", getting a cryptic answer and then creating an even bigger and more complicated computer to find the question is a pretty good satire of generative ai based chatbots of exponentially increasing model size.

lcnPylGDnU4H9OF · on July 11, 2023

Even down to the blind faith that there was an accurate calculation performed, which ultimately justified the planet-computer.

Deestan · on July 11, 2023

Don't remember if it made it into the book, but from the radio series my favorite is the scene where they end up in a nightclub filled with dancing mannequins sprayed with sweat in order to convince people it was popular and come in.

namaria · on July 11, 2023

> a nightclub filled with dancing mannequins sprayed with sweat

That's pretty much how I see every nightclub, to be fair...

I loved illegal raves when I was in college. Paying big bucks to listen to loops triggered by some influencer feels like a surreal parody...

hoc · on July 14, 2023

Well, I am intrigued by the obvious analogies of Deep Thought and the prompt engineers (and their upcoming tools and their respective complexities) while on the other hand I already see the upcoming war between simple and dumb business interests on both sides that will try to make the fascinating und useful concept (which it is after all) consumable on one side while the other side will be ready to kill it if they don't get paid for it using/refrencing/quoting their pieces of work in its answers or "thinking"/modelling. Probably resulting in a cripled tool if politics won't visibly value educational gain over commercial interest.

Pain and joke too often come as closely bundled as within Adams's pointed work.

dontupvoteme · on July 11, 2023

Imagine what Russell or Wittgenstein would think of all this.

hoc · on July 14, 2023

Hm. The idea of allowing a transform without actual reasoning to act as speech... source of new chapters of their work or of total despair.

Just ask ChatGPT to not say a thing.

xmcqdpt2 · on July 11, 2023

I think one should just use GPT to generate the prompts so as to reduce the human input further still, a kind of gpt-gpt-prompt-engineer-engineer.

bravura · on July 11, 2023

Isn’t that what this codebase is doing? I haven’t grokked it 100% yet.

Recently I’ve been trying to engineer a prompt that I intend to run 1k times.

Noticing GPT4 bug out on several responses, I’ve talked it through the problem more and asked it to rewrite the prompt. So an automated approach to help build better prompts based upon held out gold data is useful to me.

Deestan · on July 11, 2023

And a GPT to receive the output, write a summary, and insert it directly into a CEO's trash folder.

Makhini · on July 11, 2023

I think it's exactly what they do here

xmcqdpt2 · on July 11, 2023

Is it? I thought it just ranked the (manual) prompts against each other.

jwestbury · on July 11, 2023

It's turtles^W GPT all the way down, I guess.

Kiro · on July 11, 2023

How are they actually ranked?

arbol · on July 11, 2023

It seems like a `ranking_system_prompt` is used to rank the output of other prompts, which is pretty cool!

> Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task. You will be provided with the task description, the test prompt, and two generations - one for each system prompt. Rank the generations in order of quality. If Generation A is better, respond with 'A'. If Generation B is better, respond with 'B'. Remember, to be considered 'better', a generation must not just be good, it must be noticeably superior to the other. Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other. Respond with your ranking, and nothing else. Be fair and unbiased in your judgement.

Source: https://github.com/mshumer/gpt-prompt-engineer/blob/main/gpt...

fdondi · on July 11, 2023

Wait, so we are asking the innkeeper if the wine is good? If the model thinks say that the current year is 2021 it will rate higher a response that does?

There should be at a minimum a way to print which comparisons were made, so that you could double-check if you agree with those.

Also, we could modify the prompt to explain why the decision was made.

axlee · on July 11, 2023

How did they rank the ranking prompt?

archargelod · on July 11, 2023

They prey extra hard for this one, so the almighty Machine God blesses the results and they won't bring down whole thing. /s

mistymountains · on July 11, 2023

They don’t. They simply assume the model’s most likely output is meaningfully correlated with true rankings even though it was never trained on this task and certainly has not been trained to output the most likely prompt given a prompt in some meaningful order. It’s hogwash.

A-Train · on July 11, 2023

I'm pretty surprised more people dont use logit biases to call openai with. Checking if something is either a or b means that the tokens for those letters must be 100 weight which means they will be chosen no matter what and no other character is allowed.

msp26 · on July 11, 2023

If you're using OAI function calling, you can define a json schema with boolean values. If you have a handful of values, you can use an enum with a list of possible values.

Unless your prompt seriously conflicts with the schema, it's pretty consistent.

firtoz · on July 11, 2023

A neat feature would be able to switch between autoranking and manual ranking.

asimpleusecase · on July 11, 2023

A bit like an autoGPT, I did not immediately see any kind of token limits. But I did not look carefully. On a complex problem or one that accesses a lot of data the cost might ramp up.

msp26 · on July 11, 2023

Currently working on something similar for myself, this doesn't seem to fit my needs (benchmarking generations too rather than just classification). I only have a crude cosine similarity metric for accuracy for now. Also I'm using function calling rather than the normal completions.

I was hoping this would do something more interesting with multiple messages (if using a chat model) rather than just dumping the entire prompt in one message. The assistant lets you do stuff with examples.

iRomain · on July 11, 2023

Cool for you. What about if you contribute to this one or open-source yours?

msp26 · on July 11, 2023

It's company code, I don't think I can open-source it. Happy to answer any questions/discuss prompt benchmarking though.

hacksoi · on July 11, 2023

Are we really going down this path of prompt-prompt-engineering?

PUSH_AX · on July 11, 2023

It would be cool if, given a handful of test cases, you could send those off to the LLM to generate even more test cases.

My first thought when looking over this tool was "Why do I have to do all the work?", the ideal scenario is that I give the high level description and the LLM does the hard work to create the best prompt.

Kichererbsen · on July 11, 2023

I haven't read the article at all, but what you said just reminded me that I've been using ChatGPT to create Dungeons & Dragons campaigns / adventures / worlds (it's crazy good at that) and one trick i've started doing at the end of a fruitful conversation is to ask it to summarize the discussion so far _in a format to be used as a prompt_. it works more or less.

ljlolel · on July 11, 2023

It can also create compressed versions in unicode that expand to approximate the human readable prompt. Pretty wild.

https://www.reddit.com/r/ChatGPT/comments/12cvx9l/compressio...

vivekv · on July 11, 2023

So one job that has been 'taken over by ai' before it even started ?

drekipus · on July 11, 2023

> to summarize the discussion so far _in a format to be used as a prompt_.

Very nice idea, thanks. i might steal that.

hamasho · on July 11, 2023

Off topic but Jupyter cells on GitHub can't display horizontally long content and it frustrates me a lot. This small piece of code for the browser console helps me see more content, but it only works in large displays.

    $("[data-type='ipynb']").style.width = '100%'

rhdunn · on July 11, 2023

Try editing the pre style so it has `white-space: pre-wrap`. That works for me. -- This uses `pre` whitespace formatting but wraps to the next line if it is too long, unlike the default `pre` element behaviour.

m3kw9 · on July 11, 2023

You need to learn how to use this to generate a good prompt but why not just learn how to generate good prompts? This code is basically asking for something, with examples. And ask a few real questions to test it

sgt101 · on July 11, 2023

this is supervised machine learning on top of unsupervised machine learning with some interesting wrinkles in both steps!

it reminds me of those aircraft that folks in rural india build from time to time.

namuol · on July 11, 2023

This could work really well if it replaced GPT-X-judged performance ranking with human-in-the-loop ranking of prompts, but that’s not as exciting, I guess.

__loam · on July 11, 2023

I think the human in the loop evaluation of prompts is a red herring since the LLMs were trained with HFRL, which optimized them to be convincing. Humans aren't reliable for evaluating these things because it was literally trained to make us think it's doing well.

namuol · on July 12, 2023

I can’t see how we’d expect the same model to be any better…

magicroot75 · on July 11, 2023

People don't understand intelligence

fdondi · on July 11, 2023

Does it only work with ChatGPT? Seems it would be useful also for local Llamas etc.

gaolei8888 · on July 11, 2023

This is a cool tool

jstarfish · on July 12, 2023

Uh...am I missing something, or is this whole thing setting the user up for humiliating failure by doing its testing the same way that bit that lawyer in the ass?

> Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task.

> You will be provided with the task description, the test prompt, and two generations - one for each system prompt.

> Rank the generations in order of quality. If Generation A is better, respond with 'A'. If Generation B is better, respond with 'B'.

> Remember, to be considered 'better', a generation must not just be good, it must be noticeably superior to the other.

> Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other.

> Respond with your ranking, and nothing else. Be fair and unbiased in your judgement.

So what factors make the "quality" of one prompt "better" than another?

How "impressive" it is to an LLM? What even impresses an LLM? I thought as an AI language model, it lacks human emotional reactions or whatever.

Quality is subjective. Even accuracy is subjective. What needs testing is alignment-- with your interests. The thing is hardcoded to rate based on what aligns with model hosts' interests, not yours.

Only the "classification version" looks capable of making any kind of assertion:

> 'prompt': 'I had a great day!', 'output': 'true' [sentiment analysis I assume?]

The rest of the test prompts aren't even complete sentences, they're half-thoughts you'd expect to hear Peter Gregory mutter to himself:

> 'prompt': 'Launching a new line of eco-friendly clothing' [ok, and?]

The one for 'Why a vegan diet is beneficial for your health' makes some sense at least, but it's really ambiguous.

I'm just some idiot, but if I were creating this, I'd expect the response to ask for a number of expected keywords or something to measure how close each model comes to what the user actually wants. Like, for me, 'what are operating systems' "must" mention all keywords Linux, Windows, and iOS, and "should" mention any of Unix, Symbian, PalmOS, etc.

All tests should tank the score if it detects fourth-wall-breaking "As an AI language model/I don't feel comfortable" crap anywhere in the response. National Geographic got outed on that one the other day.

lofaszvanitt · on July 11, 2023

"Prompt engineering is kind of like alchemy. There's no clear way to predict what will work best. It's all about experimenting until you find the right prompt."

lololoollool

mandmandam · on July 11, 2023

I'm honestly curious why you find this funny?

Many fields of study have this fuzzy property - it's easier to name which fields don't.

mistymountains · on July 11, 2023

May I ask, are you a VC? This is a complete non sequitur.

mandmandam · on July 12, 2023

No I'm not, and no it isn't.

It's a response to the notion that the quote above comparing writing prompts to alchemy is so wrong that it's funny in some way. I thought it was a pretty good analogy.