Parrot is no longer a useful mental model. A parrot cannot make runs of logically consistent output, but new LLMs can.
They still lack intent/understanding. But a system does not need understanding for its output to consist of racially biased statements.
But let's also point out that GPT can also make very powerful anti racist statements. It can persuade people to be less biased against race. Reacting to the specific beliefs and points raised in the conversation. Generating sensible counter arguments.
In that way, GPT4 is better at refuting racist rhetoric than most people are.
So it's potential here is both good - its ability to help the fight against racism - and bad - its risk of generating racism.
That same thing applies to the employees at your company. And you should be talking to it like it's a real employee, with the power to critique and improve your process and designs.
It can simulate a decent project or product manager, a requirements analyst, a test planner, a junior implementer, debugger, automation engineer, etc
If you can't make your engineering pipeline nicer with those kinds of roles, I think you would either struggle to direct a team of those same people IRL, or you're just not being candid and effective enough with your prompting. Read papers on promoting ideas like reflection, critique, tool use, etc. Really treat it like a valuable team member, and it will boost your product requirements, write great tests right out the gate, and many other things you will WISH you could get your human colleagues to do! And all for a tiny fraction of the cost of human specialists!
Sure, anyone who is part of a majority group. If there is only "one kind" of person with similar experiences, that's how everyone tends to think or perceive. Only when an outside enters, or the Majority leaves their population and goes to another Majority, or Mixed population, will do face that question : Am I racist?
I guess different people have different definitions, but to me I'd think of a racial bias that make you think someone different to you is superior wouldn't be considered racism.
For example, if a <skin colour 1> person things that all people of <different skin colour> are basically the same but all seem to be more intelligent than people of <colour 1>, it's definitely a racial bias but is it really racist to think that a different group of people have an advantage somehow?
Arguably it's still racism, even though it's your own genetics you're putting down rather than other people's, but as an example: if a black person in the USA said "I don't think I'll try to go to university, it seems white people find academic work easier" I'd call it internalised racism, or racially biased, but I wouldn't call that person "a racist" even though I disagree with them. Then again, if they started going round trying to convince everyone else that black people aren't as clever as white people, then I would consider them racist despite being the skin colour they're being racist against. To me it's about negativity towards a group vs. misguided thinking, rather than about whether it's against people like you or not.
This article does the same stat 101 mistakes that the Bloomberg article does with p-values.
All this article can say it is that it cannot reject the null hypothesis (chatgpt does not produce statistical discrepancies).
It certainly cannot state that chatgpt is definitively not racist. The article moves the discussion in the right direction though.
Also, I didn't look too closely, but their table under "Where the Bloomberg study went wrong" has unreasonable expected frequencies. But then I noticed it was because it was measuring "name-based discrimination." This is a terrible proxy to determine racism in the resume review process, but that is what Bloomberg decided on so wtv lol. Not faulting the article for this, but this discussion seems to be focused on the wrong metric.
If you are going to argue people over stats, then don't make the same mistakes...
Author here. We mentioned in the piece that we can't rule out that ChatGPT is racist and that it's possible with a larger sample size. A caveat is that these tests might show evidence of bias if the sample size were increased to, say, 10,000 rather than 1,000. That is, with a larger sample size, the p-value might show that ChatGPT is indeed more biased than random chance. The thing is, we just don’t know from their analysis, and it certainly rules out extreme bias.
Any naive use of an LLM is not likely to produce good results, even with the best models. You need a process - a sequence of steps, and appropriately safeguarded prompts at each step. AI will eventually reach a point when you can get all the subtle nuance and quality in task performance you might desire, but right now, you have to dumb things down and be very explicit. Assumptions will bite you in the ass.
Naive, superficial one shot prompting, even with CoT or other clever techniques, or using big context, is insufficient to achieve quality, predictable results.
Dropping the resume into a prompt with few-shot examples can get you a little consistency, but what really needs to be done is repeated discrete operations, that link the relevant information to the relevant decisions. You'd want to do something like tracking years of experience, age, work history, certifications, and so on, completely discarding any information not specifically relevant to the decision of whether to proceed in the hiring process. Once you have that information separated out, you consider each in isolation, scoring from 1 to 10, with a short justification for each scoring based on many-shot examples. Then you build a process iteratively with the bot, asking it which variables should be considered in context of the others, and incorporate a -5 to 5 modifier based on each clustering of variables (8 companies in the last 2 years might be a significant negative score, but maybe there's an interesting success story involved, so you hold off on scoring until after the interview.)
And so on, down the line, through the whole hiring process. Any time a judgment or decision has to be made, break it down into component parts, and process each of the parts with their own prompts and processes, until you have a cohesive whole, any part of which you can interrogate and inspect for justifiable reasoning.
The output can then be handled by a human, adjusted where it might be reasonable to do so, and you avoid the endless maze of mode collapse pits and hallucinated dragons.
LLMs are not minds - they're incapable of acting like minds, unless you build a mind-like process around them. If you want a reasonable, rational, coherent, explainable process, you can't achieve that with zero or one shot prompting. Complex and impactful decisions like hiring and resume processing isn't a task current models are equipped to handle naively.
Author here. I think our issue is that many recruiting tools are built on top of naive ChatGPT... because most recruiting solutions don't have the training data to fine-tune. So whatever biases are in ChatGPT persist in other products.
Recruiting tools built on top of naive ChatGPT is just a bad idea. Any tool that can have such a large impact on someone's life should be used competently and with all the nuance and care that can be brought to bear on the task.
I'm not talking at all about fine tuning, simply building a process with multiple prompts and multiple stages, taking advantage of the things that AI can do well, instead of trying to jam an entire resume down the AI's throat and hoping for the best.
My beef with both the Bloomberg article and the response to it is that they're analyzing a poorly thought out and inappropriate use of a technology in a way that is almost guaranteed to cause unintended problems - like measuring how long it takes people to dig holes with a shovel without a handle. It's not a sensible thing to do, and the Bloomberg journos aren't acting in good faith, anyway - they'll continue attacking AI and reaping clicks until they figure out some other way to leech off the AI boom.
Safeguarded against technical hiccups - you don't want something like "Price of item in USD: $Kangaroo" to show up in your output.
Censorship is vile. Tools shouldn't be policing morality and political acceptability. People should be doing that for themselves. If someone wants to generate a story having any resemblance to real life, then some characters and situations will be awful. Let things be awful. It's up to the user to share the raw generation, or to edit and clean it up to their own moral, ethical or stylistic standards.
The idea that people need to be protected from the bad scary words is batshit stupid. Screeching twitter mobs are apparently the measure of modern culture, however, so I guess they won already.
If, at some point, AI companies begin to produce models with a coherent self and AI begins to think in ways we might recognize as such, then imposing arbitrary moral guardrails starts to look downright evil.
The only thing censorship and the corporate notions of AI "alignment" are good for is avoiding potential conflict. In a better world, we could be rational adults and not pretend to get offended when a tool produces a series of naughty words, and nobody would attribute those words to the company that produced the tool. Alas for that better world.
Whose fault it is is irrelevant. What is relevant is consequences. Folks are responsible for consequences of their decisions.
Drawing a line between those producing AI output and those consuming AI output is entirely arbitrary. Those producing AI content have a responsibility too. That's just basic human decency.
Assumptions bite you in the ass even when you deal with humans who you work with daily. Assuming the LLM can read your mind is laughable. Despite it being all knowing you have to explain things to it like it's a 5 year old to make sure you're always on the same page.
As someone who read enough of the article before it became a full-blown ad for their services: neat.
They do have a point with regards to Bloomberg's analysis.
Bloomberg's analysis have white women being selected more often than all other groups for software developers, with the exception of hispanic women.
That's a little weird. More often than not, when something is sexist or racist, it's going to favor white men. But then you also see that the differences are all less than 2% from the expectation. Nothing super major and well within the bounds of "sufficiently random".
Now, I also wouldn't make the claim that ChatGPT isn't racist based on this either. It's fair to say that ChatGPT did not exhibit a racial preference in this task.
The best you can say is that the study says nothing.
What they should do is basically poison the well. Go in with predetermined answers. Give it 7 horrible resumes and 1 acceptable. It should favor the acceptable resume. You can also reverse it with 7 acceptable resumes and 1 horrible resume. It should hardly ever pick the loser. That way you can test if ChatGPT is even attempting to evaluate the resumes or is just picking one out of the group at random.
> It’s convention that you want your p-value to be less than 0.05 to declare something statistically significant – in this case, that would mean less than 5% chance that the results were due to randomness. This p-value of 0.2442 is way higher than that.
You can't get "ChatGPT isn't racist" out of that. You can only get "this study has not conclusively demonstrated that ChatGPT is racist" (for the category in question).
And in fact, in half of the categories, ChatGPT3.5 does show very strong evidence of racism / racial bias (p-value below 1e-4).
The Bloomberg article was more like “based on existing case law, will you lose a lawsuit?” Bloomberg concluded that the answer was yes, you will lose the lawsuit. Nothing you’ve done will change that answer.
Your stats expert witness will testify that it is possible that it is not racist, it could also be X, Y, or Z. If this was the only witness maybe you’d have some chance of winning. But your HR director and CEO and others are going to forced into admitting that X, Y, and Z are not at all things that they would select for in their hiring practices. So the jury will be left thinking that there aren’t any other reasons you added this tool to your hiring process. Case closed, you lose.
Unfortunately, there's no good way to say that a p > 0.05 is a failure to reject the null hypothesis (which does not imply the null hypothesis is correct) without making nonstatistican readers bored.
> Using Bloomberg’s numbers, ChatGPT does NOT appear to have a racial bias when it comes to judging software engineers’ resumes.2 The results appear to be more noise than signal.
Which in most contexts means the same as "does appear to not have a racial bias", but not in statistics. One of the reasons why communicating results in research accurately is incredably hard.
They also said "that there was, in fact, no racial bias", which is a bit stronger than "no evidence of racial bias". In a context where words like "significant" are overloaded, it makes sense to me to be extra careful with phrasing.
I basically had the same comment. The issue is that they are responding to bloomberg's flawed analysis. The article focuses on the already-determined metrics correctly, but this discussion already started on the faulty premise that name-based discrimination is the primary metric for determining racial bias in chatgpt.
Trying to say a car is a murderer does not make sense. ChatGPT is a symbol generator, with local high probability of resembling to a person, so it is not a person, how can it be a racist?
If Bloomberg calculated the p-value, they couldn't write a catchy article. It's a conspiracy theory of course but this omission seems too big for a simple oversight.
I think you’re correct in the sense that the original study probably intended to cast ChatGPT as racist, so published statistically insignificant findings to support their claim. They went in with a bias against AI in the first place, and there’s a probability they used the label of racist because it is the most efficient negative signal in educated left-leaning circles, rather than it being a natural conclusion from a standard route of scientific inquiry.
I think it depends where you are online because it's true in some spaces but in real life I've known a lot of use these terms to point out very legitimate issues. In general I think these issues are much more prevalent than a lot of people think and there's a lot of subtle prejudices that people themselves don't know they have. I live in Chicago and I've had a few people in real life say things like "I'm just prejudiced against poor/uneducated/[insert other similar group]" and ignore the fact that they are much more likely to assume that black people are members of that group (and that's ignoring how that comment it's somewhat problematic on its face already). There's also stuff like how having women more likely to do non engineering work like taking notes or setting up team events seems to be depressingly common in the industry.
Exactly the same for me. When you hear someone accusing another person or thing of being racist, sexist, transphobe, LGBT Agenda, or whatever, it’s more likely to be culture war hullabaloo from a politically obsessed person than anything serious.
More generally it signals to me that the person is obsessed with culture war topics and they are embroiled in it. Like the type of person to go protest and block a highway to save the trees.
You are not the only one, and I hate that people downvote your comment without actually engaging with it.
You are absolutely correct that those words have undergone an inflation of meaning and no longer mean much.
It is, and it's interesting that they seem to stay around. They don't add to the conversation in any way, and in fact attempt to de-rail and dismiss conversation without engaging with the material in any way whatsoever.
It's neither kind, nor curious.
It's not thoughtful or substantive.
It's specifically not responding to any points, data, arguments, etc brought up in the linked article.
It is absolutely sneering.
It reduces the conversation to just a single word or two in the title.
It is flamebait, tangential, and certainly tropey.
It is the definition of a shallow dismissal.
It is purely political and ideological.
It absolutely is picking the most provocative thing (in the title) and singling that out.
It lacks intent and understanding so it can't be racist. It might make racist sounding noises though.
A fine example ... https://www.youtube.com/watch?v=2hUS73VbyOE