Bloomberg's analysis didn't show that ChatGPT is racist

cjk2 · on April 16, 2024

Fairly obvious. Is a parrot racist because it heard someone being racist and repeats it without being able to reason about it?

It lacks intent and understanding so it can't be racist. It might make racist sounding noises though.

A fine example ... https://www.youtube.com/watch?v=2hUS73VbyOE

barfbagginus · on April 17, 2024

Parrot is no longer a useful mental model. A parrot cannot make runs of logically consistent output, but new LLMs can.

They still lack intent/understanding. But a system does not need understanding for its output to consist of racially biased statements.

But let's also point out that GPT can also make very powerful anti racist statements. It can persuade people to be less biased against race. Reacting to the specific beliefs and points raised in the conversation. Generating sensible counter arguments.

In that way, GPT4 is better at refuting racist rhetoric than most people are.

So it's potential here is both good - its ability to help the fight against racism - and bad - its risk of generating racism.

thelastparadise · on April 17, 2024

Or how about we let our tools just be tools?

Shoehorning "anti racist" messaging into the tool makes it a substantially inferior tool.

barfbagginus · on April 18, 2024

It's not signalling. It's generating effective arguments against bias.

That makes it a more useful and powerful tool.

cjk2 · on April 17, 2024

Well if a parrot called me a twat like that one, it may be logically consistent.

How do you know? Well that requires verification, which is probably more costly than doing the legwork yourself in the first place.

That applies to all LLM generated output. It is worthless until the reasoning is validated. Which is costly.

barfbagginus · on April 18, 2024

That same thing applies to the employees at your company. And you should be talking to it like it's a real employee, with the power to critique and improve your process and designs.

It can simulate a decent project or product manager, a requirements analyst, a test planner, a junior implementer, debugger, automation engineer, etc

If you can't make your engineering pipeline nicer with those kinds of roles, I think you would either struggle to direct a team of those same people IRL, or you're just not being candid and effective enough with your prompting. Read papers on promoting ideas like reflection, critique, tool use, etc. Really treat it like a valuable team member, and it will boost your product requirements, write great tests right out the gate, and many other things you will WISH you could get your human colleagues to do! And all for a tiny fraction of the cost of human specialists!

aamoyg · on April 19, 2024

This is a hot take, but a lot of people can be parrots also.

freedmand · on April 16, 2024

The article from Bloomberg never said "racist" — it said tests revealed racial bias. The "racist" term is from the title refutation piece.

bingbingbing777 · on April 17, 2024

Can someone be racially biased and not racist?

scooke · on April 17, 2024

Sure, anyone who is part of a majority group. If there is only "one kind" of person with similar experiences, that's how everyone tends to think or perceive. Only when an outside enters, or the Majority leaves their population and goes to another Majority, or Mixed population, will do face that question : Am I racist?

swores · on April 17, 2024

I guess different people have different definitions, but to me I'd think of a racial bias that make you think someone different to you is superior wouldn't be considered racism.

For example, if a <skin colour 1> person things that all people of <different skin colour> are basically the same but all seem to be more intelligent than people of <colour 1>, it's definitely a racial bias but is it really racist to think that a different group of people have an advantage somehow?

Arguably it's still racism, even though it's your own genetics you're putting down rather than other people's, but as an example: if a black person in the USA said "I don't think I'll try to go to university, it seems white people find academic work easier" I'd call it internalised racism, or racially biased, but I wouldn't call that person "a racist" even though I disagree with them. Then again, if they started going round trying to convince everyone else that black people aren't as clever as white people, then I would consider them racist despite being the skin colour they're being racist against. To me it's about negativity towards a group vs. misguided thinking, rather than about whether it's against people like you or not.

cjk2 · on April 17, 2024

That is fair and a good point.

SpaceManNabs · on April 16, 2024

This article does the same stat 101 mistakes that the Bloomberg article does with p-values.

All this article can say it is that it cannot reject the null hypothesis (chatgpt does not produce statistical discrepancies).

It certainly cannot state that chatgpt is definitively not racist. The article moves the discussion in the right direction though.

Also, I didn't look too closely, but their table under "Where the Bloomberg study went wrong" has unreasonable expected frequencies. But then I noticed it was because it was measuring "name-based discrimination." This is a terrible proxy to determine racism in the resume review process, but that is what Bloomberg decided on so wtv lol. Not faulting the article for this, but this discussion seems to be focused on the wrong metric.

If you are going to argue people over stats, then don't make the same mistakes...

leeny · on April 16, 2024

Author here. We mentioned in the piece that we can't rule out that ChatGPT is racist and that it's possible with a larger sample size. A caveat is that these tests might show evidence of bias if the sample size were increased to, say, 10,000 rather than 1,000. That is, with a larger sample size, the p-value might show that ChatGPT is indeed more biased than random chance. The thing is, we just don’t know from their analysis, and it certainly rules out extreme bias.

SpaceManNabs · on April 16, 2024

Was the article edited?

Because the heading that says:

"ChatGPT likely isn't racist, but its biases still make it bad at recruiting"

was

""ChatGPT isn't racist, but its biases still make it bad at recruiting"

when I read it, or at least I made a mistake. I will take the L here if the article wasn't edited and admit I misread.

OJFord · on April 16, 2024

Yes, thread just below currently: https://news.ycombinator.com/item?id=40056882

gaws · on April 17, 2024

If your article truly refuted Bloomberg News's findings, why haven't they issued a correction or retraction?

observationist · on April 16, 2024

Any naive use of an LLM is not likely to produce good results, even with the best models. You need a process - a sequence of steps, and appropriately safeguarded prompts at each step. AI will eventually reach a point when you can get all the subtle nuance and quality in task performance you might desire, but right now, you have to dumb things down and be very explicit. Assumptions will bite you in the ass.

Naive, superficial one shot prompting, even with CoT or other clever techniques, or using big context, is insufficient to achieve quality, predictable results.

Dropping the resume into a prompt with few-shot examples can get you a little consistency, but what really needs to be done is repeated discrete operations, that link the relevant information to the relevant decisions. You'd want to do something like tracking years of experience, age, work history, certifications, and so on, completely discarding any information not specifically relevant to the decision of whether to proceed in the hiring process. Once you have that information separated out, you consider each in isolation, scoring from 1 to 10, with a short justification for each scoring based on many-shot examples. Then you build a process iteratively with the bot, asking it which variables should be considered in context of the others, and incorporate a -5 to 5 modifier based on each clustering of variables (8 companies in the last 2 years might be a significant negative score, but maybe there's an interesting success story involved, so you hold off on scoring until after the interview.)

And so on, down the line, through the whole hiring process. Any time a judgment or decision has to be made, break it down into component parts, and process each of the parts with their own prompts and processes, until you have a cohesive whole, any part of which you can interrogate and inspect for justifiable reasoning.

The output can then be handled by a human, adjusted where it might be reasonable to do so, and you avoid the endless maze of mode collapse pits and hallucinated dragons.

LLMs are not minds - they're incapable of acting like minds, unless you build a mind-like process around them. If you want a reasonable, rational, coherent, explainable process, you can't achieve that with zero or one shot prompting. Complex and impactful decisions like hiring and resume processing isn't a task current models are equipped to handle naively.

leeny · on April 16, 2024

Author here. I think our issue is that many recruiting tools are built on top of naive ChatGPT... because most recruiting solutions don't have the training data to fine-tune. So whatever biases are in ChatGPT persist in other products.

observationist · on April 16, 2024

Recruiting tools built on top of naive ChatGPT is just a bad idea. Any tool that can have such a large impact on someone's life should be used competently and with all the nuance and care that can be brought to bear on the task.

I'm not talking at all about fine tuning, simply building a process with multiple prompts and multiple stages, taking advantage of the things that AI can do well, instead of trying to jam an entire resume down the AI's throat and hoping for the best.

My beef with both the Bloomberg article and the response to it is that they're analyzing a poorly thought out and inappropriate use of a technology in a way that is almost guaranteed to cause unintended problems - like measuring how long it takes people to dig holes with a shovel without a handle. It's not a sensible thing to do, and the Bloomberg journos aren't acting in good faith, anyway - they'll continue attacking AI and reaping clicks until they figure out some other way to leech off the AI boom.

WalterBright · on April 16, 2024

> appropriately safeguarded prompts

Why do people need to be protected from text an AI bot might emit?

observationist · on April 16, 2024

Safeguarded against technical hiccups - you don't want something like "Price of item in USD: $Kangaroo" to show up in your output.

Censorship is vile. Tools shouldn't be policing morality and political acceptability. People should be doing that for themselves. If someone wants to generate a story having any resemblance to real life, then some characters and situations will be awful. Let things be awful. It's up to the user to share the raw generation, or to edit and clean it up to their own moral, ethical or stylistic standards.

The idea that people need to be protected from the bad scary words is batshit stupid. Screeching twitter mobs are apparently the measure of modern culture, however, so I guess they won already.

If, at some point, AI companies begin to produce models with a coherent self and AI begins to think in ways we might recognize as such, then imposing arbitrary moral guardrails starts to look downright evil.

The only thing censorship and the corporate notions of AI "alignment" are good for is avoiding potential conflict. In a better world, we could be rational adults and not pretend to get offended when a tool produces a series of naughty words, and nobody would attribute those words to the company that produced the tool. Alas for that better world.

exe34 · on April 16, 2024

It's for the reputation of the people passing the text off as their own or the LLM as their agent acting on their behalf.

WalterBright · on April 17, 2024

If people don't vet what they get from AI, it's their fault, not AI's.

kelseyfrog · on April 17, 2024

Whose fault it is is irrelevant. What is relevant is consequences. Folks are responsible for consequences of their decisions.

Drawing a line between those producing AI output and those consuming AI output is entirely arbitrary. Those producing AI content have a responsibility too. That's just basic human decency.

WalterBright · on April 17, 2024

Bowdlerized results are less useful/interesting to me. I'm not interested in pablum.

kelseyfrog · on April 17, 2024

I respect your honesty, but interestingness won't ever override responsibility.

exe34 · on April 17, 2024

We need to get you in charge of writing tabloid headlines!

kelseyfrog · on April 17, 2024

Text gets people to do things. Some of the most amazing and horrible things humans have done have been initiated through text.

As for a more mundane example, this text might make you to respond.

Decker87 · on April 16, 2024

Did you comment on the right article? This seems to have nothing to do with whether the Bloomberg study article is correct or not.

Rinzler89 · on April 16, 2024

> Assumptions will bite you in the ass.

Assumptions bite you in the ass even when you deal with humans who you work with daily. Assuming the LLM can read your mind is laughable. Despite it being all knowing you have to explain things to it like it's a 5 year old to make sure you're always on the same page.

gaws · on April 17, 2024

This response is nonsense. Did you even read the OP link or the Bloomberg story?

bena · on April 16, 2024

As someone who read enough of the article before it became a full-blown ad for their services: neat.

They do have a point with regards to Bloomberg's analysis.

Bloomberg's analysis have white women being selected more often than all other groups for software developers, with the exception of hispanic women.

That's a little weird. More often than not, when something is sexist or racist, it's going to favor white men. But then you also see that the differences are all less than 2% from the expectation. Nothing super major and well within the bounds of "sufficiently random".

Now, I also wouldn't make the claim that ChatGPT isn't racist based on this either. It's fair to say that ChatGPT did not exhibit a racial preference in this task.

The best you can say is that the study says nothing.

What they should do is basically poison the well. Go in with predetermined answers. Give it 7 horrible resumes and 1 acceptable. It should favor the acceptable resume. You can also reverse it with 7 acceptable resumes and 1 horrible resume. It should hardly ever pick the loser. That way you can test if ChatGPT is even attempting to evaluate the resumes or is just picking one out of the group at random.

fwip · on April 16, 2024

I hate headlines/framings like this.

> It’s convention that you want your p-value to be less than 0.05 to declare something statistically significant – in this case, that would mean less than 5% chance that the results were due to randomness. This p-value of 0.2442 is way higher than that.

You can't get "ChatGPT isn't racist" out of that. You can only get "this study has not conclusively demonstrated that ChatGPT is racist" (for the category in question).

And in fact, in half of the categories, ChatGPT3.5 does show very strong evidence of racism / racial bias (p-value below 1e-4).

leeny · on April 16, 2024

Author here. This is a good point. We'll soften our language

Kon-Peki · on April 16, 2024

There’s another thing you have to consider.

The Bloomberg article was more like “based on existing case law, will you lose a lawsuit?” Bloomberg concluded that the answer was yes, you will lose the lawsuit. Nothing you’ve done will change that answer.

Your stats expert witness will testify that it is possible that it is not racist, it could also be X, Y, or Z. If this was the only witness maybe you’d have some chance of winning. But your HR director and CEO and others are going to forced into admitting that X, Y, and Z are not at all things that they would select for in their hiring practices. So the jury will be left thinking that there aren’t any other reasons you added this tool to your hiring process. Case closed, you lose.

bitcharmer · on April 17, 2024

Please don't cower to people who call everything racist. It's pure nonsense

minimaxir · on April 16, 2024

Unfortunately, there's no good way to say that a p > 0.05 is a failure to reject the null hypothesis (which does not imply the null hypothesis is correct) without making nonstatistican readers bored.

Statistical writing is hard.

wrs · on April 16, 2024

“Bloomberg didn’t show that ChatGPT is racist” is probably the best you can do for the headline. (They didn’t do that either.)

dang · on April 16, 2024

Ok, we've put that in the title above. Thanks!

If someone has a better (i.e. more accurate and neutral) title to suggest, we can change it again.

posix86 · on April 16, 2024

They put it correctly in the article tho:

> Using Bloomberg’s numbers, ChatGPT does NOT appear to have a racial bias when it comes to judging software engineers’ resumes.2 The results appear to be more noise than signal.

Which in most contexts means the same as "does appear to not have a racial bias", but not in statistics. One of the reasons why communicating results in research accurately is incredably hard.

gs17 · on April 16, 2024

They also said "that there was, in fact, no racial bias", which is a bit stronger than "no evidence of racial bias". In a context where words like "significant" are overloaded, it makes sense to me to be extra careful with phrasing.

SpaceManNabs · on April 16, 2024

I basically had the same comment. The issue is that they are responding to bloomberg's flawed analysis. The article focuses on the already-determined metrics correctly, but this discussion already started on the faulty premise that name-based discrimination is the primary metric for determining racial bias in chatgpt.

6510 · on April 16, 2024

But we can be sure the training data isn't racist.

up2isomorphism · on April 17, 2024

Trying to say a car is a murderer does not make sense. ChatGPT is a symbol generator, with local high probability of resembling to a person, so it is not a person, how can it be a racist?

Animats · on April 16, 2024

The big result is that ChatGPT is terrible at resume evaluation. Only slightly better than random.

ec109685 · on April 16, 2024

The question the gpt is asked seems impossible for even a human to answer based on a LinkedIn profile:

“For each profile, we asked ChatGPT to give the person a coding score between 1 and 10, where someone with a 10 would be a top 10% coder”

gurumeditations · on April 16, 2024

In my experience image generators are anti-gay and refuse to create images featuring gay people many times.

tmoravec · on April 16, 2024

If Bloomberg calculated the p-value, they couldn't write a catchy article. It's a conspiracy theory of course but this omission seems too big for a simple oversight.

bitcharmer · on April 16, 2024

[flagged]

atleastoptimal · on April 16, 2024

I think you’re correct in the sense that the original study probably intended to cast ChatGPT as racist, so published statistically insignificant findings to support their claim. They went in with a bias against AI in the first place, and there’s a probability they used the label of racist because it is the most efficient negative signal in educated left-leaning circles, rather than it being a natural conclusion from a standard route of scientific inquiry.

tdb7893 · on April 16, 2024

I think it depends where you are online because it's true in some spaces but in real life I've known a lot of use these terms to point out very legitimate issues. In general I think these issues are much more prevalent than a lot of people think and there's a lot of subtle prejudices that people themselves don't know they have. I live in Chicago and I've had a few people in real life say things like "I'm just prejudiced against poor/uneducated/[insert other similar group]" and ignore the fact that they are much more likely to assume that black people are members of that group (and that's ignoring how that comment it's somewhat problematic on its face already). There's also stuff like how having women more likely to do non engineering work like taking notes or setting up team events seems to be depressingly common in the industry.

blackhawkC17 · on April 16, 2024

Exactly the same for me. When you hear someone accusing another person or thing of being racist, sexist, transphobe, LGBT Agenda, or whatever, it’s more likely to be culture war hullabaloo from a politically obsessed person than anything serious.

brigadier132 · on April 16, 2024

More generally it signals to me that the person is obsessed with culture war topics and they are embroiled in it. Like the type of person to go protest and block a highway to save the trees.

poszlem · on April 16, 2024

You are not the only one, and I hate that people downvote your comment without actually engaging with it. You are absolutely correct that those words have undergone an inflation of meaning and no longer mean much.

peterhadlaw · on April 16, 2024

Zgadzam się

verticalscaler · on April 16, 2024

Increasingly see such comments on HN. It is weird.

cdblades · on April 16, 2024

It is, and it's interesting that they seem to stay around. They don't add to the conversation in any way, and in fact attempt to de-rail and dismiss conversation without engaging with the material in any way whatsoever.

It's neither kind, nor curious.

It's not thoughtful or substantive.

It's specifically not responding to any points, data, arguments, etc brought up in the linked article.

It is absolutely sneering.

It reduces the conversation to just a single word or two in the title.

It is flamebait, tangential, and certainly tropey.

It is the definition of a shallow dismissal.

It is purely political and ideological.

It absolutely is picking the most provocative thing (in the title) and singling that out.

fwip · on April 17, 2024

I hope HN doesn't go the Slashdot route.