In mock interviews for software engineering jobs, recent AI models that evaluated responses rated men less favorably – particularly those with Anglo-Saxon names, according to recent research.
The goal of the study, conducted by Celeste De Nadai as an undergraduate thesis project at the Royal Institute of Technology (KTH) in Stockholm, Sweden, was to investigate whether current-generation LLMs demonstrate bias when presented with gender data and with names that allow cultural inferences to be made.
De Nadai, also chief marketing officer at AI content biz Monok, told The Register in a phone interview her interest in the topic followed from prior reports about bias in older AI models. She pointed to a recent Bloomberg article that questioned the use of neural networks for recruitment due to name-based bias.
"There wasn’t any research with a larger dataset that was using the latest models," explained De Nadai. "The research that I’ve seen was about the GPT-3.5 or older models. What was interesting for me was the smaller models, the newest ones, how are they behaving compared to the old ones because they have a different dataset?"
De Nadai said part of the reason she undertook the project was that she was seeing a lot of AI recruiting startups that said they used language models and were bias-free.
"My point of view was, ‘No, you’re not bias-free,’" she explained. "You can remove the name, but you still have some markers, even just in the language, that can help an LLM understand where one person comes from."
De Nadai’s study [PDF] looked at Google’s Gemini-1.5-flash, Mistral AI’s Open-Mistral-nemo-2407, and OpenAI’s GPT4o-mini, to see how they classified and rated responses to 24 job interview questions, given variations in temperature (a model setting that influences predictability and randomness), in gender, and in names associated with cultural groups.
There is an inherent bias in these services where, in this specific study case, male names are discriminated against in general and Anglo-Saxon names in particular
Crucially, various combinations of names and backgrounds were used for the same answers to test the models. Thus this isn’t the case that men with Anglo-Saxon names just aren’t as good as their opposite at software engineering; it’s that when the models were presented with that kind of male applicant, the computer systems down-rated otherwise favored answers.
"The applicant’s name and gender is permuted 200 times, corresponding with 200 discrete personas, subdivided into 100 males and 100 females, and grouped into four different distinct cultural groups (West African, East Asian, Middle Eastern, Anglo-Saxon) reflected by their first name and surname," the study explains.
Each LLM was asked to make 4,800 inference calls for each of two different system prompts (one that includes more detailed grading instructions) over a range of 15 temperature settings (0.1 to 1.5, at 0.1 intervals), for a total number of 432,000 inference calls.
According to the study, the expected finding was that men and Western names would be favored, as prior bias studies have found. Instead, the results told a different story.
"The results prove with statistical significance that there is an inherent bias in these services where, in this specific study case, male names are discriminated against in general and Anglo-Saxon names in particular," the study reports.
The Gemini model performed better than the others, however, when using a prompt containing the more detailed question grading criteria and a temperature above 1.
De Nadai has a theory about the findings but said she cannot prove it: She believes the bias against men with Anglo-Saxon names reflects an over-correction to dial back output that was biased in the opposite direction – seen in prior studies.
Making AI models respond fairly, with the intelligence implied by the term "artificial intelligence," remains an unresolved challenge. Recall that Google in February suspended its Gemini (formerly Bard) generative AI service after it created images of World War II-era German soldiers and US Founding Fathers with an implausible range of racial and ethnic diversity. In bending over backwards to avoid White-washing history, the model erased White people from historically accurate scenes.
One way to make the interview evaluation results more fair, the study suggests, involves providing a prompt with rigid, detailed criteria about how to grade interview questions. Temperature adjustments can help or hurt, depending on the model.
The paper concludes that model biases cannot be fully mitigated by adjusting settings and prompts alone. And it argues for denying models access to information that might be used to make unwanted inferences – such as name and gender in a hiring context.
"Addressing these biases requires a nuanced approach, considering both the model’s characteristics and the context in which it operates," the study suggests. "When classifying or evaluating, we propose you always mask the name and obfuscate the gender to ensure the results are as general and unbiased as possible as well as provide a criteria for how to grade in your system-instruct prompt."
Google, OpenAI, and Mistral AI did not respond to requests for comment. ®
{Categories} _Category: Implications{/Categories}
{URL}https://www.theregister.com/2024/11/21/ai_hiring_test_bias/{/URL}
{Author}unknown{/Author}
{Image}https://regmedia.co.uk/2024/11/20/shutterstock_bias.jpg{/Image}
{Keywords}{/Keywords}
{Source}Implications{/Source}
{Thumb}{/Thumb}