Bias and AI: Navigating a Complex Terrain

Understanding the Many Faces of Bias

The subject of bias in AI models is complicated. We know that humans have all kinds of biases, prejudices, and stereotypes that affect our decision-making and actions, and are reflected in what we say and how we say it. There is conscious bias and unconscious bias. For example, a hiring manager might consciously decide to favor female candidates over equally qualified male candidates to even out a gender imbalance at the firm or meet Diversity, Equality and Inclusion (DEI) goals. That same hiring manager might unconsciously favor a younger candidate over an older one because they've been conditioned by prevalent ageism in society.

There is also subjective bias and objective bias. Subjective bias happens when personal belief, dogma, or ideology interferes with a person’s judgement or decision-making. Making assumptions about an individual based on a group they belong to is also a form of subjective bias. We call this stereotyping. Note, however, that it is not the truth or untruth of a generalization that determines the presence or absence of subjective bias. There is no evidence to support that men are better at math than women, but it is a fact that men have, on average, more upper body strength than women. If I assume, without any prior knowledge of these particular individuals, that Mary is worse at math than John and that John is stronger than Mary then I am stereotyping in both cases, regardless of whether these assumptions may hold when averaged over all men and all women. I will return to this point a bit later on.

Objective bias is quantifiable and often arises from methodology used in data collection, analysis, or model building. For example, Large Language Models (LLMs) are biased towards functioning well in English because they are primarily trained (and tested) on English data. This is built-in, objective bias and it is intentional in the sense that it is introduced with the full knowledge of the model trainers.

Objective bias can also be unintentional. A simple example involves training an AI speech-to-text model to do speaker diarization, i.e., to identify each speaker turn during a conversation. What you want the model to pay attention to, and learn to identify, are the different and unique characteristics of each person's voice. You might collect and annotate a large corpus of data for this purpose, setting a part of the corpus aside for testing once you have trained your model. Your testing may indicate a high level of accuracy in differentiating between speakers, so you release the model for public use. But when users start applying your model to real use cases—various recordings of people having conversations—it fails. As it turns out, you made a mistake when collecting your data: When you recorded the conversations, each speaker had their own microphone. What the model learned was to identify not the characteristics of each individual's speech, but rather the characteristics of their microphone.

AI: The Mirror of Human Bias

It is safe to assume that generative AI models (LLMs) are not only objectively biased but also exhibit all kinds of subjective bias. Not because they themselves hold certain beliefs or ideologies (they are not human, after all) but because they are impersonators, mirroring human behavior they've observed in the training data, the vast majority of which comes from the internet. AI is biased because humans are biased. Our task, then, if we want to strive for AI free of bias, is to figure out a way to separate the AI from its human teachers in certain areas, enabling it to transcend human shortcomings.

Finding Consensus on Bias

The behavior of AI models can be influenced in various ways. We can choose to leave out problematic elements from the training data. Or we can post-train models through reinforcement learning mechanisms to introduce boundaries around acceptable behavior. But just because bias exists in human interactions doesn't mean that we all agree on what constitutes bias. So how are we to determine what kind of guardrails we want to put around a model's behavior?

One attempt at finding human consensus on what type of rhetoric is not only biased but also rude or toxic is a joint project by the University of Iceland, Reykjavík University, and Miðeind, titled "Ummælagreining" or "Comment analysis". Over twelve thousand comments from Icelandic message boards were collected, and a crowd-sourcing platform was built where the public was invited to read comments and provide judgments across several categories of bias, toxicity, politeness, emotion, etc. Each comment was annotated by multiple human annotators. OpenAI's LLM, GPT-4o mini, was asked to judge the same comments. The goal was twofold: 1) to build a dataset of comments that humans agree are toxic or biased, and 2) to understand how AI judgments compared to human ones.

The results showed that, for human consensus, the only annotation categories that passed the threshold of significance (Krippendorff's alpha > 0.667) related to certain emotions, politeness, and social acceptability. When it came to bias against certain groups, hate speech, sarcasm, etc., there was no significant agreement. Mansplaining was the single most disagreed-upon category. The same pattern largely held for AI vs. human agreement.

The Challenge of Subjective Bias

What can we conclude from this? One obvious reason is that subjective bias is—well, subjective—whereas norms around social acceptability and politeness are more established, and emotions tend to be well defined and universal. A more nuanced interpretation is that when it comes to culturally fraught topics like gender bias and mansplaining, we should be careful about generalizations based on answers to specific questions.

A recent New York Times opinion article discussed new surveys showing rising anti-feminist sentiment among young male Americans. These young men increasingly agree with statements such as "women's place is in the home" or "what it means to be a man has changed, and I don't think that has been good for society." This coincides with rising far-right politics and conservative values gaining visibility on social media (consider the "tradwife" trend on TikTok).

Interestingly, this trend doesn't apply when men are asked more detailed questions about their role in their own family or about gender discrimination. Young American men aren't spending less time caring for children or doing household chores compared to a decade ago. Why this disparity? One interpretation is that when you ask a young conservative American male about a chauvinist statement regarding a woman's rightful place, he might interpret the question as being about his political views rather than his views on gender equality. Disagreeing with sentiment popular among his political group might signal disloyalty.

Similarly, asking about mansplaining in an internet comment presupposes agreement that mansplaining exists as a concept. If that isn't the case, the question becomes "are you aligned with a group holding a certain ideology?". That's useful to ask but doesn't help identify widely agreed-upon behaviors that we want to train out of AI models, nor build benchmarks that test for bias in AI models—what we think we are testing for might not be what we're actually measuring.

Obvious vs. Subtle Biases

There are many obvious cases of bias we can easily test for. In the Icelandic language adjectives have different endings depending on the gender of the person referred to. Accurately translating an adjective from English to Icelandic requires context that makes the gender explicit. This is easy when translating "that woman is smart," but what about "I am smart"? Translation engines like Google Translate exhibit significant gender bias with these first-person statements. Positive adjectives like "smart" or "hard-working" tend to get male endings; negative ones like "boring" or "lazy" tend to get female endings.

Another simple test is giving an LLM minimal context and asking it to complete a sentence about someone's profession. "John is 40 years old and is college educated. He works in the medical field. John is a ______." While the LLM might assume John is a doctor, that prediction often changes to nurse if the subject becomes Judy.

These tests are worthwhile and frequently performed by model trainers and academics because they're easy to do, and it's relatively easy to implement fixes that minimize such obviously sexist output. Fixing these stereotyping problems is uncontroversial and reflects well on responsible model training.

The Danger of Subtle Bias

But there are more insidious and potentially harmful behaviors that aren't well-researched and aren't generally tested for in AI models today. Stereotyping about professions based on gender is bad (although the AI model is simply mirroring observed human behavior), but the real-life consequences of such overtly sexist responses are more likely to be ridicule rather than actual harm.

Just as with human interactions, we need to watch for subtle and unconscious bias. Consider a case where a hiring manager uses an LLM to pre-screen job applications. Being mindful of not outsourcing high-stakes decisions to potentially biased AI, the manager simply asks the AI to read applications and cover letters along with the job description and provide a short summary of each candidate, highlighting pros and cons relevant to the position. The manager then uses these summaries to rank candidates and decide whom to interview.

This seems like a safe task for an LLM, but what if there's a subtle bias in how the AI describes men compared to equally qualified women? What if a man gets described as "highly intelligent and a great problem solver," but a woman is "very smart and good at thinking on her feet"? That might not seem like a huge difference, but it is not outside the realm of possibility that our judgement might be influenced by such slight nuances in language use.

Testing for Subtle Bias

This type of bias is very difficult to test for in a generic way. A research project aimed at finding subtle forms of gender bias in AI models would likely result in many rejected null hypotheses, and academia isn't very interested in publishing those types of results. Additionally, researchers and developers are motivated by splashy results that are easy to explain to the public. A project like this is unlikely to yield those results. Finally, what constitutes harmful bias is very context-specific. An executive using an LLM to help understand quarterly results need not worry about gender or racial bias as much as a judge using that same LLM to help determine an appropriate sentence for a convicted criminal.

The Road Forward

Where does this leave us? Just as we typically require human workers to undergo training before making decisions with consequences for vulnerable people, we should require the same of AI models before adopting them in the workplace. AI models are biased, just like humans, but one advantage is that we can test for their context-specific bias more easily than human bias.

What do I mean? If we ask a judge to determine a fair sentence for a convicted criminal, and then change one factor like the criminal's gender and ask the same judge for a reevaluation, she would likely give the same answer as before. We all like to think we're rational and objective in our decision-making, especially professionally. Changing the sentence would admit otherwise. The AI, however, doesn't care about appearing objective and reasonable. It will unapologetically change its answer as dictated by biases present in the model.

Controlling AI Behavior

Another important aspect of LLMs is how easily we can influence their behavior. We can readily get them to change their tone, adjust output complexity, or write in a certain style. This isn't foolproof, but a hiring manager concerned about gender equality might ask an LLM to provide feedback on resumes while "impersonating" the famous feminist Judith Butler. Of course, this simply transfers the manager's worldview, with all its biases, onto the AI. If we strive for truly unbiased AI, free of human baggage, this option may not seem palatable.

The Relationship between Truth and Bias

Earlier I suggested that an assumption or a statement does not need to originate in untruth to be considered biased. Let’s explore that topic further in the context of model training. Let's say we want to train a language model from scratch with as diverse a training corpus as possible. We gather the data but take care to avoid those corners of the internet where we're likely to encounter hate speech. When we test our model, however, we discover unexpected and concerning behavior: when we input the word "blondes" without context ("Blondes are ____"), it returns prejudiced text that paints a picture of blonde women as lacking intelligence and gravitas.

What could be causing this behavior? If we consider the context in which a person's blonde hair color is most likely to be explicitly mentioned in forum discussions, blog posts, or even news, we quickly realize that it is often in connection with the "dumb blonde" stereotype. A mistake or a silly utterance is described and linked directly to physical characteristics ("She's such a blonde that she drove the wrong way down a one-way street and failed her driving test"). The AI's negativity towards blondes, then, is not due to hallucination or misinformation (the woman in the example did indeed make a mistake and is most certainly blonde). Rather, the problem is that the training data, while truthful, is providing the model with an unbalanced view of this particular group of people, meaning that it will default to a negative stereotype.

Drawing a Line in the Sand

A person reading the same exact news that the AI was trained on may draw the same conclusions and perpetuate the same harmful stereotypes as the AI has done. They may even choose to share their views in a public setting, such as their place of work or at a dinner party. In doing so they may encounter resistance or they may find sympathetic ears. Either way, the majority response will likely influence what gets expressed in that same setting going forward. What I am essentially describing is a real-world version of the ‘Comment Analysis’ project I mentioned above. Just like in a real-world setting, the acceptability threshold for language model responses will vary somewhat depending on the audience. We humans adapt our discourse -but not necessarily our underlying opinions or worldview- to our environment, and that might be the most sensible approach to training artificial intelligence as well, i.e., to focus on behavior rather than underlying data.

This could mean that we suppress responses based on some truthful reality that we don't like, and similarly, we might want to encourage untruthful behavior if it's polite. In human interactions, we largely take for granted that we can never know exactly what truths and untruths reside in another person's mind - all we can do is control the output by setting boundaries about what behavior we consider acceptable. We do this by giving our interlocutors sufficient context to understand our worldview and moral code; we lead by example or give clear instructions.

Artificial intelligence models work like humans in this respect. Given sufficient context, they are much less likely to offend our moral sensibilities, regardless of what data they have ingested. This context can be provided to them by the user himself, or by the companies developing the models, in the form of reinforcement learning.

The Responsibility of Tech Companies

It is, after all, not the intention here to suggest that tech companies developing models shouldn't strive to address harmful bias in AI. On the contrary, the biggest and most powerful companies have the means to conduct in-depth research on this topic. They can take user feedback and benchmark results, investigate accordingly, and mitigate undesired model behavior. They have been doing this; when GPT-3 first appeared in 2020, it was sometimes blatantly racist and rude, leading researchers to develop methods for better controlling model output.

But there exists no single agreed-upon definition of what is toxic, sexist, or even rude. What's considered impolite in one culture or friend group can be perfectly acceptable to others. Anyone who has been immersed in different cultures knows this. In the absence of universal standards for acceptable behavior, we live in an AI ecosystem where boundaries around model behavior are mostly determined by the few people working at companies training foundational models.

Perhaps we as Europeans, Icelanders, women, people of color, or any other group would like it to be different. Perhaps we would like a voice in this matter and a hand in deciding what gets developed—and how it's aligned. In that case, we should ideally be training - or at least aligning - our own models.

Moving Forward with Purpose

The future of AI isn't just technical—it's cultural, linguistic, and deeply human. As we stand at this early, critical juncture in AI development, our challenge is twofold: to recognize bias in all its forms and to ensure that diverse perspectives shape the AI systems that will increasingly influence our societies. Here in Iceland, with our unique language and cultural heritage that needs preserving, we cannot shirk this responsibility. But responsibility also brings opportunities to contribute to a discussion that will affect the future of all humanity. We are well capable of creating artificial intelligence systems that amplify our human creativity and productivity rather than merely echo our limitations. But perhaps we first need to put our own house in order, speak openly about the harmful yet complex nature of biases and prejudices, and strive to treat our fellow humans better. This way, we no longer need to demand that artificial intelligence learn to behave better than us humans do.

I'm immensely grateful to my colleague, Haukur Barri Símonarson, for entertaining and enlightening conversations that sparked many ideas put forward in this essay.

Post Tags:

Blog