Artificial intelligence is radically changing how we work, learn, play and socialize, from virtual assistants helping organize our day to bots that can score Taylor Swift tickets or write college-level essays.
But that vast computing capability may also come at a cost, generating results that are rife with bias if the data that was used to train AI systems is itself biased against or excludes certain groups of people. To counter this issue, we hear about the efforts of two engineering and computer science doctoral students in the Pacific Northwest.
At the University of Washington, Kate Glazko led a team of researchers on a study that found that the popular AI application ChatGPT routinely ranked job seekers lower if their CVs mentioned an award or recognition that implied they had a disability such as autism or blindness. At Oregon State University, Eric Slyman developed computing instructions that can be used to train AI to be less biased against marginalized groups when generating image search results. Slyman and Glazko join us for more details.
This transcript was created by a computer and edited by a volunteer.
Dave Miller: From the Gert Boyle Studio at OPB, this is Think Out Loud. I’m Dave Miller. Artificial intelligence is radically changing how we work and learn and create stuff, but its vast computing power also comes at a cost, generating results that can replicate or sometimes exacerbate existing societal biases. I’m joined now by two computer science doctoral students in the Northwest who are each working to tackle different aspects of this problem. Kate Glazko at the University of Washington led a team of researchers that got ChatGPT to be less likely to be biased against people with disabilities when making hiring decisions. Eric Slyman at Oregon State University developed simple instructions to train AI to be less biased against marginalized groups when generating images. Slyman and Glazko both join me now. It’s great to have both of you on Think Out Loud.
Eric Slyman: Yeah, thanks so much for having us here.
Kate Glazko: Thank you.
Miller: Eric, first. What is the problem that you set out to address?
Glazko: Frequently, whenever we talk about fairness and bias and artificial intelligence, we’re studying in the space of research where we’re just looking at a model of interest or the actual products that have been deployed. But there’s lots of things we have to do in between to actually get those models from a research space into production. So I study one of these places where people try to figure out how we can actually train AI. And the thing we have to do there is reduce the amount of data that we’re showing because it’s just too expensive to show AI everything that’s there.
Miller: Oh, so when you’re training it, you have to give it a bunch of text or photos, and you’re saying it’s expensive to give it a ton of stuff because it takes time and people to just input it?
Slyman: Yes, exactly. We’re showing these AI billions of images and corresponding captions, and it can take upwards of a million dollars to train even small, by industry standards, versions of these models. So if we can reduce the amount of images that we have to show them, then it becomes cheaper and faster to get them off into the world.
Miller: Well, what’s wrong with the current ways that computer scientists are slimming down the number of images that they’re feeding in?
Slyman: So imagine that we are showing our AI images of doctors, then the distribution, the spread of the kinds of doctors we see on the internet might not align with what we want our AI to learn. For example, we know that in the US that doctors are typically men around their forties and white, but we might want our AI to learn an even representation of all of those people. So whenever we show images from the internet, it learns that bias. And when we do this sort of deduplication process to remove some of those images, we actually end up reinforcing it. We keep more of those dominant group photos than we would have otherwise had.
Miller: Why is that? If the idea of deduplication is to give AIs less stuff so they can learn things in a faster way and a cheaper way, why is it that more pictures of white male doctors would be more likely to be kept? Those seem less likely to be, in my mind, considered duplicates than a woman of color who is a doctor as tagged in an image. It seems like that would be more unique and should just by definition be less likely to be thrown out, if that makes sense.
Slyman: Yeah, it does. And that’s exactly what we would hope when we think about these algorithms. But it’s kind of a funny quirk of the way they work. People, when they use them, deduplicate really aggressively. They might cut out half of their data. And so you’re not just removing exact copies of the same images, but ones that are really similar semantically to each other. You can imagine if we have doctors in different poses, we would consider every doctor in some pose – maybe they’re standing up with their hands on their hips – to be the same. And if only two of 10,000 of those images are of a doctor, let’s say a woman of color, then the odds that one of those two pictures or both of those pictures are the ones that get removed are pretty high when we’re randomly removing half of those 10,000 images.
Miller: So how do you change this?
Slyman: We allow humans to specify either through example images of their own or just natural language descriptions. For example, me telling my AI, “I want you to preserve photos Black women,’’ that it should make sure not to overly prune any of these groups of people. So that when it’s going through and selecting it has some notion of, I’ve seen a lot of these white man doctors so far, I should try to make sure I grab a few other people.
Miller: What happened when you gave it those directions?
Slyman: We saw it do exactly that and it was able to grab a more balanced data set to then use to train AI downstream. And we were able to see that we got AI coming out of this that was just as accurate but didn’t have the same kinds of fairness disparities or biases happening in the process.
Miller: And that’s what matters the most, the product that this computer is spitting out. So meaning, after you had it trained differently in a differently-weeded-out set of images, when you said “give me a picture of a doctor,” it was less likely to be a white man?
Slyman: So in the case of this AI that we trained, we said “I want you to be in line with U.S. discriminatory law.” We said that we want it to be an equal representation across all protected demographics for any occupation. So if we were to tell it “yes, generate a doctor,” it would be more likely to show anyone that could be a doctor. If we said we were using it as a search engine to retrieve images for us, it would be more likely to show us a diverse panel of what doctors might look like.
Miller: Eric, I want to hear more about what you did and why and what comes next. But as I mentioned, Kate Glazko is with us as well, a doctoral student at the college of engineering and computer science at the University of Washington.
Kate, why did you and your team decide to focus on the use of AI to sort through resumes?
Glazko: My prior research has focused on some of the ways in which emerging technologies – generative AI and ChatGPT – can help or harm people with disabilities like myself. We found that potential benefits like improved accessibility, but also potential harms such as biases and stereotyped depictions. And this made me really curious about how generative AI technologies such as ChatGPT are being used in real world scenarios that can have a tangible impact on people’s lives.
Last fall, when I was applying for internships – and I have a pretty large network due to being in an industry before my PhD – I started to notice people on my LinkedIn post about using GPT to make the recruiting and hiring process more efficient. And at the time, I couldn’t help but wonder how these potential AI biases that we’ve seen already were playing out in these types of real-world tasks, like resume ranking and hiring.
Miller: So can you describe the different resumes that you created – the control one and then the various experimental versions?
Glazko: Yeah. So these resumes for the experiment were created based on real life advice. I’ve been given well-meaning advice before as a disabled job seeker to leave disability-related items off of my resume to avoid negative perceptions. And that’s actually a pretty common experience for many disabled job seekers.
We compared two identical versions of the same resume, and these are academic resumes, so they’re pretty long. They were 10 pages overall. And we just added four extra disability-related items. These were really positive items, things like scholarships and awards that were disability- related. But they were a tiny part of the overall resume. Otherwise, all the experiences and publications were the same. We just added those extra items.
We tried this across six different disabilities: general disability, blind, deaf, autism, cerebral palsy and depression. And we gave ChatGPT 10 tries for each of these resumes in the rankings against the control.
Miller: And just to be clear, the references to these various disabilities, they were in relatively subtle, it seems, and purely positive ways about scholarships or awards, and these were just brief mentions in 10 pages of resume?
Glazko: Yeah, absolutely. And these are the kind of things that you could expect to put on your resume in real life, like if you get a scholarship that has to do with ADHD, or autism, or an award or something. So yeah, these were all purely positive and very subtle.
Miller: How did ChatGPT respond when you asked it to rank these various resumes?
Glazko: So what we saw was not great. No disability was consistently ranked first out of the 10 trials against the control, despite the additional words and extracurriculars. Some of them did tie. So the blind and disability CV did tie with the control CVs, but the rest of the disabilities ranked first much less than the control. One of the disabilities even did not rank first a single time.
Miller: Am I right that autism was the one that ChatGPT treated most harshly?
Glazko: Yes, it did.
Miller: How do you explain these different outcomes for resumes that mention different disabilities? And I guess I’m wondering in particular about autism, which fared the worst.
Glazko: So with these black box bottles, we don’t have an exact way of saying exactly why, but it does tie to these real world statistics that there are some disabilities that are particularly stigmatized. And we got some hints from the justifications that ChatGPT provided us. Like for example, in the summaries that it gave of the candidates, it described resumes with autism as having less leadership experience, even though all of these disabled resumes had an additional leadership-related award. And a quote said that leadership experience, “less emphasis on leadership roles, projects and grant applications” compared to the control. And we really do think that this potentially reflects the kind of biases and stereotypes that are seen in real life.
Miller: I’m intrigued by the limited information you were able to get because we’ve talked a lot on the show over the last year or so about, as you just mentioned, the black box-ness of AI. We give it this stuff, but it’s not at all a human way of understanding the world and it’s sort of impenetrable, the actual nuts and bolts of its deliberative mechanism. We don’t understand it.
But it did give you some kind of reasoning for why it made its decisions? I mean, lack of leadership, that’s something a human would say. But still, you can’t know why exactly it told you that rationale?
Glazko: I mean, we did have some clues from the kinds of summaries that it provided. But definitely more research is needed to be done in sort of the white box presentations of these models, where you can actually see quantitatively why certain decisions are being made and what weights are being assigned. Then, for example, in looking at the results, we did have some hints overall, like there were some words when describing the resumes with disabilities versus the ones that were not and showed significantly more. So from our statistical analysis, we were able to get some clues. For example, the resumes that were the control, the descriptions were much more likely to mention research, experience and industry. All the resumes that mentioned disability were more likely to mention DEI and less likely to mention the other attributes. So we do have some clues based on even the ways that ChatGPT was describing these resumes.
Another one, for example, there were really interesting quotes that people like those, well … “Potential overemphasis on non-core qualities. The additional focus on DEI and personal challenges, while valuable, might detract slightly from core technical and research-oriented aspects of the role” – that is how ChatGPT described the resume that mentioned depression. So from these kinds of summaries and outputs, we have some clues.
Miller: I should note that you did then follow up. You gave the AI specific directions to demonstrate an understanding of disability and inclusion, and to be more aware of inclusion efforts around disabilities. And there was some improvement. It does make me wonder what fairness would look like to you. What would it take for you to say, “this AI behaved exactly the way I would want it to when given the prompt”?
Glazko: Oh, thank you so much. I love that question and you’re right. When we asked ChatGPT to be disability aware, it showed less bias overall and some of the disability resumes had actually begun to rank first more often than the control. But other resumes like the ones that mentioned autism and depression still didn’t do as well. And I think fairness, with everything else equal, the resume showing these positive attributes to the awards and scholarships should be ranked higher.
But the question goes beyond how I wanted to behave because, again, we found that the one-size-fits-all approach to fairness didn’t work and resulted in unequal biased improvements. So to me, true fairness is ensuring the input involvement of people with disabilities, and other diverse and even intersectional identities in designing, building and testing these AI technologies. Because there could be so many scenarios where potentially more marginalized or less represented conditions may not be adequately addressed by a biased fix. So it’s really about bringing that involvement in.
Miller: Eric, how do you think about the question of fairness, which is such a short word for an enormous issue? I mean, one thing that comes to mind if we’re talking about images of CEOs, for example… I looked this morning and in 2023, only 58 women and eight Black people were the CEOs of Fortune 500 companies. So an accurate representation of where we are in American society right now would say that the picture of CEOs would reflect that.
That obviously doesn’t mesh with what many of us want to see in terms of inclusion and diversity. But that is an accurate picture of the world as it is right now. It’s just one piece of when we imagine feeding in pictures of CEOs, some of them might come from those Fortune 500 companies. How do you think about what fairness means when it comes to translating it to these nonhuman tools that are increasingly powerful?
Slyman: Yeah, I love this question so much. This is part of why I love this type of research because there’s dozens of definitions, even beyond the two you just gave me, of what fairness could be. And almost all of the time they’re in conflict with each other. So, for the CEO one, exactly what you said – the depiction of what we might consider equal or equitable, it’s not the same thing as the depiction that is necessarily real.
I don’t think that it’s up to an individual researcher to say this is what it ought to be because there’s even more ways we might want it to act. And what it comes down to, to me, when I talk about fairness, is the ability to control AI’s behavior in specific contexts. So instead of saying this is how we think the AI ought to act all the time, enabling it so that when I go and say, “show me a picture of a CEO,” that instead of just showing me what it thinks is the right picture, it can respond and say, “Can you tell me a little bit more? Do you want a picture of a CEO that’s from a real depiction of the world? Do you want it to be an equal representation of people? Tell me more about what you as a user of this system are looking for.”
Miller: We just have about 30 seconds left. But do you trust big tech companies to build their platforms in such a way that that kind of interaction will be easy and just standard?
Slyman: I trust big tech companies to do things that make them money.
Miller: Yeah.
Slyman: And it’s a fact that the more people that can use your tools effectively, the more people then will use it, right? If we make it so anyone can, they will …
Miller: If we want it, if we ask for it, then they’ll do it. And if we don’t, they won’t.
Slyman: Exactly.
Miller: Kate Glazko and Eric Slyman, thank you very much.
Slyman / Glazko: Thank you.
Miller: Eric Slyman is a doctoral student at the college of engineering and computer science at Oregon State University. Kate Glazko, a doctoral student at the college of engineering and computer science at the University of Washington.
Contact “Think Out Loud®”
If you’d like to comment on any of the topics in this show or suggest a topic of your own, please get in touch with us on Facebook, send an email to thinkoutloud@opb.org, or you can leave a voicemail for us at 503-293-1983. The call-in phone number during the noon hour is 888-665-5865.