false
Catalog
The Emerging Use of Artificial Intelligence in Der ...
Fair and Responsible AI for Dermatology
Fair and Responsible AI for Dermatology
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
All right, so we've talked a lot. I'm really excited how everything has set up to the point that we're at now. I'm Roxanne Adonis-Jun. I'm gonna be talking about fair and responsible AI for dermatology. My disclosures are not really relevant to this talk. I won't be speaking about any particular company. So I think at this point, it's really been drilled down that building AI models is very dependent on the data that is used to build the model and to test the model. That being said, if you put biased data in, you will get a model that produces biased results. This is a fun, there's no magic here where the model somehow unlearns biases or is able to extrapolate data that it's never seen before. Interestingly, humans are much better at this kind of thing where we might only see one example of something and be able to make an extrapolation. How many of us in residency only saw one case of an incredibly rare thing and then in clinic, many years later, notice the same disease and remembered from that one example? Raise your hand if that's ever happened to you. That is not something that AI models are able to really do. And so if they see one example, they're not gonna be able to learn well and so those biases are gonna get kind of baked in. That being said, humans, of course, have biases as well. Dr. Jenna Lester gave this excellent TED Talk about how skin disease is often missed in dark skin tones and Dr. Adai Adamson wrote one of the first sort of perspective pieces in JAMA Dermatology raising the alarm saying, hey, because AI is driven by data and we know that there are biases in dermatology textbooks, in our educational system and in our data sets, I am concerned or he was concerned that these biases could actually get baked into the algorithms. And in fact, Dr. Rotenberg and I and other colleagues did a study where we reviewed 70 AI algorithm papers and found that one, unsurprisingly, most of them don't share their data and two, that they don't really describe what skin tones or ethnicities were used in developing the algorithms. And in the 10% of papers that actually described what skin tones went into the algorithm, the majority of them actually either excluded or severely underrepresented Fitzpatrick skin tone five and six. So I'm an AI researcher. That's actually what I spend about 90% of my time doing. The other 10% is clinical practice in dermatology. And so this is a big problem. And so one of the things we did is we actually built this diverse data set of images that had balanced representation and had matched disease between Fitzpatrick one and two and five and six, matched age, matched sex, matched time that the photo was taken because camera technology changes and that can affect algorithms. And then what we did is that we actually tested three previous state-of-the-art AI algorithms. These are the kinds of algorithms that were published in high-impact journals making claims like, oh, these algorithms beat dermatologists. I mean, we've seen that claim over and over again. Every time I see that claim, I just sort of roll my eyes at this point. And what we found is that, so this is a area under the curve, receiver operator curve. And so what you see here is the red line is really bad performance. And so the closer the curves come to the red line, they're basically like guessing at that point. The first thing we found is that all the algorithms did worse, much worse than their reported performance on our data set because again, our data set was a new data set they had never seen before. There might be differences in lighting, in disease distribution, and certainly skin tone. And the other thing we found is that when we compared performance on Fitzpatrick 1 and 2 and 5 and 6, all the algorithms did significantly worse on Fitzpatrick 5 and 6 and had statistically significant worse sensitivity in being able to detect skin cancer malignancies. Now the thing is, is that most AI algorithms will never report this kind of thing about how they perform on skin tones. And even the FDA approved models have not reported any data on differences in performance across skin tones. So this is a huge issue. One other thing that we showed in this study is that this is not an issue, even though camera technology behaves differently across skin tones and there are biases there, this is not an issue that, this is an issue that can be fixed. So what we did is we actually took one of the models that we had access to, because usually you don't have access to the AI models unfortunately, and we gave that model examples of diverse data to see if we could basically train it to do better. And what we found is that if you just give it images of white skin, Fitzpatrick 1 and 2, it performs a little bit better on our data set because it's picking up on differences in say lighting or the camera that was used. But that gap in performance between Fitzpatrick 1 and 2 and 5 and 6 still persists. The only way to close that gap is to take the model and expose it and train it on diverse images. So basically we wanted to show this is a fixable problem. People just have to be aware that they need to use diverse skin tones when training and testing their models. It's a surmountable problem. And so I encourage all of you that if you get approached by companies making claims about AI models that they want you to use in clinic for whatever it might be, triage, diagnosis, ask the question, what was the diversity of skin tones used to train your model? Do you have any data that shows that your model performs fairly across skin tones? Because that is something that companies will only respond to if they get pushed on it. They're not going to sort of do it out of the goodness of their hearts, unfortunately, as the data has shown. So I wanted to bring up synthetic data. The field of generative AI has just exploded. And there are many companies who have said, okay, we don't have diverse data. So what we're gonna do is we're gonna generate synthetic images of disease across diverse skin tones and use that to train our model to do better. And so we are thinking there was like, okay, wait a second. Now, that sounds very interesting. And certainly when you're training AI models, there's all sorts of things that you can do to transform the data to make it more generalizable. But what actually happens if you use synthetic data? Here, as you move to the right, the proportion of real images increases and every dot within a box is an increase in synthetic images. And so really the takeaway from here is that the box on the right, which has the most real images, leads to better performance than the synthetic data. The synthetic data can boost performance, but it turns out that real data is king. And what that means is that if you're a group that's designing an algorithm and your Fitzpatrick one and two diseases have thousands of real images, but your Fitzpatrick five and six have much fewer real images and you try to fill in that gap with synthetic images, you're still actually going to end up with bias there because synthetic images do not perform as well as real images. Again, bringing kind of home the point that if you're going to build fair algorithms, what we need to do is actually do the work to collect the diverse data of images across all skin tones. So I wanted to also hone in on this point. This is kind of a paper just hot off the press. What matters most at the end of the day is understanding how these models work in partnership with the humans that will use them because we know at this point that autonomous AI and dermatology is not going to happen. The models right now are just not good enough. There are models that could help with triage or with decision support or with providing support when there's a non-specialist trying to decide whether or not to make a referral. And so we were curious in terms of fairness, what happens when you have a model give advice? Now, there've been plenty of studies on models giving advice to doctors across dermatology and other fields, but this was one of the first studies looking at giving advice and what happens with bias in terms of how that advice is taken in. And so this was a large-scale experiment. It wasn't like a real-world trial, but it's an experiment where dermatologists, primary care physicians were given images of skin disease across skin tones, like the same diseases. And one of the first things that we showed is actually humans have biased performance at baseline. Now, to some people, this may not be surprising. To other people, they contend that that can't be true, but the fact is is that we did the experiment to show that, yes, unfortunately, there is biased performance across skin tones on identifying skin disease, same diseases across different skin tones. We gave them then decision support from models that were fair, meaning that the models performed similarly across Fitzpatrick 1 and 2 and Fitzpatrick 5 and 6. And what we found is that basically everybody's performance got boosted by having decision support. And this isn't congruent with what other studies have shown, that having decision support can help. But interestingly, one thing that we did find was that even though performance was boosted, for some reason, the clinicians believed the model more when they encountered Fitzpatrick 1 and 2 than when they encountered 5 and 6. Maybe the clinicians were aware of prior results that said models were biased, even though we told them these are fair models, like, hey, these models perform. For some reason, in their decision-making, even though the model boosted performance across both Fitzpatrick 1 and 2 and 5 and 6, the disparity gap actually increased, which was a surprising result to us. And again, goes back to, you need to know how the model works when the human decision-maker comes into the loop. So my key takeaways from this are essentially, technology, even though I basically am basing my entire career around developing and building technology, is not always the savior. I'm definitely not against technology. In fact, I'm one of the biggest promoters of it. But it's not the savior, and we have to also address the biases that exist at baseline in our clinical decision-making. And then to build on biased models, we really need diverse and representative data, and that synthetic data doesn't really make up for the lack of real data. And it's really important to test models, how they intend to be used in the clinical realm so that we can understand the biases that might come in in the decision-making with the model in partnership. And the last thing I'll say before I step down is, we talked again and again about data today, and I wanted people to know that the Journal of Investigative Dermatology actually has a track now where you can submit datasets as a research paper. And the reason that this now exists is because we wanted to acknowledge that dataset work is real research. Like, that's one of the issues, is that datasets, people are not really feeling inclined to share de-identified data because there's not really credit for all the work that it takes to cleanly label those datasets. But now there's actually a way. You don't have to necessarily do any sort of analysis. You can just describe the dataset, what's in it, how it was created, how it was labeled, and you can now publish that as academic publication. So I just wanted to let you know that that, I think, is gonna help us move towards having more sort of data sharing with de-identified data, and hopefully increase the diversity of the kinds of datasets that we see developed for AI use. Thank you.
Video Summary
In the video transcript, Roxanne Adonis-Jun discusses the importance of fair and responsible AI in dermatology. She highlights how biases in AI models can result from biased data input, emphasizing the need for diverse and representative datasets to mitigate these biases. Adonis-Jun also presents findings from a study testing AI algorithms on skin tones, showing significant performance disparities. She underscores the importance of testing AI models with diverse data and involving human decision-making to address biases effectively. Additionally, she mentions the significance of sharing de-identified datasets for advancing AI research in dermatology.
Keywords
fair AI
diverse datasets
skin tones
human decision-making
de-identified datasets
Legal notice
Copyright © 2024 American Academy of Dermatology. All rights reserved.
Reproduction or republication strictly prohibited without prior written permission.
×
Please select your language
1
English