false
Catalog
The Emerging Use of Artificial Intelligence in Der ...
Datasets, clinical studies, and external validatio ...
Datasets, clinical studies, and external validation
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Just out of curiosity, how many folks out there have an account with OpenAI? Can you raise your hands? Wow, okay, that's a good, that's a good number. Maybe like third? Okay, how many folks have actually used it for any purpose, like? Okay. So some, right? Maybe like 15% or so? Okay, that's exciting. Hey everyone, thanks for your patience with that as we got the talk loaded. So I'm Albert Hsu, and I'm one of the dermatologists at Sanford. I work with Rob, and I think, thank you, Rob and Veronica, for such a great broad overview that really, I think, gives us a good look at the landscape of everything. I'm going to double-click on the kind of stage two and three of what Veronica just presented here and talk a little bit about data sets and where they play a role in evaluating algorithms. And so I wear a couple hats at Sanford. I'm a clinical trialist. I also work in research informatics through our healthcare system. So over there, it's kind of all AI all the time, and so it's an exciting time. So when we think about the developments in AI and dermatology over the past few years, I think a lot of the focus has been on the algorithms and their performance on these retrospective, highly curated data sets. And so on these data sets, we take a look at the algorithms and assess their ability to classify various diseases, to prognose, even in various circumstances, provide treatment recommendations. And in dermatology, at least with the computer vision-based processes, we're seeing that these algorithms can perform at the level of board-certified dermatologists for a variety of diagnostic tasks, at least under these experimental settings. And so it's a very different proposition to say that these algorithms are actually going to be useful in clinical practice. And I think that's kind of the phase we're really moving into at this point, which is evaluating how they perform in the real world. And as we shift to this type of focus, I think channeling the words of our chief data scientist, Dr. Shah, in some ways, the algorithm is starting to become the sideshow. And as we are laser-focused on real-world performance, what we're interested in is, can we get some data sets that can actually bring the heterogeneity and kind of edge cases and challenges of real clinical practice and see how these algorithms perform against them? Because once we see how they do in these scenarios, we have a better understanding of where they do well, and also where they come up short and potentially introduce harms. So what makes a data set useful? So when you take a look at data sets, not everyone is equal. There are going to be different patient populations, different diseases, even the ways the images are taken, if you're taking a look at one of these computer vision algorithms-based data sets. So when you think about these different characteristics, you really want to think about how the algorithm will react to these different dimensions, because that will actually inform you on whether this algorithm could potentially generalize to a practice setting like yours if those particular dimensions look like your own practice. So as with the validation of any diagnostic test, prospective data or prospectively curated data sets tend to be the most helpful. The reason for that is that they mitigate a lot of the biases that are present within retrospective data sets. So with retrospective data, many times cases that are more edge cases, atypical presentations, or potentially have issues with image quality are excluded. But we know in the real world what's going to happen is that anything can walk through the door. You can see diseases that are not well represented in training, or even if you see only cases that your algorithm has been trained for, those distributions can shift, as Veronica's mentioned, and that can significantly affect algorithm performance. And finally, there's just all sorts of curveballs that can happen in real world clinical practice. You're going to get secondary morphology changes, artifacts. Your patients may have a different distribution of skin tones represented. All of these things can actually negatively impact algorithm performance. And in fact, there are so many factors that can negatively impact algorithm performance that we encounter in real world practice. It would be shocking if an algorithm did not have performance degradation when brought into the real world, or challenged with a data set that's more reflective of clinical reality. So again, the question is, what is that degree of performance degradation, and in what particular characteristics? Because then you can make wise decisions when it comes to pre-deployment countermeasures, or ask hard questions about whether a particular algorithm is actually relevant for your practice. So here's just a nice illustration of this from our colleague, Dr. Shandell. So imagine if you're evaluating an algorithm that's a melanoma classifier, and it's heavily trained on nevi and melanomas. So if you feed it a melanocytic lesion, it will probably do a pretty good job at making a prediction. However, if you then bring this in the real world, and let's say it starts ending up in the hands of generalist clinicians, they see a BCC and don't immediately recognize that this is not a melanocytic lesion, and they feed it a BCC. If BCCs are not well-represented in training, there's a good chance it will diagnose this as a nevis, and give false reassurance. So if you do good pre-deployment testing, and you see this type of mistake being made repeatedly, you can make a wise decision and say, gosh, should this really be utilized upstream in the care pathway, in the hands of generalist clinicians, or is it much safer for this particular algorithm to be in the hands of dermatologists who can do appropriate case selection? So I think we're starting to see this play out now. As we get better and better data sets that are more reflective of real world practice, and also more prospective data, we're starting to get a more mature understanding of where these algorithms can actually be helpful in patients' diagnostic journeys. So this is one case study I just want to highlight. This is ModelDerm, a state-of-the-art algorithm that was developed by one of our great collaborators in South Korea. It's a general lesion classifier. And in its original testing with retrospective data, it performed at a level comparable to board-certified dermatologists. And so this was in a task related to diagnosing keratinocyte cancers of the face. And with the exuberance of the time to pulling this directly from the paper, they're speculating on the unprecedented potential as a mass screening tool, sort of implying almost population-level implications. However, when they started to feed it data sets that are more reflective of that actual use case, patients submitted photos and start introducing the real world issues like image quality variation and changes in disease distribution, what they could only say at that point was that it performed better than laypeople using Google and maybe was comparable to general physicians. And then when they brought it under the bright lights of a randomized clinical trial, it was clear that this algorithm is actually inferior to board-certified dermatologists or the Korean board-certified dermatologists. So even though actually during this journey, the results are actually getting less and less impressive, I would actually say they're getting more and more useful because what they actually found was that in a subgroup of non-dermatology trainees, it actually significantly improved their ability to discriminate almost kind of a triage situation what potentially could be cancerous. So if the ultimate yardstick is what helps our patients on their journey to getting an accurate and timely diagnosis, this is actually a great insight. It probably doesn't get published in Nature, however. So I just want to take you to another good case study, and this is where an algorithm actually holds up under prospective validation. So highlighting the great work of our Memorial Hospital in Kettering colleagues led by Dr. Rotenberg, they took the algorithm that won one of the recent ISIC Grand Challenges. So it is a melanoma classifier. And so you can see here, you basically take a dermoscopic image. They developed this really nice user interface that shows you a melanoma risk. And they designed a very tightly focused study looking prospective observational study at what happens when dermatologists use this in a focused situation where they're ruling out a melanoma. And so under this very tightly defined task, this algorithm performs very well. Its specificity is quite a robust, you know, close to 40 percent while maintaining a pre-specified sensitivity above 95 percent. And so when they really marry the task closely to how this algorithm was trained, they found that when dermatologists were exposed to this data, it actually significantly improved their ability to discriminate between melanomas and nevi. So this is an example where knowing your task definition really well and kind of focusing its use case in the real world, it could actually hold up in prospective validation. So I'll just make a note here now that there's a lot of great work going into making it possible for the general community to be able to do these types of validation. And a lot of this work revolves around the development of publicly available oftentimes prospective data sets. So I just want to share a little bit of our own work on this and the lessons we've learned. This is a Melanoma Research Alliance-funded study under the great leadership of Dr. Nivoa. But we tried to develop a prospective multimodal data set, kind of evaluating the task of triage. And so kind of the bones of this are, across Cleveland Clinic and Stanford during the past three years when patients came in with a single lesion of concern, we offered prospective enrollment with consent for sharing of their images where we would basically take clinical photos, also dermoscopic photos in the manner that algorithms are used to take as input. We then wrangled the clinical annotation and then linked this to histopath to generate a histopath proven data set. So the result is we recruited over a thousand lesions across 800 patients. And then hopefully what we have here is a kind of broader distribution of diseases that we think is more reflective of what we might encounter in general practice. So this might be a little bit more relevant for, say, triage tasks for clinicians who are a bit upstream of, say, an expert pigmented lesion or some other specialty clinic. And so the interesting part is what happens when you take these vaunted state-of-the-art algorithms that were previously validated on retrospective data and run them through prospectively obtained data? And so this is a callback to this very well-known algorithm, DeepDerm. This was published in Nature by a lot of our great colleagues here. And this is one of the first algorithms to show dermatologist-level performance on retrospective data at being able to discriminate skin cancers from benign lesions, particularly melanomas. And so what we learned is that under prospective validation, we see performance degradation across the board. Now, in a triage scenario, which I would argue this algorithm probably was actually designed with an intention to be utilized for, the key performance metric here is sensitivity for skin cancers. And we can see actually a fairly dismal sensitivity of 28%. Again, an algorithm developed at our institution, and this is many patients represented at our own institution that we're testing it on. And it's very important to understand how an algorithm makes its mistakes. So this is one of the ways to do that. This is a confusion matrix, essentially a three-by-three contingency table, where along the vertical axis is the true diagnosis of the lesions, and then horizontal is what the algorithm called it. And we can see one of the most common mistakes this algorithm is making is that it's taking true malignancies and calling them inflammatory. And when we go to an actual diagnosis-level analysis, we're seeing a very common mistake of SECs and BCCs being called inflammatory lesions, offering false reassurance. And we suspect this is because they have a lot of erythema in them. And so I would argue that this should probably make you seriously question the clinical readiness of this type of algorithm for deployment. And further, you probably wouldn't have gotten at these insights by just reading through its performance on retrospective data. So now we can also test some other things, such as what happens if you actually tweak the task definition of a well-performing algorithm. So I'll just call back again to the ProveAI validated algorithm, which, again, held up under prospective validation under a very tight task definition of ruling out melanoma. And so when we say, gosh, let's mimic a situation where someone says, this is a great melanoma classifier. Let me use it to triage stuff coming into my clinic. When we feed it this broader distribution and more challenging lesions that include non-melanoma skin cancers, what we see is actually a pretty significant drop in its specificity. And the reason that happens is that the algorithm, as it tries to maintain its pre-specified sensitivity above 95%, essentially generates a massive number of false positives. And so essentially makes the algorithm not particularly useful in this use case. So I think this highlights one important consideration, which is that you should really know your algorithm's task definition. And with this current generation of very task-specific trained algorithms, if you try to vary that task definition even to something very related, there is a good chance that algorithm will not be adaptable. And you'll get very unpredictable changes in its performance characteristics. So I'll just make one final point, which is that this paradigm I've predicted of trying to develop gold standard benchmarks to validate algorithms is not the only approach that you can take. In fact, there are many limitations to this type of approach. In fact, you can make entire tables of the limitations of external data sets, as these authors have done. And it kind of boils down to, if you take an external data set, and at that moment in time it doesn't look particularly like your patient population you're planning to deploy into, you're probably not going to learn anything useful from it. And so local validation, other strategies that we've kind of hinted at, are an excellent approach and have many advantages over these kind of gold standard external validations. However, the reality is the capacity of different institutions and different practices to do their own local validation will probably vary pretty significantly. So I think moving forward, the role for these high-quality external data sets will remain strong. So just to conclude here in summary, prospective data sets, retrospective data sets, more reflective clinical practice are very useful tools to get a better understanding of the real-world performance of algorithms. And then when you challenge an algorithm with either real-world scenarios or a real-world data set, you should expect performance segregation. But seeing how the algorithm fails is actually incredibly important in understanding where it can actually be clinically useful. My final point is that for traditional diagnostic algorithms, the current generation that we have, understanding the original task definition is critical because you're not going to get the same performance if you tweak it. Thank you.
Video Summary
The speaker, Albert Hsu, discussed the importance of evaluating AI algorithms in real-world clinical practice, specifically in dermatology. He emphasized the need for diverse and prospective data sets to assess algorithm performance accurately. Hsu highlighted examples where algorithms performed well in retrospective data but showed degradation in prospective validation. He stressed the significance of understanding an algorithm's task definition and its adaptability to different scenarios. Hsu presented case studies showing how algorithms designed for specific tasks may not generalize well when faced with broader and more challenging data sets. He also discussed the limitations of external data sets and advocated for local validation to assess algorithm readiness for clinical deployment. Overall, he emphasized the importance of understanding algorithm performance in real-world settings to determine their clinical usefulness accurately.
Keywords
AI algorithms
clinical practice
dermatology
prospective data sets
algorithm performance
Legal notice
Copyright © 2025 American Academy of Dermatology. All rights reserved.
Reproduction or republication strictly prohibited without prior written permission.
×
Please select your language
1
English