Recent studies have revealed glaring gaps in the clinical readiness of image-based diagnostic AI models. The scientists believe that the findings point to the rising need for models that meet conventionally reported metrics and validation with computational stress tests to assess clinical readiness. Discrimination and poor calibration remain major concerns in the real-world use of CNNs as the proof of practice for published models have not been demonstrated.
Research
The discrimination is subjected to tests on controlled training datasets (curated datasets) within the field of dermatology and is then compared with test data from new and potentially more diverse datasets (non-curated datasets). The study found that CNNs perform at par with dermatologists when it comes to curated benchmark datasets, but fall short when applied to non-curated datasets.
Calibration quantifies how well an ML model can forecast its accuracy, i.e., if a model asserts that it can perform with 90% accuracy, then it should correctly predict outcomes 90% of the time. It also helps in deciding if there is a need for human intervention.
Currently, model development in CNN does not take into consideration the need for calibration. Also, it is important that the decision-making needs to be independent of factors like ink markings, hair, zoom, lighting, etc. The experiment calculated the robustness of algorithms by changing image magnification or angles used for testing skin cancer, within dermatology. Almost 30 percent of the image classifications were glaringly different from the original predictions. Image transformation also pointed out inconsistencies in model predictions across the collected datasets.
Possible Improvements
The study points out the need for algorithms to have better discrimination capabilities for the target population, to express uncertainty when likely to be wronged, and produce robust results immune to variations in image capture. While the study found lower discrimination capability when dealing with non-curated datasets, the differences were insignificant. It was observed that models trained on dermoscopic images performed comparatively better, even to classify non-dermoscopic images. Hence, the study also defines the scope for dermoscopic images for training.
In terms of robustness, the study highly recommends diversifying training datasets further by capturing training images using different methods or by using specialized computational techniques such as generating modifications to CNN architecture, generating adversarial training models, or leveraging unlabeled examples.
Experts suggest that allowing AI to do the decision-making might leave healthcare professionals over-reliant on algorithms and hence may lead to misdiagnosis. When introduced in a clinical setting, AI models should accurately project their confidence level; otherwise, they will do more harm than good in real circumstances.
Follow and connect with us on Facebook, Linkedin & Twitter