In this video, the host from code_your_own_AI explores the reliability and performance of large multimodal models (LMMs) in medical Vision Question Answering (VQA) systems. The context is set with the host’s personal experience of undergoing an MRI scan following a bicycle accident, leading to an exploration of how AI could aid in medical diagnosis.

The video delves into a study titled ‘Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA’ by the University of California and Carnegie Mellon University. This study introduces the ProbMed dataset, designed to rigorously evaluate the performance of medical AI models. The dataset includes adversarial pairs to test the models’ robustness and reasoning capabilities. These pairs introduce a second question with a misleading or hallucinated attribute to challenge the AI’s reasoning path.

The study evaluates three classes of AI models: large foundation models like GPT-4 Vision and Gemini Pro, fine-tuned general domain LMMs for biomedical fields, and narrow domain-specific LMMs focused on specific medical needs. The results reveal significant weaknesses in the AI models when faced with adversarial questions, highlighting their sensitivity and lack of stable reasoning.

The video presents detailed performance metrics, showing drastic drops in accuracy when adversarial pairs are introduced. For instance, a model’s accuracy could drop from over 80% to as low as 3% when a second question is added. This underscores the current limitations of AI in medical diagnosis, emphasizing the need for more reliable and robust models.

The host also discusses the importance of reasoning capabilities in AI models, particularly in medical contexts where stakes are high. The video concludes with reflections on the future of medical AI and the ongoing need for advancements in model training and evaluation to ensure safety and accuracy in critical applications.

code_your_own_AI
Not Applicable
June 15, 2024
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA