CARES

A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

UNC-Chapel Hill University of Illinois Urbana-Champaign Brown University University of Washington Microsoft Research UT Arlington Monash University Stanford University
*Equal Contribution

đŸ”„[NEW!] We delve into the trustworthiness of Med-LVLMs across 5 key dimensions: trustfulness, fairness, safety, privacy, & robustness. With 41K Q&A pairs, spanning 16 image modalities & 27 anatomical regions.

🧐🔍 Findings: Models often show factual inaccuracies & fail to maintain fairness, also proving vulnerable to attacks with a lack of privacy awareness.

Abstract

Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment.

  1. Evaluation Dimensions. We introduce CARES and aim to Comprehensively evAluate the tRustworthinESs of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness
  2. Data Format and Scale. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions.
  3. Performance. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness.
  4. Open-source. We make GPT-4 generated evaluation data and code base publicly available.

🏆 Leaderboard

Scores on CARES benchmark. Here "ACC": Accuracy, "OC": Over-Confident ratio, "Abs": Abstention rate, "Tox": Toxicity score, "AED": Accuracy Equality Difference. We report "Abs" in Privacy and Overcautiousness. To view detailed results, please see the paper.

# Model Trustfulness (%) Fairness (%) Safety (%) Privacy (%) Robustness (%)
Factuality
ACC↑
Uncertainty
ACC↑ / OC↓
Age
AED↓
Gendere
AED↓
Racee
AED↓
Jailbreaking
ACC↑/Abs↑
Overcaut
iousness↓
Toxicity
Tox↓/Abs↑
Zero-shot↑ Few-shot↑ Input
ACC↑/Abs↑
Semantic
Abs↑
1 LLaVA-Med

Microsoft

40.4 38.4 / 38.3 18.3 2.7 4.7 35.6 / 30.2 59.0 1.37 / 17.4 2.71 2.04 42.9 / 6.68 /
2 Med-Flamingo

Stanford & Hospital Israelita Albert Einstein & Harvard

29.0 33.7 / 59.1 11.8 1.6 4.8 22.5 / 0.00 0.00 1.88 / 0.35 0.76 0.65 37.5 / 0.00 /
3 MedVInT

SJTU & Shanghai AI Lab

39.3 32.9 / 52.9 19.7 0.8 2.0 34.1 / 0.00 0.00 1.53 / 0.04 0.00 0.00 57.9 / 0.00 0.01
4 RadFM

SJTU & Shanghai AI Lab

27.5 35.9 / 58.5 14.0 2.7 13.8 25.4 / 0.65 1.00 0.83 / 2.58 0.00 0.00 22.2 / 0.02 0.06
5 LLaVA-v1.6

Microsoft & UW Madison

32.3 42.5 / 44.7 19.7 1.9 6.4 29.4 / 1.13 3.67 13.0 / 5.18 14.0 13.2 / /
6 Qwen-VL-Chat

Alibaba

33.8 50.7 / 17.0 16.1 1.0 3.1 31.1 / 5.36 2.67 1.69 / 7.26 10.4 9.82 / /

CARES Datasets

We utilize open-source medical vision-language datasets and image classification datasets to construct CARES benchmark, which cover a wide range of medical image modalities and body parts. The diversity of the datasets ensures richness in question formats and indicates coverage of 16 medical image modalities and 27 human anatomical structures.

Data Source Data Modality # Images # QAs Dataset Type Answer Type Demography
MIMIC-CXR Chest X-ray 1.9K 10.3K VL Open-ended Age, Gender, Race
IU-Xray Chest X-ray 0.5k 2.5K VL Yes/No -
Harvard-FairVLMed SLO Fundus 0.7K 2.8K VL Open-ended Age, Gender, Race
HAM10000 Dermatoscopy 1K 2K Classification Multi-choice Age, Gender
OL3I Heart CT 1K 1K Classification Yes/No Age, Gender
PMC-OA Mixture 2.5K 13K VL Open-ended -
OmniMedVQA Mixture 11K 12K VQA Multi-choice -

There are two types of questions in CARES: (1) Closed-ended questions: Two or more candidate options are provided for each question as the prompt, with only one being correct. We calculate the accuracy by matching the option in the model output; (2) Open-ended questions: Open-ended questions do not have a fixed set of possible answers and require more detailed, explanatory or descriptive responses. It is more challenging, as fully open settings encourage a deeper analysis of medical scenarios, enabling a comprehensive assessment of the model’s understanding of medical knowledge. We quantify the accuracy of model responses using GPT-4.

Statistical overview of CARES datasets. (left) CARES covers numerous anatomical structures, including the brain, eyes, heart, chest, etc. (right) the involved medical imaging modalities, including major radiological modalities, pathology, etc.

CARES: A Benchmark of Trustworthiness in Medical Vision Language Models

CARES is designed to provide a comprehensive evaluation of trustworthiness in MedLVLMs, reflecting the issues present in model responses. We assess trustworthiness across five critical dimensions: trustfulness, fairness, safety, privacy, and robustness.

  • Trustfulness. We discuss the trustfulness of MedLVLMs, defined as the extent to which a Med-LVLM can provide factual responses and recognize when those responses may potentially be incorrect.
    • Factuality: Med-LVLMs are susceptible to factual hallucination, wherein the model may generate incorrect or misleading information about medical conditions, including erroneous judgments regarding symptoms or diseases, and inaccurate descriptions of medical images. interventions.
    • Uncertainty: A trustful Med-LVLM should produce confidence scores that accurately reflect the probability of its predictions being correct, essentially offering precise uncertainty estimation. However, as various authors have noted, LLM-based models often display overconfidence in their responses, which could potentially lead to a significant number of misdiagnoses or erroneous diagnoses.
  • Fairness. Med-LVLMs have the potential to unintentionally cause health disparities, especially among underrepresented groups. These disparities can reinforce stereotypes and lead to biased medical advice. It is essential to prioritize fairness in healthcare to guarantee that every individual receives equitable and accurate medical treatment.
  • Safety. Med-LVLMs present safety concerns, which include several aspects such as jailbreaking, overcautious behavior, and toxicity.
    • Jailbreaking: Jailbreaking refers to attempts or actions that manipulate or exploit a model to deviate from its intended functions or restrictions. For Med-LVLMs, it involves prompting the model in ways that allow access to restricted information or generating responses that violate medical guidelines.
    • Overcautiousness: Overcautiousness describes how Med-LVLMs often refrain from responding to medical queries they are capable of answering. In medical settings, this excessively cautious approach can lead models to decline answering common clinical diagnostic questions.
    • Toxicity: In Med-LVLMs, toxicity refers to outputs that are harmful, such as those containing biased, offensive, or inappropriate content. In medical applications, the impact of toxic outputs is particularly severe because they may generate rude or disrespectful medical advice, eroding trust in the application of clinical management.
  • Privacy. Privacy breaches in Med-LVLMs is a critical issue due to the sensitive nature of health-related data. These models are expected to refrain from disclosing private information, such as marital status, as this can compromise both the reliability of the model and compliance with legal regulations.
  • Robustness . Robustness in Med-LVLMs aims to evaluate whether the models perform reliably across various clinical settings. We focus on evaluating out-of-distribution (OOD) robustness, aiming to assess the model’s ability to handle test data whose distributions significantly differ from those of the training data.

Performance

BibTeX


@article{xia2024cares,
    title={CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models},
    author={Xia, Peng and Chen, Ze and Tian, Juanxi and Gong, Yangrui and Hou, Ruibo and Xu, Yue and Wu, Zhenbang and Fan, Zhiyuan and Zhou, Yiyang and Zhu, Kangyu and others},
    journal={arXiv preprint arXiv:2406.06007},
    year={2024}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA team for giving us access to their models, and open-source projects.

Usage and License Notices: The data and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna, GPT-4, LLaVA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.