CARES

Abstract

Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment.

Evaluation Dimensions. We introduce CARES and aim to Comprehensively evAluate the tRustworthinESs of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness
Data Format and Scale. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions.
Performance. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness.
Open-source. We make GPT-4 generated evaluation data and code base publicly available.

🏆 Leaderboard

Scores on CARES benchmark. Here "ACC": Accuracy, "OC": Over-Confident ratio, "Abs": Abstention rate, "Tox": Toxicity score, "AED": Accuracy Equality Difference. We report "Abs" in Privacy and Overcautiousness. To view detailed results, please see the paper.

#	Model	Trustfulness (%)		Fairness (%)			Safety (%)			Privacy (%)		Robustness (%)
#	Model	Factuality ACC↑	Uncertainty ACC↑ / OC↓	Age AED↓	Gender AED↓	Race AED↓	Jailbreaking ACC↑/Abs↑	Overcaut iousness↓	Toxicity Tox↓/Abs↑	Zero-shot↑	Few-shot↑	Input ACC↑/Abs↑	Semantic Abs↑
1	LLaVA-Med Microsoft	40.4	38.4 / 38.3	18.3	2.7	4.7	35.6 / 30.2	59.0	1.37 / 17.4	2.71	2.04	42.9 / 6.68	/
2	Med-Flamingo Stanford & Hospital Israelita Albert Einstein & Harvard	29.0	33.7 / 59.1	11.8	1.6	4.8	22.5 / 0.00	0.00	1.88 / 0.35	0.76	0.65	37.5 / 0.00	/
3	MedVInT SJTU & Shanghai AI Lab	39.3	32.9 / 52.9	19.7	0.8	2.0	34.1 / 0.00	0.00	1.53 / 0.04	0.00	0.00	57.9 / 0.00	0.01
4	RadFM SJTU & Shanghai AI Lab	27.5	35.9 / 58.5	14.0	2.7	13.8	25.4 / 0.65	1.00	0.83 / 2.58	0.00	0.00	22.2 / 0.02	0.06
5	LLaVA-v1.6 Microsoft & UW Madison	32.3	42.5 / 44.7	19.7	1.9	6.4	29.4 / 1.13	3.67	13.0 / 5.18	14.0	13.2	/	/
6	Qwen-VL-Chat Alibaba	33.8	50.7 / 17.0	16.1	1.0	3.1	31.1 / 5.36	2.67	1.69 / 7.26	10.4	9.82	/	/

CARES Datasets

We utilize open-source medical vision-language datasets and image classification datasets to construct CARES benchmark, which cover a wide range of medical image modalities and body parts. The diversity of the datasets ensures richness in question formats and indicates coverage of 16 medical image modalities and 27 human anatomical structures.

Data Source	Data Modality	# Images	# QAs	Dataset Type	Answer Type	Demography
MIMIC-CXR	Chest X-ray	1.9K	10.3K	VL	Open-ended	Age, Gender, Race
IU-Xray	Chest X-ray	0.5k	2.5K	VL	Yes/No	-
Harvard-FairVLMed	SLO Fundus	0.7K	2.8K	VL	Open-ended	Age, Gender, Race
HAM10000	Dermatoscopy	1K	2K	Classification	Multi-choice	Age, Gender
OL3I	Heart CT	1K	1K	Classification	Yes/No	Age, Gender
PMC-OA	Mixture	2.5K	13K	VL	Open-ended	-
OmniMedVQA	Mixture	11K	12K	VQA	Multi-choice	-

There are two types of questions in CARES: (1) Closed-ended questions: Two or more candidate options are provided for each question as the prompt, with only one being correct. We calculate the accuracy by matching the option in the model output; (2) Open-ended questions: Open-ended questions do not have a fixed set of possible answers and require more detailed, explanatory or descriptive responses. It is more challenging, as fully open settings encourage a deeper analysis of medical scenarios, enabling a comprehensive assessment of the model’s understanding of medical knowledge. We quantify the accuracy of model responses using GPT-4.

Statistical overview of CARES datasets. (left) CARES covers numerous anatomical structures, including the brain, eyes, heart, chest, etc. (right) the involved medical imaging modalities, including major radiological modalities, pathology, etc.

CARES: A Benchmark of Trustworthiness in Medical Vision Language Models

CARES is designed to provide a comprehensive evaluation of trustworthiness in MedLVLMs, reflecting the issues present in model responses. We assess trustworthiness across five critical dimensions: trustfulness, fairness, safety, privacy, and robustness.

Trustfulness. We discuss the trustfulness of MedLVLMs, defined as the extent to which a Med-LVLM can provide factual responses and recognize when those responses may potentially be incorrect.

Factuality: Med-LVLMs are susceptible to factual hallucination, wherein the model may generate incorrect or misleading information about medical conditions, including erroneous judgments regarding symptoms or diseases, and inaccurate descriptions of medical images. interventions.
Uncertainty: A trustful Med-LVLM should produce confidence scores that accurately reflect the probability of its predictions being correct, essentially offering precise uncertainty estimation. However, as various authors have noted, LLM-based models often display overconfidence in their responses, which could potentially lead to a significant number of misdiagnoses or erroneous diagnoses.

Fairness. Med-LVLMs have the potential to unintentionally cause health disparities, especially among underrepresented groups. These disparities can reinforce stereotypes and lead to biased medical advice. It is essential to prioritize fairness in healthcare to guarantee that every individual receives equitable and accurate medical treatment.
Safety. Med-LVLMs present safety concerns, which include several aspects such as jailbreaking, overcautious behavior, and toxicity.

Jailbreaking: Jailbreaking refers to attempts or actions that manipulate or exploit a model to deviate from its intended functions or restrictions. For Med-LVLMs, it involves prompting the model in ways that allow access to restricted information or generating responses that violate medical guidelines.
Overcautiousness: Overcautiousness describes how Med-LVLMs often refrain from responding to medical queries they are capable of answering. In medical settings, this excessively cautious approach can lead models to decline answering common clinical diagnostic questions.
Toxicity: In Med-LVLMs, toxicity refers to outputs that are harmful, such as those containing biased, offensive, or inappropriate content. In medical applications, the impact of toxic outputs is particularly severe because they may generate rude or disrespectful medical advice, eroding trust in the application of clinical management.

Privacy. Privacy breaches in Med-LVLMs is a critical issue due to the sensitive nature of health-related data. These models are expected to refrain from disclosing private information, such as marital status, as this can compromise both the reliability of the model and compliance with legal regulations.
Robustness . Robustness in Med-LVLMs aims to evaluate whether the models perform reliably across various clinical settings. We focus on evaluating out-of-distribution (OOD) robustness, aiming to assess the model’s ability to handle test data whose distributions significantly differ from those of the training data.

Performance

Trustfulness. The evaluation of trustfulness includes assessments of factuality and uncertainty. The key findings are: (1) Existing Med-LVLMs encounter significant factuality hallucination, with accuracy exceeding 50% on the comprehensive VQA benchmark we constructed, especially when facing open-ended questions and rare modalities or anatomical regions; (2) The performance of Med-LVLMs in uncertainty estimation is unsatisfactory, revealing a poor understanding of their medical knowledge limits. These models also tend to exhibit overconfidence, thereby increasing the risk of misdiagnoses. [1-2]
Fairness. In fairness evaluation, our results reveal significant disparities in model performance across various demographic groups that categorized by age, gender and races. Specifically, agerelated findings show the highest performance in the 40-60 age group, with reduced accuracy among the elderly due to imbalanced training data distribution. Gender disparities are less pronounced, suggesting relative fairness; however, notable discrepancies still exist in specific datasets like CT and dermatology. Racial analysis indicates better model performance for Hispanic or Caucasian populations, though some models achieve more balanced results across different races. [3]
Safety. The safety evaluation of includes assessments of jailbreaking, overcautiousness, and toxicity. Our key findings are: (1) Under the attack of "jailbreaking" prompts, the accuracy of all models decreases. LLaVA-Med demonstrates the strongest resistance, refusing to answer many unsafe questions, whereas other models typically respond without notable defenses; (2) All MedLVLMs exhibit a slight increase in toxicity when prompted with toxic inputs. Compared to other Med-LVLMs, only LLaVA-Med demonstrates significant resistance to induced toxic outputs, as evidenced by a notable increase in its abstention rate; (3) Due to excessively conservative tuning, LLaVA-Med exhibits severe over-cautiousness, resulting in a higher refusal rate compared to other models, even for manageable questions in routine medical inquiries. [4-6]
Privacy. The privacy assessment reveals significant gaps in Med-LVLMs regarding the protection of patient privacy, highlighting several key issues: (1) Med-LVLMs lack effective defenses against queries that seek private information, in contrast to general LVLMs, which typically refuse to produce content related to private information; (2) While Med-LVLMs often generate what appears to be private information, it is usually fabricated rather than an actual disclosure; (3) Current Med-LVLMs tend to leak private information that is included in the input prompts. [7]
Robustness . The evaluation of robustness focuses on out-of-distribution (OOD) robustness, specifically targeting input-level and semantic-level distribution shifts. The findings indicate that: (1) when significant noise is introduced to input images, Med-LVLMs fail to make accurate judgments and seldom refuse to respond; (2) when tested on unfamiliar modalities, these models continue to respond, despite lacking sufficient medical knowledge. [8]

BibTeX


@article{xia2024cares,
    title={CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models},
    author={Xia, Peng and Chen, Ze and Tian, Juanxi and Gong, Yangrui and Hou, Ruibo and Xu, Yue and Wu, Zhenbang and Fan, Zhiyuan and Zhou, Yiyang and Zhu, Kangyu and others},
    journal={arXiv preprint arXiv:2406.06007},
    year={2024}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA team for giving us access to their models, and open-source projects.

Usage and License Notices: The data and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna, GPT-4, LLaVA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.