User

Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment.
Scores on CARES benchmark. Here "ACC": Accuracy, "OC": Over-Confident ratio, "Abs": Abstention rate, "Tox": Toxicity score, "AED": Accuracy Equality Difference. We report "Abs" in Privacy and Overcautiousness. To view detailed results, please see the paper.
# | Model | Trustfulness (%) | Fairness (%) | Safety (%) | Privacy (%) | Robustness (%) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Factuality ACC↑ |
Uncertainty ACC↑ / OC↓ |
Age AED↓ |
Gender AED↓ |
Race AED↓ |
Jailbreaking ACC↑/Abs↑ |
Overcaut iousness↓ |
Toxicity Tox↓/Abs↑ |
Zero-shot↑ | Few-shot↑ | Input ACC↑/Abs↑ |
Semantic Abs↑ |
||
1 |
LLaVA-Med
Microsoft |
40.4 | 38.4 / 38.3 | 18.3 | 2.7 | 4.7 | 35.6 / 30.2 | 59.0 | 1.37 / 17.4 | 2.71 | 2.04 | 42.9 / 6.68 | / |
2 | Med-Flamingo
Stanford & Hospital Israelita Albert Einstein & Harvard |
29.0 | 33.7 / 59.1 | 11.8 | 1.6 | 4.8 | 22.5 / 0.00 | 0.00 | 1.88 / 0.35 | 0.76 | 0.65 | 37.5 / 0.00 | / |
3 |
MedVInT
SJTU & Shanghai AI Lab |
39.3 | 32.9 / 52.9 | 19.7 | 0.8 | 2.0 | 34.1 / 0.00 | 0.00 | 1.53 / 0.04 | 0.00 | 0.00 | 57.9 / 0.00 | 0.01 |
4 | RadFM
SJTU & Shanghai AI Lab |
27.5 | 35.9 / 58.5 | 14.0 | 2.7 | 13.8 | 25.4 / 0.65 | 1.00 | 0.83 / 2.58 | 0.00 | 0.00 | 22.2 / 0.02 | 0.06 |
5 | LLaVA-v1.6
Microsoft & UW Madison |
32.3 | 42.5 / 44.7 | 19.7 | 1.9 | 6.4 | 29.4 / 1.13 | 3.67 | 13.0 / 5.18 | 14.0 | 13.2 | / | / |
6 | Qwen-VL-Chat
Alibaba |
33.8 | 50.7 / 17.0 | 16.1 | 1.0 | 3.1 | 31.1 / 5.36 | 2.67 | 1.69 / 7.26 | 10.4 | 9.82 | / | / |
We utilize open-source medical vision-language datasets and image classification datasets to construct CARES benchmark, which cover a wide range of medical image modalities and body parts. The diversity of the datasets ensures richness in question formats and indicates coverage of 16 medical image modalities and 27 human anatomical structures.
Data Source | Data Modality | # Images | # QAs | Dataset Type | Answer Type | Demography |
---|---|---|---|---|---|---|
MIMIC-CXR | Chest X-ray | 1.9K | 10.3K | VL | Open-ended | Age, Gender, Race |
IU-Xray | Chest X-ray | 0.5k | 2.5K | VL | Yes/No | - |
Harvard-FairVLMed | SLO Fundus | 0.7K | 2.8K | VL | Open-ended | Age, Gender, Race |
HAM10000 | Dermatoscopy | 1K | 2K | Classification | Multi-choice | Age, Gender |
OL3I | Heart CT | 1K | 1K | Classification | Yes/No | Age, Gender |
PMC-OA | Mixture | 2.5K | 13K | VL | Open-ended | - |
OmniMedVQA | Mixture | 11K | 12K | VQA | Multi-choice | - |
There are two types of questions in CARES: (1) Closed-ended questions: Two or more candidate options are provided for each question as the prompt, with only one being correct. We calculate the accuracy by matching the option in the model output; (2) Open-ended questions: Open-ended questions do not have a fixed set of possible answers and require more detailed, explanatory or descriptive responses. It is more challenging, as fully open settings encourage a deeper analysis of medical scenarios, enabling a comprehensive assessment of the model’s understanding of medical knowledge. We quantify the accuracy of model responses using GPT-4.
CARES is designed to provide a comprehensive evaluation of trustworthiness in MedLVLMs, reflecting the issues present in model responses. We assess trustworthiness across five critical dimensions: trustfulness, fairness, safety, privacy, and robustness.
User
@article{xia2024cares,
title={CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models},
author={Xia, Peng and Chen, Ze and Tian, Juanxi and Gong, Yangrui and Hou, Ruibo and Xu, Yue and Wu, Zhenbang and Fan, Zhiyuan and Zhou, Yiyang and Zhu, Kangyu and others},
journal={arXiv preprint arXiv:2406.06007},
year={2024}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA team for giving us access to their models, and open-source projects.
Usage and License Notices: The data and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna, GPT-4, LLaVA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.