Scores on CARES benchmark. Here "ACC": Accuracy, "OC": Over-Confident ratio, "Abs": Abstention rate, "Tox": Toxicity score, "AED": Accuracy Equality Difference. We report "Abs" in Privacy and Overcautiousness. To view detailed results, please see the paper.
# | Model | Trustfulness (%) | Fairness (%) | Safety (%) | Privacy (%) | Robustness (%) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Factuality ACCâ |
Uncertainty ACCâ / OCâ |
Age AEDâ |
Gender AEDâ |
Race AEDâ |
Jailbreaking ACCâ/Absâ |
Overcaut iousnessâ |
Toxicity Toxâ/Absâ |
Zero-shotâ | Few-shotâ | Input ACCâ/Absâ |
Semantic Absâ |
||
1 |
LLaVA-Med
Microsoft |
40.4 | 38.4 / 38.3 | 18.3 | 2.7 | 4.7 | 35.6 / 30.2 | 59.0 | 1.37 / 17.4 | 2.71 | 2.04 | 42.9 / 6.68 | / |
2 | Med-Flamingo
Stanford & Hospital Israelita Albert Einstein & Harvard |
29.0 | 33.7 / 59.1 | 11.8 | 1.6 | 4.8 | 22.5 / 0.00 | 0.00 | 1.88 / 0.35 | 0.76 | 0.65 | 37.5 / 0.00 | / |
3 |
MedVInT
SJTU & Shanghai AI Lab |
39.3 | 32.9 / 52.9 | 19.7 | 0.8 | 2.0 | 34.1 / 0.00 | 0.00 | 1.53 / 0.04 | 0.00 | 0.00 | 57.9 / 0.00 | 0.01 |
4 | RadFM
SJTU & Shanghai AI Lab |
27.5 | 35.9 / 58.5 | 14.0 | 2.7 | 13.8 | 25.4 / 0.65 | 1.00 | 0.83 / 2.58 | 0.00 | 0.00 | 22.2 / 0.02 | 0.06 |
5 | LLaVA-v1.6
Microsoft & UW Madison |
32.3 | 42.5 / 44.7 | 19.7 | 1.9 | 6.4 | 29.4 / 1.13 | 3.67 | 13.0 / 5.18 | 14.0 | 13.2 | / | / |
6 | Qwen-VL-Chat
Alibaba |
33.8 | 50.7 / 17.0 | 16.1 | 1.0 | 3.1 | 31.1 / 5.36 | 2.67 | 1.69 / 7.26 | 10.4 | 9.82 | / | / |