摘要:As artifcial intelligence (AI) makes continuous progress to improve quality of care for some patients by leveraging ever increasing amounts of digital health data, others are left behind . Empirical evaluation studies are required to keep biased AI models from reinforcing systemic health disparities faced by minority populations through dangerous feedback loops . The aim of this study is to raise broad awareness of the pervasive challenges around bias and fairness in risk prediction models . We performed a case study on a MIMIC-trained benchmarking model using a broadly applicable fairness and generalizability assessment framework . While open-science benchmarks are crucial to overcome many study limitations today, this case study revealed a strong class imbalance problem as well as fairness concerns for Black and publicly insured ICU patients . Therefore, we advocate for the widespread use of comprehensive fairness and performance assessment frameworks to efectively monitor and validate benchmark pipelines built on open data resources .