首页    期刊浏览 2025年02月20日 星期四
登录注册

文章基本信息

  • 标题:Implicit data crimes: Machine learning bias arising from misuse of public data
  • 本地全文:下载
  • 作者:Efrat Shimron ; Jonathan I. Tamir ; Ke Wang
  • 期刊名称:Proceedings of the National Academy of Sciences
  • 印刷版ISSN:0027-8424
  • 电子版ISSN:1091-6490
  • 出版年度:2022
  • 卷号:119
  • 期号:13
  • DOI:10.1073/pnas.2117203119
  • 语种:English
  • 出版社:The National Academy of Sciences of the United States of America
  • 摘要:Significance Public databases are an important resource for machine learning research, but their growing availability sometimes leads to “off-label” usage, where data published for one task are used for another. This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms. The underlying cause is that public data are processed with hidden processing pipelines that alter the data features. Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases. We relate to the publication of such results as implicit “data crimes” to raise community awareness of this growing big data problem. Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data-processing pipelines. We describe two processing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for MRI reconstruction: compressed sensing, dictionary learning, and DL. Our results demonstrate that all these algorithms yield systematically biased results when they are naively trained on seemingly appropriate data: The normalized rms error improves consistently with the extent of data processing, showing an artificial improvement of 25 to 48% in some cases. Because this phenomenon is not widely known, biased results sometimes are published as state of the art; we refer to that as implicit “data crimes.” This work hence aims to raise awareness regarding naive off-label usage of big data and reveal the vulnerability of modern inverse problem solvers to the resulting bias.
  • 关键词:endata crimesinverse problembig dataMRIbias
国家哲学社会科学文献中心版权所有