摘要:Background: Breast Cancer (BC) is a known global crisis. TheWorld Health Organization reports a global 2.09 million inci-dences and 627,000 deaths in 2018 relating to BC. The traditionalBC screening method in developed countries is mammography,whilst developing countries employ breast self-examination andclinical breast examination. The prominent gold standard for BCdetection is triple assessment: i) clinical examination, ii) mam-mography and/or ultrasonography; and iii) Fine Needle AspirateCytology. However, the introduction of cheaper, efficient and non-invasive methods of BC screening and detection would be benefi-cial. Design and methods: We propose the use of eight machinelearning algorithms: i) Logistic Regression; ii) Support VectorMachine; iii) K-Nearest Neighbors; iv) Decision Tree; v) RandomForest; vi) Adaptive Boosting; vii) Gradient Boosting; viii)eXtreme Gradient Boosting, and blood test results using BCCoimbra Dataset (BCCD) from University of California Irvineonline database to create models for BC prediction. To ensure themodels’ robustness, we will employ: i) Stratified k-fold Cross-Validation; ii) Correlation-based Feature Selection (CFS); and iii)parameter tuning. The models will be validated on validation andtest sets of BCCD for full features and reduced features. Featurereduction has an impact on algorithm performance. Seven metricswill be used for model evaluation, including accuracy. Expected impact of the study for public health: The CFStogether with highest performing model(s) can serve to identifyimportant specific blood tests that point towards BC, which mayserve as an important BC biomarker. Highest performing model(s)may eventually be used to create an Artificial Intelligence tool toassist clinicians in BC screening and detection.
关键词:Breast cancer; cancer screening; biomarkers; machine learning; blood tests