摘要:Diabetes mellitus is a hyperglycemia-like chronic condition that is a troublesome disease. It is estimated that, according to the growing morbidity, by 2040, the world will cross 642 million diabetic patients. This means that each one of the ten adults will be diabetes-affected. Diabetes can also lead to other illnesses such as heart attacks, kidney damage, and even blindness. The prediction of diabetes in advance motivates us to develop a machine learning-based model. A dataset was obtained from the online repository for this work. The obtained dataset was imbalanced. An imbalanced dataset presents a challenge that is needed to be balanced for prediction using multiple machine learning like Tomek and SMOTE. These techniques remove necessary outliers that are incomplete in the provided dataset. These outliers are also managed using the IQR method. Additionally, this research employed a two-stage model selection methodology. In the first stage, logistic regression, Support Vector Machine, k-nearest neighbors, gradient boost, Naive Bayes, and Random Forests were applied to determine the efficiency of prediction based on patients’ preconditioning. At this stage, Random Forest was found to be the best with an accuracy of 80.7% after applying SMOTE oversampling technique to balance the dataset. In the second stage, three better-performing models were used by utilizing a voting algorithm. The results were encouraging, and the model obtained 82.0% accuracy with the default dataset and 81.7% accuracy with the balanced dataset. Naive Bayes Theorem, Gradient Boosting Classifier, and Random Forest were used as inputs to the voting algorithm.