Simple Classification Model for Diabetes Prediction Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

For more information on this case study project, please consult Dr. Brownlee’s blog post at

Dataset Used: Pima Indians Diabetes Database

Data Set ML Model: Classification with numerical attributes

Dataset Reference:

For more information on performance benchmarks, please consult:

INTRODUCTION: The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details. It is a binary (2-class) classification problem. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values.

CONCLUSION: The baseline performance of predicting the class variable achieved an average accuracy of 75.85%. The top accuracy result achieved via Logistic Regression was 77.73% after a series of tuning trials. The ensemble algorithms, in this case, did not yield a better result than the non-ensemble algorithms to justify the additional processing required.

The HTML formatted report can be found here on GitHub.