Binary Classification Model for Census Income Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has many cells with missing values, so we will examine the models by deleting the rows with missing cells. This iteration of the project will produce a set of baseline results that we can use to compare with other data cleaning methods.

CONCLUSION: The baseline performance of the ten algorithms achieved an average accuracy of 83.79%. Four ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.27%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 86.29%, which was on par with the accuracy of the training data.

For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference:

One potential source of performance benchmark:

The HTML formatted report can be found here on GitHub.