Binary Classification Model for Census Income Using R Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has a categorical attribute, native-country, that contains over 40 different values. We will examine the models by removing the native-country attribute and see how the removed attribute might have an impact on the modeling. This iteration of the project will produce a set of results that we will use to compare with the baseline models from Take 1 and Take 2.

CONCLUSION: From iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 83.79%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.27%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 86.29%, which was on par with the accuracy of the training data.

From the previous iteration (Take 2), the baseline performance of the ten algorithms achieved an average accuracy of 84.19%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.60%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 86.93%, which was slightly better than the accuracy of the training data.

From this iteration (Take 3), the baseline performance of the ten algorithms achieved an average accuracy of 84.37%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.60%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.00%, which was slightly better than the accuracy of the training data. More importantly, the length of the script run-time decreased from Take 2’s 24 hours and 33 minutes down to Take 3’s 14 hours and 2 minutes. That was a time improvement of 42%.

For this project, dropping the native-country attribute had no impact to the overall accuracy of the training model but contributed to a noticeable improvement of the model training time. The Stochastic Gradient Boosting ensemble algorithm continued to yield consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.