Binary Classification Model for Census Income Using Python Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has a categorical attribute, native-country, that contains over 40 different values. We will examine the models by removing the native-country attribute and see how the removed attribute might have an impact on the modeling. This iteration of the project will produce a set of results that we will use to compare with the baseline models from Take 1 and Take 2.

CONCLUSION: From the previous iteration (Take 1), The baseline performance of the ten algorithms achieved an average accuracy of 81.37%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.99%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.23%, which was slightly better than the accuracy of the training data.

From iteration Take 2, the baseline performance of the ten algorithms achieved an average accuracy of 81.93%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.31%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.57%, which was slightly better than the accuracy of the training data.

From this iteration (Take 3), the baseline performance of the ten algorithms achieved an average accuracy of 81.95%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.29%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.50%, which was slightly better than the accuracy of the training data. More importantly, the length of the script run-time decreased from Take 2’s 5 hours and 12 minutes down to Take 3’s 3 hours and 34 minutes. That was a time improvement of 34%.

For this project, dropping the native-country attribute had no impact to the overall accuracy of the training model but contributed to a noticeable improvement of the model training time. The Stochastic Gradient Boosting ensemble algorithm continued to yield consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.