Binary Classification Model for Census Income Using Python Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has a continuous numeric attribute, fnlwgt. The term estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. For this iteration, we will examine the models by removing the fnlwgt attribute and see how much of impact will the removed attribute have on the modeling. This iteration of the project will produce a set of results with which we will use to compare with the results from the first three iterations of the project.

CONCLUSION: From the previous iteration (Take 1), The baseline performance of the ten algorithms achieved an average accuracy of 81.37%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.99%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.23%, which was slightly better than the accuracy of the training data.

From iteration Take 2, the baseline performance of the ten algorithms achieved an average accuracy of 81.93%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.31%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.57%, which was slightly better than the accuracy of the training data.

From iteration Take 3, the baseline performance of the ten algorithms achieved an average accuracy of 81.95%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.29%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.50%, which was slightly better than the accuracy of the training data. More importantly, the length of the script run-time decreased from Take 2’s 5 hours and 12 minutes down to Take 3’s 3 hours and 34 minutes. That was a time improvement of 34%.

For this iteration (Take 4), the baseline performance of the ten algorithms achieved an average accuracy of 84.19%. Three algorithms (Support Vector Machine, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.38%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.47%, which was slightly better than the accuracy of the training data. Moreover, the length of the script run-time decreased from Take 3’s 3 hours and 34 minutes down to Take 4’s 3 hours and 25 minutes. The time improvement was very small.

For this project, dropping the native-country and fnlwgt attributes improved both the training time and accuracy of the models. The Stochastic Gradient Boosting ensemble algorithm continued to yield consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.