Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has many cells with missing values, so we will examine the models by imputing the missing cells with a default value. This iteration of the project will produce a set of results that we will use to compare with the baseline models from Take 1.

CONCLUSION: From the previous iteration (Take 1), the baseline performance of the ten algorithms achieved an average accuracy of 83.79%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.27%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 86.29%, which was on par with the accuracy of the training data.

From this iteration (Take 2), the baseline performance of the ten algorithms achieved an average accuracy of 84.19%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.60%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 86.93%, which was slightly better than the accuracy of the training data.

For this project, imputing the missing values appeared to have contributed to a slight improvement of the overall accuracy of the training model. The Stochastic Gradient Boosting ensemble algorithm continued to yield consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.