Binary Classification Model for Bank Marketing Using R, Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

Dataset Used: Bank Marketing Data Set

Data Set ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: http://archive.ics.uci.edu/ml/datasets/bank+marketing

One source of potential performance benchmarks: https://www.kaggle.com/rouseguy/bankbalanced

INTRODUCTION: The Bank Marketing dataset involves predicting the whether the bank clients will subscribe (yes/no) a term deposit (target variable). It is a binary (2-class) classification problem. There are over 41,000 observations with 19 input variables and 1 output variable. There are no missing values within the dataset. This dataset is based on “Bank Marketing” UCI dataset and is enriched by the addition of five new social and economic features/attributes. This dataset is almost identical to the one without the five new attributes.

CONCLUSION: The take No.3 version of this banking dataset aims to test the addition of five additional social-economical attributes to the dataset and the effect. You can see the results from the take No.2 here on GitHub.

The baseline performance of the seven algorithms achieved an average accuracy of 89.70% (vs. 89.22% from the take No.2 version). Three algorithms (Logistic Regression, Bagged CART, and Stochastic Gradient Boosting) achieved the top accuracy and Kappa scores during the initial modeling round. After a series of tuning trials with these three algorithms, Stochastic Gradient Boosting achieved the top accuracy/Kappa result using the training data. It produced an average accuracy of 90.11% (vs. 89.46% from the take No.2 version) using the training data.

Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 90.00%, which was sufficiently close to the training result. For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm. The addition of the social-economical attributes did not seem to have a substantial effect on the overall accuracy of the prediction models.

The HTML formatted report can be found here on GitHub.