Binary-Class Classification Model for German Credit Risks Using Python Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The German Credit Risks Dataset is a binary-class classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains 1,000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes on credit risk by a German bank. Each person is classified as good or bad credit risks according to the set of attributes.

Because the case study also stipulated that it is worse to classify a customer as good when they are bad (weight of 5), than it is to classify a customer as bad when they are good (weight of 1). For this iteration, the script focuses on tuning various machine learning algorithms and identify the algorithm that can produce the best cost-and-accuracy tradeoffs.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the eight algorithms achieved an average accuracy of 71.80%. Three algorithms (Logistic Regression, Extra Trees, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 76.14%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 77.66%, which was slightly better than the accuracy from the training data.

From the cost vs accuracy comparison, both the Logistic Regression and Stochastic Gradient Boosting achieved high accuracy while keeping the costs of incorrect predictions low. Either algorithm should be considered for further modeling or production use.

Dataset Used: German Credit Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

One potential source of performance benchmarks: https://www.kaggle.com/uciml/german-credit/home

The HTML formatted report can be found here on GitHub.