Multi-Class Classification Model for Wine Quality Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Wine Quality dataset is a multi-class classification situation where we are trying to predict one of the three possible outcomes (cheap, average, and good).

INTRODUCTION: The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). The goal is to model wine quality based on physicochemical tests.

From the previous iteration, we approached the dataset as a regression problem and tried to predict the wine quality (a continuous numeric variable) with the least amount of mean squared error. While regression is one approach for assessing the wine quality, expressing quality in pure numbers and fractions are difficult for people to grasp fully.

For this iteration of the project, we will approach this dataset as a multi-class problem and attempt to classify the wine quality into one of the three rating categories: 1-Good (quality 7 or above), 2-Average (quality of 5-6), and 3-Cheap (quality 4 or below).

CONCLUSION: The baseline performance of the seven algorithms achieved an average accuracy of 80.08%. Three ensemble algorithms (Bagged Decision Trees, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 85.03%. With the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 83.98%, which was slightly worse than the accuracy of the training data.

For this project, predicting whether a bottle of wine would be good, average, or cheap appears to be more intuitive than to predict simply a numerical quality score. The Random Forest ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Wine Quality Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/wine+quality

One potential source of performance benchmarks: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

The HTML formatted report can be found here on GitHub.