Binary Classification Model for Truck APS Failure Detection Using R Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The APS Failure at Scania Trucks dataset is a binary-class classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that is utilized in various functions in a truck, such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.

This dataset has many cells with missing values, so it is not practical to simply delete the rows with missing cells. This iteration of the project will impute the blank cells with the value zero.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using a dimensionality reduction technique to reduce the processing time while still maintaining an adequate level of prediction accuracy. The technique was to eliminate collinear attributes based on a threshold of 75%.

For this iteration, we will explore the Recursive Feature Elimination (RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 50.

ANALYSIS: From the previous iteration Take1, the baseline performance of the ten algorithms achieved an average accuracy of 99.02%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 99.39%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 99.22%, which was slightly below the accuracy of the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 98.99%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 99.40%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 99.19%, which was slightly below the accuracy of the training data.

In the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 98.94%. Three ensemble algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 99.31%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 99.09%, which was slightly below the accuracy of the training data.

From the model-building activities, the number of attributes went from 170 down to 25 after eliminating 145 attributes. The processing time went from 63 hours 22 minutes in iteration Take 1 down to 9 hours 33 minutes in iteration Take3, which was a reduction of 84% from Take1. That was also a noticeable reduction in comparison to Take2, which reduced the processing time down to 39 hours 52 minutes.

CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained an acceptable level of accuracy. For this dataset, the Random Forest algorithm and the Recursive Feature Elimination (RFE) technique should be considered for further modeling or production use.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.