Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.
INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.
In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time. For this iteration, we will examine the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we will explore is to eliminate collinear attributes based on a threshold of 85%.
CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.
From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%, which was slightly worse than the accuracy from the training data and possibly due to over-fitting.
From the model-building activities, the number of attributes went from 561 down to 192 after eliminating 369 variables that are at least 85% collinear. The processing time went from 21 hours 43 minutes in iteration Take1 down to 7 hours and 15 minutes in iteration Take2. That was a reduction in model training and processing time of 66%.
In conclusion, the reduction in the number of attributes used still achieved an acceptable level of accuracy. by reducing the collinearity, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm should be considered for further modeling or production use.
Dataset Used: Human Activity Recognition Using Smartphone Data Set
Dataset ML Model: Multi-class classification with numerical attributes
One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
The HTML formatted report can be found here on GitHub.