Multi-Class Classification Model for Human Activity Recognition with Smartphone Using R Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time.

For iteration Take2, we will examine the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we explored was to eliminate collinear attributes based on a threshold of 85%.

For this iteration, we will explore the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Next, we eliminate the features which do not contribute to the cumulative importance of 0.99.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 91.59%. The Random Forest and Stochastic Gradient Boosting algorithms achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.74%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.42%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 561 down to 79 after eliminating 482 variables that fell below the required importance. The processing time went from 8 hours 16 minutes in iteration Take1 down to 5 hours and 22 minutes in iteration Take3. That was also a noticeable reduction in comparison to Take2, which reduced the processing time down to 7 hours 15 minutes.

In conclusion, the importance ranking technique should have benefited the tree methods the most, and it did. Furthermore, by reducing the number of attributes, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.