Multi-Class Classification Model for Human Activity Recognition with Smartphone Using R Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we explored was to eliminate collinear attributes based on a threshold of 85%.

In iteration Take3, we explored the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to cumulative importance of 0.99.

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 50.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%, which was slightly worse than the accuracy from the training data and possibly due to over-fitting.

From the previous iteration Take3, the baseline performance of the ten algorithms achieved an average accuracy of 91.59%. the Random Forest and Stochastic Gradient Boosting algorithms achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.74%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.42%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 90.62%. Three algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 97.75%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 87.21%. The accuracy on the validation dataset was noticeably worse than the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 561 down to 41 after eliminating 520 variables. The processing time went from 8 hours 16 minutes in iteration Take1 down to 2 hours and 25 minutes in iteration Take4. That was a noticeable reduction in comparison to Take2, which reduced the processing time down to 7 hours 15 minutes. It also was a noticeable reduction in comparison to Take3, which reduced the processing time down to 5 hours 22 minutes.

In conclusion, the attribute importance ranking technique helped by cutting down the attributes and reduce the training time. Furthermore, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm with attribute importance ranking should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.