Binary-Class Classification Model for Seismic Bumps Take 2 Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

INTRODUCTION: Mining activity has always been connected with the occurrence of dangers which are commonly called mining hazards. A special case of such a threat is a seismic hazard which frequently occurs in many underground mines. Seismic hazard is the hardest detectable and predictable of natural hazards, and it is comparable to an earthquake. The complexity of seismic processes and big disproportion between the number of low-energy seismic events and the number of high-energy phenomena causes the statistical techniques to be insufficient to predict seismic hazard. Therefore, it is essential to search for new opportunities for better hazard prediction, also using machine learning methods.

In iteration Take1, we had three algorithms with high accuracy results but with dismal precision and recall scores. For this iteration, we will examine the viability of using the ROC scores to rank and choose the models.

CONCLUSION: The baseline performance of the eight algorithms achieved an average accuracy of 91.94%. Three algorithms (Logistic Regression, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, the support vector machine algorithm turned in the best accuracy result of 93.36%, but with very low precision and recall scores for the positive cases when processing the validation dataset. With an imbalanced dataset we have on-hand, we needed to look for another metric or another approach to evaluate the models.

From the current iteration, the baseline performance of the eight algorithms achieved an average ROC score of 66.16%. Three algorithms (Logistic Regression, AdaBoost, and Stochastic Gradient Boosting) achieved the top three ROC scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the best ROC result of 77.98%, but with a dismal precision and recall scores.

The ROC metric has given us a more viable way to evaluate the models, other than using the accuracy scores. However, with an imbalanced dataset we have on-hand, we still need to look for another approach to further validate our modeling effort.

Dataset Used: Seismic Bumps Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/seismic-bumps

The HTML formatted report can be found here on GitHub.