Binary Classification Model for Online News Popularity Using R Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

For this iteration, we will examine the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we will eliminate the features that do not contribute to the cumulative importance of 0.99 (or 99%).

ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 64.53%. Three algorithms (Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.48%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.71%, which was just slightly below the training data.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 64.29%. Two ensemble algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.51%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.53%, which was just slightly below the accuracy of the training data.

From the model-building activities, the number of attributes went from 58 down to 42 after eliminating 16 attributes. The processing time went from 6 hours 31 minutes in iteration Take1 down to 3 hours 18 minutes in iteration Take2, which was a reduction of 49% from Take1.

CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained a comparable level of accuracy. For this dataset, the Stochastic Gradient Boosting algorithm and the attribute importance ranking technique should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The HTML formatted report can be found here on GitHub.