Regression Model for Online News Popularity Using PythonTake 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to the cumulative importance of 0.99 (or 99%).

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 40.

ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 13020. Two algorithms (Linear Regression and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 11273. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 12089, which was slightly worse than the accuracy of the training data.

From the previous iteration Take2, the baseline performance of the machine learning algorithms achieved an average RMSE of 13128. Two algorithms (Linear Regression and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 11358. Using the optimized tuning parameter available, the ElasticNet algorithm processed the validation dataset with an RMSE of 12146, which was slightly worse than the accuracy of the training data.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 14468. Two algorithms (ElasticNet and Support Vector Machine) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 11294. Using the optimized tuning parameter available, the ElasticNet algorithm processed the validation dataset with an RMSE of 12094, which was slightly worse than the accuracy of the training data.

From the model-building activities, the number of attributes went from 58 down to 30 after eliminating 28 attributes. The processing time went from 15 minutes 1 second in iteration Take1 up to 58 minutes 16 seconds in iteration Take2, which was due to the additional time required for tuning the Support Vector Machine algorithm. It also was a significant increase in comparison to Take2, which had a processing time of 17 minutes 37 seconds.

CONCLUSION: The two feature selection techniques yielded different attribute selection sets and outcomes. For this dataset, the ElasticNet algorithm and the attribute importance ranking technique from iteration Take2 should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The HTML formatted report can be found here on GitHub.