Regression Model for Bike Sharing Using R – Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery

Dataset Used: Bike Sharing Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

For performance benchmarks, please consult: https://www.kaggle.com/contactprad/bike-share-daily-data

INTRODUCTION: Using the data generated by a bike sharing system, this project attempts to predict the daily demand for bike sharing. For this iteration of the project, we attempt to use the data available for discovering a suitable machine learning algorithm that future predictions can use. We have kept the data transformation activities to a minimum and drop the several attributes that do not make sense to keep or simply will not help in training the model.

For the Take No.3 of the project, we will leverage the hourly data, instead of the daily data from Take No.1. We will examine the algorithm performance and see how the hourly dataset performs against the daily data for modeling algorithms.

CONCLUSION: The baseline performance of predicting the target variable achieved an average RMSE value of 105 (vs. RMSE of 1322 from the daily dataset. Three algorithms (k-Nearest Neighbors, Random Forest, and Stochastic Gradient Boosting) achieved the lower RMSE and higher R-square values during the initial modeling round. After a series of tuning trials with these three algorithms, Random Forest produced the lowest RMSE value of 67 (vs. 1213 using the daily data) and the highest R-square value at 0.8648 (vs. 0.6093 using the daily data).

Random Forest also processed the validation dataset with an RMSE value of 64 (vs. 1177 using the daily data) and an R-square value of 0.8778 (vs. 0.6329 using the daily data), which was better than the average training result. For this project, the Random Forest ensemble algorithm yielded top-notch training and validation results, which warrant the additional processing required by the algorithm.

Furthermore, the use of hourly data (vs. daily data) generally yielded significantly higher R-square values for all algorithms. It would be a recommended approach to leverage the predictive models by using the hourly data whenever possible.

The HTML formatted report can be found here on GitHub.