Regression Model for Bike Sharing Using Python – Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

Dataset Used: Bike Sharing Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

For performance benchmarks, please consult: https://www.kaggle.com/contactprad/bike-share-daily-data

INTRODUCTION: Using the data generated by a bike sharing system, this project attempts to predict the daily demand for bike sharing. For this iteration of the project, we attempt to use the data available for discovering a suitable machine learning algorithm that future predictions can use. We have kept the data transformation activities to a minimum and drop the several attributes that do not make sense to keep or simply will not help in training the model.

For the Take No.3 of the project, we will leverage the hourly data, instead of the daily data from Take No.1. We will examine the algorithm performance and see how the hourly dataset performs against the daily data for modeling algorithms.

CONCLUSION: The baseline performance of predicting the target variable achieved an average RMSE value of 118 (vs. RMSE of 1483 from the daily dataset. Three algorithms (k-Nearest Neighbors, Random Forest, and Extra Trees) achieved the lowest RMSE values during the initial modeling round. After a series of tuning trials with these three algorithms, Extra Trees produced the best RMSE value of 69 (vs. 1233 using the daily data).

Extra Trees also processed the validation dataset with an RMSE value of 68 (vs. 1293 using the daily data), which was better than the average training result. For this project, the Extra-Trees ensemble algorithm yielded top-notch training and validation results, which warrant the additional processing required by the algorithm.

Furthermore, the use of hourly data generally yielded significantly better RMSE values for all algorithms vs. daily data. It is, therefore, a recommended approach to leverage the predictive models by using the hourly data whenever possible.

The HTML-formatted report can be found here on GitHub.