Regression Model for Bike Sharing Using R – Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

Dataset Used: Bike Sharing Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

For available performance benchmarks, please consult: https://www.kaggle.com/contactprad/bike-share-daily-data

INTRODUCTION: Using the data generated by a bike sharing system, this project attempts to predict the daily demand for bike sharing. For this iteration (Take No.2) of the project, we attempt to use the data available, transform as necessary, and apply the Stochastic Gradient Boosting algorithm to examine the modeling effectiveness. Again, the goal of this iteration is to examine various data transformation options and find a sufficiently accurate (low error) combination for future prediction tasks.

This iteration of the project will test the following four modeling scenarios:

Scenario No.1: Remove the attribute “atemp” since it was highly correlated with the attribute “temp.”

Scenario No.2: Perform one-hot-encoding on the variable “season.”

Scenario No.3: Perform one-hot-encoding on the variable “mnth.”

For scenarios 2-3, steps from section No.3 and No.4 will be repeated for each scenario.

CONCLUSION: The baseline performance of the Stochastic Gradient Boosting stands at an RMSE value of 1240 and an R-square value of 0.5943 using the training data. Scenario No.1 did slightly better with an RMSE value of 1233 and an R-square value of 0.5991. As the result, we will leverage scenario No.1 to training the final model and observe how it will do with the validation dataset.

The final Stochastic Gradient Boosting model processed the validation dataset with an RMSE value of 1180 and an R-square value of 0.6320, which was slightly worse than the Take No.1 result of 1177 for RMSE and 0.6329 for R-square. For this iteration of the project, data transformation did not improve the model performance with a noticeable outcome.

The HTML formatted report can be found here on GitHub.