-
Notifications
You must be signed in to change notification settings - Fork 0
/
predicting-bikes-rental-demand.Rmd
166 lines (101 loc) · 5.19 KB
/
predicting-bikes-rental-demand.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Predicting Bike Shares Demand with Ridge and Lasso"
date: "`r Sys.Date()`"
output: html_document
author: "Jehangeer!"
message: false
warning: false
---
# About Data Analysis Report
This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on `r date()`.
**Data Description:**
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
**Data Source:** https://www.kaggle.com/datasets/marklvl/bike-sharing-dataset
**Relevant Paper:**
Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
# Load and explore the data
## Load data and install packages
```{r}
## Importing required packages
# install.packages("tidyverse")
# install.packages("tidymodels")
library(tidyverse)
library(tidymodels)
bikes_hours <- read_csv("bike-sharing-dataset/hour.csv")
glimpse(bikes_hours)
# theme set
theme_set(theme_minimal(base_size = 12, base_family = "Open Sans"))
```
## Which factors contribute to bike rental demand?
Here are the objectives of doing the advanced analysis on bike rental demand:
Attain superior accuracy in predicting bike rental demand.
Investigate whether a noteworthy correlation exists between the season and the bike rental demand.
Gain insights into how bike rental demand varies across different time segments within a day.
Determine the relationship between bike rental demand and prevailing weather conditions.
## Describe and explore the data
```{r}
# first six observations of the datasets
head(bikes_hours)
# checking missing values
sum(is.na(bikes_hours))
# summary statistics
summary(bikes_hours)
```
Here, a distinct dataframe is established to maintain data integrity, enabling easy adjustments without reloading the original dataset. This new dataframe excludes 'instant' and 'dteday' variables, as they hold no predictive value.
```{r}
# Constructing a 'bike_data' dataframe with organized data attributes
bike_data <- data.frame(count = bikes_hours$cnt,
season = as.factor(bikes_hours$season),
year = as.factor(bikes_hours$yr),
month = as.factor(bikes_hours$mnth),
hour = as.factor(bikes_hours$hr),
holi = as.factor(bikes_hours$holiday),
week = as.factor(bikes_hours$weekday),
work = as.factor(bikes_hours$workingday),
weather = as.factor(bikes_hours$weathersit),
env.temp = bikes_hours$temp,
feel.temp = bikes_hours$atemp,
humidity = bikes_hours$hum,
windspeed = bikes_hours$windspeed,
registered = bikes_hours$registered,
casual = bikes_hours$casual)
# Column names
names(bike_data)
# Glimpse of new data
glimpse(bike_data)
# Distribution of response variable.
bike_data |>
ggplot(aes(x = count)) +
geom_histogram() +
labs(title = "Distriution of Count")
```
We are considering the use of a Poisson regression model because our response variable consists of discrete numeric values, although they are no any non-negative integers, and response variable follows poisson distribution.
```{r}
# Fit a Poisson regression model
poisson_model <- glm(count ~ .,
data = bike_data,
family = "poisson")
summary(poisson_model)
# Checking over dispersion
# install.packages("AER")
library(AER)
dispersion_test <- dispersiontest(poisson_model)
print(dispersion_test)
```
The p-value is very close to zero (p-value < 2.2e-16), indicating that the true dispersion is significantly greater than 1.
Based on these results, we should consider using an alternative modeling approach that can account for over dispersion.
Negative Binomial Regression: This model is suitable for count data with overdispersion and can be a good choice in this situation.
```{r}
# install.packages("MASS")
library(MASS)
model_nb <- glm.nb(count ~ ., data = bike_data)
summary(model_nb)
```
This Negative Binomial regression model provides insights into the factors that influence bike rentals. Predictor variables as season, year, month, hour, weather conditions, and environmental factors all play a significant role in determining the number of bike rentals. Variables, like hour and temperature, have a particularly strong impact, while others, like holidays, have a less pronounced effect.
```{r}
# Calculating the AIC
aic_value <- AIC(model_nb)
# AIC value
aic_value
```
The model's AIC value of approximately 179,577.1 indicates a reasonable trade-off between goodness of fit and model complexity, suggesting its suitability for predicting bike rentals. However, further model refinement and evaluation against alternative models may be necessary to optimize performance.