From previously data exploration and data analysis, we know that the tickets issue is different in borough, hour and weekdays.
Therefore, for us to predicting the possible amount of ticket around certain area of a cafe( in our case, 100m around the cafe), we use these variables as our interested predictors for our model.
cafe =
read_csv("data/map_data.csv") %>%
select(business_name, adress:n_risk) %>%
full_join(
read_csv("data/map_data.csv") %>%
distinct(business_name, weekday) %>%
mutate(hour = list(1:24))
) %>%
unnest(hour) %>%
left_join(read_csv("data/map_data.csv")) %>%
drop_na()%>%
mutate(across(where(is.numeric), replace_na, replace = 0)) %>%
ungroup() %>%
left_join(read_csv(
here::here("data/Sidewalk_Caf__Licenses_and_Applications_clean.csv")
) %>%
select(business_name, borough, street_name)) %>%
mutate(weekday =
if_else(weekday %in% c("Saturday", "Sunday"),
"Weekend",
"Workday"))%>%
select(business_name,borough:hour_day_risk)
Our interested dependent is hour_day_risk: the amount of tickets issue in this area of certain hour and weekday. We can interpret that this dependent is following poison distribution, so we need to consider regression model for this settings.
cafe %>%
ggplot(aes(x = hour_day_risk))+geom_histogram()
And here’s the predictors in our model:
Because of the poison distribution of the dependent, we propose 3 models for prediction:
set.seed(1)
cafe_md =
cafe %>%
sample_n(size = 20000) %>%
modelr::crossv_mc(n = 30, test = 0.4) %>%
mutate(train = map(train, as.tibble),
test = map(test, as.tibble)) %>%
mutate(
md_glm =
map(
.x = train,
~ glm(
(hour_day_risk) ~
ticket + borough + hour + weekday,
family = poisson(),
data = .x
)
),
md_glm_int =
map(
.x = train,
~ glm(
(hour_day_risk) ~
ticket + borough * hour + weekday,
family = poisson(),
data = .x
)
),
md_gam =
map(
.x = train,
~ mgcv::gam(
(hour_day_risk) ~
s(hour,k=3) + s(ticket, k = 5) + borough + weekday,
family = poisson(),
data = .x
)
)
) %>%
select(-train)
md = cafe_md %>%
select(starts_with("md")) %>%
pivot_longer(
starts_with("md"),
names_to = "model",
values_to = "regression",
names_prefix = "md_"
) %>%
distinct(model,.keep_all = T) %>%
mutate(regression = map(regression,broom::tidy))
md %>%
filter(model == "glm") %>%
unnest(regression) %>%
knitr::kable(caption = "glm")
model | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
glm | (Intercept) | 1.589 | 0.026 | 61.083 | 0.00 |
glm | ticket | 0.000 | 0.000 | 169.589 | 0.00 |
glm | boroughBrooklyn | -0.190 | 0.026 | -7.421 | 0.00 |
glm | boroughManhattan | -0.008 | 0.024 | -0.319 | 0.75 |
glm | boroughQueens | 0.152 | 0.025 | 6.006 | 0.00 |
glm | hour | -0.025 | 0.001 | -44.725 | 0.00 |
glm | weekdayWorkday | 0.762 | 0.008 | 91.984 | 0.00 |
md %>%
filter(model == "glm_int") %>%
unnest(regression) %>%
knitr::kable(caption = "glm_in")
model | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
glm_int | (Intercept) | 2.026 | 0.074 | 27.55 | 0.000 |
glm_int | ticket | 0.000 | 0.000 | 169.75 | 0.000 |
glm_int | boroughBrooklyn | -0.714 | 0.078 | -9.22 | 0.000 |
glm_int | boroughManhattan | -0.459 | 0.074 | -6.24 | 0.000 |
glm_int | boroughQueens | -0.084 | 0.078 | -1.08 | 0.279 |
glm_int | hour | -0.062 | 0.006 | -10.34 | 0.000 |
glm_int | weekdayWorkday | 0.762 | 0.008 | 91.87 | 0.000 |
glm_int | boroughBrooklyn:hour | 0.044 | 0.006 | 7.02 | 0.000 |
glm_int | boroughManhattan:hour | 0.038 | 0.006 | 6.33 | 0.000 |
glm_int | boroughQueens:hour | 0.020 | 0.006 | 3.22 | 0.001 |
md %>%
filter(model == "gam") %>%
unnest(regression) %>%
knitr::kable(caption = "gam")
model | term | edf | ref.df | statistic | p.value |
---|---|---|---|---|---|
gam | s(hour) | 2.00 | 2 | 24444 | 0 |
gam | s(ticket) | 3.99 | 4 | 35335 | 0 |
Looking at all model, all predictors we chose are statistically significant.so we stick to these model predictor for our model. and looking at all three model, the residual mean sum of square are similar across 3 model. So firstly, interaction terms are not contributing to model predictability. And secondly,first order general linear model provide similar predictability to 5 order general addiction model. Given above consideration, and taking computing and optimization into consideration, we think the general linear model would be useful for predicting ticket of a area in certain time and day for our app.
cafe_md %>%
mutate(
rmse_glm_int =
map2_dbl(.y = test,
.x = md_glm_int,
~modelr::rmse(.x,.y)),
rmse_glm =
map2_dbl(.y = test,
.x = md_glm,
~modelr::rmse(.x,.y)),
rmse_gam =
map2_dbl(.y = test,
.x = md_gam,
~modelr::rmse(.x,.y))
) %>%
select(starts_with("rmse")) %>%
pivot_longer(
starts_with("rmse"),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_"
) %>%
ggplot(aes(x = model,y = rmse,fill = model))+
geom_violin(alpha = 0.6)
hour, weekdays, borough, and the baseline total amount of tickets issue are all good predictor for expecting amount of ticket issue in a certain hour of a day.