Predicting Model For Tickets Around The Cafe

Data

From previously data exploration and data analysis, we know that the tickets issue is different in borough, hour and weekdays.

Therefore, for us to predicting the possible amount of ticket around certain area of a cafe( in our case, 100m around the cafe), we use these variables as our interested predictors for our model.

cafe =
  read_csv("data/map_data.csv") %>%
  select(business_name, adress:n_risk) %>%
  full_join(
    read_csv("data/map_data.csv") %>%
      distinct(business_name, weekday) %>%
      mutate(hour = list(1:24))
  ) %>%
  unnest(hour) %>%
  left_join(read_csv("data/map_data.csv")) %>%
  drop_na()%>% 
  mutate(across(where(is.numeric), replace_na, replace = 0)) %>%
  ungroup() %>%
  left_join(read_csv(
    here::here("data/Sidewalk_Caf__Licenses_and_Applications_clean.csv")
  ) %>%
    select(business_name, borough, street_name)) %>%
  mutate(weekday =
           if_else(weekday %in% c("Saturday", "Sunday"),
                   "Weekend",
                   "Workday"))%>%
  select(business_name,borough:hour_day_risk)

Our interested dependent is hour_day_risk: the amount of tickets issue in this area of certain hour and weekday. We can interpret that this dependent is following poison distribution, so we need to consider regression model for this settings.

cafe %>% 
  ggplot(aes(x = hour_day_risk))+geom_histogram()

And here’s the predictors in our model:

hour : integer reflecting the 24 hour format of hour of the day;
weekday : factors reflecting either it is a workday or weekend;
borough : factors reflecting which borough this area is in;
tickets : float reflecting the baseline total issue tickets of the area;

Fitting Model

Because of the poison distribution of the dependent, we propose 3 models for prediction:

General Model of hour_day_risk ~ ticket + borough + hour + weekday with poison() family
General Model of _hour_day_risk ~ ticket + borough*hour + weekday_ with poison() family, assuming that there are interaction
General Addiction Model of hour_day_risk ~ s(ticket) + s(hour) + borough + weekday with poison() family and addressing that most tickets is issued during office hour.

set.seed(1)
cafe_md =
  cafe %>% 
  sample_n(size = 20000) %>%
  modelr::crossv_mc(n = 30, test = 0.4) %>%
  mutate(train = map(train, as.tibble), 
         test = map(test, as.tibble)) %>% 
  mutate(
    md_glm =
      map(
        .x = train,
        ~ glm(
          (hour_day_risk) ~
            ticket + borough + hour + weekday,
          family = poisson(),
          data = .x
        )
      ),
    md_glm_int =
      map(
        .x = train,
        ~ glm(
          (hour_day_risk) ~
            ticket + borough * hour + weekday,
          family = poisson(),
          data = .x
        )
      ),
    md_gam =
      map(
        .x = train,
        ~ mgcv::gam(
          (hour_day_risk) ~
            s(hour,k=3) + s(ticket, k = 5) + borough +  weekday,
          family = poisson(),
          data = .x
        )
      )
  ) %>%
  select(-train)

md = cafe_md %>% 
  select(starts_with("md")) %>% 
  pivot_longer(
    starts_with("md"),
    names_to = "model",
    values_to = "regression",
    names_prefix = "md_"
  ) %>% 
  distinct(model,.keep_all = T) %>% 
  mutate(regression = map(regression,broom::tidy))

md %>%
  filter(model == "glm") %>%
  unnest(regression) %>%
  knitr::kable(caption = "glm")

glm
model	term	estimate	std.error	statistic	p.value
glm	(Intercept)	1.589	0.026	61.083	0.00
glm	ticket	0.000	0.000	169.589	0.00
glm	boroughBrooklyn	-0.190	0.026	-7.421	0.00
glm	boroughManhattan	-0.008	0.024	-0.319	0.75
glm	boroughQueens	0.152	0.025	6.006	0.00
glm	hour	-0.025	0.001	-44.725	0.00
glm	weekdayWorkday	0.762	0.008	91.984	0.00

md %>%
  filter(model == "glm_int") %>%
  unnest(regression) %>%
  knitr::kable(caption = "glm_in")

glm_in
model	term	estimate	std.error	statistic	p.value
glm_int	(Intercept)	2.026	0.074	27.55	0.000
glm_int	ticket	0.000	0.000	169.75	0.000
glm_int	boroughBrooklyn	-0.714	0.078	-9.22	0.000
glm_int	boroughManhattan	-0.459	0.074	-6.24	0.000
glm_int	boroughQueens	-0.084	0.078	-1.08	0.279
glm_int	hour	-0.062	0.006	-10.34	0.000
glm_int	weekdayWorkday	0.762	0.008	91.87	0.000
glm_int	boroughBrooklyn:hour	0.044	0.006	7.02	0.000
glm_int	boroughManhattan:hour	0.038	0.006	6.33	0.000
glm_int	boroughQueens:hour	0.020	0.006	3.22	0.001

md %>%
  filter(model == "gam") %>%
  unnest(regression) %>%
  knitr::kable(caption = "gam")

gam
model	term	edf	ref.df	statistic	p.value
gam	s(hour)	2.00	2	24444	0
gam	s(ticket)	3.99	4	35335	0

Cross-validation for model

Looking at all model, all predictors we chose are statistically significant.so we stick to these model predictor for our model. and looking at all three model, the residual mean sum of square are similar across 3 model. So firstly, interaction terms are not contributing to model predictability. And secondly,first order general linear model provide similar predictability to 5 order general addiction model. Given above consideration, and taking computing and optimization into consideration, we think the general linear model would be useful for predicting ticket of a area in certain time and day for our app.

cafe_md %>% 
  mutate(
    rmse_glm_int = 
      map2_dbl(.y = test,
          .x = md_glm_int,
          ~modelr::rmse(.x,.y)),
     rmse_glm = 
      map2_dbl(.y = test,
          .x = md_glm,
          ~modelr::rmse(.x,.y)),
     rmse_gam = 
      map2_dbl(.y = test,
          .x = md_gam,
          ~modelr::rmse(.x,.y))
  ) %>% 
  select(starts_with("rmse")) %>% 
  pivot_longer(
    starts_with("rmse"),
    names_to = "model",
    values_to = "rmse",
    names_prefix = "rmse_"
  ) %>% 
  ggplot(aes(x = model,y = rmse,fill = model))+
  geom_violin(alpha = 0.6)

Model Conclusion

hour, weekdays, borough, and the baseline total amount of tickets issue are all good predictor for expecting amount of ticket issue in a certain hour of a day.

Regression Analysis

Predicting Model For Tickets Around The Cafe

Data

Fitting Model

Cross-validation for model

Model Conclusion