Predicting Model For Tickets Around The Cafe

Data

From previously data exploration and data analysis, we know that the tickets issue is different in borough, hour and weekdays.

Therefore, for us to predicting the possible amount of ticket around certain area of a cafe( in our case, 100m around the cafe), we use these variables as our interested predictors for our model.

cafe =
  read_csv("data/map_data.csv") %>%
  select(business_name, adress:n_risk) %>%
  full_join(
    read_csv("data/map_data.csv") %>%
      distinct(business_name, weekday) %>%
      mutate(hour = list(1:24))
  ) %>%
  unnest(hour) %>%
  left_join(read_csv("data/map_data.csv")) %>%
  drop_na()%>% 
  mutate(across(where(is.numeric), replace_na, replace = 0)) %>%
  ungroup() %>%
  left_join(read_csv(
    here::here("data/Sidewalk_Caf__Licenses_and_Applications_clean.csv")
  ) %>%
    select(business_name, borough, street_name)) %>%
  mutate(weekday =
           if_else(weekday %in% c("Saturday", "Sunday"),
                   "Weekend",
                   "Workday"))%>%
  select(business_name,borough:hour_day_risk)

Our interested dependent is hour_day_risk: the amount of tickets issue in this area of certain hour and weekday. We can interpret that this dependent is following poison distribution, so we need to consider regression model for this settings.

cafe %>% 
  ggplot(aes(x = hour_day_risk))+geom_histogram()

And here’s the predictors in our model:

  • hour : integer reflecting the 24 hour format of hour of the day;
  • weekday : factors reflecting either it is a workday or weekend;
  • borough : factors reflecting which borough this area is in;
  • tickets : float reflecting the baseline total issue tickets of the area;

Fitting Model

Because of the poison distribution of the dependent, we propose 3 models for prediction:

  1. General Model of hour_day_risk ~ ticket + borough + hour + weekday with poison() family
  2. General Model of _hour_day_risk ~ ticket + borough*hour + weekday_ with poison() family, assuming that there are interaction
  3. General Addiction Model of hour_day_risk ~ s(ticket) + s(hour) + borough + weekday with poison() family and addressing that most tickets is issued during office hour.
set.seed(1)
cafe_md =
  cafe %>% 
  sample_n(size = 20000) %>%
  modelr::crossv_mc(n = 30, test = 0.4) %>%
  mutate(train = map(train, as.tibble), 
         test = map(test, as.tibble)) %>% 
  mutate(
    md_glm =
      map(
        .x = train,
        ~ glm(
          (hour_day_risk) ~
            ticket + borough + hour + weekday,
          family = poisson(),
          data = .x
        )
      ),
    md_glm_int =
      map(
        .x = train,
        ~ glm(
          (hour_day_risk) ~
            ticket + borough * hour + weekday,
          family = poisson(),
          data = .x
        )
      ),
    md_gam =
      map(
        .x = train,
        ~ mgcv::gam(
          (hour_day_risk) ~
            s(hour,k=3) + s(ticket, k = 5) + borough +  weekday,
          family = poisson(),
          data = .x
        )
      )
  ) %>%
  select(-train)
md = cafe_md %>% 
  select(starts_with("md")) %>% 
  pivot_longer(
    starts_with("md"),
    names_to = "model",
    values_to = "regression",
    names_prefix = "md_"
  ) %>% 
  distinct(model,.keep_all = T) %>% 
  mutate(regression = map(regression,broom::tidy))

md %>%
  filter(model == "glm") %>%
  unnest(regression) %>%
  knitr::kable(caption = "glm")
glm
model term estimate std.error statistic p.value
glm (Intercept) 1.589 0.026 61.083 0.00
glm ticket 0.000 0.000 169.589 0.00
glm boroughBrooklyn -0.190 0.026 -7.421 0.00
glm boroughManhattan -0.008 0.024 -0.319 0.75
glm boroughQueens 0.152 0.025 6.006 0.00
glm hour -0.025 0.001 -44.725 0.00
glm weekdayWorkday 0.762 0.008 91.984 0.00
md %>%
  filter(model == "glm_int") %>%
  unnest(regression) %>%
  knitr::kable(caption = "glm_in")
glm_in
model term estimate std.error statistic p.value
glm_int (Intercept) 2.026 0.074 27.55 0.000
glm_int ticket 0.000 0.000 169.75 0.000
glm_int boroughBrooklyn -0.714 0.078 -9.22 0.000
glm_int boroughManhattan -0.459 0.074 -6.24 0.000
glm_int boroughQueens -0.084 0.078 -1.08 0.279
glm_int hour -0.062 0.006 -10.34 0.000
glm_int weekdayWorkday 0.762 0.008 91.87 0.000
glm_int boroughBrooklyn:hour 0.044 0.006 7.02 0.000
glm_int boroughManhattan:hour 0.038 0.006 6.33 0.000
glm_int boroughQueens:hour 0.020 0.006 3.22 0.001
md %>%
  filter(model == "gam") %>%
  unnest(regression) %>%
  knitr::kable(caption = "gam")
gam
model term edf ref.df statistic p.value
gam s(hour) 2.00 2 24444 0
gam s(ticket) 3.99 4 35335 0

Cross-validation for model

 Looking at all model, all predictors we chose are statistically significant.so we stick to these model predictor for our model. and looking at all three model, the residual mean sum of square are similar across 3 model. So firstly, interaction terms are not contributing to model predictability. And secondly,first order general linear model provide similar predictability to 5 order general addiction model. Given above consideration, and taking computing and optimization into consideration, we think the general linear model would be useful for predicting ticket of a area in certain time and day for our app.

cafe_md %>% 
  mutate(
    rmse_glm_int = 
      map2_dbl(.y = test,
          .x = md_glm_int,
          ~modelr::rmse(.x,.y)),
     rmse_glm = 
      map2_dbl(.y = test,
          .x = md_glm,
          ~modelr::rmse(.x,.y)),
     rmse_gam = 
      map2_dbl(.y = test,
          .x = md_gam,
          ~modelr::rmse(.x,.y))
  ) %>% 
  select(starts_with("rmse")) %>% 
  pivot_longer(
    starts_with("rmse"),
    names_to = "model",
    values_to = "rmse",
    names_prefix = "rmse_"
  ) %>% 
  ggplot(aes(x = model,y = rmse,fill = model))+
  geom_violin(alpha = 0.6)

Model Conclusion

 hour, weekdays, borough, and the baseline total amount of tickets issue are all good predictor for expecting amount of ticket issue in a certain hour of a day.