CovReg.Rmd

---
title: "STAT 512 Final Project Team 11"
author: "Nick Gunady"
date: "12/4/2021"
output: html_document
knit: (function(input_file, encoding) {
  out_dir <- 'docs';
  rmarkdown::render(input_file,
 encoding=encoding,
 output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
filedir <- getwd()
```

```{r packages, echo=FALSE}
library(magrittr)
library(tinytex)
library(ALSM)
library(stringr)
library(stats)
library(caret)

#plotly Regression packages
library(reshape2) # to load tips data
library(tidymodels) # for the fit() function
library(plotly)
library(kernlab)
library(pracma) #for meshgrid()

library(olsrr) #cooks residuals
library(lmtest) #for bp test
```

CA demographics dataset can be downloaded from: https://www.countyhealthrankings.org/app/california/2021/downloads

```{r data wrangling}
#main CA demographics dataset
demo <- read.csv(file="demographics.csv")

 #main working df
df.demo <- demo[2:58,c(3,4,7,8,9,10,11,14,17,18,19,20,21,22,23)]
df.demo[,c(2,4,6,8,12)] <- df.demo[,c(2,4,6,8,12)] / 100 #change % columns to decimal

#main covid case data from jhu.edu
covid <- read.csv(file="CA-2020-cov-data.csv") %>% dplyr::rename(Cases = Case.Count.2020)

df.main <- left_join(df.demo, covid, by="County") %>% dplyr::select(Cases,County,-X,everything())
df.main <- df.main[,1:16]
```

```{r initial plotting}
plot(df.main)
```
Experiment 3: Full MLR
h0 = b1 = b2 = b3 = b4 = b5 = b6 = 0
ha = at least one point estimate is not 0
``` {r full MLR}
Y <- df.main$Cases -> Confirmed.COVID19.Cases
X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity
X2 <- df.main$PERCENT.ADULT.DIABETES -> Percent.Adult.Diabetes
X3 <- df.main$PERCENT.OBESITY -> Percent.Adult.Obesity
X4 <- df.main$PERCENT.LIMITED.ACCESS -> Access.to.Healthy.Food
X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income
X6 <- df.main$AVG.DAILY.PM2.5 -> Air.Pollution

df.full <- cbind(Y,X1,X2,X3,X4,X5,X6)
df.full1 <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Percent.Adult.Diabetes,Percent.Adult.Obesity,Access.to.Healthy.Food,Median.Household.Income,Air.Pollution) %>% as.data.frame

mod.full <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Percent.Adult.Diabetes+Percent.Adult.Obesity+Access.to.Healthy.Food+Median.Household.Income+Air.Pollution, df.full1)

summary(mod.full)
anova(mod.full)

#Plot all MLR Variables
plot(df.full1)

```

From the ANOVA results of the full MLR, it is clear further model refinement is required. Therefore, we plot the MLR model criteria with plotmodel.s function, part of the ALSM package. Additionally, the Best Subsets Algorithm (also from ALSM) is utilized to determine the best model for this experiment.
```{r full MLR model selection}
#Plot Criterias for Model Selection
plotmodel.s(df.full1[,2:7], df.full1$Confirmed.COVID19.Cases)
```
Using the best subsets algorithm, we determine which models to analyze with the k fold analysis.
```{r MLR BestSub}

#Best Subsets Algorithm for Model Selection
bs.full <- BestSub(df.full1[2:7],df.full1$Confirmed.COVID19.Cases,num=1) %>% as.data.frame
bs.full

write.csv(bs.full, "BestSubsets.csv")
```
As we can see from the output of the best subsets algorithm, three models emerge as the top contenders. Step Model 1 is defined from row 2, with the lowest Cp, AICp, SBCp and PRESSp.
Step Model 2 is defined from row 4 with the highest r2.adj.
Step Model 3 is defined from row 6 with the highest r2.

The following code uses the K Fold Cross Validation technique to analyze the three model choices.
``` {r MLR Cross Validation}
#K Fold Cross Validation
set.seed(123) #set seed for reproducibility
train.control = trainControl(method='cv', number=10) #10-fold cross validation

#Step-model 1 (p = 3, row 2)
step.model1 = train(Y~X1+X5, data = df.full, method='leapBackward',
                    tuneGrid=data.frame(nvmax=2),
                    trControl=train.control)

#Step-model 2 (p = 5, row 4)
step.model2 = train(Y~X1+X4+X5+X6, data = df.full, method='leapBackward',
                    tuneGrid=data.frame(nvmax=4),
                    trControl=train.control)

#Step-model 3 (p = 7, row 6)
step.model3 = train(Y~X1+X2+X3+X4+X5+X6, data = df.full, method='leapBackward',
                    tuneGrid=data.frame(nvmax=6),
                    trControl=train.control)

step.models <- rbind(step.model1$results,step.model2$results,step.model3$results)
step.models$Model.Number <- c(1,2,3)
step.models <- step.models %>% select(Model.Number,everything())
step.models
```
We observe that Step Model 1 has the lowest RMSE, thus this becomes our model where Confirmed COVID-19 cases are described by explanatory variables X1(food insecurity) and X5 (median household income).

Now, we perform model diagnostics on the selected model.

```{r MLR diagnostics}
#Initialize Step Model 3 as lm
Y <- df.main$Cases -> Confirmed.COVID19.Cases
X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity
X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income
County <- df.main$County
df.mlr <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Median.Household.Income) %>% as.data.frame
mod.mlr <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr)

summary(mod.mlr)
anova(mod.mlr)

#Plot mod.mlr
plot(mod.mlr)
#Residual Plots on Step Model 3
residualPlot(mod.mlr)
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
avPlots(mod.mlr)
```
From the plots generated, we observe possible outliers graphically in our dataset. We use outlier methods to confirm this through analytical methods.

```{r MLR outliers}
#BoxPlot of MLR data for step model 3
fig.mlr.bp <- plot_ly(y=~df.mlr$Confirmed.COVID19.Cases,type='box',name='MLR data')
fig.mlr.bp
# boxplot(df.mlr$Confirmed.COVID19.Cases,
#         ylab = "Confirmed COVID-19 Cases")

#boxplot statistics for outlier detection
mlr.outliers <- boxplot.stats(df.mlr$Confirmed.COVID19.Cases)$out
mlr.outlier.index <- which(mlr.outliers %in% c(mlr.outliers))

#studentized residuals for outlier analysis
resid.student <- rstudent(mod.mlr)

#Bonferroni T Value
alpha = 0.1
n = nrow(df.mlr)
p = bs.full[2,"p"]
t.out <- qt(1-alpha/2,n-1-p) #t value for outlier detection

df.main[which(resid.student > t.out),]
# c("County","Cases","Food.Insecurity","Percent.Adult.Diabetes","Percent.Adult.Obesity","Access.to.Healthy.Food","Median.Household.Income","Air.Pollution")
#outliers identified with 6 explanatory variables
df.full1[which(resid.student > t.out),]
df.out <- cbind(df.main[which(resid.student > t.out),"County"],df.full1[which(resid.student > t.out),])
colnames(df.out)[1] <- "County"

df.out[,c("County","Confirmed.COVID19.Cases","Food.Insecurity","Median.Household.Income")]
```
Since boxplot.stats did not yield a better result by removing outliers identified through that method, we use quantile method for alpha of 0.025 to remove outliers.
```{r MLR outlier cleansing}
#define lower and upper bounds for quantile method
mlr.lb <- quantile(df.mlr$Confirmed.COVID19.Cases,0.05)
mlr.ub <- quantile(df.mlr$Confirmed.COVID19.Cases,0.95)

#define index of values outside of quantile bounds
mlr.quant.index <- which(df.mlr$Confirmed.COVID19.Cases < mlr.lb | df.mlr$Confirmed.COVID19.Cases > mlr.ub)

#print outliers identified
df.main[mlr.quant.index,]

#second cleansed df without outliers, quantile method
df.mlr.c <- df.mlr[-c(mlr.quant.index),]

#boxplot to verify cleansed df
fig.mlr.bp2 <- plot_ly(y=~df.mlr.c$Confirmed.COVID19.Cases,type='box',name='MLR data')
fig.mlr.bp2

#outliers from boxplot statistics
bp.out <- boxplot.stats(df.mlr.c$Confirmed.COVID19.Cases)$out
bp.out.index <- which(df.mlr.c$Confirmed.COVID19.Cases %in% c(bp.out))

#remove 3 remaining y outliers
df.mlr.c <- df.mlr.c[-bp.out.index,]
```
With this quantile method, we remove outliers of several major counties. We attempted to use alpha = 0.025, 0.05, and 0.1. When using 0.025, too many outliers still existed, but when we used 0.1, too many large CA counties were excluded for our liking. Thus we use alpha = 0.05 for this quantile method.

```{r MLR diagnostics2}
#Initialize Step Model 3 as lm
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity
X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income

mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c)

summary(mod.mlr.c)
anova(mod.mlr.c)

#Plot mod.mlr
plot(mod.mlr.c)
#Residual Plots on Step Model 3
residualPlot(mod.mlr.c)
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
avPlots(mod.mlr.c)
```
Significant curvature exists in the residual plot. Hat leverage metric for x outliers
```{r MLR x outliers}
#hat leverage
h = p/n
h
hat.vals <- lm.influence(mod.mlr.c)$hat
hat.index <- which(hat.vals > 2*h)

#identify x outliers
df.main[hat.index,]

#remove x outliers
df.mlr.c <- df.mlr.c[-hat.index,]
df.mlr[hat.index,]
df.out2 <- cbind(df.main[hat.index,"County"], df.mlr[hat.index,])

colnames(df.out2)[1] <- "County"
df.out2
```

```{r MLR diagnostics3}
#Initialize Step Model 3 as lm
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity
X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income

mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c)

summary(mod.mlr.c)
anova(mod.mlr.c)

#Plot mod.mlr
plot(mod.mlr.c)
#Residual Plots on Step Model 3
residualPlot(mod.mlr.c)
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
avPlots(mod.mlr.c)
```
Now that we have confirmed that our new dataset and model are acceptable from the standpoint of outliers, we examine our model for multicollinearity with the vif function from the cars package.
```{r MLR multicollinearity}
#plot dataset to visually analyze for multicollinearity
plot(df.mlr.c)
#Test for multicollinearity with vif
car::vif(mod.mlr.c)
```
```{r MLR Tests}
#Shapiro Test for Non-normality violation mitigation
shapiro.test(df.mlr$Confirmed.COVID19.Cases)

shapiro.test(df.mlr.c$Confirmed.COVID19.Cases)

#BP tests
bptest(mod.mlr)

bptest(mod.mlr.c)
```
Now that we have acceptable model for the MLR, we plot in 3d space.

```{r MLR visualization}
# mesh_size <- .02
# margin <- 0
# X <- df.mlr.c %>% select(Food.Insecurity,Median.Household.Income) /1000
# y <- df.mlr.c %>% select(Confirmed.COVID19.Cases) /1000
#
# model <- svm_rbf(cost = 1.0) %>%
#   set_engine("kernlab") %>%
#   set_mode("regression") %>%
#   fit(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,data=df.mlr.c)
#
# x_min <- min(X$Food.Insecurity) - margin
# x_max <- max(X$Food.Insecurity) - margin
# y_min <- min(X$Median.Household.Income) - margin
# y_max <- max(X$Median.Household.Income) - margin
# xrange <- seq(x_min, x_max, mesh_size)
# yrange <- seq(y_min, y_max, mesh_size)
# xy <- pracma::meshgrid(x = xrange, y = yrange)
# xx <- xy$X
# yy <- xy$Y
# dim_val <- dim(xx)
# xx1 <- matrix(xx, length(xx), 1)
# yy1 <- matrix(yy, length(yy), 1)
# final <- cbind(xx1, yy1)
# pred <- model %>%
#   predict(final)
#
# pred <- pred$.pred
# pred <- matrix(pred, dim_val[1], dim_val[2])
#
# #axes bounds
# axx <- list(
#   nticks = 4,
#   range = c(0,150)
# )
#
# axy <- list(
#   nticks = 4,
#   range = c(0,150)
# )
#
# axz <- list(
#   nticks = 4,
#   range = c(0,4100)
# )
#
# fig <- plot_ly(df.mlr.c, x = ~Food.Insecurity/1000, y = ~Median.Household.Income/1000, z = ~Confirmed.COVID19.Cases/1000 ) %>%
#   add_markers(size = 5) %>%
#   add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface')
# fig <- fig %>% layout(scene = list(xaxis=axx,yaxis=axy,zaxis=axz))
# fig

plot(df.mlr.c)

fig.mlr1 <- plot_ly(df.mlr.c, x = ~Food.Insecurity, y = ~Confirmed.COVID19.Cases,
               type='scatter',
               alpha=0.65,
               mode='markers',
               name='COVID-19 Cases') %>%
  add_lines(x=~Median.Household.Income,y=fitted(lm(Confirmed.COVID19.Cases~Median.Household.Income,df.mlr.c)),name='Regression Line') %>%
  layout(title='COVID-19 Cases vs. Food Insecurity + Median Household Income',
         xaxis = list(title='Food Insecurity + Median Household Income'),
         yaxis = list(title='Confirmed COVID-19 Cases'))
fig.mlr1
```


Experiment 2: Food insecurity as a factor for COVID-19 cases
H0: b1 = 0
Ha: b1 ~= 0
```{r exp2 food insecurity}
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
X <- df.mlr.c$Food.Insecurity -> Food.Insecurity

df.fi <- cbind(Food.Insecurity,Confirmed.COVID19.Cases) %>% as.data.frame
rm(Confirmed.COVID19.Cases,Food.Insecurity)

mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi)

mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity,df.mlr.c)
summary(mod.fi)
anova(mod.fi)

residualPlot(mod.fi)
```

```{r exp2 outlier diagnostics}
#define lower and upper bounds for quantile method
fi.lb <- quantile(df.fi$Confirmed.COVID19.Cases,0.1)
fi.ub <- quantile(df.fi$Confirmed.COVID19.Cases,0.9)

#define index of values outside of quantile bounds
fi.quant.index <- which(df.fi$Confirmed.COVID19.Cases < fi.lb | df.fi$Confirmed.COVID19.Cases > fi.ub)

#print outliers identified
# df.main[fi.quant.index,]
df.fi[fi.quant.index,]
#second cleansed df without outliers, quantile method
df.fi.c <- df.fi[-c(fi.quant.index),]

#boxplot to verify cleansed df
fig.fi.bp2 <- plot_ly(y=~df.fi.c$Confirmed.COVID19.Cases,type='box',name='Food Insecurity data')
fig.fi.bp2

#outliers from boxplot statistics
fi.bp.out <- boxplot.stats(df.fi.c$Confirmed.COVID19.Cases)$out
fi.bp.out.index <- which(df.fi.c$Confirmed.COVID19.Cases %in% c(bp.out))

#remove 3 remaining y outliers
df.fi.c <- df.fi.c[-bp.out.index,]

mod.fi.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.c)

#cooks distance, dfbetas, residuals diagnostics plots
ols_plot_cooksd_bar(mod.fi.c)

ols_plot_dfbetas(mod.fi.c)

ols_plot_resid_stud(mod.fi.c)

outlierTest(mod.fi.c)

#remove data that violates cook distance criteria
df.fi.cooks <- cooks.distance(mod.fi.c)
mean.fi.cooks <- mean(cooks.distance(mod.fi.c))
out.fi.cooks <- which(df.fi.cooks > 4*mean.fi.cooks)
df.fi.cl <- df.fi.c[-out.fi.cooks,]

#update model without outliers
mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl)

#outlier diagnostics of new model
residualPlot(mod.fi.cl)

#remove observation 20 as outlier
df.fi.cl <- df.fi.cl[-c(20),]
```
```{r exp2 final model}
mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl)
summary(mod.fi.cl)
anova(mod.fi.cl)
```


```{r exp2 plots}
#plot linear model data
plot(mod.fi.cl)
residualPlot(mod.fi.cl)

#plot data with regression line
plot(X,Y,pch = 16, cex = 1.3, col = "blue", main = "Confirmed COVID-19 Cases vs. Food Insecurity", xlab = "Food Insecurity", ylab = "Confirmed COVID-19 Cases")
abline(mod.fi.cl)

#Plotly
fig.fi <- plot_ly(df.fi.cl, x = ~df.fi.cl$Food.Insecurity, y = ~df.fi.cl$Confirmed.COVID19.Cases,
               type = 'scatter',
               alpha=0.65,
               mode='markers',
               name='COVID-19 Cases') %>%
  add_lines(x=~df.fi.cl$Food.Insecurity,y=fitted(mod.fi.cl),name='Regression Line') %>%
  layout(title='COVID-19 Cases vs. Food Insecurity',
         xaxis = list(title='Number of Food Insecure Adults'),
         yaxis = list(title='Confirmed COVID-19 Cases'))
fig.fi
```

```{r exp2 kfold cross validation}
#K Fold Cross Validation
set.seed(123) #set seed for reproducibility
train.control = trainControl(method='repeatedcv', number=10,repeats=3) #10-fold cross validation, 3 repeats

mod.fi.train = train(Confirmed.COVID19.Cases~Food.Insecurity,
                      data = df.fi.cl,
                      method='lm',
                      trControl=train.control)
mod.fi.train

# summary(mod.fi.train)
```

Experiment 3: Comparing Food Insecurity vs. Health factors such as obesity and diabetes.
H0: b2 = b3
Ha: not H0

``` {r exp3}
Cases <- df.main$Cases -> Confirmed.COVID19.Cases
Diabetes <- df.main$PERCENT.ADULT.DIABETES*100
Obesity <- df.main$PERCENT.OBESITY*100

df.health <- cbind(Cases,Diabetes,Obesity) %>% as.data.frame

mod.health <- lm(Cases~Diabetes+Obesity, df.health)

summary(mod.health)
anova(mod.health)
```
```{r exp3 diagnostics}
plot(mod.health)
residualPlot(mod.health)
#cooks distance, dfbetas, residuals diagnostics plots
ols_plot_cooksd_bar(mod.health)

ols_plot_dfbetas(mod.health)

ols_plot_resid_stud(mod.health)

outlierTest(mod.health)
```

```{r exp3 remedial}

#Remove Outliers
#define lower and upper bounds for quantile method
lb <- quantile(df.health$Cases,0.1)
ub <- quantile(df.health$Cases,0.9)

#define index of values outside of quantile bounds
health.out.index <- which(df.health$Cases < lb | df.health$Cases > ub)

#print outliers identified
df.main[health.out.index,]
#second cleansed df without outliers, quantile method
df.health.c <- df.health[-c(health.out.index),]

mod.health.c <- lm(log(Cases)~Diabetes+Obesity,df.health.c)

plot(mod.health.c)
residualPlot(mod.health.c)
#cooks distance, dfbetas, residuals diagnostics plots
ols_plot_cooksd_bar(mod.health.c)

ols_plot_dfbetas(mod.health.c)

ols_plot_resid_stud(mod.health.c)

outlierTest(mod.health.c)
```

```{r exp3 Remedy Tests}
#Shapiro Test for Non-normality violation mitigation
shapiro.test(df.health$Cases)

shapiro.test(df.health.c$Cases)

#BP tests
bptest(mod.health)

bptest(mod.health.c)
```
From this test, we see that BP test p value is greater than 0.05, thus heteroskedaskity is an issue with our model.

```{r exp3 heteroskedaskity remedy}
#OLS Model
mod.ols <- lm(Cases ~ Diabetes+Obesity, df.health.c) # Fit our model to get our residuals.
df.health.c$resi <- mod.ols$residuals
varfunc.ols <- lm(log(resi^2) ~ log(Diabetes) + log(Obesity), data = df.health.c)
print("summary(varfunc.ols)")
summary(varfunc.ols)

#Weighted Least Squares Model
df.health.c$varfunc <- exp(varfunc.ols$fitted.values)
mod.gls <- lm(Cases ~ Diabetes+Obesity, weights = 1/sqrt(varfunc), data = df.health.c)
print("summary(mod.gls)")
summary(mod.gls)

trace1 <- predict(mod.health.c)
trace2 <- predict(mod.gls)

fig.Diabetes <- plot_ly(data = df.health.c, x = ~Diabetes, y = ~Cases,name='Data Points',type='scatter',mode='markers') %>%
  add_trace(y = ~trace1, name = 'OLS',mode = 'lines') %>%
  add_trace(y = ~trace2, name = 'WLS',mode='lines') %>%
  layout(title = 'Model Comparison of WLS vs. OLS models for Cases~Diabetes')
fig.Diabetes

plot_ly(data = df.health.c, x = ~Obesity, y = ~Cases,name='Data Points',type='scatter',mode='markers')
```

```{r exp3 post remedy bp test}
#BP tests
bptest(mod.health.c)

bptest(mod.gls)
```


```{r Experiment 3 Plots}
plot(mod.fih)

#Plotly
fig.fih <- plot_ly(df.fih, x = ~X1+X2+X3, y = ~Y,
               type='scatter',
               alpha=0.65,
               mode='markers',
               name='COVID-19 Cases') %>%
  add_lines(x=~X1+X2+X3,y=fitted(mod.fih),name='Regression Line') %>%
  layout(title='COVID-19 Cases vs. Food Insecurity+Obesity+Diabetes',
         xaxis = list(title='Food Insecurity + Obesity + Diabetes'),
         yaxis = list(title='Confirmed COVID-19 Cases'))
fig.fih
```
	---
	title: "STAT 512 Final Project Team 11"
	author: "Nick Gunady"
	date: "12/4/2021"
	output: html_document
	knit: (function(input_file, encoding) {
	out_dir <- 'docs';
	rmarkdown::render(input_file,
	encoding=encoding,
	output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	rm(list = ls())
	filedir <- getwd()
	```

	```{r packages, echo=FALSE}
	library(magrittr)
	library(tinytex)
	library(ALSM)
	library(stringr)
	library(stats)
	library(caret)

	#plotly Regression packages
	library(reshape2) # to load tips data
	library(tidymodels) # for the fit() function
	library(plotly)
	library(kernlab)
	library(pracma) #for meshgrid()

	library(olsrr) #cooks residuals
	library(lmtest) #for bp test
	```

	CA demographics dataset can be downloaded from: https://www.countyhealthrankings.org/app/california/2021/downloads

	```{r data wrangling}
	#main CA demographics dataset
	demo <- read.csv(file="demographics.csv")

	#main working df
	df.demo <- demo[2:58,c(3,4,7,8,9,10,11,14,17,18,19,20,21,22,23)]
	df.demo[,c(2,4,6,8,12)] <- df.demo[,c(2,4,6,8,12)] / 100 #change % columns to decimal

	#main covid case data from jhu.edu
	covid <- read.csv(file="CA-2020-cov-data.csv") %>% dplyr::rename(Cases = Case.Count.2020)

	df.main <- left_join(df.demo, covid, by="County") %>% dplyr::select(Cases,County,-X,everything())
	df.main <- df.main[,1:16]
	```

	```{r initial plotting}
	plot(df.main)
	```
	Experiment 3: Full MLR
	h0 = b1 = b2 = b3 = b4 = b5 = b6 = 0
	ha = at least one point estimate is not 0
	``` {r full MLR}
	Y <- df.main$Cases -> Confirmed.COVID19.Cases
	X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity
	X2 <- df.main$PERCENT.ADULT.DIABETES -> Percent.Adult.Diabetes
	X3 <- df.main$PERCENT.OBESITY -> Percent.Adult.Obesity
	X4 <- df.main$PERCENT.LIMITED.ACCESS -> Access.to.Healthy.Food
	X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income
	X6 <- df.main$AVG.DAILY.PM2.5 -> Air.Pollution

	df.full <- cbind(Y,X1,X2,X3,X4,X5,X6)
	df.full1 <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Percent.Adult.Diabetes,Percent.Adult.Obesity,Access.to.Healthy.Food,Median.Household.Income,Air.Pollution) %>% as.data.frame

	mod.full <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Percent.Adult.Diabetes+Percent.Adult.Obesity+Access.to.Healthy.Food+Median.Household.Income+Air.Pollution, df.full1)

	summary(mod.full)
	anova(mod.full)

	#Plot all MLR Variables
	plot(df.full1)

	```

	From the ANOVA results of the full MLR, it is clear further model refinement is required. Therefore, we plot the MLR model criteria with plotmodel.s function, part of the ALSM package. Additionally, the Best Subsets Algorithm (also from ALSM) is utilized to determine the best model for this experiment.
	```{r full MLR model selection}
	#Plot Criterias for Model Selection
	plotmodel.s(df.full1[,2:7], df.full1$Confirmed.COVID19.Cases)
	```
	Using the best subsets algorithm, we determine which models to analyze with the k fold analysis.
	```{r MLR BestSub}

	#Best Subsets Algorithm for Model Selection
	bs.full <- BestSub(df.full1[2:7],df.full1$Confirmed.COVID19.Cases,num=1) %>% as.data.frame
	bs.full

	write.csv(bs.full, "BestSubsets.csv")
	```
	As we can see from the output of the best subsets algorithm, three models emerge as the top contenders. Step Model 1 is defined from row 2, with the lowest Cp, AICp, SBCp and PRESSp.
	Step Model 2 is defined from row 4 with the highest r2.adj.
	Step Model 3 is defined from row 6 with the highest r2.

	The following code uses the K Fold Cross Validation technique to analyze the three model choices.
	``` {r MLR Cross Validation}
	#K Fold Cross Validation
	set.seed(123) #set seed for reproducibility
	train.control = trainControl(method='cv', number=10) #10-fold cross validation

	#Step-model 1 (p = 3, row 2)
	step.model1 = train(Y~X1+X5, data = df.full, method='leapBackward',
	tuneGrid=data.frame(nvmax=2),
	trControl=train.control)

	#Step-model 2 (p = 5, row 4)
	step.model2 = train(Y~X1+X4+X5+X6, data = df.full, method='leapBackward',
	tuneGrid=data.frame(nvmax=4),
	trControl=train.control)

	#Step-model 3 (p = 7, row 6)
	step.model3 = train(Y~X1+X2+X3+X4+X5+X6, data = df.full, method='leapBackward',
	tuneGrid=data.frame(nvmax=6),
	trControl=train.control)

	step.models <- rbind(step.model1$results,step.model2$results,step.model3$results)
	step.models$Model.Number <- c(1,2,3)
	step.models <- step.models %>% select(Model.Number,everything())
	step.models
	```
	We observe that Step Model 1 has the lowest RMSE, thus this becomes our model where Confirmed COVID-19 cases are described by explanatory variables X1(food insecurity) and X5 (median household income).

	Now, we perform model diagnostics on the selected model.

	```{r MLR diagnostics}
	#Initialize Step Model 3 as lm
	Y <- df.main$Cases -> Confirmed.COVID19.Cases
	X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity
	X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income
	County <- df.main$County
	df.mlr <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Median.Household.Income) %>% as.data.frame
	mod.mlr <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr)

	summary(mod.mlr)
	anova(mod.mlr)

	#Plot mod.mlr
	plot(mod.mlr)
	#Residual Plots on Step Model 3
	residualPlot(mod.mlr)
	#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
	avPlots(mod.mlr)
	```
	From the plots generated, we observe possible outliers graphically in our dataset. We use outlier methods to confirm this through analytical methods.

	```{r MLR outliers}
	#BoxPlot of MLR data for step model 3
	fig.mlr.bp <- plot_ly(y=~df.mlr$Confirmed.COVID19.Cases,type='box',name='MLR data')
	fig.mlr.bp
	# boxplot(df.mlr$Confirmed.COVID19.Cases,
	# ylab = "Confirmed COVID-19 Cases")

	#boxplot statistics for outlier detection
	mlr.outliers <- boxplot.stats(df.mlr$Confirmed.COVID19.Cases)$out
	mlr.outlier.index <- which(mlr.outliers %in% c(mlr.outliers))

	#studentized residuals for outlier analysis
	resid.student <- rstudent(mod.mlr)

	#Bonferroni T Value
	alpha = 0.1
	n = nrow(df.mlr)
	p = bs.full[2,"p"]
	t.out <- qt(1-alpha/2,n-1-p) #t value for outlier detection

	df.main[which(resid.student > t.out),]
	# c("County","Cases","Food.Insecurity","Percent.Adult.Diabetes","Percent.Adult.Obesity","Access.to.Healthy.Food","Median.Household.Income","Air.Pollution")
	#outliers identified with 6 explanatory variables
	df.full1[which(resid.student > t.out),]
	df.out <- cbind(df.main[which(resid.student > t.out),"County"],df.full1[which(resid.student > t.out),])
	colnames(df.out)[1] <- "County"

	df.out[,c("County","Confirmed.COVID19.Cases","Food.Insecurity","Median.Household.Income")]
	```
	Since boxplot.stats did not yield a better result by removing outliers identified through that method, we use quantile method for alpha of 0.025 to remove outliers.
	```{r MLR outlier cleansing}
	#define lower and upper bounds for quantile method
	mlr.lb <- quantile(df.mlr$Confirmed.COVID19.Cases,0.05)
	mlr.ub <- quantile(df.mlr$Confirmed.COVID19.Cases,0.95)

	#define index of values outside of quantile bounds
	mlr.quant.index <- which(df.mlr$Confirmed.COVID19.Cases < mlr.lb \| df.mlr$Confirmed.COVID19.Cases > mlr.ub)

	#print outliers identified
	df.main[mlr.quant.index,]

	#second cleansed df without outliers, quantile method
	df.mlr.c <- df.mlr[-c(mlr.quant.index),]

	#boxplot to verify cleansed df
	fig.mlr.bp2 <- plot_ly(y=~df.mlr.c$Confirmed.COVID19.Cases,type='box',name='MLR data')
	fig.mlr.bp2

	#outliers from boxplot statistics
	bp.out <- boxplot.stats(df.mlr.c$Confirmed.COVID19.Cases)$out
	bp.out.index <- which(df.mlr.c$Confirmed.COVID19.Cases %in% c(bp.out))

	#remove 3 remaining y outliers
	df.mlr.c <- df.mlr.c[-bp.out.index,]
	```
	With this quantile method, we remove outliers of several major counties. We attempted to use alpha = 0.025, 0.05, and 0.1. When using 0.025, too many outliers still existed, but when we used 0.1, too many large CA counties were excluded for our liking. Thus we use alpha = 0.05 for this quantile method.

	```{r MLR diagnostics2}
	#Initialize Step Model 3 as lm
	Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
	X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity
	X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income

	mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c)

	summary(mod.mlr.c)
	anova(mod.mlr.c)

	#Plot mod.mlr
	plot(mod.mlr.c)
	#Residual Plots on Step Model 3
	residualPlot(mod.mlr.c)
	#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
	avPlots(mod.mlr.c)
	```
	Significant curvature exists in the residual plot. Hat leverage metric for x outliers
	```{r MLR x outliers}
	#hat leverage
	h = p/n
	h
	hat.vals <- lm.influence(mod.mlr.c)$hat
	hat.index <- which(hat.vals > 2*h)

	#identify x outliers
	df.main[hat.index,]

	#remove x outliers
	df.mlr.c <- df.mlr.c[-hat.index,]
	df.mlr[hat.index,]
	df.out2 <- cbind(df.main[hat.index,"County"], df.mlr[hat.index,])

	colnames(df.out2)[1] <- "County"
	df.out2
	```

	```{r MLR diagnostics3}
	#Initialize Step Model 3 as lm
	Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
	X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity
	X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income

	mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c)

	summary(mod.mlr.c)
	anova(mod.mlr.c)

	#Plot mod.mlr
	plot(mod.mlr.c)
	#Residual Plots on Step Model 3
	residualPlot(mod.mlr.c)
	#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
	avPlots(mod.mlr.c)
	```
	Now that we have confirmed that our new dataset and model are acceptable from the standpoint of outliers, we examine our model for multicollinearity with the vif function from the cars package.
	```{r MLR multicollinearity}
	#plot dataset to visually analyze for multicollinearity
	plot(df.mlr.c)
	#Test for multicollinearity with vif
	car::vif(mod.mlr.c)
	```
	```{r MLR Tests}
	#Shapiro Test for Non-normality violation mitigation
	shapiro.test(df.mlr$Confirmed.COVID19.Cases)

	shapiro.test(df.mlr.c$Confirmed.COVID19.Cases)

	#BP tests
	bptest(mod.mlr)

	bptest(mod.mlr.c)
	```
	Now that we have acceptable model for the MLR, we plot in 3d space.

	```{r MLR visualization}
	# mesh_size <- .02
	# margin <- 0
	# X <- df.mlr.c %>% select(Food.Insecurity,Median.Household.Income) /1000
	# y <- df.mlr.c %>% select(Confirmed.COVID19.Cases) /1000
	#
	# model <- svm_rbf(cost = 1.0) %>%
	# set_engine("kernlab") %>%
	# set_mode("regression") %>%
	# fit(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,data=df.mlr.c)
	#
	# x_min <- min(X$Food.Insecurity) - margin
	# x_max <- max(X$Food.Insecurity) - margin
	# y_min <- min(X$Median.Household.Income) - margin
	# y_max <- max(X$Median.Household.Income) - margin
	# xrange <- seq(x_min, x_max, mesh_size)
	# yrange <- seq(y_min, y_max, mesh_size)
	# xy <- pracma::meshgrid(x = xrange, y = yrange)
	# xx <- xy$X
	# yy <- xy$Y
	# dim_val <- dim(xx)
	# xx1 <- matrix(xx, length(xx), 1)
	# yy1 <- matrix(yy, length(yy), 1)
	# final <- cbind(xx1, yy1)
	# pred <- model %>%
	# predict(final)
	#
	# pred <- pred$.pred
	# pred <- matrix(pred, dim_val[1], dim_val[2])
	#
	# #axes bounds
	# axx <- list(
	# nticks = 4,
	# range = c(0,150)
	# )
	#
	# axy <- list(
	# nticks = 4,
	# range = c(0,150)
	# )
	#
	# axz <- list(
	# nticks = 4,
	# range = c(0,4100)
	# )
	#
	# fig <- plot_ly(df.mlr.c, x = ~Food.Insecurity/1000, y = ~Median.Household.Income/1000, z = ~Confirmed.COVID19.Cases/1000 ) %>%
	# add_markers(size = 5) %>%
	# add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface')
	# fig <- fig %>% layout(scene = list(xaxis=axx,yaxis=axy,zaxis=axz))
	# fig

	plot(df.mlr.c)

	fig.mlr1 <- plot_ly(df.mlr.c, x = ~Food.Insecurity, y = ~Confirmed.COVID19.Cases,
	type='scatter',
	alpha=0.65,
	mode='markers',
	name='COVID-19 Cases') %>%
	add_lines(x=~Median.Household.Income,y=fitted(lm(Confirmed.COVID19.Cases~Median.Household.Income,df.mlr.c)),name='Regression Line') %>%
	layout(title='COVID-19 Cases vs. Food Insecurity + Median Household Income',
	xaxis = list(title='Food Insecurity + Median Household Income'),
	yaxis = list(title='Confirmed COVID-19 Cases'))
	fig.mlr1
	```


	Experiment 2: Food insecurity as a factor for COVID-19 cases
	H0: b1 = 0
	Ha: b1 ~= 0
	```{r exp2 food insecurity}
	Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
	X <- df.mlr.c$Food.Insecurity -> Food.Insecurity

	df.fi <- cbind(Food.Insecurity,Confirmed.COVID19.Cases) %>% as.data.frame
	rm(Confirmed.COVID19.Cases,Food.Insecurity)

	mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi)

	mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity,df.mlr.c)
	summary(mod.fi)
	anova(mod.fi)

	residualPlot(mod.fi)
	```

	```{r exp2 outlier diagnostics}
	#define lower and upper bounds for quantile method
	fi.lb <- quantile(df.fi$Confirmed.COVID19.Cases,0.1)
	fi.ub <- quantile(df.fi$Confirmed.COVID19.Cases,0.9)

	#define index of values outside of quantile bounds
	fi.quant.index <- which(df.fi$Confirmed.COVID19.Cases < fi.lb \| df.fi$Confirmed.COVID19.Cases > fi.ub)

	#print outliers identified
	# df.main[fi.quant.index,]
	df.fi[fi.quant.index,]
	#second cleansed df without outliers, quantile method
	df.fi.c <- df.fi[-c(fi.quant.index),]

	#boxplot to verify cleansed df
	fig.fi.bp2 <- plot_ly(y=~df.fi.c$Confirmed.COVID19.Cases,type='box',name='Food Insecurity data')
	fig.fi.bp2

	#outliers from boxplot statistics
	fi.bp.out <- boxplot.stats(df.fi.c$Confirmed.COVID19.Cases)$out
	fi.bp.out.index <- which(df.fi.c$Confirmed.COVID19.Cases %in% c(bp.out))

	#remove 3 remaining y outliers
	df.fi.c <- df.fi.c[-bp.out.index,]

	mod.fi.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.c)

	#cooks distance, dfbetas, residuals diagnostics plots
	ols_plot_cooksd_bar(mod.fi.c)

	ols_plot_dfbetas(mod.fi.c)

	ols_plot_resid_stud(mod.fi.c)

	outlierTest(mod.fi.c)

	#remove data that violates cook distance criteria
	df.fi.cooks <- cooks.distance(mod.fi.c)
	mean.fi.cooks <- mean(cooks.distance(mod.fi.c))
	out.fi.cooks <- which(df.fi.cooks > 4*mean.fi.cooks)
	df.fi.cl <- df.fi.c[-out.fi.cooks,]

	#update model without outliers
	mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl)

	#outlier diagnostics of new model
	residualPlot(mod.fi.cl)

	#remove observation 20 as outlier
	df.fi.cl <- df.fi.cl[-c(20),]
	```
	```{r exp2 final model}
	mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl)
	summary(mod.fi.cl)
	anova(mod.fi.cl)
	```


	```{r exp2 plots}
	#plot linear model data
	plot(mod.fi.cl)
	residualPlot(mod.fi.cl)

	#plot data with regression line
	plot(X,Y,pch = 16, cex = 1.3, col = "blue", main = "Confirmed COVID-19 Cases vs. Food Insecurity", xlab = "Food Insecurity", ylab = "Confirmed COVID-19 Cases")
	abline(mod.fi.cl)

	#Plotly
	fig.fi <- plot_ly(df.fi.cl, x = ~df.fi.cl$Food.Insecurity, y = ~df.fi.cl$Confirmed.COVID19.Cases,
	type = 'scatter',
	alpha=0.65,
	mode='markers',
	name='COVID-19 Cases') %>%
	add_lines(x=~df.fi.cl$Food.Insecurity,y=fitted(mod.fi.cl),name='Regression Line') %>%
	layout(title='COVID-19 Cases vs. Food Insecurity',
	xaxis = list(title='Number of Food Insecure Adults'),
	yaxis = list(title='Confirmed COVID-19 Cases'))
	fig.fi
	```

	```{r exp2 kfold cross validation}
	#K Fold Cross Validation
	set.seed(123) #set seed for reproducibility
	train.control = trainControl(method='repeatedcv', number=10,repeats=3) #10-fold cross validation, 3 repeats

	mod.fi.train = train(Confirmed.COVID19.Cases~Food.Insecurity,
	data = df.fi.cl,
	method='lm',
	trControl=train.control)
	mod.fi.train

	# summary(mod.fi.train)
	```

	Experiment 3: Comparing Food Insecurity vs. Health factors such as obesity and diabetes.
	H0: b2 = b3
	Ha: not H0

	``` {r exp3}
	Cases <- df.main$Cases -> Confirmed.COVID19.Cases
	Diabetes <- df.main$PERCENT.ADULT.DIABETES*100
	Obesity <- df.main$PERCENT.OBESITY*100

	df.health <- cbind(Cases,Diabetes,Obesity) %>% as.data.frame

	mod.health <- lm(Cases~Diabetes+Obesity, df.health)

	summary(mod.health)
	anova(mod.health)
	```
	```{r exp3 diagnostics}
	plot(mod.health)
	residualPlot(mod.health)
	#cooks distance, dfbetas, residuals diagnostics plots
	ols_plot_cooksd_bar(mod.health)

	ols_plot_dfbetas(mod.health)

	ols_plot_resid_stud(mod.health)

	outlierTest(mod.health)
	```

	```{r exp3 remedial}

	#Remove Outliers
	#define lower and upper bounds for quantile method
	lb <- quantile(df.health$Cases,0.1)
	ub <- quantile(df.health$Cases,0.9)

	#define index of values outside of quantile bounds
	health.out.index <- which(df.health$Cases < lb \| df.health$Cases > ub)

	#print outliers identified
	df.main[health.out.index,]
	#second cleansed df without outliers, quantile method
	df.health.c <- df.health[-c(health.out.index),]

	mod.health.c <- lm(log(Cases)~Diabetes+Obesity,df.health.c)

	plot(mod.health.c)
	residualPlot(mod.health.c)
	#cooks distance, dfbetas, residuals diagnostics plots
	ols_plot_cooksd_bar(mod.health.c)

	ols_plot_dfbetas(mod.health.c)

	ols_plot_resid_stud(mod.health.c)

	outlierTest(mod.health.c)
	```

	```{r exp3 Remedy Tests}
	#Shapiro Test for Non-normality violation mitigation
	shapiro.test(df.health$Cases)

	shapiro.test(df.health.c$Cases)

	#BP tests
	bptest(mod.health)

	bptest(mod.health.c)
	```
	From this test, we see that BP test p value is greater than 0.05, thus heteroskedaskity is an issue with our model.

	```{r exp3 heteroskedaskity remedy}
	#OLS Model
	mod.ols <- lm(Cases ~ Diabetes+Obesity, df.health.c) # Fit our model to get our residuals.
	df.health.c$resi <- mod.ols$residuals
	varfunc.ols <- lm(log(resi^2) ~ log(Diabetes) + log(Obesity), data = df.health.c)
	print("summary(varfunc.ols)")
	summary(varfunc.ols)

	#Weighted Least Squares Model
	df.health.c$varfunc <- exp(varfunc.ols$fitted.values)
	mod.gls <- lm(Cases ~ Diabetes+Obesity, weights = 1/sqrt(varfunc), data = df.health.c)
	print("summary(mod.gls)")
	summary(mod.gls)

	trace1 <- predict(mod.health.c)
	trace2 <- predict(mod.gls)

	fig.Diabetes <- plot_ly(data = df.health.c, x = ~Diabetes, y = ~Cases,name='Data Points',type='scatter',mode='markers') %>%
	add_trace(y = ~trace1, name = 'OLS',mode = 'lines') %>%
	add_trace(y = ~trace2, name = 'WLS',mode='lines') %>%
	layout(title = 'Model Comparison of WLS vs. OLS models for Cases~Diabetes')
	fig.Diabetes

	plot_ly(data = df.health.c, x = ~Obesity, y = ~Cases,name='Data Points',type='scatter',mode='markers')
	```

	```{r exp3 post remedy bp test}
	#BP tests
	bptest(mod.health.c)

	bptest(mod.gls)
	```


	```{r Experiment 3 Plots}
	plot(mod.fih)

	#Plotly
	fig.fih <- plot_ly(df.fih, x = ~X1+X2+X3, y = ~Y,
	type='scatter',
	alpha=0.65,
	mode='markers',
	name='COVID-19 Cases') %>%
	add_lines(x=~X1+X2+X3,y=fitted(mod.fih),name='Regression Line') %>%
	layout(title='COVID-19 Cases vs. Food Insecurity+Obesity+Diabetes',
	xaxis = list(title='Food Insecurity + Obesity + Diabetes'),
	yaxis = list(title='Confirmed COVID-19 Cases'))
	fig.fih
	```