Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
title: "STAT 512 Final Project Team 11"
author: "Nick Gunady"
date: "12/4/2021"
output: html_document
knit: (function(input_file, encoding) {
out_dir <- 'docs';
rmarkdown::render(input_file,
encoding=encoding,
output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
filedir <- getwd()
```
```{r packages, echo=FALSE}
library(magrittr)
library(tinytex)
library(ALSM)
library(stringr)
library(stats)
library(caret)
#plotly Regression packages
library(reshape2) # to load tips data
library(tidymodels) # for the fit() function
library(plotly)
library(kernlab)
library(pracma) #for meshgrid()
library(olsrr) #cooks residuals
library(lmtest) #for bp test
```
CA demographics dataset can be downloaded from: https://www.countyhealthrankings.org/app/california/2021/downloads
```{r data wrangling}
#main CA demographics dataset
demo <- read.csv(file="demographics.csv")
#main working df
df.demo <- demo[2:58,c(3,4,7,8,9,10,11,14,17,18,19,20,21,22,23)]
df.demo[,c(2,4,6,8,12)] <- df.demo[,c(2,4,6,8,12)] / 100 #change % columns to decimal
#main covid case data from jhu.edu
covid <- read.csv(file="CA-2020-cov-data.csv") %>% dplyr::rename(Cases = Case.Count.2020)
df.main <- left_join(df.demo, covid, by="County") %>% dplyr::select(Cases,County,-X,everything())
df.main <- df.main[,1:16]
```
```{r initial plotting}
plot(df.main)
```
Experiment 3: Full MLR
h0 = b1 = b2 = b3 = b4 = b5 = b6 = 0
ha = at least one point estimate is not 0
``` {r full MLR}
Y <- df.main$Cases -> Confirmed.COVID19.Cases
X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity
X2 <- df.main$PERCENT.ADULT.DIABETES -> Percent.Adult.Diabetes
X3 <- df.main$PERCENT.OBESITY -> Percent.Adult.Obesity
X4 <- df.main$PERCENT.LIMITED.ACCESS -> Access.to.Healthy.Food
X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income
X6 <- df.main$AVG.DAILY.PM2.5 -> Air.Pollution
df.full <- cbind(Y,X1,X2,X3,X4,X5,X6)
df.full1 <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Percent.Adult.Diabetes,Percent.Adult.Obesity,Access.to.Healthy.Food,Median.Household.Income,Air.Pollution) %>% as.data.frame
mod.full <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Percent.Adult.Diabetes+Percent.Adult.Obesity+Access.to.Healthy.Food+Median.Household.Income+Air.Pollution, df.full1)
summary(mod.full)
anova(mod.full)
#Plot all MLR Variables
plot(df.full1)
```
From the ANOVA results of the full MLR, it is clear further model refinement is required. Therefore, we plot the MLR model criteria with plotmodel.s function, part of the ALSM package. Additionally, the Best Subsets Algorithm (also from ALSM) is utilized to determine the best model for this experiment.
```{r full MLR model selection}
#Plot Criterias for Model Selection
plotmodel.s(df.full1[,2:7], df.full1$Confirmed.COVID19.Cases)
```
Using the best subsets algorithm, we determine which models to analyze with the k fold analysis.
```{r MLR BestSub}
#Best Subsets Algorithm for Model Selection
bs.full <- BestSub(df.full1[2:7],df.full1$Confirmed.COVID19.Cases,num=1) %>% as.data.frame
bs.full
write.csv(bs.full, "BestSubsets.csv")
```
As we can see from the output of the best subsets algorithm, three models emerge as the top contenders. Step Model 1 is defined from row 2, with the lowest Cp, AICp, SBCp and PRESSp.
Step Model 2 is defined from row 4 with the highest r2.adj.
Step Model 3 is defined from row 6 with the highest r2.
The following code uses the K Fold Cross Validation technique to analyze the three model choices.
``` {r MLR Cross Validation}
#K Fold Cross Validation
set.seed(123) #set seed for reproducibility
train.control = trainControl(method='cv', number=10) #10-fold cross validation
#Step-model 1 (p = 3, row 2)
step.model1 = train(Y~X1+X5, data = df.full, method='leapBackward',
tuneGrid=data.frame(nvmax=2),
trControl=train.control)
#Step-model 2 (p = 5, row 4)
step.model2 = train(Y~X1+X4+X5+X6, data = df.full, method='leapBackward',
tuneGrid=data.frame(nvmax=4),
trControl=train.control)
#Step-model 3 (p = 7, row 6)
step.model3 = train(Y~X1+X2+X3+X4+X5+X6, data = df.full, method='leapBackward',
tuneGrid=data.frame(nvmax=6),
trControl=train.control)
step.models <- rbind(step.model1$results,step.model2$results,step.model3$results)
step.models$Model.Number <- c(1,2,3)
step.models <- step.models %>% select(Model.Number,everything())
step.models
```
We observe that Step Model 1 has the lowest RMSE, thus this becomes our model where Confirmed COVID-19 cases are described by explanatory variables X1(food insecurity) and X5 (median household income).
Now, we perform model diagnostics on the selected model.
```{r MLR diagnostics}
#Initialize Step Model 3 as lm
Y <- df.main$Cases -> Confirmed.COVID19.Cases
X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity
X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income
County <- df.main$County
df.mlr <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Median.Household.Income) %>% as.data.frame
mod.mlr <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr)
summary(mod.mlr)
anova(mod.mlr)
#Plot mod.mlr
plot(mod.mlr)
#Residual Plots on Step Model 3
residualPlot(mod.mlr)
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
avPlots(mod.mlr)
```
From the plots generated, we observe possible outliers graphically in our dataset. We use outlier methods to confirm this through analytical methods.
```{r MLR outliers}
#BoxPlot of MLR data for step model 3
fig.mlr.bp <- plot_ly(y=~df.mlr$Confirmed.COVID19.Cases,type='box',name='MLR data')
fig.mlr.bp
# boxplot(df.mlr$Confirmed.COVID19.Cases,
# ylab = "Confirmed COVID-19 Cases")
#boxplot statistics for outlier detection
mlr.outliers <- boxplot.stats(df.mlr$Confirmed.COVID19.Cases)$out
mlr.outlier.index <- which(mlr.outliers %in% c(mlr.outliers))
#studentized residuals for outlier analysis
resid.student <- rstudent(mod.mlr)
#Bonferroni T Value
alpha = 0.1
n = nrow(df.mlr)
p = bs.full[2,"p"]
t.out <- qt(1-alpha/2,n-1-p) #t value for outlier detection
df.main[which(resid.student > t.out),]
# c("County","Cases","Food.Insecurity","Percent.Adult.Diabetes","Percent.Adult.Obesity","Access.to.Healthy.Food","Median.Household.Income","Air.Pollution")
#outliers identified with 6 explanatory variables
df.full1[which(resid.student > t.out),]
df.out <- cbind(df.main[which(resid.student > t.out),"County"],df.full1[which(resid.student > t.out),])
colnames(df.out)[1] <- "County"
df.out[,c("County","Confirmed.COVID19.Cases","Food.Insecurity","Median.Household.Income")]
```
Since boxplot.stats did not yield a better result by removing outliers identified through that method, we use quantile method for alpha of 0.025 to remove outliers.
```{r MLR outlier cleansing}
#define lower and upper bounds for quantile method
mlr.lb <- quantile(df.mlr$Confirmed.COVID19.Cases,0.05)
mlr.ub <- quantile(df.mlr$Confirmed.COVID19.Cases,0.95)
#define index of values outside of quantile bounds
mlr.quant.index <- which(df.mlr$Confirmed.COVID19.Cases < mlr.lb | df.mlr$Confirmed.COVID19.Cases > mlr.ub)
#print outliers identified
df.main[mlr.quant.index,]
#second cleansed df without outliers, quantile method
df.mlr.c <- df.mlr[-c(mlr.quant.index),]
#boxplot to verify cleansed df
fig.mlr.bp2 <- plot_ly(y=~df.mlr.c$Confirmed.COVID19.Cases,type='box',name='MLR data')
fig.mlr.bp2
#outliers from boxplot statistics
bp.out <- boxplot.stats(df.mlr.c$Confirmed.COVID19.Cases)$out
bp.out.index <- which(df.mlr.c$Confirmed.COVID19.Cases %in% c(bp.out))
#remove 3 remaining y outliers
df.mlr.c <- df.mlr.c[-bp.out.index,]
```
With this quantile method, we remove outliers of several major counties. We attempted to use alpha = 0.025, 0.05, and 0.1. When using 0.025, too many outliers still existed, but when we used 0.1, too many large CA counties were excluded for our liking. Thus we use alpha = 0.05 for this quantile method.
```{r MLR diagnostics2}
#Initialize Step Model 3 as lm
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity
X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income
mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c)
summary(mod.mlr.c)
anova(mod.mlr.c)
#Plot mod.mlr
plot(mod.mlr.c)
#Residual Plots on Step Model 3
residualPlot(mod.mlr.c)
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
avPlots(mod.mlr.c)
```
Significant curvature exists in the residual plot. Hat leverage metric for x outliers
```{r MLR x outliers}
#hat leverage
h = p/n
h
hat.vals <- lm.influence(mod.mlr.c)$hat
hat.index <- which(hat.vals > 2*h)
#identify x outliers
df.main[hat.index,]
#remove x outliers
df.mlr.c <- df.mlr.c[-hat.index,]
df.mlr[hat.index,]
df.out2 <- cbind(df.main[hat.index,"County"], df.mlr[hat.index,])
colnames(df.out2)[1] <- "County"
df.out2
```
```{r MLR diagnostics3}
#Initialize Step Model 3 as lm
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity
X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income
mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c)
summary(mod.mlr.c)
anova(mod.mlr.c)
#Plot mod.mlr
plot(mod.mlr.c)
#Residual Plots on Step Model 3
residualPlot(mod.mlr.c)
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3
avPlots(mod.mlr.c)
```
Now that we have confirmed that our new dataset and model are acceptable from the standpoint of outliers, we examine our model for multicollinearity with the vif function from the cars package.
```{r MLR multicollinearity}
#plot dataset to visually analyze for multicollinearity
plot(df.mlr.c)
#Test for multicollinearity with vif
car::vif(mod.mlr.c)
```
```{r MLR Tests}
#Shapiro Test for Non-normality violation mitigation
shapiro.test(df.mlr$Confirmed.COVID19.Cases)
shapiro.test(df.mlr.c$Confirmed.COVID19.Cases)
#BP tests
bptest(mod.mlr)
bptest(mod.mlr.c)
```
Now that we have acceptable model for the MLR, we plot in 3d space.
```{r MLR visualization}
# mesh_size <- .02
# margin <- 0
# X <- df.mlr.c %>% select(Food.Insecurity,Median.Household.Income) /1000
# y <- df.mlr.c %>% select(Confirmed.COVID19.Cases) /1000
#
# model <- svm_rbf(cost = 1.0) %>%
# set_engine("kernlab") %>%
# set_mode("regression") %>%
# fit(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,data=df.mlr.c)
#
# x_min <- min(X$Food.Insecurity) - margin
# x_max <- max(X$Food.Insecurity) - margin
# y_min <- min(X$Median.Household.Income) - margin
# y_max <- max(X$Median.Household.Income) - margin
# xrange <- seq(x_min, x_max, mesh_size)
# yrange <- seq(y_min, y_max, mesh_size)
# xy <- pracma::meshgrid(x = xrange, y = yrange)
# xx <- xy$X
# yy <- xy$Y
# dim_val <- dim(xx)
# xx1 <- matrix(xx, length(xx), 1)
# yy1 <- matrix(yy, length(yy), 1)
# final <- cbind(xx1, yy1)
# pred <- model %>%
# predict(final)
#
# pred <- pred$.pred
# pred <- matrix(pred, dim_val[1], dim_val[2])
#
# #axes bounds
# axx <- list(
# nticks = 4,
# range = c(0,150)
# )
#
# axy <- list(
# nticks = 4,
# range = c(0,150)
# )
#
# axz <- list(
# nticks = 4,
# range = c(0,4100)
# )
#
# fig <- plot_ly(df.mlr.c, x = ~Food.Insecurity/1000, y = ~Median.Household.Income/1000, z = ~Confirmed.COVID19.Cases/1000 ) %>%
# add_markers(size = 5) %>%
# add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface')
# fig <- fig %>% layout(scene = list(xaxis=axx,yaxis=axy,zaxis=axz))
# fig
plot(df.mlr.c)
fig.mlr1 <- plot_ly(df.mlr.c, x = ~Food.Insecurity, y = ~Confirmed.COVID19.Cases,
type='scatter',
alpha=0.65,
mode='markers',
name='COVID-19 Cases') %>%
add_lines(x=~Median.Household.Income,y=fitted(lm(Confirmed.COVID19.Cases~Median.Household.Income,df.mlr.c)),name='Regression Line') %>%
layout(title='COVID-19 Cases vs. Food Insecurity + Median Household Income',
xaxis = list(title='Food Insecurity + Median Household Income'),
yaxis = list(title='Confirmed COVID-19 Cases'))
fig.mlr1
```
Experiment 2: Food insecurity as a factor for COVID-19 cases
H0: b1 = 0
Ha: b1 ~= 0
```{r exp2 food insecurity}
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases
X <- df.mlr.c$Food.Insecurity -> Food.Insecurity
df.fi <- cbind(Food.Insecurity,Confirmed.COVID19.Cases) %>% as.data.frame
rm(Confirmed.COVID19.Cases,Food.Insecurity)
mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi)
mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity,df.mlr.c)
summary(mod.fi)
anova(mod.fi)
residualPlot(mod.fi)
```
```{r exp2 outlier diagnostics}
#define lower and upper bounds for quantile method
fi.lb <- quantile(df.fi$Confirmed.COVID19.Cases,0.1)
fi.ub <- quantile(df.fi$Confirmed.COVID19.Cases,0.9)
#define index of values outside of quantile bounds
fi.quant.index <- which(df.fi$Confirmed.COVID19.Cases < fi.lb | df.fi$Confirmed.COVID19.Cases > fi.ub)
#print outliers identified
# df.main[fi.quant.index,]
df.fi[fi.quant.index,]
#second cleansed df without outliers, quantile method
df.fi.c <- df.fi[-c(fi.quant.index),]
#boxplot to verify cleansed df
fig.fi.bp2 <- plot_ly(y=~df.fi.c$Confirmed.COVID19.Cases,type='box',name='Food Insecurity data')
fig.fi.bp2
#outliers from boxplot statistics
fi.bp.out <- boxplot.stats(df.fi.c$Confirmed.COVID19.Cases)$out
fi.bp.out.index <- which(df.fi.c$Confirmed.COVID19.Cases %in% c(bp.out))
#remove 3 remaining y outliers
df.fi.c <- df.fi.c[-bp.out.index,]
mod.fi.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.c)
#cooks distance, dfbetas, residuals diagnostics plots
ols_plot_cooksd_bar(mod.fi.c)
ols_plot_dfbetas(mod.fi.c)
ols_plot_resid_stud(mod.fi.c)
outlierTest(mod.fi.c)
#remove data that violates cook distance criteria
df.fi.cooks <- cooks.distance(mod.fi.c)
mean.fi.cooks <- mean(cooks.distance(mod.fi.c))
out.fi.cooks <- which(df.fi.cooks > 4*mean.fi.cooks)
df.fi.cl <- df.fi.c[-out.fi.cooks,]
#update model without outliers
mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl)
#outlier diagnostics of new model
residualPlot(mod.fi.cl)
#remove observation 20 as outlier
df.fi.cl <- df.fi.cl[-c(20),]
```
```{r exp2 final model}
mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl)
summary(mod.fi.cl)
anova(mod.fi.cl)
```
```{r exp2 plots}
#plot linear model data
plot(mod.fi.cl)
residualPlot(mod.fi.cl)
#plot data with regression line
plot(X,Y,pch = 16, cex = 1.3, col = "blue", main = "Confirmed COVID-19 Cases vs. Food Insecurity", xlab = "Food Insecurity", ylab = "Confirmed COVID-19 Cases")
abline(mod.fi.cl)
#Plotly
fig.fi <- plot_ly(df.fi.cl, x = ~df.fi.cl$Food.Insecurity, y = ~df.fi.cl$Confirmed.COVID19.Cases,
type = 'scatter',
alpha=0.65,
mode='markers',
name='COVID-19 Cases') %>%
add_lines(x=~df.fi.cl$Food.Insecurity,y=fitted(mod.fi.cl),name='Regression Line') %>%
layout(title='COVID-19 Cases vs. Food Insecurity',
xaxis = list(title='Number of Food Insecure Adults'),
yaxis = list(title='Confirmed COVID-19 Cases'))
fig.fi
```
```{r exp2 kfold cross validation}
#K Fold Cross Validation
set.seed(123) #set seed for reproducibility
train.control = trainControl(method='repeatedcv', number=10,repeats=3) #10-fold cross validation, 3 repeats
mod.fi.train = train(Confirmed.COVID19.Cases~Food.Insecurity,
data = df.fi.cl,
method='lm',
trControl=train.control)
mod.fi.train
# summary(mod.fi.train)
```
Experiment 3: Comparing Food Insecurity vs. Health factors such as obesity and diabetes.
H0: b2 = b3
Ha: not H0
``` {r exp3}
Cases <- df.main$Cases -> Confirmed.COVID19.Cases
Diabetes <- df.main$PERCENT.ADULT.DIABETES*100
Obesity <- df.main$PERCENT.OBESITY*100
df.health <- cbind(Cases,Diabetes,Obesity) %>% as.data.frame
mod.health <- lm(Cases~Diabetes+Obesity, df.health)
summary(mod.health)
anova(mod.health)
```
```{r exp3 diagnostics}
plot(mod.health)
residualPlot(mod.health)
#cooks distance, dfbetas, residuals diagnostics plots
ols_plot_cooksd_bar(mod.health)
ols_plot_dfbetas(mod.health)
ols_plot_resid_stud(mod.health)
outlierTest(mod.health)
```
```{r exp3 remedial}
#Remove Outliers
#define lower and upper bounds for quantile method
lb <- quantile(df.health$Cases,0.1)
ub <- quantile(df.health$Cases,0.9)
#define index of values outside of quantile bounds
health.out.index <- which(df.health$Cases < lb | df.health$Cases > ub)
#print outliers identified
df.main[health.out.index,]
#second cleansed df without outliers, quantile method
df.health.c <- df.health[-c(health.out.index),]
mod.health.c <- lm(log(Cases)~Diabetes+Obesity,df.health.c)
plot(mod.health.c)
residualPlot(mod.health.c)
#cooks distance, dfbetas, residuals diagnostics plots
ols_plot_cooksd_bar(mod.health.c)
ols_plot_dfbetas(mod.health.c)
ols_plot_resid_stud(mod.health.c)
outlierTest(mod.health.c)
```
```{r exp3 Remedy Tests}
#Shapiro Test for Non-normality violation mitigation
shapiro.test(df.health$Cases)
shapiro.test(df.health.c$Cases)
#BP tests
bptest(mod.health)
bptest(mod.health.c)
```
From this test, we see that BP test p value is greater than 0.05, thus heteroskedaskity is an issue with our model.
```{r exp3 heteroskedaskity remedy}
#OLS Model
mod.ols <- lm(Cases ~ Diabetes+Obesity, df.health.c) # Fit our model to get our residuals.
df.health.c$resi <- mod.ols$residuals
varfunc.ols <- lm(log(resi^2) ~ log(Diabetes) + log(Obesity), data = df.health.c)
print("summary(varfunc.ols)")
summary(varfunc.ols)
#Weighted Least Squares Model
df.health.c$varfunc <- exp(varfunc.ols$fitted.values)
mod.gls <- lm(Cases ~ Diabetes+Obesity, weights = 1/sqrt(varfunc), data = df.health.c)
print("summary(mod.gls)")
summary(mod.gls)
trace1 <- predict(mod.health.c)
trace2 <- predict(mod.gls)
fig.Diabetes <- plot_ly(data = df.health.c, x = ~Diabetes, y = ~Cases,name='Data Points',type='scatter',mode='markers') %>%
add_trace(y = ~trace1, name = 'OLS',mode = 'lines') %>%
add_trace(y = ~trace2, name = 'WLS',mode='lines') %>%
layout(title = 'Model Comparison of WLS vs. OLS models for Cases~Diabetes')
fig.Diabetes
plot_ly(data = df.health.c, x = ~Obesity, y = ~Cases,name='Data Points',type='scatter',mode='markers')
```
```{r exp3 post remedy bp test}
#BP tests
bptest(mod.health.c)
bptest(mod.gls)
```
```{r Experiment 3 Plots}
plot(mod.fih)
#Plotly
fig.fih <- plot_ly(df.fih, x = ~X1+X2+X3, y = ~Y,
type='scatter',
alpha=0.65,
mode='markers',
name='COVID-19 Cases') %>%
add_lines(x=~X1+X2+X3,y=fitted(mod.fih),name='Regression Line') %>%
layout(title='COVID-19 Cases vs. Food Insecurity+Obesity+Diabetes',
xaxis = list(title='Food Insecurity + Obesity + Diabetes'),
yaxis = list(title='Confirmed COVID-19 Cases'))
fig.fih
```