Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
covid-regression/CovReg.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
584 lines (469 sloc)
18.7 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "STAT 512 Final Project Team 11" | |
author: "Nick Gunady" | |
date: "12/4/2021" | |
output: html_document | |
knit: (function(input_file, encoding) { | |
out_dir <- 'docs'; | |
rmarkdown::render(input_file, | |
encoding=encoding, | |
output_file=file.path(dirname(input_file), out_dir, 'index.html'))}) | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
rm(list = ls()) | |
filedir <- getwd() | |
``` | |
```{r packages, echo=FALSE} | |
library(magrittr) | |
library(tinytex) | |
library(ALSM) | |
library(stringr) | |
library(stats) | |
library(caret) | |
#plotly Regression packages | |
library(reshape2) # to load tips data | |
library(tidymodels) # for the fit() function | |
library(plotly) | |
library(kernlab) | |
library(pracma) #for meshgrid() | |
library(olsrr) #cooks residuals | |
library(lmtest) #for bp test | |
``` | |
CA demographics dataset can be downloaded from: https://www.countyhealthrankings.org/app/california/2021/downloads | |
```{r data wrangling} | |
#main CA demographics dataset | |
demo <- read.csv(file="demographics.csv") | |
#main working df | |
df.demo <- demo[2:58,c(3,4,7,8,9,10,11,14,17,18,19,20,21,22,23)] | |
df.demo[,c(2,4,6,8,12)] <- df.demo[,c(2,4,6,8,12)] / 100 #change % columns to decimal | |
#main covid case data from jhu.edu | |
covid <- read.csv(file="CA-2020-cov-data.csv") %>% dplyr::rename(Cases = Case.Count.2020) | |
df.main <- left_join(df.demo, covid, by="County") %>% dplyr::select(Cases,County,-X,everything()) | |
df.main <- df.main[,1:16] | |
``` | |
```{r initial plotting} | |
plot(df.main) | |
``` | |
Experiment 3: Full MLR | |
h0 = b1 = b2 = b3 = b4 = b5 = b6 = 0 | |
ha = at least one point estimate is not 0 | |
``` {r full MLR} | |
Y <- df.main$Cases -> Confirmed.COVID19.Cases | |
X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity | |
X2 <- df.main$PERCENT.ADULT.DIABETES -> Percent.Adult.Diabetes | |
X3 <- df.main$PERCENT.OBESITY -> Percent.Adult.Obesity | |
X4 <- df.main$PERCENT.LIMITED.ACCESS -> Access.to.Healthy.Food | |
X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income | |
X6 <- df.main$AVG.DAILY.PM2.5 -> Air.Pollution | |
df.full <- cbind(Y,X1,X2,X3,X4,X5,X6) | |
df.full1 <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Percent.Adult.Diabetes,Percent.Adult.Obesity,Access.to.Healthy.Food,Median.Household.Income,Air.Pollution) %>% as.data.frame | |
mod.full <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Percent.Adult.Diabetes+Percent.Adult.Obesity+Access.to.Healthy.Food+Median.Household.Income+Air.Pollution, df.full1) | |
summary(mod.full) | |
anova(mod.full) | |
#Plot all MLR Variables | |
plot(df.full1) | |
``` | |
From the ANOVA results of the full MLR, it is clear further model refinement is required. Therefore, we plot the MLR model criteria with plotmodel.s function, part of the ALSM package. Additionally, the Best Subsets Algorithm (also from ALSM) is utilized to determine the best model for this experiment. | |
```{r full MLR model selection} | |
#Plot Criterias for Model Selection | |
plotmodel.s(df.full1[,2:7], df.full1$Confirmed.COVID19.Cases) | |
``` | |
Using the best subsets algorithm, we determine which models to analyze with the k fold analysis. | |
```{r MLR BestSub} | |
#Best Subsets Algorithm for Model Selection | |
bs.full <- BestSub(df.full1[2:7],df.full1$Confirmed.COVID19.Cases,num=1) %>% as.data.frame | |
bs.full | |
write.csv(bs.full, "BestSubsets.csv") | |
``` | |
As we can see from the output of the best subsets algorithm, three models emerge as the top contenders. Step Model 1 is defined from row 2, with the lowest Cp, AICp, SBCp and PRESSp. | |
Step Model 2 is defined from row 4 with the highest r2.adj. | |
Step Model 3 is defined from row 6 with the highest r2. | |
The following code uses the K Fold Cross Validation technique to analyze the three model choices. | |
``` {r MLR Cross Validation} | |
#K Fold Cross Validation | |
set.seed(123) #set seed for reproducibility | |
train.control = trainControl(method='cv', number=10) #10-fold cross validation | |
#Step-model 1 (p = 3, row 2) | |
step.model1 = train(Y~X1+X5, data = df.full, method='leapBackward', | |
tuneGrid=data.frame(nvmax=2), | |
trControl=train.control) | |
#Step-model 2 (p = 5, row 4) | |
step.model2 = train(Y~X1+X4+X5+X6, data = df.full, method='leapBackward', | |
tuneGrid=data.frame(nvmax=4), | |
trControl=train.control) | |
#Step-model 3 (p = 7, row 6) | |
step.model3 = train(Y~X1+X2+X3+X4+X5+X6, data = df.full, method='leapBackward', | |
tuneGrid=data.frame(nvmax=6), | |
trControl=train.control) | |
step.models <- rbind(step.model1$results,step.model2$results,step.model3$results) | |
step.models$Model.Number <- c(1,2,3) | |
step.models <- step.models %>% select(Model.Number,everything()) | |
step.models | |
``` | |
We observe that Step Model 1 has the lowest RMSE, thus this becomes our model where Confirmed COVID-19 cases are described by explanatory variables X1(food insecurity) and X5 (median household income). | |
Now, we perform model diagnostics on the selected model. | |
```{r MLR diagnostics} | |
#Initialize Step Model 3 as lm | |
Y <- df.main$Cases -> Confirmed.COVID19.Cases | |
X1 <- df.main$NUM.FOOD.INSECURE -> Food.Insecurity | |
X5 <- df.main$MEDIAN.HOUSEHOLD.INCOME -> Median.Household.Income | |
County <- df.main$County | |
df.mlr <- cbind(Confirmed.COVID19.Cases,Food.Insecurity,Median.Household.Income) %>% as.data.frame | |
mod.mlr <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr) | |
summary(mod.mlr) | |
anova(mod.mlr) | |
#Plot mod.mlr | |
plot(mod.mlr) | |
#Residual Plots on Step Model 3 | |
residualPlot(mod.mlr) | |
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3 | |
avPlots(mod.mlr) | |
``` | |
From the plots generated, we observe possible outliers graphically in our dataset. We use outlier methods to confirm this through analytical methods. | |
```{r MLR outliers} | |
#BoxPlot of MLR data for step model 3 | |
fig.mlr.bp <- plot_ly(y=~df.mlr$Confirmed.COVID19.Cases,type='box',name='MLR data') | |
fig.mlr.bp | |
# boxplot(df.mlr$Confirmed.COVID19.Cases, | |
# ylab = "Confirmed COVID-19 Cases") | |
#boxplot statistics for outlier detection | |
mlr.outliers <- boxplot.stats(df.mlr$Confirmed.COVID19.Cases)$out | |
mlr.outlier.index <- which(mlr.outliers %in% c(mlr.outliers)) | |
#studentized residuals for outlier analysis | |
resid.student <- rstudent(mod.mlr) | |
#Bonferroni T Value | |
alpha = 0.1 | |
n = nrow(df.mlr) | |
p = bs.full[2,"p"] | |
t.out <- qt(1-alpha/2,n-1-p) #t value for outlier detection | |
df.main[which(resid.student > t.out),] | |
# c("County","Cases","Food.Insecurity","Percent.Adult.Diabetes","Percent.Adult.Obesity","Access.to.Healthy.Food","Median.Household.Income","Air.Pollution") | |
#outliers identified with 6 explanatory variables | |
df.full1[which(resid.student > t.out),] | |
df.out <- cbind(df.main[which(resid.student > t.out),"County"],df.full1[which(resid.student > t.out),]) | |
colnames(df.out)[1] <- "County" | |
df.out[,c("County","Confirmed.COVID19.Cases","Food.Insecurity","Median.Household.Income")] | |
``` | |
Since boxplot.stats did not yield a better result by removing outliers identified through that method, we use quantile method for alpha of 0.025 to remove outliers. | |
```{r MLR outlier cleansing} | |
#define lower and upper bounds for quantile method | |
mlr.lb <- quantile(df.mlr$Confirmed.COVID19.Cases,0.05) | |
mlr.ub <- quantile(df.mlr$Confirmed.COVID19.Cases,0.95) | |
#define index of values outside of quantile bounds | |
mlr.quant.index <- which(df.mlr$Confirmed.COVID19.Cases < mlr.lb | df.mlr$Confirmed.COVID19.Cases > mlr.ub) | |
#print outliers identified | |
df.main[mlr.quant.index,] | |
#second cleansed df without outliers, quantile method | |
df.mlr.c <- df.mlr[-c(mlr.quant.index),] | |
#boxplot to verify cleansed df | |
fig.mlr.bp2 <- plot_ly(y=~df.mlr.c$Confirmed.COVID19.Cases,type='box',name='MLR data') | |
fig.mlr.bp2 | |
#outliers from boxplot statistics | |
bp.out <- boxplot.stats(df.mlr.c$Confirmed.COVID19.Cases)$out | |
bp.out.index <- which(df.mlr.c$Confirmed.COVID19.Cases %in% c(bp.out)) | |
#remove 3 remaining y outliers | |
df.mlr.c <- df.mlr.c[-bp.out.index,] | |
``` | |
With this quantile method, we remove outliers of several major counties. We attempted to use alpha = 0.025, 0.05, and 0.1. When using 0.025, too many outliers still existed, but when we used 0.1, too many large CA counties were excluded for our liking. Thus we use alpha = 0.05 for this quantile method. | |
```{r MLR diagnostics2} | |
#Initialize Step Model 3 as lm | |
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases | |
X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity | |
X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income | |
mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c) | |
summary(mod.mlr.c) | |
anova(mod.mlr.c) | |
#Plot mod.mlr | |
plot(mod.mlr.c) | |
#Residual Plots on Step Model 3 | |
residualPlot(mod.mlr.c) | |
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3 | |
avPlots(mod.mlr.c) | |
``` | |
Significant curvature exists in the residual plot. Hat leverage metric for x outliers | |
```{r MLR x outliers} | |
#hat leverage | |
h = p/n | |
h | |
hat.vals <- lm.influence(mod.mlr.c)$hat | |
hat.index <- which(hat.vals > 2*h) | |
#identify x outliers | |
df.main[hat.index,] | |
#remove x outliers | |
df.mlr.c <- df.mlr.c[-hat.index,] | |
df.mlr[hat.index,] | |
df.out2 <- cbind(df.main[hat.index,"County"], df.mlr[hat.index,]) | |
colnames(df.out2)[1] <- "County" | |
df.out2 | |
``` | |
```{r MLR diagnostics3} | |
#Initialize Step Model 3 as lm | |
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases | |
X1 <- df.mlr.c$Food.Insecurity -> Food.Insecurity | |
X5 <- df.mlr.c$Median.Household.Income -> Median.Household.Income | |
mod.mlr.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,df.mlr.c) | |
summary(mod.mlr.c) | |
anova(mod.mlr.c) | |
#Plot mod.mlr | |
plot(mod.mlr.c) | |
#Residual Plots on Step Model 3 | |
residualPlot(mod.mlr.c) | |
#Plot Partial Regression Plots (AV Plots) for each predictor in Step Model 3 | |
avPlots(mod.mlr.c) | |
``` | |
Now that we have confirmed that our new dataset and model are acceptable from the standpoint of outliers, we examine our model for multicollinearity with the vif function from the cars package. | |
```{r MLR multicollinearity} | |
#plot dataset to visually analyze for multicollinearity | |
plot(df.mlr.c) | |
#Test for multicollinearity with vif | |
car::vif(mod.mlr.c) | |
``` | |
```{r MLR Tests} | |
#Shapiro Test for Non-normality violation mitigation | |
shapiro.test(df.mlr$Confirmed.COVID19.Cases) | |
shapiro.test(df.mlr.c$Confirmed.COVID19.Cases) | |
#BP tests | |
bptest(mod.mlr) | |
bptest(mod.mlr.c) | |
``` | |
Now that we have acceptable model for the MLR, we plot in 3d space. | |
```{r MLR visualization} | |
# mesh_size <- .02 | |
# margin <- 0 | |
# X <- df.mlr.c %>% select(Food.Insecurity,Median.Household.Income) /1000 | |
# y <- df.mlr.c %>% select(Confirmed.COVID19.Cases) /1000 | |
# | |
# model <- svm_rbf(cost = 1.0) %>% | |
# set_engine("kernlab") %>% | |
# set_mode("regression") %>% | |
# fit(Confirmed.COVID19.Cases~Food.Insecurity+Median.Household.Income,data=df.mlr.c) | |
# | |
# x_min <- min(X$Food.Insecurity) - margin | |
# x_max <- max(X$Food.Insecurity) - margin | |
# y_min <- min(X$Median.Household.Income) - margin | |
# y_max <- max(X$Median.Household.Income) - margin | |
# xrange <- seq(x_min, x_max, mesh_size) | |
# yrange <- seq(y_min, y_max, mesh_size) | |
# xy <- pracma::meshgrid(x = xrange, y = yrange) | |
# xx <- xy$X | |
# yy <- xy$Y | |
# dim_val <- dim(xx) | |
# xx1 <- matrix(xx, length(xx), 1) | |
# yy1 <- matrix(yy, length(yy), 1) | |
# final <- cbind(xx1, yy1) | |
# pred <- model %>% | |
# predict(final) | |
# | |
# pred <- pred$.pred | |
# pred <- matrix(pred, dim_val[1], dim_val[2]) | |
# | |
# #axes bounds | |
# axx <- list( | |
# nticks = 4, | |
# range = c(0,150) | |
# ) | |
# | |
# axy <- list( | |
# nticks = 4, | |
# range = c(0,150) | |
# ) | |
# | |
# axz <- list( | |
# nticks = 4, | |
# range = c(0,4100) | |
# ) | |
# | |
# fig <- plot_ly(df.mlr.c, x = ~Food.Insecurity/1000, y = ~Median.Household.Income/1000, z = ~Confirmed.COVID19.Cases/1000 ) %>% | |
# add_markers(size = 5) %>% | |
# add_surface(x=xrange, y=yrange, z=pred, alpha = 0.65, type = 'mesh3d', name = 'pred_surface') | |
# fig <- fig %>% layout(scene = list(xaxis=axx,yaxis=axy,zaxis=axz)) | |
# fig | |
plot(df.mlr.c) | |
fig.mlr1 <- plot_ly(df.mlr.c, x = ~Food.Insecurity, y = ~Confirmed.COVID19.Cases, | |
type='scatter', | |
alpha=0.65, | |
mode='markers', | |
name='COVID-19 Cases') %>% | |
add_lines(x=~Median.Household.Income,y=fitted(lm(Confirmed.COVID19.Cases~Median.Household.Income,df.mlr.c)),name='Regression Line') %>% | |
layout(title='COVID-19 Cases vs. Food Insecurity + Median Household Income', | |
xaxis = list(title='Food Insecurity + Median Household Income'), | |
yaxis = list(title='Confirmed COVID-19 Cases')) | |
fig.mlr1 | |
``` | |
Experiment 2: Food insecurity as a factor for COVID-19 cases | |
H0: b1 = 0 | |
Ha: b1 ~= 0 | |
```{r exp2 food insecurity} | |
Y <- df.mlr.c$Confirmed.COVID19.Cases -> Confirmed.COVID19.Cases | |
X <- df.mlr.c$Food.Insecurity -> Food.Insecurity | |
df.fi <- cbind(Food.Insecurity,Confirmed.COVID19.Cases) %>% as.data.frame | |
rm(Confirmed.COVID19.Cases,Food.Insecurity) | |
mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi) | |
mod.fi <- lm(Confirmed.COVID19.Cases~Food.Insecurity,df.mlr.c) | |
summary(mod.fi) | |
anova(mod.fi) | |
residualPlot(mod.fi) | |
``` | |
```{r exp2 outlier diagnostics} | |
#define lower and upper bounds for quantile method | |
fi.lb <- quantile(df.fi$Confirmed.COVID19.Cases,0.1) | |
fi.ub <- quantile(df.fi$Confirmed.COVID19.Cases,0.9) | |
#define index of values outside of quantile bounds | |
fi.quant.index <- which(df.fi$Confirmed.COVID19.Cases < fi.lb | df.fi$Confirmed.COVID19.Cases > fi.ub) | |
#print outliers identified | |
# df.main[fi.quant.index,] | |
df.fi[fi.quant.index,] | |
#second cleansed df without outliers, quantile method | |
df.fi.c <- df.fi[-c(fi.quant.index),] | |
#boxplot to verify cleansed df | |
fig.fi.bp2 <- plot_ly(y=~df.fi.c$Confirmed.COVID19.Cases,type='box',name='Food Insecurity data') | |
fig.fi.bp2 | |
#outliers from boxplot statistics | |
fi.bp.out <- boxplot.stats(df.fi.c$Confirmed.COVID19.Cases)$out | |
fi.bp.out.index <- which(df.fi.c$Confirmed.COVID19.Cases %in% c(bp.out)) | |
#remove 3 remaining y outliers | |
df.fi.c <- df.fi.c[-bp.out.index,] | |
mod.fi.c <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.c) | |
#cooks distance, dfbetas, residuals diagnostics plots | |
ols_plot_cooksd_bar(mod.fi.c) | |
ols_plot_dfbetas(mod.fi.c) | |
ols_plot_resid_stud(mod.fi.c) | |
outlierTest(mod.fi.c) | |
#remove data that violates cook distance criteria | |
df.fi.cooks <- cooks.distance(mod.fi.c) | |
mean.fi.cooks <- mean(cooks.distance(mod.fi.c)) | |
out.fi.cooks <- which(df.fi.cooks > 4*mean.fi.cooks) | |
df.fi.cl <- df.fi.c[-out.fi.cooks,] | |
#update model without outliers | |
mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl) | |
#outlier diagnostics of new model | |
residualPlot(mod.fi.cl) | |
#remove observation 20 as outlier | |
df.fi.cl <- df.fi.cl[-c(20),] | |
``` | |
```{r exp2 final model} | |
mod.fi.cl <- lm(Confirmed.COVID19.Cases~Food.Insecurity, df.fi.cl) | |
summary(mod.fi.cl) | |
anova(mod.fi.cl) | |
``` | |
```{r exp2 plots} | |
#plot linear model data | |
plot(mod.fi.cl) | |
residualPlot(mod.fi.cl) | |
#plot data with regression line | |
plot(X,Y,pch = 16, cex = 1.3, col = "blue", main = "Confirmed COVID-19 Cases vs. Food Insecurity", xlab = "Food Insecurity", ylab = "Confirmed COVID-19 Cases") | |
abline(mod.fi.cl) | |
#Plotly | |
fig.fi <- plot_ly(df.fi.cl, x = ~df.fi.cl$Food.Insecurity, y = ~df.fi.cl$Confirmed.COVID19.Cases, | |
type = 'scatter', | |
alpha=0.65, | |
mode='markers', | |
name='COVID-19 Cases') %>% | |
add_lines(x=~df.fi.cl$Food.Insecurity,y=fitted(mod.fi.cl),name='Regression Line') %>% | |
layout(title='COVID-19 Cases vs. Food Insecurity', | |
xaxis = list(title='Number of Food Insecure Adults'), | |
yaxis = list(title='Confirmed COVID-19 Cases')) | |
fig.fi | |
``` | |
```{r exp2 kfold cross validation} | |
#K Fold Cross Validation | |
set.seed(123) #set seed for reproducibility | |
train.control = trainControl(method='repeatedcv', number=10,repeats=3) #10-fold cross validation, 3 repeats | |
mod.fi.train = train(Confirmed.COVID19.Cases~Food.Insecurity, | |
data = df.fi.cl, | |
method='lm', | |
trControl=train.control) | |
mod.fi.train | |
# summary(mod.fi.train) | |
``` | |
Experiment 3: Comparing Food Insecurity vs. Health factors such as obesity and diabetes. | |
H0: b2 = b3 | |
Ha: not H0 | |
``` {r exp3} | |
Cases <- df.main$Cases -> Confirmed.COVID19.Cases | |
Diabetes <- df.main$PERCENT.ADULT.DIABETES*100 | |
Obesity <- df.main$PERCENT.OBESITY*100 | |
df.health <- cbind(Cases,Diabetes,Obesity) %>% as.data.frame | |
mod.health <- lm(Cases~Diabetes+Obesity, df.health) | |
summary(mod.health) | |
anova(mod.health) | |
``` | |
```{r exp3 diagnostics} | |
plot(mod.health) | |
residualPlot(mod.health) | |
#cooks distance, dfbetas, residuals diagnostics plots | |
ols_plot_cooksd_bar(mod.health) | |
ols_plot_dfbetas(mod.health) | |
ols_plot_resid_stud(mod.health) | |
outlierTest(mod.health) | |
``` | |
```{r exp3 remedial} | |
#Remove Outliers | |
#define lower and upper bounds for quantile method | |
lb <- quantile(df.health$Cases,0.1) | |
ub <- quantile(df.health$Cases,0.9) | |
#define index of values outside of quantile bounds | |
health.out.index <- which(df.health$Cases < lb | df.health$Cases > ub) | |
#print outliers identified | |
df.main[health.out.index,] | |
#second cleansed df without outliers, quantile method | |
df.health.c <- df.health[-c(health.out.index),] | |
mod.health.c <- lm(log(Cases)~Diabetes+Obesity,df.health.c) | |
plot(mod.health.c) | |
residualPlot(mod.health.c) | |
#cooks distance, dfbetas, residuals diagnostics plots | |
ols_plot_cooksd_bar(mod.health.c) | |
ols_plot_dfbetas(mod.health.c) | |
ols_plot_resid_stud(mod.health.c) | |
outlierTest(mod.health.c) | |
``` | |
```{r exp3 Remedy Tests} | |
#Shapiro Test for Non-normality violation mitigation | |
shapiro.test(df.health$Cases) | |
shapiro.test(df.health.c$Cases) | |
#BP tests | |
bptest(mod.health) | |
bptest(mod.health.c) | |
``` | |
From this test, we see that BP test p value is greater than 0.05, thus heteroskedaskity is an issue with our model. | |
```{r exp3 heteroskedaskity remedy} | |
#OLS Model | |
mod.ols <- lm(Cases ~ Diabetes+Obesity, df.health.c) # Fit our model to get our residuals. | |
df.health.c$resi <- mod.ols$residuals | |
varfunc.ols <- lm(log(resi^2) ~ log(Diabetes) + log(Obesity), data = df.health.c) | |
print("summary(varfunc.ols)") | |
summary(varfunc.ols) | |
#Weighted Least Squares Model | |
df.health.c$varfunc <- exp(varfunc.ols$fitted.values) | |
mod.gls <- lm(Cases ~ Diabetes+Obesity, weights = 1/sqrt(varfunc), data = df.health.c) | |
print("summary(mod.gls)") | |
summary(mod.gls) | |
trace1 <- predict(mod.health.c) | |
trace2 <- predict(mod.gls) | |
fig.Diabetes <- plot_ly(data = df.health.c, x = ~Diabetes, y = ~Cases,name='Data Points',type='scatter',mode='markers') %>% | |
add_trace(y = ~trace1, name = 'OLS',mode = 'lines') %>% | |
add_trace(y = ~trace2, name = 'WLS',mode='lines') %>% | |
layout(title = 'Model Comparison of WLS vs. OLS models for Cases~Diabetes') | |
fig.Diabetes | |
plot_ly(data = df.health.c, x = ~Obesity, y = ~Cases,name='Data Points',type='scatter',mode='markers') | |
``` | |
```{r exp3 post remedy bp test} | |
#BP tests | |
bptest(mod.health.c) | |
bptest(mod.gls) | |
``` | |
```{r Experiment 3 Plots} | |
plot(mod.fih) | |
#Plotly | |
fig.fih <- plot_ly(df.fih, x = ~X1+X2+X3, y = ~Y, | |
type='scatter', | |
alpha=0.65, | |
mode='markers', | |
name='COVID-19 Cases') %>% | |
add_lines(x=~X1+X2+X3,y=fitted(mod.fih),name='Regression Line') %>% | |
layout(title='COVID-19 Cases vs. Food Insecurity+Obesity+Diabetes', | |
xaxis = list(title='Food Insecurity + Obesity + Diabetes'), | |
yaxis = list(title='Confirmed COVID-19 Cases')) | |
fig.fih | |
``` |