Dataset Details

The dataset for this project was obtained from Kaggle.com. The data was collected via the Vaccine Adverse Event Reporting System (VAERS) which was created by the Food and Drug Administration (FDA) and the Center for Disease Control and Prevention (CDC) to receive reports about adverse events that may be associated with Covid-19 vaccines. The dataset consists of more than 5,000 reports as of February 11, 2021 for different individuals in all 50 states across the United States.

Objective

The objective of this project is to gain knowledge about adverse effects of Covid-19 vaccines on populations of different age groups and gender and to study if age and gender plays any role in having any adverse effect of a Covid-19 vaccine.

library(plotly)
library(tidyverse)

# Load COVID-19 World Vaccine Adverse Reactions dataset
VAERSDATA <- read.csv(
  "/Users/munibasiddiqi/Desktop/BUCS544/CS544\ Project/CS544Final_Shaikh/2021VAERSDATA.csv")

covidVAE.data <-select(VAERSDATA, STATE, AGE_YRS, SEX, NUMDAYS, L_THREAT )

# covidVAE.data <- VAERSDATA[c("STATE", "AGE_YRS", "SEX","NUMDAYS","L_THREAT" )]

# Removing rows with empty/missing values

covidVAE.data <- subset(covidVAE.data, covidVAE.data$STATE != "" & covidVAE.data$AGE_YRS != "" &  covidVAE.data$SEX!= "" & covidVAE.data$NUMDAYS != "" )

# Renaming columns
colnames(covidVAE.data) <- c("State", "Age", "Gender", "NumDays_Before_Onset", "Life_Threatening")

# Convert the data type age column to integer
covidVAE.data$Age <- as.integer(covidVAE.data$Age)

# Convert the data type numdays column to integer
covidVAE.data$NumDays_Before_Onset <- as.integer(covidVAE.data$NumDays_Before_Onset)

# Convert the data frame into tibble format
covidVAE.data <- as_tibble(covidVAE.data)

# glimpse(covidVAE.data)

Analysis of Age Distribution of the Population

# Bar plot of population age

plot_ly(covidVAE.data, x = ~Age) %>%
  layout(xaxis = list(title = 'Age'), 
         yaxis = list(title = 'Frequency')) -> p
p
x <- hist(covidVAE.data$Age, col = hcl(0), xlab = "Age", ylab = "Frequency", main = "")

for (i in 1:(length(x$breaks)-1)){
  cat(x$counts[i]," people in age group range (",x$breaks[i],",",x$breaks[i+1],"] got adverse reaction to Covid-19 vaccine.\n", sep = "")
}
## 5 people in age group range (0,10] got adverse reaction to Covid-19 vaccine.
## 25 people in age group range (10,20] got adverse reaction to Covid-19 vaccine.
## 350 people in age group range (20,30] got adverse reaction to Covid-19 vaccine.
## 626 people in age group range (30,40] got adverse reaction to Covid-19 vaccine.
## 570 people in age group range (40,50] got adverse reaction to Covid-19 vaccine.
## 549 people in age group range (50,60] got adverse reaction to Covid-19 vaccine.
## 525 people in age group range (60,70] got adverse reaction to Covid-19 vaccine.
## 487 people in age group range (70,80] got adverse reaction to Covid-19 vaccine.
## 431 people in age group range (80,90] got adverse reaction to Covid-19 vaccine.
## 169 people in age group range (90,100] got adverse reaction to Covid-19 vaccine.
## 5 people in age group range (100,110] got adverse reaction to Covid-19 vaccine.
options(digits = 4)
cat("Population Mean of Age = ", mean(covidVAE.data$Age)," SD = ", sd(covidVAE.data$Age), " median = ", median(covidVAE.data$Age))
## Population Mean of Age =  56.83  SD =  20.38  median =  56

The overall population is normally distributed with respect to age as there is not much difference between mean and median.

The histogram shows that the population having adverse reaction to Covid-19 vaccines is in the age group ranging from 1 year old to 105 years old.

Based on the current data set, very few people under 20 years of age had a reaction to a Covid-19 vaccine. Age group under 20 does not have a good representation in the population data set as people under 20 are not on the priority list for Covid-19 vaccination. The current data set covers health care workers, senior citizens and people with underlying health conditions.

Analysis of Gender Distribution of the Population

plot_ly(covidVAE.data, labels = ~Gender, type = "pie") 
x <- length(covidVAE.data$Gender)
table(covidVAE.data$Gender)
## 
##    F    M    U 
## 2536 1197    9
cat("Total Population = ", x)
## Total Population =  3742

The pie chart shows that 67.8% of the population that had an adverse reaction to the Covid-19 vaccine are females, 32% of the population are males, and 0.241% of the population are people with undefined gender. However, we cannot infer from this data set that females are more susceptible to having an adverse reaction to the Covid-19 vaccine than males and undefined gender as it does not contain the total number of males, females, and undefined gender that have been vaccinated so far.

plot_ly(covidVAE.data, x = ~Age, name = ~Gender, color = ~Gender) %>%
  layout(xaxis = list(title = 'Age'), yaxis = list(title = 'Frequency'),barmode = 'stack') -> p
p

The above histogram shows that overall population age is normally distributed. However, the shape of the female population distribution is right skewed and the shape of the male population distribution is left skewed.

Analysis of Age and Gender of the Population

subplot(
  plot_ly(covidVAE.data, y = ~Age, type = "box",name = "Population"),
  plot_ly(covidVAE.data, y = ~Age, color = ~Gender, type = "box") ,
  nrows = 1,
  shareY = TRUE) %>% 
  layout( yaxis = list(title = 'Age')
  ) -> p
p

The boxplot shows that on average females that had an adverse reaction to a Covid-19 vaccine are younger than males.

The females that had an adverse reaction to the Covid-19 vaccine are in the age group between 37 and 68 years with a median age of 51 years.

The male population that had an adverse reaction to a Covid-19 vaccine are in the age group between 50 and 79 years old with a median age of 67 years.

covidVAE.data %>%
  group_by(Gender, Age) %>%
  summarise(count = n()) -> gender
#gender

plot_ly(gender, x = ~Gender,y = ~count, color = ~Age) %>%
  layout(xaxis = list(title = 'Gender'), barmode = 'group') -> p

p 

The bar plot shows that the number of females that had an adverse reaction to a Covid-19 vaccine is almost double the number of males in the data set. Very few in the data set with undefined gender had a reaction to the Covid-19 vaccine. The female bar consists of mostly blue and green color which means that most females are in the age group between 30 to 70 years of age.

Central Limit Theorem

The Central Limit theorem states that the distribution of sample means, taken from independent random sample sizes, follows a normal distribution even if the original population is not normally distributed.

As shown in histograms below, the overall age distribution of population is normal. However, the age distribution of female population is right skewed and the age distribution of male population is left skewed. For this project, the female population is used as an example to show the application of the central limit theorem.

plot_ly(covidVAE.data, x = ~Age, name = ~Gender, color = ~Gender) %>%
  layout(xaxis = list(title = 'Age'),
         yaxis = list(title = 'Frequency'),
         barmode = 'stack') -> p1
#p1

female.data <- subset(covidVAE.data, Gender == "F")
#female.data

plot_ly(female.data, x = ~Age, name = "female") %>%
  layout(xaxis = list(title = 'Age'), 
         yaxis = list(title = 'Frequency')) -> p2
#p2

subplot(p1,p2,
        nrows = 2,
        shareX = TRUE)
cat("Below are histograms showing the sample means of 1000 random samples of sample size 10, 20, 30, and 40 follow a normal distribution.")
## Below are histograms showing the sample means of 1000 random samples of sample size 10, 20, 30, and 40 follow a normal distribution.
samples <- 1000

sample.size <- c(10, 20,30,40)

xbar <- numeric(samples)
xbar2 <- numeric(samples)
xbar3 <- numeric(samples)
xbar4 <- numeric(samples)

set.seed(0264)

for (i in 1: samples) {
  xbar[i] <- mean(sample(female.data$Age, sample.size[1], replace = FALSE))
  xbar2[i] <- mean(sample(female.data$Age, sample.size[2], replace =FALSE))
  xbar3[i] <- mean(sample(female.data$Age, sample.size[3], replace =FALSE))
  xbar4[i] <- mean(sample(female.data$Age, sample.size[4], replace =FALSE))
}

# Plots of sample means of sample of sizes 10, 20, 30, and 40

subplot(
  
  plot_ly( x = xbar, name = sample.size[1]),
  plot_ly( x = xbar2, name = sample.size[2]),
  plot_ly( x = xbar3, name = sample.size[3]),
  plot_ly( x = xbar4, name = sample.size[4]),
  
  nrows = 2,
  shareY = FALSE, 
  shareX = TRUE
)
sample.mean <- numeric(length(sample.size))
sample.sd <- numeric(length(sample.size))
count <-  1

sample.mean[count] <- mean(xbar)
sample.sd[count] <- sd(xbar)
count <- count + 1

sample.mean[count] <- mean(xbar2)
sample.sd[count] <- sd(xbar2)
count <- count + 1

sample.mean[count] <- mean(xbar3)
sample.sd[count] <- sd(xbar3)
count <- count + 1

sample.mean[count] <- mean(xbar4)
sample.sd[count] <- sd(xbar4)

cat("Population Mean = ", mean(female.data$Age),
    " SD = ", sd(female.data$Age), "\n")
## Population Mean =  53.59  SD =  20.04
for(i in 1:length(sample.size)){
  cat("Sample Size = ", sample.size[i], " Mean = ", sample.mean[i],
      " SD = ", sample.sd[i], "\n")
}
## Sample Size =  10  Mean =  53.73  SD =  6.423 
## Sample Size =  20  Mean =  53.63  SD =  4.357 
## Sample Size =  30  Mean =  53.42  SD =  3.63 
## Sample Size =  40  Mean =  53.32  SD =  3.237

Sample means are same as the mean of the female population.

Standard deviation varies with the sample size. The larger the sample size, the smaller the sample standard deviation.

The distributions of sample means are normal distributions. Hence, the Central Limit Theorem holds even if the female population distribution is right skewed, the sampling distributions of sample means are normal.

As the sample size increases, the shape of the sampling distribution of sample mean gets closer to normal distribution.

Sampling of Population Age via Simple Random Sample Without Replacement, Systematic Sampling, Systematic Sampling With Unequal Probablities, and Stratified Sampling

Sampling is a technique to select a representative portion of the population to perform an analysis on and draw inferences from it for the whole population. Sampling is very useful and efficient when analyzing huge amounts of data because it requires less computational power. Inferences can be drawn for the whole population based on a sample taken from that population without needing to evaluate the whole population. .

For this project the age of the population has been analyzed via Simple random sampling without replacement, systematic sampling, systematic sampling With unequal probabilities, and stratified sampling methods.

Simple random sampling is a basic sampling technique where individual subjects are selected at random from a larger group. In this case, every sample has the same chance of getting picked. Systematic sampling is a method where samples are selected via a fixed periodic interval. The interval is calculated by dividing the whole population sample by the desired sample size. The first sample is decided randomly within the first interval. Lastly, stratified sampling takes into the account that there is heterogeneity in a population. The population is subdivided into sub populations and the same percentage of individuals is selected from each sub population to make up the sample set. When looking at a normal distribution, the sample mean can be used as an estimate for the population mean. .

library(sampling)

sample.size <- 100

## 

# a)  Sample drawn using simple random sampling without replacement.

set.seed(0264)
s <- srswor(sample.size, nrow(covidVAE.data))

sample1 <- covidVAE.data[s != 0, ]
# head(sample.1,n=2)
# nrow(sample1)

sample1 %>%
  group_by(Age) %>%
  summarise(count = n()) -> sample1.age
# sample1.age

##

# b) Sample drawn using systematic sampling. 

N <- nrow(covidVAE.data)

n <- sample.size

k <- ceiling(N / n)

set.seed(0264)
r <- sample(k, 1)

# select every kth item
s <- seq(r, by = k, length = n)
sample2 <- covidVAE.data[s, ]
# head(sample2, n=2)
# nrow(sample2)

sample2 %>%
  group_by(Age) %>%
  summarise(count = n()) -> sample2.age
# sample2.age

##

# c) Sample drawn using systematic sampling with unequal probabilities by 
# calculating inclusion probabilities based on Age variable.

pik <- inclusionprobabilities(covidVAE.data$Age, sample.size)
# length(pik)
# sum(pik)

set.seed(0264)
s <- UPsystematic(pik)
# length(s)

sample3 <- covidVAE.data[s != 0, ]
# head(sample3, n=3)
# nrow(sample3)

sample3 %>%
  group_by(Age) %>%
  summarise(count = n()) -> sample3.age
# sample3.age

##

# d) Sample drawn using stratified sampling using proportional sizes based on
# Gender

data <- covidVAE.data[order(covidVAE.data$Gender),]

data <- subset(data, !data$Gender == "U")
# nrow(data)

freq <- table(data$Gender)
# freq

sizes <- round(sample.size * freq / sum(freq))
# sizes
# sum(sizes)

set.seed(0264)
st <- strata(data, stratanames = c("Gender"),
             size = sizes, method = "srswor")
sample4 <- getdata(data,st)
# head(sample4, n=2)
# nrow(sample4)

sample4 %>%
  group_by(Age) %>%
  summarise(count = n()) -> sample4.age
# sample4.age

# e) Compare the means of Age variable for these four samples against
# the mean for the data.

# Plots

subplot(
  plot_ly(covidVAE.data, x = ~Age, name = "Population Age"),
  plot_ly(sample1.age, x = ~Age,y = ~count,type = "bar", name = "SRSWOR"),
  plot_ly(sample2.age, x = ~Age,y = ~count, type = "bar", name = "Systematic Sampling"),
  plot_ly(sample3, x = ~Age,y = ~count, type = "bar", name = "UP Systematic"),
  plot_ly(sample4.age, x = ~Age,y = ~count, type = "bar", name = "Stratified Sampling\nby Gender"),
  
  nrows = 5,
  shareY = FALSE, 
  shareX = TRUE
)
# Comparison of Means

options(digits = 4)

sampling.mean <- c(mean(sample1$Age), mean(sample2$Age,na.rm = TRUE), mean(sample3$Age), mean(sample4$Age))

sampling.sd <- c(sd(sample1$Age), sd(sample2$Age,na.rm = TRUE), sd(sample3$Age), sd(sample4$Age))

cat("Population Mean = ", mean(covidVAE.data$Age),
    " SD = ", sd(covidVAE.data$Age), "\n")
## Population Mean =  56.83  SD =  20.38
for(i in 1:4){
  cat("Sample ",i," Mean = ", sampling.mean[i],sep = "",
      " SD = ", sampling.sd[i], "\n")
}
## Sample 1 Mean = 58.51 SD = 21.33
## Sample 2 Mean = 57.27 SD = 20.66
## Sample 3 Mean = 66.68 SD = 17.94
## Sample 4 Mean = 55.8 SD = 21.18

The mean of the sample drawn using systematic sampling is closest to the population mean for age.

The mean of the sample drawn using stratified sampling using proportional sizes based on the gender is also close to the population mean of age for the dataset.

The mean of the sample drawn using systematic sampling with unequal probabilities is farthest from the population mean of age for the dataset.

A Sample drawn using systematic sampling would be a good representation of the whole dataset as there is not much difference between the means and standard deviations of the sample and entire population.

The data set used for this project can be seen as a sample of the whole population based on age ranging from 20 years old to 100 years old. Age group under 20 does not have a good representation in the population data set as people under 20 are not on the priority list for vaccination. The current data set covers health care workers, senior citizens and people with underlying health conditions.

Analysis of the number of days between receiving a Covid-19 vaccine and onset of adverse reaction for different age groups

covidVAE.data %>%
  group_by( NumDays_Before_Onset, Age) %>%
  summarise(count = n()) -> num_days_age

plot_ly(num_days_age, x = ~NumDays_Before_Onset,
        y = ~count, color = ~Age, type = "bar") %>%
  layout(xaxis = list(title = 'Number of days btw onset of adverse reaction and vaccination'),
         barmode = 'stack') -> p

p
cat("Average number of days between receiving a Covid-19 vaccine and onset of adverse reaction for the population = ",mean(covidVAE.data$NumDays_Before_Onset))
## Average number of days between receiving a Covid-19 vaccine and onset of adverse reaction for the population =  3.501

Analysis of the number of days between receiving a Covid-19 vaccine and onset of adverse reaction based on gender

covidVAE.data %>%
  group_by( NumDays_Before_Onset, Gender) %>%
  summarise(count = n()) -> num_days
# num_days

plot_ly(num_days, x = ~NumDays_Before_Onset,y = ~count, color = ~Gender, type = "bar") %>%
  layout(xaxis = list(title = 'Number of Days Between Vaccination and Onset of Adverse Reaction'),barmode = 'stack') -> p

p 

Most of the population reported an adverse reaction to the vaccine on the same day or the next day. The number of adverse reactions reporting shows exponentially downward trend as the number of days increases. The barplot also shows that 1017 females reported on same day compare to only 312 males and females continue to out number males for later days.

Some people reported a reaction to the vaccine between after 20 days to 60 days.

plot_ly(covidVAE.data, x = ~NumDays_Before_Onset, 
        color = ~Gender, type = "box") %>%
  layout(xaxis = list(title = 'Number of Days Between Vaccination and Onset of Adverse Reaction'),barmode = 'stack') -> p

p 

The box plot shows that most females have reported reaction mostly between 0 to 10 days range where as males have reported reaction between 0 to 12 days. The median value is 1 day for both males and females.

It also shows a few outliers in the dataset.

Analysis of Number of adverse reaction reports of the vaccine in different states

covidVAE.data %>%
  group_by(State,Gender) %>%
  summarise(count = n()) -> state
#state

plot_ly(state, x = ~State,y = ~count, color = ~Gender) %>%
  layout(xaxis = list(title = 'State'),barmode = 'stack') -> p

p 

As of Feburary 11th ,the highest number of adverse effects of the vaccine are reported in California.

Summary

The analysis shows that on average a female has a higher chance of getting an adverse reaction to the Covid-19 vaccine if she is between 37 and 68 years old with a median age of 51 years.

On average a male has a higher chance of getting an adverse reaction to a Covid-19 vaccine if he is between 50 and 79 years old with a median age of 67 years.

Most of the population have had an adverse reaction to the vaccine either on the same day or the next day.

The females have reported an adverse effect of the vaccine upto 36 days after getting it with a few outliers of upto 74 days.

The males have reported an adverse effect of the vaccine upto 31 days after getting it with a few outliers of upto 60 days.