Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
title: "BIST5092-Modelling (Shreya)"
author: "Shreya Mittal"
date: '2022-08-07'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
##Analyses the following
## 1. MCAR 20% missing data
## 2. MCAR 30% missing data
## 3. MAR data
## 4. Complete/Full Dataset
```{r}
#load all packages
library(dplyr)
library(car)
library(lme4)
library(mice)
#load all raw data
adsl20=read.csv(file="/Users/shreya/Downloads/MCAR20%n.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
adsl20=adsl20 %>% filter(PARAMCD!='sPGA') #%>% filter(AVISITN!=2)
adsl20$PCHG=as.numeric(adsl20$PCHG)
adsl20$PCHGCAT1 = ifelse((adsl20$PCHG<75.0) , '<75%', ifelse(adsl20$PCHG>=75.0 , '>=75%', NA))
#if aval is na then pchgcat1 is na
adsl20$PCHGCAT1 = ifelse(is.na(adsl20$AVAL), NA, adsl20$PCHGCAT1)
adsl30=read.csv(file="/Users/shreya/Downloads/MCAR30%.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
adsl30=adsl30 %>% filter(PARAMCD!='sPGA')
adsl30$PCHGCAT1 = ifelse((adsl30$PCHG>=0.0 & adsl30$PCHG<75.0 ), '<75%', ifelse(adsl30$PCHG>=75.0 , '>=75%', NA))
#if aval is na then pchgcat1 is na
adsl30$PCHGCAT1 = ifelse(is.na(adsl30$AVAL), NA, adsl30$PCHGCAT1)
#MCAR full data
adsl_full=read.csv(file="/Users/shreya/Downloads/ADPA + SEX.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
adsl_full=adsl_full %>% filter(PARAMCD!='sPGA')
#MAR DATA
mar_adsl_full=read.csv(file="/Users/shreya/Downloads/MAR_Data_V1.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
mar_adsl_full=mar_adsl_full %>% filter(PARAMCD!='sPGA')
mar_adsl_full$PCHGCAT1 = ifelse((mar_adsl_full$PCHG<75.0) , '<75%', ifelse(mar_adsl_full$PCHG>=75.0 , '>=75%', NA))
raw_dat=adsl20
```
##LOGISTIC REGRESSION-COMPLETE DATA
We see a very high variance and standard deviation per visit. In other words the log odds of being a responder vs non responder varies on average by 20.1 around as we move from visit to visit or each visit has its own unique deviation in terms of log odds.
the 20.1 corresponds to the deviation of the specific log odds of being a responder in a given visit from the overall log odss of being a responder across all visits. We see a high random intercept variance, therefore larger the variation of the log odds of being a responder from visit to visit, indicating that subjects have more chances of being a responder in some visits than others.
cite:https://www.rips-irsp.com/articles/10.5334/irsp.90/
```{r}
#check if a patient can have 2 PASI 75 labels
multi_label_check=raw_dat %>% group_by(SUBJID) %>% mutate(distinct_PASI=n_distinct(PCHGCAT1, na.rm = TRUE)) %>% filter(distinct_PASI>1)
#change to 0 or 1, 0 is non responder, 1 is responder
data=raw_dat
data$SEXN=as.factor(data$SEXN) #factor variable
data$TRTP=as.factor(data$TRTP) #factor variable
data$PCHGCAT1=dplyr::recode(data$PCHGCAT1,'<75%'='0', '>=75%'='1')
data$PCHGCAT1=as.factor(data$PCHGCAT1)
#find outlier for model
car::outlierTest(mod_nonmm)
#remove outlier
data=data %>% filter(!row_number() %in% c(469,210) )
#model
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo
#control = glmerControl(optimizer ="Nelder_Mead")
mod=glmer(PCHGCAT1~SEXN+TRTP+(1|AVISITN),data=data,family=binomial)
summary(mod)
#confint(mod)
# Computing profile confidence intervals ...
# 2.5 % 97.5 %
# .sig01 2.2702930 14.66421685
# (Intercept) -17.3509358 -1.73045710
# SEXN2 -0.3916813 -0.03178773
# TRTPActive control 2.6194154 3.44087712
# TRTPTest drug 140mg 2.6229402 3.39699952
# TRTPTest drug 210mg 3.4962785 4.28475351
mod_nonmm=glm(PCHGCAT1~SEXN+TRTP,data=data,family=binomial )
summary(mod_nonmm)
```
Dependent variable: PCHGCAT1
a. Include only completers. Patients who drop out and have missing data will be excluded for analysis. Analyze by logistic regression with treatment and sex (from ADSL data).
MAR_DATA
-Random Intercept on subjectid leads to a singular fit because variance of random intercept is 0
-Random Intercept on visitid leads to failed converge, but random intercept variance is very high
ADSL20
-Random Intercept on visitid runs fine, shows a large variation but confint fails saying "profiling detected new, lower deviance"
-Random Intercept on subjectid gives singular fit because random intercept variance is 0
ADSL30
-Random Intercept on visitid runs fine, shows a VERY large variation and CIs
-Random Intercept on subjectid gives singular fit because random intercept variance is 0
```{r}
#model
completers=raw_dat %>% group_by(SUBJID) %>% summarise(n=sum(is.na(AVAL))) %>% filter(n==0) %>% select(SUBJID) #list of subjects that have no missing data
data=inner_join(raw_dat, completers, by="SUBJID")
multi_label_check=data %>% group_by(SUBJID) %>% mutate(distinct_PASI=n_distinct(PCHGCAT1, na.rm = TRUE)) %>% filter(distinct_PASI>1)
data$SEXN=as.factor(data$SEXN) #factor variable
data$TRTP=as.factor(data$TRTP) #factor variable
#change to 0 or 1, 0 is non responder, 1 is responder
data$PCHGCAT1=dplyr::recode(data$PCHGCAT1,'<75%'=0, '>=75%'=1)
data$PCHGCAT1=as.factor(data$PCHGCAT1)
# mixed model
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo
mod=glmer(PCHGCAT1~SEXN+TRTP+(1|AVISITN),data=data,family=binomial)
summary(mod)
# confint(mod)
#ADSL30 CI
# Computing profile confidence intervals ...
# 2.5 % 97.5 %
# .sig01 3.9523848 63.504002487
# (Intercept) -41.3380638 -3.986073608
# SEXN2 -0.4285189 0.006992372
# TRTPPlacebo -3.4841414 -2.493074140
# TRTPTest drug 140mg -0.1593532 0.419247419
# TRTPTest drug 210mg 0.7042414 1.287490789
#non mixed model
mod=glm(PCHGCAT1~SEXN+TRTP, data=data,family=binomial)
summary(mod)
#create mode function to get one value per subject
calc_mode <- function(x){
# List the distinct / unique values
distinct_values <- unique(x)
# Count the occurrence of each distinct value
distinct_tabulate <- tabulate(match(x, distinct_values))
# Return the value with the highest occurrence
distinct_values[which.max(distinct_tabulate)]
}
data.non.mm = data %>% group_by(SUBJID) %>% summarize(PCHGCAT1.mode = calc_mode(PCHGCAT1))
data.non.mm=inner_join(data, data.non.mm, by="SUBJID")
unique(data.non.mm$PCHGCAT1.mode)
data.non.mm$PCHGCAT1.mode=dplyr::recode(data.non.mm$PCHGCAT1.mode,'<75%'=0, '>=75%'=1)
mod.MM=glm(PCHGCAT1.mode~SEXN+TRTP,data=data.non.mm,family=binomial)
summary(mod.MM)
```
b. Impute the patients who have missing data at visit 6 to be non-responders (i.e. patients who have missing data or who have visit 6 data but didn’t reach PASI75 will be both treated as non-responder). Analyze by logistic regression with treatment and sex.
ADSL30:
-All vars significant. Outlier 206.
ADSL20:
-All vars significant. Outlier 206.
MAR:
-All vars significant. Outlier 206.
```{r}
#Two conditions applied:
#1-If the patient has data at visit 6 but PASI 75 is less than 75%
#2-If patient has no data at visit 6
#one patients gets only one label: responder vs non-responder
data=raw_dat %>% group_by(SUBJID) %>% mutate(PCHGCAT1_IMP = min(ifelse(((AVISIT=='VISIT 6' & is.na(PCHGCAT1)) |(AVISIT=='VISIT 6' & PCHGCAT1=='<75%')) , 0, 1))) %>% ungroup
data$PCHGCAT1_IMP=as.factor(data$PCHGCAT1_IMP)
data$TRTP=as.factor(data$TRTP)
data$SEXN=as.factor(data$SEXN)
#find outlier for model
#car::outlierTest(mod)
#remove outlier
data=data %>% filter(!row_number() %in% c(206) )
#model
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo
mod=glm(PCHGCAT1_IMP~SEXN+TRTP,data=data,family=binomial)
summary(mod)
```
c.Compare the results with the one from complete data set. Analyze by logistic regression with treatment and sex.
Complete Data:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: PCHGCAT1 ~ SEXN + TRTP + (1 | AVISITN)
Data: data
AIC BIC logLik deviance df.resid
3501.2 3542.6 -1744.6 3489.2 7316
Scaled residuals:
Min 1Q Median 3Q Max
-2.3708 -0.3018 -0.0332 -0.0073 19.3650
Random effects:
Groups Name Variance Std.Dev.
AVISITN (Intercept) 20.1 4.483
Number of obs: 7322, groups: AVISITN, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.55249 2.26261 -3.338 0.000844 ***
SEXN2 -0.21659 0.09174 -2.361 0.018225 *
TRTPActive control 3.04722 0.21134 14.419 < 2e-16 ***
TRTPTest drug 140mg 3.02553 0.19936 15.176 < 2e-16 ***
TRTPTest drug 210mg 3.90602 0.20300 19.241 < 2e-16 ***
##ADSL20:
Random intercept variance is ~30 compared to 20.1 for full dataset. Coefficients for TRTP and sex remain same, but intercept is more negative (~ -9)
##ADSL30
AIC is lower but the random intercept variance is 166, compared to 20.1 for the full dataset model. All coefficients are the same but the intercept is more negative (~ -17)
##MAR:
Model didnt converge, but very similar coefficients to the full dataset.
##Comparison to imputed models
##MAR
Intercept compared to the full model is much more postive (~ -2.5) and coefficient of sex is much lower (~ -0.11). Effects of TRTP is similar but lower.
#ADSL30
Very similar compared to the results for MAR imputed model, but intercept is slightly higher (~2.75)
#ADSL20
Effects of TRTP is similar but lower compared to full data. Very similar intercept to adsl30 (~2.6).