Logistic_Regression_BIST5092.Rmd

---
title: "BIST5092-Modelling (Shreya)"
author: "Shreya Mittal"
date: '2022-08-07'
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

##Analyses the following
## 1. MCAR 20% missing data
## 2. MCAR 30% missing data
## 3. MAR data
## 4. Complete/Full Dataset
```{r}
#load all packages
library(dplyr)
library(car)
library(lme4)
library(mice)
#load all raw data
adsl20=read.csv(file="/Users/shreya/Downloads/MCAR20%n.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
adsl20=adsl20 %>% filter(PARAMCD!='sPGA') #%>% filter(AVISITN!=2)
adsl20$PCHG=as.numeric(adsl20$PCHG)
adsl20$PCHGCAT1 = ifelse((adsl20$PCHG<75.0) , '<75%', ifelse(adsl20$PCHG>=75.0 , '>=75%', NA))
#if aval is na then pchgcat1 is na
adsl20$PCHGCAT1 = ifelse(is.na(adsl20$AVAL), NA, adsl20$PCHGCAT1)


adsl30=read.csv(file="/Users/shreya/Downloads/MCAR30%.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
adsl30=adsl30 %>% filter(PARAMCD!='sPGA')
adsl30$PCHGCAT1 = ifelse((adsl30$PCHG>=0.0 & adsl30$PCHG<75.0 ), '<75%', ifelse(adsl30$PCHG>=75.0 , '>=75%', NA))
#if aval is na then pchgcat1 is na
adsl30$PCHGCAT1 = ifelse(is.na(adsl30$AVAL), NA, adsl30$PCHGCAT1)

#MCAR full data
adsl_full=read.csv(file="/Users/shreya/Downloads/ADPA + SEX.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
adsl_full=adsl_full %>% filter(PARAMCD!='sPGA')

#MAR DATA
mar_adsl_full=read.csv(file="/Users/shreya/Downloads/MAR_Data_V1.csv", sep = ",", na.strings=c('',NA))
# remove sPGA rows
mar_adsl_full=mar_adsl_full %>% filter(PARAMCD!='sPGA')
mar_adsl_full$PCHGCAT1 = ifelse((mar_adsl_full$PCHG<75.0) , '<75%', ifelse(mar_adsl_full$PCHG>=75.0 , '>=75%', NA))

raw_dat=adsl20
```

##LOGISTIC REGRESSION-COMPLETE DATA

We see a very high variance and standard deviation per visit. In other words the log odds of being a responder vs non responder varies on average by 20.1 around as we move from visit to visit or each visit has its own unique deviation in terms of log odds.

the 20.1 corresponds to the deviation of the specific log odds of being a responder in a given visit from the overall log odss of being a responder across all visits. We see a high random intercept variance, therefore larger the variation of the log odds of being a responder from visit to visit, indicating that subjects have more chances of being a responder in some visits than others.
cite:https://www.rips-irsp.com/articles/10.5334/irsp.90/
```{r}
#check if a patient can have 2 PASI 75 labels
multi_label_check=raw_dat %>% group_by(SUBJID) %>% mutate(distinct_PASI=n_distinct(PCHGCAT1, na.rm = TRUE)) %>% filter(distinct_PASI>1)

#change to 0 or 1, 0 is non responder, 1 is responder
data=raw_dat
data$SEXN=as.factor(data$SEXN) #factor variable
data$TRTP=as.factor(data$TRTP) #factor variable
data$PCHGCAT1=dplyr::recode(data$PCHGCAT1,'<75%'='0', '>=75%'='1')
data$PCHGCAT1=as.factor(data$PCHGCAT1)

#find outlier for model
car::outlierTest(mod_nonmm) 
#remove outlier
data=data %>% filter(!row_number() %in% c(469,210) )
#model
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo

#control = glmerControl(optimizer ="Nelder_Mead")
mod=glmer(PCHGCAT1~SEXN+TRTP+(1|AVISITN),data=data,family=binomial)
summary(mod)
#confint(mod)
# Computing profile confidence intervals ...
#                           2.5 %      97.5 %
# .sig01                2.2702930 14.66421685
# (Intercept)         -17.3509358 -1.73045710
# SEXN2                -0.3916813 -0.03178773
# TRTPActive control    2.6194154  3.44087712
# TRTPTest drug 140mg   2.6229402  3.39699952
# TRTPTest drug 210mg   3.4962785  4.28475351

mod_nonmm=glm(PCHGCAT1~SEXN+TRTP,data=data,family=binomial )
summary(mod_nonmm)

```


Dependent variable: PCHGCAT1

a.	Include only completers. Patients who drop out and have missing data will be excluded for analysis. Analyze by logistic regression with treatment and sex (from ADSL data).

MAR_DATA
  -Random Intercept on subjectid leads to a singular fit because variance of random intercept is 0
  -Random Intercept on visitid leads to failed converge, but random intercept variance is very high
  
ADSL20
  -Random Intercept on visitid runs fine, shows a large variation but confint fails saying "profiling detected new, lower deviance"
  -Random Intercept on subjectid gives singular fit because random intercept variance is 0
  
ADSL30
  -Random Intercept on visitid runs fine, shows a VERY large variation and CIs
  -Random Intercept on subjectid gives singular fit because random intercept variance is 0
```{r}
#model 
completers=raw_dat %>% group_by(SUBJID) %>% summarise(n=sum(is.na(AVAL))) %>% filter(n==0) %>% select(SUBJID) #list of subjects that have no missing data
data=inner_join(raw_dat, completers, by="SUBJID")
multi_label_check=data %>% group_by(SUBJID) %>% mutate(distinct_PASI=n_distinct(PCHGCAT1, na.rm = TRUE)) %>% filter(distinct_PASI>1)

data$SEXN=as.factor(data$SEXN) #factor variable
data$TRTP=as.factor(data$TRTP) #factor variable

#change to 0 or 1, 0 is non responder, 1 is responder
data$PCHGCAT1=dplyr::recode(data$PCHGCAT1,'<75%'=0, '>=75%'=1)
data$PCHGCAT1=as.factor(data$PCHGCAT1)

# mixed model
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo

mod=glmer(PCHGCAT1~SEXN+TRTP+(1|AVISITN),data=data,family=binomial)
summary(mod)
# confint(mod)
#ADSL30 CI
# Computing profile confidence intervals ...
#                           2.5 %       97.5 %
# .sig01                3.9523848 63.504002487
# (Intercept)         -41.3380638 -3.986073608
# SEXN2                -0.4285189  0.006992372
# TRTPPlacebo          -3.4841414 -2.493074140
# TRTPTest drug 140mg  -0.1593532  0.419247419
# TRTPTest drug 210mg   0.7042414  1.287490789

#non mixed model
mod=glm(PCHGCAT1~SEXN+TRTP, data=data,family=binomial)
summary(mod)

#create mode function to get one value per subject
calc_mode <- function(x){
  # List the distinct / unique values
  distinct_values <- unique(x)
  # Count the occurrence of each distinct value
  distinct_tabulate <- tabulate(match(x, distinct_values))
  # Return the value with the highest occurrence
  distinct_values[which.max(distinct_tabulate)]
}

data.non.mm = data %>% group_by(SUBJID) %>% summarize(PCHGCAT1.mode = calc_mode(PCHGCAT1))
data.non.mm=inner_join(data, data.non.mm, by="SUBJID")
unique(data.non.mm$PCHGCAT1.mode)
data.non.mm$PCHGCAT1.mode=dplyr::recode(data.non.mm$PCHGCAT1.mode,'<75%'=0, '>=75%'=1)

mod.MM=glm(PCHGCAT1.mode~SEXN+TRTP,data=data.non.mm,family=binomial)
summary(mod.MM)
```


b.	Impute the patients who have missing data at visit 6 to be non-responders (i.e. patients who have missing data or who have visit 6 data but didn’t reach PASI75 will be both treated as non-responder). Analyze by logistic regression with treatment and sex.

ADSL30: 
 -All vars significant. Outlier 206.
 
ADSL20:
 -All vars significant. Outlier 206.
 
MAR:
  -All vars significant. Outlier 206.

```{r}
#Two  conditions applied:
#1-If the patient has data at visit 6 but PASI 75 is less than 75%
#2-If patient has no data at visit 6

#one patients gets only one label: responder vs non-responder
data=raw_dat %>% group_by(SUBJID) %>% mutate(PCHGCAT1_IMP = min(ifelse(((AVISIT=='VISIT 6' & is.na(PCHGCAT1)) |(AVISIT=='VISIT 6' & PCHGCAT1=='<75%')) , 0, 1))) %>% ungroup

data$PCHGCAT1_IMP=as.factor(data$PCHGCAT1_IMP)
data$TRTP=as.factor(data$TRTP)
data$SEXN=as.factor(data$SEXN)

#find outlier for model
#car::outlierTest(mod) 
#remove outlier
data=data %>% filter(!row_number() %in% c(206) )

#model
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo

mod=glm(PCHGCAT1_IMP~SEXN+TRTP,data=data,family=binomial)
summary(mod)
```

c.Compare the results with the one from complete data set. Analyze by logistic regression with treatment and sex.

Complete Data:

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: binomial  ( logit )
Formula: PCHGCAT1 ~ SEXN + TRTP + (1 | AVISITN)
   Data: data

     AIC      BIC   logLik deviance df.resid 
  3501.2   3542.6  -1744.6   3489.2     7316 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3708 -0.3018 -0.0332 -0.0073 19.3650 

Random effects:
 Groups  Name        Variance Std.Dev.
 AVISITN (Intercept) 20.1     4.483   
Number of obs: 7322, groups:  AVISITN, 4

Fixed effects:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -7.55249    2.26261  -3.338 0.000844 ***
SEXN2               -0.21659    0.09174  -2.361 0.018225 *  
TRTPActive control   3.04722    0.21134  14.419  < 2e-16 ***
TRTPTest drug 140mg  3.02553    0.19936  15.176  < 2e-16 ***
TRTPTest drug 210mg  3.90602    0.20300  19.241  < 2e-16 ***

##ADSL20:
Random intercept variance is ~30 compared to 20.1 for full dataset. Coefficients for TRTP and sex remain same, but intercept is more negative (~ -9) 

##ADSL30
AIC is lower but the random intercept variance is 166, compared to 20.1 for the full dataset model. All coefficients are the same but the intercept is more negative (~ -17)

##MAR:
Model didnt converge, but very similar coefficients to the full dataset.

##Comparison to imputed models
##MAR
Intercept compared to the full model is much more postive (~ -2.5) and coefficient of sex is much lower (~ -0.11). Effects of TRTP is similar but lower. 

#ADSL30
Very similar compared to the results for MAR imputed model, but intercept is slightly higher (~2.75)

#ADSL20
Effects of TRTP is similar but lower compared to full data. Very similar intercept to adsl30 (~2.6).