Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Biostats-Practicum/Logistic_Regression_BIST5092.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
249 lines (193 sloc)
9.37 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "BIST5092-Modelling (Shreya)" | |
author: "Shreya Mittal" | |
date: '2022-08-07' | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
``` | |
##Analyses the following | |
## 1. MCAR 20% missing data | |
## 2. MCAR 30% missing data | |
## 3. MAR data | |
## 4. Complete/Full Dataset | |
```{r} | |
#load all packages | |
library(dplyr) | |
library(car) | |
library(lme4) | |
library(mice) | |
#load all raw data | |
adsl20=read.csv(file="/Users/shreya/Downloads/MCAR20%n.csv", sep = ",", na.strings=c('',NA)) | |
# remove sPGA rows | |
adsl20=adsl20 %>% filter(PARAMCD!='sPGA') #%>% filter(AVISITN!=2) | |
adsl20$PCHG=as.numeric(adsl20$PCHG) | |
adsl20$PCHGCAT1 = ifelse((adsl20$PCHG<75.0) , '<75%', ifelse(adsl20$PCHG>=75.0 , '>=75%', NA)) | |
#if aval is na then pchgcat1 is na | |
adsl20$PCHGCAT1 = ifelse(is.na(adsl20$AVAL), NA, adsl20$PCHGCAT1) | |
adsl30=read.csv(file="/Users/shreya/Downloads/MCAR30%.csv", sep = ",", na.strings=c('',NA)) | |
# remove sPGA rows | |
adsl30=adsl30 %>% filter(PARAMCD!='sPGA') | |
adsl30$PCHGCAT1 = ifelse((adsl30$PCHG>=0.0 & adsl30$PCHG<75.0 ), '<75%', ifelse(adsl30$PCHG>=75.0 , '>=75%', NA)) | |
#if aval is na then pchgcat1 is na | |
adsl30$PCHGCAT1 = ifelse(is.na(adsl30$AVAL), NA, adsl30$PCHGCAT1) | |
#MCAR full data | |
adsl_full=read.csv(file="/Users/shreya/Downloads/ADPA + SEX.csv", sep = ",", na.strings=c('',NA)) | |
# remove sPGA rows | |
adsl_full=adsl_full %>% filter(PARAMCD!='sPGA') | |
#MAR DATA | |
mar_adsl_full=read.csv(file="/Users/shreya/Downloads/MAR_Data_V1.csv", sep = ",", na.strings=c('',NA)) | |
# remove sPGA rows | |
mar_adsl_full=mar_adsl_full %>% filter(PARAMCD!='sPGA') | |
mar_adsl_full$PCHGCAT1 = ifelse((mar_adsl_full$PCHG<75.0) , '<75%', ifelse(mar_adsl_full$PCHG>=75.0 , '>=75%', NA)) | |
raw_dat=adsl20 | |
``` | |
##LOGISTIC REGRESSION-COMPLETE DATA | |
We see a very high variance and standard deviation per visit. In other words the log odds of being a responder vs non responder varies on average by 20.1 around as we move from visit to visit or each visit has its own unique deviation in terms of log odds. | |
the 20.1 corresponds to the deviation of the specific log odds of being a responder in a given visit from the overall log odss of being a responder across all visits. We see a high random intercept variance, therefore larger the variation of the log odds of being a responder from visit to visit, indicating that subjects have more chances of being a responder in some visits than others. | |
cite:https://www.rips-irsp.com/articles/10.5334/irsp.90/ | |
```{r} | |
#check if a patient can have 2 PASI 75 labels | |
multi_label_check=raw_dat %>% group_by(SUBJID) %>% mutate(distinct_PASI=n_distinct(PCHGCAT1, na.rm = TRUE)) %>% filter(distinct_PASI>1) | |
#change to 0 or 1, 0 is non responder, 1 is responder | |
data=raw_dat | |
data$SEXN=as.factor(data$SEXN) #factor variable | |
data$TRTP=as.factor(data$TRTP) #factor variable | |
data$PCHGCAT1=dplyr::recode(data$PCHGCAT1,'<75%'='0', '>=75%'='1') | |
data$PCHGCAT1=as.factor(data$PCHGCAT1) | |
#find outlier for model | |
car::outlierTest(mod_nonmm) | |
#remove outlier | |
data=data %>% filter(!row_number() %in% c(469,210) ) | |
#model | |
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo | |
#control = glmerControl(optimizer ="Nelder_Mead") | |
mod=glmer(PCHGCAT1~SEXN+TRTP+(1|AVISITN),data=data,family=binomial) | |
summary(mod) | |
#confint(mod) | |
# Computing profile confidence intervals ... | |
# 2.5 % 97.5 % | |
# .sig01 2.2702930 14.66421685 | |
# (Intercept) -17.3509358 -1.73045710 | |
# SEXN2 -0.3916813 -0.03178773 | |
# TRTPActive control 2.6194154 3.44087712 | |
# TRTPTest drug 140mg 2.6229402 3.39699952 | |
# TRTPTest drug 210mg 3.4962785 4.28475351 | |
mod_nonmm=glm(PCHGCAT1~SEXN+TRTP,data=data,family=binomial ) | |
summary(mod_nonmm) | |
``` | |
Dependent variable: PCHGCAT1 | |
a. Include only completers. Patients who drop out and have missing data will be excluded for analysis. Analyze by logistic regression with treatment and sex (from ADSL data). | |
MAR_DATA | |
-Random Intercept on subjectid leads to a singular fit because variance of random intercept is 0 | |
-Random Intercept on visitid leads to failed converge, but random intercept variance is very high | |
ADSL20 | |
-Random Intercept on visitid runs fine, shows a large variation but confint fails saying "profiling detected new, lower deviance" | |
-Random Intercept on subjectid gives singular fit because random intercept variance is 0 | |
ADSL30 | |
-Random Intercept on visitid runs fine, shows a VERY large variation and CIs | |
-Random Intercept on subjectid gives singular fit because random intercept variance is 0 | |
```{r} | |
#model | |
completers=raw_dat %>% group_by(SUBJID) %>% summarise(n=sum(is.na(AVAL))) %>% filter(n==0) %>% select(SUBJID) #list of subjects that have no missing data | |
data=inner_join(raw_dat, completers, by="SUBJID") | |
multi_label_check=data %>% group_by(SUBJID) %>% mutate(distinct_PASI=n_distinct(PCHGCAT1, na.rm = TRUE)) %>% filter(distinct_PASI>1) | |
data$SEXN=as.factor(data$SEXN) #factor variable | |
data$TRTP=as.factor(data$TRTP) #factor variable | |
#change to 0 or 1, 0 is non responder, 1 is responder | |
data$PCHGCAT1=dplyr::recode(data$PCHGCAT1,'<75%'=0, '>=75%'=1) | |
data$PCHGCAT1=as.factor(data$PCHGCAT1) | |
# mixed model | |
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo | |
mod=glmer(PCHGCAT1~SEXN+TRTP+(1|AVISITN),data=data,family=binomial) | |
summary(mod) | |
# confint(mod) | |
#ADSL30 CI | |
# Computing profile confidence intervals ... | |
# 2.5 % 97.5 % | |
# .sig01 3.9523848 63.504002487 | |
# (Intercept) -41.3380638 -3.986073608 | |
# SEXN2 -0.4285189 0.006992372 | |
# TRTPPlacebo -3.4841414 -2.493074140 | |
# TRTPTest drug 140mg -0.1593532 0.419247419 | |
# TRTPTest drug 210mg 0.7042414 1.287490789 | |
#non mixed model | |
mod=glm(PCHGCAT1~SEXN+TRTP, data=data,family=binomial) | |
summary(mod) | |
#create mode function to get one value per subject | |
calc_mode <- function(x){ | |
# List the distinct / unique values | |
distinct_values <- unique(x) | |
# Count the occurrence of each distinct value | |
distinct_tabulate <- tabulate(match(x, distinct_values)) | |
# Return the value with the highest occurrence | |
distinct_values[which.max(distinct_tabulate)] | |
} | |
data.non.mm = data %>% group_by(SUBJID) %>% summarize(PCHGCAT1.mode = calc_mode(PCHGCAT1)) | |
data.non.mm=inner_join(data, data.non.mm, by="SUBJID") | |
unique(data.non.mm$PCHGCAT1.mode) | |
data.non.mm$PCHGCAT1.mode=dplyr::recode(data.non.mm$PCHGCAT1.mode,'<75%'=0, '>=75%'=1) | |
mod.MM=glm(PCHGCAT1.mode~SEXN+TRTP,data=data.non.mm,family=binomial) | |
summary(mod.MM) | |
``` | |
b. Impute the patients who have missing data at visit 6 to be non-responders (i.e. patients who have missing data or who have visit 6 data but didn’t reach PASI75 will be both treated as non-responder). Analyze by logistic regression with treatment and sex. | |
ADSL30: | |
-All vars significant. Outlier 206. | |
ADSL20: | |
-All vars significant. Outlier 206. | |
MAR: | |
-All vars significant. Outlier 206. | |
```{r} | |
#Two conditions applied: | |
#1-If the patient has data at visit 6 but PASI 75 is less than 75% | |
#2-If patient has no data at visit 6 | |
#one patients gets only one label: responder vs non-responder | |
data=raw_dat %>% group_by(SUBJID) %>% mutate(PCHGCAT1_IMP = min(ifelse(((AVISIT=='VISIT 6' & is.na(PCHGCAT1)) |(AVISIT=='VISIT 6' & PCHGCAT1=='<75%')) , 0, 1))) %>% ungroup | |
data$PCHGCAT1_IMP=as.factor(data$PCHGCAT1_IMP) | |
data$TRTP=as.factor(data$TRTP) | |
data$SEXN=as.factor(data$SEXN) | |
#find outlier for model | |
#car::outlierTest(mod) | |
#remove outlier | |
data=data %>% filter(!row_number() %in% c(206) ) | |
#model | |
data <- within(data, TRTP <- relevel(TRTP, ref = 'Placebo')) # change reference level to Placebo | |
mod=glm(PCHGCAT1_IMP~SEXN+TRTP,data=data,family=binomial) | |
summary(mod) | |
``` | |
c.Compare the results with the one from complete data set. Analyze by logistic regression with treatment and sex. | |
Complete Data: | |
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod'] | |
Family: binomial ( logit ) | |
Formula: PCHGCAT1 ~ SEXN + TRTP + (1 | AVISITN) | |
Data: data | |
AIC BIC logLik deviance df.resid | |
3501.2 3542.6 -1744.6 3489.2 7316 | |
Scaled residuals: | |
Min 1Q Median 3Q Max | |
-2.3708 -0.3018 -0.0332 -0.0073 19.3650 | |
Random effects: | |
Groups Name Variance Std.Dev. | |
AVISITN (Intercept) 20.1 4.483 | |
Number of obs: 7322, groups: AVISITN, 4 | |
Fixed effects: | |
Estimate Std. Error z value Pr(>|z|) | |
(Intercept) -7.55249 2.26261 -3.338 0.000844 *** | |
SEXN2 -0.21659 0.09174 -2.361 0.018225 * | |
TRTPActive control 3.04722 0.21134 14.419 < 2e-16 *** | |
TRTPTest drug 140mg 3.02553 0.19936 15.176 < 2e-16 *** | |
TRTPTest drug 210mg 3.90602 0.20300 19.241 < 2e-16 *** | |
##ADSL20: | |
Random intercept variance is ~30 compared to 20.1 for full dataset. Coefficients for TRTP and sex remain same, but intercept is more negative (~ -9) | |
##ADSL30 | |
AIC is lower but the random intercept variance is 166, compared to 20.1 for the full dataset model. All coefficients are the same but the intercept is more negative (~ -17) | |
##MAR: | |
Model didnt converge, but very similar coefficients to the full dataset. | |
##Comparison to imputed models | |
##MAR | |
Intercept compared to the full model is much more postive (~ -2.5) and coefficient of sex is much lower (~ -0.11). Effects of TRTP is similar but lower. | |
#ADSL30 | |
Very similar compared to the results for MAR imputed model, but intercept is slightly higher (~2.75) | |
#ADSL20 | |
Effects of TRTP is similar but lower compared to full data. Very similar intercept to adsl30 (~2.6). | |