This repository contains two branches: MicroVI and MicroVI-retraining. MicroVI contains the base code for implementing MicroVI 5-fold cross-validation 100 times. MicroVI-retraining contains the code for implementing MicroVI with data augmentation, i.e., after each run of the model is trained, a portion of the learned latent space is sampled and used to finetune the trained model.
The run.py file within each project is the main file to run. A sample shell script is provided in myjob.sh. The data for each project is provided within the NEW_DATASETS folder, which is subdivided into a subfolder containing microbiome data with mice on different diets (aka m_diet, our dataset used for linear regression) and another subfolder containing microbiome data measured when mice were in different ages (aka m_age, our dataset used for classification/logistic regression).
To choose which setting to run (i.e., linear regression or classification), modify the first ten lines of the main function within run.py:
## Linear Regression - uncomment below:
# dataset = 'm_diet'
## Logistic Regression (classification) - uncomment below:
dataset = 'm_age'
pct_supervised = 100 # Choose pct supervision from: 0, 5, 10, 25, 50, or 100
alpha_list = [1.0] # Choose weightage of supervision in loss function from: 0.1, 0.25, 0.5, 1.0
covariate_ablation = True # Choose whether to include or exclude covariates: if True, exclude covariates; if False, include covariates
That is, set the desired dataset, percent supervision, alpha value, and whether to include covariates.
run.py will create a nested set of Master_Results and dataset folders according to the settings above. This is where the results csv of the 100 CV runs will be stored.