Heritable Component Analysis Pipeline

This program includes three tools for analyzing genetic data:

GRM generation from SNP data.
Heritability analysis for a given phenotype.
Heritable component analysis for a given set of phenotypic features.

Program I/O

The file formats used in this program are explained below. See the example_data folder for example files.

SNP data is accepted in PLINK binary format (BED/BIM/FAM files); use PLINK to convert from other formats into the efficient binary format.
Pregiven kinship matrices (GRMs) are accepted in the GCTA GRM format. Likewise, GRMs generated from SNP data are saved in this format.
Phenotypic and covariate data are accepted in the following CSV format: family_id,individual_id,feature_1,feature_2,feature_n. Do not include a header.

All program output is saved to a directory with the following name: hca-run-{epoch_timestamp}. If the output_file_prefix option is specified, it must be a directory. Then, program output will be saved to the following directory: {output_file_prefix}/hca-run-{epoch_timestamp}.

Output is generally in CSV format. The only exception to this rule is kinship matrices; these are outputted in the GCTA GRM format. All program outputs are below:

File Path (relative to output directory)	File Format	Tool	File Purpose
`/score/scores.csv`	CSV	`score`	The generated scores will be saved in this file.
`/hca/cv_{n}_result.csv`	CSV	`hca`	This file will be created for every CV repeat; `n` will represent the CV repeat iteration. This file will include the `h^2` scores for each lambda, for each testing fold.
`/hca/weights_final.csv`	CSV	`hca`	This file will include the final weights generated by HCA.
`/h2r/result.csv`	CSV	`h2r`, `hca`	If running `h2r`, this file will represent the heritability analysis results. If running `hca`, this file will represent the heritability analysis results of the FINAL weights, with the full dataset.
`/kinship/grm.kinship`	GCTA GRM	`h2r`, `hca`, `kinship`	This file will include the generated kinship matrix (GRM).
`/kinship/grm.kinship_id`	GCTA GRM ID	`h2r`, `hca`, `kinship`	This file will include individual metadata for the GRM.

Usage

To run the program, you first need to choose a program command (specified via the program_command option). The available commands are:

hca (heritable component analysis)
h2r (heritability analysis)
kinship (GRM generation)
score (score a phenotype with a HCA-generated trait)

If you see a warning related to matrix inversion, note the following. If matrix inversion fails, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. As such, if this warning appears, note that your results may be unstable.

HCA Documentation

HCA Cross-Validation

If cross-validation is not used (hca_cv_folds == 1), the following process will be used for lambda tuning:

HCA will be run with each lambda value.
For each Lambda value, Heritability analysis will be run with the generated trait.
The Lambda value that generates the most heritable trait trait will be saved as a result of the analysis.

If cross-validation is used (hca_cv_folds != 1), the following process will be used for lambda tuning:

The data will be split into hca_cv_folds folds.
HCA will be run for each lambda value hca_cv_folds times. Each time, one fold will be chosen for testing data, and the rest of the folds will be used for training data. HCA will be run with the training data, and the result will be scored via heritability analysis (with the testing data). The resulting h^2 will be saved.
Steps 1-2 will be repeated hca_cv_repeats times, if hca_cv_repeats is greater than 1.
Each lambda value will have been used hca_cv_folds * hca_cv_repeats times. The average h^2 value from all of these iterations will be taken. The lambda value with the best average h^2 will be chosen as the best.
HCA/heritability analysis will be run again to get the final result. All data will be used for both the training and testing stages (there will be no separate test set for this stage).

When cross-validation is used, the data gleaned from each CV repeat will be saved in a CSV file. This file will include the h^2 scores for each run in that repeat. You may find this information helpful when troubleshooting.

Note that some datasets may be particularly sensitive to removing subjects from training data. Cross-validation will remove subjects from training data for the sake of creating a test set. If you find that results are unstable with cross-validation, consider running the program without this functionality.

Example Commands

Examples for each program_command are included below:

hca: ./hca-dev-new --program_command hca --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv --hca_lambda_values_file lambda.txt

h2r: ./hca-dev-new --program_command h2r --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv

kinship: ./hca-dev-new --program_command kinship --geno_fam_file ../example_data/snp_data/fam.fam --geno_bim_file ../example_data/snp_data/bim.bim --geno_bed_file ../example_data/snp_data/bed.bed --snps_in_memory 10000 --num_threads 40

score: ./hca-dev-new --program_command score --trait_file ../example_data/weights_final.csv --phen_file ../example_data/phenotypic_data/data.csv

Compiling

To compile on most Linux distributions, follow these steps:

Install g++-5: apt-get install g++-5
Install the Armadillo matrix library (download here and run cmake . && make && sudo make install)
Install the NLOpt optimization library (download here and run ./configure && make && sudo make install)
Install the OpenBLAS library from source. Make sure to specify DYNAMIC_ARCH=1 when running make and make install, if you plan on using the binary across multiple architectures.
Install Boost: sudo apt-get install libboost-all-dev.
Install lapack: sudo apt-get install liblapack-dev.
Run make --file Makefile_linux in the src directory.

Options

Name	Relevant Tools	Required?	Allowed Values	Default	Description
`program_command`	`hca`, `h2r`, `kinship`, `score`	Yes	`hca`, `h2r`, `kinship`, `score`		Program command.
`num_threads`	`hca`, `h2r`, `kinship`, `score`	No	Integer	1	Number of threads to use.
`output_file_prefix`	`hca`, `h2r`, `kinship`, `score`	No	String (directory)		Prefix for data output. Will be interpreted as directory. Default: current directory.
`log_level`	`hca`, `h2r`, `kinship`, `score`	No	`info`, `warning`, `debug`, `error`	`info`	Log level.
`snp_missing_handler`	`hca`, `h2r`, `kinship`	No	`deleteMissing`, `fillMean`	`deleteMissing`	How to handle missing data when creating a GRM from SNP data.
`maf_cutoff`	`hca`, `h2r`,`kinship`	No	Double		MAF cutoff. Include if you want to exclude SNPs based on MAF.
`hwe_cutoff`	`hca`, `h2r`,`kinship`	No	Double		HWE cutoff. Include if you want to exclude SNPs based on MAF.
`snps_in_memory`	`hca`, `h2r`, `kinship`	No	Integer	1000	Number of SNPs to keep in memory at a time.
`all_snps_in_memory`	`hca`, `h2r`, `kinship`	No	Boolean (`true`, `false`)	`false`	Whether or not to perform GRM generation with all data in memory. This is significantly faster than the iterative approach, but may not be feasible for large datasets.
`kinship_src`	`hca`, `h2r`	Yes (for `hca`, `h2r`)	`pregiven`, `snp`		Where to get kinship from. If `pregiven`, must provide kinship files. If `snp`, must provide geno data.
`indv_whitelist_file`	`hca`, `h2r`, `kinship`	No	String (file)		Whitelist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect. If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset.
`indv_blacklist_file`	`hca`, `h2r`, `kinship`	No	String (file)		Blacklist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect.,If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset.
`c_cov_file`	`hca`, `h2r`	No	String (file)		Categorical covariate data.
`c_cov_spec_file`	`hca`, `h2r`	Yes, if a whitelist or blacklist is specified for categorical covariates.	String (file)		Categorical covariate spec, one feature name for file.
`c_cov_whitelist_file`	`hca`, `h2r`	No	String (file)		Whitelist covariate features.
`c_cov_blacklist_file`	`hca`, `h2r`	No	String (file)		Blacklist covariate features.
`q_cov_file`	`hca`, `h2r`	No	String (file)		Quantitative covariate data.
`q_cov_spec_file`	`hca`, `h2r`	Yes, if a whitelist or blacklist is specified for quantitative covariates.	String (file)		Quantitative covariate spec, one feature name for file.
`q_cov_whitelist_file`	`hca`, `h2r`	No	String (file)		Whitelist covariate features.
`q_cov_blacklist_file`	`hca`, `h2r`	No	String (file)		Blacklist covariate features.
`phen_file`	`hca`, `h2r`	Yes (for `hca`, `h2r`)	String (file)		Phenotypic data. Required.
`phen_spec_file`	`hca`, `h2r`	Yes, if a whitelist or blacklist is specified for phenotypic data.	String (file)		Phenotypic data spec, one feature name for file.
`phen_whitelist_file`	`hca`, `h2r`	No	String (file)		Whitelist phenotypic features.
`phen_blacklist_file`	`hca`, `h2r`	No	String (file)		Blacklist phenotypic features.
`given_kinship_file`	`hca`, `h2r`	Required if using pregiven kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data.	String (file)		Given kinship data path (data file).
`given_kinship_id_file`	`hca`, `h2r`	Required if using pregiven kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data.	String (file)		Given kinship data path (ID file).
`geno_fam_file`	`hca`, `h2r`, `kinship`	Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data.	String (file)		Genotypic data file (FAM file).
`geno_bim_file`	`hca`, `h2r`, `kinship`	Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data.	String (file)		Genotypic data file (BIM file).
`geno_bed_file`	`hca`, `h2r`, `kinship`	Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data.	String (file)		Genotypic data file (BED file).
`geno_whitelist_file`	`hca`, `h2r`, `kinship`	No	String (file)		Whitelist SNPs.
`geno_blacklist_file`	`hca`, `h2r`, `kinship`	No	String (file)		Blacklist SNPs.
`trait_file`	`score`	Yes (for `score`).	String (file)		Trait to score phenotypic data with.
`coef_file`	`hca`	No.	String (file)		Coefficient constraint file for HCA. Include if you want to constrain the weight coefficients.
`hca_lambda_values_file`	`hca`	Required if populating HCA lambda values from a file (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used.	String (file)		File with lambda values for HCA, one per line. Each value must be a double.
`hca_max_iterations`	`hca`	No.	Integer	1000	Max iterations for HCA.
`h2r_max_iterations`	`h2r`	No.	Integer	1000	Max iterations for H2R.
`hca_cv_folds`	`hca`	No.	Integer	1	Number of folds to use for HCA cross-validation. If 1, no CV will be performed.
`hca_cv_repeats`	`hca`	No.	Integer	1	Number of repeats to use for HCA cross-validation.
`auto_lambda`	`hca`	Required if populating HCA lambda values automatically (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used.	`true`, `false`	`false`	Whether to use auto lambda discovery.
`hca_lambda_start`	`hca`	Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used.	Double		Start value for lambda search.
`hca_lambda_end`	`hca`	Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used.	Double		End value for lambda search.
`hca_lambda_incr`	`hca`	Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used.	Double		Interval value for lambda search.
`hca_gr_cutoff`	`hca`	No.	Double		GR cutoff for HCA. NOTE: this option has not yet been implemented and currently has no effect. If we decide to implement it, it should have similar behavior to the `grm-cutoff` option in the GCTA software.

References

The following references were used while preparing this program:

Sun J, Kranzler HR, Bi J. Refining multivariate disease phenotypes for high chip heritability. BMC Medical Genomics. 2015;8(Suppl 3):S3. doi:10.1186/1755-8794-8-S3-S3.

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex Trait Analysis. American Journal of Human Genetics. 2011;88(1):76-82. doi:10.1016/j.ajhg.2010.11.011.

Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of heritability for human height. Nature genetics. 2010;42(7):565-569. doi:10.1038/ng.608.

HealthInfoLab/hca

About

Resources

Stars

Watchers

Forks

Releases

Languages