Skip to content

HealthInfoLab/hca

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 

Heritable Component Analysis Pipeline

This program includes three tools for analyzing genetic data:

  1. GRM generation from SNP data.

  2. Heritability analysis for a given phenotype.

  3. Heritable component analysis for a given set of phenotypic features.

Program I/O

The file formats used in this program are explained below. See the example_data folder for example files.

  1. SNP data is accepted in PLINK binary format (BED/BIM/FAM files); use PLINK to convert from other formats into the efficient binary format.

  2. Pregiven kinship matrices (GRMs) are accepted in the GCTA GRM format. Likewise, GRMs generated from SNP data are saved in this format.

  3. Phenotypic and covariate data are accepted in the following CSV format: family_id,individual_id,feature_1,feature_2,feature_n. Do not include a header.

All program output is saved to a directory with the following name: hca-run-{epoch_timestamp}. If the output_file_prefix option is specified, it must be a directory. Then, program output will be saved to the following directory: {output_file_prefix}/hca-run-{epoch_timestamp}.

Output is generally in CSV format. The only exception to this rule is kinship matrices; these are outputted in the GCTA GRM format. All program outputs are below:

File Path (relative to output directory) File Format Tool File Purpose
/score/scores.csv CSV score The generated scores will be saved in this file.
/hca/cv_{n}_result.csv CSV hca This file will be created for every CV repeat; n will represent the CV repeat iteration. This file will include the h^2 scores for each lambda, for each testing fold.
/hca/weights_final.csv CSV hca This file will include the final weights generated by HCA.
/h2r/result.csv CSV h2r, hca If running h2r, this file will represent the heritability analysis results. If running hca, this file will represent the heritability analysis results of the FINAL weights, with the full dataset.
/kinship/grm.kinship GCTA GRM h2r, hca, kinship This file will include the generated kinship matrix (GRM).
/kinship/grm.kinship_id GCTA GRM ID h2r, hca, kinship This file will include individual metadata for the GRM.

Usage

To run the program, you first need to choose a program command (specified via the program_command option). The available commands are:

  1. hca (heritable component analysis)

  2. h2r (heritability analysis)

  3. kinship (GRM generation)

  4. score (score a phenotype with a HCA-generated trait)

If you see a warning related to matrix inversion, note the following. If matrix inversion fails, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. As such, if this warning appears, note that your results may be unstable.

HCA Documentation

HCA Cross-Validation

If cross-validation is not used (hca_cv_folds == 1), the following process will be used for lambda tuning:

  1. HCA will be run with each lambda value.

  2. For each Lambda value, Heritability analysis will be run with the generated trait.

  3. The Lambda value that generates the most heritable trait trait will be saved as a result of the analysis.

If cross-validation is used (hca_cv_folds != 1), the following process will be used for lambda tuning:

  1. The data will be split into hca_cv_folds folds.

  2. HCA will be run for each lambda value hca_cv_folds times. Each time, one fold will be chosen for testing data, and the rest of the folds will be used for training data. HCA will be run with the training data, and the result will be scored via heritability analysis (with the testing data). The resulting h^2 will be saved.

  3. Steps 1-2 will be repeated hca_cv_repeats times, if hca_cv_repeats is greater than 1.

  4. Each lambda value will have been used hca_cv_folds * hca_cv_repeats times. The average h^2 value from all of these iterations will be taken. The lambda value with the best average h^2 will be chosen as the best.

  5. HCA/heritability analysis will be run again to get the final result. All data will be used for both the training and testing stages (there will be no separate test set for this stage).

When cross-validation is used, the data gleaned from each CV repeat will be saved in a CSV file. This file will include the h^2 scores for each run in that repeat. You may find this information helpful when troubleshooting.

Note that some datasets may be particularly sensitive to removing subjects from training data. Cross-validation will remove subjects from training data for the sake of creating a test set. If you find that results are unstable with cross-validation, consider running the program without this functionality.

Example Commands

Examples for each program_command are included below:

hca: ./hca-dev-new --program_command hca --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv --hca_lambda_values_file lambda.txt

h2r: ./hca-dev-new --program_command h2r --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv

kinship: ./hca-dev-new --program_command kinship --geno_fam_file ../example_data/snp_data/fam.fam --geno_bim_file ../example_data/snp_data/bim.bim --geno_bed_file ../example_data/snp_data/bed.bed --snps_in_memory 10000 --num_threads 40

score: ./hca-dev-new --program_command score --trait_file ../example_data/weights_final.csv --phen_file ../example_data/phenotypic_data/data.csv

Compiling

To compile on most Linux distributions, follow these steps:

  1. Install g++-5: apt-get install g++-5

  2. Install the Armadillo matrix library (download here and run cmake . && make && sudo make install)

  3. Install the NLOpt optimization library (download here and run ./configure && make && sudo make install)

  4. Install the OpenBLAS library from source. Make sure to specify DYNAMIC_ARCH=1 when running make and make install, if you plan on using the binary across multiple architectures.

  5. Install Boost: sudo apt-get install libboost-all-dev.

  6. Install lapack: sudo apt-get install liblapack-dev.

  7. Run make --file Makefile_linux in the src directory.

Options

Name Relevant Tools Required? Allowed Values Default Description
program_command hca, h2r, kinship, score Yes hca, h2r, kinship, score Program command.
num_threads hca, h2r, kinship, score No Integer 1 Number of threads to use.
output_file_prefix hca, h2r, kinship, score No String (directory) Prefix for data output. Will be interpreted as directory. Default: current directory.
log_level hca, h2r, kinship, score No info, warning, debug, error info Log level.
snp_missing_handler hca, h2r, kinship No deleteMissing, fillMean deleteMissing How to handle missing data when creating a GRM from SNP data.
maf_cutoff hca, h2r,kinship No Double MAF cutoff. Include if you want to exclude SNPs based on MAF.
hwe_cutoff hca, h2r,kinship No Double HWE cutoff. Include if you want to exclude SNPs based on MAF.
snps_in_memory hca, h2r, kinship No Integer 1000 Number of SNPs to keep in memory at a time.
all_snps_in_memory hca, h2r, kinship No Boolean (true, false) false Whether or not to perform GRM generation with all data in memory. This is significantly faster than the iterative approach, but may not be feasible for large datasets.
kinship_src hca, h2r Yes (for hca, h2r) pregiven, snp Where to get kinship from. If pregiven, must provide kinship files. If snp, must provide geno data.
indv_whitelist_file hca, h2r, kinship No String (file) Whitelist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect. If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset.
indv_blacklist_file hca, h2r, kinship No String (file) Blacklist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect.,If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset.
c_cov_file hca, h2r No String (file) Categorical covariate data.
c_cov_spec_file hca, h2r Yes, if a whitelist or blacklist is specified for categorical covariates. String (file) Categorical covariate spec, one feature name for file.
c_cov_whitelist_file hca, h2r No String (file) Whitelist covariate features.
c_cov_blacklist_file hca, h2r No String (file) Blacklist covariate features.
q_cov_file hca, h2r No String (file) Quantitative covariate data.
q_cov_spec_file hca, h2r Yes, if a whitelist or blacklist is specified for quantitative covariates. String (file) Quantitative covariate spec, one feature name for file.
q_cov_whitelist_file hca, h2r No String (file) Whitelist covariate features.
q_cov_blacklist_file hca, h2r No String (file) Blacklist covariate features.
phen_file hca, h2r Yes (for hca, h2r) String (file) Phenotypic data. Required.
phen_spec_file hca, h2r Yes, if a whitelist or blacklist is specified for phenotypic data. String (file) Phenotypic data spec, one feature name for file.
phen_whitelist_file hca, h2r No String (file) Whitelist phenotypic features.
phen_blacklist_file hca, h2r No String (file) Blacklist phenotypic features.
given_kinship_file hca, h2r Required if using pregiven kinship data (for hca, h2r). You must use either pregiven or SNP-generated kinship data. String (file) Given kinship data path (data file).
given_kinship_id_file hca, h2r Required if using pregiven kinship data (for hca, h2r). You must use either pregiven or SNP-generated kinship data. String (file) Given kinship data path (ID file).
geno_fam_file hca, h2r, kinship Required if using SNP-generated kinship data (for hca, h2r). You must use either pregiven or SNP-generated kinship data. String (file) Genotypic data file (FAM file).
geno_bim_file hca, h2r, kinship Required if using SNP-generated kinship data (for hca, h2r). You must use either pregiven or SNP-generated kinship data. String (file) Genotypic data file (BIM file).
geno_bed_file hca, h2r, kinship Required if using SNP-generated kinship data (for hca, h2r). You must use either pregiven or SNP-generated kinship data. String (file) Genotypic data file (BED file).
geno_whitelist_file hca, h2r, kinship No String (file) Whitelist SNPs.
geno_blacklist_file hca, h2r, kinship No String (file) Blacklist SNPs.
trait_file score Yes (for score). String (file) Trait to score phenotypic data with.
coef_file hca No. String (file) Coefficient constraint file for HCA. Include if you want to constrain the weight coefficients.
hca_lambda_values_file hca Required if populating HCA lambda values from a file (for hca). Exactly one lambda strategy (file, auto, interval) must be used. String (file) File with lambda values for HCA, one per line. Each value must be a double.
hca_max_iterations hca No. Integer 1000 Max iterations for HCA.
h2r_max_iterations h2r No. Integer 1000 Max iterations for H2R.
hca_cv_folds hca No. Integer 1 Number of folds to use for HCA cross-validation. If 1, no CV will be performed.
hca_cv_repeats hca No. Integer 1 Number of repeats to use for HCA cross-validation.
auto_lambda hca Required if populating HCA lambda values automatically (for hca). Exactly one lambda strategy (file, auto, interval) must be used. true, false false Whether to use auto lambda discovery.
hca_lambda_start hca Required if populating HCA lambda values from an interval (for hca). Exactly one lambda strategy (file, auto, interval) must be used. Double Start value for lambda search.
hca_lambda_end hca Required if populating HCA lambda values from an interval (for hca). Exactly one lambda strategy (file, auto, interval) must be used. Double End value for lambda search.
hca_lambda_incr hca Required if populating HCA lambda values from an interval (for hca). Exactly one lambda strategy (file, auto, interval) must be used. Double Interval value for lambda search.
hca_gr_cutoff hca No. Double GR cutoff for HCA. NOTE: this option has not yet been implemented and currently has no effect. If we decide to implement it, it should have similar behavior to the grm-cutoff option in the GCTA software.

References

The following references were used while preparing this program:

Sun J, Kranzler HR, Bi J. Refining multivariate disease phenotypes for high chip heritability. BMC Medical Genomics. 2015;8(Suppl 3):S3. doi:10.1186/1755-8794-8-S3-S3.

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex Trait Analysis. American Journal of Human Genetics. 2011;88(1):76-82. doi:10.1016/j.ajhg.2010.11.011.

Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of heritability for human height. Nature genetics. 2010;42(7):565-569. doi:10.1038/ng.608. 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Languages