This program includes three tools for analyzing genetic data:
-
GRM generation from SNP data.
-
Heritability analysis for a given phenotype.
-
Heritable component analysis for a given set of phenotypic features.
The file formats used in this program are explained below. See the example_data
folder for example files.
-
SNP data is accepted in PLINK binary format (BED/BIM/FAM files); use PLINK to convert from other formats into the efficient binary format.
-
Pregiven kinship matrices (GRMs) are accepted in the GCTA GRM format. Likewise, GRMs generated from SNP data are saved in this format.
-
Phenotypic and covariate data are accepted in the following CSV format:
family_id,individual_id,feature_1,feature_2,feature_n
. Do not include a header.
All program output is saved to a directory with the following name: hca-run-{epoch_timestamp}
. If the output_file_prefix
option is specified, it must be a directory. Then, program output will be saved to the following directory: {output_file_prefix}/hca-run-{epoch_timestamp}
.
Output is generally in CSV format. The only exception to this rule is kinship matrices; these are outputted in the GCTA GRM format. All program outputs are below:
File Path (relative to output directory) | File Format | Tool | File Purpose |
---|---|---|---|
/score/scores.csv |
CSV | score |
The generated scores will be saved in this file. |
/hca/cv_{n}_result.csv |
CSV | hca |
This file will be created for every CV repeat; n will represent the CV repeat iteration. This file will include the h^2 scores for each lambda, for each testing fold. |
/hca/weights_final.csv |
CSV | hca |
This file will include the final weights generated by HCA. |
/h2r/result.csv |
CSV | h2r , hca |
If running h2r , this file will represent the heritability analysis results. If running hca , this file will represent the heritability analysis results of the FINAL weights, with the full dataset. |
/kinship/grm.kinship |
GCTA GRM | h2r , hca , kinship |
This file will include the generated kinship matrix (GRM). |
/kinship/grm.kinship_id |
GCTA GRM ID | h2r , hca , kinship |
This file will include individual metadata for the GRM. |
To run the program, you first need to choose a program command (specified via the program_command
option). The available commands are:
-
hca
(heritable component analysis) -
h2r
(heritability analysis) -
kinship
(GRM generation) -
score
(score a phenotype with a HCA-generated trait)
If you see a warning related to matrix inversion, note the following. If matrix inversion fails, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. As such, if this warning appears, note that your results may be unstable.
If cross-validation is not used (hca_cv_folds
== 1), the following process will be used for lambda tuning:
-
HCA will be run with each lambda value.
-
For each Lambda value, Heritability analysis will be run with the generated trait.
-
The Lambda value that generates the most heritable trait trait will be saved as a result of the analysis.
If cross-validation is used (hca_cv_folds
!= 1), the following process will be used for lambda tuning:
-
The data will be split into
hca_cv_folds
folds. -
HCA will be run for each lambda value
hca_cv_folds
times. Each time, one fold will be chosen for testing data, and the rest of the folds will be used for training data. HCA will be run with the training data, and the result will be scored via heritability analysis (with the testing data). The resultingh^2
will be saved. -
Steps 1-2 will be repeated
hca_cv_repeats
times, ifhca_cv_repeats
is greater than 1. -
Each lambda value will have been used
hca_cv_folds * hca_cv_repeats
times. The averageh^2
value from all of these iterations will be taken. The lambda value with the best averageh^2
will be chosen as the best. -
HCA/heritability analysis will be run again to get the final result. All data will be used for both the training and testing stages (there will be no separate test set for this stage).
When cross-validation is used, the data gleaned from each CV repeat will be saved in a CSV file. This file will include the h^2
scores for each run in that repeat. You may find this information helpful when troubleshooting.
Note that some datasets may be particularly sensitive to removing subjects from training data. Cross-validation will remove subjects from training data for the sake of creating a test set. If you find that results are unstable with cross-validation, consider running the program without this functionality.
Examples for each program_command
are included below:
hca
: ./hca-dev-new --program_command hca --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv --hca_lambda_values_file lambda.txt
h2r:
./hca-dev-new --program_command h2r --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv
kinship
: ./hca-dev-new --program_command kinship --geno_fam_file ../example_data/snp_data/fam.fam --geno_bim_file ../example_data/snp_data/bim.bim --geno_bed_file ../example_data/snp_data/bed.bed --snps_in_memory 10000 --num_threads 40
score
: ./hca-dev-new --program_command score --trait_file ../example_data/weights_final.csv --phen_file ../example_data/phenotypic_data/data.csv
To compile on most Linux distributions, follow these steps:
-
Install g++-5:
apt-get install g++-5
-
Install the Armadillo matrix library (download here and run
cmake . && make && sudo make install
) -
Install the NLOpt optimization library (download here and run
./configure && make && sudo make install
) -
Install the OpenBLAS library from source. Make sure to specify
DYNAMIC_ARCH=1
when running make and make install, if you plan on using the binary across multiple architectures. -
Install Boost:
sudo apt-get install libboost-all-dev
. -
Install lapack:
sudo apt-get install liblapack-dev
. -
Run
make --file Makefile_linux
in thesrc
directory.
Name | Relevant Tools | Required? | Allowed Values | Default | Description |
---|---|---|---|---|---|
program_command |
hca , h2r , kinship , score |
Yes | hca , h2r , kinship , score |
Program command. | |
num_threads |
hca , h2r , kinship , score |
No | Integer | 1 | Number of threads to use. |
output_file_prefix |
hca , h2r , kinship , score |
No | String (directory) | Prefix for data output. Will be interpreted as directory. Default: current directory. | |
log_level |
hca , h2r , kinship , score |
No | info , warning , debug , error |
info |
Log level. |
snp_missing_handler |
hca , h2r , kinship |
No | deleteMissing , fillMean |
deleteMissing |
How to handle missing data when creating a GRM from SNP data. |
maf_cutoff |
hca , h2r ,kinship |
No | Double | MAF cutoff. Include if you want to exclude SNPs based on MAF. | |
hwe_cutoff |
hca , h2r ,kinship |
No | Double | HWE cutoff. Include if you want to exclude SNPs based on MAF. | |
snps_in_memory |
hca , h2r , kinship |
No | Integer | 1000 | Number of SNPs to keep in memory at a time. |
all_snps_in_memory |
hca , h2r , kinship |
No | Boolean (true , false ) |
false |
Whether or not to perform GRM generation with all data in memory. This is significantly faster than the iterative approach, but may not be feasible for large datasets. |
kinship_src |
hca , h2r |
Yes (for hca , h2r ) |
pregiven , snp |
Where to get kinship from. If pregiven , must provide kinship files. If snp , must provide geno data. |
|
indv_whitelist_file |
hca , h2r , kinship |
No | String (file) | Whitelist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect. If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset. | |
indv_blacklist_file |
hca , h2r , kinship |
No | String (file) | Blacklist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect.,If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset. | |
c_cov_file |
hca , h2r |
No | String (file) | Categorical covariate data. | |
c_cov_spec_file |
hca , h2r |
Yes, if a whitelist or blacklist is specified for categorical covariates. | String (file) | Categorical covariate spec, one feature name for file. | |
c_cov_whitelist_file |
hca , h2r |
No | String (file) | Whitelist covariate features. | |
c_cov_blacklist_file |
hca , h2r |
No | String (file) | Blacklist covariate features. | |
q_cov_file |
hca , h2r |
No | String (file) | Quantitative covariate data. | |
q_cov_spec_file |
hca , h2r |
Yes, if a whitelist or blacklist is specified for quantitative covariates. | String (file) | Quantitative covariate spec, one feature name for file. | |
q_cov_whitelist_file |
hca , h2r |
No | String (file) | Whitelist covariate features. | |
q_cov_blacklist_file |
hca , h2r |
No | String (file) | Blacklist covariate features. | |
phen_file |
hca , h2r |
Yes (for hca , h2r ) |
String (file) | Phenotypic data. Required. | |
phen_spec_file |
hca , h2r |
Yes, if a whitelist or blacklist is specified for phenotypic data. | String (file) | Phenotypic data spec, one feature name for file. | |
phen_whitelist_file |
hca , h2r |
No | String (file) | Whitelist phenotypic features. | |
phen_blacklist_file |
hca , h2r |
No | String (file) | Blacklist phenotypic features. | |
given_kinship_file |
hca , h2r |
Required if using pregiven kinship data (for hca , h2r ). You must use either pregiven or SNP-generated kinship data. |
String (file) | Given kinship data path (data file). | |
given_kinship_id_file |
hca , h2r |
Required if using pregiven kinship data (for hca , h2r ). You must use either pregiven or SNP-generated kinship data. |
String (file) | Given kinship data path (ID file). | |
geno_fam_file |
hca , h2r , kinship |
Required if using SNP-generated kinship data (for hca , h2r ). You must use either pregiven or SNP-generated kinship data. |
String (file) | Genotypic data file (FAM file). | |
geno_bim_file |
hca , h2r , kinship |
Required if using SNP-generated kinship data (for hca , h2r ). You must use either pregiven or SNP-generated kinship data. |
String (file) | Genotypic data file (BIM file). | |
geno_bed_file |
hca , h2r , kinship |
Required if using SNP-generated kinship data (for hca , h2r ). You must use either pregiven or SNP-generated kinship data. |
String (file) | Genotypic data file (BED file). | |
geno_whitelist_file |
hca , h2r , kinship |
No | String (file) | Whitelist SNPs. | |
geno_blacklist_file |
hca , h2r , kinship |
No | String (file) | Blacklist SNPs. | |
trait_file |
score |
Yes (for score ). |
String (file) | Trait to score phenotypic data with. | |
coef_file |
hca |
No. | String (file) | Coefficient constraint file for HCA. Include if you want to constrain the weight coefficients. | |
hca_lambda_values_file |
hca |
Required if populating HCA lambda values from a file (for hca ). Exactly one lambda strategy (file, auto, interval) must be used. |
String (file) | File with lambda values for HCA, one per line. Each value must be a double. | |
hca_max_iterations |
hca |
No. | Integer | 1000 | Max iterations for HCA. |
h2r_max_iterations |
h2r |
No. | Integer | 1000 | Max iterations for H2R. |
hca_cv_folds |
hca |
No. | Integer | 1 | Number of folds to use for HCA cross-validation. If 1, no CV will be performed. |
hca_cv_repeats |
hca |
No. | Integer | 1 | Number of repeats to use for HCA cross-validation. |
auto_lambda |
hca |
Required if populating HCA lambda values automatically (for hca ). Exactly one lambda strategy (file, auto, interval) must be used. |
true , false |
false |
Whether to use auto lambda discovery. |
hca_lambda_start |
hca |
Required if populating HCA lambda values from an interval (for hca ). Exactly one lambda strategy (file, auto, interval) must be used. |
Double | Start value for lambda search. | |
hca_lambda_end |
hca |
Required if populating HCA lambda values from an interval (for hca ). Exactly one lambda strategy (file, auto, interval) must be used. |
Double | End value for lambda search. | |
hca_lambda_incr |
hca |
Required if populating HCA lambda values from an interval (for hca ). Exactly one lambda strategy (file, auto, interval) must be used. |
Double | Interval value for lambda search. | |
hca_gr_cutoff |
hca |
No. | Double | GR cutoff for HCA. NOTE: this option has not yet been implemented and currently has no effect. If we decide to implement it, it should have similar behavior to the grm-cutoff option in the GCTA software. |
The following references were used while preparing this program:
Sun J, Kranzler HR, Bi J. Refining multivariate disease phenotypes for high chip heritability. BMC Medical Genomics. 2015;8(Suppl 3):S3. doi:10.1186/1755-8794-8-S3-S3.
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex Trait Analysis. American Journal of Human Genetics. 2011;88(1):76-82. doi:10.1016/j.ajhg.2010.11.011.
Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of heritability for human height. Nature genetics. 2010;42(7):565-569. doi:10.1038/ng.608.