Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
# Heritable Component Analysis Pipeline
This program includes three tools for analyzing genetic data:
1. GRM generation from SNP data.
2. Heritability analysis for a given phenotype.
3. Heritable component analysis for a given set of phenotypic features.
# Program I/O
The file formats used in this program are explained below. See the `example_data` folder for example files.
1. SNP data is accepted in PLINK binary format (BED/BIM/FAM files); use PLINK to convert from other formats into the efficient binary format.
2. Pregiven kinship matrices (GRMs) are accepted in the GCTA GRM format. Likewise, GRMs generated from SNP data are saved in this format.
3. Phenotypic and covariate data are accepted in the following CSV format: `family_id,individual_id,feature_1,feature_2,feature_n`. Do not include a header.
All program output is saved to a directory with the following name: `hca-run-{epoch_timestamp}`. If the `output_file_prefix` option is specified, it must be a directory. Then, program output will be saved to the following directory: `{output_file_prefix}/hca-run-{epoch_timestamp}`.
Output is generally in CSV format. The only exception to this rule is kinship matrices; these are outputted in the GCTA GRM format. All program outputs are below:
| File Path (relative to output directory) | File Format | Tool | File Purpose |
|------------------------------------------|-------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `/score/scores.csv` | CSV | `score` | The generated scores will be saved in this file. |
| `/hca/cv_{n}_result.csv` | CSV | `hca` | This file will be created for every CV repeat; `n` will represent the CV repeat iteration. This file will include the `h^2` scores for each lambda, for each testing fold. |
| `/hca/weights_final.csv` | CSV | `hca` | This file will include the final weights generated by HCA. |
| `/h2r/result.csv` | CSV | `h2r`, `hca` | If running `h2r`, this file will represent the heritability analysis results. If running `hca`, this file will represent the heritability analysis results of the FINAL weights, with the full dataset. |
| `/kinship/grm.kinship` | GCTA GRM | `h2r`, `hca`, `kinship` | This file will include the generated kinship matrix (GRM). |
| `/kinship/grm.kinship_id` | GCTA GRM ID | `h2r`, `hca`, `kinship` | This file will include individual metadata for the GRM. |
# Usage
To run the program, you first need to choose a program command (specified via the `program_command` option). The available commands are:
1. `hca` (heritable component analysis)
2. `h2r` (heritability analysis)
3. `kinship` (GRM generation)
4. `score` (score a phenotype with a HCA-generated trait)
If you see a warning related to matrix inversion, note the following. If matrix inversion fails, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. As such, if this warning appears, note that your results may be unstable.
# HCA Documentation
## HCA Cross-Validation
If cross-validation is not used (`hca_cv_folds` == 1), the following process will be used for lambda tuning:
1. HCA will be run with each lambda value.
2. For each Lambda value, Heritability analysis will be run with the generated trait.
3. The Lambda value that generates the most heritable trait trait will be saved as a result of the analysis.
If cross-validation is used (`hca_cv_folds` != 1), the following process will be used for lambda tuning:
1. The data will be split into `hca_cv_folds` folds.
2. HCA will be run for each lambda value `hca_cv_folds` times. Each time, one fold will be chosen for testing data, and the rest of the folds will be used for training data. HCA will be run with the training data, and the result will be scored via heritability analysis (with the testing data). The resulting `h^2` will be saved.
3. Steps 1-2 will be repeated `hca_cv_repeats` times, if `hca_cv_repeats` is greater than 1.
4. Each lambda value will have been used `hca_cv_folds * hca_cv_repeats` times. The average `h^2` value from all of these iterations will be taken. The lambda value with the best average `h^2` will be chosen as the best.
5. HCA/heritability analysis will be run again to get the final result. All data will be used for both the training and testing stages (there will be no separate test set for this stage).
When cross-validation is used, the data gleaned from each CV repeat will be saved in a CSV file. This file will include the `h^2` scores for each run in that repeat. You may find this information helpful when troubleshooting.
Note that some datasets may be particularly sensitive to removing subjects from training data. Cross-validation will remove subjects from training data for the sake of creating a test set. If you find that results are unstable with cross-validation, consider running the program without this functionality.
# Example Commands
Examples for each `program_command` are included below:
`hca`: `./hca-dev-new --program_command hca --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv --hca_lambda_values_file lambda.txt`
`h2r:` `./hca-dev-new --program_command h2r --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv`
`kinship`: `./hca-dev-new --program_command kinship --geno_fam_file ../example_data/snp_data/fam.fam --geno_bim_file ../example_data/snp_data/bim.bim --geno_bed_file ../example_data/snp_data/bed.bed --snps_in_memory 10000 --num_threads 40`
`score`: `./hca-dev-new --program_command score --trait_file ../example_data/weights_final.csv --phen_file ../example_data/phenotypic_data/data.csv`
# Compiling
To compile on most Linux distributions, follow these steps:
1. Install g++-5: `apt-get install g++-5`
2. Install the Armadillo matrix library (download [here](http://arma.sourceforge.net/download.html) and run `cmake . && make && sudo make install`)
3. Install the NLOpt optimization library (download [here](http://ab-initio.mit.edu/nlopt/) and run `./configure && make && sudo make install`)
4. Install the OpenBLAS library [from source](https://github.com/xianyi/OpenBLAS/wiki/Installation-Guide). Make sure to specify `DYNAMIC_ARCH=1` when running make and make install, if you plan on using the binary across multiple architectures.
5. Install Boost: `sudo apt-get install libboost-all-dev`.
6. Install lapack: `sudo apt-get install liblapack-dev`.
7. Run `make --file Makefile_linux` in the `src` directory.
# Options
| Name | Relevant Tools | Required? | Allowed Values | Default | Description |
|-------------------------- |---------------------------------- |------------------------------------------------------------------------------------------------------------------------------------------ |------------------------------------- |----------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `program_command` | `hca`, `h2r`, `kinship`, `score` | Yes | `hca`, `h2r`, `kinship`, `score` | | Program command. |
| `num_threads` | `hca`, `h2r`, `kinship`, `score` | No | Integer | 1 | Number of threads to use. |
| `output_file_prefix` | `hca`, `h2r`, `kinship`, `score` | No | String (directory) | | Prefix for data output. Will be interpreted as directory. Default: current directory. |
| `log_level` | `hca`, `h2r`, `kinship`, `score` | No | `info`, `warning`, `debug`, `error` | `info` | Log level. |
| `snp_missing_handler` | `hca`, `h2r`, `kinship` | No | `deleteMissing`, `fillMean` | `deleteMissing` | How to handle missing data when creating a GRM from SNP data. |
| `maf_cutoff` | `hca`, `h2r`,`kinship` | No | Double | | MAF cutoff. Include if you want to exclude SNPs based on MAF. |
| `hwe_cutoff` | `hca`, `h2r`,`kinship` | No | Double | | HWE cutoff. Include if you want to exclude SNPs based on MAF. |
| `snps_in_memory` | `hca`, `h2r`, `kinship` | No | Integer | 1000 | Number of SNPs to keep in memory at a time. |
| `all_snps_in_memory` | `hca`, `h2r`, `kinship` | No | Boolean (`true`, `false`) | `false` | Whether or not to perform GRM generation with all data in memory. This is significantly faster than the iterative approach, but may not be feasible for large datasets. |
| `kinship_src` | `hca`, `h2r` | Yes (for `hca`, `h2r`) | `pregiven`, `snp` | | Where to get kinship from. If `pregiven`, must provide kinship files. If `snp`, must provide geno data. |
| `indv_whitelist_file` | `hca`, `h2r`, `kinship` | No | String (file) | | Whitelist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect. If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset. |
| `indv_blacklist_file` | `hca`, `h2r`, `kinship` | No | String (file) | | Blacklist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect.,If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset. |
| `c_cov_file` | `hca`, `h2r` | No | String (file) | | Categorical covariate data. |
| `c_cov_spec_file` | `hca`, `h2r` | Yes, if a whitelist or blacklist is specified for categorical covariates. | String (file) | | Categorical covariate spec, one feature name for file. |
| `c_cov_whitelist_file` | `hca`, `h2r` | No | String (file) | | Whitelist covariate features. |
| `c_cov_blacklist_file` | `hca`, `h2r` | No | String (file) | | Blacklist covariate features. |
| `q_cov_file` | `hca`, `h2r` | No | String (file) | | Quantitative covariate data. |
| `q_cov_spec_file` | `hca`, `h2r` | Yes, if a whitelist or blacklist is specified for quantitative covariates. | String (file) | | Quantitative covariate spec, one feature name for file. |
| `q_cov_whitelist_file` | `hca`, `h2r` | No | String (file) | | Whitelist covariate features. |
| `q_cov_blacklist_file` | `hca`, `h2r` | No | String (file) | | Blacklist covariate features. |
| `phen_file` | `hca`, `h2r` | Yes (for `hca`, `h2r`) | String (file) | | Phenotypic data. Required. |
| `phen_spec_file` | `hca`, `h2r` | Yes, if a whitelist or blacklist is specified for phenotypic data. | String (file) | | Phenotypic data spec, one feature name for file. |
| `phen_whitelist_file` | `hca`, `h2r` | No | String (file) | | Whitelist phenotypic features. |
| `phen_blacklist_file` | `hca`, `h2r` | No | String (file) | | Blacklist phenotypic features. |
| `given_kinship_file` | `hca`, `h2r` | Required if using pregiven kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Given kinship data path (data file). |
| `given_kinship_id_file` | `hca`, `h2r` | Required if using pregiven kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Given kinship data path (ID file). |
| `geno_fam_file` | `hca`, `h2r`, `kinship` | Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Genotypic data file (FAM file). |
| `geno_bim_file` | `hca`, `h2r`, `kinship ` | Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Genotypic data file (BIM file). |
| `geno_bed_file` | `hca`, `h2r`, `kinship ` | Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Genotypic data file (BED file). |
| `geno_whitelist_file` | `hca`, `h2r`, `kinship ` | No | String (file) | | Whitelist SNPs. |
| `geno_blacklist_file` | `hca`, `h2r`, `kinship ` | No | String (file) | | Blacklist SNPs. |
| `trait_file` | `score` | Yes (for `score`). | String (file) | | Trait to score phenotypic data with. |
| `coef_file` | `hca` | No. | String (file) | | Coefficient constraint file for HCA. Include if you want to constrain the weight coefficients. |
| `hca_lambda_values_file` | `hca` | Required if populating HCA lambda values from a file (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | String (file) | | File with lambda values for HCA, one per line. Each value must be a double. |
| `hca_max_iterations` | `hca` | No. | Integer | 1000 | Max iterations for HCA. |
| `h2r_max_iterations` | `h2r` | No. | Integer | 1000 | Max iterations for H2R. |
| `hca_cv_folds` | `hca` | No. | Integer | 1 | Number of folds to use for HCA cross-validation. If 1, no CV will be performed. |
| `hca_cv_repeats` | `hca` | No. | Integer | 1 | Number of repeats to use for HCA cross-validation. |
| `auto_lambda` | `hca` | Required if populating HCA lambda values automatically (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | `true`, `false` | `false` | Whether to use auto lambda discovery. |
| `hca_lambda_start` | `hca` | Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | Double | | Start value for lambda search. |
| `hca_lambda_end` | `hca` | Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | Double | | End value for lambda search. |
| `hca_lambda_incr` | `hca` | Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | Double | | Interval value for lambda search. |
| `hca_gr_cutoff` | `hca` | No. | Double | | GR cutoff for HCA. NOTE: this option has not yet been implemented and currently has no effect. If we decide to implement it, it should have similar behavior to the `grm-cutoff` option in the GCTA software. |
# References
The following references were used while preparing this program:
```
Sun J, Kranzler HR, Bi J. Refining multivariate disease phenotypes for high chip heritability. BMC Medical Genomics. 2015;8(Suppl 3):S3. doi:10.1186/1755-8794-8-S3-S3.
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex Trait Analysis. American Journal of Human Genetics. 2011;88(1):76-82. doi:10.1016/j.ajhg.2010.11.011.
Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of heritability for human height. Nature genetics. 2010;42(7):565-569. doi:10.1038/ng.608.
```