diff --git a/README.md b/README.md index 5d8ac7b..1e6e5ba 100644 --- a/README.md +++ b/README.md @@ -1,100 +1,154 @@ # Heritable Component Analysis Pipeline -This repository represents a pipeline that performs three primary functions: +This program includes three tools for analyzing genetic data: -1. Heritable Component Analysis +1. GRM generation from SNP data. -2. Heritability Estimation +2. Heritability analysis for a given phenotype. -3. Kinship Matrix Generation +3. Heritable component analysis for a given set of phenotypic features. -The pipeline accepts genotypic and phenotypic data, as well as covariates, and generates a highly-heritable trait. It can then estimate the heritability of that trait via the second function above. If the user already has a kinship matrix available (i.e. from GCTA), the program can accept this matrix. Alternatively, it can use genotypic data to generate the kinship matrix. +# Program I/O -# Usage +The file formats used in this program are explained below. See the `example_data` folder for example files. -Running the program without any options will trigger the help function, which will show all options that are available. The program takes a command (`kinship`, `h2r`, `hca`, or `score`), as well as a set of options for each command. +1. SNP data is accepted in PLINK binary format (BED/BIM/FAM files); use PLINK to convert from other formats into the efficient binary format. -Generally speaking, one can follow the below guidelines when using this program: +2. Pregiven kinship matrices (GRMs) are accepted in the GCTA GRM format. Likewise, GRMs generated from SNP data are saved in this format. -1. Obtain the following data: phenotypic data, quantitative and discrete covariates (optional), kinship file (optional), genotypic data (required if kinship data is not present). Ensure that individual IDs are present in all of these files, and are common across each file. If an individual ID is missing from one file but present in another, it will not be included in analysis. +3. Phenotypic and covariate data are accepted in the following CSV format: `family_id,individual_id,feature_1,feature_2,feature_n`. Do not include a header. -2. Determine the parameters for your analysis. These are specified in the “heritable component analysis” help section. There are a few options that you must consider: “numSplits” and “lambdaVecFile”. “numSplits” controls the cross-validation functionality; if this is set to 1 (default), cross-validation will not be performed. If this is set to a value of 2 or more, cross-validation will be performed (see the cross-validation section). “lambdaVecFile” must point to a file with lambda values to use during the HCA process; each line must represent one lambda value. +All program output is saved to a directory with the following name: `hca-run-{epoch_timestamp}`. If the `output_file_prefix` option is specified, it must be a directory. Then, program output will be saved to the following directory: `{output_file_prefix}/hca-run-{epoch_timestamp}`. -3. Determine if you would like to save output data to disk; if so, specify the “outDir” parameter (set this to “.” to save to the current directory). In addition, specify the “numThreads” option (default 2) to enable multi-thread functionality. +Output is generally in CSV format. The only exception to this rule is kinship matrices; these are outputted in the GCTA GRM format. All program outputs are below: -4. Run HCA with the parameters chosen, and observe the output. Analysis may take a long time depending on the size of your dataset. +| File Path (relative to output directory) | File Format | Tool | File Purpose | +|------------------------------------------|-------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `/score/scores.csv` | CSV | `score` | The generated scores will be saved in this file. | +| `/hca/cv_{n}_result.csv` | CSV | `hca` | This file will be created for every CV repeat; `n` will represent the CV repeat iteration. This file will include the `h^2` scores for each lambda, for each testing fold. | +| `/hca/weights_final.csv` | CSV | `hca` | This file will include the final weights generated by HCA. | +| `/h2r/result.csv` | CSV | `h2r`, `hca` | If running `h2r`, this file will represent the heritability analysis results. If running `hca`, this file will represent the heritability analysis results of the FINAL weights, with the full dataset. | +| `/kinship/grm.kinship` | GCTA GRM | `h2r`, `hca`, `kinship` | This file will include the generated kinship matrix (GRM). | +| `/kinship/grm.kinship_id` | GCTA GRM ID | `h2r`, `hca`, `kinship` | This file will include individual metadata for the GRM. | +# Usage -Additional options are present, but are not required. Review the documentation and help output for more details. +To run the program, you first need to choose a program command (specified via the `program_command` option). The available commands are: -# HCA Cross-Validation and Lambda Tuning +1. `hca` (heritable component analysis) -If “numSplits” is equal to one or is not set, the following process will be used for lambda tuning: +2. `h2r` (heritability analysis) -1. HCA will be run with each lambda value. +3. `kinship` (GRM generation) -2. For each Lambda value, Heritability analysis will be run with the generated trait. +4. `score` (score a phenotype with a HCA-generated trait) -3. The Lambda value that generates the most heritable trait trait will be saved as a result of the analysis. +If you see a warning related to matrix inversion, note the following. If matrix inversion fails, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. As such, if this warning appears, note that your results may be unstable. -If the “numSplits” option is greater than one, the dataset will be split randomly into “numSplits” splits. The following process will then occur: +# HCA Documentation -1. The code will iterate through each lambda value. +## HCA Cross-Validation -2. For each lambda value, the code will iterate through each split. On each iteration, the chosen split will be used as cross-validation data; all other data will be marked as training data. HCA will be run with the training data and the current lambda value. Once HCA has been run with all splits, the average heritability score for the current lambda value will be calculated. +If cross-validation is not used (`hca_cv_folds` == 1), the following process will be used for lambda tuning: -3. The lambda with the highest average heritability score will be considered the best. HCA and heritability estimation will be re-performed with this lambda value, on the full data set. This will be considered the final result set. - -**Important Note on Cross-Validation Functionality** - -Note that some datasets may be particularly sensitive to removing certain subjects. As specifying a numSplits value causes subjects to be removed during the training process, this may cause instability in the generated weights. If you notice unstable results with data spliting enabled, consider running the program without this functionality. - -# Outputs +1. HCA will be run with each lambda value. -In addition to outputting data to the CLI, if an `outDir` parameter is specified, some data will also be saved. For all analysis, if a GRM was generated (non-pregiven), that GRM will be saved to "kinship.csv". The following analysis-specific data will also be saved: +2. For each Lambda value, Heritability analysis will be run with the generated trait. -## HCA +3. The Lambda value that generates the most heritable trait trait will be saved as a result of the analysis. -When HCA is run, the final weights will be saved to "trait_hca.csv". Indiviudals will be scored with these weights, and the output from the "Scoring" section will be saved. In addition, the output from the "Heritability Analysis" section will be saved for the final weights. +If cross-validation is used (`hca_cv_folds` != 1), the following process will be used for lambda tuning: -## Heritability Analysis +1. The data will be split into `hca_cv_folds` folds. -When heritability analysis is run, statistics regarding the analysis will be saved to "h2r_est.txt". +2. HCA will be run for each lambda value `hca_cv_folds` times. Each time, one fold will be chosen for testing data, and the rest of the folds will be used for training data. HCA will be run with the training data, and the result will be scored via heritability analysis (with the testing data). The resulting `h^2` will be saved. -## Scoring +3. Steps 1-2 will be repeated `hca_cv_repeats` times, if `hca_cv_repeats` is greater than 1. -When scoring is run, the calculated scores will be saved to "scores.txt". +4. Each lambda value will have been used `hca_cv_folds * hca_cv_repeats` times. The average `h^2` value from all of these iterations will be taken. The lambda value with the best average `h^2` will be chosen as the best. -## Kinship Generation +5. HCA/heritability analysis will be run again to get the final result. All data will be used for both the training and testing stages (there will be no separate test set for this stage). -When kinship generation is run, the kinship file will be saved to "kinship.txt". +When cross-validation is used, the data gleaned from each CV repeat will be saved in a CSV file. This file will include the `h^2` scores for each run in that repeat. You may find this information helpful when troubleshooting. -# Special Note on Heritability Estimation +Note that some datasets may be particularly sensitive to removing subjects from training data. Cross-validation will remove subjects from training data for the sake of creating a test set. If you find that results are unstable with cross-validation, consider running the program without this functionality. -In the event that the variance-covariance matrix is non-invertible during heritability estimation, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. A warning will be outputted in the event that the add-to-diagonals approach is used. +# Example Commands -# Documentation +Examples for each `program_command` are included below: -Further documentation is available in the `docs` folder. +`hca`: `./hca-dev-new --program_command hca --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv --hca_lambda_values_file lambda.txt` -# Dependencies +`h2r:` `./hca-dev-new --program_command h2r --kinship_src pregiven --given_kinship_file ../example_data/kinship/data.grm --given_kinship_id_file ../example_data/kinship/id.grm --phen_file ../example_data/phenotypic_data/data.csv --c_cov_file ../example_data/categorial_covariates/data.csv` -The Linux binary should work automatically on most Linux distributions. If not, compile it for your architecture. +`kinship`: `./hca-dev-new --program_command kinship --geno_fam_file ../example_data/snp_data/fam.fam --geno_bim_file ../example_data/snp_data/bim.bim --geno_bed_file ../example_data/snp_data/bed.bed --snps_in_memory 10000 --num_threads 40` -The OSX binary requires GCC version 6. Install it by running `brew install gcc6 --without-multilib` on your machine. +`score`: `./hca-dev-new --program_command score --trait_file ../example_data/weights_final.csv --phen_file ../example_data/phenotypic_data/data.csv` # Compiling -To compile on most Linux distributions and OSX, follow these steps: - -1. Install the Armadillo matrix library (download [here](http://arma.sourceforge.net/download.html) and run `cmake . && make && sudo make install`) - -2. Install the NLOpt optimization library (download [here](http://ab-initio.mit.edu/nlopt/) and run `./configure && make && sudo make install`) - -3. Install the OpenBLAS library [from source](https://github.com/xianyi/OpenBLAS/wiki/Installation-Guide). Make sure to specify `DYNAMIC_ARCH=1` when running `make` and `make install`, if you plan on using the binary across multiple architectures. - -4. If on Linux, run (i.e. `sudo apt-get install liblapack-dev`). If on Mac, run `brew install gcc6 --without-multilib` and `brew install lapack`. - -6. Run `make --file Makefile_osx` or `make --file Makefile_linux`, depending on your platform. +To compile on most Linux distributions, follow these steps: + +1. Install g++-5: `apt-get install g++-5` + +2. Install the Armadillo matrix library (download [here](http://arma.sourceforge.net/download.html) and run `cmake . && make && sudo make install`) + +3. Install the NLOpt optimization library (download [here](http://ab-initio.mit.edu/nlopt/) and run `./configure && make && sudo make install`) + +4. Install the OpenBLAS library [from source](https://github.com/xianyi/OpenBLAS/wiki/Installation-Guide). Make sure to specify `DYNAMIC_ARCH=1` when running make and make install, if you plan on using the binary across multiple architectures. + +5. Install Boost: `sudo apt-get install libboost-all-dev`. + +6. Install lapack: `sudo apt-get install liblapack-dev`. + +7. Run `make --file Makefile_linux` in the `src` directory. + +# Options + +| Name | Relevant Tools | Required? | Allowed Values | Default | Description | +|-------------------------- |---------------------------------- |------------------------------------------------------------------------------------------------------------------------------------------ |------------------------------------- |----------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `program_command` | `hca`, `h2r`, `kinship`, `score` | Yes | `hca`, `h2r`, `kinship`, `score` | | Program command. | +| `num_threads` | `hca`, `h2r`, `kinship`, `score` | No | Integer | 1 | Number of threads to use. | +| `output_file_prefix` | `hca`, `h2r`, `kinship`, `score` | No | String (directory) | | Prefix for data output. Will be interpreted as directory. Default: current directory. | +| `log_level` | `hca`, `h2r`, `kinship`, `score` | No | `info`, `warning`, `debug`, `error` | `info` | Log level. | +| `snp_missing_handler` | `hca`, `h2r`, `kinship` | No | `deleteMissing`, `fillMean` | `deleteMissing` | How to handle missing data when creating a GRM from SNP data. | +| `maf_cutoff` | `hca`, `h2r`,`kinship` | No | Double | | MAF cutoff. Include if you want to exclude SNPs based on MAF. | +| `hwe_cutoff` | `hca`, `h2r`,`kinship` | No | Double | | HWE cutoff. Include if you want to exclude SNPs based on MAF. | +| `snps_in_memory` | `hca`, `h2r`, `kinship` | No | Integer | 1000 | Number of SNPs to keep in memory at a time. | +| `all_snps_in_memory` | `hca`, `h2r`, `kinship` | No | Boolean (`true`, `false`) | `false` | Whether or not to perform GRM generation with all data in memory. This is significantly faster than the iterative approach, but may not be feasible for large datasets. | +| `kinship_src` | `hca`, `h2r` | Yes (for `hca`, `h2r`) | `pregiven`, `snp` | | Where to get kinship from. If `pregiven`, must provide kinship files. If `snp`, must provide geno data. | +| `indv_whitelist_file` | `hca`, `h2r`, `kinship` | No | String (file) | | Whitelist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect. If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset. | +| `indv_blacklist_file` | `hca`, `h2r`, `kinship` | No | String (file) | | Blacklist individuals for analysis. NOTE: this option has not yet been implemented and currently has no effect.,If you need to remove any individuals from analysis, simply remove them from at least one file in the input dataset. | +| `c_cov_file` | `hca`, `h2r` | No | String (file) | | Categorical covariate data. | +| `c_cov_spec_file` | `hca`, `h2r` | Yes, if a whitelist or blacklist is specified for categorical covariates. | String (file) | | Categorical covariate spec, one feature name for file. | +| `c_cov_whitelist_file` | `hca`, `h2r` | No | String (file) | | Whitelist covariate features. | +| `c_cov_blacklist_file` | `hca`, `h2r` | No | String (file) | | Blacklist covariate features. | +| `q_cov_file` | `hca`, `h2r` | No | String (file) | | Quantitative covariate data. | +| `q_cov_spec_file` | `hca`, `h2r` | Yes, if a whitelist or blacklist is specified for quantitative covariates. | String (file) | | Quantitative covariate spec, one feature name for file. | +| `q_cov_whitelist_file` | `hca`, `h2r` | No | String (file) | | Whitelist covariate features. | +| `q_cov_blacklist_file` | `hca`, `h2r` | No | String (file) | | Blacklist covariate features. | +| `phen_file` | `hca`, `h2r` | Yes (for `hca`, `h2r`) | String (file) | | Phenotypic data. Required. | +| `phen_spec_file` | `hca`, `h2r` | Yes, if a whitelist or blacklist is specified for phenotypic data. | String (file) | | Phenotypic data spec, one feature name for file. | +| `phen_whitelist_file` | `hca`, `h2r` | No | String (file) | | Whitelist phenotypic features. | +| `phen_blacklist_file` | `hca`, `h2r` | No | String (file) | | Blacklist phenotypic features. | +| `given_kinship_file` | `hca`, `h2r` | Required if using pregiven kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Given kinship data path (data file). | +| `given_kinship_id_file` | `hca`, `h2r` | Required if using pregiven kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Given kinship data path (ID file). | +| `geno_fam_file` | `hca`, `h2r`, `kinship` | Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Genotypic data file (FAM file). | +| `geno_bim_file` | `hca`, `h2r`, `kinship ` | Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Genotypic data file (BIM file). | +| `geno_bed_file` | `hca`, `h2r`, `kinship ` | Required if using SNP-generated kinship data (for `hca`, `h2r`). You must use either pregiven or SNP-generated kinship data. | String (file) | | Genotypic data file (BED file). | +| `geno_whitelist_file` | `hca`, `h2r`, `kinship ` | No | String (file) | | Whitelist SNPs. | +| `geno_blacklist_file` | `hca`, `h2r`, `kinship ` | No | String (file) | | Blacklist SNPs. | +| `trait_file` | `score` | Yes (for `score`). | String (file) | | Trait to score phenotypic data with. | +| `coef_file` | `hca` | No. | String (file) | | Coefficient constraint file for HCA. Include if you want to constrain the weight coefficients. | +| `hca_lambda_values_file` | `hca` | Required if populating HCA lambda values from a file (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | String (file) | | File with lambda values for HCA, one per line. Each value must be a double. | +| `hca_max_iterations` | `hca` | No. | Integer | 1000 | Max iterations for HCA. | +| `h2r_max_iterations` | `h2r` | No. | Integer | 1000 | Max iterations for H2R. | +| `hca_cv_folds` | `hca` | No. | Integer | 1 | Number of folds to use for HCA cross-validation. If 1, no CV will be performed. | +| `hca_cv_repeats` | `hca` | No. | Integer | 1 | Number of repeats to use for HCA cross-validation. | +| `auto_lambda` | `hca` | Required if populating HCA lambda values automatically (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | `true`, `false` | `false` | Whether to use auto lambda discovery. | +| `hca_lambda_start` | `hca` | Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | Double | | Start value for lambda search. | +| `hca_lambda_end` | `hca` | Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | Double | | End value for lambda search. | +| `hca_lambda_incr` | `hca` | Required if populating HCA lambda values from an interval (for `hca`). Exactly one lambda strategy (file, auto, interval) must be used. | Double | | Interval value for lambda search. | +| `hca_gr_cutoff` | `hca` | No. | Double | | GR cutoff for HCA. NOTE: this option has not yet been implemented and currently has no effect. If we decide to implement it, it should have similar behavior to the `grm-cutoff` option in the GCTA software. | # References @@ -107,3 +161,4 @@ Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex Tr Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of heritability for human height. Nature genetics. 2010;42(7):565-569. doi:10.1038/ng.608. ``` + diff --git a/bin/hca-linux b/bin/hca-linux deleted file mode 100755 index a809387..0000000 Binary files a/bin/hca-linux and /dev/null differ diff --git a/bin/hca-osx b/bin/hca-osx deleted file mode 100755 index 0387985..0000000 Binary files a/bin/hca-osx and /dev/null differ diff --git a/docs/html/_data_8cpp.html b/docs/html/_data_8cpp.html deleted file mode 100644 index d769b3a..0000000 --- a/docs/html/_data_8cpp.html +++ /dev/null @@ -1,120 +0,0 @@ - - -
- - - - -
- HCA
-
- |
-
#include "Data.h"
#include <unistd.h>
#include <sys/stat.h>
#include <bitset>
#include "SnpKinship.h"
#include "GivenKinship.h"
- HCA
-
- |
-
#include <iostream>
#include <armadillo>
#include <map>
#include <set>
#include <iterator>
#include <bitset>
#include "IndividualDataSet.h"
#include "Option.h"
#include "Scorer.h"
Go to the source code of this file.
--Classes | |
class | Data |
A class to load all relevant data. Responsible solely for loading data - not for reconciling missing individuals. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
#include "GivenKinship.h"
- HCA
-
- |
-
Go to the source code of this file.
--Classes | |
class | GivenKinship |
A class to load a pregiven kinship matrix. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
#include "IndividualData.h"
- HCA
-
- |
-
#include <iostream>
#include <armadillo>
#include <map>
Go to the source code of this file.
--Classes | |
class | IndividualData |
A class to represent data associated with a given individual. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
#include "IndividualDataSet.h"
- HCA
-
- |
-
#include "GivenKinship.h"
#include <armadillo>
#include <list>
#include "IndividualData.h"
#include "Option.h"
#include "SnpKinship.h"
Go to the source code of this file.
--Classes | |
class | IndividualDataSet |
A class to keep track of and reconcile IndividualData objects. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
#include <stdint.h>
#include <iostream>
#include <armadillo>
#include <map>
#include <set>
Go to the source code of this file.
--Classes | |
class | Kinship |
A class used to load or generate kinship data. This class must not be used directly; only subclasses should be used. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
#include <iostream>
#include <stdlib.h>
#include <stdint.h>
Go to the source code of this file.
--Classes | |
class | Option |
A class to load all user options. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
#include "RemlH2rEst.h"
- HCA
-
- |
-
Go to the source code of this file.
--Classes | |
class | RemlH2rEst |
A class to perform heritability analysis on a given dataset. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
#include "RemlHca.h"
- HCA
-
- |
-
#include "Option.h"
#include "Data.h"
#include "RemlH2rEst.h"
#include <nlopt.hpp>
#include <armadillo>
Go to the source code of this file.
--Classes | |
class | RemlHca |
A class to perform heritable component analysis on a given dataset. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
#include "Scorer.h"
- HCA
-
- |
-
#include "Option.h"
#include <list>
#include <armadillo>
#include "IndividualData.h"
#include "IndividualDataSet.h"
Go to the source code of this file.
--Classes | |
class | Scorer |
A class to score users based on their phenotypes and a generated weight. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
#include "SnpKinship.h"
- HCA
-
- |
-
Go to the source code of this file.
--Classes | |
class | SnpKinship |
A class to generate a kinship matrix from SNP data. More... | |
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
CData | A class to load all relevant data. Responsible solely for loading data - not for reconciling missing individuals |
CGivenKinship | A class to load a pregiven kinship matrix |
CH2rEst | A class used to perform heritability analysis |
CHca | A class used to perform heritable component analysis |
CIndividualData | A class to represent data associated with a given individual |
CIndividualDataSet | A class to keep track of and reconcile IndividualData objects |
CKinship | A class used to load or generate kinship data. This class must not be used directly; only subclasses should be used |
COption | A class to load all user options |
CRemlH2rEst | A class to perform heritability analysis on a given dataset |
CRemlHca | A class to perform heritable component analysis on a given dataset |
CScorer | A class to score users based on their phenotypes and a generated weight |
CSnpKinship | A class to generate a kinship matrix from SNP data |
CUtil |
- HCA
-
- |
-
This is the complete list of members for Data, including all inherited members.
-Data(Option &option) | Data | |
individualDataSet (defined in Data) | Data | |
lambdaVec (defined in Data) | Data | |
load(Option &option) | Data | |
scorers (defined in Data) | Data | |
trait (defined in Data) | Data | |
writeKinship(std::string outDir) | Data |
- HCA
-
- |
-
A class to load all relevant data. Responsible solely for loading data - not for reconciling missing individuals. - More...
- -#include <Data.h>
-Public Member Functions | |
- | Data (Option &option) |
A constructor. | |
void | load (Option &option) |
Loads data based on the provided option argument. More... | |
-void | writeKinship (std::string outDir) |
Writes the Kinship file to disk. | |
-Public Attributes | |
-IndividualDataSet | individualDataSet |
-arma::mat | trait |
-std::vector< double > | lambdaVec |
-std::vector< Scorer > | scorers |
A class to load all relevant data. Responsible solely for loading data - not for reconciling missing individuals.
-Loads data into the userDataSet, trait, lambdaVec, and scorers fields.
-void Data::load | -( | -Option & | -option | ) | -- |
Loads data based on the provided option argument.
-Loads the following individual-specific data:
Loads the following non-individual-specific data:
- HCA
-
- |
-
This is the complete list of members for GivenKinship, including all inherited members.
-construct(std::string kinshipfile, std::string kinshipidfile) | GivenKinship | |
getGrm() | Kinship | |
getIdVec() | Kinship | |
idVec (defined in Kinship) | Kinship | protected |
Kinship() | Kinship | |
Kinship(arma::mat grm, std::map< std::string, int > ind, int nIndv, std::vector< std::string > idVec) | Kinship | |
m_grm (defined in Kinship) | Kinship | protected |
m_nIndv (defined in Kinship) | Kinship | protected |
- HCA
-
- |
-
A class to load a pregiven kinship matrix. - More...
- -#include <GivenKinship.h>
-Public Member Functions | |
-void | construct (std::string kinshipfile, std::string kinshipidfile) |
Constructor. Loads and parses the kinship file into a matrix. | |
Public Member Functions inherited from Kinship | |
- | Kinship () |
Constructor. | |
- | Kinship (arma::mat grm, std::map< std::string, int > ind, int nIndv, std::vector< std::string > idVec) |
Constructor. | |
-arma::mat & | getGrm () |
Returns the generated or parsed Genetic Relationship Matrix (GRM). | |
std::vector< std::string > & | getIdVec () |
Returns the generated or parsed individual id vector. More... | |
-Additional Inherited Members | |
Protected Attributes inherited from Kinship | |
-arma::mat | m_grm |
-uint32_t | m_nIndv |
-std::vector< std::string > | idVec |
A class to load a pregiven kinship matrix.
-Used if ksSrc is pregiven.
-
- HCA
-
- |
-
A class used to perform heritability analysis. - More...
- -#include <H2rEst.h>
A class used to perform heritability analysis.
-
- HCA
-
- |
-
A class used to perform heritable component analysis. - More...
- -#include <Hca.h>
A class used to perform heritable component analysis.
-
- HCA
-
- |
-
This is the complete list of members for IndividualData, including all inherited members.
-addCCovData(std::vector< double > &cCovNew) | IndividualData | |
addPedData(std::vector< double > &pedNew) | IndividualData | |
addPhenData(std::vector< double > &phenNew) | IndividualData | |
addQCovData(std::vector< double > &qCovNew) | IndividualData | |
getCCov() | IndividualData | |
getNewGrmId() const | IndividualData | |
getNumMissingGenoValues() | IndividualData | |
getPed() | IndividualData | |
getPedId() const | IndividualData | |
getPhen() | IndividualData | |
getPregivenGrmId() const | IndividualData | |
getQCov() | IndividualData | |
getStrId() const | IndividualData | |
resetPhenData() | IndividualData | |
setNewGrmId(int idNew) | IndividualData | |
setNumMissingGenoValues(int numMissingGenoValuesNew) | IndividualData | |
setPedId(int idNew) | IndividualData | |
setPregivenGrmId(int idNew) | IndividualData | |
setStrId(std::string strIdNew) | IndividualData |
- HCA
-
- |
-
A class to represent data associated with a given individual. - More...
- -#include <IndividualData.h>
-Public Member Functions | |
-std::vector< double > & | getPed () |
Returns PED data for this individual. | |
-std::vector< double > & | getPhen () |
Returns phenotypic data for this individual. | |
-std::vector< double > & | getQCov () |
Returns quantitative covariate data for this individual. | |
-std::vector< double > & | getCCov () |
Returns categorical covariate data for this individual. | |
-void | setStrId (std::string strIdNew) |
Sets the string ID for this individual (i.e. the ID provided in input files). | |
-std::string | getStrId () const |
Returns the string ID for this individual. | |
-void | setPregivenGrmId (int idNew) |
Sets this individual's position in the pregiven GRM. | |
-int | getPregivenGrmId () const |
Returns this individual's position in the pregiven GRM. | |
-void | setNewGrmId (int idNew) |
Sets this individual's position in the final GRM. | |
-int | getNewGrmId () const |
Returns this individual's position in the final GRM. | |
-void | setPedId (int idNew) |
Sets this individual's position in the genotypic data matrix. | |
-int | getPedId () const |
Returns this individual's position in the genotypic data matrix. | |
-void | setNumMissingGenoValues (int numMissingGenoValuesNew) |
Sets the number of missing genotypic values for this individual. | |
-int | getNumMissingGenoValues () |
Returns the number of missing genotypic values for this individual. | |
-void | addPedData (std::vector< double > &pedNew) |
Adds data from the PED file to this individual. | |
-void | resetPhenData () |
Deletes all phenotypic data on this individual. | |
-void | addPhenData (std::vector< double > &phenNew) |
Adds phenotypic data to this individual. | |
-void | addQCovData (std::vector< double > &qCovNew) |
Adds quantitative covariate data to this individual. | |
-void | addCCovData (std::vector< double > &cCovNew) |
Adds categorical covariate data to this individual. | |
A class to represent data associated with a given individual.
-IndividualData objects are generally kept track of with a IndividualDataSet.
-
- HCA
-
- |
-
This is the complete list of members for IndividualDataSet, including all inherited members.
-
- HCA
-
- |
-
A class to keep track of and reconcile IndividualData objects. - More...
- -#include <IndividualDataSet.h>
-Public Member Functions | |
-void | addingPedData () |
Indicate that PED data will be added to this IndividualDataSet. | |
-void | addPedData (std::string id, int pedId, std::vector< double > &pedData) |
Add PED data to this IndividualDataSet, for the provided individual ID. | |
void | addingPhenData () |
Indicate that phenotypic data will be added to this IndividualDataSet. More... | |
-void | addPhenData (std::string id, std::vector< double > &phenData) |
Add phenotypic data to this IndividualDataSet, for the provided individual ID. | |
-void | addingQCovData () |
Indicate that quantitative covariate data will be added to this IndividualDataSet. | |
-void | addQCovData (std::string id, std::vector< double > &qCovData) |
Add quantitative covariate data to this IndividualDataSet, for the provided individual ID. | |
-void | addingCCovData () |
Indicate that categorical covariate data will be added to this IndividualDataSet. | |
-void | addCCovData (std::string id, std::vector< double > &cCovData) |
Add categorical covariate data to this IndividualDataSet, for the provided individual ID. | |
-void | addingChrData () |
Indicate that chromsome data will be added to this IndividualDataSet. | |
-void | setChrData (std::vector< int > &chrNew) |
Add chromosome data to this IndividualDataSet. | |
-void | settingGenoData () |
Indicate that genotypic data will be added to this IndividualDataSet. | |
-void | setGenoData (arma::mat &genoNew, std::vector< int > numMissingValues) |
Add genotypic data to this IndividualDataSet. | |
-void | setOption (Option &optionNew) |
Sets the Option object on this individualDataSet. | |
void | reconcile () |
Reconciles data by removing individuals with missing data. More... | |
-int | getNumSubjects () |
Returns the number of individuals currently being kept track of. | |
-bool | getCovAdded () |
Returns true if covariate data has been added. | |
-arma::mat & | getGeno () |
Returns the generated genotypic data matrix. | |
-std::vector< int > & | getChr () |
Returns the provided chromosome data. | |
-arma::mat & | getAN () |
Returns the AN matrix (only available if the matrix was generated from SNP data). | |
-arma::mat & | getGrm (int idx) |
Returns the generated genetic relationship matrix. | |
-arma::mat & | getCov (int idx) |
Returns the covariate data associated with the given index (based on split number). | |
-arma::mat & | getPhen (int idx) |
Returns the phenotypic data associated with the given index (based on split number). | |
-arma::mat & | getGrmWithoutSplit (int idx) |
Returns the GRM data with all individuals EXCEPT for those in the provided split number. | |
-arma::mat & | getCovWithoutSplit (int idx) |
Returns the covariates with all individuals EXCEPT for those in the provided split number. | |
-arma::mat & | getPhenWithoutSplit (int idx) |
Returns the phenotypic data with all individuals EXCEPT for those in the provided split number. | |
-arma::mat & | getPed () |
Returns the PED data associated with the given index (based on split number). | |
-std::vector< std::vector< std::string > > & | getSplitPartIds () |
Returns the IDs associated with each "split". | |
-std::list< std::reference_wrapper< IndividualData > > & | getIndividualList () |
Returns the IndividualData objects in a list data structure. | |
-std::map< std::string, IndividualData > & | getIndividualMap () |
Returns the IndividualData objects in a map data structure. | |
A class to keep track of and reconcile IndividualData objects.
-void IndividualDataSet::addingPhenData | -( | -) | -- |
Indicate that phenotypic data will be added to this IndividualDataSet.
-Wipes any existing phenotypic data.
- -void IndividualDataSet::reconcile | -( | -) | -- |
Reconciles data by removing individuals with missing data.
-Once data is reconciled, also generates final matrices with all individuals that are still included.
-Generates several matrix formats, including:
- HCA
-
- |
-
This is the complete list of members for Kinship, including all inherited members.
-getGrm() | Kinship | |
getIdVec() | Kinship | |
idVec (defined in Kinship) | Kinship | protected |
Kinship() | Kinship | |
Kinship(arma::mat grm, std::map< std::string, int > ind, int nIndv, std::vector< std::string > idVec) | Kinship | |
m_grm (defined in Kinship) | Kinship | protected |
m_nIndv (defined in Kinship) | Kinship | protected |
- HCA
-
- |
-
A class used to load or generate kinship data. This class must not be used directly; only subclasses should be used. - More...
- -#include <Kinship.h>
-Public Member Functions | |
- | Kinship () |
Constructor. | |
- | Kinship (arma::mat grm, std::map< std::string, int > ind, int nIndv, std::vector< std::string > idVec) |
Constructor. | |
-arma::mat & | getGrm () |
Returns the generated or parsed Genetic Relationship Matrix (GRM). | |
std::vector< std::string > & | getIdVec () |
Returns the generated or parsed individual id vector. More... | |
-Protected Attributes | |
-arma::mat | m_grm |
-uint32_t | m_nIndv |
-std::vector< std::string > | idVec |
A class used to load or generate kinship data. This class must not be used directly; only subclasses should be used.
-std::vector< std::string > & Kinship::getIdVec | -( | -) | -- |
Returns the generated or parsed individual id vector.
-Maintains insertion order. The first individual in idVec represents the first individual in the GRM.
- -
- HCA
-
- |
-
This is the complete list of members for Option, including all inherited members.
-getBFilePrefix() const | Option | |
getCCovFile() const | Option | |
getCmmd() const | Option | |
getKinshipFile() const | Option | |
getKinshipIDFile() const | Option | |
getKinshipSrc() const | Option | |
getLambdaVecFile() const | Option | |
getMafCutoff() const | Option | |
getMaxIterH2r() const | Option | |
getMaxIterHca() const | Option | |
getMissingGenoCutoff() const | Option | |
getNumSplits() const | Option | |
getNumThreads() const | Option | |
getOutDir() const | Option | |
getPhenFile() const | Option | |
getQCovFile() const | Option | |
getTraitFile() const | Option | |
parse(int argc, char **argv) | Option |
- HCA
-
- |
-
A class to load all user options. - More...
- -#include <Option.h>
-Public Member Functions | |
-void | parse (int argc, char **argv) |
Parses all CLI options. | |
-uint16_t | getCmmd () const |
Returns the command the user specified; i.e., the analysis to run. | |
-std::string | getBFilePrefix () const |
Returns the prefix for the binary data filenames. | |
-std::string | getQCovFile () const |
Returns the quantitative covariate filename. | |
-std::string | getCCovFile () const |
Returns the categorical covariate filename. | |
-std::string | getKinshipFile () const |
Returns the kinship filename. | |
-std::string | getKinshipIDFile () const |
Returns the kinship ID filename. | |
-std::string | getPhenFile () const |
Returns the phenotype filename. | |
-std::string | getTraitFile () const |
Returns the trait filename. | |
-int | getKinshipSrc () const |
Returns the kinship source to use. | |
-double | getMafCutoff () const |
Returns the Minor Allele Frequency cutoff for GRM generation. | |
-double | getMissingGenoCutoff () const |
Returns the missing genotypic data cutoff for GRM generation. | |
-std::string | getOutDir () const |
Returns the directory for output files. | |
-int | getNumThreads () const |
Returns the number of threads to use. | |
-std::string | getLambdaVecFile () const |
Returns the lambda vector filename. | |
-int | getNumSplits () const |
Returns the number of splits to use during the cross-validation and lambda-tuning process. | |
-int | getMaxIterHca () const |
Returns the maximum number of iterations for HCA analysis (default: 200). | |
-int | getMaxIterH2r () const |
Returns the maximum number of iterations for heritability analysis (default: 200). | |
A class to load all user options.
-Note that this class does not load data: only options. The Data class is responsible for actually loading data.
-
- HCA
-
- |
-
This is the complete list of members for RemlH2rEst, including all inherited members.
-calcH2r() | RemlH2rEst | |
getFinalStats() | RemlH2rEst | |
RemlH2rEst(Option &newOption, arma::mat &newPhen, arma::mat &newGrm, arma::mat &newCov) | RemlH2rEst | |
saveOutput() | RemlH2rEst |
- HCA
-
- |
-
A class to perform heritability analysis on a given dataset. - More...
- -#include <RemlH2rEst.h>
-Public Member Functions | |
RemlH2rEst (Option &newOption, arma::mat &newPhen, arma::mat &newGrm, arma::mat &newCov) | |
Constructor. More... | |
-void | calcH2r () |
Runs heritability analysis on the data. | |
-void | saveOutput () |
Saves output to a file. | |
-std::vector< std::vector< double > > & | getFinalStats () |
Returns the final stats from the REML analysis. | |
A class to perform heritability analysis on a given dataset.
-Uses the REML algorithm.
-RemlH2rEst::RemlH2rEst | -( | -Option & | -newOption, | -
- | - | arma::mat & | -newPhen, | -
- | - | arma::mat & | -newGrm, | -
- | - | arma::mat & | -newCov | -
- | ) | -- |
Constructor.
-Raises an error if the provided phenotypic data has more than one column.
- -
- HCA
-
- |
-
This is the complete list of members for RemlHca, including all inherited members.
-constraintFunction(const std::vector< double > &x, std::vector< double > &grad, void *data) | RemlHca | static |
getBestLambdaVal() | RemlHca | |
getBestTrainedW() | RemlHca | |
objectiveFunction(const std::vector< double > &x, std::vector< double > &grad, void *my_func_data) | RemlHca | static |
RemlHca(Option &newOption, Data &newData) | RemlHca | |
saveOutput() | RemlHca | |
train() | RemlHca |
- HCA
-
- |
-
A class to perform heritable component analysis on a given dataset. - More...
- -#include <RemlHca.h>
-Public Member Functions | |
- | RemlHca (Option &newOption, Data &newData) |
Constructor. | |
-void | train () |
Runs the REML algorithm to obtain highly-heritable traits. | |
-void | saveOutput () |
Saves output to a file. | |
-double | getBestLambdaVal () |
Return the best lambda value, found by the train function. | |
-arma::mat & | getBestTrainedW () |
Returns the weights generated by the train function. | |
-Static Public Member Functions | |
-static double | objectiveFunction (const std::vector< double > &x, std::vector< double > &grad, void *my_func_data) |
Objective function for the HCA minimization process. | |
static double | constraintFunction (const std::vector< double > &x, std::vector< double > &grad, void *data) |
Constraint function for the HCA minimization process. More... | |
A class to perform heritable component analysis on a given dataset.
-Uses the REML algorithm.
-
-
|
- -static | -
Constraint function for the HCA minimization process.
-Constrained to be equal to zero.
- -
- HCA
-
- |
-
- HCA
-
- |
-
A class to score users based on their phenotypes and a generated weight. - More...
- -#include <Scorer.h>
-Public Member Functions | |
Scorer (Option &newOption, arma::mat &phen, arma::mat &trait) | |
Constructor. More... | |
-arma::mat & | getScore () |
Returns the generated scores. | |
-void | saveOutput (IndividualDataSet &individualDataSet) |
Saves output to a file. | |
A class to score users based on their phenotypes and a generated weight.
-Weights are typically generated via the HCA process.
-Scorer::Scorer | -( | -Option & | -newOption, | -
- | - | arma::mat & | -phen, | -
- | - | arma::mat & | -trait | -
- | ) | -- |
Constructor.
-Also generates scores inline.
- -
- HCA
-
- |
-
This is the complete list of members for SnpKinship, including all inherited members.
-construct(Option &option, arma::mat &ped, std::vector< int > &chr, arma::mat &newGeno) | SnpKinship | |
getAN() | SnpKinship | |
getGrm() | Kinship | |
getIdVec() | Kinship | |
idVec (defined in Kinship) | Kinship | protected |
Kinship() | Kinship | |
Kinship(arma::mat grm, std::map< std::string, int > ind, int nIndv, std::vector< std::string > idVec) | Kinship | |
m_grm (defined in Kinship) | Kinship | protected |
m_nIndv (defined in Kinship) | Kinship | protected |
- HCA
-
- |
-
A class to generate a kinship matrix from SNP data. - More...
- -#include <SnpKinship.h>
-Public Member Functions | |
void | construct (Option &option, arma::mat &ped, std::vector< int > &chr, arma::mat &newGeno) |
Constructor. More... | |
-arma::mat & | getAN () |
Returns AN matrix, which represents the number of SNPs that were used to calculate the GRM (on a per-individual basis). | |
Public Member Functions inherited from Kinship | |
- | Kinship () |
Constructor. | |
- | Kinship (arma::mat grm, std::map< std::string, int > ind, int nIndv, std::vector< std::string > idVec) |
Constructor. | |
-arma::mat & | getGrm () |
Returns the generated or parsed Genetic Relationship Matrix (GRM). | |
std::vector< std::string > & | getIdVec () |
Returns the generated or parsed individual id vector. More... | |
-Additional Inherited Members | |
Protected Attributes inherited from Kinship | |
-arma::mat | m_grm |
-uint32_t | m_nIndv |
-std::vector< std::string > | idVec |
A class to generate a kinship matrix from SNP data.
-void SnpKinship::construct | -( | -Option & | -option, | -
- | - | arma::mat & | -ped, | -
- | - | std::vector< int > & | -chr, | -
- | - | arma::mat & | -newGeno | -
- | ) | -- |
Constructor.
-Also generates GRM inline.
- -
- HCA
-
- |
-
This is the complete list of members for Util, including all inherited members.
-invertMatrix(arma::mat &m, std::string name, bool alreadyAdded=false) | Util | static |
invertMatrixSympd(arma::mat &m, std::string name, bool alreadyAdded=false) | Util | static |
parseToDouble(std::string data) | Util | static |
parseToInt(std::string data) | Util | static |
splitByDelimeter(std::string data, std::string delim) | Util | static |
- HCA
-
- |
-
-Static Public Member Functions | |
-static std::vector< std::string > | splitByDelimeter (std::string data, std::string delim) |
Splits a string by delimiter, and returns a vector swith the result. | |
-static double | parseToDouble (std::string data) |
Parses a string to a double, raising an error if the data is invalid. | |
-static int | parseToInt (std::string data) |
Parses a string to an integer, raising an error if the data is invalid. | |
-static arma::mat | invertMatrix (arma::mat &m, std::string name, bool alreadyAdded=false) |
Inverts the matrix via arma::inv, adding values to diagonals should inversion fail. If values are added to diagonals, outputs a warning with the "name" variable. | |
-static arma::mat | invertMatrixSympd (arma::mat &m, std::string name, bool alreadyAdded=false) |
Inverts the matrix via arma::inv_sympd, adding values to diagonals should inversion fail. If values are added to diagonals, outputs a warning with the "name" variable. | |
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
- HCA
-
- |
-
This page explains how to interpret the graphs that are generated by doxygen.
-Consider the following example:
This will result in the following graph:
-The boxes in the above graph have the following meaning:
-The arrows have the following meaning:
-
- HCA
-
- |
-
CData | A class to load all relevant data. Responsible solely for loading data - not for reconciling missing individuals |
▼CH2rEst | A class used to perform heritability analysis |
CRemlH2rEst | A class to perform heritability analysis on a given dataset |
▼CHca | A class used to perform heritable component analysis |
CRemlHca | A class to perform heritable component analysis on a given dataset |
CIndividualData | A class to represent data associated with a given individual |
CIndividualDataSet | A class to keep track of and reconcile IndividualData objects |
▼CKinship | A class used to load or generate kinship data. This class must not be used directly; only subclasses should be used |
CGivenKinship | A class to load a pregiven kinship matrix |
CSnpKinship | A class to generate a kinship matrix from SNP data |
COption | A class to load all user options |
CScorer | A class to score users based on their phenotypes and a generated weight |
CUtil |
- HCA
-
- |
-
This repository represents a pipeline that performs three primary functions:
-The pipeline accepts genotypic and phenotypic data, as well as covariates, and generates a highly-heritable trait. It can then estimate the heritability of that trait via the second function above. If the user already has a kinship matrix available (i.e. from GCTA), the program can accept this matrix. Alternatively, it can use genotypic data to generate the kinship matrix.
-Running the program without any options will trigger the help function, which will show all options that are available. The program takes a command (kinship
, h2r
, hca
, or score
), as well as a set of options for each command.
Generally speaking, one can follow the below guidelines when using this program:
-Additional options are present, but are not required. Review the documentation and help output for more details.
-If “numSplits” is equal to one, the following process will be used for lambda tuning:
-If the “numSplits” option is greater than one, the dataset will be split randomly into “numSplits” splits. The following process will then occur:
-In addition to outputting data to the CLI, if an outDir
parameter is specified, some data will also be saved. For all analysis, if a GRM was generated (non-pregiven), that GRM will be saved to "kinship.csv". The following analysis-specific data will also be saved:
When HCA is run, the final weights will be saved to "trait_hca.csv". Indiviudals will be scored with these weights, and the output from the "Scoring" section will be saved. In addition, the output from the "Heritability Analysis" section will be saved for the final weights.
-When heritability analysis is run, statistics regarding the analysis will be saved to "h2r_est.txt".
-When scoring is run, the calculated scores will be saved to "scores.txt".
-When kinship generation is run, the kinship file will be saved to "kinship.txt".
-In the event that the variance-covariance matrix is non-invertible during heritability estimation, small values will be added to the matrix diagonals. This will generally resolve the invertibility error, but may adversely affect the results. A warning will be outputted in the event that the add-to-diagonals approach is used.
-Further documentation is available in the docs
folder.
The Linux binary should work automatically on most Linux distributions. If not, compile it for your architecture.
-The OSX binary requires GCC version 6. Install it by running brew install gcc6 --without-multilib
on your machine.
To compile on most Linux distributions and OSX, follow these steps:
-cmake . && make && sudo make install
)./configure && make && sudo make install
)DYNAMIC_ARCH=1
when running make
and make install
, if you plan on using the binary across multiple architectures.sudo apt-get install liblapack-dev
). If on Mac, run brew install gcc6 --without-multilib
and brew install lapack
.make --file Makefile_osx
or make --file Makefile_linux
, depending on your platform.The following references were used while preparing this program:
-
- HCA
-
- |
-
- - |
- - |
- - |
- - |
- - |
- - |
- - |
- - |
- - |
t |