Genome Assembly Tutorial
This repository is a usable, publicly available Genome Assembly tutorial. All steps have been provided for the UConn CBC Xanadu cluster here with appropriate headers for the Slurm scheduler that can be modified simply to run. Commands should never be executed on the submit nodes of any HPC machine. If working on the Xanadu cluster, you should use sbatch scriptname after modifying the script for each stage. Basic editing of all scripts can be performed on the server with tools, such as nano, vim, or emacs. If you are new to Linux, please use this handy guide for the operating system commands. In this guide, you will be working with common genome assemblers, such as SOAPdenovo, SPAdes, MaSuRCA, Platanus, and quality assessment tool Quast If you do not have a Xanadu account and are an affiliate of UConn/UCHC, please apply for one here.
Contents
In this tutorial we will assemble sequences from a Prokaryote sample and Asian swallowtail (Papilio xuthus). The bacterial sample used in this tutorial is paired-end, meaning that there are forward and reverse reads. which we will designate as Data Acquisition
Sample_R1.fastq
and Sample_R2.fastq
, respectively.
The butterfly sample is from DDBJ center. It contains 7 libraries in total, 2 pair-end, 5 mate-pair, which are assigned SRA
accession number DRR021673
, DRR021674
, DRR021675
, DRR021676
, DRR021677
, DRR021678
, DRR021679
.
Note that all libraires take around 50GB storage in total. Prepare you storage accordingly before proceeding to next
step.
LibraryType | Insert size | Read1 | Read2 |
Pair-end | 300 | DRR021673_1.fastq | DRR021673_2.fastq |
Pair-end | 500 | DRR021674_1.fastq | DRR021674_2.fastq |
Mate-pair | 3kb | DRR021675_1.fastq | DRR021675_2.fastq |
DRR021676_1.fastq | DRR021676_2.fastq | ||
Mate-pair | 5kb | DRR021677_1.fastq | DRR021677_2.fastq |
DRR021678_1.fastq | DRR021678_2.fastq | ||
Mate-pair | 8kb | DRR021679_1.fastq | DRR021679_2.fastq |
If you are performing this tutorial on Xanadu. Make sure you are under home directory.
cd
before proceeding. Your home directory contains 10TB of storage and will not pollute the capacities of other users on the cluster.
The workflow may be cloned into the appropriate directory using the terminal command:
$git clone https://github.uconn.edu/mux13001/GenomeAssembly.git $cd GenomeAssemblyTutorial $ls
The bacteria data is located in dataset/Bacteria/
. Uncompress the two sequence files with:
tar -xJf Sample_R1.tar.xz && tar -xJf Sample_R2.tar.xz
We can obtain butterfly data either through sequence-read-archives (SRA) or by a downloadable link. Here we use the link to download data. Command below downloads
and uncompress part of butterfly data using wget
and bzip2
.
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/DRA002/DRA002407/DRX019820/DRR021674_1.fastq.bz2 -P dataset/ && bzip2 -d dataset/DRR021674_1.fastq.bz2
The script which downloads all butterfly libraries is dataset/Butterfly/download.sh
.
All data files are in fastq format. For more information about fastq format, see File Formats Tutorial
Quality Control with Sickle
The first is to perform quality control on the reads using sickle. Sicle trims low quality reads below a certain threshold from raw sequencing data. Use command below run the program on bacteria sample:
sickle pe -f dataset/bacteria/Sample_R1.fastq -r dataset/bacteria/Sample_R2.fastq -t sanger -o Sample_1.fastq -p Sample_2.fastq -s Sample_s.fastq -q 30 -l 45
The command processes a file containing forward reads and a file containing reverse reads , in addition, it outputs trimmed singles.
-f
: designate the input file containing the forward reads-r
: the input file containing the reverse reads-o
: the output file containing the trimmed forward reads-p
: the output file containing the trimmed reverse reads-s
: the output file containing trimmed singles-q
: designate the minimum quality-l
: the minimum read length-t
: designate the type of read
Don't run this command alone on Xanadu terminal. The slurm script executing this command is sickle/quality_control.sh
.
Since the bufferfly data are already trimmed adaptor sequences, we can proceed without treating them.
SOAPdenovo: de novo sequence assembler
SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads.
SOAPdenovo uses a config file to pass information about the sequences into the
program. Notable fields include average insert size and read length,
which differ depending on the sequencing technology, Each library starts with line [LIB]
Enter command below to load SOAPdenovo on Xanadu:
module load SOAP-denovo/2.04
Bacteria
#maximal read length max_rd_len=250 [LIB] #average insert size avg_ins=550 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 250 bps of each read rd_len_cutoff=250 #in which order the reads are used while scaffolding rank=1 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 # path to genes q1=../../dataset/Sample_1.fastq q2=../../dataset/Sample_2.fastq q=../../Sample_s.fastq
Above shows sample config file for the bacterial library, where q1
, q2
, q
designate
the paths to the forward, reverse and singles trimmed reads respectively. The file can be found in SOAPdenovo/Bacteria/config
To run the assembler we will use the SOAPdenovo-63mer command with the all
option ((to perform kmer graph construction, contig error correction, mapping of reads to contigs, and scaffolding).
SOAPdenovo-63mer all -s /common/Assembly_Tutorial/Assembly/Sample.config -K 31 -R -o graph_Sample_31 1>ass31.log 2>ass31.err
-s
: path to the contig-k
: size of kmer-o
: the output prefix
We repeat this command for kmer size = 35, 41 for later analysis. the slurm script is at SOAPdenovo/Bacteria/assemble.sh
swallowtail
For butterfly sample, we will run multiple libraries at once. Therefore, rank
for each library. SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds.
Libraries with the same rank would be used at the same time. It is desired that the pairs in each rank provide adequate physical coverage of the genome.
#maximal read length max_rd_len=250 [LIB] #average insert size avg_ins=300 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 250 bps of each read rd_len_cutoff=250 #in which order the reads are used while scaffolding rank=1 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 # path to genes q1=../../dataset/Butterfly/DRR021673_1.fastq q2=../../dataset/Butterfly/DRR021673_2.fastq [LIB] #average insert size avg_ins=500 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 250 bps of each read rd_len_cutoff=250 #in which order the reads are used while scaffolding rank=2 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 # path to genes q1=../../dataset/Butterfly/DRR021674_1.fastq q2=../../dataset/Butterfly/DRR021674_2.fastq [LIB] #average insert size avg_ins=3000 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 250 bps of each read rd_len_cutoff=250 #in which order the reads are used while scaffolding rank=3 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 # path to genes q1=../../dataset/Butterfly/DRR021675_1.fastq q2=../../dataset/Butterfly/DRR021675_2.fastq [LIB] #average insert size avg_ins=300 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 250 bps of each read rd_len_cutoff=250 #in which order the reads are used while scaffolding rank=4 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 # path to genes q1=../../dataset/Butterfly/DRR021677_1.fastq q2=../../dataset/Butterfly/DRR021677_2.fastq
As shown above, four libraries with different insert sizes. The file is at SOAPdenovo/Butterfly/config
. Run
command below to start assemble
SOAPdenovo-63mer all -s config -K 31-R -o graph_xuthus_31 1>ass31.log 2>ass31.err
The slurm script containing the command is at SOAPdenovo/Butterfly/assemble.sh
SPAdes: de Bruijn graph based assembler
SPAdes is different from the other assemblers in that it generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. To run SPAdes we will use the spades.py command with the --careful
option to minimize the number of mismatches in the contigs
Enter command below to load SPAdes on Xanadu:
module load SPAdes/3.11.1
Bacteria
spades.py --careful -o SPAdes_out -1 ../../dataset/BacteriaSample_1.fastq -2 ../../dataset/Bacteria/Sample_2.fastq -s d../../ataset/Bacteria/Sample_s.fastq
-o
: path to output directory-1
: path to the forward reads-2
: path to the reverse reads-s
: path to the singles reads-k
: override automatic kmer selection
The script is at SPAdes/Bacteria/assemble.sh
.
swallowtail
spades.py --careful -o xuthus_out/ -t 16 -m 250 --pe1-1 ../../dataset/butterfly/DRR021673_1.fastq --pe1-2 ../../dataset/butterfly/DRR021673_2.fastq --pe2-1 ../../dataset/butterfly/DRR021674_1.fastq --pe2-2 ../../dataset/butterfly/DRR021674_2.fastq --mp1-1 ../../dataset/butterfly/DRR021675_1.fastq --mp1-2 ../../dataset/Butterfly/DRR021675_2.fastq
-pe
: path to pair-end reads-mp
: path to mate mate-pair reads
The script is at SPAdes/Butterfly/assemble.sh
.
MaSuRCA assembler
MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. It requires a configuration file to generate a run script. In the configuration file, user are required to specify input files, number of cores to use, jellyfish hash size and name of SGE queue to use.
The configuration file consists of 2 main sections, DATA
and PARAMETERS
. Under DATA
,
input library is specified with 5 fields: two-letter prefix, average insert length and standard deviation, path to forward
reads, path to reverse reads (optional). In addition to get average insert length and standard deviation from data source.
we can also use awk
to calculate two values from data files. Below is an example of the operation on Bacteria
Sample.
awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' Sample_[12].fastq > Sample_stats.txt
The command outputs a text file containing mean and standard deviation of reads. The slurm script for bacteria and butterfly are
stored in MaSuRCA/Bacteria/sample_seq_stats.sh
and MaSuRCA/Butterfly/xut_seq_stats.sh
respectively.
Load MaSuRCA module on Xanadu:
module load MaSuRCA/3.2.4
Bacteria
DATA PE= pe 232 1442 /home/CAM/mxu/tutorial/m4/Sample_1.fastq /home/CAM/mxu/tutorial/m4/Sample_2.fastq PE= se 176 4194 /home/CAM/mxu/tutorial/Bacteria/Sample_s.fastq #JUMP= sh 3600 200 /FULL_PATH/short_1.fastq /FULL_PATH/short_2.fastq #pacbio reads must be in a single fasta file! make sure you provide absolute path #PACBIO=/FULL_PATH/pacbio.fa #OTHER=/FULL_PATH/file.frg END PARAMETERS #set this to 1 if your Illumina jumping library reads are shorter than 100bp #EXTEND_JUMP_READS=0 #this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE = auto #set this to 1 for all Illumina-only assemblies #set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs #otherwise keep at 0 USE_LINKING_MATES = 0 #specifies whether to run mega-reads correction on the grid USE_GRID=0 #specifies queue to use when running on the grid MANDATORY GRID_QUEUE=all.q #batch size in the amount of long read sequence for each batch on the grid GRID_BATCH_SIZE=300000000 #coverage by the longest Long reads to use ##LHE_COVERAGE=30 #this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms LIMIT_JUMP_COVERAGE = 60 #these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically. #set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms. CA_PARAMETERS = cgwErrorRate=0.15 #minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100 KMER_COUNT_THRESHOLD = 1 #whether to attempt to close gaps in scaffolds with Illumina data CLOSE_GAPS=1 #auto-detected number of cpus to use NUM_THREADS = 16 #this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage JF_SIZE = 200000000 #set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data SOAP_ASSEMBLY=0 END
The configuration file is shown above. It is also located in MaSuRCA/Bacteria/config
. Here under DATA
,
flag PE
designate both pair-end and single-end reads.
Next, run MaSuRCA on the config file.
masurca config
The command generates a bash script named assemble.sh
. Run assemble.sh
to start assembly (
Don't run the script on submit node).
bash assemble.sh
The results will be stored in directory CA/
after completion. The slurm script containing commands above is
MaSuRCA/Bacteria/ma_assemble.sh
swallowtail
DATA ##PE= pe 525 60 avg_read_length std_dev /FULL_PATH/paired_read1.fastq /FULL_PATH/paired_read2.fastq PE= p1 147 160 /home/CAM/mxu/tutorial/p3/dataset/DRR021673_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021673_2.fastq PE= p2 145 251 /home/CAM/mxu/tutorial/p3/dataset/DRR021674_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021674_2.fastq JUMP= m1 124 1539 /home/CAM/mxu/tutorial/p3/dataset/DRR021675_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021675_2.fastq JUMP= m2 125 1493 /home/CAM/mxu/tutorial/p3/dataset/DRR021677_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021677_2.fastq #pacbio reads must be in a single fasta file! make sure you provide absolute path #PACBIO=/FULL_PATH/pacbio.fa #OTHER=/FULL_PATH/file.frg END PARAMETERS #set this to 1 if your Illumina jumping library reads are shorter than 100bp #EXTEND_JUMP_READS=0 #this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE = auto #set this to 1 for all Illumina-only assemblies #set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs #otherwise keep at 0 USE_LINKING_MATES = 0 #specifies whether to run mega-reads correction on the grid USE_GRID=0 #specifies queue to use when running on the grid MANDATORY GRID_QUEUE=all.q #batch size in the amount of long read sequence for each batch on the grid GRID_BATCH_SIZE=300000000 #coverage by the longest Long reads to use ##LHE_COVERAGE=30 #this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms LIMIT_JUMP_COVERAGE = 300 #these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically. #set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms. CA_PARAMETERS = cgwErrorRate=0.15 #minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100 KMER_COUNT_THRESHOLD = 1 #whether to attempt to close gaps in scaffolds with Illumina data CLOSE_GAPS=1 #auto-detected number of cpus to use NUM_THREADS = 16 #this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage JF_SIZE = 200000000 #set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data SOAP_ASSEMBLY=0 END
For butterfly, we use 4 libraries as input. 2 pair-end libraries start with flag PE
and 2 mate-pair
libraries start with flag JUMP
. The configuration script is located at MaSuRCA/Butterfly/config
The script to submit assembly job is at MaSuRCA/Butterfly/ma_assemble.sh
Platanus is a novel de novo sequence assembler that can reconstruct genomic sequences of highly heterozygous diploids from massively parallel shotgun sequencing data. It consists of 3 separate commands: Platanus: PLATform for Assembling NUcleotide Sequences
assemble
,
scaffold
, and gapclose
. We will go through these commands in this section. Enter command below
to load Platanus module on Xanadu.
module load platanus/1.2.4
Bacteria
First we need assemble contigs from trimmed reads.
platanus assemble -o sample -f dataset/Bacteria/Sample_[12s].fastq -t 16 -m 128 2
-o
: output prefix-f
: path to input reads-t
: number of cpus to use-m
: Amount of memory to use (GB)
The command above takes 3 input files with forward reads, reverse reads, and singles reads respectively.
Assembled contigs will be saved in sample_contig.fa
when it is completed.
platanus scaffold -o sample -c sample_contig.fa -b sample_contigBubble.fa -IP1 ../../dataset/Bacteria/Sample_1.fastq ../../dataset/Bacteria/Sample_2.fastq -t 16
-o
: output prefix-c
: path to input assembled configs-b
: path to contig bubbles-IP1
: path to input paired-end reads-t
: number of cpus to use-m
: Amount of memory to use (GB)
The scaffold
command generates a scaffolds file sample_scaffold.fa
. With scaffolds,
we can proceed to gapclose
.
platanus gap_close -o sample -c sample_scaffold.fa -IP1 ../../dataset/Bacteria/Sample_1.fastq ../../dataset/Bacteria/Sample_2.fastq -t 16
-o
: output prefix-c
: path to input scaffolds-IP1
: path to input paired-end reads-t
: number of cpus to use-m
: Amount of memory to use (GB)
It outputs sample_gapClose.fa
which contains gap closed sequences.
The slurm script with all three step is at Platanus/Bacteria/platanus.sh/
.
swallowtail
We choose library DRR021673
and DRR021674
as input of assembly.
platanus assemble -o Pxut -f /home/CAM/mxu/tutorial/p3/dataset/DRR02167[34]_[12].fastq -t 16 -m 128
-o
: output prefix-f
: path to input reads-t
: number of cpus to use-m
: Amount of memory to use (GB)
Assembled contigs will be saved in Pxut_contig.fa
when it is completed.
platanus scaffold -o Pxut -c Pxut_contig.fa -b Pxut_contigBubble.fa -IP1 ../../dataset/Butterfly/DRR021673_1.fastq ../../dataset/Butterfly/DRR021673_2.fastq -IP2 ../../dataset/Butterfly/DRR021674_1.fastq ../../dataset/Butterfly/DRR021674_2.fastq -t 16
-o
: output prefix-c
: path to input assembled configs-b
: path to contig bubbles-IP1/-IP2
: path to input paired-end reads-t
: number of cpus to use-m
: Amount of memory to use (GB)
The scaffold
command generates a scaffolds file Pxut_scaffold.fa
. With scaffolds,
we can proceed to gapclose
.
platanus gap_close -o Pxut -c Pxut_scaffold.fa -IP1 ../../dataset/Butterfly/DRR021673_1.fastq ../../dataset/Butterfly/DRR021673_2.fastq -IP2 ../../dataset/Butterfly/DRR021674_1.fastq ../../dataset/Butterfly/DRR021674_2.fastq -t 16
-o
: output prefix-c
: path to input scaffolds-IP1/-IP2
: path to input paired-end reads-t
: number of cpus to use-m
: Amount of memory to use (GB)
It outputs Pxut_gapClose.fa
which contains gap closed sequences.
The slurm script with all three step is at Platanus/Butterfly/platanus.sh/
.
Quast: Quality Assessment Tool for Genome Assemblies
Now that we have several assemblies, it’s time to analyze the quality of each assembly. SOAPdenovo has its own statistics output, but for consistency, we will be using the program QUAST. The statistics we are most interested in are number of contigs, total length, and N50. A good assembly would have a low number of contigs, a total length that makes sense for the species, and a high N50 value. To run quast on all of our final assembly files we will run the following commands, with the only parameters used being the name of the scaffold file(s) and output directory.
To load QUAST on Xanadu
module load quast/4.6
Sample command that processes output of SOAPdenovo with QUAST.
python quast.py -t 8 ../../SOAPdenovo/Bacteria/graph_Sample_31.scafSeq -o SOAP
-o
: path to output directory-t
: number of CPU to use
QUAST’s output consists of a directory containing results in multiple formats. For statistics such as
contigs, total length, and N50, we can check report.txt
by using less
command, or we can download
the output from cluster and open the interactive report in HTML format with a web browser (Optional).
scp -r your-username@xanadu-submit-ext.cam.uchc.edu:/path/to/QUAST/output .
Bacteria
Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
SOAPdenovo | 276 | 103125 | 3574101 | 32.44 | 26176 |
SPAdes | 59 | 255551 | 2880184 | 32.65 | 147660 |
MaSuRCA | 110 | 148785 | 2891062 | 32.60 | 45141 |
Platanus | 631 | 66346 | 2804614 | 32.80 | 14143 |
The script to run QUAST on bacterial results is located at Quast/Bacteria/quast.sh/
.
Butterfly
Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
SOAPdenovo | 31897 | 6312 | 321848681 | 33.56 | 831 |
SPAdes | |||||
MaSuRCA | 33583 | 521760 | 426261801 | 33.93 | 20460 |
Platanus | 28736 | 632864 | 236681996 | 33.81 | 62393 |
The script to run QUAST on swallowtail's results is located at Quast/Butterfly/quast.sh/
.
Citations
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., … Pevzner, P. A. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5), 455–477. http://doi.org/10.1089/cmb.2012.0021
Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075. http://doi.org/10.1093/bioinformatics/btt086
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T, “Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads”. Genome Res. 2014 Aug;24(8):1384-95. doi: 10.1101/gr.170720.113.
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., … Wang, J. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 1, 18. http://doi.org/10.1186/2047-217X-1-18
Zimin, A. et al. The MaSuRCA genome Assembler. Bioinformatics (2013). doi:10.1093/bioinformatics/btt476