Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
# Genome Assembly Tutorial
This repository is a usable, publicly available Genome Assembly tutorial.
All steps have been provided for the UConn CBC Xanadu cluster here with appropriate headers for
the Slurm scheduler that can be modified simply to run.  Commands should never be executed on the submit
nodes of any HPC machine.  If working on the Xanadu cluster, you should use sbatch scriptname after modifying the
script for each stage.  Basic editing of all scripts can be performed on the server with tools,
such as nano, vim, or emacs.  If you are new to Linux, please use
<a href="https://bioinformatics.uconn.edu/unix-basics/">this</a> handy guide for the operating system commands. 
In this guide, you will be working with common genome assemblers, such as
<a href="http://soap.genomics.org.cn/soapdenovo.html">SOAPdenovo</a>,
<a href="http://spades.bioinf.spbau.ru/">SPAdes</a>,
<a href="http://www.genome.umd.edu/masurca.html">MaSuRCA</a>,
<a href="http://platanus.bio.titech.ac.jp/">Platanus</a>,
and quality assessment tool <a href="http://bioinf.spbau.ru/quast">Quast</a>
If you do not have a Xanadu account and are an affiliate of UConn/UCHC, please apply for one <a href="https://bioinformatics.uconn.edu/contact-us/">here</a>.
<div id="toc_container">
<p class="toc_title">Contents</p>
<ul class="toc_list">
<li><a href="#First_Point_Header">1 Data Acquisition</a></li>
<li><a href="#Second_Point_Header">2 Quality Control with Sickle</a></li>
<li><a href="#Third_Point_Header">3 SOAPdenovo: de novo sequence assembler</a></li>
<li><a href="#Fourth_Point_Header">4 SPAdes: de Bruijn graph based assembler</a></li>
<li><a href="#Fifth_Point_Header">5 MaSuRCA assembler</a></li>
<li><a href="#Sixth_Point_Header">6 Platanus: PLATform for Assembling NUcleotide Sequences</a></li>
<li><a href="#EnTAP">7 Quast: Quality Assessment Tool for Genome Assemblies</a></li>
<li><a href="#Citation">Citations</a></li>
</ul>
</div>
<h2 id="First_Point_Header">Data Acquisition</h2>
In this tutorial we will assemble sequences from a Prokaryote sample and Asian swallowtail (Papilio xuthus).
The bacterial sample used in this tutorial is paired-end, meaning that there are forward and reverse reads. which we
will designate as <code>Sample_R1.fastq</code> and <code>Sample_R2.fastq</code>, respectively.
The butterfly sample is from DDBJ center. It contains 7 libraries in total, 2 pair-end, 5 mate-pair, which are assigned SRA
accession number <code>DRR021673</code>, <code>DRR021674</code>, <code>DRR021675</code>, <code>DRR021676</code>, <code>DRR021677</code>, <code>DRR021678</code>, <code>DRR021679</code>.
Note that all libraires take around 50GB storage in total. Prepare you storage accordingly before proceeding to next
step.
<table dir="ltr" border="1" cellspacing="0" cellpadding="0"><colgroup> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /></colgroup>
<tbody>
<tr>
<td>LibraryType</td>
<td>Insert size</td>
<td>Read1</td>
<td>Read2</td>
</tr>
<tr>
<td>Pair-end</td>
<td>300</td>
<td>DRR021673_1.fastq</td>
<td>DRR021673_2.fastq</td>
</tr>
<tr>
<td>Pair-end</td>
<td>500</td>
<td>DRR021674_1.fastq</td>
<td>DRR021674_2.fastq</td>
</tr>
<tr>
<td>Mate-pair</td>
<td>3kb</td>
<td>DRR021675_1.fastq</td>
<td>DRR021675_2.fastq</td>
</tr>
<tr>
<td></td>
<td></td>
<td>DRR021676_1.fastq</td>
<td>DRR021676_2.fastq</td>
</tr>
<tr>
<td>Mate-pair</td>
<td>5kb</td>
<td>DRR021677_1.fastq</td>
<td>DRR021677_2.fastq</td>
</tr>
<tr>
<td></td>
<td></td>
<td>DRR021678_1.fastq</td>
<td>DRR021678_2.fastq</td>
</tr>
<tr>
<td>Mate-pair</td>
<td>8kb</td>
<td>DRR021679_1.fastq</td>
<td>DRR021679_2.fastq</td>
</tr>
</tbody>
</table>
If you are performing this tutorial on Xanadu. Make sure you are under home directory.
<pre style="color: silver; background: black;">cd</pre>
before proceeding. Your home directory contains 10TB of storage and will not pollute the capacities of other users on the cluster.
The workflow may be cloned into the appropriate directory using the terminal command:
<pre style="color: silver; background: black;">$git clone https://github.uconn.edu/mux13001/GenomeAssembly.git
$cd GenomeAssemblyTutorial
$ls </pre>
The bacteria data is located in <code>dataset/Bacteria/</code>. Uncompress the two sequence files with:
<pre>tar -xJf Sample_R1.tar.xz && tar -xJf Sample_R2.tar.xz</pre>
We can obtain butterfly data either through sequence-read-archives (SRA) or by a downloadable link. Here we use the link to download data. Command below downloads
and uncompress part of butterfly data using <code>wget</code> and <code>bzip2</code>.
<pre>wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/DRA002/DRA002407/DRX019820/DRR021674_1.fastq.bz2 -P dataset/ && bzip2 -d dataset/DRR021674_1.fastq.bz2</pre>
The script which downloads all butterfly libraries is <code>dataset/Butterfly/download.sh</code>.
All data files are in fastq format. For more information about fastq format, see <a href="http://bioinformatics.uconn.edu/resources-and-events/tutorials/file-formats-tutorial/">File Formats Tutorial</a>
<h2 id="Second_Point_Header">Quality Control with Sickle</h2>
The first is to perform quality control on the reads using sickle. Sicle trims low quality reads below a certain threshold
from raw sequencing data. Use command below run the program on bacteria sample:
<pre>sickle pe -f dataset/bacteria/Sample_R1.fastq -r dataset/bacteria/Sample_R2.fastq -t sanger -o Sample_1.fastq -p Sample_2.fastq -s Sample_s.fastq -q 30 -l 45</pre>
The command processes a file containing forward reads and a file containing reverse reads , in addition, it outputs trimmed singles.
<ul>
<li><code>-f</code>: designate the input file containing the forward reads</li>
<li><code>-r</code>: the input file containing the reverse reads</li>
<li><code>-o</code>: the output file containing the trimmed forward reads</li>
<li><code>-p</code>: the output file containing the trimmed reverse reads</li>
<li><code>-s</code>: the output file containing trimmed singles</li>
<li><code>-q</code>: designate the minimum quality</li>
<li><code>-l</code>: the minimum read length</li>
<li><code>-t</code>: designate the type of read</li>
</ul>
<b>Don't run this command alone on Xanadu terminal</b>. The slurm script executing this command is <code>sickle/quality_control.sh</code>.
Since the bufferfly data are already trimmed adaptor sequences, we can proceed without treating them.
<h2 id="Third_Point_Header">SOAPdenovo: de novo sequence assembler</h2>
SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads.
SOAPdenovo uses a config file to pass information about the sequences into the
program. Notable fields include average insert size and read length,
which differ depending on the sequencing technology, Each library starts with line <code>[LIB]</code>
Enter command below to load SOAPdenovo on Xanadu:
<pre>module load SOAP-denovo/2.04</pre>
<b>Bacteria</b>
<pre>#maximal read length
max_rd_len=250
[LIB]
#average insert size
avg_ins=550
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 250 bps of each read
rd_len_cutoff=250
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
# path to genes
q1=../../dataset/Sample_1.fastq
q2=../../dataset/Sample_2.fastq
q=../../Sample_s.fastq</pre>
Above shows sample config file for the bacterial library, where <code>q1</code>, <code>q2</code>, <code>q</code> designate
the paths to the forward, reverse and singles trimmed reads respectively. The file can be found in <code>SOAPdenovo/Bacteria/config</code>
To run the assembler we will use the SOAPdenovo-63mer command with the <code>all</code> option ((to perform kmer graph construction, contig error correction, mapping of reads to contigs, and scaffolding).
<pre>SOAPdenovo-63mer all -s /common/Assembly_Tutorial/Assembly/Sample.config -K 31 -R -o graph_Sample_31 1>ass31.log 2>ass31.err</pre>
<ul>
<li><code>-s</code>: path to the contig</li>
<li><code>-k</code>: size of kmer</li>
<li><code>-o</code>: the output prefix</li>
</ul>
We repeat this command for kmer size = 35, 41 for later analysis. the slurm script is at <code>SOAPdenovo/Bacteria/assemble.sh</code>
<b>swallowtail</b>
For butterfly sample, we will run multiple libraries at once. Therefore, <code>rank</code> for each library. SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds.
Libraries with the same rank would be used at the same time. It is desired that the pairs in each rank provide adequate physical coverage of the genome.
<pre>#maximal read length
max_rd_len=250
[LIB]
#average insert size
avg_ins=300
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 250 bps of each read
rd_len_cutoff=250
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
# path to genes
q1=../../dataset/Butterfly/DRR021673_1.fastq
q2=../../dataset/Butterfly/DRR021673_2.fastq
[LIB]
#average insert size
avg_ins=500
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 250 bps of each read
rd_len_cutoff=250
#in which order the reads are used while scaffolding
rank=2
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
# path to genes
q1=../../dataset/Butterfly/DRR021674_1.fastq
q2=../../dataset/Butterfly/DRR021674_2.fastq
[LIB]
#average insert size
avg_ins=3000
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 250 bps of each read
rd_len_cutoff=250
#in which order the reads are used while scaffolding
rank=3
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
# path to genes
q1=../../dataset/Butterfly/DRR021675_1.fastq
q2=../../dataset/Butterfly/DRR021675_2.fastq
[LIB]
#average insert size
avg_ins=300
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 250 bps of each read
rd_len_cutoff=250
#in which order the reads are used while scaffolding
rank=4
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
# path to genes
q1=../../dataset/Butterfly/DRR021677_1.fastq
q2=../../dataset/Butterfly/DRR021677_2.fastq</pre>
As shown above, four libraries with different insert sizes. The file is at <code>SOAPdenovo/Butterfly/config</code>. Run
command below to start assemble
<pre>SOAPdenovo-63mer all -s config -K 31-R -o graph_xuthus_31 1>ass31.log 2>ass31.err</pre>
The slurm script containing the command is at <code>SOAPdenovo/Butterfly/assemble.sh</code>
<h2 id="Fourth_Point_Header">SPAdes: de Bruijn graph based assembler</h2>
SPAdes is different from the other assemblers in that it generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. To run SPAdes we will use the spades.py command with the <code>--careful</code> option to minimize the number of mismatches in the contigs
Enter command below to load SPAdes on Xanadu:
<pre>module load SPAdes/3.11.1</pre>
<b>Bacteria</b>
<pre>spades.py --careful -o SPAdes_out -1 ../../dataset/BacteriaSample_1.fastq -2 ../../dataset/Bacteria/Sample_2.fastq -s d../../ataset/Bacteria/Sample_s.fastq</pre>
<ul>
<li><code>-o</code>: path to output directory</li>
<li><code>-1</code>: path to the forward reads</li>
<li><code>-2</code>: path to the reverse reads</li>
<li><code>-s</code>: path to the singles reads</li>
<li><code>-k</code>: override automatic kmer selection</li>
</ul>
The script is at <code>SPAdes/Bacteria/assemble.sh</code>.
<b>swallowtail</b>
<pre>spades.py --careful -o xuthus_out/ -t 16 -m 250 --pe1-1 ../../dataset/butterfly/DRR021673_1.fastq --pe1-2 ../../dataset/butterfly/DRR021673_2.fastq --pe2-1 ../../dataset/butterfly/DRR021674_1.fastq --pe2-2 ../../dataset/butterfly/DRR021674_2.fastq --mp1-1 ../../dataset/butterfly/DRR021675_1.fastq --mp1-2 ../../dataset/Butterfly/DRR021675_2.fastq</pre>
<ul>
<li><code>-pe</code>: path to pair-end reads</li>
<li><code>-mp</code>: path to mate mate-pair reads</li>
</ul>
The script is at <code>SPAdes/Butterfly/assemble.sh</code>.
<h2 id="Fifth_Point_Header">MaSuRCA assembler</h2>
MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches.
It requires a configuration file to generate a run script. In the configuration file, user are required to specify input files, number
of cores to use, jellyfish hash size and name of SGE queue to use.
The configuration file consists of 2 main sections, <code>DATA</code> and <code>PARAMETERS</code>. Under <code>DATA</code>,
input library is specified with 5 fields: two-letter prefix, average insert length and standard deviation, path to forward
reads, path to reverse reads (optional). In addition to get average insert length and standard deviation from data source.
we can also use <code>awk</code> to calculate two values from data files. Below is an example of the operation on Bacteria
Sample.
<pre>awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' Sample_[12].fastq > Sample_stats.txt</pre>
The command outputs a text file containing mean and standard deviation of reads. The slurm script for bacteria and butterfly are
stored in <code>MaSuRCA/Bacteria/sample_seq_stats.sh</code> and <code>MaSuRCA/Butterfly/xut_seq_stats.sh</code> respectively.
Load MaSuRCA module on Xanadu:
<pre>module load MaSuRCA/3.2.4</pre>
<b>Bacteria</b>
<pre>DATA
PE= pe 232 1442 /home/CAM/mxu/tutorial/m4/Sample_1.fastq /home/CAM/mxu/tutorial/m4/Sample_2.fastq
PE= se 176 4194 /home/CAM/mxu/tutorial/Bacteria/Sample_s.fastq
#JUMP= sh 3600 200 /FULL_PATH/short_1.fastq /FULL_PATH/short_2.fastq
#pacbio reads must be in a single fasta file! make sure you provide absolute path
#PACBIO=/FULL_PATH/pacbio.fa
#OTHER=/FULL_PATH/file.frg
END
PARAMETERS
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
#EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 0
#specifies whether to run mega-reads correction on the grid
USE_GRID=0
#specifies queue to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=300000000
#coverage by the longest Long reads to use
##LHE_COVERAGE=30
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 60
#these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS = cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 16
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
END</pre>
The configuration file is shown above. It is also located in <code>MaSuRCA/Bacteria/config</code>. Here under <code>DATA</code>,
flag <code>PE</code> designate both pair-end and single-end reads.
Next, run MaSuRCA on the config file.
<pre>masurca config</pre>
The command generates a bash script named <code>assemble.sh</code>. Run <code>assemble.sh</code> to start assembly (<b>
Don't run the script on submit node</b>).
<pre>bash assemble.sh</pre>
The results will be stored in directory <code>CA/</code> after completion. The slurm script containing commands above is
<code>MaSuRCA/Bacteria/ma_assemble.sh</code>
<b>swallowtail</b>
<pre>DATA
##PE= pe 525 60 avg_read_length std_dev /FULL_PATH/paired_read1.fastq /FULL_PATH/paired_read2.fastq
PE= p1 147 160 /home/CAM/mxu/tutorial/p3/dataset/DRR021673_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021673_2.fastq
PE= p2 145 251 /home/CAM/mxu/tutorial/p3/dataset/DRR021674_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021674_2.fastq
JUMP= m1 124 1539 /home/CAM/mxu/tutorial/p3/dataset/DRR021675_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021675_2.fastq
JUMP= m2 125 1493 /home/CAM/mxu/tutorial/p3/dataset/DRR021677_1.fastq /home/CAM/mxu/tutorial/p3/dataset/DRR021677_2.fastq
#pacbio reads must be in a single fasta file! make sure you provide absolute path
#PACBIO=/FULL_PATH/pacbio.fa
#OTHER=/FULL_PATH/file.frg
END
PARAMETERS
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
#EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 0
#specifies whether to run mega-reads correction on the grid
USE_GRID=0
#specifies queue to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=300000000
#coverage by the longest Long reads to use
##LHE_COVERAGE=30
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS = cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 16
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
END</pre>
For butterfly, we use 4 libraries as input. 2 pair-end libraries start with flag <code>PE</code> and 2 mate-pair
libraries start with flag <code>JUMP</code>. The configuration script is located at <code>MaSuRCA/Butterfly/config</code>
The script to submit assembly job is at <code>MaSuRCA/Butterfly/ma_assemble.sh</code>
<h2 id="Sixth_Point_Header">Platanus: PLATform for Assembling NUcleotide Sequences</h2>
Platanus is a novel de novo sequence assembler that can reconstruct genomic sequences of
highly heterozygous diploids from massively parallel shotgun sequencing data. It consists of 3 separate commands: <code>assemble</code>,
<code>scaffold</code>, and <code>gapclose</code>. We will go through these commands in this section. Enter command below
to load Platanus module on Xanadu.
<pre>module load platanus/1.2.4</pre>
<b>Bacteria</b>
First we need assemble contigs from trimmed reads.
<pre>platanus assemble -o sample -f dataset/Bacteria/Sample_[12s].fastq -t 16 -m 128 2</pre>
<ul>
<li><code>-o</code>: output prefix</li>
<li><code>-f</code>: path to input reads</li>
<li><code>-t</code>: number of cpus to use </li>
<li><code>-m</code>: Amount of memory to use (GB)</li>
</ul>
The command above takes 3 input files with forward reads, reverse reads, and singles reads respectively.
Assembled contigs will be saved in <code>sample_contig.fa</code> when it is completed.
<pre>platanus scaffold -o sample -c sample_contig.fa -b sample_contigBubble.fa -IP1 ../../dataset/Bacteria/Sample_1.fastq ../../dataset/Bacteria/Sample_2.fastq -t 16</pre>
<ul>
<li><code>-o</code>: output prefix</li>
<li><code>-c</code>: path to input assembled configs</li>
<li><code>-b</code>: path to contig bubbles</li>
<li><code>-IP1</code>: path to input paired-end reads</li>
<li><code>-t</code>: number of cpus to use </li>
<li><code>-m</code>: Amount of memory to use (GB)</li>
</ul>
The <code>scaffold</code> command generates a scaffolds file <code>sample_scaffold.fa</code>. With scaffolds,
we can proceed to <code>gapclose</code>.
<pre>platanus gap_close -o sample -c sample_scaffold.fa -IP1 ../../dataset/Bacteria/Sample_1.fastq ../../dataset/Bacteria/Sample_2.fastq -t 16</pre>
<ul>
<li><code>-o</code>: output prefix</li>
<li><code>-c</code>: path to input scaffolds</li>
<li><code>-IP1</code>: path to input paired-end reads</li>
<li><code>-t</code>: number of cpus to use </li>
<li><code>-m</code>: Amount of memory to use (GB)</li>
</ul>
It outputs <code>sample_gapClose.fa</code> which contains gap closed sequences.
The slurm script with all three step is at <code>Platanus/Bacteria/platanus.sh/</code>.
<b>swallowtail</b>
We choose library <code>DRR021673</code> and <code>DRR021674</code> as input of assembly.
<pre>platanus assemble -o Pxut -f /home/CAM/mxu/tutorial/p3/dataset/DRR02167[34]_[12].fastq -t 16 -m 128</pre>
<ul>
<li><code>-o</code>: output prefix</li>
<li><code>-f</code>: path to input reads</li>
<li><code>-t</code>: number of cpus to use </li>
<li><code>-m</code>: Amount of memory to use (GB)</li>
</ul>
Assembled contigs will be saved in <code>Pxut_contig.fa</code> when it is completed.
<pre>platanus scaffold -o Pxut -c Pxut_contig.fa -b Pxut_contigBubble.fa -IP1 ../../dataset/Butterfly/DRR021673_1.fastq ../../dataset/Butterfly/DRR021673_2.fastq -IP2 ../../dataset/Butterfly/DRR021674_1.fastq ../../dataset/Butterfly/DRR021674_2.fastq -t 16</pre>
<ul>
<li><code>-o</code>: output prefix</li>
<li><code>-c</code>: path to input assembled configs</li>
<li><code>-b</code>: path to contig bubbles</li>
<li><code>-IP1/-IP2</code>: path to input paired-end reads</li>
<li><code>-t</code>: number of cpus to use </li>
<li><code>-m</code>: Amount of memory to use (GB)</li>
</ul>
The <code>scaffold</code> command generates a scaffolds file <code>Pxut_scaffold.fa</code>. With scaffolds,
we can proceed to <code>gapclose</code>.
<pre>platanus gap_close -o Pxut -c Pxut_scaffold.fa -IP1 ../../dataset/Butterfly/DRR021673_1.fastq ../../dataset/Butterfly/DRR021673_2.fastq -IP2 ../../dataset/Butterfly/DRR021674_1.fastq ../../dataset/Butterfly/DRR021674_2.fastq -t 16</pre>
<ul>
<li><code>-o</code>: output prefix</li>
<li><code>-c</code>: path to input scaffolds</li>
<li><code>-IP1/-IP2</code>: path to input paired-end reads</li>
<li><code>-t</code>: number of cpus to use </li>
<li><code>-m</code>: Amount of memory to use (GB)</li>
</ul>
It outputs <code>Pxut_gapClose.fa</code> which contains gap closed sequences.
The slurm script with all three step is at <code>Platanus/Butterfly/platanus.sh/</code>.
<h2 id = "EnTAP">Quast: Quality Assessment Tool for Genome Assemblies</h2>
Now that we have several assemblies, it’s time to analyze the quality of each assembly. SOAPdenovo has
its own statistics output, but for consistency, we will be using the program QUAST. The statistics we are most
interested in are number of contigs, total length, and N50. A good assembly would have a low number of contigs,
a total length that makes sense for the species, and a high N50 value. To run quast on all of our final assembly
files we will run the following commands, with the only parameters used being the name of the scaffold file(s) and
output directory.
To load QUAST on Xanadu
<pre>module load quast/4.6</pre>
Sample command that processes output of SOAPdenovo with QUAST.
<pre>python quast.py -t 8 ../../SOAPdenovo/Bacteria/graph_Sample_31.scafSeq -o SOAP</pre>
<ul>
<li><code>-o</code>: path to output directory</li>
<li><code>-t</code>: number of CPU to use</li>
</ul>
QUAST’s output consists of a directory containing results in multiple formats. For statistics such as
contigs, total length, and N50, we can check <code>report.txt</code> by using <code>less</code> command, or we can download
the output from cluster and open the interactive report in HTML format with a web browser (Optional).
<pre>scp -r your-username@xanadu-submit-ext.cam.uchc.edu:/path/to/QUAST/output .</pre>
<b>Bacteria</b>
<table dir="ltr" border="1" cellspacing="0" cellpadding="0"><colgroup> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /></colgroup>
<tbody>
<tr>
<td>Assembly</td>
<td># contigs</td>
<td>Largest contig</td>
<td>Total length</td>
<td>GC (%)</td>
<td>N50</td>
</tr>
<tr>
<td>SOAPdenovo</td>
<td>276</td>
<td>103125</td>
<td>3574101</td>
<td>32.44</td>
<td>26176</td>
</tr>
<tr>
<td>SPAdes</td>
<td>59</td>
<td>255551</td>
<td>2880184</td>
<td>32.65</td>
<td>147660</td>
</tr>
<tr>
<td>MaSuRCA</td>
<td>110</td>
<td>148785</td>
<td>2891062</td>
<td>32.60</td>
<td>45141</td>
</tr>
<tr>
<td>Platanus</td>
<td>631</td>
<td>66346</td>
<td>2804614</td>
<td>32.80</td>
<td>14143</td>
</tr>
</tbody>
</table>
The script to run QUAST on bacterial results is located at <code>Quast/Bacteria/quast.sh/</code>.
<b>Butterfly</b>
<table dir="ltr" border="1" cellspacing="0" cellpadding="0"><colgroup> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /> <col width="100" /></colgroup>
<tbody>
<tr>
<td>Assembly</td>
<td># contigs</td>
<td>Largest contig</td>
<td>Total length</td>
<td>GC (%)</td>
<td>N50</td>
</tr>
<tr>
<td>SOAPdenovo</td>
<td>31897</td>
<td>6312</td>
<td>321848681</td>
<td>33.56</td>
<td>831</td>
</tr>
<tr>
<td>SPAdes</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MaSuRCA</td>
<td>33583</td>
<td>521760</td>
<td>426261801</td>
<td>33.93</td>
<td>20460</td>
</tr>
<tr>
<td>Platanus</td>
<td>28736</td>
<td>632864</td>
<td>236681996</td>
<td>33.81</td>
<td>62393</td>
</tr>
</tbody>
</table>
The script to run QUAST on swallowtail's results is located at <code>Quast/Butterfly/quast.sh/</code>.
<h2 id="Citation">Citations</h2>
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., … Pevzner, P. A. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5), 455–477. http://doi.org/10.1089/cmb.2012.0021
Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075. http://doi.org/10.1093/bioinformatics/btt086
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T, “Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads”. Genome Res. 2014 Aug;24(8):1384-95. doi: 10.1101/gr.170720.113.
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., … Wang, J. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 1, 18. http://doi.org/10.1186/2047-217X-1-18
Zimin, A. et al. The MaSuRCA genome Assembler. Bioinformatics (2013). doi:10.1093/bioinformatics/btt476