博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Canu Quick Start(快速使用Canu)
阅读量:5086 次
发布时间:2019-06-13

本文共 10748 字,大约阅读时间需要 35 分钟。

Canu Quick Start

  • (老版的canu)

Canu specializes in(专门从事) assembling PacBio or Oxford Nanopre sequences. Canu will correct the reads, then trim suspicious regions(修剪可疑区域) (such as remaining SMRTbell adapter), then assemble the corrected and cleaned reads into unitigs(非重复序列区).(canu是专门用来组装三代reads的,三步走:校正、修剪、组装。)

Brief Introduction简单介绍

Canu has been designed to auto-detect your resources(硬件资源) and scale itself to fit. Two parameters let you restrict the resources used(限制资源的使用).

maxMemory=XXmaxThreads=XX

Memory is specified in gigabytes(千兆G). On a single machine, it will restrict Canu to at most this limit, on the grid(集群), no single job will try to use more than the specified resources.

The input sequences can be FASTA or FASTQ(有质量信息的二代reads) format, uncompressed, or compressed with gz, bz2 or xz(使用格式).

Running on the grid

Canu is designed to run on grid environments(集群环境) (LSF/PBS/Torque/Slrum/SGE are supported). Currently, Canu will submit itself to the default queue with default time options(所以在运行canu时,如果没有手工设置,就必须要指定grid为false,否则无法提交到集群运行). You can overwrite this behavior by providing any specific parameters you want to be used for submission as an option. Users should also specify a job name to use on the grid:

gridOptionsJobName=myassembly"gridOptions=--partition quick --time 2:00"

 

Assembling PacBio data 组装

Pacific Biosciences released P6-C4 chemistry reads. You can download them (7 GB) or from the . You must have the Pac Bio SMRTpipe software installed to extract the reads as FASTQ(安装原厂软件将原始数据提出成FASTQ).

We made a 25X subset FASTQ available (测试数据)

or use the following curl command:

curl -L -o p6.25x.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq

Correct, Trim and Assemble(校正、修建、组装)

By default, canu will correct the reads, then trim the reads, then assemble the reads to unitigs(非重复序列区).

默认是一条龙,全部自动做完。

canu \ -p ecoli -d ecoli-auto \ genomeSize=4.8m \ -pacbio-raw p6.25x.fastq
#PBS -N R498_CANU#PBS -j oe#PBS -l nodes=1:ppn=4#PBS -l mem=30gb#PBS -q lowcd $PBS_O_WORKDIRdateexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/software/gcc-4.9.2/lib64/public/software/canu-master/Linux-amd64/bin/canu \         -p ecoli -d ecoli-auto \         genomeSize=4.8m \         -pacbio-raw ecoli_p6_25x.filtered.fastq

This will use the prefix(前缀) ‘ecoli’ to name files, compute the correction task(校正任务) in directory ‘ecoli-auto/correction’, the trimming task(修剪任务) in directory ‘ecoli-auto/trimming’, and the unitig construction(contig构建) stage in ‘ecoli-auto’ itself. Output files are described in the next section.

Find the Output

The canu progress chatter records statistics(会记录一些统计量) such as an input read histogram(输入read的直方图), corrected read histogram(校正后的read直方图), and overlap types(overlap类型). Outputs from the assembly tasks are in:

ecoli*/ecoli.correctedReads.fasta.gz (校正后的reads,可以直接less查看)
The sequences after
correction, trimmed and split based on
consensus
evidence. Typically >99% for PacBio and >98% for Nanopore but it can vary based on your input sequencing quality.
ecoli*/ecoli.trimmedReads.fasta.gz (修剪后的reads)The sequences after correction and
final trimming. The corrected sequences above are
overlapped again to identify any missed hairpin adapters or bad sequence that could not be detected in the raw sequences.
ecoli*/ecoli.layout(layout阶段的文件) The layout provides information on where each read ended up in the final assembly, including
contig and positions. It also includes the
consensus sequence for each contig.
ecoli*/ecoli.gfa (图文件)The is the assembly
graph generated by Canu. Currently this includes the
contigs, associated bubbles, and any
overlaps which were not used by the assembly.

The fasta output is split into three types:

1.ecoli*/asm.contigs.fasta (最终的contig文件,里面有些标签头)

Everything which could be assembled and is part of the primary assembly, including both unique and repetitive elements. Each contig has several flags included on the fasta def line:

>tig######## len=
reads=
covStat=
gappedBases=
class=
suggestRepeat=
suggestCircular=
>tig00000000 len=110432 reads=231 covStat=181.52 gappedBases=no class=contig suggestRepeat=no suggestCircular=noTGAAAACACCAGTCGGTGGCAGACAAGGCGTCGGTCGGTGGAAGTGTAGACGCCCAACAACGGCAGCATAATAGGTCAGCCGTGCAGGCGGAGACACCAG
下面是对上面标签的细致化解释:
len
Length of the sequence, in bp.
reads
Number of reads used to form the contig. (组装这条contig用了多少个reads)
covStat (log(不同的contig/two-copy)???)
The log of the ratio of the contig being unique(唯一的) versus being two-copy(), based on the read arrival rate().
Positive values indicate
more likely to be unique, while
negative values indicate
more likely to be repetitive. See in .
gappedBases
If yes, the sequence includes
all gaps in the multialignment.
class
Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read.
suggestRepeat
If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
suggestCircular
If yes, sequence is likely circular. Not implemented.
2.ecoli*/asm.bubbles.fasta (bubble信息)alternate paths in the graph which could not be merged into the primary assembly.
3.ecoli*/asm.unassembled.fasta (没有进入组装,也不是bubble)reads which could not be incorporated into the primary or bubble assemblies.

Correct, Trim and Assemble, Manually 手动运行

Sometimes, however, it makes sense to do the three top-level tasks by hand. This would allow trying multiple unitig construction parameters on the same set of corrected and trimmed reads.

First, correct the raw reads:

canu -correct \  -p ecoli -d ecoli \  genomeSize=4.8m \  -pacbio-raw  p6.25x.fastq

Then, trim the output of the correction:

canu -trim \  -p ecoli -d ecoli \  genomeSize=4.8m \  -pacbio-corrected ecoli/correction/ecoli.correctedReads.fasta.gz

And finally, assemble the output of trimming, twice(为啥要两次?):

canu -assemble \  -p ecoli -d ecoli-erate-0.013 \  genomeSize=4.8m \  errorRate=0.013 \  -pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gzcanu -assemble \  -p ecoli -d ecoli-erate-0.025 \  genomeSize=4.8m \  errorRate=0.025 \  -pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz

The directory layout for correction and trimming is exactly the same as when we ran all tasks in the same command. Each unitig construction task needs its own private work space, and in there the ‘correction’ and ‘trimming’ directories are empty. The error rate always specifies the error in the corrected reads which is typically <1% for PacBio data and <2% for Nanopore data (<1% on newest chemistries).

Assembling Oxford Nanopore data

A set of E. coli runs were released by the Loman lab. You can download one or any of them from the .(下载测试数据)

or use the following curl command:

curl -L -o oxford.fasta http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta

Canu assembles any of the four available datasets into a single contig but we picked one dataset to use in this tutorial. Then, assemble the data as before:

canu \ -p ecoli -d ecoli-oxford \ genomeSize=4.8m \ -nanopore-raw oxford.fasta

The assembled identity is >99% before polishing.

Assembling With Multiple Technologies/Files多类型数据组装

Canu takes an arbitrary number of input files/formats. We made a mixed dataset of about 10X of a PacBio P6 and 10X of an Oxford Nanopore run available (测试数据)

or use the following curl command:

curl -L -o mix.tar.gz http://gembox.cbcb.umd.edu/mhap/raw/ecoliP6Oxford.tar.gztar xvzf mix.tar.gz

Now you can assemble all the data:

canu \ -p ecoli -d ecoli-mix \ genomeSize=4.8m \ -pacbio-raw pacbio*fastq.gz \ -nanopore-raw oxford.fasta.gz

Assembling Low Coverage Datasets组装低覆盖度的数据

When you have 30X or less coverage, it helps to adjust the Canu assembly parameters. Typically, assembly 20X of single-molecule data outperforms hybrid methods with higher coverage. You can download a 20X subset of (下载测试数据)

or use the following curl command:

curl -L -o yeast.20x.fastq.gz http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.20x.fastq.gz

and run the assembler adding sensitive parameters (errorRate=0.035):

canu \ -p asm -d yeast \ genomeSize=12.1m \ errorRate=0.035 \ -pacbio-raw yeast.20x.fastq.gz

After the run completes, we can check the assembly statistics:

tgStoreDump -sizes -s 12100000 -T yeast/unitigging/asm.tigStore 2 -G yeast/unitigging/asm.gkpStore
lenSuggestRepeat sum     160297 (genomeSize 12100000)lenSuggestRepeat num         12lenSuggestRepeat ave      13358lenUnassembled ng10       13491 bp   lg10      77   sum    1214310 bplenUnassembled ng20       11230 bp   lg20     176   sum    2424556 bplenUnassembled ng30        9960 bp   lg30     290   sum    3632411 bplenUnassembled ng40        8986 bp   lg40     418   sum    4841978 bplenUnassembled ng50        8018 bp   lg50     561   sum    6054460 bplenUnassembled ng60        7040 bp   lg60     723   sum    7266816 bplenUnassembled ng70        6169 bp   lg70     906   sum    8474192 bplenUnassembled ng80        5479 bp   lg80    1114   sum    9684981 bplenUnassembled ng90        4787 bp   lg90    1348   sum   10890099 bplenUnassembled ng100       4043 bp   lg100   1624   sum   12103239 bplenUnassembled ng110       3323 bp   lg110   1952   sum   13310167 bplenUnassembled ng120       2499 bp   lg120   2370   sum   14520362 bplenUnassembled ng130       1435 bp   lg130   2997   sum   15731198 bplenUnassembled sum   16139888 (genomeSize 12100000)lenUnassembled num       3332lenUnassembled ave       4843lenContig ng10      770772 bp   lg10       2   sum    1566457 bplenContig ng20      710140 bp   lg20       4   sum    3000257 bplenContig ng30      669248 bp   lg30       5   sum    3669505 bplenContig ng40      604859 bp   lg40       7   sum    4884914 bplenContig ng50      552911 bp   lg50      10   sum    6571204 bplenContig ng60      390415 bp   lg60      12   sum    7407061 bplenContig ng70      236725 bp   lg70      16   sum    8521520 bplenContig ng80      142854 bp   lg80      23   sum    9768299 bplenContig ng90       94308 bp   lg90      33   sum   10927790 bplenContig sum   12059140 (genomeSize 12100000)lenContig num         56lenContig ave     215341

Consensus Accuracy

While Canu corrects sequences and has 99% identity or greater with PacBio or Nanopore sequences, for the best accuracy we recommend polishing with a sequence-specific tool. We recommend for PacBio and for Oxford Nanpore data.(专业校正)

If you have Illumina sequences available, can also be used to polish either PacBio or Oxford Nanopore assemblies.

Futher Reading

See the page for commonly-asked questions and the . notes page for information on what’s changed and known issues.

转载于:https://www.cnblogs.com/leezx/p/5707595.html

你可能感兴趣的文章
Linux安装redis数据库及添加环境变量
查看>>
Mysql索引和性能优化笔记
查看>>
oracle sql 性能 优化
查看>>
pm2 日常使用
查看>>
H5+SDK
查看>>
Linux命令:用“dirs”、“pushd”、“popd”来操作目录栈
查看>>
sqlserver2016 management tool v18
查看>>
C#/.NET 实现MD5加密的简单写法
查看>>
1217.2——定义一个类+方法声明调用
查看>>
Oracle11G 数据库导出后再导入,部分表没有导入
查看>>
将博客搬至CSDN
查看>>
数据库10大常见安全问题及Top 10 数据库安全工具盘点
查看>>
poj3261 Milk Patterns
查看>>
继电器
查看>>
SQLServer → 10:数据完整性
查看>>
2009程序员考试大纲
查看>>
Shell入门
查看>>
Cocos工程命名规则整理(node部分)
查看>>
感觉不错的随笔 关于C、C++的
查看>>
Flask從入門到入土(四)——登錄實現
查看>>