构建Consensus sequence
简介
Consensus sequence
又称为一致性序列、保守序列、共有序列等。维基百科的解释:
In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment.
我个人理解就是通过序列比对在每个位置出现频率最高的碱基或者氨基酸。
所以如果你的序列比较少,就直接用Bioedit
进行比对,可以导出consensus sequence
。序列比较多的话使用软件批量运行比较好。请教一位老师之后推荐就用RepeatScout
这个软件比较好,简单实用,最主要的是准确性也很高。
安装
wget http://bix.ucsd.edu/repeatscout/RepeatScout-1.0.5.tar.gz
tar -xvfz RepeatScout-1.0.5.tar.gz
cd RepeatScout-1.0.5
make
用法
1. Create an l-mer (-1 option) frequency table for a given genome sequence
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency
l值可以使用默认的,也可以自行设置,具体选择如下:
The l value used in the build_lmer_table program is set to ceil (log_4 (L) +1) by default. Here, the ceil function is a function to round up a real number having a decimal point to an integer, log_4 is a logarithm of base 4, L is a length (bp) of a genomic sequence.
You can change the l-mer value using the -l parameter. At this time, when using the RepeatScout program, the same l-mer must be applied. RepeatScout also takes a value with the -l parameter.
2. Find repetitive elements from frequency table and genome sequence information and store them in fasta format (including simple repeats, tandem repeats).
这一步RepeatScout
构建Consensus sequence
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas -freq output_lmer.frequency
最后生成的output_repeats.fas里面就有我们需要的Consensus sequence
,但是很多时候会生成多条Consensus sequence
,这个时候可以利用cd-hit-est
去除冗余序列,一般是可以获取到唯一的一条Consensus sequence
,如果经去冗余之后还是多条序列的话,可以考虑手动比对,看是否存在部分重叠等情况。当然如果提供的序列本来就不存在唯一的Consensus sequence
,那么最后肯定无法生成唯一的Consensus sequence
。