CD-HIT:序列聚类去冗余
简介
CD-HIT
通过序列聚类以降低序列冗余性,提高后续分析效率。其
文档写得非常详细,非常容易上手。
安装
wget https://github.com/weizhongli/cdhit/releases/download/V4.8.1/cd-hit-v4.8.1-2019-0228.tar.gz ## 下载cd-hit
tar -zxvf cd-hit-v4.8.1-2019-0228.tar.gz ## 解压
cd cd-hit-v4.8.1-2019-0228
make ## 编译
cd cd-hit-auxtools
make ## 编译
#Conda安装
conda install -c bioconda cd-hit
使用
CD-HIT包含以下程序:
* cd-hit Cluster peptide sequences
* cd-hit-est Cluster nucleotide sequences
* cd-hit-2d Compare 2 peptide databases
* cd-hit-est-2d Compare 2 nucleotide databases
* psi-cd-hit Cluster proteins at <40% cutoff
* cd-hit-lap Identify overlapping reads
* cd-hit-dup Identify duplicates from single or paired Illumina reads
* cd-hit-454 Identify duplicates from 454 reads
* cd-hit-otu Cluster rRNA tags
* cd-hit Web server Cluster user-uploaded data
* cd-hit-para Cluster sequences in parallel on a computer cluster
* scripts Parse results and so on
* h-cd-hit Hierarchical clustering
cd-hit
基本用法
cd-hit
用于蛋白序列聚类。
Basic command:
cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 16000 –d 0 -T 8
cd-hit -i db -o db90 -c 0.9 -n 5 -M 16000 –d 0 -T 8
where
db is the filename of input,
db90 is output,
-c 1.0, means 100% identity, is the clustering threshold
-c 0.9, means 90% identity, is the clustering threshold
-n 5 is the word size
-d 0 use sequence name in fasta header till the first white space
-M 16000, to use 16GB RAM
-T 8, to use 8 threads
其中:
- -i:输入文件,蛋白序列,fasta格式
- -o:输出文件,有两个,一个代表性序列文件,一个聚类文件
- -c:聚类阈值,1.0代表100%一致性,0.9代表90%一致性,以此类推
- -n:两两序列进行序列比对时选择的 word size,具体选值参考下面
- -d:0表示使用 fasta 标题中第一个空格前的字段作为序列名字
- -M:设置内存,16000,16G
- -t:设置线程数
还有很多参数,具体如下:
-i input filename in fasta format, required
-o output filename, required
-c sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence
-G use global sequence identity, default 1
if set to 0, then use local sequence identity, calculated as :
number of identical amino acids in alignment
divided by the length of the alignment
NOTE!!! don't use -G 0 unless you use alignment coverage controls
see options -aL, -AL, -aS, -AS
-b band_width of alignment, default 20
-M memory limit (in MB) for the program, default 800; 0 for unlimitted;
-T number of threads, default 1; with 0, all CPUs will be used
-n word_length, default 5, see user's guide for choosing it
-l length of throw_away_sequences, default 10
-t tolerance for redundance, default 2
-d length of description in .clstr file, default 20
if set to 0, it takes the fasta defline and stops at first space
-s length difference cutoff, default 0.0
if set to 0.9, the shorter sequences need to be
at least 90% length of the representative of the cluster
-S length difference cutoff in amino acid, default 999999
if set to 60, the length difference between the shorter sequences
and the representative of the cluster can not be bigger than 60
-aL alignment coverage for the longer sequence, default 0.0
if set to 0.9, the alignment must covers 90% of the sequence
-AL alignment coverage control for the longer sequence, default 99999999
if set to 60, and the length of the sequence is 400,
then the alignment must be >= 340 (400-60) residues
-aS alignment coverage for the shorter sequence, default 0.0
if set to 0.9, the alignment must covers 90% of the sequence
-AS alignment coverage control for the shorter sequence, default 99999999
if set to 60, and the length of the sequence is 400,
then the alignment must be >= 340 (400-60) residues
-A minimal alignment coverage control for the both sequences, default 0
alignment must cover >= this value for both sequences
-uL maximum unmatched percentage for the longer sequence, default 1.0
if set to 0.1, the unmatched region (excluding leading and tailing gaps)
must not be more than 10% of the sequence
-uS maximum unmatched percentage for the shorter sequence, default 1.0
if set to 0.1, the unmatched region (excluding leading and tailing gaps)
must not be more than 10% of the sequence
-U maximum unmatched length, default 99999999
if set to 10, the unmatched region (excluding leading and tailing gaps)
must not be more than 10 bases
-B 1 or 0, default 0, by default, sequences are stored in RAM
if set to 1, sequence are stored on hard drive
!! No longer supported !!
-p 1 or 0, default 0
if set to 1, print alignment overlap in .clstr file
-g 1 or 0, default 0
by cd-hit's default algorithm, a sequence is clustered to the first
cluster that meet the threshold (fast cluster). If set to 1, the program
will cluster it into the most similar cluster that meet the threshold
(accurate but slow mode)
but either 1 or 0 won't change the representatives of final clusters
-sc sort clusters by size (number of sequences), default 0, output clusters by decreasing length
if set to 1, output clusters by decreasing size
-sf sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
if set to 1, output sequences by decreasing cluster size
-bak write backup cluster file (1 or 0, default 0)
-h print this help
Choose of word size:
-n 5 for thresholds 0.7 ~ 1.0
-n 4 for thresholds 0.6 ~ 0.7
-n 3 for thresholds 0.5 ~ 0.6
-n 2 for thresholds 0.4 ~ 0.5
从上面可以看出,
cd-hit
只能完成40%以上的序列相似性的聚类。
cd-hit-2d
用法
cd-hit-2d
用于比较两个蛋白数据库。
Basic command:
cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5 -d 0 -M 16000 -T 8
where
db1 & db2 are inputs
db2novel is output
0.9 means 90% identity, is the comparing threshold
5 is the size of word
其他参数:
-i input filename for db1 in fasta format, required
-i2 input filename for db2 in fasta format, required
-o output filename, required
-c sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence
-G use global sequence identity, default 1
if set to 0, then use local sequence identity, calculated as :
number of identical amino acids in alignment
divided by the length of the alignment
NOTE!!! don't use -G 0 unless you use alignment coverage controls
see options -aL, -AL, -aS, -AS
-b band_width of alignment, default 20
-M memory limit (in MB) for the program, default 800; 0 for unlimitted;
-T number of threads, default 1; with 0, all CPUs will be used
-n word_length, default 5, see user's guide for choosing it
-l length of throw_away_sequences, default 10
-t tolerance for redundance, default 2
-d length of description in .clstr file, default 20
if set to 0, it takes the fasta defline and stops at first space
-s length difference cutoff, default 0.0
if set to 0.9, the shorter sequences need to be
at least 90% length of the representative of the cluster
-S length difference cutoff in amino acid, default 999999
if set to 60, the length difference between the shorter sequences
and the representative of the cluster can not be bigger than 60
-s2 length difference cutoff for db1, default 1.0
by default, seqs in db1 >= seqs in db2 in a same cluster
if set to 0.9, seqs in db1 may just >= 90% seqs in db2
-S2 length difference cutoff, default 0
by default, seqs in db1 >= seqs in db2 in a same cluster
if set to 60, seqs in db2 may 60aa longer than seqs in db1
-aL alignment coverage for the longer sequence, default 0.0
if set to 0.9, the alignment must covers 90% of the sequence
-AL alignment coverage control for the longer sequence, default 99999999
if set to 60, and the length of the sequence is 400,
then the alignment must be >= 340 (400-60) residues
-aS alignment coverage for the shorter sequence, default 0.0
if set to 0.9, the alignment must covers 90% of the sequence
-AS alignment coverage control for the shorter sequence, default 99999999
if set to 60, and the length of the sequence is 400,
then the alignment must be >= 340 (400-60) residues
-A minimal alignment coverage control for the both sequences, default 0
alignment must cover >= this value for both sequences
-uL maximum unmatched percentage for the longer sequence, default 1.0
if set to 0.1, the unmatched region (excluding leading and tailing gaps)
must not be more than 10% of the sequence
-uS maximum unmatched percentage for the shorter sequence, default 1.0
if set to 0.1, the unmatched region (excluding leading and tailing gaps)
must not be more than 10% of the sequence
-U maximum unmatched length, default 99999999
if set to 10, the unmatched region (excluding leading and tailing gaps)
must not be more than 10 bases
-B 1 or 0, default 0, by default, sequences are stored in RAM
if set to 1, sequence are stored on hard drive
!! No longer supported !!
-p 1 or 0, default 0
if set to 1, print alignment overlap in .clstr file
-g 1 or 0, default 0
by cd-hit's default algorithm, a sequence is clustered to the first
cluster that meet the threshold (fast cluster). If set to 1, the program
will cluster it into the most similar cluster that meet the threshold
(accurate but slow mode)
but either 1 or 0 won't change the representatives of final clusters
-bak write backup cluster file (1 or 0, default 0)
-h print this help
cd-hit-est
与cd-hit-est-2d
则是用于核酸序列,用法大同小异,只是n
的取值有所差异
Choose of word size:
-n 10,11 for thresholds 0.95 ~ 1.0
-n 8,9 for thresholds 0.90 ~ 0.95
-n 7 for thresholds 0.88 ~ 0.9
-n 6 for thresholds 0.85 ~ 0.88
-n 5 for thresholds 0.80 ~ 0.85
-n 4 for thresholds 0.75 ~ 0.8
cd-hit-est
无法对长序列(比如基因组级别的序列)进行聚类。另外CD-HIT
运行速度非常快,因为是基于贪婪增量聚类算法。