CD-HIT:序列聚类去冗余

Mar 18, 2020 8 min read Bioinformatics

简介

CD-HIT通过序列聚类以降低序列冗余性，提高后续分析效率。其文档写得非常详细，非常容易上手。

安装

wget https://github.com/weizhongli/cdhit/releases/download/V4.8.1/cd-hit-v4.8.1-2019-0228.tar.gz    ## 下载cd-hit
tar -zxvf cd-hit-v4.8.1-2019-0228.tar.gz  ## 解压
cd cd-hit-v4.8.1-2019-0228
make  ## 编译
cd cd-hit-auxtools
make ## 编译

#Conda安装
conda install -c bioconda cd-hit

使用

CD-HIT包含以下程序：

 * cd-hit	        Cluster peptide sequences	
  * cd-hit-est	        Cluster nucleotide sequences
  * cd-hit-2d	        Compare 2 peptide databases	
  * cd-hit-est-2d	Compare 2 nucleotide databases
  * psi-cd-hit	        Cluster proteins at <40% cutoff	
  * cd-hit-lap	        Identify overlapping reads
  * cd-hit-dup          Identify duplicates from single or paired Illumina reads	
  * cd-hit-454          Identify duplicates from 454 reads 
  * cd-hit-otu	        Cluster rRNA tags	
  * cd-hit Web server	Cluster user-uploaded data 
  * cd-hit-para         Cluster sequences in parallel on a computer cluster	
  * scripts             Parse results and so on
  * h-cd-hit            Hierarchical clustering

`cd-hit`基本用法

cd-hit用于蛋白序列聚类。

Basic command:

  cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 16000 –d 0 -T 8
  cd-hit -i db -o db90 -c 0.9 -n 5 -M 16000 –d 0 -T 8
  where
  db is the filename of input,
  db90 is output, 
  -c 1.0, means 100% identity, is the clustering threshold
  -c 0.9, means 90% identity, is the clustering threshold
  -n 5 is the word size
  -d 0 use sequence name in fasta header till the first white space
  -M 16000, to use 16GB RAM
  -T 8, to use 8 threads

其中：

-i：输入文件，蛋白序列，fasta格式
-o：输出文件，有两个，一个代表性序列文件，一个聚类文件
-c：聚类阈值，1.0代表100%一致性，0.9代表90%一致性，以此类推
-n：两两序列进行序列比对时选择的 word size，具体选值参考下面
-d：0表示使用 fasta 标题中第一个空格前的字段作为序列名字
-M：设置内存，16000，16G
-t：设置线程数

还有很多参数，具体如下：

-i	input filename in fasta format, required
   -o	output filename, required
   -c	sequence identity threshold, default 0.9
 	this is the default cd-hit's "global sequence identity" calculated as:
 	number of identical amino acids in alignment
 	divided by the full length of the shorter sequence
   -G	use global sequence identity, default 1
 	if set to 0, then use local sequence identity, calculated as :
 	number of identical amino acids in alignment
 	divided by the length of the alignment
 	NOTE!!! don't use -G 0 unless you use alignment coverage controls
 	see options -aL, -AL, -aS, -AS
   -b	band_width of alignment, default 20
   -M	memory limit (in MB) for the program, default 800; 0 for unlimitted;
   -T	number of threads, default 1; with 0, all CPUs will be used
   -n	word_length, default 5, see user's guide for choosing it
   -l	length of throw_away_sequences, default 10
   -t	tolerance for redundance, default 2
   -d	length of description in .clstr file, default 20
 	if set to 0, it takes the fasta defline and stops at first space
   -s	length difference cutoff, default 0.0
 	if set to 0.9, the shorter sequences need to be
 	at least 90% length of the representative of the cluster
   -S	length difference cutoff in amino acid, default 999999
 	if set to 60, the length difference between the shorter sequences
 	and the representative of the cluster can not be bigger than 60
   -aL	alignment coverage for the longer sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AL	alignment coverage control for the longer sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -aS	alignment coverage for the shorter sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AS	alignment coverage control for the shorter sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -A	minimal alignment coverage control for the both sequences, default 0
 	alignment must cover >= this value for both sequences 
   -uL	maximum unmatched percentage for the longer sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -uS	maximum unmatched percentage for the shorter sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -U	maximum unmatched length, default 99999999
 	if set to 10, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10 bases
   -B	1 or 0, default 0, by default, sequences are stored in RAM
 	if set to 1, sequence are stored on hard drive
 	!! No longer supported !!
   -p	1 or 0, default 0
 	if set to 1, print alignment overlap in .clstr file
   -g	1 or 0, default 0
 	by cd-hit's default algorithm, a sequence is clustered to the first 
 	cluster that meet the threshold (fast cluster). If set to 1, the program
 	will cluster it into the most similar cluster that meet the threshold
 	(accurate but slow mode)
 	but either 1 or 0 won't change the representatives of final clusters
   -sc	sort clusters by size (number of sequences), default 0, output clusters by decreasing length
 	if set to 1, output clusters by decreasing size
   -sf	sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
 	if set to 1, output sequences by decreasing cluster size
   -bak	write backup cluster file (1 or 0, default 0)
   -h	print this help

Choose of word size:
-n 5 for thresholds 0.7 ~ 1.0
-n 4 for thresholds 0.6 ~ 0.7
-n 3 for thresholds 0.5 ~ 0.6
-n 2 for thresholds 0.4 ~ 0.5

从上面可以看出，cd-hit只能完成40%以上的序列相似性的聚类。

`cd-hit-2d`用法

cd-hit-2d用于比较两个蛋白数据库。

Basic command:

  cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5 -d 0 -M 16000 -T 8
  where 
  db1 & db2 are inputs
  db2novel is output
  0.9 means 90% identity, is the comparing threshold
  5 is the size of word

其他参数：

-i	input filename for db1 in fasta format, required
   -i2	input filename for db2 in fasta format, required
   -o	output filename, required
   -c	sequence identity threshold, default 0.9
 	this is the default cd-hit's "global sequence identity" calculated as:
 	number of identical amino acids in alignment
 	divided by the full length of the shorter sequence
   -G	use global sequence identity, default 1
 	if set to 0, then use local sequence identity, calculated as :
 	number of identical amino acids in alignment
 	divided by the length of the alignment
 	NOTE!!! don't use -G 0 unless you use alignment coverage controls
 	see options -aL, -AL, -aS, -AS
   -b	band_width of alignment, default 20
   -M	memory limit (in MB) for the program, default 800; 0 for unlimitted;
   -T	number of threads, default 1; with 0, all CPUs will be used
   -n	word_length, default 5, see user's guide for choosing it
   -l	length of throw_away_sequences, default 10
   -t	tolerance for redundance, default 2
   -d	length of description in .clstr file, default 20
 	if set to 0, it takes the fasta defline and stops at first space
   -s	length difference cutoff, default 0.0
 	if set to 0.9, the shorter sequences need to be
 	at least 90% length of the representative of the cluster
   -S	length difference cutoff in amino acid, default 999999
 	if set to 60, the length difference between the shorter sequences
 	and the representative of the cluster can not be bigger than 60
   -s2	length difference cutoff for db1, default 1.0
 	by default, seqs in db1 >= seqs in db2 in a same cluster
 	if set to 0.9, seqs in db1 may just >= 90% seqs in db2
   -S2	length difference cutoff, default 0
 	by default, seqs in db1 >= seqs in db2 in a same cluster
 	if set to 60, seqs in db2 may 60aa longer than seqs in db1
   -aL	alignment coverage for the longer sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AL	alignment coverage control for the longer sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -aS	alignment coverage for the shorter sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AS	alignment coverage control for the shorter sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -A	minimal alignment coverage control for the both sequences, default 0
 	alignment must cover >= this value for both sequences 
   -uL	maximum unmatched percentage for the longer sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -uS	maximum unmatched percentage for the shorter sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -U	maximum unmatched length, default 99999999
 	if set to 10, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10 bases
   -B	1 or 0, default 0, by default, sequences are stored in RAM
 	if set to 1, sequence are stored on hard drive
 	!! No longer supported !!
   -p	1 or 0, default 0
 	if set to 1, print alignment overlap in .clstr file
   -g	1 or 0, default 0
 	by cd-hit's default algorithm, a sequence is clustered to the first 
 	cluster that meet the threshold (fast cluster). If set to 1, the program
 	will cluster it into the most similar cluster that meet the threshold
 	(accurate but slow mode)
 	but either 1 or 0 won't change the representatives of final clusters
   -bak	write backup cluster file (1 or 0, default 0)
   -h	print this help

cd-hit-est与cd-hit-est-2d则是用于核酸序列，用法大同小异，只是n的取值有所差异

Choose of word size:

-n 10,11  for thresholds 0.95 ~ 1.0
-n 8,9    for thresholds 0.90 ~ 0.95
-n 7      for thresholds 0.88 ~ 0.9
-n 6      for thresholds 0.85 ~ 0.88
-n 5      for thresholds 0.80 ~ 0.85
-n 4      for thresholds 0.75 ~ 0.8

cd-hit-est无法对长序列（比如基因组级别的序列）进行聚类。另外CD-HIT运行速度非常快，因为是基于贪婪增量聚类算法。

reduce sequence redundancy cluster

Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.

CD-HIT:序列聚类去冗余

简介

安装

使用

CD-HIT包含以下程序：

cd-hit基本用法

cd-hit-2d用法

Researcher

`cd-hit`基本用法

`cd-hit-2d`用法