TEsorter:转座子分类

简介

TEsorter是基于保守蛋白结构域以及 REXdb数据库进行分类。TEsorter不仅仅可以对LTR-RTs进行分类,还可以对其它Class Ⅰ以及Class Ⅱ等类型的TE进行分类,只要 REXdb数据库覆盖了的TE都可以分类,具体如下:

      --mobile_element                             
          ¦--Class_I                                
          ¦   ¦--SINE                               
          ¦   ¦--LTR                                
          ¦   ¦   ¦--Ty1                            
          ¦   ¦   ¦   °--copia                      
          ¦   ¦   ¦       ¦--Ale                    
          ¦   ¦   ¦       ¦--Alesia                 
          ¦   ¦   ¦       ¦--Angela                 
          ¦   ¦   ¦       ¦--Bianca                 
          ¦   ¦   ¦       ¦--Bryco                  
          ¦   ¦   ¦       ¦--Lyco                   
          ¦   ¦   ¦       ¦--Gymco-III              
          ¦   ¦   ¦       ¦--Gymco-I                
          ¦   ¦   ¦       ¦--Gymco-II               
          ¦   ¦   ¦       ¦--Ikeros                 
          ¦   ¦   ¦       ¦--Ivana                  
          ¦   ¦   ¦       ¦--Gymco-IV               
          ¦   ¦   ¦       ¦--Osser                  
          ¦   ¦   ¦       ¦--SIRE                   
          ¦   ¦   ¦       ¦--TAR                    
          ¦   ¦   ¦       ¦--Tork                   
          ¦   ¦   ¦       °--Ty1-outgroup           
          ¦   ¦   °--Ty3                            
          ¦   ¦       °--gypsy                      
          ¦   ¦           ¦--non-chromovirus        
          ¦   ¦           ¦   ¦--non-chromo-outgroup
          ¦   ¦           ¦   ¦--Phygy              
          ¦   ¦           ¦   ¦--Selgy              
          ¦   ¦           ¦   °--OTA                
          ¦   ¦           ¦       ¦--Athila         
          ¦   ¦           ¦       °--Tat            
          ¦   ¦           ¦           ¦--TatI       
          ¦   ¦           ¦           ¦--TatII      
          ¦   ¦           ¦           ¦--TatIII     
          ¦   ¦           ¦           ¦--Ogre       
          ¦   ¦           ¦           °--Retand     
          ¦   ¦           °--chromovirus            
          ¦   ¦               ¦--Chlamyvir          
          ¦   ¦               ¦--Tcn1               
          ¦   ¦               ¦--chromo-outgroup    
          ¦   ¦               ¦--CRM                
          ¦   ¦               ¦--Galadriel          
          ¦   ¦               ¦--Tekay              
          ¦   ¦               ¦--Reina              
          ¦   ¦               °--chromo-unclass     
          ¦   ¦--pararetrovirus                     
          ¦   ¦--DIRS                               
          ¦   ¦--Penelope                           
          ¦   °--LINE                               
          °--Class_II                               
              ¦--Subclass_1                         
              ¦   °--TIR                            
              ¦       ¦--MITE                       
              ¦       ¦--EnSpm                      
              ¦       ¦   °--CACTA                  
              ¦       ¦--hAT                        
              ¦       ¦--Kolobok                    
              ¦       ¦--Merlin                     
              ¦       ¦--MuDR                       
              ¦       ¦   °--Mutator                
              ¦       ¦--Novosib                    
              ¦       ¦--P                          
              ¦       ¦--PIF                        
              ¦       ¦   °--Harbinger              
              ¦       ¦--PiggyBac                   
              ¦       ¦--Sola1                      
              ¦       ¦--Sola2                      
              ¦       °--Tc1                        
              ¦           °--Mariner                
              °--Subclass_2                         
                  °--Helitron   

安装

TEsorter基于python开发的,目前有两个版本, Python2版本以及 Python3版本, Python3版本安装起来更友好。直接conda一键安装(推荐)即可:

conda install -c bioconda tesorter

没有conda的话也可以按下面方法安装:

  • 依赖于Python3环境以及biopython
  • hmmscan 3.1x or 3.2xblast+,加入环境变量即可

最后安装TEsorter:

git clone https://github.com/NBISweden/TEsorter
cd TEsorter
python setup.py install

使用

使用很简单:

$ TEsorter  -h
usage: TEsorter [-h] [-v] [-db {rexdb,rexdb-plant,rexdb-metazoa,gydb}]
                   [-st {nucl,prot}] [-pre PREFIX] [-fw] [-p PROCESSORS]
                   [-tmp TMP_DIR] [-cov MIN_COVERAGE] [-eval MAX_EVALUE]
                   [-dp2] [-rule PASS2_RULE] [-nolib] [-norc] [-nocln]
                   sequence

positional arguments:
  sequence              input TE sequences in fasta format [required]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -db {rexdb,rexdb-plant,rexdb-metazoa,gydb}, --hmm-database {rexdb,rexdb-plant,rexdb-metazoa,gydb}
                        the database used [default=rexdb]
  -st {nucl,prot}, --seq-type {nucl,prot}
                        'nucl' for DNA or 'prot' for protein [default=nucl]
  -pre PREFIX, --prefix PREFIX
                        output prefix [default='{-s}.{-db}']
  -fw, --force-write-hmmscan
                        if False, will use the existed hmmscan outfile and
                        skip hmmscan [default=False]
  -p PROCESSORS, --processors PROCESSORS
                        processors to use [default=4]
  -tmp TMP_DIR, --tmp-dir TMP_DIR
                        directory for temporary files [default=./tmp]
  -cov MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        mininum coverage for protein domains in HMMScan output
                        [default=20]
  -eval MAX_EVALUE, --max-evalue MAX_EVALUE
                        maxinum E-value for protein domains in HMMScan output
                        [default=0.001]
  -dp2, --disable-pass2
                        do not further classify the unclassified sequences
                        [default=False for `nucl`, True for `prot`]
  -rule PASS2_RULE, --pass2-rule PASS2_RULE
                        classifying rule [identity-coverage-length] in pass-2
                        based on simliarity [default=80-80-80]
  -nolib, --no-library  do not generate a library file for RepeatMasker
                        [default=False]
  -norc, --no-reverse   do not reverse complement sequences if they are
                        detected in minus strand [default=False]
  -nocln, --no-cleanup  do not clean up the temporary directory
                        [default=False]

建议植物的话就直接使用rexdb-plant

结果

rice6.9.5.liban.rexdb.domtbl        HMMScan raw output
rice6.9.5.liban.rexdb.dom.faa       protein sequences of domain, which can be used for phylogenetic analysis.
rice6.9.5.liban.rexdb.dom.tsv       inner domains of TEs/LTR-RTs, which might be used to filter domains based on their scores and coverages.
rice6.9.5.liban.rexdb.dom.gff3      domain annotations in `gff3` format
rice6.9.5.liban.rexdb.cls.tsv       TEs/LTR-RTs classifications
    Column 1: raw id
    Column 2: Order, e.g. LTR
    Column 3: Superfamily, e.g. Copia
    Column 4: Clade, e.g. SIRE
    Column 5: Complete, "yes" means one LTR Copia/Gypsy element with full GAG-POL domains.
    Column 6: Strand, + or - or ?
    Column 7: Domains, e.g. GAG|SIRE PROT|SIRE INT|SIRE RT|SIRE RH|SIRE; `none` for pass-2 classifications
rice6.9.5.liban.rexdb.cls.lib       fasta library for RepeatMasker
rice6.9.5.liban.rexdb.cls.pep       the same sequences as `rice6.9.5.liban.rexdb.dom.faa`, but id is changed with classifications.

其中以.cls.tsv结尾的文件里面就有我们需要的分类信息。

限制

  • 针对每一个保守结构域,只会产生最佳比对的那一个
  • 很多TE无法被分类,可能是数据库不完善、TE中突变太多,丢失了保守结构域
  • 很多TE没有保守结构域,比如非自主性转座子(Non-autonomous TEs)以及一些不活跃的自主性转座子(un-active autonomous TEs)

其他

该软件还提供了一些进化分析、序列提取的脚本,有兴趣的可以去 主页浏览。

Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.

comments powered by Disqus