EDTA:转座子注释

Mar 2, 2020 3 min read Transposable Elements

简介

转座子注释目前有非常多的软件，而EDTA(The Extensive de novo TE Annotator)整合了大部分目前常用的转座子注释软件，值得注意的是EDTA是全基因组从头注释。EDTA最大的好处就是简单实用，而且经过大量的改进，目前在安装、运行以及结果解释等方面十分完善。

流程

具体可以参考EDTA的 GitHub主页以及发表在Genome Biology上的文章

安装

EDTA提供了四种安装方式：

Quick installation using conda (Linux64)

conda install -c bioconda -c conda-forge edta

Quick installation using Singularity (good for HPC users)

Installation:

singularity build --sandbox EDTA.sif docker://kapeel/edta
Usage:

singularity exec {path}/EDTA.sif /EDTA/EDTA.pl --genome genome.fa [other parameters]

{path} is the path you build the EDTA singularity image

Quick installation using Docker (good for root/Mac users)

Installation:

docker pull kapeel/edta
Usage:

docker run -v $PWD:/in -w /in kapeel/edta --genome genome.fa [other parameters]

Step by step installation using conda

conda create -n EDTA
conda activate EDTA
conda config --env --add channels anaconda --add channels conda-forge --add channels bioconda
conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast openjdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6 tesorter genericrepeatfinder genometools-genometools ltr_retriever ltr_finder numpy=1.16.4
git clone https://github.com/oushujun/EDTA
./EDTA/EDTA.pl

文件

只需要提供基因组文件就行了，但是需要注意的是序列名称不能超过15个字符，所以简单点好不易出错。如果你研究的物种还有以下文件的话也可以提供，可以提高注释的准确性：

所研究物种或者亲缘关系比较近的物种的CDS文件（需要剔除掉内含子以及UTRs）
基因坐标信息
可信度非常高的TE库，无需全基因组完整的，部分零散的也行，但是需要注意的是可信度必须是比较高的，不然就不要提供了，免得适得其反

用法

用法非常简单，毕竟整合了大部分主流软件且封装到一起

一步到位From head to toe

You got a genome and you want to get a high-quality TE annotation:

perl EDTA.pl [options]
  --genome	[File]	The genome FASTA
  --species [Rice|Maize|others]	Specify the species for identification of TIR candidates. Default: others
  --step	[all|filter|final|anno] Specify which steps you want to run EDTA.
			all: run the entire pipeline (default)
			filter: start from raw TEs to the end.
			final: start from filtered TEs to finalizing the run.
			anno: perform whole-genome annotation/analysis after TE library construction.
  --overwrite	[0|1]	If previous results are found, decide to overwrite (1, rerun) or not (0, default).
  --cds	[File]	Provide a FASTA file containing the coding sequence (no introns, UTRs, nor TEs) of this genome or its close relative.
  --curatedlib	[file]	Provided a curated library to keep consistant naming and classification for known TEs.
			All TEs in this file will be trusted 100%, so please ONLY provide MANUALLY CURATED ones here.
			This option is not mandatory. It's totally OK if no file is provided (default).
  --sensitive	[0|1]	Use RepeatModeler to identify remaining TEs (1) or not (0, default).
			This step is very slow and MAY help to recover some TEs.
  --anno	[0|1]	Perform (1) or not perform (0, default) whole-genome TE annotation after TE library construction.
  --rmout	[File]	Provide your own homology-based TE annotation instead of using the EDTA library for masking. File is in RepeatMasker .out format. This file will be merged with the structural-based TE annotation. (-anno 1 required). Default: use the EDTA library for annotation.
  --evaluate	[0|1]	Evaluate (1) classification consistency of the TE annotation. (-anno 1 required). Default: 0.
			This step is slow and does not affect the annotation result.
  --exclude	[File]	Exclude bed format regions from TE annotation. Default: undef. (-anno 1 required).
  --threads|-t	[int]	Number of theads to run this script (default: 4)
  --help|-h	Display this help info

参数说明非常详细，用起来十分清爽

分类注释

也可以指定需要注释的转座子类型：

perl EDTA_raw.pl [options]
 --genome	[File]	The genome FASTA
 --species [Rice|Maize|others]	Specify the species for identification of TIR candidates. Default: others
 --type	[ltr|tir|helitron|all]	Specify which type of raw TE candidates you want to get. Default: all
 --overwrite	[0|1]	If previous results are found, decide to overwrite (1, rerun) or not (0, default).
 --threads|-t	[int]	Number of theads to run this script
 --help|-h	Display this help info

另外还支持断点运行，中间软件运行意外终止的话，可以从终止那一步继续运行

Finish the rest of the EDTA analysis (specify -overwrite 0 and it will automatically pick up existing results in the work folder)

perl EDTA.pl --overwrite 0 [options]

结果

返回的结果非常多，并且分门别类帮你整理好，具体我就不讲了，毕竟文档讲得太详细了。

Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.