利用tximport进行RNA-seq分析

简介


RNA-seq后续分析可以利用R包edgeR、DESeq2以及limma-voom等,而tximport包则可以将RNA-seq上游定量分析软件产生的结果导入到R语言中,进而方便后续的分析。具体的情况可以参考这篇文献:

Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research http://sci-hub.cc/10.12688/f1000research.7563.1

tximport具有以下优点:

  • 对于样本之间因基因长度不同导致的差异具有纠正功能
  • 一些上游分析软件(Salmon, Sailfish, kallisto)等的结果可以完美对接到R语言中,这些软件运行速度更快更稳定,且占用更少的内存
  • 更高的灵敏度,因为它可以避免那些比对到多基因的片段的丢失

导入转录组数据


下面我们将使用包tximportData中的数据进行演示,tximport可以处理多种类型的上游定量结果,只需要在参数type中设定就行。 先安装包

source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")#设置中科大镜像
biocLite("tximportData")
library(tximportData)
dir <- system.file("extdata", package = "tximportData")#set directory
list.files(dir)
##  [1] "cufflinks"            "kallisto"             "kallisto_boot"       
##  [4] "rsem"                 "sailfish"             "salmon"              
##  [7] "salmon_gibbs"         "samples.txt"          "samples_extended.txt"
## [10] "tx2gene.csv"

可以看到dir中存在这些文件,接下来我们其中的samples.txt读进来

samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
##   pop center                assay    sample experiment       run
## 1 TSI  UNIGE NA20503.1.M_111124_5 ERS185497  ERX163094 ERR188297
## 2 TSI  UNIGE NA20504.1.M_111124_7 ERS185242  ERX162972 ERR188088
## 3 TSI  UNIGE NA20505.1.M_111124_6 ERS185048  ERX163009 ERR188329
## 4 TSI  UNIGE NA20507.1.M_111124_7 ERS185412  ERX163158 ERR188288
## 5 TSI  UNIGE NA20508.1.M_111124_2 ERS185362  ERX163159 ERR188021
## 6 TSI  UNIGE NA20514.1.M_111124_4 ERS185217  ERX163062 ERR188356

再将salmon与samples以及quant.sf一起创建每个样本的文件地址

files <- file.path(dir, "salmon", samples$run, "quant.sf")
names(files) <- paste0("sample", 1:6)
all(file.exists(files))
## [1] TRUE

结果显示TRUE说明运行成功,可以进行下一步,要进行基因表达差异分析,首先要将转录本与基因名关联起来,这就需要我们创建一个两列的名为tx2gene的数据框,最简单的就是通过TxDb创建,再通过AnnotationDbi包里的函数select()进行选取

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]#tx ID在前,gene ID在后
#这里我们演示就使用现有的tx2gene
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))
head(tx2gene)
##         TXNAME   GENEID
## 1    NM_130786     A1BG
## 2    NR_015380 A1BG-AS1
## 3 NM_001198818     A1CF
## 4 NM_001198819     A1CF
## 5 NM_001198820     A1CF
## 6    NM_014576     A1CF

最后就是将转录组水平的数据导入就行了

library(tximport)
library(readr)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
names(txi)
## [1] "abundance"           "counts"              "length"             
## [4] "countsFromAbundance"
head(txi$counts)
##             sample1   sample2    sample3   sample4   sample5   sample6
## A1BG     109.232000 316.22400 110.638000 116.00000  86.38430  76.91630
## A1BG-AS1  83.969700 138.44900 119.274000 151.08300 123.98500 103.25100
## A1CF       9.030691  10.01847   5.019242  13.01820  25.21914  25.07356
## A2M       24.000000   2.00000  21.000000   6.00000  38.00000   8.00000
## A2M-AS1    1.000000   1.00000   1.000000   1.00000   0.00000   0.00000
## A2ML1      3.047950   1.02987   4.076160   1.04945   3.07761   5.12409

差异分析


下游分析主要使用Bioconductor包进行

library(edgeR)
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat))+log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat))+o)

接下来的y就可以用来后续的分析

SessionInfo


sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936 
## [2] LC_CTYPE=Chinese (Simplified)_China.936   
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.936    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] edgeR_3.18.1                           
##  [2] limma_3.32.5                           
##  [3] readr_1.1.1                            
##  [4] tximport_1.4.0                         
##  [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [6] GenomicFeatures_1.28.4                 
##  [7] AnnotationDbi_1.38.2                   
##  [8] Biobase_2.36.2                         
##  [9] GenomicRanges_1.28.4                   
## [10] GenomeInfoDb_1.12.2                    
## [11] IRanges_2.10.3                         
## [12] S4Vectors_0.14.3                       
## [13] BiocGenerics_0.22.0                    
## [14] tximportData_1.4.0                     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12               compiler_3.4.1            
##  [3] XVector_0.16.0             bitops_1.0-6              
##  [5] tools_3.4.1                zlibbioc_1.22.0           
##  [7] biomaRt_2.32.1             digest_0.6.12             
##  [9] bit_1.1-12                 lattice_0.20-35           
## [11] evaluate_0.10.1            RSQLite_2.0               
## [13] memoise_1.1.0              tibble_1.3.4              
## [15] pkgconfig_2.0.1            rlang_0.1.2               
## [17] Matrix_1.2-11              DelayedArray_0.3.12       
## [19] DBI_0.7                    yaml_2.1.14               
## [21] GenomeInfoDbData_0.99.0    rtracklayer_1.36.4        
## [23] stringr_1.2.0              knitr_1.17                
## [25] hms_0.3                    Biostrings_2.44.2         
## [27] locfit_1.5-9.1             grid_3.4.1                
## [29] rprojroot_1.2              bit64_0.9-7               
## [31] R6_2.2.2                   XML_3.98-1.9              
## [33] BiocParallel_1.10.1        rmarkdown_1.6             
## [35] blob_1.1.0                 magrittr_1.5              
## [37] matrixStats_0.52.2         GenomicAlignments_1.12.2  
## [39] backports_1.1.0            Rsamtools_1.28.0          
## [41] htmltools_0.3.6            SummarizedExperiment_1.6.3
## [43] stringi_1.1.5              RCurl_1.95-4.8            
## [45] rjson_0.2.15
Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.

comments powered by Disqus