利用tximport进行RNA-seq分析
简介
RNA-seq后续分析可以利用R包edgeR、DESeq2以及limma-voom等,而tximport包则可以将RNA-seq上游定量分析软件产生的结果导入到R语言中,进而方便后续的分析。具体的情况可以参考这篇文献:
Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research http://sci-hub.cc/10.12688/f1000research.7563.1
tximport具有以下优点:
- 对于样本之间因基因长度不同导致的差异具有纠正功能
- 一些上游分析软件(Salmon, Sailfish, kallisto)等的结果可以完美对接到R语言中,这些软件运行速度更快更稳定,且占用更少的内存
- 更高的灵敏度,因为它可以避免那些比对到多基因的片段的丢失
导入转录组数据
下面我们将使用包tximportData中的数据进行演示,tximport可以处理多种类型的上游定量结果,只需要在参数type中设定就行。 先安装包
source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")#设置中科大镜像
biocLite("tximportData")
library(tximportData)
dir <- system.file("extdata", package = "tximportData")#set directory
list.files(dir)
## [1] "cufflinks" "kallisto" "kallisto_boot"
## [4] "rsem" "sailfish" "salmon"
## [7] "salmon_gibbs" "samples.txt" "samples_extended.txt"
## [10] "tx2gene.csv"
可以看到dir中存在这些文件,接下来我们其中的samples.txt读进来
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
## pop center assay sample experiment run
## 1 TSI UNIGE NA20503.1.M_111124_5 ERS185497 ERX163094 ERR188297
## 2 TSI UNIGE NA20504.1.M_111124_7 ERS185242 ERX162972 ERR188088
## 3 TSI UNIGE NA20505.1.M_111124_6 ERS185048 ERX163009 ERR188329
## 4 TSI UNIGE NA20507.1.M_111124_7 ERS185412 ERX163158 ERR188288
## 5 TSI UNIGE NA20508.1.M_111124_2 ERS185362 ERX163159 ERR188021
## 6 TSI UNIGE NA20514.1.M_111124_4 ERS185217 ERX163062 ERR188356
再将salmon与samples以及quant.sf一起创建每个样本的文件地址
files <- file.path(dir, "salmon", samples$run, "quant.sf")
names(files) <- paste0("sample", 1:6)
all(file.exists(files))
## [1] TRUE
结果显示TRUE说明运行成功,可以进行下一步,要进行基因表达差异分析,首先要将转录本与基因名关联起来,这就需要我们创建一个两列的名为tx2gene的数据框,最简单的就是通过TxDb创建,再通过AnnotationDbi包里的函数select()进行选取
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]#tx ID在前,gene ID在后
#这里我们演示就使用现有的tx2gene
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))
head(tx2gene)
## TXNAME GENEID
## 1 NM_130786 A1BG
## 2 NR_015380 A1BG-AS1
## 3 NM_001198818 A1CF
## 4 NM_001198819 A1CF
## 5 NM_001198820 A1CF
## 6 NM_014576 A1CF
最后就是将转录组水平的数据导入就行了
library(tximport)
library(readr)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
names(txi)
## [1] "abundance" "counts" "length"
## [4] "countsFromAbundance"
head(txi$counts)
## sample1 sample2 sample3 sample4 sample5 sample6
## A1BG 109.232000 316.22400 110.638000 116.00000 86.38430 76.91630
## A1BG-AS1 83.969700 138.44900 119.274000 151.08300 123.98500 103.25100
## A1CF 9.030691 10.01847 5.019242 13.01820 25.21914 25.07356
## A2M 24.000000 2.00000 21.000000 6.00000 38.00000 8.00000
## A2M-AS1 1.000000 1.00000 1.000000 1.00000 0.00000 0.00000
## A2ML1 3.047950 1.02987 4.076160 1.04945 3.07761 5.12409
差异分析
下游分析主要使用Bioconductor包进行
library(edgeR)
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat))+log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat))+o)
接下来的y就可以用来后续的分析
SessionInfo
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936
## [2] LC_CTYPE=Chinese (Simplified)_China.936
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.936
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] edgeR_3.18.1
## [2] limma_3.32.5
## [3] readr_1.1.1
## [4] tximport_1.4.0
## [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [6] GenomicFeatures_1.28.4
## [7] AnnotationDbi_1.38.2
## [8] Biobase_2.36.2
## [9] GenomicRanges_1.28.4
## [10] GenomeInfoDb_1.12.2
## [11] IRanges_2.10.3
## [12] S4Vectors_0.14.3
## [13] BiocGenerics_0.22.0
## [14] tximportData_1.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 compiler_3.4.1
## [3] XVector_0.16.0 bitops_1.0-6
## [5] tools_3.4.1 zlibbioc_1.22.0
## [7] biomaRt_2.32.1 digest_0.6.12
## [9] bit_1.1-12 lattice_0.20-35
## [11] evaluate_0.10.1 RSQLite_2.0
## [13] memoise_1.1.0 tibble_1.3.4
## [15] pkgconfig_2.0.1 rlang_0.1.2
## [17] Matrix_1.2-11 DelayedArray_0.3.12
## [19] DBI_0.7 yaml_2.1.14
## [21] GenomeInfoDbData_0.99.0 rtracklayer_1.36.4
## [23] stringr_1.2.0 knitr_1.17
## [25] hms_0.3 Biostrings_2.44.2
## [27] locfit_1.5-9.1 grid_3.4.1
## [29] rprojroot_1.2 bit64_0.9-7
## [31] R6_2.2.2 XML_3.98-1.9
## [33] BiocParallel_1.10.1 rmarkdown_1.6
## [35] blob_1.1.0 magrittr_1.5
## [37] matrixStats_0.52.2 GenomicAlignments_1.12.2
## [39] backports_1.1.0 Rsamtools_1.28.0
## [41] htmltools_0.3.6 SummarizedExperiment_1.6.3
## [43] stringi_1.1.5 RCurl_1.95-4.8
## [45] rjson_0.2.15