改善R语言代码的5个小技巧

mark

简介

这篇文章来源于 @drsimonj博客

从1开始排序

当使用冒号(:)创建序列时,尝试用seq().

# sequence a vector
x <- runif(10)
seq(x)
##  [1]  1  2  3  4  5  6  7  8  9 10

#sequence an integer

seq(nrow(mtcars))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32

冒号有时会产生意想不到的结果,它会在我们不注意的情况下产生各种问题,比如当我们对空向量的长度进行排序的时候:

# empty vector
x = c()
1:length(x)
## [1] 1 0

seq(x)
## integer(0)

利用seq()可以自动创建从1到对象长度的序列,这样就可以避免使用length()

vcetor()而非c()

创建空向量的时候,尝试使用vector("type", length)

# a numeric vector with 5 elements
vector("numeric", 10)
##  [1] 0 0 0 0 0 0 0 0 0 0

#a character vector with 5 elements
vector("character", 5)
## [1] "" "" "" "" ""

使用vector()可以提高内存使用率并提高运行速度

n <- 1e05
x_empty <- c()
system.time(
  for (i in seq(n)){
    x_empty <- c(x_empty,i)
  }
)
##    user  system elapsed 
##    9.69    0.03    9.97
n <- 1e05
x_zeros <- vector("integer", n)
system.time(
  for (i in seq(n)){
    x_zeros[i] <- i
  }
)
##    user  system elapsed 
##       0       0       0

放弃which()

使用R语言的时候,我们常常使用which()从某个布尔条件中获取索引,然后根据索引提取数据,其实没有必要使用which()

#obtain elements greater than 5
x <- 3:7
#use which (not necessary)
x[which(x>5)]
## [1] 6 7
#no which
x[x>5]
## [1] 6 7

或者计算大于5的元素个数

#use which
length(which(x>5))
## [1] 2
# no which
sum(x>5)
## [1] 2

实际上我们需要的是布尔值

condition <- x>5
condition
## [1] FALSE FALSE FALSE  TRUE  TRUE
x[condition]
## [1] 6 7

当与sum()或者mean()结合使用时,可以用布尔值来获取满足条件的值的计数或比例

sum(condition)
## [1] 2
mean(condition)
## [1] 0.4

但是which()可以告诉我们TRUE值的索引号

which(condition)
## [1] 4 5

测试任何或者所有的值是否为真,可以利用any()all()

set.seed(23)
x <- runif(10)
if (length(which(x>0.5))>0)
  print("At least one value is greater than 0.5")
## [1] "At least one value is greater than 0.5"

if (any(x>0.5))
  print("At least one value is greater than 0.5")
## [1] "At least one value is greater than 0.5"

#use which and length to test if all values are less than 1
if (length(which(x<1))==length(x))
  print("All values are less than 1")
## [1] "All values are less than 1"

if (all(x<1))
  print("All values are less than 1")
## [1] "All values are less than 1"

另外一点是可以节省时间

x <- runif(1e8)

system.time(x[which(x > .5)])
##    user  system elapsed 
##    1.24    0.09    1.33
system.time(x[x > .5])
##    user  system elapsed 
##    0.97    0.14    1.13

factor你的变量

当移除一个元素之后,被移除的元素仍然占据着一个位置

set.seed(23)
x <-factor(sample(letters,5,replace = FALSE))
x
## [1] o f h q s
## Levels: f h o q s
plot(x)

mark

移除s之后

x <- x[x!="s"]
x
## [1] o f h q
## Levels: f h o q s
plot(x)

mark

一种解决办法是再次factor()

x <- factor(x)
x
## [1] o f h q
## Levels: f h o q
plot(x)

mark

另外一种方法是利用droplevels()

set.seed(23)
x <-factor(sample(letters,5,replace = FALSE))
x <- x[x!="s"]
x <- droplevels(x)
x
## [1] o f h q
## Levels: f h o q
plot(x)

mark

优先使用$

从data.frame中提取数据时,在行$之前指定列[

#row first,column second - not ideal
mtcars[mtcars$cyl==4, ]$hp
##  [1]  93  62  95  66  52  65  97  66  91 113 109
#column first, row second - much better
mtcars$hp[mtcars$cyl==4]
##  [1]  93  62  95  66  52  65  97  66  91 113 109

原因如下:

  • 尽量避免使用烦人的逗号
  • 提高运行速度
# Simulate a data frame...
n <- 1e7
d <- data.frame(
  a = seq(n),
  b = runif(n)
)

# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
##    user  system elapsed 
##    0.47    0.00    0.47

# column first, rows second - much better
system.time(d$a[d$b > .5])
##    user  system elapsed 
##    0.11    0.00    0.11

SessionInfo

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936 
## [2] LC_CTYPE=Chinese (Simplified)_China.936   
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.1  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.5.1     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.17   
##  [9] stringi_1.1.7   rmarkdown_1.10  knitr_1.20      stringr_1.3.1  
## [13] digest_0.6.15   evaluate_0.10.1
Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.

comments powered by Disqus