改善R语言代码的5个小技巧
简介
从1开始排序
当使用冒号(:)创建序列时,尝试用seq()
.
# sequence a vector
x <- runif(10)
seq(x)
## [1] 1 2 3 4 5 6 7 8 9 10
#sequence an integer
seq(nrow(mtcars))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32
冒号有时会产生意想不到的结果,它会在我们不注意的情况下产生各种问题,比如当我们对空向量的长度进行排序的时候:
# empty vector
x = c()
1:length(x)
## [1] 1 0
seq(x)
## integer(0)
利用seq()
可以自动创建从1到对象长度的序列,这样就可以避免使用length()
vcetor()
而非c()
创建空向量的时候,尝试使用vector("type", length)
# a numeric vector with 5 elements
vector("numeric", 10)
## [1] 0 0 0 0 0 0 0 0 0 0
#a character vector with 5 elements
vector("character", 5)
## [1] "" "" "" "" ""
使用vector()
可以提高内存使用率并提高运行速度
n <- 1e05
x_empty <- c()
system.time(
for (i in seq(n)){
x_empty <- c(x_empty,i)
}
)
## user system elapsed
## 9.69 0.03 9.97
n <- 1e05
x_zeros <- vector("integer", n)
system.time(
for (i in seq(n)){
x_zeros[i] <- i
}
)
## user system elapsed
## 0 0 0
放弃which()
使用R语言的时候,我们常常使用which()
从某个布尔条件中获取索引,然后根据索引提取数据,其实没有必要使用which()
#obtain elements greater than 5
x <- 3:7
#use which (not necessary)
x[which(x>5)]
## [1] 6 7
#no which
x[x>5]
## [1] 6 7
或者计算大于5的元素个数
#use which
length(which(x>5))
## [1] 2
# no which
sum(x>5)
## [1] 2
实际上我们需要的是布尔值
condition <- x>5
condition
## [1] FALSE FALSE FALSE TRUE TRUE
x[condition]
## [1] 6 7
当与sum()
或者mean()
结合使用时,可以用布尔值来获取满足条件的值的计数或比例
sum(condition)
## [1] 2
mean(condition)
## [1] 0.4
但是which()
可以告诉我们TRUE值的索引号
which(condition)
## [1] 4 5
测试任何或者所有的值是否为真,可以利用any()
和all()
set.seed(23)
x <- runif(10)
if (length(which(x>0.5))>0)
print("At least one value is greater than 0.5")
## [1] "At least one value is greater than 0.5"
if (any(x>0.5))
print("At least one value is greater than 0.5")
## [1] "At least one value is greater than 0.5"
#use which and length to test if all values are less than 1
if (length(which(x<1))==length(x))
print("All values are less than 1")
## [1] "All values are less than 1"
if (all(x<1))
print("All values are less than 1")
## [1] "All values are less than 1"
另外一点是可以节省时间
x <- runif(1e8)
system.time(x[which(x > .5)])
## user system elapsed
## 1.24 0.09 1.33
system.time(x[x > .5])
## user system elapsed
## 0.97 0.14 1.13
factor
你的变量
当移除一个元素之后,被移除的元素仍然占据着一个位置
set.seed(23)
x <-factor(sample(letters,5,replace = FALSE))
x
## [1] o f h q s
## Levels: f h o q s
plot(x)
移除s之后
x <- x[x!="s"]
x
## [1] o f h q
## Levels: f h o q s
plot(x)
一种解决办法是再次factor()
x <- factor(x)
x
## [1] o f h q
## Levels: f h o q
plot(x)
另外一种方法是利用droplevels()
set.seed(23)
x <-factor(sample(letters,5,replace = FALSE))
x <- x[x!="s"]
x <- droplevels(x)
x
## [1] o f h q
## Levels: f h o q
plot(x)
优先使用$
从data.frame中提取数据时,在行$之前指定列[
#row first,column second - not ideal
mtcars[mtcars$cyl==4, ]$hp
## [1] 93 62 95 66 52 65 97 66 91 113 109
#column first, row second - much better
mtcars$hp[mtcars$cyl==4]
## [1] 93 62 95 66 52 65 97 66 91 113 109
原因如下:
- 尽量避免使用烦人的逗号
- 提高运行速度
# Simulate a data frame...
n <- 1e7
d <- data.frame(
a = seq(n),
b = runif(n)
)
# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
## user system elapsed
## 0.47 0.00 0.47
# column first, rows second - much better
system.time(d$a[d$b > .5])
## user system elapsed
## 0.11 0.00 0.11
SessionInfo
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936
## [2] LC_CTYPE=Chinese (Simplified)_China.936
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.1 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
## [5] tools_3.5.1 htmltools_0.3.6 yaml_2.1.19 Rcpp_0.12.17
## [9] stringi_1.1.7 rmarkdown_1.10 knitr_1.20 stringr_1.3.1
## [13] digest_0.6.15 evaluate_0.10.1