读取含有多个表头的数据
我们有时读取数据的时候,往往不是我们希望的表头只有一行,有时会出现多个表头数据用来描述列明,比如下面这样的数据:
用EXCEL打开如下:
此时我们读取该数据的话,显示如下:
library(readr)
link <- "https://dataverse.harvard.edu/api/access/datafile/2712105"
stickers <- read_tsv(link)
stickers
## # A tibble: 402 x 18
## SubjectNumber Condition NumberStickers NumberEnvelopes Gender Agemonths
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 [Included Sa~ 1=12:1; ~ 1=12; 2=30 1=1 recipient;~ 1=fem~ NA
## 2 1 1 1 1 1 36
## 3 2 1 1 1 2 36
## 4 3 1 1 1 2 36
## 5 4 1 1 1 1 36
## 6 5 1 1 1 2 36
## 7 6 1 1 1 2 36
## 8 7 2 1 2 1 36
## 9 8 2 1 2 2 36
## 10 9 3 2 1 2 36
## # ... with 392 more rows, and 12 more variables: Ageyears <dbl>,
## # Agegroups <chr>, `Subject'sEnvelope` <chr>, LeftEnvelope <chr>,
## # RightEnvelope <chr>,
## # `absolutenumberofstickersgiven(Conditions1or3:Outof12;Conditions2or4:Outof30)` <chr>,
## # `PercentGiven(Outof100percent)` <chr>, Giveornot <chr>,
## # LargerEnvelopeabs <chr>, LargeEnvelopepercent <chr>,
## # SmallerEnvelopeabs <chr>, SmallEnvelopepercent <chr>
可以看出读取进来的数据有如下问题:
- 第一行不是数据信息,而是作为列名的补充数据
针对这种问题, Hadley给出了答案,很简单,分成两步来读取数据,首先读取列名,然后跳过列名读取数据进行组合
library(dplyr)
library(janitor)
#读取数据列名
stickers_name <- link%>%
read_tsv(n_max = 0)%>%
rename(stickersgiven = "absolutenumberofstickersgiven(Conditions1or3:Outof12;Conditions2or4:Outof30)")%>%
names()
#读取数据并设置列名
stickers <- link%>%
read_tsv(skip = 2, col_names = stickers_name)%>%
clean_names()
stickers
## # A tibble: 401 x 18
## subject_number condition number_stickers number_envelopes gender
## <int> <int> <int> <int> <int>
## 1 1 1 1 1 1
## 2 2 1 1 1 2
## 3 3 1 1 1 2
## 4 4 1 1 1 1
## 5 5 1 1 1 2
## 6 6 1 1 1 2
## 7 7 2 1 2 1
## 8 8 2 1 2 2
## 9 9 3 2 1 2
## 10 10 3 2 1 2
## # ... with 391 more rows, and 13 more variables: agemonths <dbl>,
## # ageyears <dbl>, agegroups <int>, subjects_envelope <int>,
## # left_envelope <int>, right_envelope <int>, stickersgiven <int>,
## # percent_given_outof100percent <dbl>, giveornot <int>,
## # larger_envelopeabs <int>, large_envelopepercent <dbl>,
## # smaller_envelopeabs <int>, small_envelopepercent <dbl>
完美解决!!!
鉴于有两行作为列名,我们可以制作一个字典来存储列名信息:
library(tidyr)
stickers_dict <- link%>%
read_tsv(n_max = 1)%>%
rename(stickersgiven = 'absolutenumberofstickersgiven(Conditions1or3:Outof12;Conditions2or4:Outof30)') %>%
clean_names()%>%
gather(variable_name, variable_description)
stickers_dict
## # A tibble: 18 x 2
## variable_name variable_description
## <chr> <chr>
## 1 subject_number [Included Sample Only]
## 2 condition 1=12:1; 2=12:2, 3=30:1, 4=30:2
## 3 number_stickers 1=12; 2=30
## 4 number_envelopes 1=1 recipient; 2=2 recipients
## 5 gender 1=female; 2=male
## 6 agemonths <NA>
## 7 ageyears <NA>
## 8 agegroups 1=3-4yrs; 2=5-6yrs; 3=7-8yrs; 4=9-11yrs
## 9 subjects_envelope How many stickers did the child keep for~
## 10 left_envelope 1 recipient conditions: How many sticker~
## 11 right_envelope 1 recipient conditions: N/A; 2 recipient~
## 12 stickersgiven Regardless of condition, the number of s~
## 13 percent_given_outof100percent Regardless of condition, the proportion ~
## 14 giveornot 1=Donated 1 or more stickers to the reci~
## 15 larger_envelopeabs Raw number of stickers (out of 30: Condi~
## 16 large_envelopepercent Proportion of stickers (out of 100%; Con~
## 17 smaller_envelopeabs Raw number of stickers (out of 30: Condi~
## 18 small_envelopepercent Proportion of stickers (out of 100%; Con~
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936
## [2] LC_CTYPE=Chinese (Simplified)_China.936
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_0.8.1 janitor_1.0.0 dplyr_0.7.5 readr_1.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 bindr_0.1.1 knitr_1.20 magrittr_1.5
## [5] hms_0.4.2 tidyselect_0.2.4 R6_2.2.2 rlang_0.2.1
## [9] stringr_1.3.1 tools_3.5.1 utf8_1.1.4 cli_1.0.0
## [13] htmltools_0.3.6 yaml_2.1.19 rprojroot_1.3-2 digest_0.6.15
## [17] assertthat_0.2.0 tibble_1.4.2 crayon_1.3.4 bindrcpp_0.2.2
## [21] purrr_0.2.5 curl_3.2 snakecase_0.9.1 glue_1.2.0
## [25] evaluate_0.10.1 rmarkdown_1.10 stringi_1.1.7 compiler_3.5.1
## [29] pillar_1.2.3 backports_1.1.2 pkgconfig_2.0.1