读取含有多个表头的数据

Jul 10, 2018 3 min read R

mark

我们有时读取数据的时候，往往不是我们希望的表头只有一行，有时会出现多个表头数据用来描述列明，比如下面这样的数据：

mark

用EXCEL打开如下：

mark

此时我们读取该数据的话，显示如下：

library(readr)
link <- "https://dataverse.harvard.edu/api/access/datafile/2712105"
stickers <- read_tsv(link)
stickers

## # A tibble: 402 x 18
##    SubjectNumber Condition NumberStickers NumberEnvelopes Gender Agemonths
##    <chr>         <chr>     <chr>          <chr>           <chr>      <dbl>
##  1 [Included Sa~ 1=12:1; ~ 1=12; 2=30     1=1 recipient;~ 1=fem~        NA
##  2 1             1         1              1               1             36
##  3 2             1         1              1               2             36
##  4 3             1         1              1               2             36
##  5 4             1         1              1               1             36
##  6 5             1         1              1               2             36
##  7 6             1         1              1               2             36
##  8 7             2         1              2               1             36
##  9 8             2         1              2               2             36
## 10 9             3         2              1               2             36
## # ... with 392 more rows, and 12 more variables: Ageyears <dbl>,
## #   Agegroups <chr>, `Subject'sEnvelope` <chr>, LeftEnvelope <chr>,
## #   RightEnvelope <chr>,
## #   `absolutenumberofstickersgiven(Conditions1or3:Outof12;Conditions2or4:Outof30)` <chr>,
## #   `PercentGiven(Outof100percent)` <chr>, Giveornot <chr>,
## #   LargerEnvelopeabs <chr>, LargeEnvelopepercent <chr>,
## #   SmallerEnvelopeabs <chr>, SmallEnvelopepercent <chr>

可以看出读取进来的数据有如下问题：

第一行不是数据信息，而是作为列名的补充数据

针对这种问题， Hadley给出了答案，很简单，分成两步来读取数据，首先读取列名，然后跳过列名读取数据进行组合

library(dplyr)
library(janitor)
#读取数据列名
stickers_name <- link%>%
  read_tsv(n_max = 0)%>%
  rename(stickersgiven = "absolutenumberofstickersgiven(Conditions1or3:Outof12;Conditions2or4:Outof30)")%>% 
  names()

#读取数据并设置列名
stickers <- link%>%
  read_tsv(skip = 2, col_names = stickers_name)%>%
  clean_names()
stickers

## # A tibble: 401 x 18
##    subject_number condition number_stickers number_envelopes gender
##             <int>     <int>           <int>            <int>  <int>
##  1              1         1               1                1      1
##  2              2         1               1                1      2
##  3              3         1               1                1      2
##  4              4         1               1                1      1
##  5              5         1               1                1      2
##  6              6         1               1                1      2
##  7              7         2               1                2      1
##  8              8         2               1                2      2
##  9              9         3               2                1      2
## 10             10         3               2                1      2
## # ... with 391 more rows, and 13 more variables: agemonths <dbl>,
## #   ageyears <dbl>, agegroups <int>, subjects_envelope <int>,
## #   left_envelope <int>, right_envelope <int>, stickersgiven <int>,
## #   percent_given_outof100percent <dbl>, giveornot <int>,
## #   larger_envelopeabs <int>, large_envelopepercent <dbl>,
## #   smaller_envelopeabs <int>, small_envelopepercent <dbl>

完美解决！！！

鉴于有两行作为列名，我们可以制作一个字典来存储列名信息：

library(tidyr)
stickers_dict <- link%>%
  read_tsv(n_max = 1)%>%
  rename(stickersgiven = 'absolutenumberofstickersgiven(Conditions1or3:Outof12;Conditions2or4:Outof30)') %>% 
  clean_names()%>%
  gather(variable_name, variable_description)
stickers_dict

## # A tibble: 18 x 2
##    variable_name                 variable_description                     
##    <chr>                         <chr>                                    
##  1 subject_number                [Included Sample Only]                   
##  2 condition                     1=12:1; 2=12:2, 3=30:1, 4=30:2           
##  3 number_stickers               1=12; 2=30                               
##  4 number_envelopes              1=1 recipient; 2=2 recipients            
##  5 gender                        1=female; 2=male                         
##  6 agemonths                     <NA>                                     
##  7 ageyears                      <NA>                                     
##  8 agegroups                     1=3-4yrs; 2=5-6yrs; 3=7-8yrs; 4=9-11yrs  
##  9 subjects_envelope             How many stickers did the child keep for~
## 10 left_envelope                 1 recipient conditions: How many sticker~
## 11 right_envelope                1 recipient conditions: N/A; 2 recipient~
## 12 stickersgiven                 Regardless of condition, the number of s~
## 13 percent_given_outof100percent Regardless of condition, the proportion ~
## 14 giveornot                     1=Donated 1 or more stickers to the reci~
## 15 larger_envelopeabs            Raw number of stickers (out of 30: Condi~
## 16 large_envelopepercent         Proportion of stickers (out of 100%; Con~
## 17 smaller_envelopeabs           Raw number of stickers (out of 30: Condi~
## 18 small_envelopepercent         Proportion of stickers (out of 100%; Con~

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936 
## [2] LC_CTYPE=Chinese (Simplified)_China.936   
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.8.1   janitor_1.0.0 dplyr_0.7.5   readr_1.1.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     bindr_0.1.1      knitr_1.20       magrittr_1.5    
##  [5] hms_0.4.2        tidyselect_0.2.4 R6_2.2.2         rlang_0.2.1     
##  [9] stringr_1.3.1    tools_3.5.1      utf8_1.1.4       cli_1.0.0       
## [13] htmltools_0.3.6  yaml_2.1.19      rprojroot_1.3-2  digest_0.6.15   
## [17] assertthat_0.2.0 tibble_1.4.2     crayon_1.3.4     bindrcpp_0.2.2  
## [21] purrr_0.2.5      curl_3.2         snakecase_0.9.1  glue_1.2.0      
## [25] evaluate_0.10.1  rmarkdown_1.10   stringi_1.1.7    compiler_3.5.1  
## [29] pillar_1.2.3     backports_1.1.2  pkgconfig_2.0.1

数据读取

Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.