Linux学习笔记（三）

Nov 12, 2017 1 min read Linux

mark

“write code for humans, write data for computers”

cut

cut允许我们从数据集中提取某几列进行查看，通过参数f来设置查看几列，比如我们想要查看第一和第三列或者只查看第二列：

$ cut -f 1,3 Mus_musculus.GRCm38.75_chr1.bed|head -n 3
1       3054733
1       3054733
1       3054733
$ cut -f 2 Mus_musculus.GRCm38.75_chr1.bed|head -n 3
3054233
3054233
3054233

结合grep我们可以十分方便的将GTF文件中的部分信息提取出来进行查看，使用grep主要用来将以#开头的部份内容剔除掉，这其中涉及到正则表达式，我将在后面集中学习。下面的命令可以将染色体、起始位置以及终止位置提取出来，类似于bed文件格式：

$ grep -v '^#' Mus_musculus.GRCm38.75_chr1.gtf | cut -f 1,4,5 | head -n 3
1       3054233 3054733
1       3054233 3054733
1       3054233 3054733

我们也可以使用>将提取的文件另存为一个文件，方便以后使用。这里我们存为test.txt

$ grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | cut -f 1,4,5 > test.txt

cut默认分隔符为空格，因此如果使用cut来处理CSV文件的话，我们就需要通过参数d指定分隔符 ,:

$ cut -d , -f 2,3 Mus_musculus.GRCm38.75_chr1_bed.csv | head -n 3
3054233,3054733
3054233,3054733
3054233,3054733

column

当我们处理制表符文件时，常常行列之间无法对其，浏览效果很差,如下所示：

$ grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf|cut -f1-8|head -n 3
1       pseudogene      gene    3054233 3054733 .       +       .
1       unprocessed_pseudogene  transcript      3054233 3054733 .       +      .
1       unprocessed_pseudogene  exon    3054233 3054733 .       +       .

column可以产生阅读效果很好的文件格式，参数t表示column将对数据集当作一个table来处理,下面的阅读效果明显好于前面的。

$ grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf|cut -f1-8|head -n 3|column -t
1  pseudogene              gene        3054233  3054733  .  +  .
1  unprocessed_pseudogene  transcript  3054233  3054733  .  +  .
1  unprocessed_pseudogene  exon        3054233  3054733  .  +  .

需要注意的是，column -t只能支持在终端进行浏览数据，无法对数据集进行格式重写成一个文件。

column默认\t为分隔符，因此当我们处理其他分隔符数据时，需要使用参数s进行指定，比如当我们处理CSV数据时：

$ column -s, -t Mus_musculus.GRCm38.75_chr1_bed.csv|head -n 3
1  3054233    3054733
1  3054233    3054733
1  3054233    3054733

Linux

Researcher

I am a PhD student of Crop Genetics and Breeding at the Zhejiang University Crop Science Lab. My research interests covers a range of issues:Population Genetics Evolution and Ecotype Divergence Analysis of Oilseed Rape, Genome-wide Association Study (GWAS) of Agronomic Traits.