【原】R最快且比dplyr最高效的大數(shù)據(jù)處理R包：tidyfst

微生信生物 2021-01-16

展開全文

寫在前面

本包開發(fā)者黃天元；

首先我對tidyfst進(jìn)行了一套完整的學(xué)習(xí)，因?yàn)檫@里面的函數(shù)并不多，滿打滿計(jì)算，也就38個(gè)。

隨著擴(kuò)增子的平穩(wěn)，我逐漸轉(zhuǎn)入宏基因組，軟件更多，平臺跨度更大，R語言顯示出來很多弊端：

數(shù)據(jù)處理過程不夠快，無法快速讀入，輸出；

近年來出現(xiàn)了許多工具解決這個(gè)問題，本著適合之前的習(xí)慣，我想通過data.table和tadyfst解決這個(gè)問題。希望我這一路都是順暢的。結(jié)果會(huì)如我所料嗎？

tidyfst包（fstpackage/fst）

它的優(yōu)勢：

1、快速讀寫數(shù)據(jù)框

2、文件壓縮，保存數(shù)據(jù)框能夠給文件進(jìn)行壓縮，這就節(jié)省了大數(shù)據(jù)轉(zhuǎn)移的時(shí)間（從硬盤放到電腦或者上傳服務(wù)器）。壓縮的比率是非常感人的，有一個(gè)參數(shù)可以控制壓縮比例，我一般設(shè)置到最大。我問過原作者，他跟我解釋過，壓縮比例一共是100個(gè)等級，不壓縮的時(shí)候讀寫是最快的，但是使勁壓縮，讀寫依然非常快！親測確實(shí)如此，所以我每次都用最大等級的壓縮，并包裝了他的函數(shù)，把默認(rèn)壓縮率改為100（默認(rèn)值為50）。

測試 fst格式操作

為什么我要測試這個(gè)呢？因?yàn)閒st更快。

構(gòu)造一個(gè)巨大的數(shù)據(jù)框，代碼參考hopeR。

library(tidyfst)

# 構(gòu)造一個(gè)1億行，4列的數(shù)據(jù)框
nr_of_rows <- 1e8

df <- data.table(
  Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
  Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
  Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
  Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)

打印出文件大小

head(df)

object.size(df) %>% print(unit = "auto")

我們測試一下保存，查看保存時(shí)間。sys_time_print函數(shù)是作者在tidyfst中封裝的函數(shù)。

# ?export_fst

sys_time_print({
  export_fst(df,"./df.fst")
})

# 完成后刪除df數(shù)據(jù)框
rm(df)

讀入fst對象

parse_fst("./df.fst") -> ft

##--輸出錯(cuò)誤
# ft
head(ft)

colnames(ft)

快速計(jì)算頻數(shù)

fst數(shù)據(jù)處理的函數(shù)后綴位：_fst,這里select_fst函數(shù)用于選擇列。

 sys_time_print({
   ft %>% 
     select_fst(Logical) %>% 
     count_dt(Logical) -> res
 })

res

slice_fst：用于選擇行操作。然后分組求和

sys_time_print({
   ft %>% 
     slice_fst(1:1000) %>% 
     group_dt(
       by = Factor,
       summarise_dt(avg_int = mean(Integer)) 
     )-> res
 })

res

filter_fst函數(shù)用于列過濾。count_dth函數(shù)用于統(tǒng)計(jì)頻數(shù)

sys_time_print({
   ft %>% 
     filter_fst(Real >= 50) %>% 
     count_dt(Factor)-> res
 })

res

刪除本地?cái)?shù)據(jù)

unlink("./df.fst")

tidyfst 正式學(xué)習(xí)

這個(gè)包處理函數(shù)很快，所以我要將這個(gè)包用于宏基因組數(shù)據(jù)探索，這里

1 arrange_dt ：排序

#--使用數(shù)據(jù)

data(iris)

#---按照數(shù)值進(jìn)行排序
iris %>% arrange_dt(Sepal.Length)

iris

# 從大到小排序
iris %>% arrange_dt(-Sepal.Length)
# 雙重排序--先按照第一個(gè)拍排序，然后在此基礎(chǔ)上按照第二列排序
iris %>% arrange_dt(Sepal.Length,Petal.Length)

2 as_fst：將數(shù)據(jù)框轉(zhuǎn)化位fst對象

iris %>%
as_fst() -> iris_fst

head(iris_fst)

3 complete_dt函數(shù)

將數(shù)據(jù)框按照指定列，進(jìn)行完整組合，輸出

Complete a data frame with missing combinations of data

df <- data.table(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)

df

df %>% complete_dt(item_id,item_name)
df %>% complete_dt(item_id,item_name,fill = 0)
df %>% complete_dt("item")
df %>% complete_dt(item_id=1:3)
df %>% complete_dt(item_id=1:3,group=1:2)
df %>% complete_dt(item_id=1:3,group=1:3,item_name=c("a","b","c"))

4 count_dt：統(tǒng)計(jì)頻數(shù)

iris %>% count_dt(Sepal.Width)

#-指定頻數(shù)列名稱
iris %>% count_dt(Species,.name = "count")
#統(tǒng)計(jì)頻數(shù)并添加到源數(shù)據(jù)列
iris %>% add_count_dt(Species)
# 對添加列的命名
iris %>% add_count_dt(Species,.name = "N")
#按照兩組分類進(jìn)行統(tǒng)計(jì)頻數(shù)
mtcars %>% count_dt(cyl,vs)
# 頻數(shù)列重命名，默認(rèn)是排序的，現(xiàn)在不要排序了
mtcars %>% count_dt(cyl,vs,.name = "N",sort = FALSE)
#添加到源數(shù)據(jù)中
mtcars %>% add_count_dt(cyl,vs)

5 cummean:累積均值

cummean(1:10)

6 distinct_dt ：去除重復(fù)

iris %>% distinct_dt()
iris %>% distinct_dt(Species)
iris %>% distinct_dt(Species,.keep_all = TRUE)
mtcars %>% distinct_dt(cyl,vs)
mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE)

7 drop_na_dt ：去除NA行

df <- data.table(x = c(1, 2, NA), y = c("a", NA, "b"))

df
#去除含有NA的全部行
df %>% drop_na_dt()
#去除x列含有NA的全部行
df %>% drop_na_dt(x)
#去除y列含有NA的全部行
df %>% drop_na_dt(y)
# 去除x，y列含有NA的全部行
df %>% drop_na_dt(x,y)

# 將NA替換為0
df %>% replace_na_dt(to = 0)
df %>% replace_na_dt(x,to = 0)
df %>% replace_na_dt(y,to = 0)
df %>% replace_na_dt(x,y,to = 0)

# 對空缺值的填充
#僅僅填充x列
df %>% fill_na_dt(x)
#全部填充
df %>% fill_na_dt() # not specified, fill all columns
#指定使用臨近下一行數(shù)據(jù)填充
df %>% fill_na_dt(y,direction = "up")

#x的空缺在最后，所以無法填充
df %>% fill_na_dt(x,direction = "up")

x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
x
#--刪除全部為NA的列
x %>% delete_na_cols()
#-刪除0.75數(shù)據(jù)未NA的列
x %>% delete_na_cols(prop = 0.75)
x %>% delete_na_cols(prop = 0.5)
x %>% delete_na_cols(prop = 0.24)
#刪除數(shù)據(jù)少于2個(gè)的列
x %>% delete_na_cols(n = 2)
#刪除低于0.6數(shù)據(jù)的行
x %>% delete_na_rows(prop = 0.6)
#刪除數(shù)據(jù)少于兩個(gè)的行
x %>% delete_na_rows(n = 2)

# shift_fill
y = c("a",NA,"b",NA,"c")
y
#填充
shift_fill(y) # equals to
#
shift_fill(y,"down")

shift_fill(y,"up")

8 dummy_dt：數(shù)據(jù)長變寬

iris %>% dummy_dt(Species)
#使用源名稱
iris %>% dummy_dt(Species,longname = FALSE)
## 按照兩列進(jìn)行變寬
mtcars %>% head() %>% dummy_dt(vs,am)

mtcars %>% head() %>% dummy_dt("cyl|gear")

9 export_fst ：fst格式數(shù)據(jù)保存

export_fst(iris,"iris_fst_test.fst")
iris_dt = import_fst("iris_fst_test.fst")
iris_dt
unlink("iris_fst_test.fst")

10 filter_dt ：行篩選

iris %>% filter_dt(Sepal.Length > 7)
iris %>% filter_dt(Sepal.Length > 7,Sepal.Width > 3)
iris %>% filter_dt(Sepal.Length > 7 & Sepal.Width > 3)
iris %>% filter_dt(Sepal.Length == max(Sepal.Length))

11 slice_fst：選擇行；select_fst：選擇列；filter_fst按照行過濾

這幾個(gè)函數(shù)其實(shí)就是來處理fst格式的，會(huì)進(jìn)一步縮短時(shí)間。大數(shù)據(jù)必備。

## Not run:
fst::write_fst(iris,"iris_test.fst")

# parse the file but not reading it
parse_fst("iris_test.fst") -> ft
# ft
class(ft)
lapply(ft,class)
names(ft)
dim(ft)
# 選擇前三行
ft %>% slice_fst(1:3)
# 選擇1,3行
ft %>% slice_fst(c(1,3))

ft %>% select_fst(Sepal.Length)
ft %>% select_fst(Sepal.Length,Sepal.Width)
ft %>% select_fst("Sepal.Length")
ft %>% select_fst(1:3)
ft %>% select_fst(1,3)
ft %>% select_fst("Se")
ft %>% select_fst("nothing")
ft %>% select_fst("Se|Sp")
ft %>% select_fst(cols = names(iris)[2:3])
ft %>% filter_fst(Sepal.Width > 3)
ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica")
ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3)
unlink("iris_test.fst")

12 group_by_dt；分組

這里結(jié)合head函數(shù)可以對每個(gè)分組的前面幾行進(jìn)行計(jì)算，這個(gè)如果結(jié)合排序，可以對豐富較高或者較低的進(jìn)行統(tǒng)計(jì)。

# aggregation after grouping using group_exe_dt
as.data.table(iris) -> a

# ?group_exe_dt
#---指定分組，這里的head函數(shù)會(huì)按照分組進(jìn)行展示-這一般用的比較少
a %>%
  group_by_dt(Species) %>%
  group_exe_dt(head(3))
a
#----指定分組，進(jìn)行計(jì)算，對每個(gè)分組的前四行進(jìn)行計(jì)算
a %>%
  group_by_dt(Species) %>%
  group_exe_dt(
    head(4) %>%
      summarise_dt(sum = mean(Sepal.Length))
  )
#--指定兩個(gè)分組進(jìn)行計(jì)算
mtcars %>%
  group_by_dt("cyl|am") %>%
  group_exe_dt(
    summarise_dt(mpg_sum = sum(mpg))
  )
# 同上一個(gè)函數(shù)
mtcars %>%
  group_by_dt(cols = c("cyl","am")) %>%
  group_exe_dt(
    summarise_dt(mpg_sum = sum(mpg))
  )

13 group_dt ：分組計(jì)算

#--分組提取每個(gè)分組前三行
iris %>% group_dt(by = Species,slice_dt(1:3))

#--分組求取每個(gè)組中的最大值,保留其他列
iris %>% group_dt(Species,filter_dt(Sepal.Length == max(Sepal.Length)))

#--分組統(tǒng)計(jì)求取最大值，只有統(tǒng)計(jì)的這一列
iris %>% group_dt(Species,summarise_dt(new = max(Sepal.Length)))

# 添加一列，并分組求取這一列的和
iris %>% group_dt(Species,
                  mutate_dt(max= max(Sepal.Length)) %>%
                    summarise_dt(sum=sum(max)))

# .SD 函數(shù)可以直接使用
# 提取每個(gè)分組第一行和最后一行
iris %>%group_dt(
  by = Species,
  rbind(.SD[1],.SD[.N])
)
#' #summarise_dth函數(shù)內(nèi)置了by參數(shù)，這樣就可以直接在函數(shù)內(nèi)部分組了
mtcars %>%
  summarise_dt(
    disp = mean(disp),
    hp = mean(hp),
    by = cyl
  )
# z或者使用group函數(shù)分組
mtcars %>%
  group_dt(by =.(vs,am),
           summarise_dt(avg = mean(mpg)))

# data.table中的.()函數(shù)在這里同樣等價(jià)為list()
mtcars %>%
  group_dt(by =list(vs,am),
           summarise_dt(avg = mean(mpg)))

# mutate_dt添加一列，mean函數(shù)計(jì)算均值，顯然不夠兩行，這里循環(huán)補(bǔ)齊。
df <- data.table(x = 1:2, y = 3:4, z = 4:5)
df
df %>% mutate_dt(m = mean(c(x, y, z)))
#-等價(jià)
df %>% rowwise_dt(
  mutate_dt(m = mean(c(x, y, z)))
)

14 in_dt: 綜合函數(shù)

按照分組進(jìn)行排序，然后提取排序好的數(shù)據(jù)行，十分有用。對于微生物組數(shù)據(jù)。

iris %>% as_dt()
#--排序，分組提取第一個(gè)數(shù)據(jù)
iris %>% in_dt(order(-Sepal.Length),.SD[1],by=Species)

15 lead_dt：快速創(chuàng)建向量

lead_dt(1:5)
lag_dt(1:5)
lead_dt(1:5,2)
lead_dt(1:5,n = 2,fill = 0)

16 _join_dt：最重要的一組函數(shù)，合并數(shù)據(jù)框

#--構(gòu)造data.table對象

workers = fread("
name company
Nick Acme
John Ajax
Daniela Ajax
")
#-構(gòu)建另一個(gè)data.table對象
positions = fread("
name position
John designer
Daniela engineer
Cathie manager
")

# ?inner_join
#--合并數(shù)據(jù)框
#--共有合并
workers %>% inner_join_dt(positions)
#-保留左側(cè)行
workers %>% left_join_dt(positions)
#保留右側(cè)行
workers %>% right_join_dt(positions)
#-保留全部行
workers %>% full_join_dt(positions)

# 輸出左側(cè)數(shù)據(jù)框獨(dú)有行
workers %>% anti_join_dt(positions)
#-輸出左側(cè)數(shù)據(jù)庫共有行
workers %>% semi_join_dt(positions)

# 通過by參數(shù)指定合并的行列名
workers %>% left_join_dt(positions, by = "name")
# 重命名
positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions'
#--如果兩數(shù)據(jù)庫不同名需要合并，使用等號匹配列名
workers %>% inner_join_dt(positions2, by = c("name" = "worker"))
# 等價(jià)
workers %>% ijoin(positions2,by = "name==worker")

#-兩種合并方式相同
x= data.table(a=1:5,a1 = 2:6,b=11:15)
y= data.table(a=c(1:4,6), a1 = c(1,2,4,5,1),c=c(101:104,106))
#默認(rèn)相同的合并
merge(x,y,all = TRUE) -> a
#--按照兩列合并
fjoin(x,y,by = c("a","a1")) -> b
data.table::setcolorder(a,names(b))
fsetequal(a,b)

16 longer_dt：數(shù)據(jù)寬邊長

## 構(gòu)造數(shù)據(jù)
stocks = data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

stocks
# 數(shù)據(jù)寬變長

stocks %>%
  longer_dt(time)

#--部分即可匹配
stocks %>%
  longer_dt("ti")
#-這部分找不到數(shù)據(jù)集"billboard"，所以沒有學(xué)習(xí)運(yùn)行
# library(tidyr)
# # install.packages("billboard")
# library("billboard")
# data(billboard)
# 
# 
# billboard %>%
#   longer_dt(
#     -"wk",
#     name = "week",
#     value = "rank",
#     na.rm = TRUE
#   )
# 
# billboard
# # or use:
# billboard %>%
#   longer_dt(
#     artist,track,date.entered,
#     name = "week",
#     value = "rank",
#     na.rm = TRUE
#   )
# # or use:
# billboard %>%
#   longer_dt(
#     1:3,
#     name = "week",
#     value = "rank",
#     na.rm = TRUE
#   )

17 df_mat：矩陣和列表快速轉(zhuǎn)化

這對于網(wǎng)絡(luò)分析和相關(guān)分析十分有用。

mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3]))
mm

#--矩陣邊列表
tdf = mat_df(mm)
tdf

#--列表邊矩陣
mat = df_mat(tdf,row,col,value)
mat

setequal(mm,mat)

tdf %>%
  setNames(c("A","B","C")) %>%
  df_mat(A,B,C)

18 mutate_dt ：添加新的數(shù)據(jù)列

#--添加新的列，添加到原來列后面
iris %>% mutate_dt(one = 1,Sepal.Length = Sepal.Length + 1)
#---不要原來的數(shù)據(jù)了
iris %>% transmute_dt(one = 1,Sepal.Length = Sepal.Length + 1)

#  `.GRP`：分組標(biāo)簽添加，這些特殊符號一定要注意
iris %>% mutate_dt(id = 1:.N,grp = .GRP,by = Species)

18 mutate_when；mutate_vars，數(shù)據(jù)框整理添加新列

按照條件添加新的列，按照條件對多列進(jìn)行操作

iris[3:8,]
#-條件添加數(shù)據(jù)
iris[3:8,] %>%
  mutate_when(Petal.Width == .2,
              one = 1,Sepal.Length=2)

#--對符合條件的列標(biāo)準(zhǔn)化
iris %>% mutate_vars("Pe",scale)
#--對全部為數(shù)值的數(shù)據(jù)列進(jìn)行標(biāo)準(zhǔn)化
iris %>% mutate_vars(is.numeric,scale)
#--非因子列進(jìn)行標(biāo)準(zhǔn)化
iris %>% mutate_vars(-is.factor,scale)
#前兩列標(biāo)準(zhǔn)化
iris %>% mutate_vars(1:2,scale)
#--將全部數(shù)據(jù)列轉(zhuǎn)化為字符串
iris %>% mutate_vars(.func = as.character)

第二篇章

19 nest_dt：數(shù)據(jù)框與列表的變換

library(tidyfst)

#-按照分組拆分?jǐn)?shù)據(jù)框
a = mtcars %>% nest_dt(cyl)
#查看數(shù)據(jù)類型
# str(a)
#-查看數(shù)據(jù)list
# a[[2]]

mtcars %>% nest_dt("cyl")
mtcars %>% nest_dt(cyl,vs)
mtcars %>% nest_dt(vs:am)
mtcars %>% nest_dt("cyl|vs")
mtcars %>% nest_dt(c("cyl","vs"))
# 兩列拆分?jǐn)?shù)據(jù)框,稱為兩組列表
a = iris %>% nest_dt(mcols = list(petal="^Pe",sepal="^Se"))
# #-第二組列表查看
# a[[3]]
#--復(fù)原。ndt為需要指定的列
mtcars %>% nest_dt("cyl|vs") %>%
  unnest_dt(ndt)
mtcars %>% nest_dt("cyl|vs") %>%
  unnest_dt("ndt")

#---列表和數(shù)據(jù)庫可以一起構(gòu)建
df <- data.table(
  a = list(c("a", "b"), "c"),
  b = list(c(TRUE,TRUE),FALSE),
  c = list(3,c(1,2)),
  d = c(11, 22)
)

# str(df)

20 nth：從向量中提取值

通過編號提取目標(biāo)的值，這里指定了負(fù)數(shù)為倒序，從后往前的位置。

x = 1:10
nth(x, 1)
nth(x, 5)
nth(x, -2)

21 pull_dt 從向量中根據(jù)位置提取元素

mtcars %>% pull_dt(2)
mtcars %>% pull_dt(cyl)
mtcars %>% pull_dt("cyl")

22 pull_dt：提取數(shù)據(jù)框單一變量（轉(zhuǎn)化為向量形式）

那么你想提取兩列行不行，當(dāng)然不行！

#-這三種方式提取結(jié)果是相同的
mtcars %>% pull_dt(2)
mtcars %>% pull_dt(cyl)
mtcars %>% pull_dt("cyl")

#-查看名稱
colnames(mtcars)

23 relocate_dt：對列進(jìn)行排序

df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
df
df %>% relocate_dt(f)
df %>% relocate_dt(a,how = "last")
df %>% relocate_dt(is.character)
df %>% relocate_dt(is.numeric, how = "last")
df %>% relocate_dt("[aeiou]")
df %>% relocate_dt(a, how = "after",where = f)
df %>% relocate_dt(f, how = "before",where = a)
df %>% relocate_dt(f, how = "before",where = c)
df %>% relocate_dt(f, how = "after",where = c)
df2 <- data.table(a = 1, b = "a", c = 1, d = "a")
df2 %>% relocate_dt(is.numeric,
how = "after",
where = is.character)
df2 %>% relocate_dt(is.numeric,
how="before",
where = is.character)

24 relocate_d：對列名進(jìn)行位置調(diào)整

這個(gè)工具十分強(qiáng)大，對于微生物領(lǐng)域也將十分有用。

df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
df
#-將f列提高第一列
df %>% relocate_dt(f)
#將a列提到最后一列
df %>% relocate_dt(a,how = "last")
#將字符串列已移到前面
df %>% relocate_dt(is.character)
#將數(shù)值型變量列移到后面
df %>% relocate_dt(is.numeric, how = "last")
#--將列名按照順序排列
df %>% relocate_dt("[aeiou]")
#-將a排列在f后面
df %>% relocate_dt(a, how = "after",where = f)
#-將f排列到a前面
df %>% relocate_dt(f, how = "before",where = a)
#將f排列到c前面
df %>% relocate_dt(f, how = "before",where = c)
df %>% relocate_dt(f, how = "after",where = c)

df2 <- data.table(a = 1, b = "a", c = 1, d = "a")
#-將數(shù)值型變量排列到字符串后面
df2 %>% relocate_dt(is.numeric,
                    how = "after",
                    where = is.character)
df2 %>% relocate_dt(is.numeric,
                    how="before",
                    where = is.character)

25 rename_dt：對數(shù)據(jù)列進(jìn)行改名

#-改名，使用等號來指定舊名和新名
iris %>%
  rename_dt(sl = Sepal.Length,sw = Sepal.Width) %>%
  head()

26 replace_dt：對一列內(nèi)容替換（條件）

iris %>% mutate_vars(is.factor,as.character) -> new_iris
#-指定列，替換內(nèi)容，字符串替換
new_iris %>%
  replace_dt(Species, from = "setosa",to = "SS")
new_iris %>%
  replace_dt(Species,from = c("setosa","virginica"),to = "sv")
#-數(shù)值替換
new_iris %>%
  replace_dt(Petal.Width, from = .2,to = 2)
new_iris %>%
  replace_dt(from = .2,to = NA)
#-添加基本運(yùn)算
new_iris %>%
  replace_dt(is.numeric, from = function(x) x > 3, to = 9999 )

27 rn_col：對首列和列名操作（位置互換）

#--將列名提取到第一列
mtcars %>% rn_col()
#列名提取到第一列，并改名為rn
mtcars %>% rn_col("rn")
#-賦值給信數(shù)據(jù)框
mtcars %>% rn_col() -> new_mtcars
#--改回去，將第一列放回到列名
new_mtcars %>% col_rn() -> old_mtcars
old_mtcars
setequal(mtcars,old_mtcars)

28 sample_n_dt：行隨機(jī)抽樣

#--抽取行
sample_n_dt(mtcars, 10)
#--可重復(fù)抽取行
sample_n_dt(mtcars, 50, replace = TRUE)
#-按照百分比抽取行
sample_frac_dt(mtcars, 0.1)
# 設(shè)置可重復(fù)，就可以抽取比原來總體還要大的數(shù)據(jù)行。
sample_frac_dt(mtcars, 1.5, replace = TRUE)
#--換種寫法
sample_dt(mtcars,n=10)
sample_dt(mtcars,prop = 0.1)

29 select_dt：列選擇工具箱

#---select是一個(gè)大函數(shù)，許多功能非常實(shí)用
#--挑選一列
iris %>% select_dt(Species)
#-挑選兩列
iris %>% select_dt(Sepal.Length,Sepal.Width)
#-挑選這兩列之間的全部列
iris %>% select_dt(Sepal.Length:Petal.Length)
#去除某一列
iris %>% select_dt(-Sepal.Length)
#--去除兩列
iris %>% select_dt(-Sepal.Length,-Petal.Length)
#去除這兩列之前額全部列
iris %>% select_dt(-(Sepal.Length:Petal.Length))
#--可以使用字符串形式指定，效果相同
iris %>% select_dt(c("Sepal.Length","Sepal.Width"))
iris %>% select_dt(-c("Sepal.Length","Sepal.Width"))

#--可以使用列編號指定，效果相同
iris %>% select_dt(1)
iris %>% select_dt(-1)
iris %>% select_dt(1:3)
iris %>% select_dt(-(1:3))
iris %>% select_dt(1,3)
#--支持部分匹配和邏輯運(yùn)算符
iris %>% select_dt("Pe")
iris %>% select_dt(-"Se")
iris %>% select_dt(!"Se")
?select_dt
iris %>% select_dt("Pe",negate = TRUE)
iris %>% select_dt("Pe|Sp")
iris %>% select_dt(cols = 2:3)
#--添加參數(shù)negate返回不匹配的列
iris %>% select_dt(cols = 2:3,negate = TRUE)
iris %>% select_dt(cols = c("Sepal.Length","Sepal.Width"))
iris %>% select_dt(cols = names(iris)[2:3])
iris %>% select_dt(is.factor)
iris %>% select_dt(-is.factor)
iris %>% select_dt(!is.factor)
# 這個(gè)函數(shù)提供的選擇十分靈活，即使同時(shí)包含多種類型都可以選擇
select_mix(iris, Species,"Sepal.Length")
select_mix(iris,1:2,is.factor)
select_mix(iris,Sepal.Length,is.numeric)
# rm.dup:是否刪除重復(fù)列
select_mix(iris,Sepal.Length,is.numeric,rm.dup = FALSE)

30 separate_dt：字符串拆分

對于物種注釋數(shù)據(jù)十分有用

#--字符串拆分
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df
df %>% separate_dt(x, c("A", "B"))
# equals to
df %>% separate_dt("x", c("A", "B"))

31 slice_dt ：對行切幾行

iris %>% slice_dt(1:3)
iris %>% slice_dt(1,3)
iris %>% slice_dt(c(1,3))

31 summarise_dt：數(shù)據(jù)框統(tǒng)計(jì)

#--計(jì)算一列均值
iris %>% summarise_dt(avg = mean(Sepal.Length))

#by參數(shù)，按照分組計(jì)算均值
iris %>% summarise_dt(avg = mean(Sepal.Length),by = Species)
#-多組分組，計(jì)算均值
mtcars %>% summarise_dt(avg = mean(hp),by = .(cyl,vs))
# 統(tǒng)計(jì)數(shù)量
mtcars %>% summarise_dt(cyl_n = .N, by = .(cyl, vs)) # `.`` is short for list
#--統(tǒng)計(jì)數(shù)值型變量最小值
iris %>% summarise_vars(is.numeric,min)
#等同于上面
iris %>% summarise_vars(-is.factor,min)
#統(tǒng)計(jì)前四行最小值
iris %>% summarise_vars(1:4,min)
#-列全部轉(zhuǎn)化為字符串
iris %>% summarise_vars(.func = as.character)
#-按照分組對數(shù)值型列求取最小值
iris %>% summarise_vars(is.numeric,min,by ="Species")

#-按照兩列求取，可以使用逗號分隔，外加引號括起來。
mtcars %>% summarise_vars(is.numeric,mean,by = "vs,am")

32 sys_time_print：統(tǒng)計(jì)運(yùn)行時(shí)間

sys_time_print(Sys.sleep(1))
a = iris

#--由于idyfst總是處理大數(shù)據(jù)，所以對于時(shí)間要求很嚴(yán)格，這里提供了函數(shù)用于統(tǒng)計(jì)時(shí)間
sys_time_print({
  res = iris %>%
    mutate_dt(one = 1)
})
res

33 top_n_dt :提取前幾行（條件）。

#--提取前十行數(shù)據(jù)
iris %>% top_n_dt(10,Sepal.Length)
#-去除前十行數(shù)據(jù)
iris %>% top_n_dt(-10,Sepal.Length)

iris %>% top_frac_dt(.1,Sepal.Length)

iris %>% top_frac_dt(-.1,Sepal.Length)

# For `top_dt`, you can use both modes above
iris %>% top_dt(Sepal.Length,n = 10)
iris %>% top_dt(Sepal.Length,prop = .1)

34 t_dt ：提供數(shù)據(jù)框的轉(zhuǎn)置

?t_dt

t_dt(iris)
t_dt(mtcars)

35 uncount_dt ：提供頻數(shù)轉(zhuǎn)化我單個(gè)統(tǒng)計(jì)量

df <- data.table(x = c("a", "b"), n = c(1, 2))

df
#-將頻數(shù)轉(zhuǎn)化為單個(gè)統(tǒng)計(jì)數(shù)量
uncount_dt(df, n)
#-F設(shè)置在統(tǒng)計(jì)數(shù)量后添加每個(gè)數(shù)量的頻數(shù)
uncount_dt(df,n,FALSE)

36 unite_dt：提供行的合并處理

這對于宏基因組處理物種注釋數(shù)據(jù)很有幫助

df <- expand.grid(x = c("a", NA), y = c("b", NA))
df
# Treat missing value as character "NA"
df %>% unite_dt("z", x:y, remove = FALSE)

# T空缺值處理，只要有，邊全部按照NA處理
df %>% unite_dt("z", x:y, na.rm = TRUE, remove = FALSE)

#默認(rèn)空缺值保留，都保留
df %>%
  unite_dt("xy", x:y)

# 將全部的行都合并起來
iris %>% unite_dt("merged_name","")

37 utf8_encoding：使用utf8編碼數(shù)據(jù)框

這對于中文很有幫助

utf8_encoding(iris)

38 wider_dt：數(shù)據(jù)長變寬

#-構(gòu)造轉(zhuǎn)化為長數(shù)據(jù)
stocks = data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
) %>%
  longer_dt(time) -> longer_stocks

longer_stocks
#-長數(shù)據(jù)轉(zhuǎn)寬數(shù)據(jù)
longer_stocks %>%
  wider_dt("time",
           name = "name",
           value = "value")

#構(gòu)造填充數(shù)據(jù)，并轉(zhuǎn)換
longer_stocks %>%
  mutate_dt(one = 1) %>%
  wider_dt("time",
           name = "name",
           value = "one")

## using "fun" parameter for aggregation
DT <- data.table(v1 = rep(1:2, each = 6),
                 v2 = rep(rep(1:3, 2), each = 2),
                 v3 = rep(1:2, 6),
                 v4 = rnorm(6))

DT
## 兩列作為標(biāo)簽，然后計(jì)算總和
DT %>%
  wider_dt(v1,v2,
           value = "v4",
           name = ".",
           fun = sum)
#--計(jì)算最小值
DT %>%
  wider_dt(v1,v2,
           value = "v4",
           name = ".",
           fun = min)

后記

到此，tidyfst數(shù)據(jù)處理我就全部學(xué)習(xí)完成了，這部分也添加上的中文標(biāo)注，相比是十分容易理解的，當(dāng)然有5%的代碼我還不是很清楚，這個(gè)就要讀源代碼或者繼續(xù)看作者文檔了。

完成后，我立刻就想到由于在我開始學(xué)習(xí)R的時(shí)候dplyr包并不是很流行，也沒有帶我學(xué)習(xí)這種工具，所以我對數(shù)據(jù)框處理的方式有plyr，apply，還有perl，等影子。大量操作使用for循環(huán)此時(shí)為了處理大數(shù)據(jù)，我必須全部扒皮，將習(xí)慣修改為dplyr和tidyr的易讀類型。

學(xué)習(xí)使用的是示例數(shù)據(jù)，需要對實(shí)際的數(shù)據(jù)進(jìn)行測試運(yùn)行，這里在下一篇文檔中我進(jìn)行測試驗(yàn)證。希望不要讓我失望。