乡下人产国偷v产偷v自拍,国产午夜片在线观看,婷婷成人亚洲综合国产麻豆,久久综合给合久久狠狠狠9

  • <output id="e9wm2"></output>
    <s id="e9wm2"><nobr id="e9wm2"><ins id="e9wm2"></ins></nobr></s>

    • 分享

      R最快且比dplyr最高效的大數(shù)據(jù)處理R包:tidyfst

       微生信生物 2021-01-16

      寫在前面

      本包開發(fā)者黃天元;

      首先我對tidyfst進(jìn)行了一套完整的學(xué)習(xí),因?yàn)檫@里面的函數(shù)并不多,滿打滿計(jì)算,也就38個(gè)。

      隨著擴(kuò)增子的平穩(wěn),我逐漸轉(zhuǎn)入宏基因組,軟件更多,平臺跨度更大,R語言顯示出來很多弊端:

      • 數(shù)據(jù)處理過程不夠快,無法快速讀入,輸出;

      近年來出現(xiàn)了許多工具解決這個(gè)問題,本著適合之前的習(xí)慣,我想通過data.table和tadyfst解決這個(gè)問題。希望我這一路都是順暢的。結(jié)果會(huì)如我所料嗎?

      tidyfst包(fstpackage/fst)

      它的優(yōu)勢:

      1、快速讀寫數(shù)據(jù)框

      2、文件壓縮,保存數(shù)據(jù)框能夠給文件進(jìn)行壓縮,這就節(jié)省了大數(shù)據(jù)轉(zhuǎn)移的時(shí)間(從硬盤放到電腦或者上傳服務(wù)器)。壓縮的比率是非常感人的,有一個(gè)參數(shù)可以控制壓縮比例,我一般設(shè)置到最大。我問過原作者,他跟我解釋過,壓縮比例一共是100個(gè)等級,不壓縮的時(shí)候讀寫是最快的,但是使勁壓縮,讀寫依然非常快!親測確實(shí)如此,所以我每次都用最大等級的壓縮,并包裝了他的函數(shù),把默認(rèn)壓縮率改為100(默認(rèn)值為50)。

      測試 fst格式操作

      為什么我要測試這個(gè)呢?因?yàn)閒st更快。

      構(gòu)造一個(gè)巨大的數(shù)據(jù)框,代碼參考hopeR。

      library(tidyfst)

      # 構(gòu)造一個(gè)1億行,4列的數(shù)據(jù)框
      nr_of_rows <- 1e8

      df <- data.table(
      Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
      Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
      Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
      Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
      )

      打印出文件大小

      head(df)

      object.size(df) %>% print(unit = "auto")

      我們測試一下保存,查看保存時(shí)間。sys_time_print函數(shù)是作者在tidyfst中封裝的函數(shù)。

      # ?export_fst

      sys_time_print({
      export_fst(df,"./df.fst")
      })

      # 完成后刪除df數(shù)據(jù)框
      rm(df)

      讀入fst對象

      parse_fst("./df.fst") -> ft

      ##--輸出錯(cuò)誤
      # ft
      head(ft)

      colnames(ft)

      快速計(jì)算頻數(shù)

      fst數(shù)據(jù)處理的函數(shù)后綴位:_fst,這里select_fst函數(shù)用于選擇列。

      sys_time_print({
      ft %>%
      select_fst(Logical) %>%
      count_dt(Logical) -> res
      })

      res

      slice_fst:用于選擇行操作。然后分組求和

      sys_time_print({
      ft %>%
      slice_fst(1:1000) %>%
      group_dt(
      by = Factor,
      summarise_dt(avg_int = mean(Integer))
      )-> res
      })

      res

      filter_fst函數(shù)用于列過濾。count_dth函數(shù)用于統(tǒng)計(jì)頻數(shù)

      sys_time_print({
      ft %>%
      filter_fst(Real >= 50) %>%
      count_dt(Factor)-> res
      })

      res

      刪除本地?cái)?shù)據(jù)

      unlink("./df.fst")

      tidyfst 正式 學(xué)習(xí)

      這個(gè)包處理函數(shù)很快,所以我要將這個(gè)包用于宏基因組數(shù)據(jù)探索,這里

      1 arrange_dt :排序

      #--使用數(shù)據(jù)

      data(iris)

      #---按照數(shù)值進(jìn)行排序
      iris %>% arrange_dt(Sepal.Length)

      iris

      # 從大到小排序
      iris %>% arrange_dt(-Sepal.Length)
      # 雙重排序--先按照第一個(gè)拍排序,然后在此基礎(chǔ)上按照第二列排序
      iris %>% arrange_dt(Sepal.Length,Petal.Length)

      2 as_fst:將數(shù)據(jù)框轉(zhuǎn)化位fst對象

      iris %>%
      as_fst() -> iris_fst

      head(iris_fst)

      3 complete_dt函數(shù)

      將數(shù)據(jù)框按照指定列,進(jìn)行完整組合,輸出

      Complete a data frame with missing combinations of data

      df <- data.table(
      group = c(1:2, 1),
      item_id = c(1:2, 2),
      item_name = c("a", "b", "b"),
      value1 = 1:3,
      value2 = 4:6
      )

      df

      df %>% complete_dt(item_id,item_name)
      df %>% complete_dt(item_id,item_name,fill = 0)
      df %>% complete_dt("item")
      df %>% complete_dt(item_id=1:3)
      df %>% complete_dt(item_id=1:3,group=1:2)
      df %>% complete_dt(item_id=1:3,group=1:3,item_name=c("a","b","c"))

      4 count_dt:統(tǒng)計(jì)頻數(shù)

      iris %>% count_dt(Sepal.Width)

      #-指定頻數(shù)列名稱
      iris %>% count_dt(Species,.name = "count")
      #統(tǒng)計(jì)頻數(shù)并添加到源數(shù)據(jù)列
      iris %>% add_count_dt(Species)
      # 對添加列的命名
      iris %>% add_count_dt(Species,.name = "N")
      #按照兩組分類進(jìn)行統(tǒng)計(jì)頻數(shù)
      mtcars %>% count_dt(cyl,vs)
      # 頻數(shù)列重命名,默認(rèn)是排序的,現(xiàn)在不要排序了
      mtcars %>% count_dt(cyl,vs,.name = "N",sort = FALSE)
      #添加到源數(shù)據(jù)中
      mtcars %>% add_count_dt(cyl,vs)

      5 cummean:累積均值

      cummean(1:10)

      6 distinct_dt :去除重復(fù)

      iris %>% distinct_dt()
      iris %>% distinct_dt(Species)
      iris %>% distinct_dt(Species,.keep_all = TRUE)
      mtcars %>% distinct_dt(cyl,vs)
      mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE)

      7 drop_na_dt :去除NA行

      df <- data.table(x = c(1, 2, NA), y = c("a", NA, "b"))

      df
      #去除含有NA的全部行
      df %>% drop_na_dt()
      #去除x列含有NA的全部行
      df %>% drop_na_dt(x)
      #去除y列含有NA的全部行
      df %>% drop_na_dt(y)
      # 去除x,y列含有NA的全部行
      df %>% drop_na_dt(x,y)

      # 將NA替換為0
      df %>% replace_na_dt(to = 0)
      df %>% replace_na_dt(x,to = 0)
      df %>% replace_na_dt(y,to = 0)
      df %>% replace_na_dt(x,y,to = 0)

      # 對空缺值的填充
      #僅僅填充x列
      df %>% fill_na_dt(x)
      #全部填充
      df %>% fill_na_dt() # not specified, fill all columns
      #指定使用臨近下一行數(shù)據(jù)填充
      df %>% fill_na_dt(y,direction = "up")

      #x的空缺在最后,所以無法填充
      df %>% fill_na_dt(x,direction = "up")

      x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4))
      x
      #--刪除全部為NA的列
      x %>% delete_na_cols()
      #-刪除0.75數(shù)據(jù)未NA的列
      x %>% delete_na_cols(prop = 0.75)
      x %>% delete_na_cols(prop = 0.5)
      x %>% delete_na_cols(prop = 0.24)
      #刪除數(shù)據(jù)少于2個(gè)的列
      x %>% delete_na_cols(n = 2)
      #刪除低于0.6數(shù)據(jù)的行
      x %>% delete_na_rows(prop = 0.6)
      #刪除數(shù)據(jù)少于兩個(gè)的行
      x %>% delete_na_rows(n = 2)

      # shift_fill
      y = c("a",NA,"b",NA,"c")
      y
      #填充
      shift_fill(y) # equals to
      #
      shift_fill(y,"down")

      shift_fill(y,"up")

      8 dummy_dt:數(shù)據(jù)長變寬

      iris %>% dummy_dt(Species)
      #使用源名稱
      iris %>% dummy_dt(Species,longname = FALSE)
      ## 按照兩列進(jìn)行變寬
      mtcars %>% head() %>% dummy_dt(vs,am)

      mtcars %>% head() %>% dummy_dt("cyl|gear")

      9 export_fst :fst格式數(shù)據(jù)保存

      export_fst(iris,"iris_fst_test.fst")
      iris_dt = import_fst("iris_fst_test.fst")
      iris_dt
      unlink("iris_fst_test.fst")

      10 filter_dt :行篩選

      iris %>% filter_dt(Sepal.Length > 7)
      iris %>% filter_dt(Sepal.Length > 7,Sepal.Width > 3)
      iris %>% filter_dt(Sepal.Length > 7 & Sepal.Width > 3)
      iris %>% filter_dt(Sepal.Length == max(Sepal.Length))

      11 slice_fst:選擇行;select_fst:選擇列;filter_fst按照行過濾

      這幾個(gè)函數(shù)其實(shí)就是來處理fst格式的,會(huì)進(jìn)一步縮短時(shí)間。大數(shù)據(jù)必備。

      ## Not run:
      fst::write_fst(iris,"iris_test.fst")

      # parse the file but not reading it
      parse_fst("iris_test.fst") -> ft
      # ft
      class(ft)
      lapply(ft,class)
      names(ft)
      dim(ft)
      # 選擇前三行
      ft %>% slice_fst(1:3)
      # 選擇1,3行
      ft %>% slice_fst(c(1,3))

      ft %>% select_fst(Sepal.Length)
      ft %>% select_fst(Sepal.Length,Sepal.Width)
      ft %>% select_fst("Sepal.Length")
      ft %>% select_fst(1:3)
      ft %>% select_fst(1,3)
      ft %>% select_fst("Se")
      ft %>% select_fst("nothing")
      ft %>% select_fst("Se|Sp")
      ft %>% select_fst(cols = names(iris)[2:3])
      ft %>% filter_fst(Sepal.Width > 3)
      ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica")
      ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3)
      unlink("iris_test.fst")

      12 group_by_dt;分組

      這里結(jié)合head函數(shù)可以對每個(gè)分組的前面幾行進(jìn)行計(jì)算,這個(gè)如果結(jié)合排序,可以對豐富較高或者較低的進(jìn)行統(tǒng)計(jì)。

      # aggregation after grouping using group_exe_dt
      as.data.table(iris) -> a

      # ?group_exe_dt
      #---指定分組,這里的head函數(shù)會(huì)按照分組進(jìn)行展示-這一般用的比較少
      a %>%
      group_by_dt(Species) %>%
      group_exe_dt(head(3))
      a
      #----指定分組,進(jìn)行計(jì)算,對每個(gè)分組的前四行進(jìn)行計(jì)算
      a %>%
      group_by_dt(Species) %>%
      group_exe_dt(
      head(4) %>%
      summarise_dt(sum = mean(Sepal.Length))
      )
      #--指定兩個(gè)分組進(jìn)行計(jì)算
      mtcars %>%
      group_by_dt("cyl|am") %>%
      group_exe_dt(
      summarise_dt(mpg_sum = sum(mpg))
      )
      # 同上一個(gè)函數(shù)
      mtcars %>%
      group_by_dt(cols = c("cyl","am")) %>%
      group_exe_dt(
      summarise_dt(mpg_sum = sum(mpg))
      )

      13 group_dt :分組計(jì)算

      #--分組提取每個(gè)分組前三行
      iris %>% group_dt(by = Species,slice_dt(1:3))

      #--分組求取每個(gè)組中的最大值,保留其他列
      iris %>% group_dt(Species,filter_dt(Sepal.Length == max(Sepal.Length)))

      #--分組統(tǒng)計(jì)求取最大值,只有統(tǒng)計(jì)的這一列
      iris %>% group_dt(Species,summarise_dt(new = max(Sepal.Length)))

      # 添加一列,并分組求取這一列的和
      iris %>% group_dt(Species,
      mutate_dt(max= max(Sepal.Length)) %>%
      summarise_dt(sum=sum(max)))

      # .SD 函數(shù)可以直接使用
      # 提取每個(gè)分組第一行和最后一行
      iris %>%group_dt(
      by = Species,
      rbind(.SD[1],.SD[.N])
      )
      #' #summarise_dth函數(shù)內(nèi)置了by參數(shù),這樣就可以直接在函數(shù)內(nèi)部分組了
      mtcars %>%
      summarise_dt(
      disp = mean(disp),
      hp = mean(hp),
      by = cyl
      )
      # z或者使用group函數(shù)分組
      mtcars %>%
      group_dt(by =.(vs,am),
      summarise_dt(avg = mean(mpg)))

      # data.table中的.()函數(shù)在這里同樣等價(jià)為list()
      mtcars %>%
      group_dt(by =list(vs,am),
      summarise_dt(avg = mean(mpg)))

      # mutate_dt添加一列,mean函數(shù)計(jì)算均值,顯然不夠兩行,這里循環(huán)補(bǔ)齊。
      df <- data.table(x = 1:2, y = 3:4, z = 4:5)
      df
      df %>% mutate_dt(m = mean(c(x, y, z)))
      #-等價(jià)
      df %>% rowwise_dt(
      mutate_dt(m = mean(c(x, y, z)))
      )

      14 in_dt: 綜合函數(shù)

      按照分組進(jìn)行排序,然后提取排序好的數(shù)據(jù)行,十分有用。對于微生物組數(shù)據(jù)。

      iris %>% as_dt()
      #--排序,分組提取第一個(gè)數(shù)據(jù)
      iris %>% in_dt(order(-Sepal.Length),.SD[1],by=Species)

      15 lead_dt:快速創(chuàng)建向量

      lead_dt(1:5)
      lag_dt(1:5)
      lead_dt(1:5,2)
      lead_dt(1:5,n = 2,fill = 0)

      16 _join_dt:最重要的一組函數(shù),合并數(shù)據(jù)框

      #--構(gòu)造data.table對象

      workers = fread("
      name company
      Nick Acme
      John Ajax
      Daniela Ajax
      ")
      #-構(gòu)建另一個(gè)data.table對象
      positions = fread("
      name position
      John designer
      Daniela engineer
      Cathie manager
      ")

      # ?inner_join
      #--合并數(shù)據(jù)框
      #--共有合并
      workers %>% inner_join_dt(positions)
      #-保留左側(cè)行
      workers %>% left_join_dt(positions)
      #保留右側(cè)行
      workers %>% right_join_dt(positions)
      #-保留全部行
      workers %>% full_join_dt(positions)

      # 輸出左側(cè)數(shù)據(jù)框獨(dú)有行
      workers %>% anti_join_dt(positions)
      #-輸出左側(cè)數(shù)據(jù)庫共有行
      workers %>% semi_join_dt(positions)

      # 通過by參數(shù)指定合并的行列名
      workers %>% left_join_dt(positions, by = "name")
      # 重命名
      positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions'
      #--如果兩數(shù)據(jù)庫不同名需要合并,使用等號匹配列名
      workers %>% inner_join_dt(positions2, by = c("name" = "worker"))
      # 等價(jià)
      workers %>% ijoin(positions2,by = "name==worker")

      #-兩種合并方式相同
      x= data.table(a=1:5,a1 = 2:6,b=11:15)
      y= data.table(a=c(1:4,6), a1 = c(1,2,4,5,1),c=c(101:104,106))
      #默認(rèn)相同的合并
      merge(x,y,all = TRUE) -> a
      #--按照兩列合并
      fjoin(x,y,by = c("a","a1")) -> b
      data.table::setcolorder(a,names(b))
      fsetequal(a,b)

      16 longer_dt:數(shù)據(jù)寬邊長

      ## 構(gòu)造數(shù)據(jù)
      stocks = data.frame(
      time = as.Date('2009-01-01') + 0:9,
      X = rnorm(10, 0, 1),
      Y = rnorm(10, 0, 2),
      Z = rnorm(10, 0, 4)
      )

      stocks
      # 數(shù)據(jù)寬變長

      stocks %>%
      longer_dt(time)

      #--部分即可匹配
      stocks %>%
      longer_dt("ti")
      #-這部分找不到數(shù)據(jù)集"billboard",所以沒有學(xué)習(xí)運(yùn)行
      # library(tidyr)
      # # install.packages("billboard")
      # library("billboard")
      # data(billboard)
      #
      #
      # billboard %>%
      # longer_dt(
      # -"wk",
      # name = "week",
      # value = "rank",
      # na.rm = TRUE
      # )
      #
      # billboard
      # # or use:
      # billboard %>%
      # longer_dt(
      # artist,track,date.entered,
      # name = "week",
      # value = "rank",
      # na.rm = TRUE
      # )
      # # or use:
      # billboard %>%
      # longer_dt(
      # 1:3,
      # name = "week",
      # value = "rank",
      # na.rm = TRUE
      # )

      17 df_mat:矩陣和列表快速轉(zhuǎn)化

      這對于網(wǎng)絡(luò)分析和相關(guān)分析十分有用。

      mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3]))
      mm

      #--矩陣邊列表
      tdf = mat_df(mm)
      tdf

      #--列表邊矩陣
      mat = df_mat(tdf,row,col,value)
      mat

      setequal(mm,mat)

      tdf %>%
      setNames(c("A","B","C")) %>%
      df_mat(A,B,C)

      18 mutate_dt :添加新的數(shù)據(jù)列

      #--添加新的列,添加到原來列后面
      iris %>% mutate_dt(one = 1,Sepal.Length = Sepal.Length + 1)
      #---不要原來的數(shù)據(jù)了
      iris %>% transmute_dt(one = 1,Sepal.Length = Sepal.Length + 1)

      # `.GRP`:分組標(biāo)簽添加,這些特殊符號一定要注意
      iris %>% mutate_dt(id = 1:.N,grp = .GRP,by = Species)

      18 mutate_when;mutate_vars,數(shù)據(jù)框整理添加新列

      按照條件添加新的列,按照條件對多列進(jìn)行操作

      iris[3:8,]
      #-條件添加數(shù)據(jù)
      iris[3:8,] %>%
      mutate_when(Petal.Width == .2,
      one = 1,Sepal.Length=2)

      #--對符合條件的列標(biāo)準(zhǔn)化
      iris %>% mutate_vars("Pe",scale)
      #--對全部為數(shù)值的數(shù)據(jù)列進(jìn)行標(biāo)準(zhǔn)化
      iris %>% mutate_vars(is.numeric,scale)
      #--非因子列進(jìn)行標(biāo)準(zhǔn)化
      iris %>% mutate_vars(-is.factor,scale)
      #前兩列標(biāo)準(zhǔn)化
      iris %>% mutate_vars(1:2,scale)
      #--將全部數(shù)據(jù)列轉(zhuǎn)化為字符串
      iris %>% mutate_vars(.func = as.character)

      第二篇章

      19 nest_dt:數(shù)據(jù)框與列表的變換

      library(tidyfst)

      #-按照分組拆分?jǐn)?shù)據(jù)框
      a = mtcars %>% nest_dt(cyl)
      #查看數(shù)據(jù)類型
      # str(a)
      #-查看數(shù)據(jù)list
      # a[[2]]

      mtcars %>% nest_dt("cyl")
      mtcars %>% nest_dt(cyl,vs)
      mtcars %>% nest_dt(vs:am)
      mtcars %>% nest_dt("cyl|vs")
      mtcars %>% nest_dt(c("cyl","vs"))
      # 兩列拆分?jǐn)?shù)據(jù)框,稱為兩組列表
      a = iris %>% nest_dt(mcols = list(petal="^Pe",sepal="^Se"))
      # #-第二組列表查看
      # a[[3]]
      #--復(fù)原。ndt為需要指定的列
      mtcars %>% nest_dt("cyl|vs") %>%
      unnest_dt(ndt)
      mtcars %>% nest_dt("cyl|vs") %>%
      unnest_dt("ndt")

      #---列表和數(shù)據(jù)庫可以一起構(gòu)建
      df <- data.table(
      a = list(c("a", "b"), "c"),
      b = list(c(TRUE,TRUE),FALSE),
      c = list(3,c(1,2)),
      d = c(11, 22)
      )

      # str(df)

      20 nth:從向量中提取值

      通過編號提取目標(biāo)的值,這里指定了負(fù)數(shù)為倒序,從后往前的位置。

      x = 1:10
      nth(x, 1)
      nth(x, 5)
      nth(x, -2)

      21 pull_dt 從向量中根據(jù)位置提取元素

      mtcars %>% pull_dt(2)
      mtcars %>% pull_dt(cyl)
      mtcars %>% pull_dt("cyl")

      22 pull_dt:提取數(shù)據(jù)框單一變量(轉(zhuǎn)化為向量形式)

      那么你想提取兩列行不行,當(dāng)然不行!

      #-這三種方式提取結(jié)果是相同的
      mtcars %>% pull_dt(2)
      mtcars %>% pull_dt(cyl)
      mtcars %>% pull_dt("cyl")

      #-查看名稱
      colnames(mtcars)

      23 relocate_dt:對列進(jìn)行排序

      df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
      df
      df %>% relocate_dt(f)
      df %>% relocate_dt(a,how = "last")
      df %>% relocate_dt(is.character)
      df %>% relocate_dt(is.numeric, how = "last")
      df %>% relocate_dt("[aeiou]")
      df %>% relocate_dt(a, how = "after",where = f)
      df %>% relocate_dt(f, how = "before",where = a)
      df %>% relocate_dt(f, how = "before",where = c)
      df %>% relocate_dt(f, how = "after",where = c)
      df2 <- data.table(a = 1, b = "a", c = 1, d = "a")
      df2 %>% relocate_dt(is.numeric,
      how = "after",
      where = is.character)
      df2 %>% relocate_dt(is.numeric,
      how="before",
      where = is.character)

      24 relocate_d:對列名進(jìn)行位置調(diào)整

      這個(gè)工具十分強(qiáng)大,對于微生物領(lǐng)域也將十分有用。

      df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
      df
      #-將f列提高第一列
      df %>% relocate_dt(f)
      #將a列提到最后一列
      df %>% relocate_dt(a,how = "last")
      #將字符串列已移到前面
      df %>% relocate_dt(is.character)
      #將數(shù)值型變量列移到后面
      df %>% relocate_dt(is.numeric, how = "last")
      #--將列名按照順序排列
      df %>% relocate_dt("[aeiou]")
      #-將a排列在f后面
      df %>% relocate_dt(a, how = "after",where = f)
      #-將f排列到a前面
      df %>% relocate_dt(f, how = "before",where = a)
      #將f排列到c前面
      df %>% relocate_dt(f, how = "before",where = c)
      df %>% relocate_dt(f, how = "after",where = c)

      df2 <- data.table(a = 1, b = "a", c = 1, d = "a")
      #-將數(shù)值型變量排列到字符串后面
      df2 %>% relocate_dt(is.numeric,
      how = "after",
      where = is.character)
      df2 %>% relocate_dt(is.numeric,
      how="before",
      where = is.character)

      25 rename_dt:對數(shù)據(jù)列進(jìn)行改名

      #-改名,使用等號來指定舊名和新名
      iris %>%
      rename_dt(sl = Sepal.Length,sw = Sepal.Width) %>%
      head()

      26 replace_dt:對一列內(nèi)容替換(條件)

      iris %>% mutate_vars(is.factor,as.character) -> new_iris
      #-指定列,替換內(nèi)容,字符串替換
      new_iris %>%
      replace_dt(Species, from = "setosa",to = "SS")
      new_iris %>%
      replace_dt(Species,from = c("setosa","virginica"),to = "sv")
      #-數(shù)值替換
      new_iris %>%
      replace_dt(Petal.Width, from = .2,to = 2)
      new_iris %>%
      replace_dt(from = .2,to = NA)
      #-添加基本運(yùn)算
      new_iris %>%
      replace_dt(is.numeric, from = function(x) x > 3, to = 9999 )

      27 rn_col:對首列和列名操作(位置互換)

      #--將列名提取到第一列
      mtcars %>% rn_col()
      #列名提取到第一列,并改名為rn
      mtcars %>% rn_col("rn")
      #-賦值給信數(shù)據(jù)框
      mtcars %>% rn_col() -> new_mtcars
      #--改回去,將第一列放回到列名
      new_mtcars %>% col_rn() -> old_mtcars
      old_mtcars
      setequal(mtcars,old_mtcars)

      28 sample_n_dt:行隨機(jī)抽樣

      #--抽取行
      sample_n_dt(mtcars, 10)
      #--可重復(fù)抽取行
      sample_n_dt(mtcars, 50, replace = TRUE)
      #-按照百分比抽取行
      sample_frac_dt(mtcars, 0.1)
      # 設(shè)置可重復(fù),就可以抽取比原來總體還要大的數(shù)據(jù)行。
      sample_frac_dt(mtcars, 1.5, replace = TRUE)
      #--換種寫法
      sample_dt(mtcars,n=10)
      sample_dt(mtcars,prop = 0.1)

      29 select_dt:列選擇工具箱

      #---select是一個(gè)大函數(shù),許多功能非常實(shí)用
      #--挑選一列
      iris %>% select_dt(Species)
      #-挑選兩列
      iris %>% select_dt(Sepal.Length,Sepal.Width)
      #-挑選這兩列之間的全部列
      iris %>% select_dt(Sepal.Length:Petal.Length)
      #去除某一列
      iris %>% select_dt(-Sepal.Length)
      #--去除兩列
      iris %>% select_dt(-Sepal.Length,-Petal.Length)
      #去除這兩列之前額全部列
      iris %>% select_dt(-(Sepal.Length:Petal.Length))
      #--可以使用字符串形式指定,效果相同
      iris %>% select_dt(c("Sepal.Length","Sepal.Width"))
      iris %>% select_dt(-c("Sepal.Length","Sepal.Width"))

      #--可以使用列編號指定,效果相同
      iris %>% select_dt(1)
      iris %>% select_dt(-1)
      iris %>% select_dt(1:3)
      iris %>% select_dt(-(1:3))
      iris %>% select_dt(1,3)
      #--支持部分匹配和邏輯運(yùn)算符
      iris %>% select_dt("Pe")
      iris %>% select_dt(-"Se")
      iris %>% select_dt(!"Se")
      ?select_dt
      iris %>% select_dt("Pe",negate = TRUE)
      iris %>% select_dt("Pe|Sp")
      iris %>% select_dt(cols = 2:3)
      #--添加參數(shù)negate返回不匹配的列
      iris %>% select_dt(cols = 2:3,negate = TRUE)
      iris %>% select_dt(cols = c("Sepal.Length","Sepal.Width"))
      iris %>% select_dt(cols = names(iris)[2:3])
      iris %>% select_dt(is.factor)
      iris %>% select_dt(-is.factor)
      iris %>% select_dt(!is.factor)
      # 這個(gè)函數(shù)提供的選擇十分靈活,即使同時(shí)包含多種類型都可以選擇
      select_mix(iris, Species,"Sepal.Length")
      select_mix(iris,1:2,is.factor)
      select_mix(iris,Sepal.Length,is.numeric)
      # rm.dup:是否刪除重復(fù)列
      select_mix(iris,Sepal.Length,is.numeric,rm.dup = FALSE)

      30 separate_dt:字符串拆分

      對于物種注釋數(shù)據(jù)十分有用

      #--字符串拆分
      df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
      df
      df %>% separate_dt(x, c("A", "B"))
      # equals to
      df %>% separate_dt("x", c("A", "B"))

      31 slice_dt :對行切幾行

      iris %>% slice_dt(1:3)
      iris %>% slice_dt(1,3)
      iris %>% slice_dt(c(1,3))

      31 summarise_dt:數(shù)據(jù)框統(tǒng)計(jì)

      #--計(jì)算一列均值
      iris %>% summarise_dt(avg = mean(Sepal.Length))

      #by參數(shù),按照分組計(jì)算均值
      iris %>% summarise_dt(avg = mean(Sepal.Length),by = Species)
      #-多組分組,計(jì)算均值
      mtcars %>% summarise_dt(avg = mean(hp),by = .(cyl,vs))
      # 統(tǒng)計(jì)數(shù)量
      mtcars %>% summarise_dt(cyl_n = .N, by = .(cyl, vs)) # `.`` is short for list
      #--統(tǒng)計(jì)數(shù)值型變量最小值
      iris %>% summarise_vars(is.numeric,min)
      #等同于上面
      iris %>% summarise_vars(-is.factor,min)
      #統(tǒng)計(jì)前四行最小值
      iris %>% summarise_vars(1:4,min)
      #-列全部轉(zhuǎn)化為字符串
      iris %>% summarise_vars(.func = as.character)
      #-按照分組對數(shù)值型列求取最小值
      iris %>% summarise_vars(is.numeric,min,by ="Species")

      #-按照兩列求取,可以使用逗號分隔,外加引號括起來。
      mtcars %>% summarise_vars(is.numeric,mean,by = "vs,am")

      32 sys_time_print:統(tǒng)計(jì)運(yùn)行時(shí)間

      sys_time_print(Sys.sleep(1))
      a = iris

      #--由于idyfst總是處理大數(shù)據(jù),所以對于時(shí)間要求很嚴(yán)格,這里提供了函數(shù)用于統(tǒng)計(jì)時(shí)間
      sys_time_print({
      res = iris %>%
      mutate_dt(one = 1)
      })
      res

      33 top_n_dt :提取前幾行(條件)。

      #--提取前十行數(shù)據(jù)
      iris %>% top_n_dt(10,Sepal.Length)
      #-去除前十行數(shù)據(jù)
      iris %>% top_n_dt(-10,Sepal.Length)

      iris %>% top_frac_dt(.1,Sepal.Length)

      iris %>% top_frac_dt(-.1,Sepal.Length)

      # For `top_dt`, you can use both modes above
      iris %>% top_dt(Sepal.Length,n = 10)
      iris %>% top_dt(Sepal.Length,prop = .1)

      34 t_dt :提供數(shù)據(jù)框的轉(zhuǎn)置

      ?t_dt

      t_dt(iris)
      t_dt(mtcars)

      35 uncount_dt :提供頻數(shù)轉(zhuǎn)化我單個(gè)統(tǒng)計(jì)量

      df <- data.table(x = c("a", "b"), n = c(1, 2))

      df
      #-將頻數(shù)轉(zhuǎn)化為單個(gè)統(tǒng)計(jì)數(shù)量
      uncount_dt(df, n)
      #-F設(shè)置在統(tǒng)計(jì)數(shù)量后添加每個(gè)數(shù)量的頻數(shù)
      uncount_dt(df,n,FALSE)

      36 unite_dt:提供行的合并處理

      這對于宏基因組處理物種注釋數(shù)據(jù)很有幫助

      df <- expand.grid(x = c("a", NA), y = c("b", NA))
      df
      # Treat missing value as character "NA"
      df %>% unite_dt("z", x:y, remove = FALSE)

      # T空缺值處理,只要有,邊全部按照NA處理
      df %>% unite_dt("z", x:y, na.rm = TRUE, remove = FALSE)

      #默認(rèn)空缺值保留,都保留
      df %>%
      unite_dt("xy", x:y)

      # 將全部的行都合并起來
      iris %>% unite_dt("merged_name","")

      37 utf8_encoding:使用utf8編碼數(shù)據(jù)框

      這對于中文很有幫助

      utf8_encoding(iris)

      38 wider_dt:數(shù)據(jù)長變寬

      #-構(gòu)造轉(zhuǎn)化為長數(shù)據(jù)
      stocks = data.frame(
      time = as.Date('2009-01-01') + 0:9,
      X = rnorm(10, 0, 1),
      Y = rnorm(10, 0, 2),
      Z = rnorm(10, 0, 4)
      ) %>%
      longer_dt(time) -> longer_stocks

      longer_stocks
      #-長數(shù)據(jù)轉(zhuǎn)寬數(shù)據(jù)
      longer_stocks %>%
      wider_dt("time",
      name = "name",
      value = "value")

      #構(gòu)造填充數(shù)據(jù),并轉(zhuǎn)換
      longer_stocks %>%
      mutate_dt(one = 1) %>%
      wider_dt("time",
      name = "name",
      value = "one")

      ## using "fun" parameter for aggregation
      DT <- data.table(v1 = rep(1:2, each = 6),
      v2 = rep(rep(1:3, 2), each = 2),
      v3 = rep(1:2, 6),
      v4 = rnorm(6))

      DT
      ## 兩列作為標(biāo)簽,然后計(jì)算總和
      DT %>%
      wider_dt(v1,v2,
      value = "v4",
      name = ".",
      fun = sum)
      #--計(jì)算最小值
      DT %>%
      wider_dt(v1,v2,
      value = "v4",
      name = ".",
      fun = min)

      后記

      到此,tidyfst數(shù)據(jù)處理我就全部學(xué)習(xí)完成了,這部分也添加上的中文標(biāo)注,相比是十分容易理解的,當(dāng)然有5%的代碼我還不是很清楚,這個(gè)就要讀源代碼或者繼續(xù)看作者文檔了。

      完成后,我立刻就想到由于在我開始學(xué)習(xí)R的時(shí)候dplyr包并不是很流行,也沒有帶我學(xué)習(xí)這種工具,所以我對數(shù)據(jù)框處理的方式有plyr,apply,還有perl,等影子。大量操作使用for循環(huán)此時(shí)為了處理大數(shù)據(jù),我必須全部扒皮,將習(xí)慣修改為dplyr和tidyr的易讀類型。

      學(xué)習(xí)使用的是示例數(shù)據(jù),需要對實(shí)際的數(shù)據(jù)進(jìn)行測試運(yùn)行,這里在下一篇文檔中我進(jìn)行測試驗(yàn)證。希望不要讓我失望。

      歡迎加入微生信生物

      快來微生信生物

      微生信生物

        轉(zhuǎn)藏 分享 獻(xiàn)花(0

        0條評論

        發(fā)表

        請遵守用戶 評論公約

        類似文章 更多