事情是這個(gè)樣子的,今天上午,高高興興的寫代碼,把data.table放到循環(huán)里面,批量讀取文件,批量賦值,寫完運(yùn)行后發(fā)現(xiàn)結(jié)果是錯(cuò)誤的,查看Warning發(fā)現(xiàn)是類型不一致,就這個(gè)問題記錄了一下。希望對后來者有幫助。 「報(bào)錯(cuò)類型:」 Warning messages: 1: In set(x, j = name, value = value) : Coercing 'character' RHS to 'integer' to match the type of the target column (column 1 named 'Number'). 2: In set(x, j = name, value = value) : 強(qiáng)制改變過程中產(chǎn)生了NA
查了一下data.table的說明文檔: ?Unlike <- for data.frame, the (potentially large) LHS [Left Hand Side] is not coerced to match the type of the (often small) RHS [Right Hand Side]. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type. ? 里面的內(nèi)容大體是,當(dāng)列的數(shù)據(jù)類型不一樣時(shí),會報(bào)錯(cuò)。有兩種解決方法: 1,將列的類型變?yōu)橐恢?,比如你的?shù)字列要賦值為字符,那就先把數(shù)字列變?yōu)樽址?,再賦值
2,可以將賦值的字符的行和被賦值的行一樣,這樣也不會報(bào)錯(cuò) 1. 生成數(shù)據(jù)「生成一個(gè)data.table 的數(shù)據(jù)框」 # DT library(data.table)
df = data.table(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) df str(df)
> df x y z 1: 1 0.55319365 ttt1 2: 2 -0.08265915 ttt2 3: 3 -1.50851585 ttt3 4: 4 -0.19653575 ttt4 5: 5 -1.55555254 ttt5 6: 6 0.03887365 ttt6 7: 7 0.36618923 ttt7 8: 8 -0.93304230 ttt8 9: 9 -0.24562587 ttt9 10: 10 1.52407895 ttt10 > str(df) Classes 'data.table’ and 'data.frame': 10 obs. of 3 variables: $ x: int 1 2 3 4 5 6 7 8 9 10 $ y: num 0.5532 -0.0827 -1.5085 -0.1965 -1.5556 ... $ z: chr "ttt1" "ttt2" "ttt3" "ttt4" ... - attr(*, ".internal.selfref")=<externalptr>
這里,x 列是數(shù)字,y 列是數(shù)字,z 列是字符。 2. 重演錯(cuò)誤:將x列變?yōu)?code>a1> df$x = "a1" Warning messages: 1: In set(x, j = name, value = value) : Coercing 'character' RHS to 'integer' to match the type of the target column (column 1 named 'x'). 2: In set(x, j = name, value = value) : 強(qiáng)制改變過程中產(chǎn)生了NA
這里的報(bào)錯(cuò)信息是,右邊是字符,左邊是數(shù)字,類型不匹配,所以報(bào)錯(cuò)。「注意,這里雖然用的是Warning ,但是結(jié)果是錯(cuò)誤的,看下面轉(zhuǎn)化后的數(shù)據(jù),真是不講武德?。?!,全部變?yōu)榱?code style="overflow-wrap: break-word;margin-right: 2px;margin-left: 2px;font-family: "Operator Mono", Consolas, Monaco, Menlo, monospace;word-break: break-all;background: rgba(59, 170, 250, 0.1);padding-right: 2px;padding-left: 2px;border-radius: 2px;height: 21px;line-height: 22px;">NA」 > df x y z 1: NA 0.55319365 ttt1 2: NA -0.08265915 ttt2 3: NA -1.50851585 ttt3 4: NA -0.19653575 ttt4 5: NA -1.55555254 ttt5 6: NA 0.03887365 ttt6 7: NA 0.36618923 ttt7 8: NA -0.93304230 ttt8 9: NA -0.24562587 ttt9 10: NA 1.52407895 ttt10
如果是data.frame ,就不會出現(xiàn)這種錯(cuò)誤: df = data.frame(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) df str(df)
df$x = "a1" df
「可以看到,框的一下就轉(zhuǎn)化好了,很快的!??!,都說data.table和data.frame差不多,但就是差這么一點(diǎn)點(diǎn),學(xué)藝不精,bug滿坑啊?。?!」 > df = data.frame(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) > df x y z 1 1 -0.5037848 ttt1 2 2 -1.4766567 ttt2 3 3 -0.1606073 ttt3 4 4 -0.6011270 ttt4 5 5 1.6626815 ttt5 6 6 0.2565216 ttt6 7 7 0.2683151 ttt7 8 8 -2.3469332 ttt8 9 9 -1.6655096 ttt9 10 10 0.3784420 ttt10 > str(df) 'data.frame': 10 obs. of 3 variables: $ x: int 1 2 3 4 5 6 7 8 9 10 $ y: num -0.504 -1.477 -0.161 -0.601 1.663 ... $ z: Factor w/ 10 levels "ttt1","ttt10",..: 1 3 4 5 6 7 8 9 10 2 > df$x = "a1" > df x y z 1 a1 -0.5037848 ttt1 2 a1 -1.4766567 ttt2 3 a1 -0.1606073 ttt3 4 a1 -0.6011270 ttt4 5 a1 1.6626815 ttt5 6 a1 0.2565216 ttt6 7 a1 0.2683151 ttt7 8 a1 -2.3469332 ttt8 9 a1 -1.6655096 ttt9 10 a1 0.3784420 ttt10
3. 解決方案1:將x列先變?yōu)樽址儋x值先把它轉(zhuǎn)化為字符df$x = as.character(df$x) ,然后再賦值 df = data.table(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) df str(df)
df$x = as.character(df$x) df$x = "a1" df
可以看到,搞定: > df$x = as.character(df$x) > df$x = "a1" > df x y z 1: a1 -0.8852575 ttt1 2: a1 -0.1708877 ttt2 3: a1 0.3803468 ttt3 4: a1 0.4192728 ttt4 5: a1 1.4413745 ttt5 6: a1 -0.6828477 ttt6 7: a1 0.4294502 ttt7 8: a1 -0.1611874 ttt8 9: a1 -2.3305019 ttt9 10: a1 -0.1424764 ttt10
4. 把賦值的行和被賦值的一致將被賦值的行,弄成一樣長度的df$x = rep("a1",dim(df)[1]) df = data.table(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) str(df)
df$x = rep("a1",dim(df)[1]) df
可以看到,也成功了: > df = data.table(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) > str(df) Classes 'data.table’ and 'data.frame': 10 obs. of 3 variables: $ x: int 1 2 3 4 5 6 7 8 9 10 $ y: num 1.425 0.0537 0.219 1.8867 -0.1562 ... $ z: chr "ttt1" "ttt2" "ttt3" "ttt4" ... - attr(*, ".internal.selfref")=<externalptr> > df$x = rep("a1",dim(df)[1]) > df x y z 1: a1 1.42502710 ttt1 2: a1 0.05370049 ttt2 3: a1 0.21899323 ttt3 4: a1 1.88674618 ttt4 5: a1 -0.15622174 ttt5 6: a1 0.43704146 ttt6 7: a1 1.31103082 ttt7 8: a1 -0.09496113 ttt8 9: a1 0.33710145 ttt9 10: a1 -0.05053140 ttt10
5, 數(shù)字列賦值為字符,就報(bào)錯(cuò)。字符列賦值數(shù)字,就正常「這不是赤裸裸的歧視嗎?。?!」字符賦值數(shù)字,就運(yùn)行成功了df$z = 123 df = data.table(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) str(df)
df$z = 123 df
結(jié)果如下: > df = data.table(x = 1:10,y = rnorm(10),z = paste0("ttt",1:10)) > str(df) Classes 'data.table’ and 'data.frame': 10 obs. of 3 variables: $ x: int 1 2 3 4 5 6 7 8 9 10 $ y: num 0.148 -0.795 1.16 0.375 0.765 ... $ z: chr "ttt1" "ttt2" "ttt3" "ttt4" ... - attr(*, ".internal.selfref")=<externalptr> > df$z = 123 > df x y z 1: 1 0.1484868 123 2: 2 -0.7951205 123 3: 3 1.1601522 123 4: 4 0.3751982 123 5: 5 0.7651195 123 6: 6 0.7172938 123 7: 7 1.6518403 123 8: 8 0.3031258 123 9: 9 -1.3506003 123 10: 10 1.4655129 123
6. data.table不講武德,欺負(fù)老實(shí)人但是,我還是要用它的,因?yàn)樗_實(shí)很香的?。。?/p> 學(xué)藝不精,bug滿坑,所以我還要繼續(xù)填坑啊。 另外兩篇之前寫的data.table的學(xué)習(xí)筆記: data.table學(xué)習(xí)筆記1
data.table學(xué)習(xí)筆記2
|