1 java.io.IOException: java.io.IOException: java.lang.IllegalArgumentException: offset (0) + length (8) exceed the capacity of the array: 4 做簡單的incr操作時出現(xiàn),原因是之前put時放入的是int 長度為 vlen=4 ,不適用增加操作,只能改為long型 vlen=8
2 寫數(shù)據(jù)到column時 org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: NotServingRegionException: 1 time, servers with issues: 10.xx.xx.37:60020, 或是 org.apache.hadoop.hbase.NotServingRegionException: Region is not online: 這兩種出錯,master-status中出現(xiàn)Regions in Transition 長達(dá)十幾分鐘,一直處于PENDING_OPEN狀態(tài),導(dǎo)致請求阻塞。目前把10.xx.xx.37這臺機器下線,運行一夜穩(wěn)定,沒有出現(xiàn)因split造成的阻塞。懷疑是機器問題。Hmaster的日志顯示這臺region server 不停的open close,不做任何split 或flush RIT 的全稱是region in transcation. 每次hbase master 對region 的一個open 或一個close 操作都會向Master 的RIT中插入一條記錄,因為master 對region 的操作要保持原子性,region 的 open 和 close 是通過Hmaster 和 region server 協(xié)助來完成的. 所以為了滿足這些操作的協(xié)調(diào),回滾,和一致性.Hmaster 采用了 RIT 機制并結(jié)合Zookeeper 中Node的狀態(tài)來保證操作的安全和一致性. OFFLINE, // region is in an offline state PENDING_OPEN, // sent rpc to server to open but has not begun OPENING, // server has begun to open but not yet done OPEN, // server opened region and updated meta PENDING_CLOSE, // sent rpc to server to close but has not begun CLOSING, // server has begun to close but not yet done CLOSED, // server closed region and updated meta SPLITTING, // server started split of a region SPLIT // server completed split of a region 進(jìn)一步發(fā)現(xiàn)是load balance的問題 region server不停重復(fù)的被open close,參考http:///hbase/book.html#regions.arch.assignment 重啟了region server正常 后來的代碼運行中又出現(xiàn)region not on line ,是NotServingRegionException拋出的,原因是“Thrown by a region server if it is sent a request for a region it is not serving.” 為什么會不斷請求一個離線的region?且這種錯誤集中在150個中的3個region,追蹤服務(wù)器端log,region 會被CloseRegionHandler關(guān)掉,過了20分鐘左右才重新打開,關(guān)掉后客戶端請求的region仍然是這個關(guān)閉的region?
3 設(shè)置開關(guān)不寫入hbase并不生效 代碼初上線,增加了開關(guān),萬一hbase有問題則關(guān)閉掉開關(guān)。但是出現(xiàn)問題了發(fā)現(xiàn)程序卡死,目前認(rèn)為原因是不斷加長的retry機制,60秒超時,1-32秒的10次retry,萬一出問題,切換開關(guān)也沒有用。 需要配置rpc超時參數(shù)和retry time解決它
4 flush、split、compact導(dǎo)致stop-the-world 出現(xiàn)長時間的flush split操作導(dǎo)致hbase服務(wù)器端無法響應(yīng)請求。需要調(diào)整region大小,并測試獲取flush次數(shù)
5 hbase參數(shù)設(shè)置 hbase.regionserver.handler.count 考慮到sas盤的io能力,設(shè)置為50 hbase.hregion.memstore.block.multiplier 當(dāng)memstore的大小為hbase.hregion.memstore.flush.size的multiplier倍數(shù)時,阻塞讀寫進(jìn)行flush,默認(rèn)為2
6 region server crush Regionserver crash的原因是因為GC時間過久導(dǎo)致Regionserver和zookeeper之間的連接timeout。 Zookeeper內(nèi)部的timeout如下: minSessionTimeout 單位毫秒,默認(rèn)2倍tickTime。 maxSessionTimeout 單位毫秒,默認(rèn)20倍tickTime。 (tickTime也是一個配置項。是Server內(nèi)部控制時間邏輯的最小時間單位) 如果客戶端發(fā)來的sessionTimeout超過min-max這個范圍,server會自動截取為min或max,然后為這個Client新建一個Session對象。 默認(rèn)的tickTime是2s,也就是客戶端最大的timeout為40s,及時regionserver的zookeeper.session.timeout設(shè)置為60s也沒用。
改動:
7 代碼問題導(dǎo)致死鎖 master慢查詢?nèi)罩局幸粋€查詢達(dá)到了2小時,最終導(dǎo)致服務(wù)器響應(yīng)變慢,無法應(yīng)對大寫入。追究原因是getColumns操作一下取出十幾萬的數(shù)據(jù),沒有做分頁;更改程序分頁500條左右,目前沒有出現(xiàn)問題
8 operation too slow 2012-07-26 05:30:39,141 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"processingtimems":69315,"ts":9223372036854775807,"client":"10.75.0.109:34780","starttimems":1343251769825,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"delete","totalColumns":1,"table":"trackurl_status_list","families":{"sl":[{"timestamp":1343251769825,"qualifier":"zzzn1VlyG","vlen":0}]},"row":""} 非空row-key 刪除任意的column 耗時 3ms 不清楚是否是個bug,也還不知道怎么就傳了個空row-key進(jìn)去,目前策略為在代碼端避免對空row-key做操作。 2012-07-31 17:52:06,619 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":1156438,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3dbb29e5), rpc version=1, client version=29, methodsFingerPrint=-1508511443","client":"10.75.0.109:35245","starttimems":1343727170177,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"} 引用hbase說明:The output is tagged with operation e.g. 2012-07-31 17:52:06,812 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call get([B@61574be4, {"timeRange":[0,9223372036854775807],"totalColumns":1,"cacheBlocks":true,"families":{"c":["ALL"]},"maxVersions":1,"row":"zOuu6TK"}), rpc version=1, client version=29, methodsFingerPrint=-1508511443 from 10.75.0.151:52745: output error 較頻繁出現(xiàn)這樣的log: 方法解釋為:Remove all the keys listed in the map from the memstore. This method is called when a Put has updated memstore but subequently fails to update the wal. This method is then invoked to rollback the memstore. 很奇怪的是開始和結(jié)束的index都為0 方法中循環(huán): for (int i = start; i < end; i++) { 因此是空數(shù)據(jù),空回滾。需要進(jìn)一步調(diào)查
12 新上線一個region server 導(dǎo)致region not on line 往錯誤的region server服務(wù)器請求region
13 請求不存在的region,重新建立tablepool也不起作用 請求的時間戳 1342510667 最新region rowkey相關(guān)時間戳 1344558957 最終發(fā)現(xiàn)維持region location表的屬性是在HConnectionManager中 get Get,delete Delete,incr Increment 是在 ServerCallable類 withRetries處理 情景1 若有出錯(SocketTimeoutException ConnectException RetriesExhaustedExcetion),則清理regionServer location 情景2 numRetries 若設(shè)置為1 ,則 循環(huán)只執(zhí)行一次,connect(tries!=0) 為connect(false),即reload=false,不會進(jìn)行l(wèi)ocation更新,當(dāng)為numRetries>1的時候才會重新獲取 get Gets List, put Put或Puts List,delete Deletes List 則調(diào)用HConnectionManager的 processBatch去處理,當(dāng)發(fā)現(xiàn)批量get或者put、delete操作結(jié)果有問題,則刷新regionServer location 設(shè)置 numRetries為>1次, 我這里是3次,解決問題 14 zookeeper.RecoverableZooKeeper(195): Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master 這是在我單機做測試時出現(xiàn)的,無論是從ide或是bin啟動hbase,從shell里可以正常連接,從測試程序中無法連接,zookeeper端口是2181,客戶端端口應(yīng)該與zookeeper無關(guān)才對, 最終更改配置21818端口換為2181 運行正常,應(yīng)該是單機環(huán)境才要做這種更改。 |
|