本文講的是aglient芯片原始數(shù)據(jù)的處理,參考資料是limma 的userguide文檔。GEO數(shù)據(jù)庫下載的表達矩陣不符合預期,比如是空的,或者是有負值的,那我們就處理一下它的原始數(shù)據(jù)。aglient的芯片應用也很廣泛,舉個OSCC的栗子:GSE23558,跟著學習學習。 1.下載和讀取數(shù)據(jù)1.1獲取臨床信息數(shù)據(jù)從前,提到GEO數(shù)據(jù)下載,我們只有GEOquery,神功蓋世,但是死于網(wǎng)速。后來就有了中國人寄幾的GEO鏡像,AnnoProbe包。還沒有正式發(fā)表,就已經初露鋒芒了,因簡單易學,下載迅速,在我們的粉絲圈子里很受歡迎。 rm(list = ls()) library(stringr) library(AnnoProbe) library(GEOquery) library(limma) gse = "GSE23558" geoChina(gse)
## you can also use getGEO from GEOquery, by ## getGEO("GSE23558", destdir=".", AnnotGPL = F, getGPL = F)
## $GSE23558_series_matrix.txt.gz ## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 41000 features, 32 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: GSM577914 GSM577915 ... GSM577945 (32 total) ## varLabels: title geo_accession ... tissue:ch1 (42 total) ## varMetadata: labelDescription ## featureData: none ## experimentData: use 'experimentData(object)' ## pubMedIds: 22072328 ## 28433800 ## Annotation: GPL6480
提供一個GSE編號就可以下載啦。因為表達矩陣是處理過的,我們不要,所以只提取臨床信息表格,從中獲得分組信息。 load("GSE23558_eSet.Rdata") pd <- pData(gset[[1]])
調整pd的行名與文件讀取的順序一致,并定義分組信息。 raw_dir = "rawdata/GSE23558_RAW/" raw_datas = paste0(raw_dir,"/",dir(raw_dir))
#調整pd與rawdata的順序一致 raw_order = str_extract(raw_datas,"GSM\\d*") pd = pd[match(raw_order,rownames(pd)),]
group_list <- ifelse(stringr::str_detect(pd$title,"HealthyControl"),"normal","tumor") group_list <- factor(group_list, levels=c("normal","tumor"))
1.2 讀取原始數(shù)據(jù)x <- read.maimages(raw_datas, source="agilent", green.only=TRUE, other.columns = "gIsWellAboveBG")
## Read rawdata/GSE23558_RAW//GSM577914.txt ## Read rawdata/GSE23558_RAW//GSM577915.txt ## Read rawdata/GSE23558_RAW//GSM577916.txt ## Read rawdata/GSE23558_RAW//GSM577917.txt ## Read rawdata/GSE23558_RAW//GSM577918.txt ## Read rawdata/GSE23558_RAW//GSM577919.txt ## Read rawdata/GSE23558_RAW//GSM577920.txt ## Read rawdata/GSE23558_RAW//GSM577921.txt ## Read rawdata/GSE23558_RAW//GSM577922.txt ## Read rawdata/GSE23558_RAW//GSM577923.txt ## Read rawdata/GSE23558_RAW//GSM577924.txt ## Read rawdata/GSE23558_RAW//GSM577925.txt ## Read rawdata/GSE23558_RAW//GSM577926.txt ## Read rawdata/GSE23558_RAW//GSM577927.txt ## Read rawdata/GSE23558_RAW//GSM577928.txt ## Read rawdata/GSE23558_RAW//GSM577929.txt ## Read rawdata/GSE23558_RAW//GSM577930.txt ## Read rawdata/GSE23558_RAW//GSM577931.txt ## Read rawdata/GSE23558_RAW//GSM577932.txt ## Read rawdata/GSE23558_RAW//GSM577933.txt ## Read rawdata/GSE23558_RAW//GSM577934.txt ## Read rawdata/GSE23558_RAW//GSM577935.txt ## Read rawdata/GSE23558_RAW//GSM577936.txt ## Read rawdata/GSE23558_RAW//GSM577937.txt ## Read rawdata/GSE23558_RAW//GSM577938.txt ## Read rawdata/GSE23558_RAW//GSM577939.txt ## Read rawdata/GSE23558_RAW//GSM577940.txt ## Read rawdata/GSE23558_RAW//GSM577941.txt ## Read rawdata/GSE23558_RAW//GSM577942.txt ## Read rawdata/GSE23558_RAW//GSM577943.txt ## Read rawdata/GSE23558_RAW//GSM577944.txt ## Read rawdata/GSE23558_RAW//GSM577945.txt
dim(x)
## [1] 45015 32
2.背景校正和標準化y <- backgroundCorrect(x, method="normexp")
## Array 1 corrected ## Array 2 corrected ## Array 3 corrected ## Array 4 corrected ## Array 5 corrected ## Array 6 corrected ## Array 7 corrected ## Array 8 corrected ## Array 9 corrected ## Array 10 corrected ## Array 11 corrected ## Array 12 corrected ## Array 13 corrected ## Array 14 corrected ## Array 15 corrected ## Array 16 corrected ## Array 17 corrected ## Array 18 corrected ## Array 19 corrected ## Array 20 corrected ## Array 21 corrected ## Array 22 corrected ## Array 23 corrected ## Array 24 corrected ## Array 25 corrected ## Array 26 corrected ## Array 27 corrected ## Array 28 corrected ## Array 29 corrected ## Array 30 corrected ## Array 31 corrected ## Array 32 corrected
y <- normalizeBetweenArrays(y, method="quantile") class(y)
## [1] "EList" ## attr(,"package") ## [1] "limma"
3. 基因過濾Control <- y$genes$ControlType==1L;table(Control)
## Control ## FALSE TRUE ## 43529 1486
NoSymbol <- is.na(y$genes$GeneName);table(NoSymbol)
## NoSymbol ## FALSE ## 45015
IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 16;table(IsExpr)
## IsExpr ## FALSE TRUE ## 13088 31927
Isdup <- duplicated(y$genes$GeneName);table(Isdup)
## Isdup ## FALSE TRUE ## 30328 14687
yfilt <- y[!Control & !NoSymbol & IsExpr & !Isdup, ] dim(yfilt)
## [1] 20650 32
可以看到,過濾后剩下了2萬多個探針。 4.得到表達矩陣exp = yfilt@.Data[[1]] boxplot(exp)
exp[1:2,1:2]
## rawdata/GSE23558_RAW//GSM577914 rawdata/GSE23558_RAW//GSM577915 ## [1,] 9.284154 11.473334 ## [2,] 7.341236 7.474406
得到的表達矩陣沒問題,但行名和列名均有問題。行名應該是探針名,列名是樣本名,調整一下。 4.1獲得樣本名colnames(exp) = str_extract(colnames(exp),"GSM\\d*") exp[1:2,1:2]
## GSM577914 GSM577915 ## [1,] 9.284154 11.473334 ## [2,] 7.341236 7.474406
4.2 獲得基因名limma文檔里寫的是用了注釋R包,在本例的原文件是里有探針注釋的,這里直接使用。 可以直接將exp的行名轉為基因名。行名不能重復,所以先去重 anno = yfilt$genes nrow(anno);nrow(exp)
## [1] 20650
## [1] 20650
rownames(exp)=rownames(anno) ids = unique(anno$GeneName) exp = exp[!duplicated(anno$GeneName),] rownames(exp) = ids exp[1:4,1:4]
## GSM577914 GSM577915 GSM577916 GSM577917 ## APOBEC3B 9.284154 11.473334 10.439071 11.661000 ## A_32_P77178 7.341236 7.474406 7.310818 7.397149 ## ATP11B 9.963452 8.915621 10.193873 9.321954 ## DNAJA1 13.469790 13.201078 12.827357 13.389431
至此,得到了標準的表達矩陣。后面要做什么就看你啦,這就相當于修復了一下數(shù)據(jù)庫里那個被標準化過的表達矩陣。 5.差異分析design <- model.matrix(~group_list) fit <- lmFit(exp,design) fit <- eBayes(fit,trend=TRUE,robust=TRUE) summary(decideTests(fit))
## (Intercept) group_listtumor ## Down 0 2102 ## NotSig 0 16928 ## Up 20650 1620
deg = topTable(fit,coef=2,n=dim(y)[1]) boxplot(exp[rownames(deg)[1],]~group_list)
這里直接走limma的簡易流程,可以畫差異最顯著的那個基因表達量看看,可以看到差異是超級明顯了!
save(exp,group_list,deg,gse,file = paste0(gse,"deg.Rdata"))
后面的步驟就是我們GEO數(shù)據(jù)挖掘課程的標配啦,如果大家對這一系列“騷操作”感興趣,歡迎報名我們的GEO數(shù)據(jù)挖掘課程哈,全年滾動開班,直播互動教學以及答疑,下一期是7月6號開課,可以花時間了解一下:
|