利用KEGG數(shù)據(jù)庫進行ID轉(zhuǎn)換

yjt2004us 2018-04-14

展開全文

clusterProfiler can convert biological IDs using OrgDb object via the bitr function. Now I implemented another function, bitr_kegg for converting IDs through KEGG API.

library(clusterProfiler)
data(gcSample)
hg <->
head(hg)

## [1] '4597'  '7111'  '5266'  '2175'  '755'   '23046'
eg2np <- bitr_kegg(hg,="" fromtype='kegg' ,="" totype='ncbi-proteinid' ,="" organism='hsa'>

## Warning in bitr_kegg(hg, fromType = 'kegg', toType = 'ncbi-proteinid',
## organism = 'hsa'): 3.7% of input gene IDs are fail to map...
head(eg2np)

##     kegg ncbi-proteinid
## 1   8326      NP_003499
## 2  58487   NP_001034707
## 3 139081      NP_619647
## 4  59272      NP_068576
## 5    993      NP_001780
## 6   2676      NP_001487
np2up <- bitr_kegg(eg2np[,2],="" fromtype='ncbi-proteinid' ,="" totype='uniprot' ,="" organism='hsa'>
head(np2up)

##   ncbi-proteinid uniprot
## 1      NP_005457  O75586
## 2      NP_005792  P41567
## 3      NP_005792  Q6IAV3
## 4      NP_037536  Q13421
## 5      NP_006054  O60662
## 6   NP_001092002  O95398

The ID type (both fromType & toType) should be one of 'kegg', 'ncbi-geneid', 'ncbi-proteinid' or 'uniprot'. The 'kegg' is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the 'kegg' ID is entrezgene ID for eukaryote species and Locus ID for prokaryotes.

Many prokaryote species don't have entrezgene ID available. For example we can check the gene information of ece:Z5100 in http://www./dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID and UnitProt links in the Other DBs Entry, but not NCBI-GeneID.

If we try to convert Z5100 to ncbi-geneid, bitr_kegg will throw error of ncbi-geneid is not supported.

bitr_kegg('Z5100', fromType='kegg', toType='ncbi-geneid', organism='ece')
## Error in KEGG_convert(fromType, toType, organism) :
## ncbi-geneid is not supported for ece ...

We can of course convert it to ncbi-proteinid and uniprot:

bitr_kegg('Z5100', fromType='kegg', toType='ncbi-proteinid', organism='ece')
##    kegg ncbi-proteinid
## 1 Z5100       AAG58814
bitr_kegg('Z5100', fromType='kegg', toType='uniprot', organism='ece')
##    kegg uniprot
## 1 Z5100  Q7DB85

search_kegg_organism

clusterProfiler supports more than 4k species listed in http://www./kegg/catalog/org_list.html for hypergeometric test (enrichKEGG & enrichMKEGG) and GSEA (gseKEGG & gseMKEGG). We can use bitr_kegg to convert ID for all these 4k species. To facilitate searching scientific name abbreviate used in the organism parameter of these functions, I implemented the search_kegg_organism function. We can search by kegg_code, scientific_name or common_name (which is not available for prokaryotes).

search_kegg_organism('ece', by='kegg_code')

##     kegg_code                        scientific_name common_name
## 334       ece Escherichia coli O157:H7 EDL933 (EHEC)        
ecoli <- search_kegg_organism('escherichia="" coli',="" by='scientific_name'>
dim(ecoli)

## [1] 64  3
head(ecoli)

##     kegg_code                        scientific_name common_name
## 329       eco           Escherichia coli K-12 MG1655        
## 330       ecj            Escherichia coli K-12 W3110        
## 331       ecd            Escherichia coli K-12 DH10B        
## 332       ebw                Escherichia coli BW2952        
## 333      ecok            Escherichia coli K-12 MDS42        
## 334       ece Escherichia coli O157:H7 EDL933 (EHEC)

keyType parameter

With the ID conversion utilities built in clusterProfiler, I add a parameter keyType in enrichKEGG, enrichMKEGG, gseKEGG and gseMKEGG. Now we can use ID type that is not the primary ID in KEGG database.

x <- enrichkegg(np2up[,2],="" organism='hsa' ,="" keytype='uniprot'>
head(summary(x))

##                ID                            Description GeneRatio
## hsa04072 hsa04072      Phospholipase D signaling pathway    11/133
## hsa04060 hsa04060 Cytokine-cytokine receptor interaction    14/133
## hsa04390 hsa04390                Hippo signaling pathway    10/133
## hsa04975 hsa04975           Fat digestion and absorption     5/133
## hsa05221 hsa05221                 Acute myeloid leukemia     6/133
##           BgRatio       pvalue   p.adjust     qvalue
## hsa04072 216/9275 0.0002654190 0.03901659 0.03240905
## hsa04060 354/9275 0.0005349245 0.03931695 0.03265855
## hsa04390 213/9275 0.0009536247 0.04199404 0.03488227
## hsa04975  58/9275 0.0014014886 0.04199404 0.03488227
## hsa05221  86/9275 0.0014283687 0.04199404 0.03488227
##                                                                                                         geneID
## hsa04072                      O95398/Q99777/P49619/Q6FGP0/Q8WVM9/O14807/P41594/A8K5P7/P10145/A0A024RDA5/P16234
## hsa04060 A0N0N3/O00574/P19876/P01589/P10145/A0A024RDA5/B4DGA4/Q99665/P16234/P78556/Q6I9S7/P42830/P27930/Q9UBN6
## hsa04390                             Q8WW10/A8K141/Q9UI47/P35240/A0A024R1J8/Q659G9/Q9UJU2/P22003/M9VUD0/O00144
## hsa04975                                                            Q9UNK4/A0A087WZT4/A0A0C4DFX6/Q9UHC9/P04054
## hsa05221                                                         Q659G9/Q9UJU2/Q03181/A0A024RCW6/Q06455/B2R6I9
##          Count
## hsa04072    11
## hsa04060    14
## hsa04390    10
## hsa04975     5
## hsa05221     6

setReadable

For GO analysis, we have a readable parameter to control whether traslating the IDs to human readable gene name. This parameter is not available for KEGG analysis. But we still have the ability to translate input gene IDs to gene name using setReadable function if and only if corresponding OrgDb object is available.

y <- setreadable(x,="" 'org.hs.eg.db',="" keytype='UNIPROT'>
head(summary(y))

##                ID                            Description GeneRatio
## hsa04072 hsa04072      Phospholipase D signaling pathway    11/133
## hsa04060 hsa04060 Cytokine-cytokine receptor interaction    14/133
## hsa04390 hsa04390                Hippo signaling pathway    10/133
## hsa04975 hsa04975           Fat digestion and absorption     5/133
## hsa05221 hsa05221                 Acute myeloid leukemia     6/133
##           BgRatio       pvalue   p.adjust     qvalue
## hsa04072 216/9275 0.0002654190 0.03901659 0.03240905
## hsa04060 354/9275 0.0005349245 0.03931695 0.03265855
## hsa04390 213/9275 0.0009536247 0.04199404 0.03488227
## hsa04975  58/9275 0.0014014886 0.04199404 0.03488227
## hsa05221  86/9275 0.0014283687 0.04199404 0.03488227
##                                                                                                geneID
## hsa04072                             RAPGEF3/RAPGEF3/DGKG/MRAS/MRAS/MRAS/GRM5/GRM5/CXCL8/CXCL8/PDGFRA
## hsa04060 CXCR6/CXCR6/CXCL3/IL2RA/CXCL8/CXCL8/IL12RB2/IL12RB2/PDGFRA/CCL20/CXCL5/CXCL5/IL1R2/TNFRSF10D
## hsa04390                                        CTNNA3/CTNNA3/CTNNA3/NF2/NF2/LEF1/LEF1/BMP5/BMP5/FZD9
## hsa04975                                                        PLA2G2D/PLA2G2D/NPC1L1/NPC1L1/PLA2G1B
## hsa05221                                                        LEF1/LEF1/PPARD/PPARD/RUNX1T1/RUNX1T1
##          Count
## hsa04072    11
## hsa04060    14
## hsa04390    10
## hsa04975     5
## hsa05221     6

經(jīng)常有人問我用enricher或GSEA分析的話，沒有readable參數(shù)，要知道這兩函數(shù)是通用的富集分析工具，對于你要做什么（包括知識庫，物種，ID類型）是沒有任何假設(shè)的，請問我該如何來為你自動轉(zhuǎn)ID，答案是不可能，但你自己做什么，心里應(yīng)該有點B數(shù)，那么我為你提供了setReadable函數(shù)，可以幫忙解決部分的ID轉(zhuǎn)換問題，當(dāng)然肯定不是全部。

另外文章《ko數(shù)據(jù)庫ID轉(zhuǎn)換》一文中也展示了利用KEGG進行ID轉(zhuǎn)換，它的內(nèi)容拓展了本文，不單是基因之間的ID可以轉(zhuǎn)換，而且可以把基因映射到通路上，或者反之，都是clusterProfiler所支持的。