友情提示:由于涉及概念的內(nèi)容較多故文中藍(lán)色區(qū)域的內(nèi)容建議重點(diǎn)參考TFtranscription factor, TF, 轉(zhuǎn)錄因子, 是一種蛋白, 通過(guò)特異性結(jié)合調(diào)控區(qū)域的 DNA 序列來(lái)調(diào)控基因的轉(zhuǎn)錄過(guò)程, 一個(gè)轉(zhuǎn)錄因子可以同時(shí)調(diào)控多個(gè)基因: In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. TFs are key regulators of biological processes that function by binding to transcriptional regulatory regions (e.g., promoters, enhancers) to control the expression of their target genes.
人類(lèi)基因組中可編碼2000+個(gè)TFs transcription factor binding site, TFBS, 轉(zhuǎn)錄因子結(jié)合位點(diǎn), 是與轉(zhuǎn)錄因子結(jié)合的 DNA 序列, 長(zhǎng)度通常在5~20bp,同一個(gè)轉(zhuǎn)錄因子在不同的基因上的結(jié)合位點(diǎn)具有一定的保守性,不完全相同: Transcription factor binding motifs (TFBMs) are genomic sequences that specifically bind to transcription factors. The consensus sequence of a TFBM is variable, and there are a number of possible bases at certain positions in the motif, whereas other positions have a fixed base.
transcription factor binding motif, TFBM, 轉(zhuǎn)錄因子結(jié)合域, binding site 和 binding motif 常被混淆使用,對(duì)于其區(qū)別,參照一篇文獻(xiàn): 文中有描述如下: A single TF can recognize dozens to hundreds of DNA binding site sequences over a range of binding affinities. Hence, the TF binding specificity (i.e., preferential binding of specific sequences) cannot be adequately represented using any one DNA sequence. Instead, TF binding specificities are often represented as binding site motifs, which summarize the collection of preferentially bound sequences. These motifs can be used to scan sequences of interest (e.g., genomic regions) to predict TF binding sites.
即,motif匯總了一個(gè)TF所有可能的結(jié)合位點(diǎn)(TFBS),并用于描述結(jié)合位點(diǎn)的特異性。 motifmotifs are a more practical representation of consensus elements in biological sequences, allowing for a more detailed description of the variability at each site. Common types of motifs that are responsible for binding to DNA can be found in different transcription factors. Each TF typically recognizes a collection of similar DNA sequences, which can be represented as binding site motifs using models such as position weight matrices (PWMs)
motif 可以用多種方法、模型去表示。舉個(gè)例子,某個(gè)轉(zhuǎn)錄因子的結(jié)合位點(diǎn)序列如下: 
最基本的表達(dá)方式是一致性序列 (consensus sequences): A collection of DNA binding sites, typically referred to as a DNA binding motif, can be represented by a consensus sequence. Given a set of sequences, a consensus sequence (also called canonical sequence) is the sequence obtained by taking the most frequent residues of nucleic acids / amino acids at each position.
即,從給定的一組序列中,選擇由每個(gè)位點(diǎn)出現(xiàn)頻率最高的堿基組成的一段序列,本例中為AAGAAA https://www.commonlounge.com/discussion/912b207972304bf3a337e5473eca32ac 雖然簡(jiǎn)單,但是很明顯,這樣的表達(dá)方式是以犧牲準(zhǔn)確性為代價(jià)的,有點(diǎn)以偏概全的意思… 由最終序列,無(wú)法得到某個(gè)位點(diǎn)可能出現(xiàn)的其他堿基,當(dāng)然,你可以使用 IUPAC 編碼方式去表示可能出現(xiàn)的兩種或多種堿基,例如第二個(gè)位點(diǎn)可能出現(xiàn)A或者T,在 IUPAC 編碼中以W來(lái)表示,但是仍然無(wú)法表示某種堿基出現(xiàn)的概率等信息! http://www.bioinformatics.org/sms2/iupac.html 故,需要更準(zhǔn)確的模型來(lái)更好的表示motif 1、Position Frequency Matrices(PFMs, 位置頻率矩陣),又被稱(chēng)為Position Count Matrix (PCM),矩陣中的數(shù)值是所有序列中,每個(gè)位點(diǎn)出現(xiàn)某堿基的頻數(shù): 列數(shù)等于序列長(zhǎng)度,每列加和為6(共計(jì)6條序列),如所有序列的第一個(gè)堿基都是A,故在表中第一列A為6,其余堿基出現(xiàn)次數(shù)均為0! 2、Position Probability Matrix (PPM),矩陣中的數(shù)值是某堿基出現(xiàn)的頻率(堿基出現(xiàn)次數(shù)/列總和): 
每列加和為1,不同列之間相互獨(dú)立?;诿總€(gè)位點(diǎn)出現(xiàn)某堿基的可能性,可以推斷出現(xiàn)某序列的可能性,例如AAGAAA的可能性約15%(=1*0.67*0.5*0.83*0.83*0.66)。如果起始序列數(shù)比較少,則會(huì)在PPM矩陣中出現(xiàn)較多的0值,可以增加個(gè)假值來(lái)矯正... 3、Position Weight Matrix (PWM, 位置權(quán)重矩陣),又被稱(chēng)為position-specific weight matrix (PSWM)、position-specific scoring matrix (PSSM)、logodds scoring matrix (LSM)。PWM矩陣由Score值組成: Each column provides a score per nucleotide representing the relative preference for the given base at that position in the binding site.
最常用的Score計(jì)算方法是基于背景堿基 (隨機(jī)出現(xiàn)) 頻率,對(duì)真實(shí)的堿基頻率進(jìn)行矯正,并取log對(duì)數(shù)轉(zhuǎn)換: 
基于該公式可知,當(dāng)某個(gè)特定堿基出現(xiàn)的可能性高于背景時(shí),Score會(huì)為正值,否則為負(fù)值。假設(shè)每個(gè)堿基的背景概率均為0.25,則本例中PWM矩陣為: 
以第二個(gè)位置的T堿基Score值為例,Score = log2(2/6/0.25) ≈ 0.415 同理,可以計(jì)算某個(gè)特定的序列的Score值,每個(gè)位置Score值相加即可: In order to score a sequence, add up the score for the letters at the specific positions
如序列AAGAAA: Score = 2+1.425+1+1.737+1.737+1.415 = 9.314 與PPM矩陣類(lèi)似,顯而易見(jiàn)的是矩陣中包含較多負(fù)無(wú)窮值-Inf,導(dǎo)致某些特定序列最終Score值也為負(fù)無(wú)窮(如AAAAAA),進(jìn)而排除該序列出現(xiàn)可能性,可能會(huì)丟失關(guān)鍵信息...所以,同樣可以對(duì)ProbN使用假值矯正: 
由此可知,上示幾種矩陣模型可以方便的進(jìn)行轉(zhuǎn)換!
TFs調(diào)控基因在確定了TF的motif并將其表示為PWM之后,人們通常還希望進(jìn)一步識(shí)別受該TF調(diào)節(jié)的基因。潛在的靶基因可以通過(guò)識(shí)別基因啟動(dòng)子區(qū)域是否含有該TF結(jié)合的motif來(lái)確定: In addition to determine the sequence speci?cities of a TF and represent this speci?cities as a PWM, one usually wants to identify genes being regulated by this TF. Putative targets of a TF can be determined by ?nding genes whose promoter region contains the motif bound by that TF.

啟動(dòng)子區(qū)域示意圖: 
In genetics, a promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites (TSS) of genes, on the same strand and upstream on the DNA.
啟動(dòng)子區(qū)域的定位是相對(duì)于轉(zhuǎn)錄起始位點(diǎn)TSS的,一般定義為其上游2kb: As promoters are typically immediately adjacent to the gene in question, positions in the promoter are designated relative to the transcriptional start site, where transcription of DNA begins for a particular gene (i.e., positions upstream are negative numbers counting back from -1, for example -100 is a position 100 base pairs upstream). Promoters can be about 100–1000 base pairs long.
https://en.wikipedia.org/wiki/Promoter_(genetics)
|