乡下人产国偷v产偷v自拍,国产午夜片在线观看,婷婷成人亚洲综合国产麻豆,久久综合给合久久狠狠狠9

  • <output id="e9wm2"></output>
    <s id="e9wm2"><nobr id="e9wm2"><ins id="e9wm2"></ins></nobr></s>

    • 分享

      人類參考基因組知識(shí)點(diǎn)(更新ing~)

       菌心說 2021-04-16

      一、人類基因組有多大

      chr size size2 1 chr1 248956422 249M 2 chr2 242193529 242M 3 chr3 198295559 198M 4 chr4 190214555 190M 5 chr5 181538259 182M 6 chr6 170805979 171M 7 chr7 159345973 159M 8 chrX 156040895 156M 9 chr8 145138636 145M 10 chr9 138394717 138M 11 chr11 135086622 135M 12 chr10 133797422 134M 13 chr12 133275309 133M 14 chr13 114364328 114M 15 chr14 107043718 107M 16 chr15 101991189 102M 17 chr16 90338345 90M 18 chr17 83257441 83M 19 chr18 80373285 80M 20 chr20 64444167 64M 21 chr19 58617616 59M 22 chrY 57227415 57M 23 chr22 50818468 51M 24 chr21 46709983 47M 25 SUM 3088269832 3088M #未考慮M線粒體,其長(zhǎng)度較短,為16569,16Kbp,
      • 如上可看出染色體序號(hào)越靠前的,長(zhǎng)度越大,范圍在50M~250M之間;
      • 由于人為二倍體,所以基因組由60億個(gè)堿基組成;
      • 參考基因組一般保存為純文本格式,即直接記錄“A”、“T”、“C”、“G”這樣的 ASCII 碼字符。
      • 而1個(gè) ASCII 字符,大小是 1B,所以,如果按純文本保存 30億個(gè)字母(單鏈),就是30億字母 = 3,000,000,000 B = 3 GB。
      from NCBI

      二、奇怪的染色體name(chrUn,random,alt)

      • 同樣以UCSC里的hg38版本為例
      wget http://hgdownload.soe./goldenPath/hg38/bigZips/hg38.fa.gz
      gunzip hg38.fa.gz
      #提取染色體id
      grep '^>' hg38.fa > chr.id
      wc -l chr.id
      #455 chr.id
      head chr.id
      ####
      >chr1
      >chr10
      >chr11
      >chr11_KI270721v1_random
      >chr12
      >chr13
      >chr14
      >chr14_GL000009v2_random
      >chr14_GL000225v1_random
      >chr14_KI270722v1_random
      
      • 如上發(fā)現(xiàn),序列并不是只有25條(22+X+Y+M),加起來共有455條。其它特殊的序列可分為三類。
      • 在此之前需要簡(jiǎn)單了解由最初的測(cè)序read數(shù)據(jù)組裝成基因組的染色體序列需要經(jīng)歷contigs與scaffolds兩個(gè)過程,如下圖所示。contigs是依靠read間的重疊拼接的序列(a few kbp long),特點(diǎn)是不含有N堿基;scaffolds則主要依靠read pairs關(guān)系進(jìn)一步拼接contigs,特點(diǎn)是會(huì)產(chǎn)生N堿基(a few hundred kbp);最終由scaffolds拼接成染色體序列。


        read→contigs→scaffolds

        read→chromosomes

      2.1 Unlocalized scaffolds(*****random)

      • a sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome.
      • 簡(jiǎn)單理解:知道這個(gè)scaffolds在哪條染色體上,但不知道其在染色體的具體位置及方向
      • format: chr{chromosome number orname}_{sequence_accession}v{sequence_version}_random
      grep 'random' chr.id > chr.random wc -l chr.random #42 chr.random head chr.random ### >chr11_KI270721v1_random >chr14_GL000009v2_random >chr14_GL000225v1_random >chr14_KI270722v1_random >chr14_GL000194v1_random >chr14_KI270723v1_random >chr14_KI270724v1_random >chr14_KI270725v1_random >chr14_KI270726v1_random

      2.2 Unplaced scaffolds(chrUn******)

      • a sequence found in an assembly that is not associated with any chromosome.
      • 簡(jiǎn)單理解:不知道這條scaffolds的所屬染色體信息
      • format: chrUn_{sequence_accession}v{sequence_version}
      grep 'chrUn' chr.id > chr.chrUn
      wc -l chr.chrUn
      #127 chr.chrUn
      head chr.chrUn
      ###
      >chrUn_KI270302v1
      >chrUn_KI270304v1
      >chrUn_KI270303v1
      >chrUn_KI270305v1
      >chrUn_KI270322v1
      >chrUn_KI270320v1
      >chrUn_KI270310v1
      >chrUn_KI270316v1
      >chrUn_KI270315v1
      >chrUn_KI270312v1
      

      2.3 Alternate loci scaffolds(*****alt)

      • a scaffold that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. These could either be NOVEL patch sequences, added through patch releases, or present in the initial assembly release.
      • 簡(jiǎn)單理解:參考基因組存在的主要依據(jù)是人類99.9%的序列是一致的。但是會(huì)存在一些序列在不同人群中不一致。例如49%人群該基因組特定位置為序列A,而49%人群則為序列B,都是正常的。但拿其中一種作為參考基因組都可能不太合適,因此標(biāo)記出Alternate loci scaffolds。
      • format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_alt
      • Alternate loci scaffolds為hg38版本基因組新添類型Sequence,此前hg19版本還沒有。
      grep 'alt' chr.id > chr.alt wc -l chr.alt #261 chr.alt head chr.alt ### >chr1_KI270762v1_alt >chr1_KI270766v1_alt >chr1_KI270760v1_alt >chr1_KI270765v1_alt >chr1_GL383518v1_alt >chr1_GL383519v1_alt >chr1_GL383520v2_alt >chr1_KI270764v1_alt >chr1_KI270763v1_alt >chr1_KI270759v1_alt

      注意:以上具體的chromosome name均為ucsc的hg版本,與GRCh38略有差異,但基本也是這幾種類型sequence

      三、編碼基因占比多少

      • 在30億堿基基因組中,能夠編碼蛋白質(zhì)的基因總長(zhǎng)度只占總長(zhǎng)度的5%,而其中轉(zhuǎn)錄本exon單元總長(zhǎng)度只占總長(zhǎng)度的1.5%;
      • 人類染色體共編碼2w~3w個(gè)蛋白基因,分布于不同染色體中。平均長(zhǎng)度有10Kbp長(zhǎng)度左右,而實(shí)際上基因的長(zhǎng)度分布十分廣泛(from a few hundred bases to more than 2 million bases)
      wget https://hgdownload.soe./goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
      awk '{print$1, $10}' hg38.refGene.gtf |sort -k 2|uniq|grep -v alt | grep -v random | grep -v alt | grep -v fix| sort -k 1 > chr.gene
      cut -d' ' -f 1 chr.gene | uniq -c
      ###
         1113 chr10
         1676 chr11
         1392 chr12
          632 chr13
          946 chr14
         1010 chr15
         1146 chr16
         1574 chr17
          434 chr18
         1791 chr19
         2832 chr1
          780 chr20
          414 chr21
          644 chr22
         1817 chr2
         1563 chr3
         1088 chr4
         1313 chr5
         1453 chr6
         1341 chr7
         1029 chr8
         1114 chr9
            1 chrM
         1157 chrX
          143 chrY
      

      四、下載參考基因組

      • 目前常用的基因組版本為GRCh38/37,hg38/19,前者可通過NCBI/Ensembl下載,后者可通過UCSC網(wǎng)站下載。如下圖所示GRCh38可認(rèn)為等同于hg38,GRCh37可認(rèn)為等同于hg19。


        human genome version
      • 以下載GRCh38/hg38為例,如下

      4.1 NCBI

      wget -c ftp://ftp.ncbi.nlm./refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
      NCBI

      4.2 ensembl

      wget -c http://ftp./pub/release-103/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
      
      ensembl

      4.3 UCSC

      wget -c http://hgdownload.soe./goldenPath/hg38/bigZips/hg38.fa.gz
      UCSC

      五、更新ing~

      • 如有錯(cuò)誤歡迎指正;
      • 以及關(guān)于生信研究中人類參考基因組其它常見問題,也可評(píng)論區(qū)留言,讓我們一起弄明白,加油~

        本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊一鍵舉報(bào)。
        轉(zhuǎn)藏 分享 獻(xiàn)花(0

        0條評(píng)論

        發(fā)表

        請(qǐng)遵守用戶 評(píng)論公約

        類似文章 更多