【原】筆記 | GWAS 操作流程2-1：缺失質(zhì)控

育種數(shù)據(jù)分析 2021-11-18

展開全文

「主要介紹」

?
--geno篩選個體；--mind篩選SNP
?

GWAS分析時，拿到基因型數(shù)據(jù)，拿到表型數(shù)據(jù)，要首先做以下幾點：

1，查看自己的表型數(shù)據(jù)，是否有問題
2，查看自己的基因型數(shù)據(jù)，是否有問題

然后再進行建模，得到顯著性SNP以及可視化結(jié)果。

清洗數(shù)據(jù)的時間占80%的時間，有句話這樣講：“Garbage in, Garbage out（垃圾進，垃圾出）”，所以清洗數(shù)據(jù)非常重要，今天學習一下基因組數(shù)據(jù)如何清洗。

「為何對缺失數(shù)據(jù)進行篩選？」

?
無論是測序還是芯片，得到的基因型數(shù)據(jù)要進行質(zhì)控，而對缺失數(shù)據(jù)進行篩選，可以去掉低質(zhì)量的數(shù)據(jù)。如果一個個體，共有50萬SNP數(shù)據(jù)，發(fā)現(xiàn)20%的SNP數(shù)據(jù)（10萬）都缺失，那這個個體我們認為質(zhì)量不合格，如果加入分析中可能會對結(jié)果產(chǎn)生負面的影響，所以我們可以把它刪除。同樣的道理，如果某個SNP，在500個樣本中，缺失率為20%（即該SNP在100個個體中都沒有分型結(jié)果），我們也可以認為該SNP質(zhì)量較差，將去刪除。當然，這里的20%是過濾標準，可以改變質(zhì)控標準。下文中的質(zhì)控標準是2%。
?

1. plink數(shù)據(jù)格式轉(zhuǎn)化

數(shù)據(jù)使用上一篇的數(shù)據(jù)，因為數(shù)據(jù)是plink的bfile格式，二進制不方便查看，我們將其轉(zhuǎn)化為文本map和ped的格式。

plink --bfile HapMap_3_r3_1 --recode --out test

結(jié)果生成：test.map test.ped

2. 查看基因型個體和SNP數(shù)量

wc -l test.map test.ped

可以看出，共有165個基因型個體，共有1447897個SNP數(shù)據(jù)。

「預覽一下ped文件：」「預覽一下map文件：」

3. 查看一下個體缺失的位點數(shù)，每個SNP缺失的個體數(shù)

看一下描述：

--missing: Sample missing data report written to plink.imiss, and variant-based
missing data report written to plink.lmiss.

結(jié)果生成兩個文件，分別是一個個體ID上SNP缺失的信息，另一個是每個SNP在個體ID中缺失的信息。

個體缺失位點的統(tǒng)計在plink.imiss中
單個SNP缺失的個體數(shù)在plink.lmiss.中

「個體缺失位點統(tǒng)計預覽：」第一列為家系ID，第二列為個體ID，第三列是否表型缺失，第四列缺失的SNP個數(shù)，第五列總SNP個數(shù)，第六列缺失率。「SNP缺失的個體數(shù)文件預覽：」第一列為染色體，第二列為SNP名稱，第三列為缺失個數(shù)，第四列為總個數(shù)，第五列為缺失率

「R語言做直方圖」

代碼的意思是讀取這兩個文件，然后用頻率的那一列作圖，將圖保存為pdf輸出。

indmiss<-read.table(file="plink.imiss", header=TRUE)
snpmiss<-read.table(file="plink.lmiss", header=TRUE)
# read data into R

pdf("histimiss.pdf") #indicates pdf format and gives title to file
hist(indmiss[,6],main="Histogram individual missingness") #selects column 6, names header of file

pdf("histlmiss.pdf")
hist(snpmiss[,5],main="Histogram SNP missingness")
dev.off() # shuts down the current device

因為，當前目錄已經(jīng)有hist_miss.R腳本，所以可以直接在目錄下運行下面代碼，會生成兩個pdf文件。

Rscript hist_miss.R

「單個位點缺失率統(tǒng)計：」

「基因型個體缺失率統(tǒng)計：」

4. 對個體及SNP缺失率進行篩選

1，如果一個SNP在個體中2%都是缺失的，那么就刪掉該SNP，參數(shù)為：--mind 0.02
2，如果一個個體，有2%的SNP都是缺失的，那么就刪掉該個體，參數(shù)為：--geno 0.02

4. 1 對個體缺失率進行篩選

「先過濾個體缺失率高于2%的SNP」

 plink --bfile HapMap_3_r3_1 --geno 0.02 --make-bed --out HapMap_3_r3_2

「轉(zhuǎn)化為map和ped的形式：」

plink --bfile HapMap_3_r3_2 --recode --out test

查看一下過濾后的行數(shù)，

「之前的為：」

  1457897 test.map
      165 test.ped

「現(xiàn)在的為：」

  1430443 test.map
      165 test.ped

可以看出，過濾了2萬多個位點。

從當時的log日志里也可以看出這一點：

PLINK v1.90b6.5 64-bit (13 Sep 2018)           www.cog-genomics.org/plink/1.9/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to HapMap_3_r3_2.log.
Options in effect:
  --bfile HapMap_3_r3_1
  --geno 0.02
  --make-bed
  --out HapMap_3_r3_2

515185 MB RAM detected; reserving 257592 MB for main workspace.
1457897 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Warning: 225 het. haploid genotypes present (see HapMap_3_r3_2.hh ); many
commands treat these as missing.
Total genotyping rate is 0.997378.
27454 variants removed due to missing genotype data (--geno).
1430443 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls.  (53 phenotypes
are missing.)
--make-bed to HapMap_3_r3_2.bed + HapMap_3_r3_2.bim + HapMap_3_r3_2.fam ...
done.

可以看到--geno，過濾了27454個位點。

4. 2 對SNP缺失率進行篩選

「過濾SNP缺失率高于2%的個體」

plink --bfile HapMap_3_r3_2 --mind 0.02 --make-bed --out HapMap_3_r3_3

「查看日志：」

Options in effect:
  --bfile HapMap_3_r3_2
  --make-bed
  --mind 0.02
  --out HapMap_3_r3_3

515185 MB RAM detected; reserving 257592 MB for main workspace.
1430443 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
0 people removed due to missing genotype data (--mind).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Warning: 179 het. haploid genotypes present (see HapMap_3_r3_3.hh ); many
commands treat these as missing.
Total genotyping rate is 0.997899.
1430443 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls.  (53 phenotypes
are missing.)
--make-bed to HapMap_3_r3_3.bed + HapMap_3_r3_3.bim + HapMap_3_r3_3.fam ...
done.

「沒有過濾掉個體，剩余：」個體：165 SNP：1430443

5 同時對個體和SNP的缺失率進行篩選

「兩步合在一起，即過濾位點，又過濾個體：」

plink --bfile HapMap_3_r3_1 --geno 0.02 --mind 0.02 --make-bed --out HapMap_3_r3_5

plink日志：

Options in effect:
  --bfile HapMap_3_r3_1
  --geno 0.02
  --make-bed
  --mind 0.02
  --out HapMap_3_r3_5

515185 MB RAM detected; reserving 257592 MB for main workspace.
1457897 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
1 person removed due to missing genotype data (--mind).
ID written to HapMap_3_r3_5.irem .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 52 nonfounders present.
Calculating allele frequencies... done.
Warning: 225 het. haploid genotypes present (see HapMap_3_r3_5.hh ); many
commands treat these as missing.
Total genotyping rate in remaining samples is 0.997486.
26686 variants removed due to missing genotype data (--geno).
1431211 variants and 164 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls.  (52 phenotypes
are missing.)
--make-bed to HapMap_3_r3_5.bed + HapMap_3_r3_5.bim + HapMap_3_r3_5.fam ...
done.