Hemberg-lab單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析（七）-導(dǎo)入10X和SmartSeq2數(shù)據(jù)Tabula Muris

迷途中小小書童 2019-04-03

展開全文

Tabula Muris

Tabula Muris是測序小鼠20個器官和組織的單細(xì)胞轉(zhuǎn)錄組圖譜的國際合作項目 (Transcriptomic characterization of 20 organs and tissues from mouse at single cell resolution creates a Tabula Muris)。

簡介

我們使用 Tabula Muris最開始釋放的數(shù)據(jù)做為測試數(shù)據(jù)來完成完整的單細(xì)胞數(shù)據(jù)分析。The Tabula Muris是一個國際合作組織，目的是采用標(biāo)準(zhǔn)方法生成小鼠每個細(xì)胞的圖譜。建庫測序方法包括通量高覆蓋率低的10X數(shù)據(jù)和通量低覆蓋率高的FACS篩選+Smartseq2建庫技術(shù)。

起始數(shù)據(jù)于2017年12月20日釋放，包含20個組織/器官的100,000細(xì)胞的轉(zhuǎn)錄組圖譜。

下載數(shù)據(jù)

與其它sc-RNASeq數(shù)據(jù)上傳到GEO或ArrayExpress不同，Tabula Muris通過figshare平臺釋放數(shù)據(jù)。可以通過搜索文章的doi號10.6084/m9.figshare.5715040下載FACS/Smartseq2數(shù)據(jù)和10.6084/m9.figshare.5715025下載10X數(shù)據(jù)。數(shù)據(jù)可以直接點擊鏈接下載，或使用下面的wget命令：

終端下載FACS數(shù)據(jù):

wget https://ndownloader./files/10038307
unzip 10038307
wget https://ndownloader./files/10038310
mv 10038310 FACS_metadata.csv
wget https://ndownloader./files/10039267
mv 10039267 FACS_annotations.csv

終端下載10X數(shù)據(jù):

wget https://ndownloader./files/10038325
unzip 10038325
wget https://ndownloader./files/10038328
mv 10038328 droplet_metadata.csv
wget https://ndownloader./files/10039264
mv 10039264 droplet_annotation.csv

如果數(shù)據(jù)是手動下載的，也需要像上面一樣解壓和重命名。

現(xiàn)在應(yīng)該有兩個文件夾: FACS和droplet，每個對應(yīng)一個annotation和metadata文件。使用head命令查看前10行：

head -n 10 droplet_metadata.csv

使用wc -l查看文件的行數(shù):

wc -l droplet_annotation.csv

練習(xí)：FACS和10X數(shù)據(jù)中各有多少細(xì)胞有注釋信息？

答案: FACS : 54,838 cells; Droplet : 42,193 cells

讀入數(shù)據(jù) (Smartseq2)

讀入逗號分隔的count matrix，存儲為數(shù)據(jù)框:

dat = read.delim("FACS/Kidney-counts.csv", sep=",", header=TRUE)
dat[1:5,1:5]

數(shù)據(jù)庫第一列是基因名字，把它移除作為列名字：

dim(dat)
rownames(dat) <- dat[,1]
dat <- dat[,-1]

這是Smartseq2數(shù)據(jù)集，可能含有spike-ins:

rownames(dat)[grep("^ERCC-", rownames(dat))]

從列名字中提取metadata信息:

cellIDs <- colnames(dat)
cell_info <- strsplit(cellIDs, "\\.")
Well <- lapply(cell_info, function(x){x[1]})
Well <- unlist(Well)
Plate <- unlist(lapply(cell_info, function(x){x[2]}))
Mouse <- unlist(lapply(cell_info, function(x){x[3]}))

檢測每種metadata類型的數(shù)據(jù)分布：

summary(factor(Mouse))

查看有沒有技術(shù)因子是cofounded，實驗批次與供體小鼠批次一致:

table(Mouse, Plate)

最后讀入計算預(yù)測的細(xì)胞類型注釋，并與表達(dá)矩陣中的細(xì)胞注釋做比較:

ann <- read.table("FACS_annotations.csv", sep=",", header=TRUE)
ann <- ann[match(cellIDs, ann[,1]),]
celltype <- ann[,3]
table(celltype)

構(gòu)建scater對象

為了構(gòu)建SingleCellExperiment對象，先把所有的細(xì)胞注釋放到一個數(shù)據(jù)框中。因為實驗批次(pcr plate)和供體小鼠完全重合，所以只保留一個信息。

suppressMessages(require("SingleCellExperiment"))
suppressMessages(require("scater"))
cell_anns <- data.frame(mouse = Mouse, well=Well, type=celltype)
rownames(cell_anns) <- colnames(dat)
sceset <- SingleCellExperiment(assays = list(counts = as.matrix(dat)), colData=cell_anns)

str(sceset)

如果數(shù)據(jù)集包含spike-ins，我們在SingleCellExperiment對象中定義一個變量記錄它們。

isSpike(sceset, "ERCC") <- grepl("ERCC-", rownames(sceset))
str(sceset)

讀入10X的數(shù)據(jù)

因為10X技術(shù)細(xì)胞通量高但測序覆蓋度低，所以其count matrix是一個大的稀疏矩陣(矩陣中高達(dá)90%的數(shù)據(jù)的數(shù)值為0)。CellRanger默認(rèn)的輸出格式是.mtx文件用于存儲這個稀疏矩陣，第一列是基因的坐標(biāo)（0-based），第二列是細(xì)胞的坐標(biāo)（0-based），第三列是大于0的表達(dá)值 (長表格形式)。打開.mtx文件會看到兩行標(biāo)題行后面是包含總行數(shù) (基因數(shù))、列數(shù) (樣本數(shù))和稀疏矩陣總行數(shù) (生信寶典注：所有細(xì)胞中表達(dá)不為0的基因的總和)的一行數(shù)據(jù)。

%%MatrixMarket matrix coordinate integer general
%
23433 610 1392643
5 1 1
28 1 1
40 1 2

鑒于.mtx文件中只存儲了基因和樣品名字的坐標(biāo)，而實際的基因和樣品的名字必須單獨存儲到文件genes.tsv和barcodes.tsv。

下面使用Matrix包讀入稀疏矩陣：

suppressMessages(require("Matrix"))
cellbarcodes <- read.table("droplet/Kidney-10X_P4_5/barcodes.tsv")
genenames <- read.table("droplet/Kidney-10X_P4_5/genes.tsv")
molecules <- Matrix::readMM("droplet/Kidney-10X_P4_5/matrix.mtx")

下一步增加合適的行或列的名字。首先查看read的cellbarcode信息會發(fā)現(xiàn)這個文件只有barcode序列?？紤]到10X數(shù)據(jù)每一批的cellbarcode是有重疊的，所以在合并數(shù)據(jù)前，需要把批次信息與barcode信息合并一起。

head(cellbarcodes)

rownames(molecules) <- genenames[,1]
colnames(molecules) <- paste("10X_P4_5", cellbarcodes[,1], sep="_")

讀入計算注釋的細(xì)胞類型信息:

meta <- read.delim("droplet_metadata.csv", sep=",", header=TRUE)
head(meta)

我們需要用10X_P4_5獲得這批數(shù)據(jù)對應(yīng)的metadata信息。這時需要注意metadata表格中mouse ID與前面plate-based (FACS SmartSeq2)數(shù)據(jù)集的mouse ID不同，這里用-而非_作為分隔符，并且性別在中間。通過查閱文獻(xiàn)中的描述得知droplet (10X)和plate-based (FACS SmartSeq2)的技術(shù)用了同樣的8只老鼠。所以對數(shù)據(jù)做下修正，使得10X與FACS的數(shù)據(jù)一致。

meta[meta$channel == "10X_P4_5",]
mouseID <- "3_8_M"

注意：有些組織的10X數(shù)據(jù)可能來源于多個小鼠的樣品，如mouse id = 3-M-5/6。也需要格式化這些信息，但可能這些與FACS數(shù)據(jù)的mouse id會不一致，進(jìn)而影響下游分析。如果小鼠不是純系，可能需要通過exonic-SNP把細(xì)胞和對應(yīng)的小鼠聯(lián)系起來 (本課程不會涉及)。

ann <- read.delim("droplet_annotation.csv", sep=",", header=TRUE)
head(ann)

注釋中的cellID和cellbarcodes也存在細(xì)微差別，少了最后的-1，在匹配前需要做下校正。(生信寶典注：這種數(shù)據(jù)不一致是經(jīng)常要處理的問題，每一步檢查結(jié)果。如果與預(yù)期不符，考慮有沒有未考慮到的數(shù)據(jù)不一致的地方。)

ann[,1] <- paste(ann[,1], "-1", sep="")
ann_subset <- ann[match(colnames(molecules), ann[,1]),]
celltype <- ann_subset[,3]

構(gòu)建cell-metadata數(shù)據(jù)框:

cell_anns <- data.frame(mouse = rep(mouseID, times=ncol(molecules)), type=celltype)
rownames(cell_anns) <- colnames(molecules)
head(cell_anns)

練習(xí) 用組織的其它批次重復(fù)上面的處理。

答案

molecules1 <- molecules
cell_anns1 <- cell_anns

cellbarcodes <- read.table("droplet/Kidney-10X_P4_6/barcodes.tsv")
genenames <- read.table("droplet/Kidney-10X_P4_6/genes.tsv")
molecules <- Matrix::readMM("droplet/Kidney-10X_P4_6/matrix.mtx")
rownames(molecules) <- genenames[,1]
colnames(molecules) <- paste("10X_P4_6", cellbarcodes[,1], sep="_")
mouseID <- "3_9_M"
ann_subset <- ann[match(colnames(molecules), ann[,1]),]
celltype <- ann_subset[,3]
cell_anns <- data.frame(mouse = rep(mouseID, times=ncol(molecules)), type=celltype)
rownames(cell_anns) <- colnames(molecules)

molecules2 <- molecules
cell_anns2 <- cell_anns

cellbarcodes <- read.table("droplet/Kidney-10X_P7_5/barcodes.tsv")
genenames <- read.table("droplet/Kidney-10X_P7_5/genes.tsv")
molecules <- Matrix::readMM("droplet/Kidney-10X_P7_5/matrix.mtx")
rownames(molecules) <- genenames[,1]
colnames(molecules) <- paste("10X_P7_5", cellbarcodes[,1], sep="_")
mouseID <- "3_57_F"
ann_subset <- ann[match(colnames(molecules), ann[,1]),]
celltype <- ann_subset[,3]
cell_anns <- data.frame(mouse = rep(mouseID, times=ncol(molecules)), type=celltype)
rownames(cell_anns) <- colnames(molecules)

molecules3 <- molecules
cell_anns3 <- cell_anns

創(chuàng)建`scater`對象

現(xiàn)在讀入了多個批次的10X數(shù)據(jù)，把它們組合成一個SingleCellExperiment object對象。首先檢查不同批次數(shù)據(jù)的基因名字是否一致:

identical(rownames(molecules1), rownames(molecules2))
identical(rownames(molecules1), rownames(molecules3))

確認(rèn)沒有重復(fù)的細(xì)胞ID:

sum(colnames(molecules1) %in% colnames(molecules2))
sum(colnames(molecules1) %in% colnames(molecules3))
sum(colnames(molecules2) %in% colnames(molecules3))

檢查無誤，把它們組合起來:

# 獲得大的表達(dá)矩陣
all_molecules <- cbind(molecules1, molecules2, molecules3)
# 獲得大的數(shù)據(jù)矩陣
all_cell_anns <- as.data.frame(rbind(cell_anns1, cell_anns2, cell_anns3))
# 增加批次信息
all_cell_anns$batch <- rep(c("10X_P4_5", "10X_P4_6","10X_P7_5"), times = c(nrow(cell_anns1), nrow(cell_anns2), nrow(cell_anns3)))

練習(xí): 全部數(shù)據(jù)集有多少細(xì)胞？

答案：

dim(all_molecules)[2]

現(xiàn)在創(chuàng)建SingleCellExperiment對象。SingleCellExperiment對象的優(yōu)勢是可以正常矩陣、稀疏矩陣格式存儲數(shù)據(jù)，還可以以HDF5格式在磁盤存儲和訪問大的非稀疏矩陣而不用全部加載到內(nèi)存中。

require("SingleCellExperiment")
require("scater")
all_molecules <- as.matrix(all_molecules)
sceset <- SingleCellExperiment(assays = list(counts = all_molecules), colData=all_cell_anns)

這是10X的數(shù)據(jù)，不包含spike-ins直接存儲數(shù)據(jù):