This repository is structured to enable easy installation of the required dependencies. This is enabled by the prtools
R package that serves two major purposes:
prtools
package will also ensure that all other required packages are installed.We also provide a Docker container which has everything pre-installed and can be used to run the analysis on a server in the cloud.
This repository also includes <step>.R
files for all steps of the analysis which you can use to run parts of the analysis in an automated manner. Additional data file are also contained in this repository. In order to obtain a local copy you can clone the repository with git.
git clone cdiener/proliferation
The recommended mode of usage is via the docker images since they guarantee correct versions for dependencies and include all intermediate data.
Note that this will require a machine with 16+ GB of RAM.
If you have docker installed you can use the steps provided in the section below “With a cloud provider”.
For a local installation you will need R (http://r-project.org), git (http://git-scm.org) and Python 3 installed (http://python.org).
On Ubuntu or Debian Stretch these can be installed via
sudo apt-get install r-base r-base-dev python3 python3-pip git
Clone the repository and enter the folder:
git clone https://github.com/cdiener/proliferation && cd proliferation
For Debian Jessie we recommend updating pip on a per-user setting in order to get a version of pip that is greater than 8.1.
sudo apt-get install r-base r-base-dev python3 python3-pip git
pip3 install -U pip
The python dependencies can be installed with
pip3 install lxml python-libsbml numpy scipy cobra
R dependencies can be installed from within R (type “R” in Terminal) with
install.packages("devtools")
source("https://bioconductor.org/biocLite.R")
biocLite()
After that you can install prtools
simply via
devtools::install_github("cdiener/proliferation/prtools")
Loading the prtools package will then also load all additional packages we will use:
library(prtools)
## Loading required package: RcppProgress
## Also loading: biomaRt 2.30.0
## Also loading: data.table 1.9.6
## Also loading: doMC 1.3.4
## Also loading: foreach 1.4.3
## Also loading: ggplot2 2.2.0
## Also loading: glmnet 2.0.5
## Also loading: oligo 1.38.0
## Also loading: pheatmap 1.0.8
## Also loading: survival 2.40.1
## Also loading: tcgar 0.3.0
If you want to download the raw analysis data you will also need the GDC data transfer tool. Installation instructions can be found here. Please make sure that the gdc-client
executable is in your path. The docker image already includes the tool.
Using a cloud provider such as Google Cloud or Amazon AWS you can use the docker image.
docker pull cdiener/proliferation
docker run -d -p 8000:8787 cdiener/proliferation
This will present you with an R studio interface where all additional dependencies and intermediate data are available.
For a start click the “…” symbol in the file panel on the lower right and enter “/data/proliferation”. Now click on “protocol.rmd” to see or run the protocol (by clicking “knit HTML”) or use any of the *.R files in the same directory.
This section describes how to obtain the raw data. Alternatively it also shows how to download intermediate data sets that cut down the analysis time significantly.
We will start by getting all the required data. You will need about 40 GB of disk space for all the downloads. This step is the longest during this analysis.
First, we will download the HuEx 1.0 ST exon expression data for the NCI-60 cell lines using the GEOquery
package (about 9 GB).
GEOquery::getGEOSuppFiles("GSE29682")
untar("GSE29682/GSE29682_RAW.tar", exdir="GSE29682")
We follow this by downloading all relevany TCGA data from the Genomic Data Commons (GDC) as well (19 GB). For that you will need a file manifest generated from the GDC data portal. Here, we use the tcgar package which we wrote specifically for this analysis. We will save the data into the GDC
subfolder with another subfolder for each data type. File manifests for the data used in this analysis are already provided in the GDC
folder.
get_data("GDC/manigest_tcga_rnaseq", "GDC/rnaseq")
get_data("GDC/manigest_tcga_huex", "GDC/huex")
get_data("GDC/manigest_tcga_clinical", "GDC/clinical")
Note that this will take a long time.
Finally, we will also get annotations that relate all the genes from the NCI-60 and TCGA data sets. Here, we will employ biomart to obtain mappings between the probes on the HuEx microarrays and several Gene ID systems. This step is optional since the generated probemap.rds
file is contained in the repository as well.
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attrs <- c("ensembl_gene_id", "description", "external_gene_name", "ucsc",
"entrezgene", "affy_huex_1_0_st_v2")
probemap <- getBM(mart=ensembl, attributes=attrs)
probemap <- data.table(probemap)
names(probemap) <- c("ensgene", "description", "symbol", "ucsc", "entrez", "huex")
saveRDS(probemap, "probemap.rds")
The following intermediate data files are available online:
Note: This entire section is optional if you downloaded the NCI-60 gene expression file (#1). Running it will require the prior download of the NCI-60 data as described in “Obtain raw data”.
We will start by preparing the HuEx exon expression data for the NCI-60 data set. Here, we will read all of the raw files, calculate the log expression values and normalize the arrays with RMA. This will take about 16 GB of RAM and take a while. If you have several cores available on your machine we recommend to use them here. To use 6 cores during computations you may use the following:
registerDoMC(6)
Finally, we will save this raw expression set to a serialized format, so we do not have to repeat this step every time. This will take about half an hour.
if (file.exists("eset_raw.rds")) eset <- readRDS("eset_raw.rds") else {
celfiles <- list.celfiles(recursive=T, listGzipped=T)
raw_data <- read.celfiles(celfiles)
eset <- rma(raw_data, target="probeset")
rm(raw_data)
saveRDS(eset, file="eset_raw.rds")
}
The expression values we obtain this way are on the level of “probe sets”, sets of spots on the microarray. However, in order to obtain expression values per gene we have to summarize those probe set expression values. For this we can use probemap.rds
mapping that we obtained earlier. First, we will prepare the probe map to give us unique mappings between genes and probe sets. We will only consider genes that map to a probe set that is present on our arrays and we will also only use probe sets that map to a gene with a known ENSEMBL ID.
probemap <- readRDS("probemap.rds")
probemap <- unique(probemap, by=c("ensgene", "huex"))
probemap[, huex := as.character(huex)]
setkey(probemap, huex)
eset <- eset[rownames(eset) %in% probemap$huex, ]
probemap <- probemap[huex %in% rownames(eset) & !is.na(ensgene)]
We now have a mapping of probe sets to genes and vice versa. We will use that to summarize the log gene expression values into the geometric mean for each gene. For that we can use the eset_reduce
function from the prtools
package which does that rapidly (because it is implemented in C++).
eset <- eset_reduce(eset, probemap$huex, probemap$ensgene)
We now have the per-gene log expression values for each of the 178 microarrays in our data set. In order to map those arrays to their respective cell line we can use the annotations contained in samples.csv
:
samples <- fread("samples.csv")
head(samples)
## geo_id tumor cell_line
## 1: GSM736050 Breast BT-549
## 2: GSM736051 Breast HS578T
## 3: GSM736052 Breast MCF7
## 4: GSM736053 Breast MDA-MB-231
## 5: GSM736054 Breast T47D
## 6: GSM736055 CNS SF-268
So we see that there are 2-3 reptitions for each cell line. The actual growth rates are contained in growth_rates.csv
:
gcs <- fread("growth_rates.csv")
head(gcs)
## cell_line panel_name doubling_time inoculation_density
## 1: CCRF-CEM Leukemia 26.7 40000
## 2: HL-60 Leukemia 28.6 40000
## 3: K-562 Leukemia 19.6 5000
## 4: MOLT-4 Leukemia 27.9 30000
## 5: RPMI-8226 Leukemia 33.5 20000
## 6: SR Leukemia 28.7 20000
Now, we will reduce the individual repetitions for each cell line to its mean log expression values and relate the cell lines in the microarray samples to the ones in the growth rate data set.
setkey(gcs, cell_line)
cell_lines <- intersect(gcs$cell_line, samples$cell_line)
eset_summ <- sapply(cell_lines, function(cl) rowMeans(exprs(eset)[,samples$cell_line==cl]))
colnames(eset_summ) <- cell_lines
Right, now proliferation is given in doubling time (in hours), so we will convert those to proliferation rates and also run a quick check whether the proliferation rates are ordered the same way as the microarray samples.
rates <- gcs[cell_lines, log(2)/doubling_time]
names(rates) <- cell_lines
if (all(colnames(eset_summ) == names(rates))) cat("Ordering is the same :)")
Finally, we will attach the proliferation rates as an aditional columns to the expression values, thus, each row now denotes a cell line, and each column a gene, where only the last column denotes the proliferation rates. This is a common format for regression and classification problems and we will save it in csv format, so that it can be used easily in other software.
export <- data.table(t(eset_summ))
export[, "rates" := rates]
write.csv(export, "regprob.csv", row.names=F)
If you downloaded the intermediate data (#2) you can skip the following steps and read it directly using:
tcga <- readRDS("tcga.rds")
After downloading the raw TCGA data from GDC, we will start by reading the TCGA data into RAM. The tcgar
package does this efficiently. We will store the results in a list with one sublist for each technology (this will take around 20 minutes depending on your server).
library(tcgar)
tcga <- list(
rnaseq = read_rnaseq("GDC/manifest_tcga_rnaseq.tsv", "GDC/rnaseq", progress=F),
huex = read_huex("GDC/manifest_tcga_huex.tsv", "GDC/huex", progress=F),
clinical = read_clinical("GDC/manifest_tcga_clinical.tsv", "GDC/clinical", progress=F)
)
Again, we will save the output in a serialized file.
saveRDS(tcga, "tcga.rds")
In order to compare gene expression between the TCGA RNA-Seq data and NCI-60 microarrays we will need to create a data set that contains the mean expression values for all genes that are contained in the TCGA and NCI-60 data. For that, we begin by calculating the mean log expression values across all samples within the TCGA and NCI-60 data (RNA-Seq abundances are measured in counts, so we have to add a pseudo count in order to obtain the log). We begin by calculating the means for the RNA-Seq and HuEx data contained in the TCGA data set.
all_rnaseq <- log(rowMeans(tcga$rnaseq$counts)+1, 2)
all_rnaseq <- data.table(ensgene=tcga$rnaseq$features$ensgene,
tcga_rnaseq=all_rnaseq)
all_huex <- data.table(symbol=tcga$huex$features$symbol,
tcga_huex=rowMeans(tcga$huex$assay))
We also do the same for the NCI-60 HuEx data.
nci60 <- fread("regprob.csv", header=T)
rates <- nci60$rates
nci60[, rates := NULL]
all_nci60 <- data.table(ensgene=colnames(nci60), nci60_huex=colMeans(nci60))
We will now join those data sets by mapping their respective IDs. tcgar
already comes with a table mapping the genes in TCGA between different IDs, genemap
.
join <- merge(as.data.table(tcgar::genemap), all_nci60, by="ensgene")
join <- merge(join, all_huex, by="symbol")
join <- merge(join, all_rnaseq, by="ensgene")
head(join)
## ensgene symbol
## 1: ENSG00000000003 TSPAN6
## 2: ENSG00000000005 TNMD
## 3: ENSG00000000419 DPM1
## 4: ENSG00000000457 SCYL3
## 5: ENSG00000000460 C1orf112
## 6: ENSG00000000938 FGR
## description
## 1: tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2: tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## 3: dolichyl-phosphate mannosyltransferase polypeptide 1, catalytic subunit [Source:HGNC Symbol;Acc:HGNC:3005]
## 4: SCY1-like, kinase-like 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 5: chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## 6: FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
## entrez nci60_huex tcga_huex tcga_rnaseq
## 1: 7105 6.102991 9.232960 11.795649
## 2: 64102 4.281391 4.533475 6.336445
## 3: 8813 7.650315 8.091734 10.909042
## 4: 57147 5.985178 6.499165 9.908830
## 5: 55732 6.206742 5.850941 9.004820
## 6: 2268 6.113765 6.895676 9.491145
Sometimes, those mapping between IDs are not unique, which makes it impossible to identify unique gene expression values. We will remove those cases.
dupes <- join$ensgene[duplicated(join$ensgene)]
print(sum(join$ensgene %in% dupes))
## [1] 0
join <- join[!(join$ensgene %in% dupes), ]
saveRDS(join, "join.rds")
That gives us our joined gene expression data set.
In order to use gene expression data from the NCI-60 data set for predictions on the TCGA data set, we first have to find a set of genes with very similar expression values between cell lines and human samples. As described in the manuscript, we consider gene expression to be conserved if the expression values are correlated, meaning that the gene expression values between the two data sets follow a linear relationship. We could estimate the slope and intercept directly from TCGA RNA-Seq and NCI-60 HuEx data, however, a more stable approach is to estimate both parameters independently from additional data we have.
The intercept describes a constant difference in log expression values between the two data set. We can estimate is simply by using data obtained with the same technology between the two data sets. Since we have HuEx expression data from NCI-60 as well as the TCGA data set, the intercept can be calculated as the difference in mean between the HuEx data sets.
norm_int <- mean(join$nci60_huex) - mean(join$tcga_huex)
As we can see, using this correction both HuEx data sets relate well with each other.
ggplot(join, aes(x=tcga_huex, y=nci60_huex)) + geom_point(alpha=0.1, stroke=0) +
geom_abline(intercept=norm_int, size=3/4, col="dodgerblue") +
geom_abline(intercept=sqrt(2)+norm_int, size=3/4, col="dodgerblue", linetype="dashed") +
geom_abline(intercept=-sqrt(2)+norm_int, size=3/4, col="dodgerblue", linetype="dashed") +
xlab("TCGA HuEx") + ylab("NCI60 HuEx") + theme_bw()
Here, the dashed blue lines denote the area in which genes have a maximum distance of 1 from the 1:1 relation denotes by the solid blue line.
The slope of the linear relationship denotes a difference between technologies. It can be obtained by comparing the two technologies, HuEx and RNA-Seq, within the same data set.
norm <- coef(glm(tcga_huex ~ tcga_rnaseq + 0, data=join))
Again, we see that the majority of genes follows this relationship. However, there also some genes whose expression values are not comparable, even within the TCGA data set.
ggplot(join, aes(x=tcga_rnaseq, y=tcga_huex)) + geom_point(alpha=0.1, stroke=0) +
geom_abline(slope=norm, size=3/4, col="dodgerblue") +
xlab("TCGA RNA-seq") + ylab("TCGA HuEx") + theme_bw()
Finally, we can validate the two parameters by using the TCGA RNA-Seq data and the NCI-60 HuEx data.
norm <- c(norm, norm_int)
saveRDS(norm, "norm_factor.rds")
ggplot(join, aes(x=tcga_rnaseq, y=nci60_huex)) + geom_point(alpha=0.1, stroke=0) +
geom_abline(slope=norm, intercept=norm_int, size=3/4, col="dodgerblue") +
geom_abline(slope=norm, intercept=sqrt(2)+norm_int, size=3/4,
col="dodgerblue", linetype="dashed") +
geom_abline(slope=norm, intercept=-sqrt(2)+norm_int, size=3/4,
col="dodgerblue", linetype="dashed") +
xlab("TCGA RNA-seq") + ylab("NCI60 HuEx") + theme_bw()
Using both parameters we see a good agreement between the data, however, there are also genes that do not follow the linear relationship. Since, we are only interested in genes with a conserved expression across the two data sets, we will only consider genes in the further analysis that do not differ by a log fold change more than 1 between the two normalized data sets (area enclosed by dashed blue lines).
cutoff <- 1
good <- join[(tcga_rnaseq*norm[1] + norm[2] - nci60_huex)^2 < cutoff &
(tcga_huex + norm[2] - join$nci60_huex)^2 < cutoff, ensgene]
cor.test(join[good, tcga_rnaseq], join[good, nci60_huex])
##
## Pearson's product-moment correlation
##
## data: join[good, tcga_rnaseq] and join[good, nci60_huex]
## t = 197.78, df = 7797, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9093612 0.9167416
## sample estimates:
## cor
## 0.9131262
So we see that this cutoff leads to a good general agreement.
We now have a way of normalizing the gene expression data between TCGA and NCI-60 and a criterion to select conserved genes only. Thus, we can now start with the regression problem. We start by reading the regression problem again.
rdata <- fread("regprob.csv", header=T)
But this time we will only select the genes that conserved (difference in normalized log expression smaller than one).
rates <- rdata$rates
rdata[, "rates" := NULL]
rdata <- as.matrix(rdata)[, good]
Which has dimensions
dim(rdata)
## [1] 57 7799
We will now test a total of 4 different generalized linear models.
We will use the LASSO generalized linear models from the glmnet
package. We will also initialize two data frames that will contain all the predictions and goodnes-of-fit metrics. Those will be calculated for each of the models on the training set and leave-one-out cross validation.
library(glmnet)
pred <- data.frame()
m <- data.frame()
# For leave-one-out cross validation
folds <- length(rates)
In the first model gene expression values enter as linear variables into the model. We will extract the predictions on the training set as well as predictions from leave-one-out cross validation.
mod1 <- cv.glmnet(rdata, rates, nfolds=folds, keep=T, parallel=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod1, rdata, s="lambda.min")[,1]
pred_test <- mod1$fit.preval[, which.min(mod1$cvm)]
We now append the predicted rates and performance measures to our data frames.
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="1st"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="1st"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="1st"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="1st"))
In the second order model we will only consider interactions between gene expression values, meaning products between two gene expression values. Because of the large number of candidate genes it would be feasible to test all combinations. Instead we will only consider interactions between those genes that non-zero coefficients in the first order model.
nonzero <- abs(coef(mod1, s="lambda.min")[-1]) > 0
data2 <- inter(rdata[,nonzero])
mod2 <- cv.glmnet(data2, rates, nfolds=folds, keep=T, parallel=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod2, data2, s="lambda.min")[,1]
pred_test <- mod2$fit.preval[, which.min(mod2$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="2nd"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="2nd"))
In this model we will use the same interaction terms as before but also add all the original gene expression values in order to see whether the model can be improved.
data12 <- cbind(rdata[, nonzero], data2)
mod3 <- cv.glmnet(data12, rates, nfolds=folds, keep=T, parallel=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod3, data12, s="lambda.min")[,1]
pred_test <- mod3$fit.preval[, which.min(mod3$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="1st and 2nd"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="1st and 2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="1st and 2nd"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="1st and 2nd"))
As we can see, adding the first order terms does not improve the second order model.
Finally, we will try to improve the generalization of the model, by removing those interactions that have only very influences on the model and may thus be related to overfitting.
cf <- as.numeric(coef(mod2, s="lambda.min"))[-1]
names(cf) <- rownames(coef(mod2))[-1]
nonzero <- abs(cf) > quantile(abs(cf[abs(cf) > 0]), 0.25)
data_red <- data2[, nonzero]
mod <- cv.glmnet(data_red, rates, nfolds=folds, keep=T,
grouped=FALSE, standardize=FALSE)
pred_train <- predict(mod, data_red, s="lambda.min")[,1]
pred_test <- mod$fit.preval[, which.min(mod$cvm)]
pred <- rbind(pred, data.frame(truth=rates, pred=pred_train, set="train", order="2nd + cutoff"))
pred <- rbind(pred, data.frame(truth=rates, pred=pred_test, set="validation", order="2nd + cutoff"))
m <- rbind(m, data.frame(t(measures(rates, pred_train)), set="train", order="2nd + cutoff"))
m <- rbind(m, data.frame(t(measures(rates, pred_test)), set="validation", order="2nd + cutoff"))
print(sum(nonzero))
## [1] 38
If we compare the goodness-of-fit measures we can see that the cutoff model shows good good performance with an error of about 4% in predicting proliferation rates.
mse | rmse | mae | mre | rsq | set | order |
---|---|---|---|---|---|---|
2.00e-07 | 0.0004688 | 0.0003391 | 0.0163479 | 0.9960625 | train | 1st |
4.49e-05 | 0.0067003 | 0.0052876 | 0.2611128 | 0.1958045 | validation | 1st |
2.00e-07 | 0.0004420 | 0.0003570 | 0.0169304 | 0.9965005 | train | 2nd |
8.10e-06 | 0.0028480 | 0.0022301 | 0.1047906 | 0.8547042 | validation | 2nd |
2.00e-07 | 0.0004420 | 0.0003570 | 0.0169304 | 0.9965005 | train | 1st and 2nd |
8.10e-06 | 0.0028480 | 0.0022301 | 0.1047906 | 0.8547042 | validation | 1st and 2nd |
1.00e-07 | 0.0003543 | 0.0002643 | 0.0132028 | 0.9977519 | train | 2nd + cutoff |
1.00e-06 | 0.0010053 | 0.0008111 | 0.0398894 | 0.9818972 | validation | 2nd + cutoff |
This can also be observed visually by plotting the real proliferation rates against the predicted ones (a perfect match would create a 1:1 line).
ggplot(pred, aes(x=truth, y=pred, col=order)) + geom_abline() +
geom_point() + facet_grid(set ~ order) + theme_bw() +
xlab("measured proliferation rate [1/h]") +
ylab("predicted proliferation rate [1/h]") +
theme(legend.position="none")
We will also save a list of the gene interactions and coefficient values for the final model.
genes <- do.call(rbind, strsplit(colnames(data_red), "x"))
colnames(genes) <- c("gene1", "gene2")
write.csv(data.frame(genes, coef=cf[nonzero]), "best_interactions.csv", row.names=F)
We now have our trained model and will continue predicting proliferation rates for the samples from TCGA.
The major problem we have is that in the HuEx data genes are identified by symbols whereas our interaction terms are identied by ENSEMBL gene ids. So we will begin by creating a mapping between the two using the gene map from before and getting the symbols for our genes in the interaction terms.
map <- genemap
setkey(map, ensgene)
symbs <- cbind(map[genes[,1], symbol], map[genes[,2], symbol])
With that we can get the expression values for those genes and normalize them with the normalization factor we obtained earlier.
huex_ex <- tcga$huex$assay[unique(as.vector(symbs)), ]
huex_ex <- huex_ex + norm[2]
We can now calculate the corresponding interaction terms and predict the proliferation rates for all samples. Here, we will also remove internal controls that do not map to any TCGA samples.
huex_red <- t(huex_ex[symbs[,1], ] * huex_ex[symbs[,2], ])
colnames(huex_red) <- paste0(symbs[,1], "x", symbs[,2])
rates_huex <- predict(mod, huex_red, s="lambda.min")[,1]
controls <- is.na(tcga$huex$samples$tumor)
rates_huex <- rates_huex[!controls]
RNA-Seq samples are identified by ENSEMBL ids as well so we can advance straight away to calculating the log expression values and applying the normalization.
rna_ex <- tcga$rnaseq$counts[unique(as.vector(genes)), ]
rna_ex <- log(rna_ex+1, 2)
rna_ex <- rna_ex * norm[1] + norm[2]
With that we can calculate the interaction terms and predict the proliferation rates.
rna_red <- t(rna_ex[genes[,1], ] * rna_ex[genes[,2], ])
colnames(rna_red) <- paste0(genes[,1], "x", genes[,2])
rates_rna <- predict(mod, rna_red, s="lambda.min")[,1]
Finally we will first create a prediction data set that combines all proliferation rate predictions together with the sample annotations.
pred <- data.table(
patient_barcode=c(tcga$rnaseq$samples$patient_barcode,
tcga$huex$samples$patient_barcode[!controls]),
panel=c(tcga$rnaseq$samples$panel, tcga$huex$samples$panel[!controls]),
rates=c(rates_rna, rates_huex),
tumor=c(tcga$rnaseq$samples$tumor, tcga$huex$samples$tumor[!controls])
)
This way we can easily check what percentage of proliferation rates was predicted as negative…
sum(pred$rates)/nrow(pred)
## [1] 0.0176215
We can merge this data set with the clinical information for each patient now.
comb <- merge(pred, tcga$clinical, by=c("patient_barcode", "panel"))
print(comb)
## patient_barcode panel rates tumor
## 1: TCGA-02-0001 GBM 0.02420641 TRUE
## 2: TCGA-02-0003 GBM 0.03418276 TRUE
## 3: TCGA-02-0004 GBM 0.01989495 TRUE
## 4: TCGA-02-0006 GBM 0.02384004 TRUE
## 5: TCGA-02-0007 GBM 0.03056712 TRUE
## ---
## 12107: TCGA-ZS-A9CG LIHC 0.01406726 TRUE
## 12108: TCGA-ZT-A8OM THYM 0.03907078 TRUE
## 12109: TCGA-ZU-A8S4 CHOL 0.02606463 TRUE
## 12110: TCGA-ZU-A8S4 CHOL 0.02548018 FALSE
## 12111: TCGA-ZX-AA5X CESC 0.02485634 TRUE
## patient_uuid gender vital days_to_contact
## 1: 30a1fe5e-5b12-472c-aa86-c2db8167ab23 female Dead 279
## 2: df3c1d61-79c1-43e9-971a-8029497ffeab male Dead 144
## 3: 3ac8f759-2df0-4ef5-913f-ce05b8dd8937 male Dead 345
## 4: 98714d95-b62e-4f34-9cd1-91542da463eb female Dead 558
## 5: fb75eb9d-05be-4e6b-a515-f39efb15e66c female Dead 705
## ---
## 12107: 3f6a6891-c6d5-447b-ad32-325fd46ec32b male Alive 341
## 12108: d9816402-1073-4b19-ba16-1c7e7c00fbb3 female Alive 1398
## 12109: e76cf9a0-061e-4a7d-ba28-5a40b8da54b8 male Dead NA
## 12110: e76cf9a0-061e-4a7d-ba28-5a40b8da54b8 male Dead NA
## 12111: 4756acc0-4e96-44d4-b359-04d64dc7eb84 female Alive 119
## days_to_death days_to_birth histology
## 1: 358 -16179 Untreated primary (de novo) GBM
## 2: 144 -18341 Untreated primary (de novo) GBM
## 3: 345 -21617 Untreated primary (de novo) GBM
## 4: 558 -20516 Untreated primary (de novo) GBM
## 5: 705 -14806 Treated primary GBM
## ---
## 12107: NA -20100 Hepatocellular Carcinoma
## 12108: NA -26923 Thymoma; Type A
## 12109: 98 -19264 Cholangiocarcinoma; intrahepatic
## 12110: 98 -19264 Cholangiocarcinoma; intrahepatic
## 12111: NA -23440 Cervical Squamous Cell Carcinoma
## tissue_site stage T N M residual_tumor
## 1: Brain NA NA NA NA NA
## 2: Brain NA NA NA NA NA
## 3: Brain NA NA NA NA NA
## 4: Brain NA NA NA NA NA
## 5: Brain NA NA NA NA NA
## ---
## 12107: Liver Stage II T2 NX MX R0
## 12108: Anterior Mediastinum NA NA NA NA NA
## 12109: Bile duct Stage I T1 NX MX R0
## 12110: Bile duct Stage I T1 NX MX R0
## 12111: Cervical NA TX NX MX NA
## new_tumor_events follow_ups
## 1: 2 1
## 2: 2 1
## 3: 2 1
## 4: 2 1
## 5: 2 1
## ---
## 12107: 2 1
## 12108: 2 1
## 12109: 1 0
## 12110: 1 0
## 12111: 2 1
Finally we will save those data sets for later usage.
saveRDS(comb, "combined.rds")
write.csv(pred, "pred_rates.csv", row.names=F)
Having the combined data set comb
most of the following analyses are straight forward. First we can visualize the overall distribution of proliferation rates across the TCGA cancer panels.
We will start by ordering the panels by their mean proliferation rate,
means <- pred[, list(val=median(rates)), by=panel]
setkey(pred, panel)
pred <- pred[means[order(val), panel]]
pred[, panel := factor(panel, levels=unique(panel))]
followed by plotting them with ggplot2
.
ggplot(pred, aes(x=panel, y=rates, color=tumor, shape=tumor)) +
geom_jitter(alpha=0.2, height=0, size=1) + theme_bw() + theme(axis.text.x =
element_text(angle = 45, vjust = 1, hjust=1), legend.position="none") +
xlab("") + ylab("proliferation rate [1/h]") +
scale_colour_manual(values=c("royalblue", "red3"))
# Return panels to its original state
pred[, panel := as.character(panel)]
Here red triangles denote cancer samples and blue circles samples from healthy tissues.
We can also check how the proliferation rates from tumor and normal tissue samples behave.
ggplot(pred, aes(x=c("normal", "tumor")[tumor+1], y=rates, fill=tumor)) +
geom_boxplot() + xlab("") + ylab("proliferation rate [1/h]") + theme_bw() +
theme(legend.position="none")
Here, the fold-change between tumor and normal tissue is given by
pred[(tumor), mean(rates)] / pred[!(tumor), mean(rates)]
## [1] 1.748735
And this difference can be tested with a wilcoxon ranked-sum test.
wilcox.test(pred[!(tumor), rates], pred[(tumor), rates], conf.int=TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: pred[!(tumor), rates] and pred[(tumor), rates]
## W = 2259800, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -0.008912066 -0.007526171
## sample estimates:
## difference in location
## -0.008210097
First we will set up the times and censoring. The comb
data set includes the required information in the days_to_death
, days_to_contact
and vital
columns.
days_per_year <- 365.25
comb <- comb[!is.na(vital) & tumor]
delta <- c(comb[vital=="Alive", days_to_contact/days_per_year],
comb[vital=="Dead", days_to_death/days_per_year])
status <- comb$vital == "Dead"
For visualization we will stratify the survival data into two groups. One including the top 25% of proliferation rates and the other containing the bottom 25%.
prolif <- vector(length=nrow(comb))
prolif[comb$rates > quantile(comb$rates, .75)] <- "high"
prolif[comb$rates < quantile(comb$rates, .25)] <- "low"
prolif <- factor(prolif, levels=c("low", "high"))
We can now generate the survival fit and create the corresponding Kaplan-Meier plot.
surv <- Surv(delta, status)
fit <- survfit(surv ~ prolif)
plot(fit, col=c("blue", "red"), xlab="time [years]", ylab="survival", lwd=2)
However, the proliferation rate is a continuous variable and we would like to evaluate its association to survival without creating artificial strata. For this we can use a Cox proportional hazards model.
coxm <- coxph(surv ~ comb$rates)
print(coxm)
## Call:
## coxph(formula = surv ~ comb$rates)
##
## coef exp(coef) se(coef) z p
## comb$rates 1.67e+01 1.75e+07 1.87e+00 8.93 <2e-16
##
## Likelihood ratio test=80.3 on 1 df, p=0
## n= 11337, number of events= 3836
## (63 observations deleted due to missingness)
We can use the parameter β to calculate the increase in risk when increasing the proliferation rate by a fixed value, for instance 0.01:
exp(coef(coxm) * 0.01)
## comb$rates
## 1.181475
So this would mean an increased risk of about 18%.
We can also study how the tumor stage relates to the predicted proliferation rates. We will use the TNM staging system here. The pathological tumor stages are already contained in comb
, however they may have several subtypes. So we will clean them up a bit to only leave the major tumor stages such as “T1”, “N2”, “Stage III”, etc.
comb[, T := gsub("[a-e][0-9]*$", "", T)]
comb[, N := gsub("[^0-4NX]", "", N)]
comb[, M := gsub("[^0-4MX]", "", M)]
comb[stage %in% c("I/II NOS", "IS"), stage := NA]
comb[, stage := gsub("[A-C]*$", "", stage)]
This enables us to visualize the proliferation rates across the TNM staging system.
x <- melt(comb[, .(rates, panel, T, N, M, stage)], id.vars=c("rates", "panel"))
ggplot(x, aes(x=value, y=rates, col=variable)) +
geom_violin(aes(fill=variable), scale="width", alpha=0.3, linetype=0) +
geom_boxplot(outlier.colour=NA, width=0.5) +
facet_wrap(~ variable, scales="free_x", nrow=1) + theme_bw() +
theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1,
hjust=1), legend.position="none", strip.text = element_blank()) +
ylab("proliferation rate [1/h]") + xlab("")
As we can see there is again a lot of heterogeneity going on. We can test for associations by the non-parametric Kruskal-Wallis test.
kw_tests <- x[, kruskal.test(rates, factor(value)), by=variable]
print(kw_tests)
## variable statistic parameter p.value method
## 1: T 194.685979 6 2.564358e-39 Kruskal-Wallis rank sum test
## 2: N 147.075919 4 8.614847e-31 Kruskal-Wallis rank sum test
## 3: M 1.720631 2 4.230285e-01 Kruskal-Wallis rank sum test
## 4: stage 184.017226 5 7.418813e-38 Kruskal-Wallis rank sum test
## data.name
## 1: rates and factor(value)
## 2: rates and factor(value)
## 3: rates and factor(value)
## 4: rates and factor(value)
In order to predict fluxes based on the particular predicted proliferation rate and cancer panel we need a specific metabolic model for the cancer that include an objective for proliferation (growth). There are only few reconstructions that fulfill that requirement. Here, we use reconstruction from the Metabolic Atlas for 9 of the TCGA cancer panels which have been validated before (http://dx.doi.org/10.1073/pnas.1319196111). The mapping between panels and cancer models is described in tissues.csv
:
tissues <- fread("tissues.csv")
print(tissues)
## name panel tissue
## 1: Bladder Urothelial Carcinoma BLCA urothelial
## 2: Breast invasive carcinoma BRCA breast
## 3: Cholangiocarcinoma CHOL liver
## 4: Kidney Chromophobe KICH kidney
## 5: Kidney renal clear cell carcinoma KIRC kidney
## 6: Kidney renal papillary cell carcinoma KIRP kidney
## 7: Liver hepatocellular carcinoma LIHC liver
## 8: Lung adenocarcinoma LUAD lung
## 9: Lung squamous cell carcinoma LUSC lung
## url
## 1: http://www.metabolicatlas.org/assets/initcan/r3/urothelial_cancer_-__3.xml-b391ad03203294d3a420001c843d2fce.zip
## 2: http://www.metabolicatlas.org/assets/initcan/r3/breast_cancer_-__3.xml-4297bbc8636e6b22818d180b44898384.zip
## 3: http://www.metabolicatlas.org/assets/initcan/r3/liver_cancer_-__3.xml-07bff95988866a0e32afc4a0f072ebc9.zip
## 4: http://www.metabolicatlas.org/assets/initcan/r3/renal_cancer_-__3.xml-bb999a8ccff9ebd687264b98d821622b.zip
## 5: http://www.metabolicatlas.org/assets/initcan/r3/renal_cancer_-__3.xml-bb999a8ccff9ebd687264b98d821622b.zip
## 6: http://www.metabolicatlas.org/assets/initcan/r3/renal_cancer_-__3.xml-bb999a8ccff9ebd687264b98d821622b.zip
## 7: http://www.metabolicatlas.org/assets/initcan/r3/liver_cancer_-__3.xml-07bff95988866a0e32afc4a0f072ebc9.zip
## 8: http://www.metabolicatlas.org/assets/initcan/r3/lung_cancer_-__3.xml-bf4255bf7e786330b1e66238ae41b966.zip
## 9: http://www.metabolicatlas.org/assets/initcan/r3/lung_cancer_-__3.xml-bf4255bf7e786330b1e66238ae41b966.zip
For each cancer panel we will now use the corresponding model and solve the linear programming problem posed by parsimonious FBA for every cancer sample in the panel with a non-zero proliferation rate. Fluxes are denoted by \(v_i\), the stochiometric matrix of the model by \(\mathbf{S}\), the proliferation objective in the model by \(v_p\) and the predicted proliferation rate by \(r_p\). Models are converted to the irreversible formulation beforehand.
\[\begin{aligned} &\min_i \sum_i v_i\\ s.t.\quad& \mathbf{Sv} = 0\\ & v_p = r_p\\ & v_i \geq 0 \end{aligned}\]
For that we use cobrapy and warm start strategy where a previous solution basis is recycled to yield faster convergence of the solver. All of this is implemented in the fluxes.py
script which will use the predicted rates saved earlier.
system2("python3", "fluxes.py")
Note that you may have to substitute python3
with python
depending on your Python installation.
The resulting fluxes are now saved in fluxes.csv
, for convenience this only includes fluxes which were non-zero in at least one sample.
First we will create a mapping between patient barcodes and cancer panels.
panels <- pred$panel
names(panels) <- pred$patient_barcode
panels <- sort(panels)
Now we can read the fluxes and conver them to a matrix.
fluxes <- fread("fluxes.csv")
barcodes <- fluxes$V1
fluxes <- as.matrix(fluxes[, V1 := NULL])
rownames(fluxes) <- barcodes
For later analysis we would also like to group fluxes by their corresponding metabolic pathways. The fluxes.py
script also extracts that information from the model and saves is as “subsystem” in the flux_info.csv
file.
info <- fread("flux_info.csv")
names(info) <- c("reaction", "subsystem")
head(info)
## reaction subsystem
## 1: CancerBiomass_Ex Artificial reactions
## 2: CancerBiomass_OF Artificial reactions
## 3: HMR_0002 Acylglycerides metabolism
## 4: HMR_0003 Acylglycerides metabolism
## 5: HMR_0004 Transport, extracellular
## 6: HMR_0005 Acylglycerides metabolism
We can visulaize the resulting fluxes with a heatmap.
cols <- viridis::viridis(256)
panels <- panels[!duplicated(names(panels))]
fluxes <- fluxes[order(panels[rownames(fluxes)], decreasing=TRUE), ]
in_fluxes <- names(panels) %in% rownames(fluxes)
annrow <- data.frame(panel=panels[in_fluxes], row.names=names(panels)[in_fluxes])
anncolors <- scales::hue_pal()(9)
names(anncolors) <- levels(annrow$panel)
anncolors <- list(panel=anncolors)
s <- seq(-16, log(max(fluxes)+1e-16,2), length.out = 256)
pheatmap(fluxes, breaks=c(-1e-6,2^s), col=cols, show_rownames=F,
show_colnames=F, annotation_row=annrow, annotation_colors=anncolors,
cluster_rows=FALSE, border_color=NA)
In order to see major metabolic proerties of the studied panels we will also visualize the fluxes across the three major pathways of central carbon metabolism. We will start by transforming our flux matrix to a data.table that can be passed to ggplot.
dt <- as.data.table(fluxes)
dt$panel <- panels[rownames(fluxes)]
dt <- melt(dt, id.vars = "panel")
names(dt) <- c("panel", "reaction", "flux")
dt <- merge(dt, info, by = "reaction")
This will be followed by filtering the subsystem for Glycolysis, Oxidative Phosphorylation, and TCA cycle and plotting the results.
pathways <- c("Glycolysis / Gluconeogenesis", "Oxidative phosphorylation",
"Tricarboxylic acid cycle and glyoxylate/dicarboxylate metabolism")
ggplot(dt[subsystem %in% pathways], aes(x = panel, y = flux, color = panel)) +
geom_jitter(height = 0, alpha = 0.05, size = 0.5) + facet_wrap(~ subsystem) +
theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1,
hjust = 1), legend.position = "none") + xlab("")
Based on the predicted fluxes we can now calculate a specificity score \(s^i_p\) for each flux \(v_i\) and panel p as the log-fold change of the flux within the panel versus all other panels. Here \(\mu^i_p\) denotes the mean value of the flux \(v_i\) within the panel p and \(\mu^i_o\) outside the panel p.
\[s^i_p = \log_2 \mu^i_p - \log_2 \mu^i_p\]
A value of zero denotes homogeneous fluxes across all panels, a value larger than zero fluxes that are up-regulated in a particular panel and a value smaller zero fluxes that are down-regulated in a particular panel. The calculation is implemented in the prtools
package.
lfcs <- panel_lfc(fluxes, panels, info[, list(subsystem)])
We will order those by their absolute value from largest to smallest.
lfcs <- lfcs[order(-abs(lfc))]
lfcs
## rid panel lfc
## 1: HMR_4855 BLCA 1.544089e+01
## 2: HMR_4354 KIRC -1.445816e+01
## 3: HMR_4663 KIRC -1.443521e+01
## 4: newExRxn_39_reverse KIRC -1.443521e+01
## 5: newExRxn_41 KIRC -1.443521e+01
## ---
## 9239: HMR_2697_reverse CHOL -4.838653e-03
## 9240: HMR_3414 CHOL -4.838653e-03
## 9241: HMR_4874 CHOL -3.940913e-03
## 9242: HMR_3984 LIHC 3.425966e-05
## 9243: HMR_4900 LIHC 3.425966e-05
## subsystem
## 1: Transport, mitochondrial
## 2: Pentose phosphate pathway
## 3: Purine metabolism
## 4: CORE_ExchangeReactions
## 5: CORE_ExchangeReactions
## ---
## 9239: Carnitine shuttle (mitochondrial)
## 9240: Beta oxidation of even-chain fatty acids (mitochondrial)
## 9241: Transport, mitochondrial
## 9242: Miscellaneous
## 9243: Transport, peroxisomal
We can now visualize those changes across all metabolic pathways.
lfcs[, panel := factor(panel)]
lab <- function(x) lapply(x, shorten, n=20)
lfc_plot <- ggplot(lfcs, aes(x=panel, y=lfc, col=panel)) +
geom_hline(yintercept=0, linetype="dashed") +
geom_boxplot(outlier.colour=NA) + geom_jitter(alpha=0.5) +
theme_bw() + theme(axis.text.x=element_text(angle = 45, vjust = 1, hjust=1),
legend.position="none") + facet_wrap(~subsystem, labeller=lab, nrow=12, ncol=6) +
ylab("specificity score")
print(lfc_plot)
Those are a lot of plot and we would like a more quantitative evaluation which of the pathways are the most specific or or homogeneous ones. This can be obtained from an enrichment analysis. Here we use a method similar to GSEA which calculates normalized enrichment scores (NES) and p-values for each pathway.
Here, a value larger than one denotes association with specific fluxes and a value smaller than one association with homogeneous fluxes.
pws <- lfcs$subsystem
enr <- sapply(unique(pws), NES, w=lfcs$lfc, pws=pws)
enr <- data.table(subsystem=colnames(enr), nes=enr[1,], p=enr[2,])
enr <- enr[order(nes)]
enr[, subsystem := factor(subsystem, levels=subsystem)]
For panel-specific pathways we now obtain the following:
enr_plot <- ggplot(enr, aes(x=nes, y=subsystem, col=p)) +
geom_vline(xintercept=1, linetype="dashed") +
geom_point() + scale_colour_continuous(low="red", high="black") +
theme_bw() + scale_y_discrete(labels=shorten) +
theme(legend.position=c(0.2,0.8)) + xlab("enrichment score") + ylab("")
print(enr_plot %+% enr[nes > 1])
Here we see that there is a group of three pathways that set themselves apart on the top. We can visualize only those again:
best <- rev(enr$subsystem)[1:3]
print(lfc_plot %+% lfcs[subsystem %in% best])
Finally we can also visualize homogeneous pathways.
print(enr_plot %+% enr[nes <= 1])
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 8 (jessie)
##
## locale:
## [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
## [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
## [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
## [7] LC_PAPER=en_US.utf8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] tcgar_0.3.0 survival_2.40-1 pheatmap_1.0.8
## [4] oligo_1.38.0 Biostrings_2.42.0 XVector_0.14.0
## [7] IRanges_2.8.1 S4Vectors_0.12.0 Biobase_2.34.0
## [10] oligoClasses_1.36.0 BiocGenerics_0.20.0 glmnet_2.0-5
## [13] Matrix_1.2-7.1 ggplot2_2.2.0 doMC_1.3.4
## [16] iterators_1.0.8 foreach_1.4.3 data.table_1.9.6
## [19] biomaRt_2.30.0 prtools_0.3.2 RcppProgress_0.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 lattice_0.20-34
## [3] assertthat_0.1 digest_0.6.10
## [5] GenomeInfoDb_1.10.1 plyr_1.8.4
## [7] chron_2.3-47 RSQLite_1.0.0
## [9] evaluate_0.10 highr_0.6
## [11] BiocInstaller_1.24.0 zlibbioc_1.20.0
## [13] lazyeval_0.2.0 curl_2.2
## [15] R.utils_2.5.0 R.oo_1.21.0
## [17] preprocessCore_1.36.0 rmarkdown_1.1
## [19] labeling_0.3 splines_3.3.2
## [21] stringr_1.1.0 RCurl_1.95-4.8
## [23] bit_1.1-12 munsell_0.4.3
## [25] compiler_3.3.2 htmltools_0.3.5
## [27] SummarizedExperiment_1.4.0 gridExtra_2.2.1
## [29] tibble_1.2 ff_2.2-13
## [31] codetools_0.2-15 XML_3.98-1.5
## [33] bitops_1.0-6 R.methodsS3_1.7.1
## [35] grid_3.3.2 jsonlite_1.1
## [37] gtable_0.2.0 DBI_0.5-1
## [39] magrittr_1.5 scales_0.4.1
## [41] stringi_1.1.2 pbapply_1.3-1
## [43] reshape2_1.4.2 viridis_0.3.4
## [45] affyio_1.44.0 xml2_1.0.0
## [47] RColorBrewer_1.1-2 tools_3.3.2
## [49] yaml_2.1.14 AnnotationDbi_1.36.0
## [51] colorspace_1.3-0 GenomicRanges_1.26.1
## [53] affxparser_1.46.0 knitr_1.15
python3 -V
## Python 3.4.5 :: Anaconda custom (64-bit)
Only directly used packages are cobra and pandas
python3 -c "import cobra; print('cobra', cobra.__version__)"
python3 -c "import pandas; print('pandas', pandas.__version__)"
## cobra 0.5.4
## pandas 0.19.1