Proteogenomics is a multi-omics research approach, which sheds new light on cancer development. Integration of genomic, transcriptomic, and proteomic information can lead to a deeper understanding of the complex processes related to cancer, as well as to development of novel biomarkers and therapeutics.
Since 2006 DREAM Challenges have been posing fundamental questions about systems biology and translational medicine [1]. As opposed to the popular machine learning competitions on Kaggle [2], the DREAM Challenges do not end with establishing the final leaderboard of participants. Instead, there is usually a follow-up community phase, in which the top teams work together on combining and improving their models and use them to gain scientific insights into the addressed biological problem. The final results are then published in prestigious scientific journals [3].
The NCI-CPTAC DREAM-Proteogenomics Challenge [4] was aimed at predicting the protein concentrations in cancer tissues. Characterization and analyses of alterations in the proteome can shed more light on cancer biology and may improve development of both biomarkers and therapeutics. As measuring proteomic data is both difficult and expensive, the challenge objective was to develop machine learning models for predicting such information based on genomic and transcriptomic data.
The challenge consisted of three sub-challenges, of which Ardigen’s team took part in the second and third and scored with the fifth and second place, respectively. Their objective was to predict protein (sub-challenge 2) and phosphoprotein (sub-challenge 3) abundances based on genomic (copy number alterations, CNA), transcriptomic (mRNA expressions), and for sub-challenge 3 also proteomic data. Here, we describe our solution with an emphasis on sub-challenge 2, where the approach is directly applicable.
The challenge data was provided by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) [5] and was measured for ovarian and breast cancer. The proteomic data was measured by two laboratories: John Hopkins University (JHU) and Pacific Northwest National Lab (PNNL) [6]. The median Pearson correlation of paired protein measurements from the same sample, measured in the two laboratories, was 0.69. This already shows how difficult it is to measure the proteome. The challenge dataset was high-dimensional with around 100 input observations (samples) and typically 10 000 – 35 000 features (CNAs, gene expressions, protein abundances), see [4] for details. In both the 2 and 3 sub-challenges the task was to predict around 1 000 – 15 000 target variables, and the metrics used to measure the performance was the average Pearson correlation between predicted and measured (phospho)protein abundances.
The Ardigen’s team consisted of four people, who discussed repeatedly the strategy for model development. Specifically: Jan Kaczmarczyk proposed the general architecture of the models and implemented them, Łukasz Maziarka suggested important improvements in the models, Piotr Stępniak provided bioinformatics support, and Michał Warchoł guided model design.
In the beginning of the challenge we tested standard approaches to high-dimensional data including the lasso and relaxed lasso methods [7], as well as a random forest model with feature elimination (e.g. leaving 100 most important features after the first fit). The cross-validation scores of these methods were slightly below the 0.38 mean Spearman’s correlation between mRNA and protein abundances observed for ovarian cancer [6]. However, when submitted to the challenge and evaluated on a prospective dataset, the scores of these models were much lower. After submitting a baseline model built on a single variable (mRNA or in its absence – the variable with highest correlation to the target protein) and observing that it scored 0.38 (in sub-challenge 2), we have decided to develop models using forward feature selection and restricting ourselves to a small number of predictors.
In the second step we tested models with engineered features – aggregated variables relating to gene sets provided in the ovarian cancer study [6] and those from the MSigDB Collections [8]. These models performed well locally (on a holdout set), however, they gave significantly lower scores (0.41 in sub-challenge 2) when submitted to the challenge.
We concluded that generalization is an issue in this competition and that batch effects play a role. Moreover, generalization properties of developed models depend significantly on the type of variables, on which the models are built. We addressed this issue explicitly in our final solution by using an auxiliary model to predict how well selected models generalize: we trained models (on one-two variables) using PNNL data and tested them on JHU data of the ovarian cancer dataset in sub-challenge 2. These models enabled us to predict how well the starting model in our forward feature selection algorithm will generalize (how robust it is against batch effects) and select the best one among a few such starting models. Afterwards, we added variables to the chosen starting model in a standard manner (via forward feature selection). Our final models scored 0.46 (for ovarian cancer) and 0.42 (for breast cancer) in sub-challenge 2 and 0.29 (for ovarian cancer) and 0.36 (for breast cancer) in sub-challenge 3. These scores gave us the fifth place in sub-challenge 2 and the second place in sub-challenge 3.
We use a two-stage hierarchical forward feature selection algorithm, which in the first stage gives preference to variables showing good generalization properties as determined by an auxiliary model. In stage 1 we choose the single, ‘most promising’ model among models built using linear regression on the following sets of 1-2 variables: mRNA, CNA, (mRNA, CNA), hc: TFs, hc: all. Here, mRNA and CNA denote the features relating to the considered protein (i.e. with the same GeneID), the only set with two variables is (mRNA, CNA), and by hc we mean the variable with the highest correlation to the target variable from a given set of variables. In particular hc: TFs denotes the variable chosen among mRNA and CNA of human transcription factors (TFs) [9], whereas hc: all variables denotes variable chosen among all variables. From the hc: TFs and hc: all candidate feature sets we remove mRNA and CNA relating to the considered protein. We use TFs as a separate set of variables, because limiting ourselves to a relevant subset of the original features should decrease noise.
Fig. 1. Model architecture
In stage 2 we fit a separate model to the residuals of the model obtained in stage 1. To this end, we use a forward feature selection procedure. For the early-stopping criterion we consider the cross-validation (CV) score change when adding a variable. We use a customized cross-validation procedure with 5 folds and a holdout set. The holdout set (with ~20% of data) is used to obtain a more realistic (not overfit) estimation of model performance. Additionally, we impose a barrier, b, for enlargement of the model by variable(s) being added. If the considered variable(s) does not improve the CV score by at least b = 0.06 (for ovarian cancer) or b = 0.07 (for breast cancer) it is rejected and the procedure stops. The values of
b = 0.06, 0.07 were determined by optimizing for the holdout set score. Note that the feature with the highest correlation to the residuals of the current model is chosen per train-test split in the cross-validation procedure and it can even be a different feature for each split.
In stage 1 of our algorithm we choose the best model among several candidate models. Typically the cross-validation score (or holdout set score) can be used for such a task. However, as generalization was an issue in this challenge, we used an auxiliary model for this purpose. Namely, for all proteins in the ovarian dataset of sub-challenge 2 we train the models considered in stage 1 on the PNNL data and test them on the JHU data (cf. Fig. 2).
Fig. 2. Validation score on JHU data versus cross-validation score on PNNL data for stage 1 models trained on mRNA.
The relation between the CV score on PNNL data and the validation score on JHU data is a measure of how well selected models generalize. We then fit a cubic spline with 4 knots (extrapolated linearly above the fourth knot due to oscillatory behavior in this regime) to the (CV score, validation score) data and use it for predicting the validation score of selected models in stage 1. In Figure 3 we show cubic splines obtained in this manner for the models considered in stage 1.
Fig. 3. Cubic spline fits to the (CV score, validation score) data for models considered in stage 1.
We observe that models built on mRNA or (mRNA, CNA) show the best generalization properties, followed by models built on CNA and then by the hc: all and hc: TFs models. Explicitly, in stage 1 we take the cross-validation score of selected models as a predictor in the auxiliary model and, thereby, we predict the validation score. The predicted validation score is then used for selecting the best model.
For all our models in sub-challenge 2 we use (unregularized) linear regression, whereas in sub-challenge 3 we use least angle regression (LARS) implemented in the LarsCV class in the sklearn package in Python.
For imputation we use the mean imputation. We also tested several transformations on the predictors, of which X→log(X+B ) and (X, Y )→(X, 2Y ) are included in our final ensemble of models. As a final solution we use ensemble of models over different fold splits and (in sub-challenge 2) with the above transformations of variables.
In our final submission we use an ensemble of 15-65 models per (phospho)protein for both the ovarian and breast cancer datasets. The models within an ensemble differ by the splitting to train-test sets of the cross-validation procedure (we use the 5 models trained per train/test splits instead of retraining the model on the entire data), and in sub-challenge 2 also by the variable transformations used. The gain from ensembling is around 0.01 in sub-challenge 2 and 0.04 in sub-challenge 3, as determined on a holdout set.
In an early stage of the challenge we observed that the submitted models differed significantly in how well they generalized to the online-evaluation data. To address the issue of model generalization we used high-bias models (linear regression and least angle regression) augmented with auxiliary models to predict generalization properties of selected starting models. Construction of such auxiliary models was possible because sub-challenge 2 contained data measured by two laboratories, which enabled us to quantify generalization of selected models. In sub-challenge 2 we also used transformations of variables. In both sub-challenges we used ensembling, however, the improvement it provides is much higher in sub-challenge 3.
As our solution scored with the second place in sub-challenge 3, we were invited to the community phase, in which we worked on a joint improved solution with the winning team: Hongyang Li and Yuanfang Guan from the University of Michigan. As a result, the performance of the models was significantly improved (from the mean Pearson correlation of 0.33 to 0.37 for the ovarian cancer, and from 0.42 to 0.48 for the breast cancer). At the moment the organizers of the challenge and the top teams of sub-challenges 2 and 3 are finalizing a manuscript on the results of the sub-challenges, which is to be submitted to Nature Methods. More details on the collaboration can be found in [10].
[2] www.kaggle.com
[3] dreamchallenges.org/publications
[4] NCI-CPTAC DREAM Proteogenomics Challenge, www.synapse.org/#!Synapse:syn8228304/wiki/413428
[5] Clinical Proteomic Tumor Analysis Consortium (CPTAC), proteomics.cancer.gov/programs/cptac
[6] Zhang H, Liu T, Zhang Z, Payne SH, Zhang B, McDermott JE, et al., Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer, Cell 2016;166: 755–765. doi.org/doi:10.1016/j.cell.2016.05.069
[7] Hastie T, Tibshirani T, Tibshirani RJ, Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso, https://arxiv.org/abs/1707.08692
[8] MSigDB Collections, software.broadinstitute.org/gsea/msigdb/collections.jsp
[9] A resource for Human, Mouse and Rat Transcription Factors, www.tfcheckpoint.org/index.php/introduction
[10] Best Performers Announced for the NCI-CPTAC DREAM Proteogenomics Computational Challenge, proteomics.cancer.gov/news_and_announcements/best-performers-announced-nci-cptac-dream-proteogenomics-computational