Breast Cancer Challenge: Team “PittTransMed” places second for Metabric phase of the Challenge

Please join all of us at Sage Bionetworks and DREAM in congratulating  Chunui Cai and the entire PittTransMed team for being the second highest scoring team for the Metabric phase of our Breast Cancer Challenge!   Below you can read about Chunui’s winning model (Syn ID#1443133).  For this top performance, Chunui and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).

Identify Informative Modular Features for Predicting Cancer Clinical Outcomes

 Songjian Lu, Chunhui Cai, Hatice Ulku Osmanbeyoglu, Lujia Chen, Roger Day, Gregory Cooper, and Xinghua Lu

PittTransMed Team

Department of Biomedical Informatics

University of Pittsburgh

An important task of translational cancer genomics is to identify molecular mechanisms underlying the heterogeneity of patient prognosis and responses to treatments. The large number of molecularly characterized breast cancer samples by the TCGA provides a unique opportunity to study perturbed signal transduction pathways that are determinative of clinical outcome and drug-responses.  Our approach to reveal perturbed signaling pathways is based on the following assumption: if a module of genes participates in coherently related biological processes and is co-expressed in a subset of tumors, the module is likely regulated by a specific cellular signal that is perturbed in the subset of tumors.  We have developed a novel bi-clustering approach that unifies knowledge mining and data mining to identify gene modules and corresponding tumor subsets.

From each TCGA breast cancer tumor, we first identified all genes that were differentially expressed, i.e., 3-fold increase or decrease when compared to the median of expression values from normal samples.  Since a cancer tumor always results from perturbations of multiple signaling pathways 1, the complete list of differentially expressed genes from a tumor inevitably reflects a mixture of genes responding to distinct signals.  In order to de-convolute the signals, we developed an ontology-guided, semantic-driven approach 2 to group differentially expressed genes into non-disjoint subsets.  Each subset contains genes that participate in coherently related biological processes 3 that can be summarized by an informative GO term.   With gene expression data from each tumor conceptualized, we further pooled the genes summarized by a common GO term from all tumor samples to construct a seed gene set.

Given a seed gene set annotated with a distinct GO term, we sought to search for a subset of tumors in which a functionally coherent gene module is co-regulated.  To this end, we constructed a bipartite graph for each seed gene set, in which vertices on one side are the genes and those on the other partite are tumors.  We then applied a novel graph algorithm to identify a densely connected subgraph containing a module of genes (subset of input genes) and a subset of tumors. We required the subgraph satisfies the following conditions: 1) a gene in such a subgraph was differentially expressed in at least 75% of the tumors in the subgraph; 2) in each tumor, at least 75% of genes in the module was differentially expressed; and 3) a subgraph should include more than 25 tumors.

Applying the analysis to the TCGA breast cancer tumors, we have identified 159 subgraphs.  We hypothesized that each subgraph represented a module of genes whose differential expression was in response to a common cellular signal that was perturbed in the subset of tumors.  We then set out to test if this approach would enable us to find the signaling pathways that underlie different prognosis of the breast cancer patients in the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge.

Based on the TCGA microarray data and our modular analysis results, we trained 159 Bayesian logistic regression models 4, one for each aforementioned subgraphs.  This approach enabled us to determine if a module was perturbed (“hit”) in a breast cancer sample from the DREAM 7 challenge, moreover the results allowed us to represent a tumor sample in a modular space with a reduced dimension.  Conditioning on the status of each gene module, we dichotomized samples into two groups (“hit” vs “not hit”) and performed Kaplan-Meier survival analysis to determine if the module is predictive of clinical outcome of the patients.  We identified 20 modules that led to significantly different survival outcomes, p-value < 0.05 .

Using the states of all modules or 20 informative modules as input features, we explored different survival models, including Cox model, the generalized linear model (the “glmnet” package from the CRAN), and the generalized boosted model (the “gbm” package from the CRAN).  When the above models were evaluated individually, they led to prediction concordance indices of  ~ 0.67 and ~ 0.66 on training and testing datasets respectively.  We also explored to use a machine learning model, the RankSVM 5  to predict rank order of patients based on the modular status. In general, the performances of individual RankSVM and survival models were inferior to the leading ensemble models submitted by other groups.

After carefully studying the leading models, we noticed that the ensemble model developed by the Attractome group was particularly suitable to incorporate the modular features identified by our approach.  In an essence, the Attractome group also aimed to identify gene modules and to project tumor samples into a modular space for modeling, although their approach was based on different assumptions and therefore derived different modules.  Moreover, the ensemble learning approach adopted by the Attractome team also addressed the overfitting problems confronting single-model approaches.  Therefore, we adopted an Attractome model (syn1421196) and evaluated if our modular features could enhance the model.  We performed a feature selection using a greedy forward-search approach in a series of cross validation experiments.  We identified 6 features that were capable of enhancing the performance of the Attractome model and integrated them into to a hybrid model, referred to as the PittAttractomeHyb.2 model.  When tested on the hold out METABRIC data, the model performed well, indicating that these modular features are indeed informative of clinical outcomes.

In summary, the major contributions of this study include: 1) A novel integrative approach that is capable of identifying gene modules that likely represent units of cellular responses to perturbed signaling. 2) Certain modules identified from the TCGA data are predictive of clinical outcome when applied to the METBRIC data, therefore the information encoded by these modules are generalizable.  3) Integrating informative modular features with ensemble predictive models enhance the capability of predicting clinical outcomes of breast cancer patients.   Ongoing efforts concentrate on reverse engineering the signaling pathways that underlie these predictive modules.


1          Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646-674, doi:10.1016/j.cell.2011.02.013 (2011).

2          Jin, B. & Lu, X. Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics 26, 2445-2451, doi:10.1093/bioinformatics/btq449 (2010).

3          Richards, A. J. et al. Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26, i79-87, doi:10.1093/bioinformatics/btq203 (2010).

4          Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1-22 (2010).

5          Joachims, T. in ACM Conference on Knowledge Discovery and Data Mining (KDD).


Comments are closed.

%d bloggers like this: