Did Team Hive’s online game generate a top-scoring Challenge model?

Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge.  His post is about the fun online game (called “The Cure”) that he and his team launched in September 2012 to crowdsource ideas that Team Hive can then use to build great models for our Breast Cancer Challenge.  Please read on to find out how Team Hive’s models built with ideas from the crowd performed!

Reblogged from i9606

Monday, October 29, 2012

Results from the Cancer Biology game: The Cure

Building intelligent systems for biology

Our research group has been exploring the concept of serious games for several months now.  Aside from providing nerdy entertainment, our games collect (and distribute) biological knowledge from broad audiences of players.  The hypothesis underlying this work is that, by capturing knowledge in forms suitable for computation, these games make it possible to build more intelligent programs.

As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’.  The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples.  We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.

In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes.  The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.

Below is a game in progress.  I’m on the bottom and my opponent, Barney, is on the top.  We alternate turns selecting a card (a gene) from the board and adding it to our hand.  When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time.  The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules.   The better the gene set performs in generating predictive decision trees, the higher the score.  When the player defeats their opponent, they move on to play another board.  (Multiple players play each board.)

A game of the The Cure.  Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.

As you can see to the right of the board, information from the Gene Ontology, RefSeq, and PubMed is provided through the game interface to aid players in selecting their genes.  Players are also encouraged to make use of external knowledge sources (in addition to their own brains).

Promotion, players and play

The Cure was promoted on launch day via a presentation by Andrew Su at Genome Informatics 2012, via Twitter and in several blog posts.   As we first described in a post published on the Sage community site, more than 120 players registered and collectively played more than 2000 games in the first week that the game was alive – with much of this activity happening within the first few days.  Nearly half of the players self-reported having PhDs and half claimed knowledge of cancer biology.  Following the initial buzz, game-playing activity slowed down to what is now a slow but persistent trickle.
Games played at The Cure since launch

As of last Friday, Oct. 26, 2012 we have had 214 people register and have recorded 3,954 total games (including training games).  The player demographics have remained stable with about 40% PhDs, nearly 50% declaring knowledge of cancer biology, and about 50% stating that they are biologists.

Predicting breast cancer prognosis

Aside from entertainment, the point of this particular game is to assemble a predictor for breast cancer prognosis.  The main hypothesis is that biological knowledge, accessible from players, can be used to help select good sets of genes to use to train predictive models using machine learning algorithms.  The premise is that injecting distributed biological knowledge (which can not entirely be learned from any one training set) will help reduce overfitting by identifying the gene sets with biologically consistent associations with disease progression.

The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play.  While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.

  1. Filter out games from players that indicated no knowledge of cancer biology.
  2. Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.
  3. Select the top 20 genes according to this ranking.
  4. Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge.  To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.  
CCL3L3 CXCL9 IL1B BCL2 DUSP1 ERBB2 EGR1 JUN PITX1 MAP3K1 IGFBP2 STAT1 BCAR3 HOXB2 BCL11B MAPK15 WNT5A APOA2 HLA-DRB4 CD163
Game-selected genes

The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard – search for team HIVE – and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)

In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information).  This was substantially better than classifiers produced using randomly selected genes (55%).  Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.

Final Results from METABRIC round of BCC challenge

!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!

Conclusions

These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play.  We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.
Data
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Implementation
The code that operates The Cure is freely available on our BitBucket account.  It consists of a Java server application (running in Tomcat) that handles database interaction, board generation, and integration with the WEKA machine learning library.  WEKA is used to dynamically train and test decision trees (though we could easily use other models) while the game is running.  The interface is almost entirely CSS and JavaScript that communicates with the server via JSON requests.  We would be thrilled if some one wanted to use this code to build another classification game!

Trees
One aspect of the code-base that may be useful in a variety of different projects is the code that translates the Java objects that represent decision trees in WEKA into the Web-ready visualizations presented to the players.  This is accomplished via server-side translation into a JSON structure that is rendered in the browser using code that builds on the D3 javascript visualization library.

Credits
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.

Barney
The opponent in The Cure came from a Wikipedia Commons imagefrom the game “You have to Burn the Rope“. Thanks for sharing!

Breast Cancer Challenge: Team “PittTransMed” places second for Metabric phase of the Challenge

Please join all of us at Sage Bionetworks and DREAM in congratulating  Chunui Cai and the entire PittTransMed team for being the second highest scoring team for the Metabric phase of our Breast Cancer Challenge!   Below you can read about Chunui’s winning model (Syn ID#1443133).  For this top performance, Chunui and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).

Identify Informative Modular Features for Predicting Cancer Clinical Outcomes

 Songjian Lu, Chunhui Cai, Hatice Ulku Osmanbeyoglu, Lujia Chen, Roger Day, Gregory Cooper, and Xinghua Lu

PittTransMed Team

Department of Biomedical Informatics

University of Pittsburgh

An important task of translational cancer genomics is to identify molecular mechanisms underlying the heterogeneity of patient prognosis and responses to treatments. The large number of molecularly characterized breast cancer samples by the TCGA provides a unique opportunity to study perturbed signal transduction pathways that are determinative of clinical outcome and drug-responses.  Our approach to reveal perturbed signaling pathways is based on the following assumption: if a module of genes participates in coherently related biological processes and is co-expressed in a subset of tumors, the module is likely regulated by a specific cellular signal that is perturbed in the subset of tumors.  We have developed a novel bi-clustering approach that unifies knowledge mining and data mining to identify gene modules and corresponding tumor subsets.

From each TCGA breast cancer tumor, we first identified all genes that were differentially expressed, i.e., 3-fold increase or decrease when compared to the median of expression values from normal samples.  Since a cancer tumor always results from perturbations of multiple signaling pathways 1, the complete list of differentially expressed genes from a tumor inevitably reflects a mixture of genes responding to distinct signals.  In order to de-convolute the signals, we developed an ontology-guided, semantic-driven approach 2 to group differentially expressed genes into non-disjoint subsets.  Each subset contains genes that participate in coherently related biological processes 3 that can be summarized by an informative GO term.   With gene expression data from each tumor conceptualized, we further pooled the genes summarized by a common GO term from all tumor samples to construct a seed gene set.

Given a seed gene set annotated with a distinct GO term, we sought to search for a subset of tumors in which a functionally coherent gene module is co-regulated.  To this end, we constructed a bipartite graph for each seed gene set, in which vertices on one side are the genes and those on the other partite are tumors.  We then applied a novel graph algorithm to identify a densely connected subgraph containing a module of genes (subset of input genes) and a subset of tumors. We required the subgraph satisfies the following conditions: 1) a gene in such a subgraph was differentially expressed in at least 75% of the tumors in the subgraph; 2) in each tumor, at least 75% of genes in the module was differentially expressed; and 3) a subgraph should include more than 25 tumors.

Applying the analysis to the TCGA breast cancer tumors, we have identified 159 subgraphs.  We hypothesized that each subgraph represented a module of genes whose differential expression was in response to a common cellular signal that was perturbed in the subset of tumors.  We then set out to test if this approach would enable us to find the signaling pathways that underlie different prognosis of the breast cancer patients in the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge.

Based on the TCGA microarray data and our modular analysis results, we trained 159 Bayesian logistic regression models 4, one for each aforementioned subgraphs.  This approach enabled us to determine if a module was perturbed (“hit”) in a breast cancer sample from the DREAM 7 challenge, moreover the results allowed us to represent a tumor sample in a modular space with a reduced dimension.  Conditioning on the status of each gene module, we dichotomized samples into two groups (“hit” vs “not hit”) and performed Kaplan-Meier survival analysis to determine if the module is predictive of clinical outcome of the patients.  We identified 20 modules that led to significantly different survival outcomes, p-value < 0.05 .

Using the states of all modules or 20 informative modules as input features, we explored different survival models, including Cox model, the generalized linear model (the “glmnet” package from the CRAN), and the generalized boosted model (the “gbm” package from the CRAN).  When the above models were evaluated individually, they led to prediction concordance indices of  ~ 0.67 and ~ 0.66 on training and testing datasets respectively.  We also explored to use a machine learning model, the RankSVM 5  to predict rank order of patients based on the modular status. In general, the performances of individual RankSVM and survival models were inferior to the leading ensemble models submitted by other groups.

After carefully studying the leading models, we noticed that the ensemble model developed by the Attractome group was particularly suitable to incorporate the modular features identified by our approach.  In an essence, the Attractome group also aimed to identify gene modules and to project tumor samples into a modular space for modeling, although their approach was based on different assumptions and therefore derived different modules.  Moreover, the ensemble learning approach adopted by the Attractome team also addressed the overfitting problems confronting single-model approaches.  Therefore, we adopted an Attractome model (syn1421196) and evaluated if our modular features could enhance the model.  We performed a feature selection using a greedy forward-search approach in a series of cross validation experiments.  We identified 6 features that were capable of enhancing the performance of the Attractome model and integrated them into to a hybrid model, referred to as the PittAttractomeHyb.2 model.  When tested on the hold out METABRIC data, the model performed well, indicating that these modular features are indeed informative of clinical outcomes.

In summary, the major contributions of this study include: 1) A novel integrative approach that is capable of identifying gene modules that likely represent units of cellular responses to perturbed signaling. 2) Certain modules identified from the TCGA data are predictive of clinical outcome when applied to the METBRIC data, therefore the information encoded by these modules are generalizable.  3) Integrating informative modular features with ensemble predictive models enhance the capability of predicting clinical outcomes of breast cancer patients.   Ongoing efforts concentrate on reverse engineering the signaling pathways that underlie these predictive modules.

REFERENCES

1          Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646-674, doi:10.1016/j.cell.2011.02.013 (2011).

2          Jin, B. & Lu, X. Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics 26, 2445-2451, doi:10.1093/bioinformatics/btq449 (2010).

3          Richards, A. J. et al. Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26, i79-87, doi:10.1093/bioinformatics/btq203 (2010).

4          Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1-22 (2010).

5          Joachims, T. in ACM Conference on Knowledge Discovery and Data Mining (KDD).

Breast Cancer Challenge: Team “Attractor Metagenes” nabs top overall Metabric score!

Please join all of us at Sage Bionetworks and DREAM in congratulating  Wei-yi Cheng and the entire Attractor Metagenes team for their winning model (Syn ID#1444444): this training model received the top overall score for the Metabric phase of this Challenge.  It will be so interesting to see how this model (and all the others) perform against the final validation data set that is currently being produced!  Wei-yi and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).   Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.

Our models use the information provided by the “attractor metagenes” to evaluate their prognostic value in breast cancer. We have previously applied an iterative attractor finding algorithm on rich expression datasets from multiple cancer types identifying three universal (pan-cancer) attractors, which are the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) [1]. We like to think of these three attractors as “bioinformatic hallmarks of cancer.” In our top submission (syn1444444) we used precisely these same three metagenes as found from our previous unsupervised multi-cancer analysis.

Our experience in the Challenge was particularly rewarding as we were successively confirming that each of these attractors would indeed be helpful towards improving the breast cancer prognostic model, and we were reporting our observations to the other participants using the Synapse forum. We first found that the CIN attractor is highly prognostic, as evidenced by the fact that it was essentially recreated after ranking the individual genes in terms of their corresponding concordance index [2]. The other two main attractors were also found to be highly prognostic after being properly conditioned. For example, the MES attractor is most prognostic in early stage breast cancer (no positive lymph nodes and tumor size less than 30 mm) [3]. On the other hand, the LYM attractor is protective when ER and HER2 expressions are low, while it has the reverse effect on prognosis when there are multiple positive lymph nodes [4]. We also identified a few additional prognostic metagenes, such as the SUSD3-FGD3 metagene that is composed of two genomically adjacent genes, the ZMYND10-LRRC48-CASC1 metagene, the PGR-RAI2 metagene, the HER2 amplicon attractor metagene at chr17q11.2 – q21 and a chr17p12 meta-CNV. We compiled an attractor metagene space using these metagenes, along with TP53 and VEGFA, known to be associated with cancer, and then used them as a molecular feature space for feature selection.

In our top submission (syn1444444), we applied several subclassifiers to maximize the information used and to build a robust, generalizable model, including Cox regression, generalized boost model (GBM), and K‑nearest neighbor (KNN). We also used Akaike information criterion (AIC) on the features passed to Cox regression to avoid overfitting. We applied the AIC-based Cox regression and the GBM on the metagene space and the clinical features, respectively, and the KNN model on the combined metagene and clinical feature space. We combined the predictions of each subclassifier by directly summing up the linear predictors generated by the subclassifiers. For the KNN model, because the prediction is the survival time, we used the reciprocal of the prediction to be summed up. We also included two subclassifiers using mixed molecular and clinical features. In particular, one of them used all three of the universal metagenes (CIN, LYM and MES properly conditioned), the SUSD3-FGD3 metagene, clinical features age, radiation therapy, and chemotherapy. We found that such a simple model provides accurate prognosis, and can be treated independently with other subclassifiers.

In our submissions we did not make use of any code from other Challenge participants. We submitted the winning model prior to the October 15 model submission deadline, and the full code of this model was accessible to other participants as soon as it was posted to the leaderboard and thereafter.

References

  1. W-Y Cheng, D Anastassiou, Biomolecular events in cancer revealed by attractor metagenes. arXiv:1204.6538v1 [q-bio.QM].
  2. http://support.sagebase.org/sagebase/topics/mitotic_chromosomal_instability_attractor_metagene
  3. http://support.sagebase.org/sagebase/topics/mesenchymal_transition_attractor_metagene-znl1g
  4. http://support.sagebase.org/sagebase/topics/lymphocyte_specific_attractor_metagene