And The Winner Is…

We are very happy to announce that the Attractor Metagenes Team (consisting of Mr. Wei-Yi Cheng, Mr. Tai-Hsien Ou Yang and Professor Dimitris Anastassiou of Columbia University) is the winner of the Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge.   Please join all of us at Sage Bionetworks and DREAM in congratulating  the team for their winning Challenge model (Syn ID#1417992) that achieved a concordance index of 0.7562. This means that given any two Breast Cancer patients, the probability that team Attractor Metagenes will correctly predict who of the two patients will survive the longer is 76%, an extremely statistically significant performance. The performance was robust, in that this team was also the best performer in most of the 100 instances in which the test set was perturbed by random removal of 20% of the patients.  As the winner of the Challenge, the Attractor Metagenes team has been awarded  the opportunity to publish an article about the winning Challenge model in Science Translational Medicine and will be invited to the 4th Annual Sage Congress, taking place in April 2013.  Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.

Sage / DREAM7 Breast Cancer Prognostic Challenge


Validation Phase Writeup

Wei-Yi Cheng

During the first stages of the Breast Cancer Prognostic Challenge, we have shown that the breast cancer survival can be well predicted by the “Attractor Metagenes.” The Attractor Metagenes are sets of strongly co-expressed genes found using an iterative algorithm. We have previously identified several such Attractor Metagenes in almost identical forms across various cancer types, namely the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) [1]. As we had mentioned before, we like to think of these three main attractor metagenes as representing three key “bioinformatic hallmarks of cancer,” reflecting the ability of cancer cells to divide uncontrollably, to invade surrounding tissues, and the ability of the organism to recruit a particular type of immune response to fight the disease. In the METABRIC dataset, we confirmed that, under certain conditions, each of these attractors has strong prognostic power. For instance, the expression of the CIN attractor suggests high grade; the expression of the MES attractor during early stage (no positive lymph node, and tumor size less than 30 mm) indicates the invasiveness of cancer cells and thus suggests bad prognosis; and the expression of the LYM attractor is an indication of good prognosis in ER negative breast cancer, while it is reversely ominous when there are already positive lymph nodes. The attractor approach also identifies some breast-cancer specific attractor such as the ER attractor and the HER2 amplicon. We have used these Attractor Metagenes in all our models and reported their association with survival in previous stages of the challenge [2][3][4].

We think that the success of our model is due to the lack of overfitting to the training (METABRIC) dataset. Indeed our features, the attractor metagenes, were not derived from the training set. Instead, they were derived from other cancer datasets from multiple cancer types [1]. We hypothesized that the attractor metagenes represent universal biomolecular events in cancer, which would therefore be useful for the particular type of breast cancer. So, we used the training set only to find the best ways to combine these features in breast cancer. And we were so happy to see that our score for overall survival in a totally new validation dataset was actually higher than the corresponding score that we had achieved in the previous phases of the Challenge in which the METABRIC dataset itself was split and used for both training as well as validation.

In order to select from our existing models trained on the METABRIC data for the totally new Oslo validation set, we performed several hold-out tests on all of our submitted models, such as the 10-fold cross-validation. In addition, based on what Dr. Huang revealed in “Contours of the Oslo Validation set” [5], we thought it was also important to evaluate the performance of our models using several re-sampled test sets containing only patients who received chemotherapy. Indeed, the top-performing model (syn1417992) has the highest chemotherapy-only test score among our other models.

The top-performing syn1417992 model contains several subclassifiers that utilize orthogonal information. Based on the universal Attractor Metagenes we found in multiple cancer types [1], and several breast cancer specific Attractor Metagenes, we created an “Attractor Metagene Space” of around 15 attractor metagenes to replace the 50,000-gene molecular space. We used Cox regression, generalized boost model (GBM), K-nearest neighbor (KNN) to create prognosis models on the Attractor Metagene Space and clinical features respectively. For feature selection, we used Akaike information criterion (AIC) when performing Cox regression. The model also includes a subclassifier that used mixed clinical and molecular features, which include all three of the universal metagenes (CIN, LYM restricted on ER and HER2 low, and MES restricted on lymph-negative and tumor size less than 30 mm), and the SUSD3 metagene, which we found is highly associated with good prognosis when over-expressed. In our submissions we did not make use of any code from other Challenge participants.

Finally, we would like to thank everyone who made this wonderful challenge possible. We believe that this success validates not only the prognostic power of our model in breast cancer, but also the “pan-cancer” property of the attractor metagenes, since they were defined from other datasets of various cancer types. We hope that we will have the opportunity to collaborate with pharmaceutical companies towards the development of related diagnostic, prognostic and predictive products; and particularly to scrutinize the underlying biological mechanisms trying to think of potential therapeutic interventions that could be applicable to all types of cancer.


  1. W.Y. Cheng, T.H. Ou Yang and D. Anastassiou, “Biomolecular events in cancer revealed by attractor metagenes,” Preprint available from arXiv:1204.6538v1, April 30, 2012, PLoS Computational Biology, in Press.

Did Team Hive’s online game generate a top-scoring Challenge model?

Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge.  His post is about the fun online game (called “The Cure”) that he and his team launched in September 2012 to crowdsource ideas that Team Hive can then use to build great models for our Breast Cancer Challenge.  Please read on to find out how Team Hive’s models built with ideas from the crowd performed!

Reblogged from i9606

Monday, October 29, 2012

Results from the Cancer Biology game: The Cure

Building intelligent systems for biology

Our research group has been exploring the concept of serious games for several months now.  Aside from providing nerdy entertainment, our games collect (and distribute) biological knowledge from broad audiences of players.  The hypothesis underlying this work is that, by capturing knowledge in forms suitable for computation, these games make it possible to build more intelligent programs.

As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’.  The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples.  We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.

In this game, available at, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes.  The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.

Below is a game in progress.  I’m on the bottom and my opponent, Barney, is on the top.  We alternate turns selecting a card (a gene) from the board and adding it to our hand.  When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time.  The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules.   The better the gene set performs in generating predictive decision trees, the higher the score.  When the player defeats their opponent, they move on to play another board.  (Multiple players play each board.)

A game of the The Cure.  Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.

As you can see to the right of the board, information from the Gene Ontology, RefSeq, and PubMed is provided through the game interface to aid players in selecting their genes.  Players are also encouraged to make use of external knowledge sources (in addition to their own brains).

Promotion, players and play

The Cure was promoted on launch day via a presentation by Andrew Su at Genome Informatics 2012, via Twitter and in several blog posts.   As we first described in a post published on the Sage community site, more than 120 players registered and collectively played more than 2000 games in the first week that the game was alive – with much of this activity happening within the first few days.  Nearly half of the players self-reported having PhDs and half claimed knowledge of cancer biology.  Following the initial buzz, game-playing activity slowed down to what is now a slow but persistent trickle.
Games played at The Cure since launch

As of last Friday, Oct. 26, 2012 we have had 214 people register and have recorded 3,954 total games (including training games).  The player demographics have remained stable with about 40% PhDs, nearly 50% declaring knowledge of cancer biology, and about 50% stating that they are biologists.

Predicting breast cancer prognosis

Aside from entertainment, the point of this particular game is to assemble a predictor for breast cancer prognosis.  The main hypothesis is that biological knowledge, accessible from players, can be used to help select good sets of genes to use to train predictive models using machine learning algorithms.  The premise is that injecting distributed biological knowledge (which can not entirely be learned from any one training set) will help reduce overfitting by identifying the gene sets with biologically consistent associations with disease progression.

The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play.  While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.

  1. Filter out games from players that indicated no knowledge of cancer biology.
  2. Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.
  3. Select the top 20 genes according to this ranking.
  4. Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge.  To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.  
Game-selected genes

The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard – search for team HIVE – and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)

In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information).  This was substantially better than classifiers produced using randomly selected genes (55%).  Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.

Final Results from METABRIC round of BCC challenge

!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!


These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play.  We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
The code that operates The Cure is freely available on our BitBucket account.  It consists of a Java server application (running in Tomcat) that handles database interaction, board generation, and integration with the WEKA machine learning library.  WEKA is used to dynamically train and test decision trees (though we could easily use other models) while the game is running.  The interface is almost entirely CSS and JavaScript that communicates with the server via JSON requests.  We would be thrilled if some one wanted to use this code to build another classification game!

One aspect of the code-base that may be useful in a variety of different projects is the code that translates the Java objects that represent decision trees in WEKA into the Web-ready visualizations presented to the players.  This is accomplished via server-side translation into a JSON structure that is rendered in the browser using code that builds on the D3 javascript visualization library.

Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.

The opponent in The Cure came from a Wikipedia Commons imagefrom the game “You have to Burn the Rope“. Thanks for sharing!

Breast Cancer Challenge: Team “PittTransMed” places second for Metabric phase of the Challenge

Please join all of us at Sage Bionetworks and DREAM in congratulating  Chunui Cai and the entire PittTransMed team for being the second highest scoring team for the Metabric phase of our Breast Cancer Challenge!   Below you can read about Chunui’s winning model (Syn ID#1443133).  For this top performance, Chunui and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).

Identify Informative Modular Features for Predicting Cancer Clinical Outcomes

 Songjian Lu, Chunhui Cai, Hatice Ulku Osmanbeyoglu, Lujia Chen, Roger Day, Gregory Cooper, and Xinghua Lu

PittTransMed Team

Department of Biomedical Informatics

University of Pittsburgh

An important task of translational cancer genomics is to identify molecular mechanisms underlying the heterogeneity of patient prognosis and responses to treatments. The large number of molecularly characterized breast cancer samples by the TCGA provides a unique opportunity to study perturbed signal transduction pathways that are determinative of clinical outcome and drug-responses.  Our approach to reveal perturbed signaling pathways is based on the following assumption: if a module of genes participates in coherently related biological processes and is co-expressed in a subset of tumors, the module is likely regulated by a specific cellular signal that is perturbed in the subset of tumors.  We have developed a novel bi-clustering approach that unifies knowledge mining and data mining to identify gene modules and corresponding tumor subsets.

From each TCGA breast cancer tumor, we first identified all genes that were differentially expressed, i.e., 3-fold increase or decrease when compared to the median of expression values from normal samples.  Since a cancer tumor always results from perturbations of multiple signaling pathways 1, the complete list of differentially expressed genes from a tumor inevitably reflects a mixture of genes responding to distinct signals.  In order to de-convolute the signals, we developed an ontology-guided, semantic-driven approach 2 to group differentially expressed genes into non-disjoint subsets.  Each subset contains genes that participate in coherently related biological processes 3 that can be summarized by an informative GO term.   With gene expression data from each tumor conceptualized, we further pooled the genes summarized by a common GO term from all tumor samples to construct a seed gene set.

Given a seed gene set annotated with a distinct GO term, we sought to search for a subset of tumors in which a functionally coherent gene module is co-regulated.  To this end, we constructed a bipartite graph for each seed gene set, in which vertices on one side are the genes and those on the other partite are tumors.  We then applied a novel graph algorithm to identify a densely connected subgraph containing a module of genes (subset of input genes) and a subset of tumors. We required the subgraph satisfies the following conditions: 1) a gene in such a subgraph was differentially expressed in at least 75% of the tumors in the subgraph; 2) in each tumor, at least 75% of genes in the module was differentially expressed; and 3) a subgraph should include more than 25 tumors.

Applying the analysis to the TCGA breast cancer tumors, we have identified 159 subgraphs.  We hypothesized that each subgraph represented a module of genes whose differential expression was in response to a common cellular signal that was perturbed in the subset of tumors.  We then set out to test if this approach would enable us to find the signaling pathways that underlie different prognosis of the breast cancer patients in the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge.

Based on the TCGA microarray data and our modular analysis results, we trained 159 Bayesian logistic regression models 4, one for each aforementioned subgraphs.  This approach enabled us to determine if a module was perturbed (“hit”) in a breast cancer sample from the DREAM 7 challenge, moreover the results allowed us to represent a tumor sample in a modular space with a reduced dimension.  Conditioning on the status of each gene module, we dichotomized samples into two groups (“hit” vs “not hit”) and performed Kaplan-Meier survival analysis to determine if the module is predictive of clinical outcome of the patients.  We identified 20 modules that led to significantly different survival outcomes, p-value < 0.05 .

Using the states of all modules or 20 informative modules as input features, we explored different survival models, including Cox model, the generalized linear model (the “glmnet” package from the CRAN), and the generalized boosted model (the “gbm” package from the CRAN).  When the above models were evaluated individually, they led to prediction concordance indices of  ~ 0.67 and ~ 0.66 on training and testing datasets respectively.  We also explored to use a machine learning model, the RankSVM 5  to predict rank order of patients based on the modular status. In general, the performances of individual RankSVM and survival models were inferior to the leading ensemble models submitted by other groups.

After carefully studying the leading models, we noticed that the ensemble model developed by the Attractome group was particularly suitable to incorporate the modular features identified by our approach.  In an essence, the Attractome group also aimed to identify gene modules and to project tumor samples into a modular space for modeling, although their approach was based on different assumptions and therefore derived different modules.  Moreover, the ensemble learning approach adopted by the Attractome team also addressed the overfitting problems confronting single-model approaches.  Therefore, we adopted an Attractome model (syn1421196) and evaluated if our modular features could enhance the model.  We performed a feature selection using a greedy forward-search approach in a series of cross validation experiments.  We identified 6 features that were capable of enhancing the performance of the Attractome model and integrated them into to a hybrid model, referred to as the PittAttractomeHyb.2 model.  When tested on the hold out METABRIC data, the model performed well, indicating that these modular features are indeed informative of clinical outcomes.

In summary, the major contributions of this study include: 1) A novel integrative approach that is capable of identifying gene modules that likely represent units of cellular responses to perturbed signaling. 2) Certain modules identified from the TCGA data are predictive of clinical outcome when applied to the METBRIC data, therefore the information encoded by these modules are generalizable.  3) Integrating informative modular features with ensemble predictive models enhance the capability of predicting clinical outcomes of breast cancer patients.   Ongoing efforts concentrate on reverse engineering the signaling pathways that underlie these predictive modules.


1          Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646-674, doi:10.1016/j.cell.2011.02.013 (2011).

2          Jin, B. & Lu, X. Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics 26, 2445-2451, doi:10.1093/bioinformatics/btq449 (2010).

3          Richards, A. J. et al. Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26, i79-87, doi:10.1093/bioinformatics/btq203 (2010).

4          Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1-22 (2010).

5          Joachims, T. in ACM Conference on Knowledge Discovery and Data Mining (KDD).

Breast Cancer Challenge: Team “Attractor Metagenes” nabs top overall Metabric score!

Please join all of us at Sage Bionetworks and DREAM in congratulating  Wei-yi Cheng and the entire Attractor Metagenes team for their winning model (Syn ID#1444444): this training model received the top overall score for the Metabric phase of this Challenge.  It will be so interesting to see how this model (and all the others) perform against the final validation data set that is currently being produced!  Wei-yi and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).   Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.

Our models use the information provided by the “attractor metagenes” to evaluate their prognostic value in breast cancer. We have previously applied an iterative attractor finding algorithm on rich expression datasets from multiple cancer types identifying three universal (pan-cancer) attractors, which are the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) [1]. We like to think of these three attractors as “bioinformatic hallmarks of cancer.” In our top submission (syn1444444) we used precisely these same three metagenes as found from our previous unsupervised multi-cancer analysis.

Our experience in the Challenge was particularly rewarding as we were successively confirming that each of these attractors would indeed be helpful towards improving the breast cancer prognostic model, and we were reporting our observations to the other participants using the Synapse forum. We first found that the CIN attractor is highly prognostic, as evidenced by the fact that it was essentially recreated after ranking the individual genes in terms of their corresponding concordance index [2]. The other two main attractors were also found to be highly prognostic after being properly conditioned. For example, the MES attractor is most prognostic in early stage breast cancer (no positive lymph nodes and tumor size less than 30 mm) [3]. On the other hand, the LYM attractor is protective when ER and HER2 expressions are low, while it has the reverse effect on prognosis when there are multiple positive lymph nodes [4]. We also identified a few additional prognostic metagenes, such as the SUSD3-FGD3 metagene that is composed of two genomically adjacent genes, the ZMYND10-LRRC48-CASC1 metagene, the PGR-RAI2 metagene, the HER2 amplicon attractor metagene at chr17q11.2 – q21 and a chr17p12 meta-CNV. We compiled an attractor metagene space using these metagenes, along with TP53 and VEGFA, known to be associated with cancer, and then used them as a molecular feature space for feature selection.

In our top submission (syn1444444), we applied several subclassifiers to maximize the information used and to build a robust, generalizable model, including Cox regression, generalized boost model (GBM), and K‑nearest neighbor (KNN). We also used Akaike information criterion (AIC) on the features passed to Cox regression to avoid overfitting. We applied the AIC-based Cox regression and the GBM on the metagene space and the clinical features, respectively, and the KNN model on the combined metagene and clinical feature space. We combined the predictions of each subclassifier by directly summing up the linear predictors generated by the subclassifiers. For the KNN model, because the prediction is the survival time, we used the reciprocal of the prediction to be summed up. We also included two subclassifiers using mixed molecular and clinical features. In particular, one of them used all three of the universal metagenes (CIN, LYM and MES properly conditioned), the SUSD3-FGD3 metagene, clinical features age, radiation therapy, and chemotherapy. We found that such a simple model provides accurate prognosis, and can be treated independently with other subclassifiers.

In our submissions we did not make use of any code from other Challenge participants. We submitted the winning model prior to the October 15 model submission deadline, and the full code of this model was accessible to other participants as soon as it was posted to the leaderboard and thereafter.


  1. W-Y Cheng, D Anastassiou, Biomolecular events in cancer revealed by attractor metagenes. arXiv:1204.6538v1 [q-bio.QM].

Science is hard

Check out a new clearScience blog post expounding on an article by David Shaywitz in today’s Forbes. Enjoy!

The Challenge’s October 1 Leaderboard Winner: A “Repeat Performance” from Attractor Metagenes Team!

Please join all of us at Sage Bionetworks and DREAM in congratulating  Wei-yi Cheng and the entire Attractor Metagenes team for their October 1 Leaderboard Winner Achievement.  Attractor Metagenes was also the September 1 leaderboard winner .  This “repeat performance” is especially impressive given that it was achieved working with two different versions of the Metabric data!  Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.   

Dear fellow BCC challenge participants and others here on Synapse ,

I would like to thank once more the organizers for the opportunity that they again give to the Attractor Metagenes Team to share some of our methods and findings. It has been another exciting month, in which many new ideas have been shared among us. It is inspiring that through these discussions on the Challenge forum, we all are gaining a better understanding of different perspectives on the data and the disease itself.

The main attributes responsible for our continuing high score is that we are making use of the three strongest attractor metagenes representing universal (multi-cancer) biomolecular events: The mitotic chromosomal instability attractor metagene; the mesenchymal transition attractor metagene; and the lymphocyte-specific attractor metagene. We are particularly excited because these metagenes are present in all solid cancers, and therefore can be used as “pancancer” biomarkers, which will be more robust, compared to using individual oncogenes. We have now posted descriptions of each of these three main attractors as items in the Synapse forum. So far we have not incorporated any code from other submissions, but we will certainly do so if we deem appropriate, giving them credit prominently. And similarly, we also welcome others to make use of our code that is always freely and readily available. The functions and metagene lists used in all our submissions are incorporated in an R package downloadable through the link given in the source code we uploaded on the leaderboard. We also have uploaded an R package for finding attractor metagenes, available under Synapse ID syn1123167 for anyone interested to use, not only in breast cancer, but in all types of cancer.

We understand that the main objective in Phase 2 of the Challenge is to build a generalizable model that will work well when evaluated against the Oslo Validation data set. Based on our experience in both Phases, we believe that achieving a generalizable model requires making use of survival data that have been “purified” by excluding causes of death unrelated to the disease itself. We understand, however, that this is difficult to achieve in general and even more so in the case of the Oslo-Val data set. And because Phase 2 uses the same overall survival data as in the Olso-Val, we modified our models to include lots of clinical features that we do not think would be otherwise required for the development of a sharp and generalizable prognostic model.

To elaborate on this last point: We are excited about building sharp, insightful and powerful “minimalist” disease models that could be used for biomarker products making use of a very small number of features. For example, we believe that we have identified such a model in breast cancer that makes use of nothing other than our three attractor metagenes mentioned above, tumor size, number of lymph nodes affected, and one more protective feature that we discovered as a result of our participation in the Challenge: The metagene defined as the average of the genes SUSD3 and FGD3, which, as we observed, are genomically adjacent at chr9q22.31. We know that simultaneous silencing of these two genes is strongly associated with bad prognosis, but we are not certain about the underlying biological mechanism (it may not be the result of a CNV). We suspect that this simultaneous silencing is one of several triggers required for ER-negativity, perhaps the most important one. This is an interesting research question and we hope that other Challenge participants with expertise in biology and medicine will join us in the effort to decipher this important mechanism in breast cancer!

Wei-Yi Cheng

Graduate Research Assistant, Ph.D. Candidate in Electrical Engineering

Genomic Information Systems Laboratory,

Columbia University

Can the crowd provide ‘The Cure’?

The participants on Sage Bionetworks’ Breast Cancer Prognosis Challenge keep surprising and delighting us!  Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge.  His post is about the online game he and his team just launched a few weeks ago to crowdsource ideas that Team Hive can then use to build great models for the Challenge.  Please enjoy Ben’s post and try out his cool game!!

Serious Games Serious games are games that have an underlying purpose.  When you play a game like Foldit or Phylo, you are finding entertainment like any other game but your actions are also translating into a useful end product.  In Foldit, you contribute to protein structure determination, in Phylo to multiple sequence alignment.  Reconstituting difficult or time consuming tasks into components of games opens up a new way to find and motivate volunteer contributors at potentially massive scale.  Like the SAGE competitions themselves, serious games provide the opportunity to focus widespread community attention on particular challenging problems.

The Cure The purpose of the game The Cure is to identify sets of genes that can be used to build predictors of breast cancer prognosis that will stand up to validation.  The hypothesis is that we can outperform purely data-driven approaches by infusing our gene selection algorithms with the biological knowledge and reasoning abilities of hundreds or even thousands of players.  This biological insight is captured through a simple, fun two-player card game where each card corresponds to a gene. In its current form, the game consists of a series of 100 boards, each containing 25 distinct genes from a precomputed list of interesting genes.  In the game, you compete with the computer opponent Barney to find the set of 5 genes from each board that form the best predictors of 10-year survival.  In each turn, the player takes a gene card off of the board and puts it in their hand.  To make these decisions, extensive annotation information from resources such as the Gene Ontology, Entrez Gene and PubMed is provided through the game interface and players are free to conduct their own research.  As each card is added to a player’s hand, a decision tree is constructed automatically using the genes in the hand and the training dataset from the Sage Bionetworks / DREAM challenge.  The tree is shown to the player and the hand is scored based on the performance of the decision tree algorithm, coupled with those genes, in a 10-fold cross-validation test.  If the player produces a better gene set than Barney they score points based on the cross-validation score. Play begins in a very short training stage that teaches the mechanics of the game as players select features to use to build an animal classifier.  Once this stage is passed, players are free to choose which of the gene boards to play.  Each board is shown with an indication about how many other players have already defeated it.  Once a certain number of players have finished a board, we declare it complete and close it off to encourage the player population to explore the entire board space.  The first collection of boards is nearly complete and the next level should be available soon.

Results so far In the first week that the game was live (Sept. 7-14, 2012), more than 120 players registered and collectively played more than 2000 hands.  60% of the players came from the U.S., 30% from China and the rest arrived from all over the world.  Nearly half of the players have PhDs.  While it is too early to tell whether this approach will be a contest winner, we have already used it to identify several small gene sets that have significant predictive power (far better than random).  We also know that some of the players are having a great time.  One of the top players wrote this about the game: “This is a wonderful game, which can give me happiness and knowledge at the same time.” Whether or not this game can manage to move the bar forward in cancer prognosis, it seems that it was already worth the effort to create it. Play now at !

About the creators of ‘The Cure’ The Scripps Research Institute team HIVE members include: Benjamin Good, Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian MacLeod and Andrew Su. Their research explores applications of crowdsourcing in biology such as the Gene WikiBioGPS, and the emerging collection of serious games at ‘The Cure’, is specifically focused on accumulating knowledge that can be translated into good performance on the BCC challenge.

Model From “Attractor Metagenes” Team Tops The September 1 Leaderboard!

Please join all of us at DREAM and Sage Bionetworks in congratulating the “Attractor Metagenes” team for their September 1 Leaderboard Winner Achievement.  Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.   The BCC Support Support Team

Dear fellow BCC challenge participants and organizers,

This is Wei-Yi Cheng, along with my teammates Tai-Hsien Ou Yang and Professor Dimitris Anastassiou at Columbia University. It is our great honor to be highlighted as the top team on September 1st in the competition. Tai-Hsien and I are currently Ph.D. students in Prof. Anastassiou’s Genomic Information Systems Laboratory (GISL) and the three of us have recently been working extensively to develop prognostic models in this challenge. I would like to thank the organizers for giving me the opportunity to present ourselves, and the ideas that we have been using.

The main topic of my thesis will be the discovery of biomolecular mechanisms in cancers using an iterative computational process that converges to what we call “attractor metagenes” or just “attractors.” Contrary to other methods of finding modules of co-expressed genes, the attractor methodology is totally unconstrained so it can point to the core genes of the biomolecular event that it represents. Remarkably, we found that some of these attractors are present in nearly identical form in all cancer types that we tried, suggesting that they represent universal mechanisms. We like to think of two of these attractors as “bioinformatic hallmarks of cancer.” We call them the “mesenchymal transition attractor” and the “mitotic chromosomal instability (CIN) attractor.” We believe that they reflect universal biological mechanisms empowering cancer cells to invade surrounding tissues and to divide uncontrollably, respectively. They are also strongly associated with tumor stage and grade, respectively, as well as other phenotypes. We also found many other attractors, including amplicons, particularly one prominent universal amplicon at chr8q24.3, and some attractors that are cancer-type specific, such as the estrogen receptor attractor.  For information about the underlying algorithm and additional results please see our preprint in

In our models, we use the attractor metagenes for survival prediction. The mitotic CIN metagene is the most prognostic, but the other ones provide significant additional help. We think that using such metagenes representing biomolecular events is preferable compared to using individual genes or classification into subtypes. For example, one of the features of our top model as of September 1st (#118304) is the replacement of the PAM50 molecular subtype classifier by three of our attractor metagenes: the mitotic CIN attractor, the estrogen receptor attractor, and a chr7p11.2 amplicon involving EGFR. We do not claim that these three metagenes contain all the information in PAM50. But we believe that the effort to discover mutually exclusive “subtypes” of cancer (not just in breast cancer but in all types of cancer) may have done the community a disservice. Instead, we think that simply focusing on precise biomolecular events will lead to better understanding of the underlying mechanisms. For example, although similar subtypes have been identified across cancer types, this similarity has not been strong enough to infer that it reflects the same biological event. In contrast, the attractor metagenes are found to be nearly identical across cancer types.

In our submission we used AIC on the Cox regression model to select other clinical features, and included a GBM model fed with relevant clinical features and several other attractor metagenes. The R package for finding attractor metagenes is available under synapse ID syn1123167.

I would like to express my appreciation and admiration to Adam, Erhan, Thea, and all the other challenge organizers, as well as everyone who contributed with funding, data, or infrastructure, to make this challenge possible. The design and implementation of the challenge provided an open and transparent environment for us to know how we are doing, and to learn from the others at the same time. We believe that such open-source environment can really help push innovations further for better applications of bioinformatic tools. It will be rewarding if this wonderful and worthy collective effort leads to an improvement in the prognosis of this devastating disease. May the best model win!

Wei-Yi Cheng

Graduate Research Assistant, Ph.D. Candidate in Electrical Engineering

Genomic Information Systems Laboratory,

Columbia University

Launch of clearScience

We have always envisioned that a key use of the Synapse platform would be to make it easier for researchers to record and document their scientific work in a manner that is easily understandable by others.  This vision just got a boost today with the launch of the clearScience initiative led by Brian Bot and Erich Huang of Sage Bionetworks.  Backed by a generous grant from the Sloan Foundation, the initiative will engage scientific publishers for their ideas on how Synapse can best support scientific communication.  Over the next several months Bot and Huang will be speaking to editors of prominent scientific journals, engaging researchers in a variety of fields, and presenting their evolving ideas at conferences including the upcoming Strata Rx.  And to support the clearScience initiative, Sage’s software team will be accelerating Synapse development for provenance capture and visualization as well as creating scientific narratives that link directly to content managed by Synapse.

If I had a billion dollars…

Guest post from Michael Kellen on incentives and competitions

Science, Reengineered

In an apparently recurring theme, my thoughts again are running to the incentives that drive human behavior, this time inspired by the recent news that the Russian billionaire Yuri Milner has established a new $3 Million Fundamental Physics Prize.  He’s actually awarded 9 of these prizes for a cool $27M promoting the efforts of theoretical physics.  Certainly that kind of money and publicity could drive a lot of attention to the field, and I love the fact that we now almost have a basketball team’s worth of physicists who almost make a basketball player’s salary.

However, is this the best way to spend $27M to shake up and rally support for science?  Of course Mr. Milner is free to spend his money any way he wishes but I see some potential problems with his approach.  Quoting from the NY times article referenced above “Mr. Milner personally selected the inaugural…

View original post 823 more words