Can the crowd provide ‘The Cure’?

The participants on Sage Bionetworks’ Breast Cancer Prognosis Challenge keep surprising and delighting us!  Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge.  His post is about the online game he and his team just launched a few weeks ago to crowdsource ideas that Team Hive can then use to build great models for the Challenge.  Please enjoy Ben’s post and try out his cool game!!

Serious Games Serious games are games that have an underlying purpose.  When you play a game like Foldit or Phylo, you are finding entertainment like any other game but your actions are also translating into a useful end product.  In Foldit, you contribute to protein structure determination, in Phylo to multiple sequence alignment.  Reconstituting difficult or time consuming tasks into components of games opens up a new way to find and motivate volunteer contributors at potentially massive scale.  Like the SAGE competitions themselves, serious games provide the opportunity to focus widespread community attention on particular challenging problems.

The Cure The purpose of the game The Cure is to identify sets of genes that can be used to build predictors of breast cancer prognosis that will stand up to validation.  The hypothesis is that we can outperform purely data-driven approaches by infusing our gene selection algorithms with the biological knowledge and reasoning abilities of hundreds or even thousands of players.  This biological insight is captured through a simple, fun two-player card game where each card corresponds to a gene. In its current form, the game consists of a series of 100 boards, each containing 25 distinct genes from a precomputed list of interesting genes.  In the game, you compete with the computer opponent Barney to find the set of 5 genes from each board that form the best predictors of 10-year survival.  In each turn, the player takes a gene card off of the board and puts it in their hand.  To make these decisions, extensive annotation information from resources such as the Gene Ontology, Entrez Gene and PubMed is provided through the game interface and players are free to conduct their own research.  As each card is added to a player’s hand, a decision tree is constructed automatically using the genes in the hand and the training dataset from the Sage Bionetworks / DREAM challenge.  The tree is shown to the player and the hand is scored based on the performance of the decision tree algorithm, coupled with those genes, in a 10-fold cross-validation test.  If the player produces a better gene set than Barney they score points based on the cross-validation score. Play begins in a very short training stage that teaches the mechanics of the game as players select features to use to build an animal classifier.  Once this stage is passed, players are free to choose which of the gene boards to play.  Each board is shown with an indication about how many other players have already defeated it.  Once a certain number of players have finished a board, we declare it complete and close it off to encourage the player population to explore the entire board space.  The first collection of boards is nearly complete and the next level should be available soon.

Results so far In the first week that the game was live (Sept. 7-14, 2012), more than 120 players registered and collectively played more than 2000 hands.  60% of the players came from the U.S., 30% from China and the rest arrived from all over the world.  Nearly half of the players have PhDs.  While it is too early to tell whether this approach will be a contest winner, we have already used it to identify several small gene sets that have significant predictive power (far better than random).  We also know that some of the players are having a great time.  One of the top players wrote this about the game: “This is a wonderful game, which can give me happiness and knowledge at the same time.” Whether or not this game can manage to move the bar forward in cancer prognosis, it seems that it was already worth the effort to create it. Play now at !

About the creators of ‘The Cure’ The Scripps Research Institute team HIVE members include: Benjamin Good, Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian MacLeod and Andrew Su. Their research explores applications of crowdsourcing in biology such as the Gene WikiBioGPS, and the emerging collection of serious games at ‘The Cure’, is specifically focused on accumulating knowledge that can be translated into good performance on the BCC challenge.


Model From “Attractor Metagenes” Team Tops The September 1 Leaderboard!

Please join all of us at DREAM and Sage Bionetworks in congratulating the “Attractor Metagenes” team for their September 1 Leaderboard Winner Achievement.  Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.   The BCC Support Support Team

Dear fellow BCC challenge participants and organizers,

This is Wei-Yi Cheng, along with my teammates Tai-Hsien Ou Yang and Professor Dimitris Anastassiou at Columbia University. It is our great honor to be highlighted as the top team on September 1st in the competition. Tai-Hsien and I are currently Ph.D. students in Prof. Anastassiou’s Genomic Information Systems Laboratory (GISL) and the three of us have recently been working extensively to develop prognostic models in this challenge. I would like to thank the organizers for giving me the opportunity to present ourselves, and the ideas that we have been using.

The main topic of my thesis will be the discovery of biomolecular mechanisms in cancers using an iterative computational process that converges to what we call “attractor metagenes” or just “attractors.” Contrary to other methods of finding modules of co-expressed genes, the attractor methodology is totally unconstrained so it can point to the core genes of the biomolecular event that it represents. Remarkably, we found that some of these attractors are present in nearly identical form in all cancer types that we tried, suggesting that they represent universal mechanisms. We like to think of two of these attractors as “bioinformatic hallmarks of cancer.” We call them the “mesenchymal transition attractor” and the “mitotic chromosomal instability (CIN) attractor.” We believe that they reflect universal biological mechanisms empowering cancer cells to invade surrounding tissues and to divide uncontrollably, respectively. They are also strongly associated with tumor stage and grade, respectively, as well as other phenotypes. We also found many other attractors, including amplicons, particularly one prominent universal amplicon at chr8q24.3, and some attractors that are cancer-type specific, such as the estrogen receptor attractor.  For information about the underlying algorithm and additional results please see our preprint in

In our models, we use the attractor metagenes for survival prediction. The mitotic CIN metagene is the most prognostic, but the other ones provide significant additional help. We think that using such metagenes representing biomolecular events is preferable compared to using individual genes or classification into subtypes. For example, one of the features of our top model as of September 1st (#118304) is the replacement of the PAM50 molecular subtype classifier by three of our attractor metagenes: the mitotic CIN attractor, the estrogen receptor attractor, and a chr7p11.2 amplicon involving EGFR. We do not claim that these three metagenes contain all the information in PAM50. But we believe that the effort to discover mutually exclusive “subtypes” of cancer (not just in breast cancer but in all types of cancer) may have done the community a disservice. Instead, we think that simply focusing on precise biomolecular events will lead to better understanding of the underlying mechanisms. For example, although similar subtypes have been identified across cancer types, this similarity has not been strong enough to infer that it reflects the same biological event. In contrast, the attractor metagenes are found to be nearly identical across cancer types.

In our submission we used AIC on the Cox regression model to select other clinical features, and included a GBM model fed with relevant clinical features and several other attractor metagenes. The R package for finding attractor metagenes is available under synapse ID syn1123167.

I would like to express my appreciation and admiration to Adam, Erhan, Thea, and all the other challenge organizers, as well as everyone who contributed with funding, data, or infrastructure, to make this challenge possible. The design and implementation of the challenge provided an open and transparent environment for us to know how we are doing, and to learn from the others at the same time. We believe that such open-source environment can really help push innovations further for better applications of bioinformatic tools. It will be rewarding if this wonderful and worthy collective effort leads to an improvement in the prognosis of this devastating disease. May the best model win!

Wei-Yi Cheng

Graduate Research Assistant, Ph.D. Candidate in Electrical Engineering

Genomic Information Systems Laboratory,

Columbia University

A guest post from Mike Kellen on use of AWS by Synapse

Science, Reengineered

Recently, I sat down with Jeff Barr on the AWS report to discuss how we’ve used various Amazon services throughout our architecture while developing Synapse.  In the interview, I discussed how Synapse uses RDS (MySQL) as our back end database, Elastic Beanstalk to host our service and web hosting tiers, Cloud Search for providing a search across all Synapse content, and Simple Workflow to manage distributed scientific workflows (see also our AWS case study). The decision to rely heavily on Amazon as an infrastructure provider for our project was based on the belief that hosted infrastructure was they way of the future, and it was best to build technology with that future in mind assuming that services that were still early stage would mature along with our own work.  Despite a few of the hic-ups associated with adopting early stage technology, I’m still pretty pleased with the decision…

View original post 858 more words