DREAM’s First Ever Hackathon! AML Challenge Organizers Run Hackathon to Foster Collaboration

Although the DREAM Challenges do an excellent job bringing together researchers from several different areas of science and several different institutions to work on the same problem, the competitive setting provides little incentive for these great minds to work together. While the cornerstone of crowdsourcing is in fact the application of several different approaches to the same issue, the organizers of the Amyloid Myeloid Leukemia (AML) Outcome Prediction DREAM Challenge would still like to see participants working on similar or complementary approaches collaborating and possibly even joining teams. To encourage these partnerships we decided to hold the first ever DREAM Hackathon! The Hackathon took place on July 26-27 at Rice University and was simultaneously broadcast over the web. The Hackathon was designed to catalyze conversations about the AML Challenge in two ways:

First, we wanted to use the Hackathon to encourage Challenge participants to share a little about their general approach to model building. While we wanted participants to present their ideas, we were also mindful that the AML Outcome Prediction DREAM Challenge is a competition, and we didn’t want participants to present their approach at a level of detail that could make their methodology available to others. Considering this fine line, we encouraged participants to present only as much as they were comfortable with: no details about the approach or efficacy of their methods had to be presented.

For this part of the Hackathon, two Challenge teams signed up and presented an overview of their model-building methodologies via live webinar to other Challenge participants around the world. Both teams had excellent presentations and received some valuable feedback from Hackathon-Challenge participants! While both presenters had very good approaches, it was clear that Hackathon spectators with different expertise and “fresh eyes” had numerous good ideas on how these two teams could improve even further. In the end, while the turnout for this section of the Hackathon was low, the exercise was very constructive and the main goal was achieved for those who participated.

The second goal we had for the Hackathon was to invite a few “experts in the field” — for both model building methodology as well as approaches to predict AML outcome — to present during the Hackathon in order to get participants discussing new ideas that could help their model building efforts.   These talks along with their Q&A sessions were very constructive, particularly the one by Dr. Kenneth Hess from MD Anderson Cancer Center, who presented on general statistical analysis of survival time data tailored to the DREAM 9 dataset. Given the diverse nature of these talks, we believe participants of the AML Outcome Prediction DREAM Challenge were able to get an insight on different approaches that could be incorporated into their methods. These talks could also give them different perspectives on the Challenge, opening new horizons not considered before. These presentation were recorded and are available in the Synapse website (https://www.synapse.org/#!Synapse:syn2455683/wiki/64687).

Overall I believe this was a really successful first DREAM Hackathon! We will continue to follow up with the participants about what they liked or didn’t like about this event, and we are open to ideas on how we can improve. With your help we hope to make this event more and more successful in subsequent editions of the DREAM Challenges.

Thank you,
André Schultz: member of the AML DREAM Challenge Organizing Team

 

Teams WarwickDataScience and UT_CCB top the real-time leaderboard for NIEHS-NCATS-UNC DREAM Toxicogenetics sub-Challenge 1

Below are blogposts from Teams WarwickDataScience and UT-CCB, the two top-scoring teams for sub-Challenge 1 of the NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge.  Sub-Challenge 1 asked participants to model cytotoxicity across cell lines based on genomic information.  We were so excited to see 49 teams submit nearly 1000 models for scoring to the real-time leaderboard that was open for 6 weeks (you can see the final leaderboard for this sub-Challenge here: https://www.synapse.org/#!Synapse:syn1761567/WIKI/56154!

We awarded $250 to the top two teams with the highest mean rank as determined by RMSE and by Pearson Correlation (four total prizes).  Below are blogplosts from the top-scoring teams: Rich Savage from Team WarwickDataScience and Yonghui Wu from UT_CCB both share a little about their respective teams and their winning models.  

Team WarwickDataScicence Blogpost

WHO IS YOUR TEAM?

Our team (WarwickDataScience) consists of Rich Savage and Jay Moore.  We’re based at the University of Warwick’s Systems Biology Centre (in the UK), where Rich is an associate professor and Jay leads the bioinformatics engineering group.  Rich is also a joint appointment with the Warwick Medical School.

WHY DID YOU JOIN THE CHALLENGE?

We joined the DREAM Tox challenge because we thought it was an interesting scientific problem, and also because we have both developed interests in data science challenges of this nature and were keen to start a data science team here at Warwick.

WHAT WAS YOUR WINNING MODEL?

Over the course of the Challenge we identified several considerations that were demonstrably important to building high-scoring models.  Our basic approach was to build a regression model for each compound, using a Random Forest model, then to couple these together to do multi-task learning.  We experimented with various other regression models, but we got good performance from Random Forests and it was very fast and easy to use, leaving us more time to focus on other aspects of the modelling.

Data-wise, we used some of the covariate features (population, gender) along with a small number of informative SNPs for each of the 106 compounds.  Finding the SNPs was a significant computational challenge, which we ended up solving using a fast, if somewhat quick-and-dirty, approach to GWAS analysis.  Our initial work used only the X-chromosome SNPs, but we found obtained similar results by using any single chromosome, or combination of chromosomes.  We speculate that this may be because we’re finding some widespread genomic signature, but we don’t (yet) have good evidence to confirm this.  We also experimented with the RNAseq data, and a microarray data set sourced externally to the challenge, but ran out of time with this. From cross-validation we think these data may also have helped.

It was clear to us from early in the Challenge that some of the compounds had highly similar toxicity results.  We therefore coupled our individual regression models together by using their predictions as input to a second set of Random Forest classifiers.  This gave us a fast, effective way of sharing information between the regression models for different compounds.

Finally, discussions on the forum and in the webinar made it clear that there was a difference in distribution between the training set and the leaderboard test set.  We decided to try a simple scheme to correct for this bias by identifying training items from the tails of the target distributions and adding duplicate examples to the training set.  While in some ways this is a bit of a hack, we felt it was a reasonable thing to do, given the way Random Forest works.  It resulted in a significant boost to our final scores.

WHAT HAVE YOU LIKED ABOUT THE DREAM CHALLENGE/HOW WOULD YOU LIKE TO SEE IT EVOLVE IN THE NEXT SEASON?

We like that it was a challenging problem, particularly trying to use the SNPs.  We also liked that it was focused on a real-world problem.

Next year, it would be nice if the leaderboard and final test predictions were part of a combined submission, as is the case in many other challenges (eg. Kaggle, Netflix).  For example, the test set can be randomly subdivided by the organizers, with half the items being used to compute leaderboard performance, and the other half being used for the final scoring.  The user simply submits predictions for all test items, without knowing the random partition.  We’d prefer this to the retraining stage, which can be a bit fiddly and error-prone (for example, in the Breast Cancer challenge last year, many final models did not evaluate correctly after the final retraining).  It’s not clear to us that we gain much more scientific understanding from the retraining process.

Team UT_CCB Blogpost

Dear fellow NIEHS-NCATS-UNC-Dream Challenge participants and organizers,

This is Yonghui Wu, representing the participating team from the Center of Computational Biomedicine (CCB) at The University of Texas School of Biomedical Informatics at Houston (UT-SBMI). Our team consists of two postdoctoral fellows (myself and Yunguo Gong), a research assistant (Liang-Chin Huang), and five CCB faculty members including Drs. Jingchun Sun, Jeffrey Chang, Trevor Cohen, W. Jim Zheng, and Hua Xu, who directs the CCB. It is our great honor to be highlighted as a top team on September 4st in the 2013 NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge. Two of our submissions by myself and my teammate Liang-Chin Huang were both ranked second in the mean rank and RMSE, correspondingly.

At CCB, we are working on various projects to develop advanced informatics methods and tools to support biomedical research and clinical practice. One of the focused areas is translational bioinformatics. We are actively working on data extraction and analysis approaches to bridge clinical practice data with biomedical experimental data, thus to facilitate drug discovery and to promote personalized medicine. We are interested in not only identifying genetic variations that are associated with drug responses (pharmacogenomics), but also building models that combine both genomic and clinical variables to predict drug responses, which is exactly what this challenge does.

In this Challenge, we examined different groups of features, including the covariance data, the compound chemical feature, the mRNA expression data as well as SNP data. The critical step in this Challenge is to extract the features that really affect the compound responses from a large amount of data. We utilized different statistical tools to analyze the data to determine the strong features associated with the compound responses. In our top-submissions (Syn ID#2024493 and Syn ID#2024471), we investigated different regression models, including the traditional linear regression model, the logistic regression model as well as the lasso penalized regression model and the Elastic-net model. We also tried different strategies to combine these models to improve the prediction.

This is the first time that we could test our models on a real time leaderboard and we learned a lot from others during the Challenge. I would like to thank the Challenge organizers, as well as everyone who contributed with funding, data, or infrastructure, to make this Challenge possible. This Challenge provides us a great chance to test our computational methods on the real genetic data sets.

Yonghui Wu

Postdoc Research Fellow

Center for Computational Biomedicine

The University of Texas School of Biomedical Informatics at Houston

HPN Challenge teams HDSystems, IPNet and Tongii write about their winning models for Sub-Challenges 1A and 1B

Below are blogposts from three winning teams (HDSystems, IPNet and Tongii)  from sub-challenges 1A and 1B of the Heritage Provider Network DREAM Breast Cancer Network Inference Challenge.  Sub- Challenge-1A asked participants to work with experimental time-course breast cancer proteomic data to infer a causal network. Sub-Challenge-1B asked participants to work with in silico time-course data generated from a state-of-the-art dynamical model of signaling to also construct a causal network. We awarded cash prizes to the first three teams within each sub-Challenge with a model that scored 2 standard deviations above the null model.  Below three of the winning teams (HDSystems for sub-Challenge-1A and both IPNet and Tongii for sub-Challenge-1B) share a little about themselves and the rationale behind their winning model.  

You can check out the current leaderboards for these sub-Challenges here:

Sub-Challenge-1A: https://www.synapse.org/#!Synapse:syn1720047/WIKI/56830

Sub-Challenge-1B: https://www.synapse.org/#!Synapse:syn1720047/WIKI/56850

Team HD Systems (syn ID2109051)

Dear fellow HPN challenge participants and organizers,

This is Ruth Großeholz, along with Oliver Hahn and Michael Zengerling at Ruprecht-Karls University in Heidelberg, Germany. We are happy to be one of the winning teams of the sub-Challenge 1-A leaderboard incentive prize for the experimental network and would like to thank the organizers for the opportunity to introduce ourselves and our ideas.

The three of us are Master students in Professor Ursula Kummer’s group for Modeling of Biological Processes at Bioquant Heidelberg participating in this challenge to gather working experience with network inference. For us, the Challenge offers a chance to work outside of the controlled conditions of a practical course and expand our methodical knowledge. Before this Challenge, network inference was uncharted territory as it is not covered in our Master program. So far, it has been a great experience to work with such a rich data set.

Since we come from a biological background and know from a number of practical courses that a model is only as good as the information backing it, our idea was to build one model including all the edges required for this challenge using extensive literature research on the roles and interactions of all the given proteins in cellular signaling. Even though the cell lines differ quite drastically between each other we felt that having one basic model, which describes signaling in a healthy cell would be a good point to start. Only after we had placed all proteins within our universal network, we started to tailor the models to their respective cell lines and growth factors. In our primary network all edges had a score of 1, which we later adjusted according to the dynamics of both source and target.

We thank Thea, Laura and all the other challenge organizers, as well as everyone who contributed to making this challenge possible. The implementation of the leaderboard did not only provide a possibility to get feedback during the challenge but also gave the challenge a more competitive character.

Team HD Systems

Oliver Hahn, Michael Zengerling & Ruth Großeholz

Master Students, Major Systems Biology

Modelling of Biological Processes

Ruprecht-Karls University, Heidelberg

 

Team IPNet (syn ID2023386)

It is a great honor for us to be highlighted among the top-scoring models in the HPN-DREAM Subchallenge 1B on the August 7th leaderboard. We are very thankful that the organizers have given us the opportunity to present ourselves and to introduce our model.

We are a team that started at the Institute for Medical Informatics and Biometry at the TU Dresden in Germany. The team is composed of myself (Marta Matos), Dr. Bettina Knapp, and Prof. Dr. Lars Kaderali.

The model we used in the HPN-DREAM Challenge was developed during my master’s thesis [1], under the supervision of Dr. Bettina Knapp in the group of Prof. Dr. Lars Kaderali. The model is an extension of the approach previously developed by Knapp and Kaderali [2] which is available as a bioconductor software package [3]. It is based on linear programming and it infers signaling networks using perturbation data. In particular, it was designed to take advantage of RNA interference experiments in combination with steady-state expression measurements of the proteins of interest. In my master’s thesis we expanded this model to take advantage of perturbation time-series data to improve the prediction of causal relations between proteins. Therefore, the HPN-DREAM Subchallenge 1B is an excellent opportunity to evaluate the performance of the extended model on time-series data after different perturbations.

In our approach the signal is modeled as an information flow which starts at the source nodes and propagates downstream along the network until it reaches the sink nodes. A node which is not active at a given time step interrupts the flow and we assume that the signal cannot propagate to its child nodes. To distinguish between active and inactive nodes we use a thresholding approach. The choice of the threshold has a big influence on the model performance. For the HPN-DREAM Subchallenge 1A, we got our best results when using the values of each node at the first time point to discretize the data. The underlying assumption is that the expression of the network nodes are in an inactive state at t=0, since they have not yet been stimulated.

What we like the most in the DREAM Challenge, is that it allows the comparison between different models in exact the same setting and that it is possible to evaluate the performance of the models on real, yet unpublished, data. Furthermore, the Challenge facilitates to learn from other researchers working in the same field and it allows for the exchange of knowledge and expertise. This helps to improve the developed models and to answer complex biological questions in more detail. We thank the challenge organizers and all who contributed in making this competition possible.

[1]  Marta Matos. Network Inference: extension of a linear programming model for time-series data. Master’s thesis, University of Minho, 2013

[2] Bettina Knapp and Lars Kaderali. Reconstruction of cellular signal transduction networks using perturbation assays and linear programming. PLoS ONE, 8(7):e69220, 07 2013.

[3] Bettina Knapp, Johanna Mazur and Lars Kaderali (2013). lpNet: Linear Programming Model for Network Inference.  R package version 1.0.0.

 

Team Tongii (synID2024139)

This is Su Wang, along with my teammates Xiaoqi Zheng, Chengyang Wang, Yingxiang Li, Haojie Ren and Hanfei Sun at Tongji University, Shanghai, China. It is our honor that our model was highlighted as one of the top-scoring models for HPN-DREAM Challenge. Xiaoqi is an associate professor in Shanghai Normal University; Yingxiang and I are the PhD candidate students; Chengyang and Hanfei are master students; Haojie just graduated from Nankai University. The diversity of our background gives us the courage to participate in this Challenge. Thank the organizers to give me the chance to introduce team and our model.

In our previous study, we focused on the software development to detect the transcription factors and chromatin regulators target genes. We used a monotonically deceasing function based on the distance between the binding site and transcription start site to measure the contribution of the binding to a gene, and combine with the differential expression data to get the factors direct target genes. We create the transcription network based on the direct target relationship and try to find the relationships among all the factors and the co-regulate pairs. We collected some available pathways from KEGG and integrated with our predicted results to convince the predicted relationship. Although the mechanism of protein phosphorylation is different from the regulation between transcription factors, some models and ideas can still be applied to reconstruction the network.

In our model, we applied the Dynamic Bayesian Network to train the data. Combined with the mutual information between two genes, we used a simple rank average to get the causal relationship. This model works well because, firstly, the time series data can be easily used for Dynamic Bayesian network; secondly, the relationship of each gene is not linear, so the correlation like Pearson’s Correlation is not proper to get the information between the two genes; last but not the least, the information inequality applied to delete the unbelievable edges, which made our model have a better sensitivity and stability.

There is much room for improvement of our model. We thank the Challenge organizers us the chance to present our model here.  The Challenge is very good for helping us to build a model to deal with a specific question, and the leaderboard is a very good platform to test how good our model and to help us understand where we should pay more attention to improve. We believe that everyone learned a lot by doing this Challenge, both for the study of the Challenge and the skill themselves. We hope the best performing teams can share their model and learn from each other, so that we can all get the best model to solve more questions.

DREAM8 Whole Cell Parameter Estimation Challenge: Winners for Most Creative Method Share Their Ideas

We asked the Challenge teams that received the “Most Creative Method” prize in the Whole-cell Parameter Estimation DREAM8 Challenge to submit a short write-up explaining how they are working to solve the Challenge. Below are the descriptions from the top 3 teams: winner Team Whole-Sale Modelers, followed by Team Crux,  and Team newDREAM.

Team Whole-Sale Modelers

 1.Summary of Overall Approach

The Challenge boils down to a complex, high-dimensional regression problem. We are asked to infer 15 perturbations that were delivered to a subset of 30 identified model parameters. These parameters (which can be though of as dependent variables) are estimated based on large amounts of \high-throughput data.” In this document I describe the statistical techniques I have used to solve this problem. Importantly, the techniques I outline below are complementary to an analysis of the “sub-models”. That is, the general search strategy I describe can be constrained and thus improved if one were to gain insight from the sub-models.

I have written code to estimate the parameters of the whole model, given the high-throughput data that is generated for each simulation. This model can be stated very simply:

Image

where p is the estimated “perturbation vector” (which encodes the perturbation delivered to each of the 30 parameters of interest), f is a non-linear function, and  x is a vector that contains all of the high-throughput data for the mutant model (provided freely to contest participants). The vector p has 30 elements, each of which represents the proportional change in each parameter (e.g. if the third parameter in the list – the kcat of Tmk -is halved, then the third element of ~p is equal to 3).In practice, I found that the above problem is intractable because of the large number of variables in the high throughput data leading to a very large vector x. Thus I performed a principal components analysis of the high-throughput data before fitting the model. This reduces the dimensionality of x to be on the order of 50 components.

The nonlinear function f is fitted by a collection of regression trees using the Random Forests technique. This is a popular technique in the field of machine learning. Its popularity stems from the fact that we do not need to have an initial guess for the form of the non-linear function f. Additionally, the algorithm cleverly avoids the over-fitting problem by probabilistically sampling from the training data. The random forest was fitted  based on the high-throughput data of 1128 whole cell simulations.

2. Improvements – Compressed Sensing

A substantial improvement, which I have not had the time to implement yet, would be to incorporate the constraint that the perturbation vector p is sparse. This piece of information is critical, and is widely studied in the context of  “compressed sensing”. Essentially, one should be able to improve the fit by penalizing the L1 norm of the estimated perturbation vector p (in theory, at least).  Additionally, I am in the process of running more simulations which I am now doing in triplicate. Averaging the high-throughput data across these replicates has the potential to improve the fit, since the stochasticity of the model can be quite influential (especially in terms of the data stored in  the variable “rxnFluxes”). In fact, I found some models that seemed to fit the data quite well, but failed when submitted to Bitmill. This was apparently due to trial-to-trial variability in the rxnFlux data, which was revealed by averaging over 8 trials.

3. Explanation of Code

Running the script  “s007 simplified random forest script” should generate some estimates of the perturbations to the cell. This script can be found in the “analysis” folder. I have added the prefix “alex ” to almost all of my written functions to distinguish them from Jonathan Karr’s original code. Due to severe time constraints, I have not been able to comment and clean up all of my code.  Full code can be found at: https://github.com/ahwillia/WholeCell

4. About the Team: Whole-Sale Modelers

  • Alex Williams | www: Alex Williams is a research technician in Eve Marder’s lab at Brandeis University. His research interests are in computational neuroscience. Alex’s work examines how neurons maintain stable activity patterns over long time periods in spite of comparatively rapid protein turnover.
  • Jeremy Zucker | www: Jeremy Zucker has over 10 years of experience in the representation, integration, modeling and simulation of biological pathways to elucidate the complex relationship between genotype and phenotype.

Team crux

1. Methodology

The goal of the DREAM8 Whole-cell parameter estimation challenge is to estimate 30 unknown parameters P of a mathematical model of a cell. The whole cell model has 1972 parameters in total and default values for all parameters are known. As prior information, it is known in addition that 15 out of the given set of 30 parameters

are identical to default values. For the remaining 15 modified parameters, the following prior knowledge is available:

1. 4 promoter affinities were modified

2. 7 kcats were modified

3. 4 RNA half lives were modified

4. 5 genes have one changed parameter

5. 5 genes have two changed parameters

6. 13 of the 15 modified parameters were decreased

7. 2 of the 15 modified parameters were increased

8. The decreases range from 2.8-93.4 %.

9. The increases range from 11.7-90.6%.

In our analyses, we restricted the parameter space to satisfy these constraints.

To account for strictly positive parameter values, all the analyses have been performed in a logarithmic parameter space. Moreover, this accounts for the fact that changes of parameter values usually contribute multiplicatively rather than additively, i.e. usually changing a parameter by a factor a or 1/a have a similar impact, but adding a constant a has mostly a qualitatively different effect than the corresponding subtraction.  Since promoter affinities are normalized, all perturbations of these parameters have to satisfy the normalization condition

Image

For all perturbation we performed, we normalized the promoter affinities by adapting a single, fixed parameter which is not in the set of modified parameters P.

The parameters were estimated using the maximum likelihood methodology. We initially perturbed each parameter ϴϵ P individually to challenge the response of the model and to estimate the gradient of the likelihood. For testing purposes, we also modified other parameters ϴ not from P. At the second stage, we also altered sets of more than a single parameter. This allows the computation of higher order derivatives. An iterative procedure then allows to advance towards a better model fit. Finally, analysis of the response of the model for default and estimated parameters allows to assess which of the 30 candidate parameter were modified.

2. Implementation of the numerical methods

A major bottleneck in this Challenge is the computational effort that a single evaluation of the model needs. It critically limits the number of possible iterations during maximumm likelihood estimation of the parameter. In addition to simulations on local computers, the simulations were performed on the Bitmill server using Matlab code.

3. About the Team: Team crux 

Team crux consists of a group of researchers at the Institute of Physics at the University of Freiburg. Team crux has previously won two DREAM competitions, the DREAM6 Parameter Estimation Challenge and the DREAM7 Network Inference Challenge.

  • Dr. Clemens Kreutz | www  Clemens is a postdoctoral scholar at the Institute of Physics at the University of Freiburg. Clemens’ research focuses on mathematical modelling of cellular signal transduction, experimental design, and statistics.
  • Dr. Andreas Raue | www  Andreas is also a postdoctoral scholar at the Institute of Physics at the University of Freiburg. Andreas’ research focuses on parameter estimation, experimental design, and uncertainty analysis.
  • Bernhard Steiert | www  Bernhard is a PhD candidate at the Institute of Physics at the University of Freiburg. His research focuses on modeling erythropoietic signaling pathways in cancer, including EGF/HGF crosstalk.
  • Prof. Jens Timmer | www  Jens is a professor of mathematics and physics at the University of Freiburg. His research focuses on the development and interdisciplinary application of mathematical methods to analyse and model dynamical processes in biology and medicine. His group develops and applies mathematical methods to analyse and model these processes based on measured data. Their final aim is to help to turn the life sciences from a qualitative descriptive into a quantitative predictive science.

Team newDREAM

1.Background

Participants are challenged to estimate the values of 15 unknown parameter values from a set of 30 parameters – 10 promoter affinities, 10 RNA half-lives, and 10 metabolic reaction kcats – of a recently published whole-cell model of M. genitalium (Karr et al., 2012) given the model’s structure and simulated data.

2.Strategy

The solution needs to answer two questions: 1. Identify parameter candidates. 2. Minimize parameter distance given the candidate parameters. Since the model is too ‘big’ and simulations take very long time, we could not thoroughly seek the entire parameter space to get near-perfect solutions within the limited time. Therefore, our strategy includes:

1) Use wild type/gold standard/ downloaded perturbation datasets to identify the (potentially modified) parameters most sensitive to cell growth.

2) Observe the high-throughput data. Divide the potentially modified parameters into three groups (A, too time-consuming to get optimized; B, hard to get optimized; C, easy to get optimized) based on the observation.

3) Make educated guesses for parameters in group A firstly, and then try to optimize parameters in group B with fixing parameters in group A. Finally, optimize parameters in group C.

3. About the Team: newDream 

newDream includes a team of researchers from the University of Texas Southwestern. Last year team newDream won the DREAM Drug Sensitivity Prediction Challenge.

Dr. Jichen Yang | www Jichen is a postdoctoral scholar at the University of Texas at Southwestern.

Yajuan Li

Dr. Hao Tang | www: Hao is a postdoctoral fellow at the QBCR at the University of Texas Southwestern.

Tao Wang | www: Tao is a graduate student at the QBCR at the University of Texas Southwestern.

Dr. Yueming Liu | www: Yueming is a mathematician at the University of Texas at Arlington.

Prof. Yang Xie | www: Yang is a professor of in the Department of Clinical Science at the University of Texas Southwestern.

Prof. Guanghua Xiao | www: Guanghua is a professor of in the Department of Clinical Science at the University of Texas Southwestern.

And The Winner Is…

We are very happy to announce that the Attractor Metagenes Team (consisting of Mr. Wei-Yi Cheng, Mr. Tai-Hsien Ou Yang and Professor Dimitris Anastassiou of Columbia University) is the winner of the Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge.   Please join all of us at Sage Bionetworks and DREAM in congratulating  the team for their winning Challenge model (Syn ID#1417992) that achieved a concordance index of 0.7562. This means that given any two Breast Cancer patients, the probability that team Attractor Metagenes will correctly predict who of the two patients will survive the longer is 76%, an extremely statistically significant performance. The performance was robust, in that this team was also the best performer in most of the 100 instances in which the test set was perturbed by random removal of 20% of the patients.  As the winner of the Challenge, the Attractor Metagenes team has been awarded  the opportunity to publish an article about the winning Challenge model in Science Translational Medicine and will be invited to the 4th Annual Sage Congress, taking place in April 2013.  Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.

Sage / DREAM7 Breast Cancer Prognostic Challenge

 

Validation Phase Writeup

Wei-Yi Cheng

wc2302@columbia.edu

During the first stages of the Breast Cancer Prognostic Challenge, we have shown that the breast cancer survival can be well predicted by the “Attractor Metagenes.” The Attractor Metagenes are sets of strongly co-expressed genes found using an iterative algorithm. We have previously identified several such Attractor Metagenes in almost identical forms across various cancer types, namely the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) [1]. As we had mentioned before, we like to think of these three main attractor metagenes as representing three key “bioinformatic hallmarks of cancer,” reflecting the ability of cancer cells to divide uncontrollably, to invade surrounding tissues, and the ability of the organism to recruit a particular type of immune response to fight the disease. In the METABRIC dataset, we confirmed that, under certain conditions, each of these attractors has strong prognostic power. For instance, the expression of the CIN attractor suggests high grade; the expression of the MES attractor during early stage (no positive lymph node, and tumor size less than 30 mm) indicates the invasiveness of cancer cells and thus suggests bad prognosis; and the expression of the LYM attractor is an indication of good prognosis in ER negative breast cancer, while it is reversely ominous when there are already positive lymph nodes. The attractor approach also identifies some breast-cancer specific attractor such as the ER attractor and the HER2 amplicon. We have used these Attractor Metagenes in all our models and reported their association with survival in previous stages of the challenge [2][3][4].

We think that the success of our model is due to the lack of overfitting to the training (METABRIC) dataset. Indeed our features, the attractor metagenes, were not derived from the training set. Instead, they were derived from other cancer datasets from multiple cancer types [1]. We hypothesized that the attractor metagenes represent universal biomolecular events in cancer, which would therefore be useful for the particular type of breast cancer. So, we used the training set only to find the best ways to combine these features in breast cancer. And we were so happy to see that our score for overall survival in a totally new validation dataset was actually higher than the corresponding score that we had achieved in the previous phases of the Challenge in which the METABRIC dataset itself was split and used for both training as well as validation.

In order to select from our existing models trained on the METABRIC data for the totally new Oslo validation set, we performed several hold-out tests on all of our submitted models, such as the 10-fold cross-validation. In addition, based on what Dr. Huang revealed in “Contours of the Oslo Validation set” [5], we thought it was also important to evaluate the performance of our models using several re-sampled test sets containing only patients who received chemotherapy. Indeed, the top-performing model (syn1417992) has the highest chemotherapy-only test score among our other models.

The top-performing syn1417992 model contains several subclassifiers that utilize orthogonal information. Based on the universal Attractor Metagenes we found in multiple cancer types [1], and several breast cancer specific Attractor Metagenes, we created an “Attractor Metagene Space” of around 15 attractor metagenes to replace the 50,000-gene molecular space. We used Cox regression, generalized boost model (GBM), K-nearest neighbor (KNN) to create prognosis models on the Attractor Metagene Space and clinical features respectively. For feature selection, we used Akaike information criterion (AIC) when performing Cox regression. The model also includes a subclassifier that used mixed clinical and molecular features, which include all three of the universal metagenes (CIN, LYM restricted on ER and HER2 low, and MES restricted on lymph-negative and tumor size less than 30 mm), and the SUSD3 metagene, which we found is highly associated with good prognosis when over-expressed. In our submissions we did not make use of any code from other Challenge participants.

Finally, we would like to thank everyone who made this wonderful challenge possible. We believe that this success validates not only the prognostic power of our model in breast cancer, but also the “pan-cancer” property of the attractor metagenes, since they were defined from other datasets of various cancer types. We hope that we will have the opportunity to collaborate with pharmaceutical companies towards the development of related diagnostic, prognostic and predictive products; and particularly to scrutinize the underlying biological mechanisms trying to think of potential therapeutic interventions that could be applicable to all types of cancer.

References

  1. W.Y. Cheng, T.H. Ou Yang and D. Anastassiou, “Biomolecular events in cancer revealed by attractor metagenes,” Preprint available from arXiv:1204.6538v1, April 30, 2012, PLoS Computational Biology, in Press.
  2. http://support.sagebase.org/sagebase/topics/mitotic_chromosomal_instability_attractor_metagene
  3. http://support.sagebase.org/sagebase/topics/mesenchymal_transition_attractor_metagene-znl1g
  4. http://support.sagebase.org/sagebase/topics/lymphocyte_specific_attractor_metagene
  5. http://support.sagebase.org/sagebase/topics/contours_of_the_oslo_validation_set

Did Team Hive’s online game generate a top-scoring Challenge model?

Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge.  His post is about the fun online game (called “The Cure”) that he and his team launched in September 2012 to crowdsource ideas that Team Hive can then use to build great models for our Breast Cancer Challenge.  Please read on to find out how Team Hive’s models built with ideas from the crowd performed!

Reblogged from i9606

Monday, October 29, 2012

Results from the Cancer Biology game: The Cure

Building intelligent systems for biology

Our research group has been exploring the concept of serious games for several months now.  Aside from providing nerdy entertainment, our games collect (and distribute) biological knowledge from broad audiences of players.  The hypothesis underlying this work is that, by capturing knowledge in forms suitable for computation, these games make it possible to build more intelligent programs.

As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’.  The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples.  We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.

In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes.  The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.

Below is a game in progress.  I’m on the bottom and my opponent, Barney, is on the top.  We alternate turns selecting a card (a gene) from the board and adding it to our hand.  When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time.  The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules.   The better the gene set performs in generating predictive decision trees, the higher the score.  When the player defeats their opponent, they move on to play another board.  (Multiple players play each board.)

A game of the The Cure.  Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.

As you can see to the right of the board, information from the Gene Ontology, RefSeq, and PubMed is provided through the game interface to aid players in selecting their genes.  Players are also encouraged to make use of external knowledge sources (in addition to their own brains).

Promotion, players and play

The Cure was promoted on launch day via a presentation by Andrew Su at Genome Informatics 2012, via Twitter and in several blog posts.   As we first described in a post published on the Sage community site, more than 120 players registered and collectively played more than 2000 games in the first week that the game was alive – with much of this activity happening within the first few days.  Nearly half of the players self-reported having PhDs and half claimed knowledge of cancer biology.  Following the initial buzz, game-playing activity slowed down to what is now a slow but persistent trickle.
Games played at The Cure since launch

As of last Friday, Oct. 26, 2012 we have had 214 people register and have recorded 3,954 total games (including training games).  The player demographics have remained stable with about 40% PhDs, nearly 50% declaring knowledge of cancer biology, and about 50% stating that they are biologists.

Predicting breast cancer prognosis

Aside from entertainment, the point of this particular game is to assemble a predictor for breast cancer prognosis.  The main hypothesis is that biological knowledge, accessible from players, can be used to help select good sets of genes to use to train predictive models using machine learning algorithms.  The premise is that injecting distributed biological knowledge (which can not entirely be learned from any one training set) will help reduce overfitting by identifying the gene sets with biologically consistent associations with disease progression.

The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play.  While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.

  1. Filter out games from players that indicated no knowledge of cancer biology.
  2. Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.
  3. Select the top 20 genes according to this ranking.
  4. Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge.  To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.  
CCL3L3 CXCL9 IL1B BCL2 DUSP1 ERBB2 EGR1 JUN PITX1 MAP3K1 IGFBP2 STAT1 BCAR3 HOXB2 BCL11B MAPK15 WNT5A APOA2 HLA-DRB4 CD163
Game-selected genes

The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard – search for team HIVE – and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)

In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information).  This was substantially better than classifiers produced using randomly selected genes (55%).  Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.

Final Results from METABRIC round of BCC challenge

!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!

Conclusions

These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play.  We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.
Data
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Implementation
The code that operates The Cure is freely available on our BitBucket account.  It consists of a Java server application (running in Tomcat) that handles database interaction, board generation, and integration with the WEKA machine learning library.  WEKA is used to dynamically train and test decision trees (though we could easily use other models) while the game is running.  The interface is almost entirely CSS and JavaScript that communicates with the server via JSON requests.  We would be thrilled if some one wanted to use this code to build another classification game!

Trees
One aspect of the code-base that may be useful in a variety of different projects is the code that translates the Java objects that represent decision trees in WEKA into the Web-ready visualizations presented to the players.  This is accomplished via server-side translation into a JSON structure that is rendered in the browser using code that builds on the D3 javascript visualization library.

Credits
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.

Barney
The opponent in The Cure came from a Wikipedia Commons imagefrom the game “You have to Burn the Rope“. Thanks for sharing!

Breast Cancer Challenge: Team “PittTransMed” places second for Metabric phase of the Challenge

Please join all of us at Sage Bionetworks and DREAM in congratulating  Chunui Cai and the entire PittTransMed team for being the second highest scoring team for the Metabric phase of our Breast Cancer Challenge!   Below you can read about Chunui’s winning model (Syn ID#1443133).  For this top performance, Chunui and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).

Identify Informative Modular Features for Predicting Cancer Clinical Outcomes

 Songjian Lu, Chunhui Cai, Hatice Ulku Osmanbeyoglu, Lujia Chen, Roger Day, Gregory Cooper, and Xinghua Lu

PittTransMed Team

Department of Biomedical Informatics

University of Pittsburgh

An important task of translational cancer genomics is to identify molecular mechanisms underlying the heterogeneity of patient prognosis and responses to treatments. The large number of molecularly characterized breast cancer samples by the TCGA provides a unique opportunity to study perturbed signal transduction pathways that are determinative of clinical outcome and drug-responses.  Our approach to reveal perturbed signaling pathways is based on the following assumption: if a module of genes participates in coherently related biological processes and is co-expressed in a subset of tumors, the module is likely regulated by a specific cellular signal that is perturbed in the subset of tumors.  We have developed a novel bi-clustering approach that unifies knowledge mining and data mining to identify gene modules and corresponding tumor subsets.

From each TCGA breast cancer tumor, we first identified all genes that were differentially expressed, i.e., 3-fold increase or decrease when compared to the median of expression values from normal samples.  Since a cancer tumor always results from perturbations of multiple signaling pathways 1, the complete list of differentially expressed genes from a tumor inevitably reflects a mixture of genes responding to distinct signals.  In order to de-convolute the signals, we developed an ontology-guided, semantic-driven approach 2 to group differentially expressed genes into non-disjoint subsets.  Each subset contains genes that participate in coherently related biological processes 3 that can be summarized by an informative GO term.   With gene expression data from each tumor conceptualized, we further pooled the genes summarized by a common GO term from all tumor samples to construct a seed gene set.

Given a seed gene set annotated with a distinct GO term, we sought to search for a subset of tumors in which a functionally coherent gene module is co-regulated.  To this end, we constructed a bipartite graph for each seed gene set, in which vertices on one side are the genes and those on the other partite are tumors.  We then applied a novel graph algorithm to identify a densely connected subgraph containing a module of genes (subset of input genes) and a subset of tumors. We required the subgraph satisfies the following conditions: 1) a gene in such a subgraph was differentially expressed in at least 75% of the tumors in the subgraph; 2) in each tumor, at least 75% of genes in the module was differentially expressed; and 3) a subgraph should include more than 25 tumors.

Applying the analysis to the TCGA breast cancer tumors, we have identified 159 subgraphs.  We hypothesized that each subgraph represented a module of genes whose differential expression was in response to a common cellular signal that was perturbed in the subset of tumors.  We then set out to test if this approach would enable us to find the signaling pathways that underlie different prognosis of the breast cancer patients in the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge.

Based on the TCGA microarray data and our modular analysis results, we trained 159 Bayesian logistic regression models 4, one for each aforementioned subgraphs.  This approach enabled us to determine if a module was perturbed (“hit”) in a breast cancer sample from the DREAM 7 challenge, moreover the results allowed us to represent a tumor sample in a modular space with a reduced dimension.  Conditioning on the status of each gene module, we dichotomized samples into two groups (“hit” vs “not hit”) and performed Kaplan-Meier survival analysis to determine if the module is predictive of clinical outcome of the patients.  We identified 20 modules that led to significantly different survival outcomes, p-value < 0.05 .

Using the states of all modules or 20 informative modules as input features, we explored different survival models, including Cox model, the generalized linear model (the “glmnet” package from the CRAN), and the generalized boosted model (the “gbm” package from the CRAN).  When the above models were evaluated individually, they led to prediction concordance indices of  ~ 0.67 and ~ 0.66 on training and testing datasets respectively.  We also explored to use a machine learning model, the RankSVM 5  to predict rank order of patients based on the modular status. In general, the performances of individual RankSVM and survival models were inferior to the leading ensemble models submitted by other groups.

After carefully studying the leading models, we noticed that the ensemble model developed by the Attractome group was particularly suitable to incorporate the modular features identified by our approach.  In an essence, the Attractome group also aimed to identify gene modules and to project tumor samples into a modular space for modeling, although their approach was based on different assumptions and therefore derived different modules.  Moreover, the ensemble learning approach adopted by the Attractome team also addressed the overfitting problems confronting single-model approaches.  Therefore, we adopted an Attractome model (syn1421196) and evaluated if our modular features could enhance the model.  We performed a feature selection using a greedy forward-search approach in a series of cross validation experiments.  We identified 6 features that were capable of enhancing the performance of the Attractome model and integrated them into to a hybrid model, referred to as the PittAttractomeHyb.2 model.  When tested on the hold out METABRIC data, the model performed well, indicating that these modular features are indeed informative of clinical outcomes.

In summary, the major contributions of this study include: 1) A novel integrative approach that is capable of identifying gene modules that likely represent units of cellular responses to perturbed signaling. 2) Certain modules identified from the TCGA data are predictive of clinical outcome when applied to the METBRIC data, therefore the information encoded by these modules are generalizable.  3) Integrating informative modular features with ensemble predictive models enhance the capability of predicting clinical outcomes of breast cancer patients.   Ongoing efforts concentrate on reverse engineering the signaling pathways that underlie these predictive modules.

REFERENCES

1          Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646-674, doi:10.1016/j.cell.2011.02.013 (2011).

2          Jin, B. & Lu, X. Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics 26, 2445-2451, doi:10.1093/bioinformatics/btq449 (2010).

3          Richards, A. J. et al. Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26, i79-87, doi:10.1093/bioinformatics/btq203 (2010).

4          Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1-22 (2010).

5          Joachims, T. in ACM Conference on Knowledge Discovery and Data Mining (KDD).

Breast Cancer Challenge: Team “Attractor Metagenes” nabs top overall Metabric score!

Please join all of us at Sage Bionetworks and DREAM in congratulating  Wei-yi Cheng and the entire Attractor Metagenes team for their winning model (Syn ID#1444444): this training model received the top overall score for the Metabric phase of this Challenge.  It will be so interesting to see how this model (and all the others) perform against the final validation data set that is currently being produced!  Wei-yi and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).   Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.

Our models use the information provided by the “attractor metagenes” to evaluate their prognostic value in breast cancer. We have previously applied an iterative attractor finding algorithm on rich expression datasets from multiple cancer types identifying three universal (pan-cancer) attractors, which are the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) [1]. We like to think of these three attractors as “bioinformatic hallmarks of cancer.” In our top submission (syn1444444) we used precisely these same three metagenes as found from our previous unsupervised multi-cancer analysis.

Our experience in the Challenge was particularly rewarding as we were successively confirming that each of these attractors would indeed be helpful towards improving the breast cancer prognostic model, and we were reporting our observations to the other participants using the Synapse forum. We first found that the CIN attractor is highly prognostic, as evidenced by the fact that it was essentially recreated after ranking the individual genes in terms of their corresponding concordance index [2]. The other two main attractors were also found to be highly prognostic after being properly conditioned. For example, the MES attractor is most prognostic in early stage breast cancer (no positive lymph nodes and tumor size less than 30 mm) [3]. On the other hand, the LYM attractor is protective when ER and HER2 expressions are low, while it has the reverse effect on prognosis when there are multiple positive lymph nodes [4]. We also identified a few additional prognostic metagenes, such as the SUSD3-FGD3 metagene that is composed of two genomically adjacent genes, the ZMYND10-LRRC48-CASC1 metagene, the PGR-RAI2 metagene, the HER2 amplicon attractor metagene at chr17q11.2 – q21 and a chr17p12 meta-CNV. We compiled an attractor metagene space using these metagenes, along with TP53 and VEGFA, known to be associated with cancer, and then used them as a molecular feature space for feature selection.

In our top submission (syn1444444), we applied several subclassifiers to maximize the information used and to build a robust, generalizable model, including Cox regression, generalized boost model (GBM), and K‑nearest neighbor (KNN). We also used Akaike information criterion (AIC) on the features passed to Cox regression to avoid overfitting. We applied the AIC-based Cox regression and the GBM on the metagene space and the clinical features, respectively, and the KNN model on the combined metagene and clinical feature space. We combined the predictions of each subclassifier by directly summing up the linear predictors generated by the subclassifiers. For the KNN model, because the prediction is the survival time, we used the reciprocal of the prediction to be summed up. We also included two subclassifiers using mixed molecular and clinical features. In particular, one of them used all three of the universal metagenes (CIN, LYM and MES properly conditioned), the SUSD3-FGD3 metagene, clinical features age, radiation therapy, and chemotherapy. We found that such a simple model provides accurate prognosis, and can be treated independently with other subclassifiers.

In our submissions we did not make use of any code from other Challenge participants. We submitted the winning model prior to the October 15 model submission deadline, and the full code of this model was accessible to other participants as soon as it was posted to the leaderboard and thereafter.

References

  1. W-Y Cheng, D Anastassiou, Biomolecular events in cancer revealed by attractor metagenes. arXiv:1204.6538v1 [q-bio.QM].
  2. http://support.sagebase.org/sagebase/topics/mitotic_chromosomal_instability_attractor_metagene
  3. http://support.sagebase.org/sagebase/topics/mesenchymal_transition_attractor_metagene-znl1g
  4. http://support.sagebase.org/sagebase/topics/lymphocyte_specific_attractor_metagene

The Challenge’s October 1 Leaderboard Winner: A “Repeat Performance” from Attractor Metagenes Team!

Please join all of us at Sage Bionetworks and DREAM in congratulating  Wei-yi Cheng and the entire Attractor Metagenes team for their October 1 Leaderboard Winner Achievement.  Attractor Metagenes was also the September 1 leaderboard winner .  This “repeat performance” is especially impressive given that it was achieved working with two different versions of the Metabric data!  Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.   

Dear fellow BCC challenge participants and others here on Synapse ,

I would like to thank once more the organizers for the opportunity that they again give to the Attractor Metagenes Team to share some of our methods and findings. It has been another exciting month, in which many new ideas have been shared among us. It is inspiring that through these discussions on the Challenge forum, we all are gaining a better understanding of different perspectives on the data and the disease itself.

The main attributes responsible for our continuing high score is that we are making use of the three strongest attractor metagenes representing universal (multi-cancer) biomolecular events: The mitotic chromosomal instability attractor metagene; the mesenchymal transition attractor metagene; and the lymphocyte-specific attractor metagene. We are particularly excited because these metagenes are present in all solid cancers, and therefore can be used as “pancancer” biomarkers, which will be more robust, compared to using individual oncogenes. We have now posted descriptions of each of these three main attractors as items in the Synapse forum. So far we have not incorporated any code from other submissions, but we will certainly do so if we deem appropriate, giving them credit prominently. And similarly, we also welcome others to make use of our code that is always freely and readily available. The functions and metagene lists used in all our submissions are incorporated in an R package downloadable through the link given in the source code we uploaded on the leaderboard. We also have uploaded an R package for finding attractor metagenes, available under Synapse ID syn1123167 for anyone interested to use, not only in breast cancer, but in all types of cancer.

We understand that the main objective in Phase 2 of the Challenge is to build a generalizable model that will work well when evaluated against the Oslo Validation data set. Based on our experience in both Phases, we believe that achieving a generalizable model requires making use of survival data that have been “purified” by excluding causes of death unrelated to the disease itself. We understand, however, that this is difficult to achieve in general and even more so in the case of the Oslo-Val data set. And because Phase 2 uses the same overall survival data as in the Olso-Val, we modified our models to include lots of clinical features that we do not think would be otherwise required for the development of a sharp and generalizable prognostic model.

To elaborate on this last point: We are excited about building sharp, insightful and powerful “minimalist” disease models that could be used for biomarker products making use of a very small number of features. For example, we believe that we have identified such a model in breast cancer that makes use of nothing other than our three attractor metagenes mentioned above, tumor size, number of lymph nodes affected, and one more protective feature that we discovered as a result of our participation in the Challenge: The metagene defined as the average of the genes SUSD3 and FGD3, which, as we observed, are genomically adjacent at chr9q22.31. We know that simultaneous silencing of these two genes is strongly associated with bad prognosis, but we are not certain about the underlying biological mechanism (it may not be the result of a CNV). We suspect that this simultaneous silencing is one of several triggers required for ER-negativity, perhaps the most important one. This is an interesting research question and we hope that other Challenge participants with expertise in biology and medicine will join us in the effort to decipher this important mechanism in breast cancer!

Wei-Yi Cheng

Graduate Research Assistant, Ph.D. Candidate in Electrical Engineering

Genomic Information Systems Laboratory,

Columbia University

Can the crowd provide ‘The Cure’?

The participants on Sage Bionetworks’ Breast Cancer Prognosis Challenge keep surprising and delighting us!  Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge.  His post is about the online game he and his team just launched a few weeks ago to crowdsource ideas that Team Hive can then use to build great models for the Challenge.  Please enjoy Ben’s post and try out his cool game!!

Serious Games Serious games are games that have an underlying purpose.  When you play a game like Foldit or Phylo, you are finding entertainment like any other game but your actions are also translating into a useful end product.  In Foldit, you contribute to protein structure determination, in Phylo to multiple sequence alignment.  Reconstituting difficult or time consuming tasks into components of games opens up a new way to find and motivate volunteer contributors at potentially massive scale.  Like the SAGE competitions themselves, serious games provide the opportunity to focus widespread community attention on particular challenging problems.

The Cure The purpose of the game The Cure is to identify sets of genes that can be used to build predictors of breast cancer prognosis that will stand up to validation.  The hypothesis is that we can outperform purely data-driven approaches by infusing our gene selection algorithms with the biological knowledge and reasoning abilities of hundreds or even thousands of players.  This biological insight is captured through a simple, fun two-player card game where each card corresponds to a gene. In its current form, the game consists of a series of 100 boards, each containing 25 distinct genes from a precomputed list of interesting genes.  In the game, you compete with the computer opponent Barney to find the set of 5 genes from each board that form the best predictors of 10-year survival.  In each turn, the player takes a gene card off of the board and puts it in their hand.  To make these decisions, extensive annotation information from resources such as the Gene Ontology, Entrez Gene and PubMed is provided through the game interface and players are free to conduct their own research.  As each card is added to a player’s hand, a decision tree is constructed automatically using the genes in the hand and the training dataset from the Sage Bionetworks / DREAM challenge.  The tree is shown to the player and the hand is scored based on the performance of the decision tree algorithm, coupled with those genes, in a 10-fold cross-validation test.  If the player produces a better gene set than Barney they score points based on the cross-validation score. Play begins in a very short training stage that teaches the mechanics of the game as players select features to use to build an animal classifier.  Once this stage is passed, players are free to choose which of the gene boards to play.  Each board is shown with an indication about how many other players have already defeated it.  Once a certain number of players have finished a board, we declare it complete and close it off to encourage the player population to explore the entire board space.  The first collection of boards is nearly complete and the next level should be available soon.

Results so far In the first week that the game was live (Sept. 7-14, 2012), more than 120 players registered and collectively played more than 2000 hands.  60% of the players came from the U.S., 30% from China and the rest arrived from all over the world.  Nearly half of the players have PhDs.  While it is too early to tell whether this approach will be a contest winner, we have already used it to identify several small gene sets that have significant predictive power (far better than random).  We also know that some of the players are having a great time.  One of the top players wrote this about the game: “This is a wonderful game, which can give me happiness and knowledge at the same time.” Whether or not this game can manage to move the bar forward in cancer prognosis, it seems that it was already worth the effort to create it. Play now at http://genegames.org/cure !

About the creators of ‘The Cure’ The Scripps Research Institute team HIVE members include: Benjamin Good, Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian MacLeod and Andrew Su. Their research explores applications of crowdsourcing in biology such as the Gene WikiBioGPS, and the emerging collection of serious games at genegames.org. ‘The Cure’, is specifically focused on accumulating knowledge that can be translated into good performance on the BCC challenge.