April 21, 2013
Thanks to everyone who joined Sage at the 2013 Congress. It was great to get so much energy and feedback from everyone in attendance. Our Congress Demo showing use of the newest features is now posted if you weren’t able to attend.
News about Sage Bionetwork's Synapse project
April 16, 2013
Synapse software engineers have been hard at work churning out features leading up to the 2013 Sage Bionetworks Commons Congress. A new version of Synapse services will be released on Tuesday April 16th.
We allow users to create a DOI to reference Synapse content from external systems or scientific publications. DOIs are a scientific publishing standard used to create robust links across many different websites to content, and provide users a short and easy way to share and reference their work (e.g. doi:10.7303/syn1234.5). DOIs for Synapse Files, Folders, and Projects can be created on the Synapse webpage for each of these objects, and can reference either the current version, or a specific version of a file. Also, any DOIs included in markdown will be converted into links automatically.
Other new features include:
April 4, 2013
We’ve just released a new version of Synapse with the following major updates. Upgrading R and Python clients is recommended.:
Wiki pages for Projects and Folders: Synapse now allows users much more flexibility to create custom content in Synapse through dedicated areas for wiki content on Project and Folder pages in the Synapse web site. Additionally, users may nest an arbitrary number of additional wiki pages below the top level page of a project or folder. Previously we allowed wiki markdown in a Description field, which had a much more constrained area of the page for display. Content previously present in the Description has now been migrated to the wiki section of pages. The description field remains as a short plain-text summary of a project or folder. Descriptions will be displayed throughout the site, e.g. on search results page or on tool-tips on lists of projects so it is useful to create both a short description for your Synapse projects as well as richer content using the wiki capabilities.
New Entity List widget replaces Summary Entities: Previous versions of Synapse allowed users to create a Summary, which was a page dedicated to a tabular view of other objects in the system. We’ve now incorporated the same functionality into a component that can be embedded within Synapse wiki pages, giving users more flexibility to mix this visualization in with other content on the same page. You can no longer create the old stand-alone Summary pages, but existing Summary pages will continue to be supported for at least 6 months. To use the new Entity List visualization, edit a wiki page and choose Insert > Entity List from the main menu.
Project-centric search: The default search has been optimized to search for projects over other types of objects in the system. We believe this change will more often help new users find the content they are looking for in Synapse. We are now using the text of both the project Description and Wiki pages in our search index, so our search now has more content to work with. This should also improve search results, especially as increasing numbers of projects are built out. All the old functionality to search for other types of objects in the system is still available by removing the filter to narrow search results to only show projects.
January 21, 2013
We are very happy to announce that the Attractor Metagenes Team (consisting of Mr. Wei-Yi Cheng, Mr. Tai-Hsien Ou Yang and Professor Dimitris Anastassiou of Columbia University) is the winner of the Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge. Please join all of us at Sage Bionetworks and DREAM in congratulating the team for their winning Challenge model (Syn ID#1417992) that achieved a concordance index of 0.7562. This means that given any two Breast Cancer patients, the probability that team Attractor Metagenes will correctly predict who of the two patients will survive the longer is 76%, an extremely statistically significant performance. The performance was robust, in that this team was also the best performer in most of the 100 instances in which the test set was perturbed by random removal of 20% of the patients. As the winner of the Challenge, the Attractor Metagenes team has been awarded the opportunity to publish an article about the winning Challenge model in Science Translational Medicine and will be invited to the 4th Annual Sage Congress, taking place in April 2013. Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.
Sage / DREAM7 Breast Cancer Prognostic Challenge
Validation Phase Writeup
During the first stages of the Breast Cancer Prognostic Challenge, we have shown that the breast cancer survival can be well predicted by the “Attractor Metagenes.” The Attractor Metagenes are sets of strongly co-expressed genes found using an iterative algorithm. We have previously identified several such Attractor Metagenes in almost identical forms across various cancer types, namely the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) . As we had mentioned before, we like to think of these three main attractor metagenes as representing three key “bioinformatic hallmarks of cancer,” reflecting the ability of cancer cells to divide uncontrollably, to invade surrounding tissues, and the ability of the organism to recruit a particular type of immune response to fight the disease. In the METABRIC dataset, we confirmed that, under certain conditions, each of these attractors has strong prognostic power. For instance, the expression of the CIN attractor suggests high grade; the expression of the MES attractor during early stage (no positive lymph node, and tumor size less than 30 mm) indicates the invasiveness of cancer cells and thus suggests bad prognosis; and the expression of the LYM attractor is an indication of good prognosis in ER negative breast cancer, while it is reversely ominous when there are already positive lymph nodes. The attractor approach also identifies some breast-cancer specific attractor such as the ER attractor and the HER2 amplicon. We have used these Attractor Metagenes in all our models and reported their association with survival in previous stages of the challenge .
We think that the success of our model is due to the lack of overfitting to the training (METABRIC) dataset. Indeed our features, the attractor metagenes, were not derived from the training set. Instead, they were derived from other cancer datasets from multiple cancer types . We hypothesized that the attractor metagenes represent universal biomolecular events in cancer, which would therefore be useful for the particular type of breast cancer. So, we used the training set only to find the best ways to combine these features in breast cancer. And we were so happy to see that our score for overall survival in a totally new validation dataset was actually higher than the corresponding score that we had achieved in the previous phases of the Challenge in which the METABRIC dataset itself was split and used for both training as well as validation.
In order to select from our existing models trained on the METABRIC data for the totally new Oslo validation set, we performed several hold-out tests on all of our submitted models, such as the 10-fold cross-validation. In addition, based on what Dr. Huang revealed in “Contours of the Oslo Validation set” , we thought it was also important to evaluate the performance of our models using several re-sampled test sets containing only patients who received chemotherapy. Indeed, the top-performing model (syn1417992) has the highest chemotherapy-only test score among our other models.
The top-performing syn1417992 model contains several subclassifiers that utilize orthogonal information. Based on the universal Attractor Metagenes we found in multiple cancer types , and several breast cancer specific Attractor Metagenes, we created an “Attractor Metagene Space” of around 15 attractor metagenes to replace the 50,000-gene molecular space. We used Cox regression, generalized boost model (GBM), K-nearest neighbor (KNN) to create prognosis models on the Attractor Metagene Space and clinical features respectively. For feature selection, we used Akaike information criterion (AIC) when performing Cox regression. The model also includes a subclassifier that used mixed clinical and molecular features, which include all three of the universal metagenes (CIN, LYM restricted on ER and HER2 low, and MES restricted on lymph-negative and tumor size less than 30 mm), and the SUSD3 metagene, which we found is highly associated with good prognosis when over-expressed. In our submissions we did not make use of any code from other Challenge participants.
Finally, we would like to thank everyone who made this wonderful challenge possible. We believe that this success validates not only the prognostic power of our model in breast cancer, but also the “pan-cancer” property of the attractor metagenes, since they were defined from other datasets of various cancer types. We hope that we will have the opportunity to collaborate with pharmaceutical companies towards the development of related diagnostic, prognostic and predictive products; and particularly to scrutinize the underlying biological mechanisms trying to think of potential therapeutic interventions that could be applicable to all types of cancer.
November 6, 2012
Here is a guest post from Benjamin Good: he is a member of “Team Hive” and participating in our Challenge. His post is about the fun online game (called “The Cure”) that he and his team launched in September 2012 to crowdsource ideas that Team Hive can then use to build great models for our Breast Cancer Challenge. Please read on to find out how Team Hive’s models built with ideas from the crowd performed!
Reblogged from i9606
|Building intelligent systems for biology|
Our research group has been exploring the concept of serious games for several months now. Aside from providing nerdy entertainment, our games collect (and distribute) biological knowledge from broad audiences of players. The hypothesis underlying this work is that, by capturing knowledge in forms suitable for computation, these games make it possible to build more intelligent programs.
As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’. The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples. We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.
In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes. The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.
Below is a game in progress. I’m on the bottom and my opponent, Barney, is on the top. We alternate turns selecting a card (a gene) from the board and adding it to our hand. When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time. The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules. The better the gene set performs in generating predictive decision trees, the higher the score. When the player defeats their opponent, they move on to play another board. (Multiple players play each board.)
|A game of the The Cure. Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.|
As you can see to the right of the board, information from the Gene Ontology, RefSeq, and PubMed is provided through the game interface to aid players in selecting their genes. Players are also encouraged to make use of external knowledge sources (in addition to their own brains).
|Games played at The Cure since launch|
As of last Friday, Oct. 26, 2012 we have had 214 people register and have recorded 3,954 total games (including training games). The player demographics have remained stable with about 40% PhDs, nearly 50% declaring knowledge of cancer biology, and about 50% stating that they are biologists.
The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play. While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.
The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard – search for team HIVE – and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)
In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information). This was substantially better than classifiers produced using randomly selected genes (55%). Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.
|Final Results from METABRIC round of BCC challenge|
!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!
These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play. We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.
November 1, 2012
Please join all of us at Sage Bionetworks and DREAM in congratulating Chunui Cai and the entire PittTransMed team for being the second highest scoring team for the Metabric phase of our Breast Cancer Challenge! Below you can read about Chunui’s winning model (Syn ID#1443133). For this top performance, Chunui and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16).
Identify Informative Modular Features for Predicting Cancer Clinical Outcomes
Songjian Lu, Chunhui Cai, Hatice Ulku Osmanbeyoglu, Lujia Chen, Roger Day, Gregory Cooper, and Xinghua Lu
Department of Biomedical Informatics
University of Pittsburgh
An important task of translational cancer genomics is to identify molecular mechanisms underlying the heterogeneity of patient prognosis and responses to treatments. The large number of molecularly characterized breast cancer samples by the TCGA provides a unique opportunity to study perturbed signal transduction pathways that are determinative of clinical outcome and drug-responses. Our approach to reveal perturbed signaling pathways is based on the following assumption: if a module of genes participates in coherently related biological processes and is co-expressed in a subset of tumors, the module is likely regulated by a specific cellular signal that is perturbed in the subset of tumors. We have developed a novel bi-clustering approach that unifies knowledge mining and data mining to identify gene modules and corresponding tumor subsets.
From each TCGA breast cancer tumor, we first identified all genes that were differentially expressed, i.e., 3-fold increase or decrease when compared to the median of expression values from normal samples. Since a cancer tumor always results from perturbations of multiple signaling pathways 1, the complete list of differentially expressed genes from a tumor inevitably reflects a mixture of genes responding to distinct signals. In order to de-convolute the signals, we developed an ontology-guided, semantic-driven approach 2 to group differentially expressed genes into non-disjoint subsets. Each subset contains genes that participate in coherently related biological processes 3 that can be summarized by an informative GO term. With gene expression data from each tumor conceptualized, we further pooled the genes summarized by a common GO term from all tumor samples to construct a seed gene set.
Given a seed gene set annotated with a distinct GO term, we sought to search for a subset of tumors in which a functionally coherent gene module is co-regulated. To this end, we constructed a bipartite graph for each seed gene set, in which vertices on one side are the genes and those on the other partite are tumors. We then applied a novel graph algorithm to identify a densely connected subgraph containing a module of genes (subset of input genes) and a subset of tumors. We required the subgraph satisfies the following conditions: 1) a gene in such a subgraph was differentially expressed in at least 75% of the tumors in the subgraph; 2) in each tumor, at least 75% of genes in the module was differentially expressed; and 3) a subgraph should include more than 25 tumors.
Applying the analysis to the TCGA breast cancer tumors, we have identified 159 subgraphs. We hypothesized that each subgraph represented a module of genes whose differential expression was in response to a common cellular signal that was perturbed in the subset of tumors. We then set out to test if this approach would enable us to find the signaling pathways that underlie different prognosis of the breast cancer patients in the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge
Based on the TCGA microarray data and our modular analysis results, we trained 159 Bayesian logistic regression models 4, one for each aforementioned subgraphs. This approach enabled us to determine if a module was perturbed (“hit”) in a breast cancer sample from the DREAM 7 challenge, moreover the results allowed us to represent a tumor sample in a modular space with a reduced dimension. Conditioning on the status of each gene module, we dichotomized samples into two groups (“hit” vs “not hit”) and performed Kaplan-Meier survival analysis to determine if the module is predictive of clinical outcome of the patients. We identified 20 modules that led to significantly different survival outcomes, p-value < 0.05 .
Using the states of all modules or 20 informative modules as input features, we explored different survival models, including Cox model, the generalized linear model (the “glmnet” package from the CRAN), and the generalized boosted model (the “gbm” package from the CRAN). When the above models were evaluated individually, they led to prediction concordance indices of ~ 0.67 and ~ 0.66 on training and testing datasets respectively. We also explored to use a machine learning model, the RankSVM 5 to predict rank order of patients based on the modular status. In general, the performances of individual RankSVM and survival models were inferior to the leading ensemble models submitted by other groups.
After carefully studying the leading models, we noticed that the ensemble model developed by the Attractome group was particularly suitable to incorporate the modular features identified by our approach. In an essence, the Attractome group also aimed to identify gene modules and to project tumor samples into a modular space for modeling, although their approach was based on different assumptions and therefore derived different modules. Moreover, the ensemble learning approach adopted by the Attractome team also addressed the overfitting problems confronting single-model approaches. Therefore, we adopted an Attractome model (syn1421196) and evaluated if our modular features could enhance the model. We performed a feature selection using a greedy forward-search approach in a series of cross validation experiments. We identified 6 features that were capable of enhancing the performance of the Attractome model and integrated them into to a hybrid model, referred to as the PittAttractomeHyb.2 model. When tested on the hold out METABRIC data, the model performed well, indicating that these modular features are indeed informative of clinical outcomes.
In summary, the major contributions of this study include: 1) A novel integrative approach that is capable of identifying gene modules that likely represent units of cellular responses to perturbed signaling. 2) Certain modules identified from the TCGA data are predictive of clinical outcome when applied to the METBRIC data, therefore the information encoded by these modules are generalizable. 3) Integrating informative modular features with ensemble predictive models enhance the capability of predicting clinical outcomes of breast cancer patients. Ongoing efforts concentrate on reverse engineering the signaling pathways that underlie these predictive modules.
1 Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646-674, doi:10.1016/j.cell.2011.02.013 (2011).
2 Jin, B. & Lu, X. Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics 26, 2445-2451, doi:10.1093/bioinformatics/btq449 (2010).
3 Richards, A. J. et al. Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26, i79-87, doi:10.1093/bioinformatics/btq203 (2010).
4 Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1-22 (2010).
5 Joachims, T. in ACM Conference on Knowledge Discovery and Data Mining (KDD).
November 1, 2012
Please join all of us at Sage Bionetworks and DREAM in congratulating Wei-yi Cheng and the entire Attractor Metagenes team for their winning model (Syn ID#1444444): this training model received the top overall score for the Metabric phase of this Challenge. It will be so interesting to see how this model (and all the others) perform against the final validation data set that is currently being produced! Wei-yi and his team have been invited to speak at the upcoming DREAM conference taking place in San Francisco (Nov 12-16). Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.
Our models use the information provided by the “attractor metagenes” to evaluate their prognostic value in breast cancer. We have previously applied an iterative attractor finding algorithm on rich expression datasets from multiple cancer types identifying three universal (pan-cancer) attractors, which are the mitotic chromosomal instability attractor (CIN), the mesenchymal transition attractor (MES), and the lymphocyte-specific attractor (LYM) . We like to think of these three attractors as “bioinformatic hallmarks of cancer.” In our top submission (syn1444444) we used precisely these same three metagenes as found from our previous unsupervised multi-cancer analysis.
Our experience in the Challenge was particularly rewarding as we were successively confirming that each of these attractors would indeed be helpful towards improving the breast cancer prognostic model, and we were reporting our observations to the other participants using the Synapse forum. We first found that the CIN attractor is highly prognostic, as evidenced by the fact that it was essentially recreated after ranking the individual genes in terms of their corresponding concordance index . The other two main attractors were also found to be highly prognostic after being properly conditioned. For example, the MES attractor is most prognostic in early stage breast cancer (no positive lymph nodes and tumor size less than 30 mm) . On the other hand, the LYM attractor is protective when ER and HER2 expressions are low, while it has the reverse effect on prognosis when there are multiple positive lymph nodes . We also identified a few additional prognostic metagenes, such as the SUSD3-FGD3 metagene that is composed of two genomically adjacent genes, the ZMYND10-LRRC48-CASC1 metagene, the PGR-RAI2 metagene, the HER2 amplicon attractor metagene at chr17q11.2 – q21 and a chr17p12 meta-CNV. We compiled an attractor metagene space using these metagenes, along with TP53 and VEGFA, known to be associated with cancer, and then used them as a molecular feature space for feature selection.
In our top submission (syn1444444), we applied several subclassifiers to maximize the information used and to build a robust, generalizable model, including Cox regression, generalized boost model (GBM), and K‑nearest neighbor (KNN). We also used Akaike information criterion (AIC) on the features passed to Cox regression to avoid overfitting. We applied the AIC-based Cox regression and the GBM on the metagene space and the clinical features, respectively, and the KNN model on the combined metagene and clinical feature space. We combined the predictions of each subclassifier by directly summing up the linear predictors generated by the subclassifiers. For the KNN model, because the prediction is the survival time, we used the reciprocal of the prediction to be summed up. We also included two subclassifiers using mixed molecular and clinical features. In particular, one of them used all three of the universal metagenes (CIN, LYM and MES properly conditioned), the SUSD3-FGD3 metagene, clinical features age, radiation therapy, and chemotherapy. We found that such a simple model provides accurate prognosis, and can be treated independently with other subclassifiers.
In our submissions we did not make use of any code from other Challenge participants. We submitted the winning model prior to the October 15 model submission deadline, and the full code of this model was accessible to other participants as soon as it was posted to the leaderboard and thereafter.
October 26, 2012
Rich Savage, a participant in the Sage Bionetworks / DREAM Breast Cancer Challenge, has written a great summary of his experiences and recommendations for future challenges on his blog, 21st Century Scientist. Key in his thinking are the issues in balancing competition and collaboration in making an interesting challenge experience.
October 5, 2012 1 Comment
Please join all of us at Sage Bionetworks and DREAM in congratulating Wei-yi Cheng and the entire Attractor Metagenes team for their October 1 Leaderboard Winner Achievement. Attractor Metagenes was also the September 1 leaderboard winner . This “repeat performance” is especially impressive given that it was achieved working with two different versions of the Metabric data! Please read on to hear from Wei-Yi Cheng who submitted the winning model on behalf of his team.
Dear fellow BCC challenge participants and others here on Synapse ,
I would like to thank once more the organizers for the opportunity that they again give to the Attractor Metagenes Team to share some of our methods and findings. It has been another exciting month, in which many new ideas have been shared among us. It is inspiring that through these discussions on the Challenge forum, we all are gaining a better understanding of different perspectives on the data and the disease itself.
The main attributes responsible for our continuing high score is that we are making use of the three strongest attractor metagenes representing universal (multi-cancer) biomolecular events: The mitotic chromosomal instability attractor metagene; the mesenchymal transition attractor metagene; and the lymphocyte-specific attractor metagene. We are particularly excited because these metagenes are present in all solid cancers, and therefore can be used as “pancancer” biomarkers, which will be more robust, compared to using individual oncogenes. We have now posted descriptions of each of these three main attractors as items in the Synapse forum. So far we have not incorporated any code from other submissions, but we will certainly do so if we deem appropriate, giving them credit prominently. And similarly, we also welcome others to make use of our code that is always freely and readily available. The functions and metagene lists used in all our submissions are incorporated in an R package downloadable through the link given in the source code we uploaded on the leaderboard. We also have uploaded an R package for finding attractor metagenes, available under Synapse ID syn1123167 for anyone interested to use, not only in breast cancer, but in all types of cancer.
We understand that the main objective in Phase 2 of the Challenge is to build a generalizable model that will work well when evaluated against the Oslo Validation data set. Based on our experience in both Phases, we believe that achieving a generalizable model requires making use of survival data that have been “purified” by excluding causes of death unrelated to the disease itself. We understand, however, that this is difficult to achieve in general and even more so in the case of the Oslo-Val data set. And because Phase 2 uses the same overall survival data as in the Olso-Val, we modified our models to include lots of clinical features that we do not think would be otherwise required for the development of a sharp and generalizable prognostic model.
To elaborate on this last point: We are excited about building sharp, insightful and powerful “minimalist” disease models that could be used for biomarker products making use of a very small number of features. For example, we believe that we have identified such a model in breast cancer that makes use of nothing other than our three attractor metagenes mentioned above, tumor size, number of lymph nodes affected, and one more protective feature that we discovered as a result of our participation in the Challenge: The metagene defined as the average of the genes SUSD3 and FGD3, which, as we observed, are genomically adjacent at chr9q22.31. We know that simultaneous silencing of these two genes is strongly associated with bad prognosis, but we are not certain about the underlying biological mechanism (it may not be the result of a CNV). We suspect that this simultaneous silencing is one of several triggers required for ER-negativity, perhaps the most important one. This is an interesting research question and we hope that other Challenge participants with expertise in biology and medicine will join us in the effort to decipher this important mechanism in breast cancer!
Graduate Research Assistant, Ph.D. Candidate in Electrical Engineering
Genomic Information Systems Laboratory,