FasterCures Webinar On Crowdsourcing and DREAM Challenges

“The way biomedical research is carried out is changing fundamentally,” Sage Bionetworks President Stephen Friend declared at the beginning of a webinar about the crowdsourced computational challenges Sage is facilitating in partnership with the DREAM (Dialogue for Reverse Engineering Assessment and Methods) project that originated at IBM. Friend laid out five opportunities he believes are giving rise to new ways to generate, analyze, and support new research models:

- It’s now possible to generate massive amounts of human “omic’s” data.
- Network modeling approaches for diseases are emerging.
- Information technology infrastructure and cloud computing capacity allow an open approach to biomedical problem solving.
- There’s an emerging movement for patients to control their own sensitive information, allowing sharing.
- Open social media allows citizens and experts to use gaming to solve problems.

“The usual rule of anointed experts being the only ones who can solve problems has really been shattered,” said Friend.

For several years, Sage has been grappling with how to bring about a better understanding of the complexity of biology, given these trends. One initiative central to their efforts has been the creation of a technology platform for data sharing and analysis called Synapse, built on the model of “github” from the open-source software world, which allows distributed projects to get done and provides the foundation for running the DREAM Challenges.

Friend noted that computational biology has been driven by crowdsourcing for a long time, and challenges like those that DREAM has been running for many years have been integral to its successes. There are increasingly large and powerful sets of data in the public domain, and putting them out for many people to look at (some of them from outside the field of biology) and make predictions and unbiased evaluations based on the data is critical to solving complex problems in biology in this day and age. Data is getting so complex that it’s impossible for any single researcher or institution to analyze it effectively. As John Wilbanks, Sage’s Chief Commons Officer and a FasterCures Senior Fellow, noted, “One of the hardest things to do in the emerging Big Data world is to get your data analyzed.”

An important aim of these challenges is to foster a new culture in research. As Friend argues, “We have a serious need not just to solve specific problems, but … to build communities so that people begin to think of each other as colleagues and collaborators.” DREAM Challenges are carefully constructed to provide opportunities for publications in journals and for other forms of recognition that are important to researchers, often more important than the promise of a monetary prize.

First of the four past challenges run by Sage and DREAM (along with partners from academia, industry, government, and patient groups) was the Breast Cancer Prognosis Challenge, created to forge a computational model that accurately predicts breast cancer survival. The winning team was from the academic lab that invented the MP3 format for digital audio, bringing their expertise in data compression to the task. Hundreds of teams comprised of thousands of individuals have participated, and a number of publications have resulted, along with other opportunities for professional advancement for “solvers.”

Challenges currently open include:

- The Somatic Mutation Calling Challenge, to predict cancer-associated mutations from whole-genomic sequencing data;
- The Rheumatoid Arthritis Responder Challenge (in partnership with the Arthritis Foundation, among others), to predict which patients will not respond to anti-TNF therapy – a clinical trial could follow if a powerful classifier emerges from the Challenge for validation; and
- The Alzheimer’s Disease Big Data Challenge, which seeks to predict early AD-related cognitive decline and the mismatch between high amyloid levels and cognitive decline. Massive amounts of data in the public domain has been aggregated, collated, massaged and curated for the task.

Two more are set to open this summer, in partnership with the Broad Institute and MD Anderson Cancer Center, and several more are being considered for launch by the end of 2014. All stakeholders – including and perhaps especially patient groups – are invited to participate by proposing ideas for challenges, contributing data, recruiting teams to participate. The Sage-DREAM Challenges are looking for partners who want not only to find the answers to tough questions in their fields, but who want to help create the conditions for the real collaboration necessary to bring about “the next generation of biomedical research.”

For more information on how to get involved with an open DREAM Challenge, click here.

View webinar slides and recording

(Cross posted from http://fastercures.tumblr.com/post/81603549119/crowdsourcing-data-challenges-to-speed-the-search-for)

ICGC-TCGA SMC DREAM Challenge highlighted in Nature Genetics

A great correspondence was published in Nature Genetics today regarding the ICGC-TCGA DREAM Somatic Mutation Calling (SMC) Challenge¹. Organizers highlight the unique nature of this challenge including its possible impact on the broad research community, the ability for challenge infrastructure to assist in the peer review process, and the resulting ‘living benchmark’ for the bioinformatics community.

To get more information or to sign up for the Somatic Mutation Calling Challenge, visit the SMC Challenge Project in Synapse or watch the kickoff webinar.

 

¹Butros P, Ewing A, Ellrott K, Norman T, Dang K, Hu Y, Kellen M, Suver C, Bare C, Stein L, Spellman P, Stolovitzky G, Friend S, Margolin A, Stuart J. Global optimization of somatic variant identification in cancer genomes with a global community challenge. Nature Genetics 46, 318–319 (2014). doi:10.1038/ng.2932

ICGC-TCGA Mutation Calling Challenge Webinar

The ICGC-TCGA DREAM Genomic Mutation Calling Challenge (open for participation Nov 2013 — Summer 2014) is an international effort to improve standard methods for identifying cancer-associated mutations and rearrangements in whole-genome sequencing (WGS) data. The goal of this somatic mutation calling (SMC) Challenge is to identify the most accurate mutation detection algorithms, and establish the state-of-the-art. The algorithms in this Challenge must use as input WGS data from tumour and normal samples and output mutation calls associated with cancer.

In this January 29, 2014 webinar, Challenge participants were invited to hear presenations and participate in a live Q&A session about the Challenge. The webinar video consists of the following three sections:

  1. Background and motivation for the Challenge (Paul Boutros: SMC Challenge Leader)
  2. Demo of Challenge web services to show you how to participate (Chris Bare: Sage Bionetworks)
  3. Answering your questions in real-time

 

 

For more information about all DREAM Challenges, please visit the DREAM web presence on Synapse.

New Wiki History feature

Happy New Year, Synapse users!

The Synapse Team is proud to announce Wiki Histories. Synapse Wikis will now capture each saved update – and allow users the ability to restore previous versions of the Wiki History. This will allow a more fine-grained archive of a Synapse Project’s history and also allow for more collaborative editing of Wiki pages. See below the fold for screenshots of a new Wiki History after making edits to a Wiki page.

Happy researching!

-The Synapse Team

 

 

Screenshot before making a Wiki edit

Screen Shot 2014-01-03 at 3.43.13 PM

Screenshot after making a few edits (new entries in Wiki History)

Screen Shot 2014-01-03 at 4.05.08 PM

Introducing Synapse Teams

The Synapse development team is happy to announce Synapse Teams.

Synapse Teams will allow users to define lists of collaborators with whom they can easily share Synapse Projects, Files, and Wikis.  You can create a new Team under the new “My Teams” section of your home page, and can search for Teams by name when sharing Synapse content with others.  We are also releasing capabilities for Synapse users to both invite others to join a team, and for users to request permission to join a team.

Synapse Teams will also allow participants in DREAM Challenges to more easily build up content with Team members and associate their Challenge submissions with a defined set of Team members.  The Synapse development team is excited to start building more collaborative features into the system and believe that Teams will provide a basis for many of these future additions. For example, we’re looking to add the ability to send messages to a Team, or use Team membership lists in allowing people to give credit for work tracked by Synapse on things like Challenge leader boards and provenance graphs.

Start creating your Synapse Teams today and please let us know how the feature is working or could be improved by email at synapseInfo@sagebase.org or by posting on the support forum.

Synapse moves out of Beta

It is our pleasure to announce that Synapse has moved from a Beta offering to a production piece of software. To celebrate, the Synapse website has been given major facelift – now donning a more sleek bootstrap-style interface and taking advantage of the full browser width. This will allow Synapse to build out more mobile-friendly content in the future.

Also new is that Synapse Projects are now organized using tabs to ease navigation through user generated content. The two most prominent tabs will be Wiki and Files – allowing users to craft their scientific narratives in a space immediately adjacent to their scientific assets such as data, code, and the provenance linking those assets together. This change in design will also allow for other tabs to be introduced in the future as further Synapse functionality is added.

We would like to thank all of our users who helped us kick the tires during our Beta phase. We will continue to add new features and roll out new versions of the service - and Synapse will continue to be a free service to the research community. As always – we welcome feedback about Synapse – and encourage suggestions for future directions.

- The Synapse Development Team

Focus on Pan-Cancer Analysis

The TCGA Pan-Cancer working group has published a set of papers in Nature Publishing Group (NPG) journals exploring the DNA, chromatin and RNA alterations across a diverse set of cancers.  Much of the analysis done by the working group leveraged Synapse services to share and evolve data, results and methodologies while performing integrative analysis, as described in this commentary. There is now a nice summarization of these studies with links to the full papers on the Nature website.

Kudos to the TCGA Pan-Cancer working group for their great research. The Synapse Team is excited to see researchers continue to benefit from their use of the platform. We hope the TCGA Pan-Cancer group can be used as a template for future large-scale collaborative research efforts.

Upgrade to Synapse markdown engine

Hello Synapse users!

The Synapse development team is pleased to announce the release of its own markdown processor. All previous syntax will continue to be supported, but users should see improved performance of page rendering as well as have access to the following features:

  • Superscripts, subscripts, and strikethrough text in WikiPages
  • Support of LaTeX-style equations both inline and blocks (integration with Mathjax)
  • Additional options for table styling (centering, borders, etc.)
  • Users specified language for all fenced code blocks to enable syntax highlighting
  • References or footnotes to Synapse WikiPages (via the reference widget)

More details are available in the formatting guide, which is accessible while directly editing WikiPages.

-The Synapse Team

Teams WarwickDataScience and UT_CCB top the real-time leaderboard for NIEHS-NCATS-UNC DREAM Toxicogenetics sub-Challenge 1

Below are blogposts from Teams WarwickDataScience and UT-CCB, the two top-scoring teams for sub-Challenge 1 of the NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge.  Sub-Challenge 1 asked participants to model cytotoxicity across cell lines based on genomic information.  We were so excited to see 49 teams submit nearly 1000 models for scoring to the real-time leaderboard that was open for 6 weeks (you can see the final leaderboard for this sub-Challenge here: https://www.synapse.org/#!Synapse:syn1761567/WIKI/56154!

We awarded $250 to the top two teams with the highest mean rank as determined by RMSE and by Pearson Correlation (four total prizes).  Below are blogplosts from the top-scoring teams: Rich Savage from Team WarwickDataScience and Yonghui Wu from UT_CCB both share a little about their respective teams and their winning models.  

Team WarwickDataScicence Blogpost

WHO IS YOUR TEAM?

Our team (WarwickDataScience) consists of Rich Savage and Jay Moore.  We’re based at the University of Warwick’s Systems Biology Centre (in the UK), where Rich is an associate professor and Jay leads the bioinformatics engineering group.  Rich is also a joint appointment with the Warwick Medical School.

WHY DID YOU JOIN THE CHALLENGE?

We joined the DREAM Tox challenge because we thought it was an interesting scientific problem, and also because we have both developed interests in data science challenges of this nature and were keen to start a data science team here at Warwick.

WHAT WAS YOUR WINNING MODEL?

Over the course of the Challenge we identified several considerations that were demonstrably important to building high-scoring models.  Our basic approach was to build a regression model for each compound, using a Random Forest model, then to couple these together to do multi-task learning.  We experimented with various other regression models, but we got good performance from Random Forests and it was very fast and easy to use, leaving us more time to focus on other aspects of the modelling.

Data-wise, we used some of the covariate features (population, gender) along with a small number of informative SNPs for each of the 106 compounds.  Finding the SNPs was a significant computational challenge, which we ended up solving using a fast, if somewhat quick-and-dirty, approach to GWAS analysis.  Our initial work used only the X-chromosome SNPs, but we found obtained similar results by using any single chromosome, or combination of chromosomes.  We speculate that this may be because we’re finding some widespread genomic signature, but we don’t (yet) have good evidence to confirm this.  We also experimented with the RNAseq data, and a microarray data set sourced externally to the challenge, but ran out of time with this. From cross-validation we think these data may also have helped.

It was clear to us from early in the Challenge that some of the compounds had highly similar toxicity results.  We therefore coupled our individual regression models together by using their predictions as input to a second set of Random Forest classifiers.  This gave us a fast, effective way of sharing information between the regression models for different compounds.

Finally, discussions on the forum and in the webinar made it clear that there was a difference in distribution between the training set and the leaderboard test set.  We decided to try a simple scheme to correct for this bias by identifying training items from the tails of the target distributions and adding duplicate examples to the training set.  While in some ways this is a bit of a hack, we felt it was a reasonable thing to do, given the way Random Forest works.  It resulted in a significant boost to our final scores.

WHAT HAVE YOU LIKED ABOUT THE DREAM CHALLENGE/HOW WOULD YOU LIKE TO SEE IT EVOLVE IN THE NEXT SEASON?

We like that it was a challenging problem, particularly trying to use the SNPs.  We also liked that it was focused on a real-world problem.

Next year, it would be nice if the leaderboard and final test predictions were part of a combined submission, as is the case in many other challenges (eg. Kaggle, Netflix).  For example, the test set can be randomly subdivided by the organizers, with half the items being used to compute leaderboard performance, and the other half being used for the final scoring.  The user simply submits predictions for all test items, without knowing the random partition.  We’d prefer this to the retraining stage, which can be a bit fiddly and error-prone (for example, in the Breast Cancer challenge last year, many final models did not evaluate correctly after the final retraining).  It’s not clear to us that we gain much more scientific understanding from the retraining process.

Team UT_CCB Blogpost

Dear fellow NIEHS-NCATS-UNC-Dream Challenge participants and organizers,

This is Yonghui Wu, representing the participating team from the Center of Computational Biomedicine (CCB) at The University of Texas School of Biomedical Informatics at Houston (UT-SBMI). Our team consists of two postdoctoral fellows (myself and Yunguo Gong), a research assistant (Liang-Chin Huang), and five CCB faculty members including Drs. Jingchun Sun, Jeffrey Chang, Trevor Cohen, W. Jim Zheng, and Hua Xu, who directs the CCB. It is our great honor to be highlighted as a top team on September 4st in the 2013 NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge. Two of our submissions by myself and my teammate Liang-Chin Huang were both ranked second in the mean rank and RMSE, correspondingly.

At CCB, we are working on various projects to develop advanced informatics methods and tools to support biomedical research and clinical practice. One of the focused areas is translational bioinformatics. We are actively working on data extraction and analysis approaches to bridge clinical practice data with biomedical experimental data, thus to facilitate drug discovery and to promote personalized medicine. We are interested in not only identifying genetic variations that are associated with drug responses (pharmacogenomics), but also building models that combine both genomic and clinical variables to predict drug responses, which is exactly what this challenge does.

In this Challenge, we examined different groups of features, including the covariance data, the compound chemical feature, the mRNA expression data as well as SNP data. The critical step in this Challenge is to extract the features that really affect the compound responses from a large amount of data. We utilized different statistical tools to analyze the data to determine the strong features associated with the compound responses. In our top-submissions (Syn ID#2024493 and Syn ID#2024471), we investigated different regression models, including the traditional linear regression model, the logistic regression model as well as the lasso penalized regression model and the Elastic-net model. We also tried different strategies to combine these models to improve the prediction.

This is the first time that we could test our models on a real time leaderboard and we learned a lot from others during the Challenge. I would like to thank the Challenge organizers, as well as everyone who contributed with funding, data, or infrastructure, to make this Challenge possible. This Challenge provides us a great chance to test our computational methods on the real genetic data sets.

Yonghui Wu

Postdoc Research Fellow

Center for Computational Biomedicine

The University of Texas School of Biomedical Informatics at Houston

HPN Challenge teams HDSystems, IPNet and Tongii write about their winning models for Sub-Challenges 1A and 1B

Below are blogposts from three winning teams (HDSystems, IPNet and Tongii)  from sub-challenges 1A and 1B of the Heritage Provider Network DREAM Breast Cancer Network Inference Challenge.  Sub- Challenge-1A asked participants to work with experimental time-course breast cancer proteomic data to infer a causal network. Sub-Challenge-1B asked participants to work with in silico time-course data generated from a state-of-the-art dynamical model of signaling to also construct a causal network. We awarded cash prizes to the first three teams within each sub-Challenge with a model that scored 2 standard deviations above the null model.  Below three of the winning teams (HDSystems for sub-Challenge-1A and both IPNet and Tongii for sub-Challenge-1B) share a little about themselves and the rationale behind their winning model.  

You can check out the current leaderboards for these sub-Challenges here:

Sub-Challenge-1A: https://www.synapse.org/#!Synapse:syn1720047/WIKI/56830

Sub-Challenge-1B: https://www.synapse.org/#!Synapse:syn1720047/WIKI/56850

Team HD Systems (syn ID2109051)

Dear fellow HPN challenge participants and organizers,

This is Ruth Großeholz, along with Oliver Hahn and Michael Zengerling at Ruprecht-Karls University in Heidelberg, Germany. We are happy to be one of the winning teams of the sub-Challenge 1-A leaderboard incentive prize for the experimental network and would like to thank the organizers for the opportunity to introduce ourselves and our ideas.

The three of us are Master students in Professor Ursula Kummer’s group for Modeling of Biological Processes at Bioquant Heidelberg participating in this challenge to gather working experience with network inference. For us, the Challenge offers a chance to work outside of the controlled conditions of a practical course and expand our methodical knowledge. Before this Challenge, network inference was uncharted territory as it is not covered in our Master program. So far, it has been a great experience to work with such a rich data set.

Since we come from a biological background and know from a number of practical courses that a model is only as good as the information backing it, our idea was to build one model including all the edges required for this challenge using extensive literature research on the roles and interactions of all the given proteins in cellular signaling. Even though the cell lines differ quite drastically between each other we felt that having one basic model, which describes signaling in a healthy cell would be a good point to start. Only after we had placed all proteins within our universal network, we started to tailor the models to their respective cell lines and growth factors. In our primary network all edges had a score of 1, which we later adjusted according to the dynamics of both source and target.

We thank Thea, Laura and all the other challenge organizers, as well as everyone who contributed to making this challenge possible. The implementation of the leaderboard did not only provide a possibility to get feedback during the challenge but also gave the challenge a more competitive character.

Team HD Systems

Oliver Hahn, Michael Zengerling & Ruth Großeholz

Master Students, Major Systems Biology

Modelling of Biological Processes

Ruprecht-Karls University, Heidelberg

 

Team IPNet (syn ID2023386)

It is a great honor for us to be highlighted among the top-scoring models in the HPN-DREAM Subchallenge 1B on the August 7th leaderboard. We are very thankful that the organizers have given us the opportunity to present ourselves and to introduce our model.

We are a team that started at the Institute for Medical Informatics and Biometry at the TU Dresden in Germany. The team is composed of myself (Marta Matos), Dr. Bettina Knapp, and Prof. Dr. Lars Kaderali.

The model we used in the HPN-DREAM Challenge was developed during my master’s thesis [1], under the supervision of Dr. Bettina Knapp in the group of Prof. Dr. Lars Kaderali. The model is an extension of the approach previously developed by Knapp and Kaderali [2] which is available as a bioconductor software package [3]. It is based on linear programming and it infers signaling networks using perturbation data. In particular, it was designed to take advantage of RNA interference experiments in combination with steady-state expression measurements of the proteins of interest. In my master’s thesis we expanded this model to take advantage of perturbation time-series data to improve the prediction of causal relations between proteins. Therefore, the HPN-DREAM Subchallenge 1B is an excellent opportunity to evaluate the performance of the extended model on time-series data after different perturbations.

In our approach the signal is modeled as an information flow which starts at the source nodes and propagates downstream along the network until it reaches the sink nodes. A node which is not active at a given time step interrupts the flow and we assume that the signal cannot propagate to its child nodes. To distinguish between active and inactive nodes we use a thresholding approach. The choice of the threshold has a big influence on the model performance. For the HPN-DREAM Subchallenge 1A, we got our best results when using the values of each node at the first time point to discretize the data. The underlying assumption is that the expression of the network nodes are in an inactive state at t=0, since they have not yet been stimulated.

What we like the most in the DREAM Challenge, is that it allows the comparison between different models in exact the same setting and that it is possible to evaluate the performance of the models on real, yet unpublished, data. Furthermore, the Challenge facilitates to learn from other researchers working in the same field and it allows for the exchange of knowledge and expertise. This helps to improve the developed models and to answer complex biological questions in more detail. We thank the challenge organizers and all who contributed in making this competition possible.

[1]  Marta Matos. Network Inference: extension of a linear programming model for time-series data. Master’s thesis, University of Minho, 2013

[2] Bettina Knapp and Lars Kaderali. Reconstruction of cellular signal transduction networks using perturbation assays and linear programming. PLoS ONE, 8(7):e69220, 07 2013.

[3] Bettina Knapp, Johanna Mazur and Lars Kaderali (2013). lpNet: Linear Programming Model for Network Inference.  R package version 1.0.0.

 

Team Tongii (synID2024139)

This is Su Wang, along with my teammates Xiaoqi Zheng, Chengyang Wang, Yingxiang Li, Haojie Ren and Hanfei Sun at Tongji University, Shanghai, China. It is our honor that our model was highlighted as one of the top-scoring models for HPN-DREAM Challenge. Xiaoqi is an associate professor in Shanghai Normal University; Yingxiang and I are the PhD candidate students; Chengyang and Hanfei are master students; Haojie just graduated from Nankai University. The diversity of our background gives us the courage to participate in this Challenge. Thank the organizers to give me the chance to introduce team and our model.

In our previous study, we focused on the software development to detect the transcription factors and chromatin regulators target genes. We used a monotonically deceasing function based on the distance between the binding site and transcription start site to measure the contribution of the binding to a gene, and combine with the differential expression data to get the factors direct target genes. We create the transcription network based on the direct target relationship and try to find the relationships among all the factors and the co-regulate pairs. We collected some available pathways from KEGG and integrated with our predicted results to convince the predicted relationship. Although the mechanism of protein phosphorylation is different from the regulation between transcription factors, some models and ideas can still be applied to reconstruction the network.

In our model, we applied the Dynamic Bayesian Network to train the data. Combined with the mutual information between two genes, we used a simple rank average to get the causal relationship. This model works well because, firstly, the time series data can be easily used for Dynamic Bayesian network; secondly, the relationship of each gene is not linear, so the correlation like Pearson’s Correlation is not proper to get the information between the two genes; last but not the least, the information inequality applied to delete the unbelievable edges, which made our model have a better sensitivity and stability.

There is much room for improvement of our model. We thank the Challenge organizers us the chance to present our model here.  The Challenge is very good for helping us to build a model to deal with a specific question, and the leaderboard is a very good platform to test how good our model and to help us understand where we should pay more attention to improve. We believe that everyone learned a lot by doing this Challenge, both for the study of the Challenge and the skill themselves. We hope the best performing teams can share their model and learn from each other, so that we can all get the best model to solve more questions.

Follow

Get every new post delivered to your Inbox.