Teams WarwickDataScience and UT_CCB top the real-time leaderboard for NIEHS-NCATS-UNC DREAM Toxicogenetics sub-Challenge 1

Below are blogposts from Teams WarwickDataScience and UT-CCB, the two top-scoring teams for sub-Challenge 1 of the NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge.  Sub-Challenge 1 asked participants to model cytotoxicity across cell lines based on genomic information.  We were so excited to see 49 teams submit nearly 1000 models for scoring to the real-time leaderboard that was open for 6 weeks (you can see the final leaderboard for this sub-Challenge here:!Synapse:syn1761567/WIKI/56154!

We awarded $250 to the top two teams with the highest mean rank as determined by RMSE and by Pearson Correlation (four total prizes).  Below are blogplosts from the top-scoring teams: Rich Savage from Team WarwickDataScience and Yonghui Wu from UT_CCB both share a little about their respective teams and their winning models.  

Team WarwickDataScicence Blogpost


Our team (WarwickDataScience) consists of Rich Savage and Jay Moore.  We’re based at the University of Warwick’s Systems Biology Centre (in the UK), where Rich is an associate professor and Jay leads the bioinformatics engineering group.  Rich is also a joint appointment with the Warwick Medical School.


We joined the DREAM Tox challenge because we thought it was an interesting scientific problem, and also because we have both developed interests in data science challenges of this nature and were keen to start a data science team here at Warwick.


Over the course of the Challenge we identified several considerations that were demonstrably important to building high-scoring models.  Our basic approach was to build a regression model for each compound, using a Random Forest model, then to couple these together to do multi-task learning.  We experimented with various other regression models, but we got good performance from Random Forests and it was very fast and easy to use, leaving us more time to focus on other aspects of the modelling.

Data-wise, we used some of the covariate features (population, gender) along with a small number of informative SNPs for each of the 106 compounds.  Finding the SNPs was a significant computational challenge, which we ended up solving using a fast, if somewhat quick-and-dirty, approach to GWAS analysis.  Our initial work used only the X-chromosome SNPs, but we found obtained similar results by using any single chromosome, or combination of chromosomes.  We speculate that this may be because we’re finding some widespread genomic signature, but we don’t (yet) have good evidence to confirm this.  We also experimented with the RNAseq data, and a microarray data set sourced externally to the challenge, but ran out of time with this. From cross-validation we think these data may also have helped.

It was clear to us from early in the Challenge that some of the compounds had highly similar toxicity results.  We therefore coupled our individual regression models together by using their predictions as input to a second set of Random Forest classifiers.  This gave us a fast, effective way of sharing information between the regression models for different compounds.

Finally, discussions on the forum and in the webinar made it clear that there was a difference in distribution between the training set and the leaderboard test set.  We decided to try a simple scheme to correct for this bias by identifying training items from the tails of the target distributions and adding duplicate examples to the training set.  While in some ways this is a bit of a hack, we felt it was a reasonable thing to do, given the way Random Forest works.  It resulted in a significant boost to our final scores.


We like that it was a challenging problem, particularly trying to use the SNPs.  We also liked that it was focused on a real-world problem.

Next year, it would be nice if the leaderboard and final test predictions were part of a combined submission, as is the case in many other challenges (eg. Kaggle, Netflix).  For example, the test set can be randomly subdivided by the organizers, with half the items being used to compute leaderboard performance, and the other half being used for the final scoring.  The user simply submits predictions for all test items, without knowing the random partition.  We’d prefer this to the retraining stage, which can be a bit fiddly and error-prone (for example, in the Breast Cancer challenge last year, many final models did not evaluate correctly after the final retraining).  It’s not clear to us that we gain much more scientific understanding from the retraining process.

Team UT_CCB Blogpost

Dear fellow NIEHS-NCATS-UNC-Dream Challenge participants and organizers,

This is Yonghui Wu, representing the participating team from the Center of Computational Biomedicine (CCB) at The University of Texas School of Biomedical Informatics at Houston (UT-SBMI). Our team consists of two postdoctoral fellows (myself and Yunguo Gong), a research assistant (Liang-Chin Huang), and five CCB faculty members including Drs. Jingchun Sun, Jeffrey Chang, Trevor Cohen, W. Jim Zheng, and Hua Xu, who directs the CCB. It is our great honor to be highlighted as a top team on September 4st in the 2013 NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge. Two of our submissions by myself and my teammate Liang-Chin Huang were both ranked second in the mean rank and RMSE, correspondingly.

At CCB, we are working on various projects to develop advanced informatics methods and tools to support biomedical research and clinical practice. One of the focused areas is translational bioinformatics. We are actively working on data extraction and analysis approaches to bridge clinical practice data with biomedical experimental data, thus to facilitate drug discovery and to promote personalized medicine. We are interested in not only identifying genetic variations that are associated with drug responses (pharmacogenomics), but also building models that combine both genomic and clinical variables to predict drug responses, which is exactly what this challenge does.

In this Challenge, we examined different groups of features, including the covariance data, the compound chemical feature, the mRNA expression data as well as SNP data. The critical step in this Challenge is to extract the features that really affect the compound responses from a large amount of data. We utilized different statistical tools to analyze the data to determine the strong features associated with the compound responses. In our top-submissions (Syn ID#2024493 and Syn ID#2024471), we investigated different regression models, including the traditional linear regression model, the logistic regression model as well as the lasso penalized regression model and the Elastic-net model. We also tried different strategies to combine these models to improve the prediction.

This is the first time that we could test our models on a real time leaderboard and we learned a lot from others during the Challenge. I would like to thank the Challenge organizers, as well as everyone who contributed with funding, data, or infrastructure, to make this Challenge possible. This Challenge provides us a great chance to test our computational methods on the real genetic data sets.

Yonghui Wu

Postdoc Research Fellow

Center for Computational Biomedicine

The University of Texas School of Biomedical Informatics at Houston


HPN Challenge teams HDSystems, IPNet and Tongii write about their winning models for Sub-Challenges 1A and 1B

Below are blogposts from three winning teams (HDSystems, IPNet and Tongii)  from sub-challenges 1A and 1B of the Heritage Provider Network DREAM Breast Cancer Network Inference Challenge.  Sub- Challenge-1A asked participants to work with experimental time-course breast cancer proteomic data to infer a causal network. Sub-Challenge-1B asked participants to work with in silico time-course data generated from a state-of-the-art dynamical model of signaling to also construct a causal network. We awarded cash prizes to the first three teams within each sub-Challenge with a model that scored 2 standard deviations above the null model.  Below three of the winning teams (HDSystems for sub-Challenge-1A and both IPNet and Tongii for sub-Challenge-1B) share a little about themselves and the rationale behind their winning model.  

You can check out the current leaderboards for these sub-Challenges here:



Team HD Systems (syn ID2109051)

Dear fellow HPN challenge participants and organizers,

This is Ruth Großeholz, along with Oliver Hahn and Michael Zengerling at Ruprecht-Karls University in Heidelberg, Germany. We are happy to be one of the winning teams of the sub-Challenge 1-A leaderboard incentive prize for the experimental network and would like to thank the organizers for the opportunity to introduce ourselves and our ideas.

The three of us are Master students in Professor Ursula Kummer’s group for Modeling of Biological Processes at Bioquant Heidelberg participating in this challenge to gather working experience with network inference. For us, the Challenge offers a chance to work outside of the controlled conditions of a practical course and expand our methodical knowledge. Before this Challenge, network inference was uncharted territory as it is not covered in our Master program. So far, it has been a great experience to work with such a rich data set.

Since we come from a biological background and know from a number of practical courses that a model is only as good as the information backing it, our idea was to build one model including all the edges required for this challenge using extensive literature research on the roles and interactions of all the given proteins in cellular signaling. Even though the cell lines differ quite drastically between each other we felt that having one basic model, which describes signaling in a healthy cell would be a good point to start. Only after we had placed all proteins within our universal network, we started to tailor the models to their respective cell lines and growth factors. In our primary network all edges had a score of 1, which we later adjusted according to the dynamics of both source and target.

We thank Thea, Laura and all the other challenge organizers, as well as everyone who contributed to making this challenge possible. The implementation of the leaderboard did not only provide a possibility to get feedback during the challenge but also gave the challenge a more competitive character.

Team HD Systems

Oliver Hahn, Michael Zengerling & Ruth Großeholz

Master Students, Major Systems Biology

Modelling of Biological Processes

Ruprecht-Karls University, Heidelberg


Team IPNet (syn ID2023386)

It is a great honor for us to be highlighted among the top-scoring models in the HPN-DREAM Subchallenge 1B on the August 7th leaderboard. We are very thankful that the organizers have given us the opportunity to present ourselves and to introduce our model.

We are a team that started at the Institute for Medical Informatics and Biometry at the TU Dresden in Germany. The team is composed of myself (Marta Matos), Dr. Bettina Knapp, and Prof. Dr. Lars Kaderali.

The model we used in the HPN-DREAM Challenge was developed during my master’s thesis [1], under the supervision of Dr. Bettina Knapp in the group of Prof. Dr. Lars Kaderali. The model is an extension of the approach previously developed by Knapp and Kaderali [2] which is available as a bioconductor software package [3]. It is based on linear programming and it infers signaling networks using perturbation data. In particular, it was designed to take advantage of RNA interference experiments in combination with steady-state expression measurements of the proteins of interest. In my master’s thesis we expanded this model to take advantage of perturbation time-series data to improve the prediction of causal relations between proteins. Therefore, the HPN-DREAM Subchallenge 1B is an excellent opportunity to evaluate the performance of the extended model on time-series data after different perturbations.

In our approach the signal is modeled as an information flow which starts at the source nodes and propagates downstream along the network until it reaches the sink nodes. A node which is not active at a given time step interrupts the flow and we assume that the signal cannot propagate to its child nodes. To distinguish between active and inactive nodes we use a thresholding approach. The choice of the threshold has a big influence on the model performance. For the HPN-DREAM Subchallenge 1A, we got our best results when using the values of each node at the first time point to discretize the data. The underlying assumption is that the expression of the network nodes are in an inactive state at t=0, since they have not yet been stimulated.

What we like the most in the DREAM Challenge, is that it allows the comparison between different models in exact the same setting and that it is possible to evaluate the performance of the models on real, yet unpublished, data. Furthermore, the Challenge facilitates to learn from other researchers working in the same field and it allows for the exchange of knowledge and expertise. This helps to improve the developed models and to answer complex biological questions in more detail. We thank the challenge organizers and all who contributed in making this competition possible.

[1]  Marta Matos. Network Inference: extension of a linear programming model for time-series data. Master’s thesis, University of Minho, 2013

[2] Bettina Knapp and Lars Kaderali. Reconstruction of cellular signal transduction networks using perturbation assays and linear programming. PLoS ONE, 8(7):e69220, 07 2013.

[3] Bettina Knapp, Johanna Mazur and Lars Kaderali (2013). lpNet: Linear Programming Model for Network Inference.  R package version 1.0.0.


Team Tongii (synID2024139)

This is Su Wang, along with my teammates Xiaoqi Zheng, Chengyang Wang, Yingxiang Li, Haojie Ren and Hanfei Sun at Tongji University, Shanghai, China. It is our honor that our model was highlighted as one of the top-scoring models for HPN-DREAM Challenge. Xiaoqi is an associate professor in Shanghai Normal University; Yingxiang and I are the PhD candidate students; Chengyang and Hanfei are master students; Haojie just graduated from Nankai University. The diversity of our background gives us the courage to participate in this Challenge. Thank the organizers to give me the chance to introduce team and our model.

In our previous study, we focused on the software development to detect the transcription factors and chromatin regulators target genes. We used a monotonically deceasing function based on the distance between the binding site and transcription start site to measure the contribution of the binding to a gene, and combine with the differential expression data to get the factors direct target genes. We create the transcription network based on the direct target relationship and try to find the relationships among all the factors and the co-regulate pairs. We collected some available pathways from KEGG and integrated with our predicted results to convince the predicted relationship. Although the mechanism of protein phosphorylation is different from the regulation between transcription factors, some models and ideas can still be applied to reconstruction the network.

In our model, we applied the Dynamic Bayesian Network to train the data. Combined with the mutual information between two genes, we used a simple rank average to get the causal relationship. This model works well because, firstly, the time series data can be easily used for Dynamic Bayesian network; secondly, the relationship of each gene is not linear, so the correlation like Pearson’s Correlation is not proper to get the information between the two genes; last but not the least, the information inequality applied to delete the unbelievable edges, which made our model have a better sensitivity and stability.

There is much room for improvement of our model. We thank the Challenge organizers us the chance to present our model here.  The Challenge is very good for helping us to build a model to deal with a specific question, and the leaderboard is a very good platform to test how good our model and to help us understand where we should pay more attention to improve. We believe that everyone learned a lot by doing this Challenge, both for the study of the Challenge and the skill themselves. We hope the best performing teams can share their model and learn from each other, so that we can all get the best model to solve more questions.

DREAM8 Whole Cell Parameter Estimation Challenge: Winners for Most Creative Method Share Their Ideas

We asked the Challenge teams that received the “Most Creative Method” prize in the Whole-cell Parameter Estimation DREAM8 Challenge to submit a short write-up explaining how they are working to solve the Challenge. Below are the descriptions from the top 3 teams: winner Team Whole-Sale Modelers, followed by Team Crux,  and Team newDREAM.

Team Whole-Sale Modelers

 1.Summary of Overall Approach

The Challenge boils down to a complex, high-dimensional regression problem. We are asked to infer 15 perturbations that were delivered to a subset of 30 identified model parameters. These parameters (which can be though of as dependent variables) are estimated based on large amounts of \high-throughput data.” In this document I describe the statistical techniques I have used to solve this problem. Importantly, the techniques I outline below are complementary to an analysis of the “sub-models”. That is, the general search strategy I describe can be constrained and thus improved if one were to gain insight from the sub-models.

I have written code to estimate the parameters of the whole model, given the high-throughput data that is generated for each simulation. This model can be stated very simply:


where p is the estimated “perturbation vector” (which encodes the perturbation delivered to each of the 30 parameters of interest), f is a non-linear function, and  x is a vector that contains all of the high-throughput data for the mutant model (provided freely to contest participants). The vector p has 30 elements, each of which represents the proportional change in each parameter (e.g. if the third parameter in the list – the kcat of Tmk -is halved, then the third element of ~p is equal to 3).In practice, I found that the above problem is intractable because of the large number of variables in the high throughput data leading to a very large vector x. Thus I performed a principal components analysis of the high-throughput data before fitting the model. This reduces the dimensionality of x to be on the order of 50 components.

The nonlinear function f is fitted by a collection of regression trees using the Random Forests technique. This is a popular technique in the field of machine learning. Its popularity stems from the fact that we do not need to have an initial guess for the form of the non-linear function f. Additionally, the algorithm cleverly avoids the over-fitting problem by probabilistically sampling from the training data. The random forest was fitted  based on the high-throughput data of 1128 whole cell simulations.

2. Improvements – Compressed Sensing

A substantial improvement, which I have not had the time to implement yet, would be to incorporate the constraint that the perturbation vector p is sparse. This piece of information is critical, and is widely studied in the context of  “compressed sensing”. Essentially, one should be able to improve the fit by penalizing the L1 norm of the estimated perturbation vector p (in theory, at least).  Additionally, I am in the process of running more simulations which I am now doing in triplicate. Averaging the high-throughput data across these replicates has the potential to improve the fit, since the stochasticity of the model can be quite influential (especially in terms of the data stored in  the variable “rxnFluxes”). In fact, I found some models that seemed to fit the data quite well, but failed when submitted to Bitmill. This was apparently due to trial-to-trial variability in the rxnFlux data, which was revealed by averaging over 8 trials.

3. Explanation of Code

Running the script  “s007 simplified random forest script” should generate some estimates of the perturbations to the cell. This script can be found in the “analysis” folder. I have added the prefix “alex ” to almost all of my written functions to distinguish them from Jonathan Karr’s original code. Due to severe time constraints, I have not been able to comment and clean up all of my code.  Full code can be found at:

4. About the Team: Whole-Sale Modelers

  • Alex Williams | www: Alex Williams is a research technician in Eve Marder’s lab at Brandeis University. His research interests are in computational neuroscience. Alex’s work examines how neurons maintain stable activity patterns over long time periods in spite of comparatively rapid protein turnover.
  • Jeremy Zucker | www: Jeremy Zucker has over 10 years of experience in the representation, integration, modeling and simulation of biological pathways to elucidate the complex relationship between genotype and phenotype.

Team crux

1. Methodology

The goal of the DREAM8 Whole-cell parameter estimation challenge is to estimate 30 unknown parameters P of a mathematical model of a cell. The whole cell model has 1972 parameters in total and default values for all parameters are known. As prior information, it is known in addition that 15 out of the given set of 30 parameters

are identical to default values. For the remaining 15 modified parameters, the following prior knowledge is available:

1. 4 promoter affinities were modified

2. 7 kcats were modified

3. 4 RNA half lives were modified

4. 5 genes have one changed parameter

5. 5 genes have two changed parameters

6. 13 of the 15 modified parameters were decreased

7. 2 of the 15 modified parameters were increased

8. The decreases range from 2.8-93.4 %.

9. The increases range from 11.7-90.6%.

In our analyses, we restricted the parameter space to satisfy these constraints.

To account for strictly positive parameter values, all the analyses have been performed in a logarithmic parameter space. Moreover, this accounts for the fact that changes of parameter values usually contribute multiplicatively rather than additively, i.e. usually changing a parameter by a factor a or 1/a have a similar impact, but adding a constant a has mostly a qualitatively different effect than the corresponding subtraction.  Since promoter affinities are normalized, all perturbations of these parameters have to satisfy the normalization condition


For all perturbation we performed, we normalized the promoter affinities by adapting a single, fixed parameter which is not in the set of modified parameters P.

The parameters were estimated using the maximum likelihood methodology. We initially perturbed each parameter ϴϵ P individually to challenge the response of the model and to estimate the gradient of the likelihood. For testing purposes, we also modified other parameters ϴ not from P. At the second stage, we also altered sets of more than a single parameter. This allows the computation of higher order derivatives. An iterative procedure then allows to advance towards a better model fit. Finally, analysis of the response of the model for default and estimated parameters allows to assess which of the 30 candidate parameter were modified.

2. Implementation of the numerical methods

A major bottleneck in this Challenge is the computational effort that a single evaluation of the model needs. It critically limits the number of possible iterations during maximumm likelihood estimation of the parameter. In addition to simulations on local computers, the simulations were performed on the Bitmill server using Matlab code.

3. About the Team: Team crux 

Team crux consists of a group of researchers at the Institute of Physics at the University of Freiburg. Team crux has previously won two DREAM competitions, the DREAM6 Parameter Estimation Challenge and the DREAM7 Network Inference Challenge.

  • Dr. Clemens Kreutz | www  Clemens is a postdoctoral scholar at the Institute of Physics at the University of Freiburg. Clemens’ research focuses on mathematical modelling of cellular signal transduction, experimental design, and statistics.
  • Dr. Andreas Raue | www  Andreas is also a postdoctoral scholar at the Institute of Physics at the University of Freiburg. Andreas’ research focuses on parameter estimation, experimental design, and uncertainty analysis.
  • Bernhard Steiert | www  Bernhard is a PhD candidate at the Institute of Physics at the University of Freiburg. His research focuses on modeling erythropoietic signaling pathways in cancer, including EGF/HGF crosstalk.
  • Prof. Jens Timmer | www  Jens is a professor of mathematics and physics at the University of Freiburg. His research focuses on the development and interdisciplinary application of mathematical methods to analyse and model dynamical processes in biology and medicine. His group develops and applies mathematical methods to analyse and model these processes based on measured data. Their final aim is to help to turn the life sciences from a qualitative descriptive into a quantitative predictive science.

Team newDREAM


Participants are challenged to estimate the values of 15 unknown parameter values from a set of 30 parameters – 10 promoter affinities, 10 RNA half-lives, and 10 metabolic reaction kcats – of a recently published whole-cell model of M. genitalium (Karr et al., 2012) given the model’s structure and simulated data.


The solution needs to answer two questions: 1. Identify parameter candidates. 2. Minimize parameter distance given the candidate parameters. Since the model is too ‘big’ and simulations take very long time, we could not thoroughly seek the entire parameter space to get near-perfect solutions within the limited time. Therefore, our strategy includes:

1) Use wild type/gold standard/ downloaded perturbation datasets to identify the (potentially modified) parameters most sensitive to cell growth.

2) Observe the high-throughput data. Divide the potentially modified parameters into three groups (A, too time-consuming to get optimized; B, hard to get optimized; C, easy to get optimized) based on the observation.

3) Make educated guesses for parameters in group A firstly, and then try to optimize parameters in group B with fixing parameters in group A. Finally, optimize parameters in group C.

3. About the Team: newDream 

newDream includes a team of researchers from the University of Texas Southwestern. Last year team newDream won the DREAM Drug Sensitivity Prediction Challenge.

Dr. Jichen Yang | www Jichen is a postdoctoral scholar at the University of Texas at Southwestern.

Yajuan Li

Dr. Hao Tang | www: Hao is a postdoctoral fellow at the QBCR at the University of Texas Southwestern.

Tao Wang | www: Tao is a graduate student at the QBCR at the University of Texas Southwestern.

Dr. Yueming Liu | www: Yueming is a mathematician at the University of Texas at Arlington.

Prof. Yang Xie | www: Yang is a professor of in the Department of Clinical Science at the University of Texas Southwestern.

Prof. Guanghua Xiao | www: Guanghua is a professor of in the Department of Clinical Science at the University of Texas Southwestern.