Teams WarwickDataScience and UT_CCB top the real-time leaderboard for NIEHS-NCATS-UNC DREAM Toxicogenetics sub-Challenge 1

Below are blogposts from Teams WarwickDataScience and UT-CCB, the two top-scoring teams for sub-Challenge 1 of the NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge.  Sub-Challenge 1 asked participants to model cytotoxicity across cell lines based on genomic information.  We were so excited to see 49 teams submit nearly 1000 models for scoring to the real-time leaderboard that was open for 6 weeks (you can see the final leaderboard for this sub-Challenge here: https://www.synapse.org/#!Synapse:syn1761567/WIKI/56154!

We awarded $250 to the top two teams with the highest mean rank as determined by RMSE and by Pearson Correlation (four total prizes).  Below are blogplosts from the top-scoring teams: Rich Savage from Team WarwickDataScience and Yonghui Wu from UT_CCB both share a little about their respective teams and their winning models.  

Team WarwickDataScicence Blogpost

WHO IS YOUR TEAM?

Our team (WarwickDataScience) consists of Rich Savage and Jay Moore.  We’re based at the University of Warwick’s Systems Biology Centre (in the UK), where Rich is an associate professor and Jay leads the bioinformatics engineering group.  Rich is also a joint appointment with the Warwick Medical School.

WHY DID YOU JOIN THE CHALLENGE?

We joined the DREAM Tox challenge because we thought it was an interesting scientific problem, and also because we have both developed interests in data science challenges of this nature and were keen to start a data science team here at Warwick.

WHAT WAS YOUR WINNING MODEL?

Over the course of the Challenge we identified several considerations that were demonstrably important to building high-scoring models.  Our basic approach was to build a regression model for each compound, using a Random Forest model, then to couple these together to do multi-task learning.  We experimented with various other regression models, but we got good performance from Random Forests and it was very fast and easy to use, leaving us more time to focus on other aspects of the modelling.

Data-wise, we used some of the covariate features (population, gender) along with a small number of informative SNPs for each of the 106 compounds.  Finding the SNPs was a significant computational challenge, which we ended up solving using a fast, if somewhat quick-and-dirty, approach to GWAS analysis.  Our initial work used only the X-chromosome SNPs, but we found obtained similar results by using any single chromosome, or combination of chromosomes.  We speculate that this may be because we’re finding some widespread genomic signature, but we don’t (yet) have good evidence to confirm this.  We also experimented with the RNAseq data, and a microarray data set sourced externally to the challenge, but ran out of time with this. From cross-validation we think these data may also have helped.

It was clear to us from early in the Challenge that some of the compounds had highly similar toxicity results.  We therefore coupled our individual regression models together by using their predictions as input to a second set of Random Forest classifiers.  This gave us a fast, effective way of sharing information between the regression models for different compounds.

Finally, discussions on the forum and in the webinar made it clear that there was a difference in distribution between the training set and the leaderboard test set.  We decided to try a simple scheme to correct for this bias by identifying training items from the tails of the target distributions and adding duplicate examples to the training set.  While in some ways this is a bit of a hack, we felt it was a reasonable thing to do, given the way Random Forest works.  It resulted in a significant boost to our final scores.

WHAT HAVE YOU LIKED ABOUT THE DREAM CHALLENGE/HOW WOULD YOU LIKE TO SEE IT EVOLVE IN THE NEXT SEASON?

We like that it was a challenging problem, particularly trying to use the SNPs.  We also liked that it was focused on a real-world problem.

Next year, it would be nice if the leaderboard and final test predictions were part of a combined submission, as is the case in many other challenges (eg. Kaggle, Netflix).  For example, the test set can be randomly subdivided by the organizers, with half the items being used to compute leaderboard performance, and the other half being used for the final scoring.  The user simply submits predictions for all test items, without knowing the random partition.  We’d prefer this to the retraining stage, which can be a bit fiddly and error-prone (for example, in the Breast Cancer challenge last year, many final models did not evaluate correctly after the final retraining).  It’s not clear to us that we gain much more scientific understanding from the retraining process.

Team UT_CCB Blogpost

Dear fellow NIEHS-NCATS-UNC-Dream Challenge participants and organizers,

This is Yonghui Wu, representing the participating team from the Center of Computational Biomedicine (CCB) at The University of Texas School of Biomedical Informatics at Houston (UT-SBMI). Our team consists of two postdoctoral fellows (myself and Yunguo Gong), a research assistant (Liang-Chin Huang), and five CCB faculty members including Drs. Jingchun Sun, Jeffrey Chang, Trevor Cohen, W. Jim Zheng, and Hua Xu, who directs the CCB. It is our great honor to be highlighted as a top team on September 4st in the 2013 NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge. Two of our submissions by myself and my teammate Liang-Chin Huang were both ranked second in the mean rank and RMSE, correspondingly.

At CCB, we are working on various projects to develop advanced informatics methods and tools to support biomedical research and clinical practice. One of the focused areas is translational bioinformatics. We are actively working on data extraction and analysis approaches to bridge clinical practice data with biomedical experimental data, thus to facilitate drug discovery and to promote personalized medicine. We are interested in not only identifying genetic variations that are associated with drug responses (pharmacogenomics), but also building models that combine both genomic and clinical variables to predict drug responses, which is exactly what this challenge does.

In this Challenge, we examined different groups of features, including the covariance data, the compound chemical feature, the mRNA expression data as well as SNP data. The critical step in this Challenge is to extract the features that really affect the compound responses from a large amount of data. We utilized different statistical tools to analyze the data to determine the strong features associated with the compound responses. In our top-submissions (Syn ID#2024493 and Syn ID#2024471), we investigated different regression models, including the traditional linear regression model, the logistic regression model as well as the lasso penalized regression model and the Elastic-net model. We also tried different strategies to combine these models to improve the prediction.

This is the first time that we could test our models on a real time leaderboard and we learned a lot from others during the Challenge. I would like to thank the Challenge organizers, as well as everyone who contributed with funding, data, or infrastructure, to make this Challenge possible. This Challenge provides us a great chance to test our computational methods on the real genetic data sets.

Yonghui Wu

Postdoc Research Fellow

Center for Computational Biomedicine

The University of Texas School of Biomedical Informatics at Houston

Advertisements

Comments are closed.

%d bloggers like this: