Scalæ By the Bay has ended
Back To Schedule
Saturday, November 12 • 3:00pm - 3:40pm
Processing 100's of TB of Genomic Data With ADAM And Toil

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Modern genome sequencing projects capture hundreds of gigabytes of data per individual. In this talk, we discuss recent work where we used the Spark-based ADAM tool to recompute genomic variants from 70TB of reads from the Simons Genome Diversity dataset. ADAM presents a drop-in, Spark-based replacement for conventional genomics pipelines like the GATK. We ran this computation across hundreds of nodes on Amazon EC2 using Toil, a novel cluster orchestration tool. Toil was used to automatically scale the number of nodes used, and to seamlessly run large single node jobs and Spark clusters in a single workflow. By combining ADAM and Toil, we are able to improve end-to-end pipeline runtime while taking advantage of the EC2 Spot Instances market. Additionally, Toil is designed for scientific reproducibility, and our entire workflow was run using Docker containers to ensure that there is a static set of binaries that could be used to reproduce the pipeline at a later date. ADAM and Toil are both freely available Apache 2 licensed tools.


Frank Austin Nothaft

Research Assistant, UC Berkeley AMPLab
Frank is a PhD student at UC Berkeley, working in the AMP and ASPIRE labs with David Patterson and Anthony Joseph. Frank's research is focused on scalable systems for processing genomic data. He works on the ADAM/Big Data Genomics project which seeks to build open source tools for... Read More →

Saturday November 12, 2016 3:00pm - 3:40pm PST
Off by One