Genomics in the Cloud

Book description

Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytesâ??or over 50 million gigabytesâ??of genomic data, and theyâ??re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that volume of data in the cloud?

With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian Oâ??Connor of the UC Santa Cruz Genomics Institute, guide you through the process. Youâ??ll learn by working with real data and genomics algorithms from the field.

This book covers:

  • Essential genomics and computing technology background
  • Basic cloud computing operations
  • Getting started with GATK, plus three major GATK Best Practices pipelines
  • Automating analysis with scripted workflows using WDL and Cromwell
  • Scaling up workflow execution in the cloud, including parallelization and cost optimization
  • Interactive analysis in the cloud using Jupyter notebooks
  • Secure collaboration and computational reproducibility using Terra

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Purpose, Scope, and Intended Audience of This Book
      1. What You Will Learn from This Book
      2. What Computational Experience Is Needed for the Exercises?
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  3. 1. Introduction
    1. The Promises and Challenges of Big Data in Biology and Life Sciences
    2. Infrastructure Challenges
    3. Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
      1. Cloud-Hosted Data and Compute
      2. Platforms for Research in the Life Sciences
      3. Standardization and Reuse of Infrastructure
    4. Being FAIR
    5. Wrap-Up and Next Steps
  4. 2. Genomics in a Nutshell: A Primer for Newcomers to the Field
    1. Introduction to Genomics
      1. The Gene as a Discrete Unit of Inheritance (Sort Of)
      2. The Central Dogma of Biology: DNA to RNA to Protein
      3. The Origins and Consequences of DNA Mutations
      4. Genomics as an Inventory of Variation in and Among Genomes
      5. The Challenge of Genomic Scale, by the Numbers
    2. Genomic Variation
      1. The Reference Genome as Common Framework
      2. Physical Classification of Variants
      3. Germline Variants Versus Somatic Alterations
    3. High-Throughput Sequencing Data Generation
      1. From Biological Sample to Huge Pile of Read Data
      2. Types of DNA Libraries: Choosing the Right Experimental Design
    4. Data Processing and Analysis
      1. Mapping Reads to the Reference Genome
      2. Variant Calling
      3. Data Quality and Sources of Error
      4. Functional Equivalence Pipeline Specification
    5. Wrap-Up and Next Steps
  5. 3. Computing Technology Basics for Life Scientists
    1. Basic Infrastructure Components and Performance Bottlenecks
      1. Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
      2. Levels of Compute Organization: Core, Node, Cluster, and Cloud
      3. Addressing Performance Bottlenecks
    2. Parallel Computing
      1. Parallelizing a Simple Analysis
      2. From Cores to Clusters and Clouds: Many Levels of Parallelism
      3. Trade-Offs of Parallelism: Speed, Efficiency, and Cost
    3. Pipelining for Parallelization and Automation
      1. Workflow Languages
      2. Popular Pipelining Languages for Genomics
      3. Workflow Management Systems
    4. Virtualization and the Cloud
      1. VMs and Containers
      2. Introducing the Cloud
      3. Categories of Research Use Cases for Cloud Services
    5. Wrap-Up and Next Steps
  6. 4. First Steps in the Cloud
    1. Setting Up Your Google Cloud Account and First Project
      1. Creating a Project
      2. Checking Your Billing Account and Activating Free Credits
    2. Running Basic Commands in Google Cloud Shell
      1. Logging in to the Cloud Shell VM
      2. Using gsutil to Access and Manage Files
      3. Pulling a Docker Image and Spinning Up the Container
      4. Mounting a Volume to Access the Filesystem from Within the Container
    3. Setting Up Your Own Custom VM
      1. Creating and Configuring Your VM Instance
      2. Logging into Your VM by Using SSH
      3. Checking Your Authentication
      4. Copying the Book Materials to Your VM
      5. Installing Docker on Your VM
      6. Setting Up the GATK Container Image
      7. Stopping Your VM…to Stop It from Costing You Money
    4. Configuring IGV to Read Data from GCS Buckets
    5. Wrap-Up and Next Steps
  7. 5. First Steps with GATK
    1. Getting Started with GATK
      1. Operating Requirements
      2. Command-Line Syntax
      3. Multithreading with Spark
      4. Running GATK in Practice
    2. Getting Started with Variant Discovery
      1. Calling Germline SNPs and Indels with HaplotypeCaller
      2. Filtering Based on Variant Context Annotations
    3. Introducing the GATK Best Practices
      1. Best Practices Workflows Covered in This Book
        1. Other Major Use Cases
    4. Wrap-Up and Next Steps
  8. 6. GATK Best Practices for Germline Short Variant Discovery
    1. Data Preprocessing
      1. Mapping Reads to the Genome Reference
      2. Marking Duplicates
      3. Recalibrating Base Quality Scores
    2. Joint Discovery Analysis
      1. Overview of the Joint Calling Workflow
      2. Calling Variants per Sample to Generate GVCFs
      3. Consolidating GVCFs
      4. Applying Joint Genotyping to Multiple Samples
      5. Filtering the Joint Callset with Variant Quality Score Recalibration
      6. Refining Genotype Assignments and Adjusting Genotype Confidence
      7. Next Steps and Further Reading
    3. Single-Sample Calling with CNN Filtering
      1. Overview of the CNN Single-Sample Workflow
      2. Applying 1D CNN to Filter a Single-Sample WGS Callset
      3. Applying 2D CNN to Include Read Data in the Modeling
    4. Wrap-Up and Next Steps
  9. 7. GATK Best Practices for Somatic Variant Discovery
    1. Challenges in Cancer Genomics
    2. Somatic Short Variants (SNVs and Indels)
      1. Overview of the Tumor-Normal Pair Analysis Workflow
      2. Creating a Mutect2 PoN
      3. Running Mutect2 on the Tumor-Normal Pair
      4. Estimating Cross-Sample Contamination
      5. Filtering Mutect2 Calls
      6. Annotating Predicted Functional Effects with Funcotator
    3. Somatic Copy-Number Alterations
      1. Overview of the Tumor-Only Analysis Workflow
      2. Creating a Somatic CNA PoN
      3. Applying Denoising
      4. Performing Segmentation and Call CNAs
      5. Additional Analysis Options
    4. Wrap-Up and Next Steps
  10. 8. Automating Analysis Execution with Workflows
    1. Introducing WDL and Cromwell
    2. Installing and Setting Up Cromwell
    3. Your First WDL: Hello World
      1. Learning Basic WDL Syntax Through a Minimalist Example
      2. Running a Simple WDL with Cromwell on Your Google VM
      3. Interpreting the Important Parts of Cromwell’s Logging Output
      4. Adding a Variable and Providing Inputs via JSON
      5. Adding Another Task to Make It a Proper Workflow
    4. Your First GATK Workflow: Hello HaplotypeCaller
      1. Exploring the WDL
      2. Generating the Inputs JSON
      3. Running the Workflow
      4. Breaking the Workflow to Test Syntax Validation and Error Messaging
    5. Introducing Scatter-Gather Parallelism
      1. Exploring the WDL
      2. Generating a Graph Diagram for Visualization
    6. Wrap-Up and Next Steps
  11. 9. Deciphering Real Genomics Workflows
    1. Mystery Workflow #1: Flexibility Through Conditionals
      1. Mapping Out the Workflow
      2. Reverse Engineering the Conditional Switch
    2. Mystery Workflow #2: Modularity and Code Reuse
      1. Mapping Out the Workflow
      2. Unpacking the Nesting Dolls
    3. Wrap-Up and Next Steps
  12. 10. Running Single Workflows at Scale with Pipelines API
    1. Introducing the GCP Genomics Pipelines API Service
      1. Enabling Genomics API and Related APIs in Your Google Cloud Project
    2. Directly Dispatching Cromwell Jobs to PAPI
      1. Configuring Cromwell to Communicate with PAPI
      2. Running Scattered HaplotypeCaller via PAPI
      3. Monitoring Workflow Execution on Google Compute Engine
    3. Understanding and Optimizing Workflow Efficiency
      1. Granularity of Operations
      2. Balance of Time Versus Money
      3. Suggested Cost-Saving Optimizations
      4. Platform-Specific Optimization Versus Portability
    4. Wrapping Cromwell and PAPI Execution with WDL Runner
      1. Setting Up WDL Runner
      2. Running the Scattered HaplotypeCaller Workflow with WDL Runner
      3. Monitoring WDL Runner Execution
    5. Wrap-Up and Next Steps
  13. 11. Running Many Workflows Conveniently in Terra
    1. Getting Started with Terra
      1. Creating an Account
      2. Creating a Billing Project
      3. Cloning the Preconfigured Workspace
    2. Running Workflows with the Cromwell Server in Terra
      1. Running a Workflow on a Single Sample
      2. Running a Workflow on Multiple Samples in a Data Table
      3. Monitoring Workflow Execution
      4. Locating Workflow Outputs in the Data Table
      5. Running the Same Workflow Again to Demonstrate Call Caching
    3. Running a Real GATK Best Practices Pipeline at Full Scale
      1. Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
      2. Examining the Preloaded Data
      3. Selecting Data and Configuring the Full-Scale Workflow
      4. Launching the Full-Scale Workflow and Monitoring Execution
      5. Options for Downloading Output Data—or Not
    4. Wrap-Up and Next Steps
  14. 12. Interactive Analysis in Jupyter Notebook
    1. Introduction to Jupyter in Terra
      1. Jupyter Notebooks in General
      2. How Jupyter Notebooks Work in Terra
    2. Getting Started with Jupyter in Terra
      1. Inspecting and Customizing the Notebook Runtime Configuration
      2. Opening Notebook in Edit Mode and Checking the Kernel
      3. Running the Hello World Cells
      4. Using gsutil to Interact with Google Cloud Storage Buckets
      5. Setting Up a Variable Pointing to the Germline Data in the Book Bucket
      6. Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
    3. Visualizing Genomic Data in an Embedded IGV Window
      1. Setting Up the Embedded IGV Browser
      2. Adding Data to the IGV Browser
      3. Setting Up an Access Token to View Private Data
    4. Running GATK Commands to Learn, Test, or Troubleshoot
      1. Running a Basic GATK Command: HaplotypeCaller
      2. Loading the Data (BAM and VCF) into IGV
      3. Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
    5. Visualizing Variant Context Annotation Data
      1. Exporting Annotations of Interest with VariantsToTable
      2. Loading R Script to Make Plotting Functions Available
      3. Making Density Plots for QUAL by Using makeDensityPlot
      4. Making a Scatter Plot of QUAL Versus DP
      5. Making a Scatter Plot Flanked by Marginal Density Plots
    6. Wrap-Up and Next Steps
  15. 13. Assembling Your Own Workspace in Terra
    1. Managing Data Inside and Outside of Workspaces
      1. The Workspace Bucket as Data Repository
      2. Accessing Private Data That You Manage Outside of Terra
      3. Accessing Data in the Terra Data Library
    2. Re-Creating the Tutorial Workspace from Base Components
      1. Creating a New Workspace
      2. Adding the Workflow to the Methods Repository and Importing It into the Workspace
      3. Creating a Configuration Quickly with a JSON File
      4. Adding the Data Table
      5. Filling in the Workspace Resource Data Table
      6. Creating a Workflow Configuration That Uses the Data Tables
      7. Adding the Notebook and Checking the Runtime Environment
      8. Documenting Your Workspace and Sharing It
    3. Starting from a GATK Best Practices Workspace
      1. Cloning a GATK Best Practices Workspace
      2. Examining GATK Workspace Data Tables to Understand How the Data Is Structured
      3. Getting to Know the 1000 Genomes High Coverage Dataset
      4. Copying Data Tables from the 1000 Genomes Workspace
      5. Using TSV Load Files to Import Data from the 1000 Genomes Workspace
      6. Running a Joint-Calling Analysis on the Federated Dataset
    4. Building a Workspace Around a Dataset
      1. Cloning the 1000 Genomes Data Workspace
      2. Importing a Workflow from Dockstore
      3. Configuring the Workflow to Use the Data Tables
    5. Wrap-Up and Next Steps
  16. 14. Making a Fully Reproducible Paper
    1. Overview of the Case Study
      1. Computational Reproducibility and the FAIR Framework
      2. Original Research Study and History of the Case Study
      3. Assessing the Available Information and Key Challenges
      4. Designing a Reproducible Implementation
    2. Generating a Synthetic Dataset as a Stand-In for the Private Data
      1. Overall Methodology
      2. Retrieving the Variant Data from 1000 Genomes Participants
      3. Creating Fake Exomes Based on Real People
      4. Mutating the Fake Exomes
      5. Generating the Definitive Dataset
    3. Re-Creating the Data Processing and Analysis Methodology
      1. Mapping and Variant Discovery
      2. Variant Effect Prediction, Prioritization, and Variant Load Analysis
      3. Analytical Performance of the New Implementation
    4. The Long, Winding Road to FAIRness
    5. Final Conclusions
  17. Glossary
  18. Index

Product information

  • Title: Genomics in the Cloud
  • Author(s): Geraldine A. Van der Auwera, Brian D. O'Connor
  • Release date: April 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491975190