2 Whole Exome Sequencing

2.1 Introduction

The AIMS DNA-Seq analysis pipeline identifies somatic variants within whole genome sequencing (WGS) and whole exome sequencing (WES) data.

WGS involves sequencing the entire genome, including coding and non-coding regions. WGS offers a comprehensive view of an individual’s genomic landscape, capturing variations in protein coding, regulatory, intronic, and intergenic regions. WGS is particularly useful for understanding complex genetic traits, population genetics, and uncovering structural variants.

WES specifically targets the exome, which represents only the protein-coding regions of the genome. While the exome constitutes only a small fraction of the entire genome, it contains the majority of known disease-causing mutations. WES is valuable for investigating genetic characteristics of diseases in exonic regions with much lower cost compared to WGS.

AIMS DNA-Seq pipeline detects variants on WGS or WES data. It compares allele frequencies in normal and tumor sample alignments, annotating each mutation, and aggregating mutations from multiple callers.

2.2 Pipeline Overview

Workflow

The AIMS WES pipeline consists of the following main steps:

  • Preprocessing

  • Genome Alignment

  • Quality Score Recalibration Improving the accuracy of variant calls

  • Variant Calling

  • Variant Filtration and Annotation

  • Mutation Aggregation

  • Downstream Statistics calculation

Input data

AIMS WES pipeline accommodates two types of input data, the raw sequencing data in FASTQ format, or aligned reads information in BAM format. If a BAM file is provided, the pipeline will automatically bypass the preprocessing and genome alignment steps.

Tools and Software

The pipeline relies on the following bioinformatics tools and internal scripts for the entire analysis process:

  • BWA (version, link) for reads mapping

  • GATK (version, link) for variant preprocessing and variant calling

  • HaplotypeCaller (GATK) for germline variant calling

  • Mutect2 (GATK) for somatic variant calling

  • Strelka2 (GATK) for somatic variant calling

  • VEP (version, link) for variant annotation

  • Perl scripts for variant aggregation and somatic mutation masking

  • Python scripts for Tumor Mutation Burden (TMB)

  • MSIsensor2 for Microsatellite Instability (MSI) calculation.

  • MANTIS for MSI calculation

Configuration

AIMS WES pipeline provides the following configuration for users to control the variant calling process.

  • Version

  • Data type

  • Downsample

  • Caller

Output

The outputs of the AIMS WES pipeline comprise sample level:

  1. Quality control assessment result

  2. Variant call results (VCF) for each selected variant caller

  3. Aggregated masked somatic mutation annotation file (MAF), and dataset-level

  4. Sample-by-gene mutation feature matrix

2.3 Pipeline workflow

Pre-alignment processing

Raw sequencing reads undergo quality control using FastQC to assess read quality, identify adapter contamination, and evaluate other sequencing metrics.

  • Command Line

    FastQC <fastq_1.fq.gz> <fastq_2.fq.gz>
    

Alignment

Align reads to the reference genome. Supported reference genomes include Human (hg38), Mouse (mm39), Fruit fly (dm6), SARS-CoV-2.

  • Command Line

    Bwa mem <reference> <fastq_1.fq.gz> <fastq_2.fq.gz>
    

Pre-alignment processing

Subsampling

Subsampling the raw input data. Support user-defined subsampling coverage.

  • Command Line

    samtools view -s <subsample_coverage> <bam>
    

MarkDuplicates

Mark repeated sequences.

  • Command Line

    gatk MarkDuplicates <bam>
    

Base (Quality Score) Recalibration

  • GATK BaseRecalibrator

    gatk BaseRecalibrator <bam>
    
  • GATK ApplyBQSR

    gatk ApplyBQSR <bam>
    

Quality control

  • samtools stats

    samtools stats <bam>
    
  • Qualimap bamqc

    qualimap bamqc <bam>
    

Pipeline run summary

  • MultiQC

Variant calling

There are many different variant calling tools, each has strengths and weaknesses, and their performance can vary based on factors such as sequencing technology, read depth, and the type of variants being analyzed. There is currently no scientific consensus on the best variant calling pipeline. AIMS includes three variant calling tools, users can choose the pipeline(s) most appropriate for the data.

Germline variant calling

Germline variant calling involves identifying genetic variations present in the germline cells of an individual, which are inherited from their parents and are present in every cell of the body. AIMS call germline variant using HaplotypeCaller and Strelka2.

  • HaplotypeCaller

    HaplotypeCaller <bam>
    
  • Strelka2

    Strelka2 <bam>
    

Somatic variant calling

A somatic mutation is a genetic alteration that occurs in the non-reproductive cells (somatic cells) of an organism. Unlike germline mutations, which are inherited and present in every cell of an individual’s body, somatic mutations arise during an individual’s lifetime and are specific to certain cells or tissues. Somatic variant calling is to detect alterations in tumor samples compared to matched normal tissue samples.. AIMS call somatic mutation using Mutect2 and Strelka2

  • Mutect2

    Mutect2 <normal_bam> <tumor_bam>
    
  • Strelka2

    Strelka2 <normal_bam> <tumor_bam>
    

Variant filtration and annotation

Tumor mutation burden estimation

Tumor Mutation Burden (TMB) is a measure of the total number of mutations present in the genomic DNA of a tumor cell. It quantifies the extent of genetic alterations within a tumor and is often expressed as the number of mutations per sample. AIMS call internal script to calculate the TMB of a sample.

Microsatellite instability detection

Microsatellite instability (MSI) refers to the condition where the number of short tandem repeats, known as microsatellites, in the DNA of a cell differs from what is considered normal.

2.4 Validation

Germline variant calling validation

Report given

Somatic variant calling validation

Report given

TMB calculation validation

Report given

MSI calculation validation

Report given

2.5 Version update history