GATK4 call somatic snvs indels

This is an HPC workflow for calling short mutations including single nucleotide (SNA) and insertion and deletion (indel) alterations from 50 matched tumor-normal whole genome sequencing (WGS) bams from 25 childhood acute lymphoblastic leukemia cases. WGS bams were preprocessed with this workflow.

This workflow was written according to the following GATK official WDLs:

  • https://github.com/gatk-workflows/gatk4-somatic-snvs-indels

  • https://github.com/broadinstitute/gatk/tree/master/scripts/mutect2_wdl

Check out gatk doc for more details on GATK4 Mutect2:

  • Detailed explanation of Mutect2

  • Details on the deprecated pipeline

  • New pipeline

  • Mutect2 PDF document

  • New features and improvements

  • What traditional somatic calling entails

  • FAQ for Mutect2

    • How is the Mutect2 filter different in tumor-only mode, versus in matched-normal mode? What does it do differently in each case?

    In tumor-normal mode, Mutect2 detects germline variants using (1) the population allele frequency from the germline resource as a prior, (2) the normal reads, and (3) the allele fraction in the tumor (allele fractions near ½ are suggestive of germline hets). In tumor-only mode, the evidence from normal reads is missing, but it’s the same model. Additionally, in tumor-only mode, the powerful normal artifact filter is not available. Mutect2 uses the matched normal to additionally exclude rare germline variation not captured by the germline resource and individual-specific artifacts. Mutect2 uses a germline population resource towards evidence of alleles being germline. The simplified sites-only gnomAD resource retaining allele-specific frequencies is available at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2. A panel of normals (PoN) has a vital role that fills a gap between the matched normal and the population resource. Mutect2 uses the PoN to catch additional sites of noise in sequencing data, like mapping artifacts or other somewhat random but systematic artifacts of sequencing and data processing.

    Interesting biostar post - How Do Heterozygotes And Somatic Mutations Manifest In Sequencing Projects

  • create PON

    • Create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2.

    • The tool takes multiple normal sample callsets produced by Mutect2’s tumor-only mode and collates sites present in multiple samples (two by default, set by the –min-sample-count argument) into a sites-only VCF. The PoN captures common artifacts. Mutect2 then uses the PoN to filter variants at the site-level. The –max-germline-probability argument sets the threshold for possible germline variants to be included in the PoN. By default this is set to 0.5, so that likely germline events are excluded. This is usually the correct behavior as germline variants are best handled by probabilistic modeling via Mutect2’s –germline-resource argument. A germline resource, such as gnomAD in the case of humans, is a much more refined tool for germline filtering than any PoN could be.

  • calculate contamination: the final table tells us the likely rate of cross-sample contamination in the tumor callset. If the level of contamination of the tumor is estimated at 1.15%, with an error of 0.19%, this means that one read out of every hundred is likely to come from someone else. This percentage will effectively be our floor for detecting low-frequency somatic events; for potential variant called at or below that AF, we would have zero power to judge whether they were real or created through contamination. We will feed the table into the filtering tool so that it can take the contamination estimate properly into account. (*explanation from GITC).

Best practices for somatic short variant discovery. A figure from GITC.

Step 1: Scatter interval for Mutect2 parallel calling

Step 2: Gather results from scattered intervals

Step 3: Calculate contamination and learn orientation bias

Step 4: Filter alignment artifacts

Step 5: Filter variants