Contents
- Sample collection and nucleic acid isolation
- Discovery sequencing
- Sequence alignment
- Variant calling
- Design of a Custom NimbleGen Capture Reagent
- Validation sequencing and analysis (Illumina capture, IDT, RMG)
- Clonality analysis
- Microarray genotyping and analysis
- Whole Genome and Exome Library Construction
- RNA-seq library construction
- Illumina whole genome and exome sequencing
- Ion Torrent Library Construction
- Exome Captures and Illumina Sequencing
- Nimblegen Custom Capture and Illumina Sequencing
- IDT custom capture (145 targets) and Illumina sequencing
- IDT AML recurrently mutated gene panel (RMG) custom capture and Illumina sequencing
- NimbleGen Exome with custom spike-in capture and Illumina sequencing
- Ion Torrent library quantification, template preparation, and enrichment
- Ion Torrent sequencing and validation of variant called in Illumina data
- Digital Droplet PCR sequencing of sub-clonal and relapse specific variants
- Amplicon (Ampliseq) Ion Torrent sequencing of AML driver variants
- RNA sequencing (Illumina)
- Analysis of library size distributions
- Analysis of multiple library sequencing and effect on library sequence complexity
- Sequence alignment
- Analysis of alignments from multiple algorithms
- Single nucleotide variant (SNV) calling
- Small insertion and deletion (indel) variant calling
- Copy number alteration (CNA) calling
- Loss of heterozygosity (LOH) analysis
- Structural variant calling (SV)
- Manual review of somatic variant predictions
- Estimating tumor sample purity and normal sample contamination by tumor cells
- Estimating an upper bound of VAF signal attributable to sequencing errors
- Identifying library specific errors
- Identifying relapse specific variants
- Integrated analysis of SNV expression using DNA-seq and RNA-seq data
- Splice site mutation analysis by integration of DNA-seq and RNA-seq data
- Defining a platinum variant list
- De novo variant calling with validation data and comparison to WGS variant calls
- Determining the effect of depth on subclone inference by downsampling analysis
Sample collection and nucleic acid isolation
Bone marrow biopsy specimens and normal skin samples (Figure 1 and Supplemental Table 1 for timepoints) were obtained from a single subject who provided written informed consent on a form that contained specific language authorizing whole-genome sequencing and data sharing. This consent procedure was approved by the Washington University School of Medicine Human Research Protection Office (HRPO) on 10/23/06 and renewed annually thereafter. Genomic DNA was prepared by column purification (Qiagen DNeasy).
Discovery sequencing
The primary AML tumor, relapse tumor and a matched normal skin sample were sequenced by whole genome sequencing on the Illumina platform using paired 2x100bp reads. Each sample was also subjected to exome sequencing using the Roche NimbleGen SeqCap EZ Human Exome Library v3.0. The resulting captured DNA was also sequenced by Illumina 2x100bp sequencing. Each sample was also captured using a custom set of capture probes (manufactured by IDT) that were designed to tile across 264 genes found to be recurrently mutated in AML [23634996]. Refer to Table 1 and Supplemental Figure 2 for sequence coverage levels achieved and Figure 1 for an overview of the experimental design. Refer to the Supplemental Methods for additional details on library construction and sequencing.
Sequence alignment
Whole genome, exome and custom capture reads were aligned to a modified version of the human reference genome (NCBI build 37, ‘GRCh37’) using BWA (v0.5.9) [19451168]. The reference genome sequence consists of chromosomes 1-22, X, Y, MT, and all unplaced “GL” contigs but excludes the alternate haplotype contigs. To assess a series of alternative short read aligners all reads obtained by custom capture of ~200k putative somatic variant sites were re-aligned to the same reference genome sequence using each of the following aligners: Bowtie2 (v2.2.0) [22388286], BWA (v0.7.7) [19451168], BWA-MEM (v0.7.7) [arXiv:1303.3997], Isaac (v01.14.03.12) [23736529], SMALT v(0.7.6) (http://www.sanger.ac.uk/resources/software/smalt/), SNAP v(0.15.4) [arXiv:1111.5572v1], Stampy v(1.0.23) [20980556] using default/recommended settings. In each alignment, duplicates reads were marked with Picard (v1.92) (http://broadinstitute.github.io/picard). Refer to the Supplemental Methods for additional details on how each of the aligners was run.
Variant calling
Prediction of somatic variants of various types was performed primarily using the ultradeep WGS data from the primary and relapse tumors (bone marrows) compared to a skin normal. Somatic single nucleotide variant (SNV) predictions consisted of the union of calls from seven SNV callers: SNV variant callers: bassovac (unpublished), MuTect v1.1.4 [23396013], Seurat v1.5-32-g2761da9 [23642077], Shimmer v0.1.1 [23620360], SomaticSniper v1.0.2 [22155872], strelka 0.4.6.2 [22581179], varscan 2.2.6 [22300766]. Somatic small insertion and deletion (indel) predictions were obtained by taking the union of seven indel callers: GATK v1.0.5336 [20644199], Pindel v0.2.2 [19561018], Seurat v1.5-32-g2761da9 [23642077], Shimmer v0.1.1 [23620360], Strelka v0.4.6.2 [22581179] and Varscan 2.2.6 [22300766]. Large scale somatic copy number variants (CNVs) were predicted by CopyCat v1.6.9 (unpublished). Structural variants (SVs) were predicted by BreakDancer [19668202] and SquareDancer. Regions of loss of heterozygosity (LOH) were predicted by use of a customized pipeline based on VarScan2 [22300766]. For complete details on all somatic variant calling refer to the Supplemental Materials.
Design of a Custom NimbleGen Capture Reagent
A total of 371,976 variants were called from the primary tumor or relapse WGS data by at least one of the seven variant callers as described above. Variants were removed from consideration for design of a validation if (a) the minimum whole genome coverage in the primary tumor was < 20X or (b) normal coverage was > 50X and the normal VAF was > 20%. This left 359,619 (96.7%) candidate somatic variants. All variants from this list were selected except for those called by seurat where a random subset (~⅓) of the remaining seurat called variants were selected (Supplemental Table 3). A final list of 198,814 unique sites were sent to NimbleGen for design of a validation capture reagent. Nimblegen was able to design targeted probes for 191,988 (96.6%) of these sites. Capture followed by sequencing produced 135,731 (68.3%) sites with coverage > 100X in the primary tumor.
Validation sequencing and analysis (Illumina capture, IDT, RMG)
Validation of predicted somatic variants from the primary and relapse tumors was conducted across multiple independent DNA samples (Table 1), using multiple capture reagents to enrich for regions of interest (Figure 1), using multiple orthogonal sequencing platforms, and multiple analysis strategies (Figure 1). Sequencing libraries were created in replicate from multiple aliquots of DNA as well as distinct DNA isolation events. Regions harboring putative somatic variants were enriched for deep validation by use of commercial exome reagents, design of a large custom NimbleGen capture panel (~200k sites), more focused capture reagents made by Integrated DNA Technologies, amplicon sequencing, and digital droplet PCR. Sequencing was performed primarily on the Illumina HiSeq and Ion Torrent platforms. Analysis involved use of comprehensive downsampling experiments, integration of multiple sequence libraries and platforms, integration and DNA and RNA data, integration of data spanning across seven disease timepoints, and extensive manual review of raw data. Refer to the Supplemental Materials for more detailed methods.
Clonality analysis
A ‘platinum’ list (Supplemental Methods) of high quality somatic variants were used for analysis of the clonal architecture of the primary tumor and its evolution from primary to relapse. To improve the accuracy of VAF estimation for each of these variants, VAFs were calculated by combining read depth from all data sets. Subclone identity (clones 1-6) was predicted by use of SciClone [25102416]. Estimated cell proportion values for the variants depicted in Figure 2 were scaled up by assuming that a founding clone VAF (highest VAF) for a heterozygous variant equal to 50% would be equivalent to a tumor purity of 100%. The interpretation of these assignments in the context of tumor evolution was first performed by use a ‘tumor cell map’ visualization tool developed in R and used to create Figure 2c. CloneEvol (http://github.com/genome/cloneevol) was then used to enumerate all possible subclonal architectures for tumor and relapse, and find the tumor models and relapse models that are biologically consistent (Supplemental Figure 15). These results along with analysis of critical driver variants across seven time points (Figure 4) and previously reported single cell sequencing results [24613412] were used to iteratively develop the more complex models of clonal evolution depicted in Figure 5.
Microarray genotyping and analysis
The normal, primary tumor, and relapse tumor genomic DNA were genotyped at ~715,000 positions using an Illumina HumanOmniExpress BeadChip kit (Cat. No. WG-315-1101). These positions were selected by Illumina from a HapMap data set as having >5% MAF. We performed hybridization of each sample to these arrays and scanned them with an Illumina iScan instrument (Cat. No. SY-101-1001) following the manufacturer’s instructions.
Whole Genome and Exome Library Construction
The yield and integrity of native genomic DNA was verified by a PicoGreen assay for quantitation (Invitrogen). Samples were also run on an Illumina iScan Instrument utilizing the Human OmniExpress genotyping array according to the manufacturer's recommendations (Illumina Inc, San Diego, CA). Small insert Illumina libraries were constructed with one of three kits/approaches according to the manufacturer's recommendations with a few exceptions described below. The three library kits used consisted of: (1) the Kapa LTP (KK8232) kit (Kapa Biosystems, Woburn, MA), (2) an in house kit that included a DNATerminator End Repair Kit from Lucigen coupled with Klenow, Quick Ligase and Phusion polymerase purchased from NEB, and (3) an in house kit that included KAPA HiFi DNATerminator End Repair Kit from Lucigen coupled with NEB reagents (Klenow, Quick Ligase) and KAPA HiFi Polymerase. For each library, 250-500 ng of gDNA was fragmented using a Covaris E210 DNA Sonicator (Covaris, Woburn, MA) with one of two custom programs as follows. Program 1 consisted of duty cycle: 20%, intensity: 5, cycles per burst: 500, and a treatment time of 120 seconds. Program 2 consisted of duty cycle: 5%, intensity: 4, cycles per burst: 200, and a treatment time of 90 seconds. To prevent excessive over-amplification during PCR, cycle optimization was performed. 2 µL of the library was cycled as follows: initial denaturation at 98 °C for 30 seconds followed by multiple cycles of 98 °C for 10 seconds, 65 °C for 30 seconds, and 72 °C for 30 seconds. After cycles 6, 8, 10, and 12 the program was halted and a 5µL aliquot was collected. Each cycle amplification product was evaluated on a 2.2% agarose Flash Gel (Lonza) and the proper cycle number determined. Eight PCR reactions were amplified at the determined cycle number to enrich for proper adaptor ligated fragments. The library was purified post amplification with a MinElute column (Qiagen, Valencia, CA). The libraries were fractionated by one of two methods. First using PAGE gels we would load 1 ug of the amplified library over three lanes of a gel and cut the following fractions: 300-350, 350-400, 400-450, and 450-500. Second, using a LabChipXT we would load 500 ng of amplified library using the DNA 750 chip (Perkin Elmer, Hopkinton, MA) and collect three unique fractions: 375 bp, 475 bp, and 675 bp with a +/- 5% covariance. Each fraction was assessed for concentration and size to determine molarity using the Qubit Fluorometer Quant-iT dsDNA HS assay (Life Technologies, Grand Island NY) and the Agilent BioAnalyzer High Sensitivity DNA Assay (Agilent Technologies, Santa Clara, CA), respectively. The concentration of each library fraction was verified through qPCR according to the manufacturer's protocol (Kapa Biosystems, Woburn, MA) to produce cluster counts appropriate for the Illumina Sequencing platform. Sample identity was confirmed for each sample by comparing sequence data with the Illumina Human OmniExpress genotype array data. Several other libraries were created as follows: Illumina multiplexed libraries were constructed with 500 ng of native genomic DNA according to the manufacturer’s protocol (Illumina Inc, San Diego, CA) with the following modifications: (1) DNA was fragmented using a Covaris E210 (Covaris, Inc. Woburn, MA) generating fragments within 100-400 bp . (2) Illumina adapter-ligated library fragments were amplified in four 50 µL PCR reactions for eighteen cycles. (3) Solid Phase Reversible Immobilization (SPRI) bead cleanup was used for enzymatic purification throughout the library process, as well as final library size selection targeting 300-500 bp fragments.
RNA-seq library construction
RNA-seq libraries were produced in multiple ways as part of a detailed RNA-seq technology development exercise conducted in parallel with that described for DNA sequencing strategies described in this work. RNA-seq libraries were constructed using a range of input fragment size fractions. PolyA enrichment of input total RNA was contrasted with ribo-reduction strategies. Both stranded and non-stranded RNA-seq libraries were generated. Multiple library construction kits involving additional differences were tested as was the use of cDNA capture with a NimbleGen exome reagent for library normalization [24814956]. For complete details refer to the related publication in this issue describing the RNA-seq data production and analysis.
Illumina whole genome and exome sequencing
KAPA qPCR was used as described above to quantify the libraries and determine the appropriate concentration to produce optimal recommended cluster density on a GAIIx (2x100 bp), HiSeq 2500 v1 Rapid Run Mode (2x100 bp), HiSeq2000 v3 (2x100 bp), or MiSeq v2 (2x150 bp) sequencing runs. All sequencing runs were completed according the manufacturer's recommendations (Illumina Inc, San Diego, CA).
Ion Torrent Library Construction
For each of the three samples (skin normal, tumor, and relapse), 100 ng of gDNA was sheared using the S2 Focused-ultrasonicator™ (Covaris, Woburn, MA) with settings as follows; duty cycle 20.0, intensity 5.0, and 500 cycles/burst for 120 seconds. Using the sheared DNA, we generated a library using the Ion Torrent Fragment Library kit (Life Technologies, Carlsbad, CA). End repair was carried out using 130 ml of fragmented DNA, 40 ml of 5X Ion Torrent End Repair Buffer, 2 ml of Ion Torrent End Repair Enzyme, and 28 ml of nuclease-free water. The end repair reaction was incubated at room temperature for 20 minutes and purified using Ampure XP beads (Beckman Coulter, Indianapolis, IN) at a bead to sample ratio of 1.5:1.0. The Ion Xpress Barcode Adapter 1-16 Kit was used to incorporate indexed adapters at the adapter ligation and nick translation step. For ligation of adapters, 25 ml of purified DNA was combined with 10 ml of 10X Ion Torrent Ligase Buffer, 2 ml of P1 Adapter, 2 ml of Ion Xpress Barcode Adapter, 49 ml of nuclease-free water, 2 ml of Ion Torrent DNA Ligase, and 8 ml of Ion Torrent Nick Repair Polymerase. Reactions were incubated for 15 minutes at 25 °C, 5 minutes at 72 °C, and a hold at 4 °C. Another round of Ampure XP bead cleanups was performed at a bead to sample ratio of 1.2:1.0 with the purified DNA eluted in 20 ml of Ion Torrent Low TE. The library for each sample was amplified across four PCRs using 4 ml of purified DNA, 25 ml of 2X KAPA HiFi PCR Master Mix (KAPA Biosystems, Boston, MA), 2 ml of Ion Torrent Library Amplification Primer Mix, and 19 ml nuclease-free water for a total volume of 50 ml. All reactions were cycled at 95 °C for 2 minutes followed by 8 cycles of 98 °C for 15 seconds, 58 °C for 30 seconds, and 72 °C for 30 seconds. The four PCRs per sample were pooled and purified once more using Ampure XP beads at a bead to sample ratio of 1.0:1.0. Recovered DNA was quantified using the Qubit dsDNA HS Kit (Life Technologies, Carlsbad, CA).
Exome Captures and Illumina Sequencing
Each sample was subjected to exome sequencing using: (A) the Roche NimbleGen SeqCap EZ Human Exome Library v3.0, (B) the Agilent Technologies SureSelect Clinical Research Exome, (C) the Agilent Technologies SureSelectXT Human All Exon V5, and (D) the Roche NimbleGen SeqCap EZ HGSC VCRome. Hybridizations were carried out according to the Nimblegen SeqCap EZ Exome protocol (Roche NimbleGen, Madison, WI.) and Agilent Technologies SureSelect Exome protocols respectively. The resulting captured DNA was sequenced by Illumina 2x100 bp sequencing on a HiSeq 2000 or HiSeq 2500. The VCRome was supplemented with a custom NimbleGen reagent that we designed to target regulatory regions of the genome (i.e. the “regulome”) (publication in preparation). The NimbleGen v3 exome hybridization was performed by itself but also in combination with a custom capture reagent covering genes found to be recurrently mutated in AML (see below for further details).
Nimblegen Custom Capture and Illumina Sequencing
Hybridizations were carried out according to the Nimblegen SeqCap EZ Exome protocol with the following modifications (Roche NimbleGen, Madison, WI.). We used 1 nmol each of Ion Torrent specific oligos as the hybridizing enhancing oligos (listed below):
PGM_A_Blocker: 5’-AACCATCTCATCCCTGCGTGTCTCCGACTCAG PGM_P1_Blocker: 5’-AACCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT
Pre-capture pooling was accomplished by combining 650 ng of amplified library per sample. The sample pool was hybridized with 100 ng of custom Nimblegen probes (design of the custom probe set is described separately in the Methods section). Following the standard Nimblegen SeqCap EZ Exome protocol (72 hour hybridization at 47°C and streptavidin bead binding/wash steps), the captured DNA for each sample was amplified across two PCRs using half of the bead bound DNA, 25 ml of 2X KAPA HiFi PCR Master Mix (KAPA Biosystems, Boston, MA), 2 ml of Ion Torrent Library Amplification Primer Mix, and 23 ml nuclease-free water for a total volume of 50 ml. All reactions were cycles for 95 °C for 2 minutes followed by 9 cycles of 98 °C for 15 seconds, 58 °C for 30 seconds, and 72 °C for 30 seconds. The two PCRs per sample were pooled and purified once more using Ampure XP beads at a bead to sample ratio of 1.0:1.0.
IDT custom capture (145 targets) and Illumina sequencing
We selected 145 putative somatic SNV sites for targeted capture and sequencing to much greater depth than could be achieved by whole genome or exome sequencing. The 145 sites consisted of 118 from a previous study of single cell sequencing of this case [24613412] as well as 27 additional sites predicted by whole genome and exome sequencing for this study. While 145 target sites were targeted for probe design, a total of 138 probes, each 120 bp in length were successfully designed and received from Integrated DNA technologies (IDT). These probes cover 134 (92.4%) of the target sites and consist of a total unique capture space of 16,560 bases. For each sample (normal, primary tumor and relapse tumor), four independent sequence libraries involving distinct ligation events were generated and single indexed (one hexamer) and pooled prior to capture. The methods for capturing with an IDT reagent were exactly the same as capturing with Nimblegen probes as described above. The same amount of probe, 100ng was used for IDT probes. We simply swapped NimbleGen probes for IDT probes and used the same Nimblegen kits and hybridization conditions. The captured libraries were sequenced by Illumina paired-end 2x100 sequencing as described above.
IDT AML recurrently mutated gene panel (RMG) custom capture and Illumina sequencing
We used a pool of probes designed to target 264 recurrently mutated genes (RMG) in AML [23634996] to obtain additional coverage over genes potentially most relevant to AML genome biology. The RMG reagent targets 6301 exons and a few regulatory regions for these 264 genes. These exon regions cover 1,201,754 (~1.2 Mb) unique bases in the genome. For these sites, 11,770 IDT probes were successfully designed. All probes are 120 bases in length and combined, cover 1,409,581 (~1.4 Mb) unique bases in the genome. These probes cover 1,175,202 (~1.2 Mb) bases of the original target exon bases (97.8%) without accounting for any ‘wingspan’ coverage that is routinely observed within at least 50-100 bases on either sides of each capture probe. For the normal and primary tumor samples, several sequence libraries were generated. Some of these were dual indexed (two octamers) and others were single indexed (one hexamer). The libraries for each sample were pooled and captured. The methods for capturing with this IDT reagent were exactly the same as capturing with Nimblegen probes as described above. The same amount of probe, 100ng was used for IDT probes, we simply swapped Nimblegen probes for IDT probes and used the same Nimblegen kits and hybridization conditions. The captured libraries were sequenced by Illumina paired-end 2x100 sequencing as described above.
NimbleGen Exome with custom spike-in capture and Illumina sequencing
To assess the possible effect of spiking in a custom capture IDT reagent along with an Exome capture reagent we combined the AML RMG reagent described above with the Roche NimbleGen SeqCap EZ Human Exome Library v3.0. The NimbleGen v3.0 exome covers ~63.6 Mb of unique bases in the genome. The AML RMG reagent covers exons that are for the most part already covered by the NimbleGen exome reagent. Specifically, 96.8% of the bases covered by the AML RMG probes are already covered by the NimbleGen v3.0 Exome. However, the spike-in serves to boost coverage over these regions. We typically see an approximately 5-fold increase in coverage over the RMG regions in exome experiments utilizing this spike-in strategy. We also experimented with a spike-in reagent that extended the existing NimbleGen regions into regulatory regions of the genome (the “regulome”) (in preparation). When spiking either of the IDT reagents into an existing Exome reagent such at the NimbleGen v3.0 exome, we attempted to achieve an approximately 1:1 stoichiometry. This was accomplished by determining the ratio of the genome bases targeted by the IDT reagent compared to the ~63.6 Mb targeted by the NimbleGen v3.0 exome reagent. For the IDT RMG reagent this works out to 1.4 / 63.6 Mb or a ratio of 0.022. Since this standard input for a NimbleGen Exome hybridization is 100 ng of probe, we would add 2.2 ng of IDT RMG probe. When an exome with IDT spike-in experiment was performed, hybridization and sequencing were conducted as described above for the NimbleGen exome alone.
Ion Torrent library quantification, template preparation, and enrichment
The library was quantified through use of the KAPA Library Quantification Kit for Ion Torrent and diluted to 26 pM with Ion Torrent Low TE. Template preparation was carried out on the Ion OneTouch 2 instrument in conjunction with the Ion PGM Template OT2 200 Kit according to revision 3.0 of the corresponding Ion PGM Template OT2 200 protocol (publication part number MAN0007221). The amplification reaction consisted of 25 ml of nuclease-free water, 500 ml of Ion PGM Template OT2 200 Reagent Mix, 300 ml of 25 ml of Ion PGM Template OT2 200 PCR Reagent B, 50 ml of Ion PGM Template OT2 200 Enzyme Mix, 25 ml of the library diluted to 26 pM, and 100 ml of Ion PGM Template OT2 200 Ion Sphere™ Particles (ISPs). Following automated template prep on the OneTouch 2, enrichment of template-positive ISPs was done on the Ion OneTouch ES instrument.
Ion Torrent sequencing and validation of variant called in Illumina data
The Ion PGM Sequencing 200 Kit v2 was used for sequencing on the Ion Torrent PGM. Each template prep and enrichment reaction was sufficient for running a single 318 chip. A total of 105,613,091 reads (18.8 Gbp) were generated across twenty-one 318 chips. For preliminary data analysis, Torrent Suite version 4.0.2 was used to combine all sequence data, mark duplicates, and provide preliminary coverage analysis against the custom target regions. Reads vary in length based on polymerase performance, sequence content (the run is limited only by the number of flows of each nucleotide which is usually 500 for a 200 bp sequencing run), and library fragment length. We obtained an average 2.1 nt incorporations per 4 cycle nucleotide flow. The mode insert size for most of the runs was around 260 bp with a fairly wide distribution.
To assess validation rates for the de novo variants called by application of somatic variant callers to Illumina WGS data we applied the following procedure. We first filtered out sites which had a minimum read depth of 100 and Variant Allele Frequency (VAF) > 0.35% in the normal sample across any of the platforms. Of the sites that remained we marked a site as validated if they had at least 4 variant supporting reads in either the Ion Torrent or Illumina Capture data. We then plotted the VAFs for the sites between platforms. For each pairwise plot, a site had to have a minimum coverage of 40 reads in both the platforms to adequately assess VAF. Supplemental Table 13 summarizes the number of sites in each pairwise comparison and gives the R2 correlation (Pearson) of the VAFs between platforms. Supplemental Figure 13 shows the raw VAF data and comparisons between the WGS Illumina data, capture Illumina data, and capture Ion Torrent data.Digital Droplet PCR sequencing of sub-clonal and relapse specific variants
The purpose of this experiment was to evaluate the application of droplet digital PCR (ddPCR) as an orthogonal validation approach for low VAF somatic variants. Based on deep read counts from primary tumor capture sequencing, VAFs were based on read coverage levels nearing a depth of 2000x. 15 random, single nucleotide variants (SNVs) were picked with VAFs ranging between 1% and 50%. We employed ddPCR and assayed each SNV position using a 5' nuclease allelic discrimination assay with TaqMan probes designed to the reference and variant alleles at each genomic position [12218656]. The selected SNVs are listed below with genomic coordinates, the reference and variant nucleotide, and VAF (Supplemental Table 9). For each reference and variant allele, BioRad designed PCR primer pairs and TaqMan probes. Each probe is a short nucleotide sequence (16-29 bp) and possesses a 3’ quencher molecule (Iowa Black FQ) and a 5’ reporter that is either 4, 7, 2', 4', 5', 7'-Hexachloro-6-carboxyfluorescein (Hex) or 6-carboxyfluorescein (6-Fam) flourophore (Figure 1). The reference allele probe is designed with Hex, and the probe for the variant allele is designed with 6-Fam. For the ddPCR validation assay, each probe pair (reference and variant) was combined with ~20,000 copies of target, primary tumor gDNA and encapsulated into droplets. 50 ng of template DNA were used per assay (well). 1 to 40 wells were assayed per target variant (i.e. up to 2 µg total DNA assayed). The droplets were thermal cycled to endpoint (40 cycles). For droplets to generate a fluorescent signature, a droplet must contain the ‘target’ DNA, the primer pair and probe(s). The primers must anneal and amplify the target DNA displacing the probe and releasing the fluorophore. Post amplification, all droplets are scanned using a two-color detector (BioRad QX100 Droplet Reader). Each droplet is classified by fluorescent emission and categorized into quadrants using the Quanta Software package. Applying normalization thresholds, droplets are classified into one of the following four quadrants: (A) 6-Fam negative/Hex negative (F-/H-), (B) 6-Fam negative/Hex positive (F-/H+), (C) 6-Fam positive/Hex negative (F+/H-), and (D) 6-Fam positive/Hex positive (F+/H+). Quadrant B (H+/F-) represents the counts for the reference allele. Quadrant C (H-/F+) represents the counts for the variant allele. To establish a baseline for sensitivity we evaluated synthetic g-blocks with mutant alleles titrated at 1%, 0.1% and 0.01%. gBlocks™ gene fragments are custom, double-stranded, sequence-verified fragments of DNA up to 500 bp (ours were ~200bp). gBlocks gene fragments are synthesized using the same high-fidelity synthesis chemistries developed by IDT for their Ultramer™ oligonucleotides. For all ddPCR assays we used two types of negative controls: (A) no template controls (water) for each column in the plate (1 per droplet cartridge) for each primary tumor and relapse tumor run and (B) a Coriell (HAPMAP) sample as a negative control for each of the 10 putative somatic SNV loci assayed with the primary tumor DNA. We encountered two assay failures during the ddPCR evaluation for this work (see Supplemental Table 9). First, we only observed H+/F- droplets specific to the reference allele probe, ‘11_G:127107590’. This result demonstrates the PCR performed as expected, generating robust reference allele signal. The lack of Fam signal as observed by the absence of H-/F+ and H+/F+ droplets suggested a design flaw with the variant allele probe, ‘11_A:127107590’. Performing a pairwise alignment of both probe sequences, we identified probe ‘11_A:127107590’ probe was synthesized with a Thymine instead of an Adenine at the variant position. Second, the probe designs for ‘1:75188531’ failed to generate fluorescent signal suggesting the PCR failed completely. The probe for the variant allele is 29 nt in length with a % G+C content of 17.2% (5’ TTAATAAAAATATGTAAGGAAAGAAAACA). The rich A/T content in this probe may require temperature optimization to increase annealing specificity or a redesign to increase probe length and Tm. With respect to background noise, we evaluated chr17: 53783869 synthetic g-blocks with the mutant allele titrated at 1%, 0.1% and 0.01%. Based on these experiments we could not be confident below the 0.1% frequency with this assay.
Amplicon (Ampliseq) Ion Torrent sequencing of AML driver variants
The purpose of this experiment was to assay 10 putative cancer driver variants discovered in the primary or relapse tumor across seven distinct time points of disease progression and treatment. Each of these driver variants was targeted by a single PCR amplicon. These amplicons were pooled and sequenced on the Ion Torrent PGM platform. Samples and associated time point (number of days from date of diagnosis) are provided in Supplemental Table 1. Primer sequences are provided in Supplemental Table 7. Using the Ion AmpliSeq™ Designer version 4.0 (www.ampliseq.com), a BED file of 10 targets was submitted for primer design under the FFPE DNA workflow. Primers were delivered pre-pooled at 2X concentration. In conjunction with the AmpliSeq™ Library Kit 2.0, 10 ng of input DNA in 6 μl was combined with 10 μl of the 2X custom AmpliSeq™ primer pool and 4 μl of the 5X Ion AmpliSeq™ HiFi Mix. All reactions were cycled: 99 °C for 2 minutes followed by 24 cycles of 99 °C for 15 seconds and 60 °C for 4 minutes. After amplification, primer sequences were digested by adding 2 μl of FuPa Reagent and cycling at 50 °C for 10 minutes, 55 °C for 10 minutes, and 60 °C for 20 minutes. Adapters were ligated onto the samples through the addition of 4 μl of Switch Solution, 2 μl of Ion Xpress Barcode adapter mix, and 2 μl of DNA Ligase. Samples were purified using 45 μl of Ampure XP beads (Agencourt/Beckman Coulter). The libraries were quantified through use of the KAPA Library Quantification Kit for Ion Torrent and diluted to 26 pM with Ion Torrent Low TE. Dilutions of all 10 libraries were then pooled together. Template preparation was carried out on the Ion OneTouch 2 instrument in conjunction with the Ion Personal Genome Machine® (PGM™ ) Template OT2 200 Kit according to revision 3.0 of the corresponding Ion PGM™ Template OT2 200 protocol (publication part number MAN0007221). The amplification reaction consisted of 25 μl of nuclease-free water, 500 μl of Ion PGM™ Template OT2 200 Reagent Mix, 300 μl of 25 μl of Ion PGM™ Template OT2 200 PCR Reagent B, 50 μl of Ion PGM™ Template OT2 200 Enzyme Mix, 25 μl of the library diluted to 26 pM, and 100 μl of Ion PGM™ Template OT2 200 Ion Sphere™ Particles (ISPs). Following automated template prep on the OneTouch™ 2, enrichment of template-positive ISPs was done on the Ion OneTouch™ ES instrument. The Ion PGM Sequencing 200 Kit v2 was used for sequencing on the Ion Torrent PGM™. A total of 2,135,638 reads (221 Mbp) were generated on a single 318 chip. Torrent Suite version 4.0.2 was used to provide preliminary coverage analysis against the custom target regions.
RNA sequencing (Illumina)
For the integrated DNA/RNA analysis described in this work, a merged alignment BAM containing RNA-seq reads from all RNA-seq approaches (except those using small fragment input) was generated from 21 lanes of Illumina HiSeq 2000 and HiSeq 2500 data. For complete details refer to the related publication in this issue describing in detail the RNA-seq data production and analysis.
Analysis of library size distributions
Whole genome sequencing of each sample described in this work involved multiple independent libraries that were deliberately designed to consist of DNA fragments of different sizes. For example, the primary tumor WGS data consists of sequence from 12 distinct library replicates. Library size distributions were determined for each whole genome sequence library using the Picard (http://broadinstitute.github.io/picard/) tool ‘CollectInsertSizeMetrics’. Visualizations of the shape of each fragment size distribution were generated in the statistical programming language R with the ‘ggplot2’ library.
Analysis of multiple library sequencing and effect on library sequence complexity
To explore the relationship between library construction strategy and library complexity we considered different combinations of the distinct WGS libraries for the primary tumor sample. We considered each library on its own, pools of three libraries with similar sizes, pools of three libraries with different sizes, and finally a pool of all libraries combined. For each of these pools of WGS libraries we started with per library BAM files (with no duplicates marked) and downsampled each BAM evenly to produce a target overall library depth. Library sizes for these calculations were determined by use of ‘samtools flagstat’ [19505943]. BAM manipulations were performed by use of various Picard tools (http://broadinstitute.github.io/picard/) (v1.118) with the following parameters supplied to JAVA: ‘java -Xmx15g -XX:MaxPermSize=256m’. Downsampling was performed using the Picard tool ‘DownsampleSam’. The resulting pool of BAMs were position sorted, merged, and indexed using Picard tools: ‘SortSam’, ‘MergeSamFiles’, and ‘BuildBamIndex’ respectively. Duplicates in this BAM were marked using Picard ‘MarkDuplicates’ with the option ‘REMOVE_DUPLICATES=true’ to ensure that downstream tools could not fail to ignore the duplicates. A coverage report for the resulting BAM was produced using ‘samtools depth’ with the parameters ‘-q 20 -Q 20’. To improve the run time of this step, the coverage report was limited to the coding exon regions of chromosome 22. Median depth of the BAM file was determined from this coverage report. The overall duplication rate of the BAM was determined using a samtools flagstat report. The entire downsampling process described above was repeated ~1,400 times using different combinations of WGS libraries being sampled to varying degrees, from 500,000 to 50,000,000 total reads. Median coverage, duplicate rates and other metrics from these simulations were analyzed in the statistical programming language R and visualized using the ‘ggplot2’ library. We defined the exhaustion point of a library as the depth at which 50% of reads would be duplicates. We predicted the exhaustion point of each library by applying a linear model to down-sampled data for each library (i.e. the duplication level observed at 5 million, 10 million, etc. reads) and extrapolating.
Sequence alignment
Reads for WGS, exome, and capture data for the normal, primary tumor and relapse tumor were aligned to the human reference genome sequence (build37/hg19) via BWA 0.5.9 with 4 threads (-t) and the read trimming parameter (-q) equal to 5. Duplicate reads were marked using picard 1.46. For comparison of multiple alignment algorithms, capture bam files were converted to FASTQ format via Picard 1.92 with parameters ‘INCLUDE_NON_PF_READS=T, INCLUDE_NON_PRIMARY_ALIGNMENTS=T, VALIDATION_STRINGENCY=LENIENT’ and re-aligned with 7 additional aligners. For all re-alignments, default parameters were used with the following exceptions. For Bowtie2, the ‘--phred33’ flag was set to accept ‘Phred+33’ encoding and use of 8 threads was specified by the parameter ‘-p 8’. For BWA 0.7.7, ‘-t 8’ option was used to allow the use of 8 threads. For BWA-MEM 0.7.7, the ‘-t 8’ option was used to allow the use of 8 threads and the ‘-M’ flag was set to allow for Picard compatibility. For Issac 01.14.03.12, the reference genome was sorted with parameter (-j 24) allowing for parallel operations, fastqs were split on lane to be of compatible input, and reads were aligned with the following optional parameters: ‘--cleanup-intermediary’ to remove intermediary files, and ‘-m 80’ to limit to 80 Gb of memory. For SMALT 0.7.6, an index was built with recommended parameters and reads were aligned with the flag ‘-l pe’ specifying paired end reads and ‘-n 8’ specifying use of 8 threads. For SNAP 0.15.4, an index was built with ‘-hg19’ flag set and reads were aligned. For Stampy 1.0.23, a genome index and hash were created and reads aligned from BWA 0.7.7 were aligned with parameters ‘-t 8’ specifying 8 threads and ‘--bamkeepgoodreads’ flag set to true. Duplicate reads for the 7 aligners were re-marked via Picard 1.92.
Analysis of alignments from multiple algorithms
Read counts corresponding to the 198,814 selected for capture validation sequencing were obtained via bam-readcount 0.5 for Normal/Tumor/Relapse capture data from 8 aligners and merged into a table. Using R 3.1.0 Annotations were added for which caller(s) made calls from the WGS data and the primary tumor caller count, data was then subset to remove 48,578 “relapse specific” calls. The data displayed highly similar coverage densities for all aligners. We applied a minimum filter requiring 600x coverage for a variant, this removed an additional 150,236 variants. Variant positions were then annotated with respect to which tier, if the variant was “platinum” and variance was calculated. Pairwise plots were generated via GGally 0.4.8 comparing variant allele frequencies between 8 aligners, correlation values were calculated using the Spearman method (Supplemental Figure 16a). Primary tumor variance as well as variance between “Platinum” and “Non-Platinum” and variance between the number of callers that made a call was plotted via the R package ggplot2 1.0.0 (Supplemental Figure 16b). Density plots were also plotted via ggplot2 corresponding to VAF distribution for 8 aligners (Supplemental Figure 16c) and VAF variation for 7 somatic callers (Supplemental Figure 16d).
Single nucleotide variant (SNV) calling
Using the BWA (v0.5.9) alignments described above, somatic single nucleotide variants (SNVs) were called from both whole genome and exome alignments using the following seven SNV variant callers: bassovac (unpublished), MuTect v1.1.4 [23396013], Seurat v1.5-32-g2761da9 [23642077], shimmer v0.1.1 (unpublished), SomaticSniper v1.0.2 [22155872], strelka 0.4.6.2 [22581179], varscan 2.2.6 [22300766]. The unique union of all variant callers was obtained (i.e. all SNVs called by one or more callers). Each of these variant callers was run using default parameters or according to those recommended by the authors where available. In some cases, if a detailed method was available, filtering was performed after variant calling. If no filtering strategy was provided for the tool the unfiltered ‘passing’ variants in the VCF file generated by each variant caller were used for subsequent analysis. Specific details for each variant caller follow. Seurat was obtained and run as part of the GATK v1.5-32 in parallel by dividing the genome into 50 approximately equal intervals with default parameters except for ‘-Q 15 --indels'. Shimmer (commit 8c1980be2d103cdec26d44bbb748b949837f3782) was obtained by cloning the git repository: git://github.com/nhansen/Shimmer.git. Variant calling was performed on each tumor/normal BAM pair without parallelization (to allow accurate multiple testing correction), and using all default parameters. SomaticSniper calls were produced by intersecting calls from samtools mpileup and bam-somaticsniper v1.0.2. Samtools mpileup was run with the parameters ‘-A -B’ and the result was filtered by the Genome Modeling System filtering tool: ‘false-positive v1’ with default parameters. SomaticSniper was run with the parameters ‘-F vcf -q 1 -Q 15’ and the results were filtered by the GMS filter ‘false-positive v1’ with the parameters ‘--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15’ and the GMS filter ‘somatic-score-mapping-quality v1’ with the parameters ‘--min-mapping-quality 40 --min-somatic-score 40’. The ‘false-positive v1’ filter applies the following criteria: (a) at least 1% of variant allele support must come from reads sequenced on each strand, (b) the variant allele frequency must be at least 5%, (c) there must be at least 4 variant supporting reads, (d) the average relative distance of the variant from the start/end of reads must be greater than 0.1, (e) the difference in mismatch quality sum between variant and reference reads must be less than 50, (f) the difference in mapping quality between variant and reference reads must be less than 30, (g) the difference in average supporting read length between variant and reference reads must be less than 25, (h) the average relative distance to the effective 3’ end of variant supporting reads must be at least 0.2, and (i) the variant must not be adjacent to 5 or more bases of the same nucleotide identity (e.g. a homopolymer run of the same base). The ‘somatic-score-mapping-quality v1’ filter removed variants if the average mapping quality of variant supporting reads was less than 40 and if the somatic score reported by SomaticSniper was less than 40. Strelka v0.4.6.2 was run as described by the authors with default parameters except when exome data was being processed in which case the following the parameter ‘isSkipDepthFilters = 0’ was used. Varscan v2.2.6 was run with default parameters and the results were filtered by the GMS filter ‘varscan-high-confidence v1’ and the GMS filter ‘false-positive v1’ with the parameters ‘--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15’. The ‘varscan-high-confidence v1’ filter removed variants if (a) the p-value reported by Varscan was greater than 0.07, (b) the normal variant allele frequency was greater than 5%, (c) the tumor variant allele frequency was less than 10% or (d) less than 2 reads support the variant. MuTect v1.1.4 was run in parallel by dividing the genome into 50 approximately equal sections with default parameters and used the p‘--cosmic-vcf’ to supply a VCF file of known mutations from COSMIC v54 and ‘--dbsnp-vcf’ to supply a VCF file of known human polymorphisms from dbSNP v137 as supplied with the MuTect download. Bassovac is an unpublished somatic variant caller currently under development at our center. A development version was run with prior assumptions that normal purity was 99.9% and tumor purity was 95%. In addition to the seven callers mentioned we also attempted use of three additional callers, Virmid [23987214], EBCall [23471004], and LoFreq [23066108], each of which was abandoned due to installation or run time problems on this data.
Small insertion and deletion (indel) variant calling
Using the BWA (v0.5.9) alignments described above, somatic small insertions and deletions (indels) were called from both whole genome and exome alignments using the following six indel variant callers: GATK v1.0.5336 [20644199], Pindel v0.2.2 [19561018], Seurat v1.5-32-g2761da9 [23642077], Shimmer v0.1.1 (unpublished), Strelka v0.4.6.2 [22581179] and Varscan 2.2.6 [22300766]. The unique union of all variant callers was obtained (i.e. all indels called by one or more callers). GATK v1.0.5336 was run with the parameters ‘-T IndelGenotyperV2 --somatic --window_size 300 -et NO_ET’, retaining only those which were called as somatic, and then filtering the using the GMS filter ‘false-indel v1’. Indels predicted by GATK were retained GATK indel sites based on the following requirements: (a) maximum length of an adjacent homopolymer of 4 bases (only affects 1-2 bp indels) (b) percent of reads supporting the indel on the plus strand ≥ 1% and ≤ 99% (indels failing these criteria are filtered only if the reads supporting the reference do not show a similar bias) (c) minimum number of indel supporting reads of 2 (d) minimum indel supporting read frequency at the site of 5%, (e) indel falls within the middle 90% of the aligned portion of the read, (f) maximum difference between the quality sum of mismatching bases in reads supporting the indel and reads supporting the reference of 100, (g) maximum mapping quality difference between reads supporting the indel and reads supporting the reference of 30, (h) maximum difference in aligned read length between reads supporting the variant base and reads supporting the reference base of 15, and (i) minimum average distance to the effective 3’ end of the read for indel supporting reads of 20% of the sequenced read length. Pindel v0.2.2 was run with the parameter ‘-w 10’ with a config file generated to pass both tumor and normal BAM files set to an insert size of 400. Pindel results were filtered using the GMS filters ‘pindel-somatic-calls v1’, ‘pindel-vaf-filter v1’ with parameters ‘--variant-freq-cutoff=0.2’ and ‘pindel-read-support v1’. Pindel calls were retained if they had: (a) no support in the normal data, (b) had more reads reported by Pindel than reported by Samtools at the indel position or if the number of supporting reads from Pindel was ≥ 8% of the total depth at the position reported by Samtools, (c) Samtools reported a depth less than 10 at the region and Pindel reported more indel supporting reads than reads mapped with gaps at the site of the call, and (d) a Fisher's exact test p-value ≤ 0.15 was returned when comparing the number of reads with gapped alignments versus reads without in the normal to the tumor. Varscan v2.2.6 was run with default parameters and the results were filtered using the GMS filters ‘false-indel v1’ with parameters ‘--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15’ and ‘varscan-high-confidence-indel v1’. VarScan indel calls were retained if they met the following criteria: (a) VarScan reported a somatic p-value ≤ 0.07, (b) VarScan reported a normal frequency ≤ 5%, (c) VarScan reported a tumor frequency ≥ 10%, (d) Varscan reported ≥ 2 reads supporting the variant. VarScan variants passing these criteria were then filtered for likely false positives using bam-readcount v0.4 and identical criteria as described above for GATK. Strelka v0.4.6.2 was run with default parameters except for setting ‘isSkipDepthFilters = 0’. Seurat and Shimmer calls were obtained from the same runs described above in the SNV variant calling descriptions. Since the authors do not specify prescribed filtering practices or provide associated tools to do so, all indels marked as passing internal criteria of the tools were extracted from the VCF file and used for downstream analysis. After obtaining the union of all indel callers, the list was filtered as follows. Transcripts were obtained from Ensembl in GTF format and the coordinates of ‘exon’ records were obtained after removing those that correspond to pseudogenes. The resulting exon coordinates were converted to BED format and overlapping exons were merged using bedtools merge [20110278]. The candidate indel list was then intersected with this merged exon coordinate list and only introns overlapping an exon were retained. Introns corresponding to the mitochondrial genome or ‘GL’ contigs of the human assembly were removed. The remaining list of 821 indels were manually reviewed in IGV to identify likely false positives.
Copy number alteration (CNA) calling
Copy number alterations were detected using CopyCat v1.6.9 with 1000 bp windows and parameters: perLibrary=1, perReadLength=1, minWidth=3, minMapability=0.6. (https://github.com/chrisamiller/copycat). All predicted sites were manually reviewed.
Loss of heterozygosity (LOH) analysis
VarScan (version 2.3.6) was used to detect possible Loss of Heterozygosity (LOH) segments in this sample. First VarScan was used to call somatic mutations in the WGS tumor sample, this step also identifies germline mutations that are present in both the tumor and normal sample and LOH SNPs that are heterozygous in the normal sample but homozygous for one allele in the tumor sample. SNPs that were in close proximity with small insertions or deletions, or tightly clustered next to each other were filtered out as likely false-positives. The germline and LOH SNPs were merged together into a single file and LOH segments were identified by looking for stretches of LOH SNPs. Segments with a minimum of 10 SNPs and which had at-least 95% SNPs with LOH were retained. VAF plots were also generated and examined across the genome to manually confirm the results and identify false-negatives.
Structural variant calling (SV)
Structural Variants(SVs) were detected using a combination of Breakdancer (version 1.4.2) [25152801, 19668202] and Squaredancer (version 0.1) [21666668]. SV region re-alignment with NovoAlign and local assembly by TIGRA-SV (https://github.com/genome/tigra-sv) were used to filter out likely false-positive calls. Three parallel approaches were used to generate SV calls and these were merged together in the final step. First, inter-chromosomal translocations were called using Breakdancer (with parameters -g -h -a -t -q 10 -d) and then filtered by re-alignment with NovoAlign and local assembly by Tigra-SV. Second, intra-chromosomal SVs were called using Breakdancer (parameters -g -h -a -q 10 -o) and filtered by Tigra-SV. Third, Squaredancer was used to call SVs that were then filtered by Tigra-SV. In the last step the union of all the calls were merged to produce a final set of SV calls.
Manual review of somatic variant predictions
Manual review of tumor somatic single nucleotide variants, insertions and deletions was conducted to allow for a visual inspection of sequence data with IGV to determine if the variant were true somatic mutations or likely erroneous calls by the variant detector. Each variant site was labeled with a unique letter code to determine if it should be removed from consideration or kept in the list for further analysis. Manual review codes and criteria used to label variants that failed manual review were as follows. ‘O’ (‘other’) included sequence anomalies, artifacts, germline paralog amplifications, variants supported only by short DNA fragments (short inserts), variants supported by reads that were are all oriented in the same direction and cases where normal coverage was insufficient. ‘G’ was used to denote germline mutations that appear to have been incorrectly called somatic. ‘LOH’ was used for variants that were biallelic in normal samples with only one of the alleles in the tumor sample. ‘A’ was used for ambiguous or discrepant traces, areas that lacked sufficient evidence to confirm or fail the variant. Variants with these codes were eliminated from further analysis. Variants that passed manual review were coded with an ‘S’. The results of manual review were added to a tab delimited file containing the following entries for each variant: chromosome, start position, end position, reference allele, variant allele, manual review code, and manual review comments. All variants in the ‘platinum’ variants list passed manual review according to the above criteria.
Estimating tumor sample purity and normal sample contamination by tumor cells
To estimate tumor cell purity for the primary and relapse tumor samples we used SciClone [25102416] to cluster a ‘platinum’ set of heterozygous somatic variants and selected those variants that were identified as likely belonging to the ‘founding’ clone (clone ‘1’ according to SciClone). VAFs were calculated for these sites using read counts obtained by combining all WGS and custom capture (200k variants) alignments. Heterozygous somatic variants present in the initiating cell of the tumor are expected to be present in all tumor cells and therefore have a VAF of 50% (contaminating normal cells in the tumor sample will reduce the VAF). Since this patient is male, hemizygous variants on the X chromosome were excluded from this analysis. The tumor purity estimate was calculated as two times the median VAF of the variants in the ‘founding clone of each tumor sample. In order to estimate tumor contamination in the normal sample we identified a high quality list of somatic variants without using an arbitrary normal VAF filter to decide they are somatic (beyond what the individual callers may do of course). This reduces but does not eliminate the inherent circular reasoning aspect of this exercise. Starting with 198,814 putative somatic variants called from ultradeep WGS data (~330X) that were then captured and sequenced to ~1,500X we obtained read counts for all WGS and capture alignments combined. Reads supporting the variant or reference base were counted if the base quality at the variant position was >10 and the mapping quality of the read was >10. We then examined the tumor data identified a set of high quality heterozygous somatic variants predicted to be in the founding clone by removing all variants: (A) with 1000 genomes GMAF > 0.1% (B) documented in dbSNP (C) where the coverage in the tumor is an outlier (i.e. does not fall within 500-4,000X) (D) on the GL contigs, MT, X, or Y chromosomes (E) that have a tumor VAF that is not consistent with a heterozygous variant in the founding clone (F) that are found in a pool of unrelated normal samples (from solid tumor cases) (G) with insufficient coverage in the normal so we are considering sites with the best estimates of normal VAF (normal coverage >= 1,000X). This left 591 high quality somatic variant sites that were selected without use of a normal VAF cutoff. The VAF of these sites were examined in the normal sample to estimate the potential contamination of tumor DNA in that sample. The VAF distribution was plotted using the ‘ggplot2’ library of the statistical programming language R and the 95th percentile was taken as an upper estimate of tumor contamination.
Estimating an upper bound of VAF signal attributable to sequencing errors
Since this analysis attempted to detect somatic tumor variants with unprecedented whole genome sequence we required an understanding of error rates and how these signal might contribute to false positives. For example, we wanted to convince ourselves that certain very low tumor VAFs would not be likely to have been observed by sequencing error alone. As a starting point, we selected genotype calls from the microarray genotype arrays for comparison to the Illumina and Ion Torrent sequencing data. These sites represented positions where we expect only a single base in our sequence data and the identity of that base has been determined by an orthogonal platform unlikely to be subject to the same systematic biases that sequence data might suffer from. We started with a genotype file of 730,520 predicted genotypes from the microarray result. To identify high quality genotype calls from the genotype array platform we required a stringent ‘gc_score’ value >= 90 (http://res.illumina.com/documents/services/technote_infinium_genotyping_data_analysis.pdf). We also limited the analysis to those sites called homozygous by the genotype array. For all of these sites obtain read counts, coverage levels, and VAFs from the primary tumor Illumina WGS, Illumina Exome, Illumina Capture, and Torrent capture BAMs. Finally, we also required a sequence coverage of >= 30 reads for each site. This last requirement resulted in varying numbers of sites for each sequence data set, specifically, 185,848 sites for the WGS data, 13,141 sites for the exome data, 7,915 sites for the Illumina custom capture data, and 1,652 for the Ion Torrent custom capture data. At each of these sites we expected the sequence data to yield a 100% VAF value for the base called by the genotype microarray and 0% the other three possible bases at that position. Deviations from this 0% expectation were considered VAF error estimates for each sequence data set. VAFs were calculated, the 95th percentile determined and the distributions plotted using custom scripts in the statistical programming language R. An estimate of 0.35% normal VAF contamination was determined by this method and used as a normal VAF cutoff throughout this analysis.
Identifying library specific errors
To identify putative somatic variants predicted in the WGS data that could in reality be enzyme introduced errors we examined the contribution of variant supporting reads from all individual libraries. Since the de novo variant calling and identification of the putative somatic sites considered in most downstream analyses was conducted with the WGS data, introduction of library specific errors in other data sets was less of a concern. Only data from the primary tumor, where the most independent libraries were created was used for this analysis. For 198,814 putative somatic variants, read counts from the WGS data were summarized on a per-library basis using bam-readcount (https://github.com/genome/bam-readcount) with a minimum base quality cutoff of 20 and a minimum mapping quality cutoff of 30. Using a combination of all sequence data sets for the normal sample we aggressively removed any variant with good coverage >100x and VAF >0.35%. This left 90,566 variants with an increased chance of actually being somatic. Some remaining variants were predicted only in the relapse tumor sample, as this analysis is focused on library specific errors in the primary WGS data, we eliminated these leaving 83,356 candidate variants. For each of four distinct libraries (independent ligation events) in the primary tumor WGS data we searched these for putative high quality library specific somatic variants by performing a Fisher’s exact test to determine cases where variant/reference read count was dependent on source library. P values were correct for multiple testing using the Benjamini & Yekutieli (BY) method. Variants with a corrected p-value <0.05 were considered significant. We also calculated the ratio of variant supporting reads derived from a single library to the supporting reads for all libraries and required a minimum library specific ratio of 0.9. A minimum coverage of 10 across all libraries was required. Finally, if the variant had support in either the relapse or normal sample (>5 variant supporting reads) it was eliminated. This left 25 possible library specific candidates. These were for the most part observed at low levels of coverage and variant supporting read counts. Furthermore, all 25 of these were derived from the same library. There is only one outlier ‘4:9624877-9624878(C/A)’ that seems convincingly library specific, with 144 variant supporting reads of 188 in a single library compared to 2 of 39, 0 of 35, and 3 of 36. This example seems to be a mapping artifact that is more prevalent in one library compared to the others. This library had shorter overall fragment sizes. The reads contributing to this putative library specific variant align with many mismatches, all variant supporting reads map to a low complexity region (polyA tract within a L1PA4 LINE repetitive element), and they are mostly mapped discordantly with one read of each pair mapping to this region and the other mapping to another chromosome entirely. The remaining 24 candidates had much less bias towards a single library but also seemed to correspond to alignment artifacts, within unexpectedly high coverage peaks, often with low complexity regions, and often within LINE elements. We believe that many of the 25 can be explained by mapping artifacts that disproportionately affected a single library rather than enzyme introduced errors. This reveals an important potential source of false positive somatic variants that could occur systematically if the fragment size distribution of the normal and tumor libraries was significantly different. This could easily happen in the common scenario where a reference blood sample yields high quality DNA and the tumor DNA is partially degraded (or heavily degraded in the case of FFPE materials). None of the 25 putative library specific candidates were included in the platinum variant list because support from multiple libraries was a requirement for entry to the platinum list. Several of these apparent artifacts could have been eliminated simply by requiring reads to be properly mapped as pairs. An estimate of 0.23% sequence error VAF was determined by this method and used as a tumor VAF cutoff throughout this analysis.
Identifying relapse specific variants
To investigate the possible existence of variants acquired during treatment and tumor evolution that were not present (even at low levels) in the primary tumor we combined sequence data from WGS, exome and custom capture sequencing for the normal, primary tumor and relapse samples. For 198,814 putative somatic variants we obtained reference and variant supporting read counts using ‘bam-readcount’ (https://github.com/genome/bam-readcount) with a minimum base quality cutoff of 20 and mapping quality cutoff of 30. To identify candidates that might be specific to the relapse sample we first started with variants that were called in the relapse tumor WGS data but not the primary tumor. Variants were further filtered to remove those with (A) <100x coverage in normal, primary and relapse samples, (B) normal VAF >= 0.35%, (C) primary tumor VAF >= 0.23%, (D) relapse tumor VAF <= 5%, (E) the difference between primary and relapse tumor VAF was <= 5%.
Integrated analysis of SNV expression using DNA-seq and RNA-seq data
RNA-seq data for the primary and relapse tumor was analyzed in the context of discoveries made by whole genome or capture sequencing. Specifically, we compared DNA and RNA sequence measurements with respect to gene expression estimates for genes harboring somatic mutations and examined the expression evidence for each individual somatic mutation. We obtained gene expression estimates for RNA-seq data derived from the same samples as follows. RNA-seq library quality was assessed using FastQC v0.10.0 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). A subset of RNA-seq libraries (those using Nugen Ovation kits) involved use of an adapter sequence for linear amplification. These adapters were trimmed prior to alignment using flexbar v229 [24832523] with the following parameters: ‘--adapter CTTTGTGTTTGA --adapter-trim-end LEFT --nono-length-dist --threads 4 --adapter-min-overlap 7 --max-uncalled 150 --min-readlength 25’. Alignments were performed using TopHat v2.0.8 [23618408] and bowtie v2.1.0 [19261174] using default parameters. Gene and isoform expression estimates were generated by Cufflinks v2.1.1 using the parameters: ‘--num-threads 4 --max-bundle-length 10000000 --max-bundle-frags 10000000’. We also masked ribosomal and mitochondrial sequences during expression estimation using the ‘--mask-file’ parameter. To compare variant allele support in RNA-seq data to support in DNA sequence data we obtained read counts and VAF estimates from RNA-seq TopHat BAM files using bam-readcount (https://github.com/genome/bam-readcount) to interrogate all ~200,000 putative somatic mutations. We used the ‘--per-library’ option to get these counts separately for each of 8 alternative RNA-seq library strategies that were tested on the AML tumors. A grand read count was obtained by merging these counts across all libraries. Variants were considered expressed in the RNA-seq data if the variant was present in the platinum variant list, if the grand RNA-seq read count was greater than 100, if the variant occurred within an annotated protein coding (including UTRs) or RNA gene, if the FPKM value for that gene was greater than 1, and if the RNA-seq VAF was greater than 0.
Splice site mutation analysis by integration of DNA-seq and RNA-seq data
Integrated DNA-seq and RNA-seq analysis was used to identify the subset of platinum somatic variants called in the WGS tumor data with an observable effect on splicing patterns in the RNA-seq data. First all all platinum somatic variants were annotated against all transcripts of all Ensembl genes (v74) using an internal transcript variant annotation tool. Six variants were identified as occurring within either a splice site or splice region. DNA-seq and RNA-seq data were manually reviewed in IGV and splicing visualization were created using the Sashimi plugin of IGV. Two of the six candidate splicing variants were confirmed as having an observable and expected splicing effect in the RNA-seq data. In both cases the effect at the RNA level tracked with a shift in DNA VAF between the primary and relapse tumors. In each case, the splicing pattern was compared to unrelated blood tumor data that did not harbor the putative splicing mutation.
Defining a platinum variant list
The platinum SNV list was compiled in the following manner: Starting from all sites assayed on the validation array, sites were ‘quality filtered’ if they (1) had a depth of less than 20x or greater than 5000x (2) had no reads supporting the variant call, (3) had a normal VAF greater than both the tumor and relapse VAF, (4) had a VAF variance of greater than 5 in either the tumor or relapse, as calculated by comparing the read counts obtained from each of the different aligners (see methods above). P-values were then added for both the tumor and relapse by comparing the read counts to those found in the normal with a fisher's exact test, then multiple testing correction was applied, using the Benjamini & Yekutieli (BY) method. To further penalize positions at error-prone positions, the q-value was raised to 1 divided by the number of normal sample libraries in which the variant was seen. Sites with an q-value of less than 0.10 in the tumor or less than 0.05 in the relapse were retained. The deep exome data was independently tested in a similar manner, using fisher's exact test and by multiple testing correction to identify sites at which the VAF was significantly greater in tumor or relapse than in the normal sample (q-value < 0.10) and had less than 2% VAF in the normal. The two lists were merged, then a final filter removed a handful of artifactual sites inconsistent with the biology and clonal structure of the tumor - those with a tumor VAF between 5 and 35% and a relapse VAF greater than 3. Each site was then manually reviewed to retain only the highest confidence sites, which comprised 1,342 SNVs. An additional ‘gold’ set of SNVs with a lower confidence threshold was compiled by adding calls to the platinum list if they passed the above ‘quality filters’ and met one of the following criteria: 1) were completely absent in the normal, and were supported by reads from at least two libraries in either the tumor or relapse. 2) had a q-value of less than 0.15 in either the relapse or tumor validation data. This added an additional 2,615 sites that were potentially valid, but did not meet the threshold of evidence that we required for the platinum list. In most cases, deeper sequencing would be needed to verify these SNVs.
De novo variant calling with validation data and comparison to WGS variant calls
Variant calls were obtained for WGS and custom capture datasets from seven and six callers, respectively, and compiled into tables specific for primary and relapse tumors (Supplemental Table 3). Variants from all seven callers were available in seven different formats, some were VCF compliant, others were not. All variants were converted to a common format and the union for all callers was obtained. These variants were annotated with respect to which caller(s) made the call and if the variant was in the platinum list, the 200k WGS called sites, and/or tiled by our custom NimbleGen capture design. The platinum list used for this analysis was defined separately for primary and relapse tables as the variants in the platinum list that were also called by at least one caller in primary or relapse respectively. These two versions of the platinum variant list were used as true positives in (Supplemental Table 11a-b) for primary and relapse respectively for calculating PPV and sensitivity. Read counts were added via bam-readcount 0.6 for both tables from tumor and normal bams for both WGS and capture. Using R 3.1.0 a function was created to obtain required input for the R package VennDiagram 1.6.9 and Venn Diagrams were respective to primary/relapse created for each caller with and without a filter requiring a variant to have a least 50x coverage for the primary tumor and 20x coverage for the relapse in WGS and capture. The same requirement was applied to normal coverage levels. Furthermore we required the variant to be targeted by the custom nimblegen design (i.e. within the coordinates of a probe designed to capture one of the ~200k sites detected in WGS and captured for deep sequencing) (Supplemental Figure 21).
Determining the effect of depth on subclone inference by downsampling analysis
The effect of sequence depth on the ability to identify tumor subclones and correctly assign each variant to these by the SciClone was investigated by a series of in silico down sampling experiments. We first compiled merged BAM files that combined data from whole genome, exome, and custom capture experiments for both the primary and relapse tumors. To reduce file size, these BAMs were further limited to the regions surrounding the platinum variants with 1000 bp of flank on either side. All aligned reads were allowed (properly paired and otherwise) and duplicate marking was removed. The primary and relapse BAMs were then downsampled 300 times from 0.3% to 100% of the data. Each downsampling iteration involved (a) sampling the BAM file with ‘sambamba’ v0.4.7 (https://github.com/lomereiter/sambamba) using a random integer seed (b) sorting the downsampled BAM by position using Picard ‘SortSam’ (http://broadinstitute.github.io/picard/), (c) indexing the BAM using Picard ‘BuildBamIndex’, (d) marking and removing duplicates using Picard ‘MarkDuplicates’ with the ‘REMOVE_DUPLICATES=true’ option, (e) producing a depth report using ‘samtools depth’ with the ‘-q 20 -Q 20 options’ [19505943], (f) producing a duplicate report using ‘samtools flagstat’, (g) obtaining read counts and VAFs for all platinum variant positions using bam-readcount v0.7, (h) and finally running SciClone [25102416] in ‘2D mode’ to predict subclone clusters using these counts from the primary and relapse tumor VAFs. SciClone was run with these parameters ‘--do-clustering --maximum-clusters=10 --minimum-depth=1 --overlay-clusters’. To estimate the overall precision of clustering and inference of subclone architecture by SciClone, the cluster identity of the platinum variants obtained by using all the data was taken as a ‘truth’ result. A subset of variants from each of the six clusters predicted by SciClone were randomly selected such that each cluster had the same number of variants. In each downsampling iteration, SciClone clustering was attempted on this subset and the proportion of variants being assigned to the correct clone was used as a measure of precision.