MAGE-TAB Version	1.1
Investigation Title	TARGET: Acute Lymphoblastic Leukemia (ALL) Phase I/II WGS
Experimental Design	disease state design
Experimental Design Term Source REF	EFO
Experimental Factor Name
Experimental Factor Type
Experimental Factor Term Source REF
Person Last Name	NCI Office of Cancer Genomics (OCG)	NCI Center for Biomedical Informatics and Information Technology (CBIIT)	Hunger	Mullighan	Loh	Ma	Zhang	Ma	Novik
Person First Name			Stephen	Charles	Mignon	Xiaotu	Jinghui	Yussanne	Karen
Person Mid Initials			P					P	L
Person Email	ocg@mail.nih.gov	ncicbiit@mail.nih.gov	hungers@chop.edu	charles.mullighan@stjude.org	lohm@peds.ucsf.edu	xiaotu.ma@stjude.org	jinghui.zhang@stjude.org	yma@bcgsc.ca	knovik@bcgsc.ca
Person Phone	+1 301 451 8027	+1 888 478 4423		+1 901 595 3387	+1 415 476 3831	+1 901 595 3774	+1 901 595 6829	+1 604 707 5800 Ext 6082	+1 604 707 8000 Ext 7983
Person Fax	+1 301 480 4368			+1 901 595 5947		+1 901 595 7100	+1 901 595 7100	+1 604 876 3561	+1 604 675 8178
Person Address	31 Center Dr, Rm 10A07, Bethesda, MD 20892	9609 Medical Center Dr, Rockville, MD 20850	3401 Civic Center Blvd Philadelphia, PA 19104	262 Danny Thomas Place, Mail Stop 342, Memphis TN 38105	Box 0106, UCSF	262 Danny Thomas Place, Memphis, TN 38105	262 Danny Thomas Place, Memphis, TN 38105	Suite 100-570 West 7th Ave, Vancouver, BC Canada V5Z 4S6	675 West 10th Ave Vancouver, BC Canada V5Z 1L3
Person Affiliation	National Cancer Institute	National Cancer Institute	Children's Hospital of Philadelphia	St Jude Children's Research Hospital	UCSF Benioff Children's Hospital	St Jude Children's Research Hospital	St Jude Children's Research Hospital	BC Cancer Agency Canada's Michael Smith Genome Sciences Centre	BC Cancer Agency Canada's Michael Smith Genome Sciences Centre
Person Roles	funder;investigator	data coder;curator	investigator	investigator	investigator	investigator;data analyst;submitter	investigator;data analyst	investigator;data analyst;submitter	investigator
Person Roles Term Source REF	EFO;EFO	EFO;EFO	EFO	EFO	EFO	EFO;EFO;EFO	EFO;EFO	EFO;EFO;EFO	EFO
Quality Control Type
Quality Control Term Source REF
Replicate Type
Replicate Term Source REF
Normalization Type
Normalization Term Source REF
Date of Experiment
Public Release Date
PubMed ID
Publication DOI
Publication Author List
Publication Title
Publication Status
Publication Status Term Source REF
Experiment Description	"There are 189 fully characterized patient cases that make up the pilot phase (Phase I) of the TARGET ALL dataset, each with gene expression, tumor and paired normal copy number analyses, and at least one type of sequencing (Sanger and/or next-generation) data available. There are 230 cases with partial molecular characterization and/or sequencing data available, to include whole genome sequencing, mRNA-seq and/or kinome sequencing; all of which can be sorted via the Case Matrix on the TARGET Data Matrix. Please visit the TARGET website listed above for additional information on this and other TARGET genomics projects. Please see the TARGET Publication Guidelines at the OCG websitefor updated details on sharing of any TARGET substudy data.. There are 175 fully characterized patient cases with relapsed precursor B-cell ALL (all tumor/normal pairs, 85 with relapse sample as well) that will make up Phase II of the TARGET ALL dataset, each with gene expression, tumor and paired normal copy number analyses, and comprehensive next-generation sequencing to include whole genome sequencing, mRNA-seq and miRNA-seq. Subsets of these cases will also have methylation and/or whole exome sequencing data available as well. There are additionally a large number of cases with partial molecular characterization making this a large and informative genomic dataset. All cases can be sorted according to data type via the Case Matrix on the TARGET Data Matrix. Please visit the TARGET website listed above for additional information on this and other TARGET genomics projects. Please see the TARGET Publication Guidelines at the OCG websitefor updated details on sharing of any TARGET substudy data."
Protocol Name	nationwidechildrens.org:Protocol:DNA-Extraction-Qiagen-QIAamp:01	bcgsc.ca:Protocol:WGS-LibraryPrep-Illumina:01	completegenomics.com:Protocol:WGS-LibraryPrep-CGI:01	bcgsc.ca:Protocol:WGS-Sequence-Illumina-GAII:01	bcgsc.ca:Protocol:WGS-Sequence-Illumina-GAIIx:01	bcgsc.ca:Protocol:WGS-Sequence-Illumina-HiSeq2000:01	bcgsc.ca:Protocol:WGS-Sequence-Illumina-HiSeq2500:01	completegenomics.com:Protocol:WGS-Sequence-CGI-CGI:01	bcgsc.ca:Protocol:WGS-BaseCall-Illumina:01	completegenomics.com:Protocol:WGS-BaseCall-CGI:01	bcgsc.ca:Protocol:WGS-ReadAlign-BWA-Picard:01	completegenomics.com:Protocol:WGS-ReadAlign-CGI:01	bcgsc.ca:Protocol:WGS-VariantCall-Strelka:01	bcgsc.ca:Protocol:WGS-VariantCall:01	bcgsc.ca:Protocol:WGS-Mpileup-Vcf2Tab:01	bcgsc.ca:Protocol:WGS-StructVariant-GenomeValidator:01	bcgsc.ca:Protocol:WGS-CombineSomaticSnvs:01	bcgsc.ca:Protocol:WGS-Strelka-Vcf2Tab-Snv:01	bcgsc.ca:Protocol:WGS-VariantCall-Mpileup-MutationSeq:01	bcgsc.ca:Protocol:WGS-Strelka-Vcf2Tab-Indel:01	bcgsc.ca:Protocol:WGS-StructVariant-DELLY:01	bcgsc.ca:Protocol:WGS-StructVariant-ABySS:01	bcgsc.ca:Protocol:WGS-VariantCall-Mpileup:01	completegenomics.com:Protocol:WGS-HigherLevelSummary-CGI:01	completegenomics.com:Protocol:WGS-CnvSegment-CGI:01	completegenomics.com:Protocol:WGS-Circos-CGI:01	completegenomics.com:Protocol:WGS-Junction-CGI:01	completegenomics.com:Protocol:WGS-VariantCall-CGI:01	completegenomics.com:Protocol:WGS-Vcf2Maf-CGI:01	completegenomics.com:Protocol:WGS-FilterSomatic-CGI:01	stjude.org:Protocol:WGS-StructVariant-CGI:01	stjude.org:Protocol:WGS-CnvSegment-CONCERTING-CGI:01	stjude.org:Protocol:WGS-VariantCall-CGI:01
Protocol Type	nucleic acid extraction protocol	nucleic acid library construction protocol	nucleic acid library construction protocol	nucleic acid sequencing protocol	nucleic acid sequencing protocol	nucleic acid sequencing protocol	nucleic acid sequencing protocol	nucleic acid sequencing protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol	data transformation protocol
Protocol Term Source REF	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO
Protocol Description	"Genomic DNA was prepared from using the Qiagen QIAamp DNA Mini Kit (Qiagen, Valencia, CA, USA). Please see https://ocg.cancer.gov/programs/target/target-methods for full extraction protocol details."	"Genomic DNA for construction of whole genome shotgun sequencing (WGSS) libraries was prepared from the same biopsy material using the Qiagen AllPrep DNA/RNA Mini Kit (Qiagen, Valencia, CA, USA). DNA quality was assessed by spectrophotometry (260/280 and 260/230) and gel electrophoresis before library construction. Depending on the availability of DNA, between 2 and 10ug was used in WGSS library construction. Briefly, DNA was sheared for 10 min using a Sonic Dismembrator 550 with a power setting of "7" in pulses of 30 seconds interspersed with 30 seconds of cooling (Cup Horn, Fisher Scientific, Ottawa, Ontario, Canada), and analyzed on 8% PAGE gels. The 200-300bp DNA fraction was excised and eluted from the gel slice overnight at 4 degrees Celsius in 300 ul of elution buffer (5:1, LoTE buffer (3 mM Tris-HCl, pH 7.5, 0.2 mM EDTA)-7.5 M ammonium acetate), and was purified using a Spin-X Filter Tube (Fisher Scientific), and by ethanol precipitation. WGSS libraries were prepared using a modified paired-end protocol supplied by Illumina Inc. (Illumina, Hayward, USA). This involved DNA end-repair and formation of 3' A overhangs using Klenow fragment (3' to 5' exo minus) and ligation to Illumina PE adapters (with 5' overhangs). Adapter-ligated products were purified on Qiaquick spin columns (Qiagen, Valencia, CA, USA) and PCR-amplified using Phusion DNA polymerase (NEB, Ipswich, MA, USA) and 10 cycles with the PE primer 1.0 and 2.0 (Illumina). PCR products of the desired size range were purified from adapter ligation artifacts using 8% PAGE gels. DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay (Agilent, Santa Clara CA, USA) and Nanodrop 7500 spectrophotometer (Nanodrop, Wilmington, DE, USA) and DNA was subsequently diluted to 10nM. The final concentration was confirmed using a Quant-iT dsDNA HS assay kit and Qubit fluorometer (Invitrogen, Carlsbad, CA, USA)"	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."					"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."		"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Illumina paired-end whole genome sequencing reads were aligned to the hg19 reference using BWA version 0.5.7. This reference contains chromosomes 1-22, X, Y, MT, 20 unlocalized scaffolds and 39 unplaced scaffolds. Multiple lanes of sequences were merged and duplicated reads were marked with Picard Tools."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"To analyze compartment specific SNVs, samples were analyzed pair wise with the default settings of Strelka v0.4.7 (Saunders et al., 2012). Primary tumor samples and relapse/met were compared against the germline sample. In the absence of a germline sample, the relapse/met samples were compared against the primary tumor sample."			"An in-house tool, Genome Validator was used to determine compartment specific events. The structural variant calls for each patient from matched genome and RNA-seq samples were concatenated together and screened for each patient against matching tumour and germline alignments. This resulted in compartment specific structural variant events and putative somatic calls. The events were further filtered against a compendium of recurrent structural variants to remove recurrent false positives."			"SNVs were analyzed pair wise with SAMtools mpileup v.0.1.17 (Li et al., 2009). Each chromosome was analyzed separately using the -C50-DSBuf parameters. Before merging the resulting vcf files, they were filtered to remove all indels and low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). The SNVs in the resulting vcf files were further filtered and scored using mutationSeq v1.0.2 and annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 and cosmic 64 db membership using snpSift (Cingolani et al., 2012a)."		"DELLY: structural variant discovery by integrated paired-end and split-read analysis. PMID:22962449"	"Structural variant detection was performed using ABySS (v1.3.2). Genome (WGS) libraries were assembled in single end mode using k-mer values of k24 and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the Trans-ABySS analysis pipeline (Robertson et al., 2010). Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions. Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments. Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs. The events were then screened against dbSNP and other variation databases to identify putative novel events. To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives."	"SNVs were analyzed with SAMtools mpileup v.0.1.17 (Li et al., 2009) either on single or paired libraries. Each chromosome was analyzed separately using the -C50-DSBuf parameters. The resulting vcf files were merged and filtered to remove low quality SNVs by using samtools varFilter (with default parameters) as well as to remove SNVs with a QUAL score of less than 20 (vcf column 6). Finally, SNVs were annotated with gene annotations from ensembl v66 using snpEff (Cingolani et al., 2012b) and the dbSNP v137 db membership assigned using snpSift (Cingolani et al., 2012a)."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"Complete Genomics Inc. standard protocol, please see CGI READMEs for details."	"MAF files containing structural variants identified by CGI were downloaded from the TARGET Data Matrix and filtered to remove germline rearrangements and low confidence somatic calls. Germline variant databases used for filtering included the Database of Genomic Variants (DGV), dbSNP, PCGP, and also recurrent germline rearrangements from the downloaded MAF files. Rearrangements where both breakpoints fall into gap regions in the human genome (hg19) were also excluded. To filter out low confidence rearrangements, a BLAT search was performed on the assembled sequence for each rearrangement, and those that could be fully mapped (>90% similarity to the reference genome) were excluded. We further required each variant to have an assembled contig length of at least 10 bp on each breakpoint. Since copy number alterations were highly coupled with rearrangement events, to avoid over-filtering we also integrated the copy number alterations into the SV analysis. Briefly, breakpoints from CNV analysis were matched to those detected in SVs, using a window size of 5kb. Those rearrangements with possible CNV support were rescued after manual curation. Of the 1,011,810 putative CGI SVs, 3,265 passed these filters. Experimental verification using 14 CGI diagnosis-remission-relapse trio samples from a previous publication6 showed a validation rate of 78% as 79 out of the 101 SVs were experimentally verified by targeted capture sequencing."	"We adapted the CONSERTING algorithm to detect copy number alterations from CGI whole genome sequencing data. Briefly, the germline single nucleotide polymorphisms (SNPs) reported by CGI in the MAF files were extracted, with recurrent paralogous variants (identified from the 625 germline whole genome sequencing data generated by the St Jude Pediatric Cancer Genome Project) removed. The read counts of the SNPs were then used to construct a coverage file by taking the mean of all SNPs within a sliding window of 100bp. The coverage difference between tumor and normal samples were then used as the input for CONSERTING. To detect loss-of-heterozygosity (LOH), we used SNPs that have variant allele fraction (VAF) in normal within an interval of (0.4, 0.6) and have >15X coverage in both tumor and normal samples. For these SNPs, the allelic imbalance (AI), defined as |Tumor_VAF-0.5|, was used as the input for CONSERTING to detect LOH. Regions with concomitant copy number changes (log ratio>0.2 or log ratio<-0.2) and/or LOH (AI>0.1) were subjected to manual review. Finally, regions with length <2Mb that passed manual review were considered to be focal changes and included in the GRIN analysis to determine the significance of the somatic alterations. We compared the MYCN amplification status derived from CONSERTING with that of the original CGI analysis to evaluate the accuracy of the recalled CNVs. A subset of 32 NBL tumors carried a clinically-validated high-amplitude amplification of MYCN, which is a known oncogenic driver in pediatric neuroblastoma. While CGI?s HMM CNA model only reported MYCN amplifications in 15 out of these 32 tumors. CONSERTING successfully identified high-amplitude amplifications in 31 tumors. For the NBL (PASJZC) with a negative finding of MYCN amplification by CONSERTING, a follow-up review of the initial diagnosis data indicated that this discrepancy could be explained by tumor heterogeneity and tumor material sampling bias. Moreover, two additional subclonal MYCN amplification events were predicted in the remaining tumor samples (PARACM, PATHVK). These results demonstrate that CONSERTING achieved higher sensitivity over the original CGI analysis. For osteosarcoma, CNA analysis was limited to the TP53 locus but not the other regions due to the excessive number of rearrangements caused by chromothripsis in this cancer histotype."	"Putative somatic SNVs and indels were extracted from MAF files downloaded from the TARGET Data Matrix and run through a 3-step filter to remove germline, low-confidence and paralog variations. In the first step, the following data sets were used for filtering germline variants: 1) NLHBI Exome Sequencing Project (http://evs.gs.washington.edu/EVS/); 2) dbSNP build 132 (https://www.ncbi.nlm.nih.gov/projects/SNP/); 3) St. Jude/Washington University Pediatric Cancer Genome Project (PCGP), and 4) germline variants present in >= 5 cases in TARGET CGI WGS data. In the second step, a variant will be considered low-confidence unless it meets the following criteria: 1) at least 3 more reads support the mutant allele in the tumor sample than in the normal sample; 2) the mutant read count in tumor is significantly higher than in the matched normal (P<0.01 by Fisher's Exact test); and 3) mutant allele fraction in normal is below 0.05. In the third step, we ran a BLAT search3 using a template sequence consisting of the mutant allele and its 20-bp flanking region to determine whether or not the mutation was uniquely mapped. Because pathogenic germline variants may overlap with oncogenic somatic mutations, we implemented a "rescue" pipeline to avoid over-filtering. All putative somatic variants were first re-annotated using a customized AnnoVar pipeline (Edmonson et al, unpublished) and performed variant classification using Medal_Ceremony. Variants assigned "Gold" by medal ceremony are those matching known mutation hotspots, or truncation mutations in tumor suppressor genes. These were "rescued" and merged with the filtered variants for each gene and the results further curated using our visualization program ProteinPaint (https://pecan.stjude.org/proteinpaint/study/pan-target). The filtering process reduced the original 51 million SNVs and 38 million indels from the CGI MAF files to a set of ~700,000 SNVs and 58,000 indels. Of these, 9,397 SNVs and 1,000 indels are in protein coding regions. We tested the filter on 14 diagnosis-remission-relapse trio samples that were analyzed by both CGI and WES. Of the 661 CGI SNVs passing the filter, 580 (88%) were verified by WES while the indel verification rate is 67% (48/72). Notably, all 53 variants (45 SNVs and 8 indels) on the driver genes identified in this study were cross-validated by WES."
Protocol Parameters									Software Versions																								
Protocol Hardware				Illumina Genome Analyzer II	Illumina Genome Analyzer IIx	Illumina HiSeq 2000	Illumina HiSeq 2500	Complete Genomics																									
Protocol Software									Bustard;Illumina RTA																								
Protocol Contact
SDRF File	TARGET_ALL_WGS_Phase1+2_20191114.sdrf.txt
Term Source Name	NCBITaxon	NCIt	MO	EFO	OBI
Term Source File	http://www.ncbi.nlm.nih.gov/taxonomy	http://ncit.nci.nih.gov/	http://mged.sourceforge.net/ontologies/MGEDontology.php	http://www.ebi.ac.uk/efo	http://purl.obolibrary.org/obo/obi
Term Source Version
Comment[SRA_STUDY]	SRP011998; SRP011999
Comment[BioProject]	PRJNA89519; PRJNA89529
Comment[dbGaP Study]	phs000463; phs000464
