MAGE-TAB Version	1.1
Investigation Title	TARGET: Kidney - Wilms Tumor (WT) mRNA-seq
Experimental Design	disease state design	transcript identification design	is expressed design
Experimental Design Term Source REF	EFO	EFO	EFO
Experimental Factor Name
Experimental Factor Type
Experimental Factor Term Source REF
Person Last Name	NCI Office of Cancer Genomics (OCG)	NCI Center for Biomedical Informatics and Information Technology (CBIIT)	Gadd	Perlman	Ma	Novik
Person First Name			Samantha	Elizabeth	Yussanne	Karen
Person Mid Initials			L	J	P	L
Person Email	ocg@mail.nih.gov	ncicbiit@mail.nih.gov	sgadd@luriechildrens.org	eperlman@luriechildrens.org	yma@bcgsc.ca	knovik@bcgsc.ca
Person Phone	+1 301 451 8027	+1 888 478 4423	+1 773 755 6392	+1 312 227 3967	+1 604 707 5800 Ext 6082	+1 604 707 8000 Ext 7983
Person Fax	+1 301 480 4368				+1 604 876 3561	+1 604 675 8178
Person Address	31 Center Dr, Rm 10A07, Bethesda, MD 20892	9609 Medical Center Dr, Rockville, MD 20850	2430 N Halsted St, Room C366 Chicago, IL 60614	225 E Chicago Ave, Chicago, IL 60611	Suite 100-570 West 7th Ave, Vancouver, BC Canada V5Z 4S6	675 West 10th Ave Vancouver, BC Canada V5Z 1L3
Person Affiliation	National Cancer Institute	National Cancer Institute	Lurie Children's Hospital of Chicago Research Center	Ann & Robert H. Lurie Children's Hospital of Chicago	BC Cancer Agency Canada's Michael Smith Genome Sciences Centre	BC Cancer Agency Canada's Michael Smith Genome Sciences Centre
Person Roles	funder;investigator	data coder;curator	investigator;data analyst	investigator	investigator;data analyst;submitter	investigator
Person Roles Term Source REF	EFO;EFO	EFO;EFO	EFO;EFO	EFO	EFO;EFO;EFO	EFO
Quality Control Type
Quality Control Term Source REF
Replicate Type
Replicate Term Source REF
Normalization Type
Normalization Term Source REF
Date of Experiment
Public Release Date
PubMed ID
Publication DOI
Publication Author List
Publication Title
Publication Status
Publication Status Term Source REF
Experiment Description	"There are 130 fully characterized patient cases with high risk Wilms tumor (all tumor/normal pairs; 8 with additional samples for analysis = 3 with tumor adjacent normal, 5 with relapse sample) that will make up the TARGET WT dataset. Each case will have gene expression, tumor and paired normal copy number analyses, methylation and whole genome sequencing; a subset of WT cases with mRNA-seq, miRNA-seq, and whole exome sequencing data available as well. All cases can be sorted according to data type via the Case Matrix on the TARGET Data Matrix. Please visit the TARGET website listed above for additional information on this and other TARGET genomics projects. Please see the TARGET Publication Guidelines at the OCG websitefor updated details on sharing of any TARGET substudy data."
Protocol Name	nationwidechildrens.org:Protocol:RNA-Extraction-Qiagen-AllPrep:01	bcgsc.ca:Protocol:mRNAseq-LibraryPrep-Illumina-StrandSpecific:01	bcgsc.ca:Protocol:mRNAseq-Sequence-Illumina-HiSeq2000:01	bcgsc.ca:Protocol:mRNAseq-BaseCall-Illumina:01	bcgsc.ca:Protocol:mRNAseq-ReadAlign-BWA-Picard:01	bcgsc.ca:Protocol:mRNAseq-StructVariant-TransABySS:01	bcgsc.ca:Protocol:mRNAseq-Expression:01	bcgsc.ca:Protocol:mRNAseq-SNVMix2-Vcf2Maf:01	bcgsc.ca:Protocol:mRNAseq-VariantCall-SNVMix2:01
Protocol Type	nucleic acid extraction protocol	nucleic acid library construction protocol	nucleic acid sequencing protocol	data transformation protocol	data transformation protocol	data transformation protocol	normalization data transformation protocol	data transformation protocol	data transformation protocol
Protocol Term Source REF	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO	EFO
Protocol Description	"RNA was prepared from using the Qiagen AllPrep DNA/RNA Mini Kit (Qiagen, Valencia, CA, USA). Please see https://ocg.cancer.gov/programs/target/target-methods for full extraction protocol details."	"Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10uL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5uM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a "Duty cycle" of 20% and "Intensity" of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3' A-tailing by Klenow fragment (3' to 5' exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37 degC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina's PE primer set, with cycle condition 98 degC 30sec followed by 10-13 cycles of 98 degC 10 sec, 65 degC 30 sec and 72 degC 30 sec, and then 72 degC 5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing."			"Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools."	"Structural variant detection was performed using Trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The meta-assemblies were then used as input to the Trans-ABySS analysis pipeline (Robertson et al., 2010). Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions. Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments. Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs. The events were then screened against dbSNP and other variation databases to identify putative novel events. To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives."	"Gene Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaGA_RNASeq:01 Link:www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/ Data Level: 3 Data File: *.gene.quantification.txt The gene coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene ) [Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10. .gene.quantification.txt A tab-delimited text file containing the following fields: - gene = Gene ID from GAF (version 3.0). The ID follows the nomenclature '<HUGO gene symbol>|<Entrez ID>'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that the HUGO gene symbol or Entrez ID is not available. e.g. U80769|?; TRNA_Pseudo|?|8of100 - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. - median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths - RPKM = Reads per kilobase of exon per million. Calculation described in detail below. Exon Coverage Analysis Protocol Name: bcgsc.ca:RNA_Sequencing:IlluminaGA_RNASeq:01 Link:www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/ Data Level: 3 Data File: *.exon.quantification.txt The exon coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similarly to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10. .exon.quantification.txt A tab-delimited text file containing the following fields: - exon = Exon coordinates according to GAF (version 3.0) with the nomenclature, chr<chromosome number>:<start coordinate>-<end coordinate>:<strand>. '.' in the <strand> indicates that there was no strand information available. e.g. chr10:120810487-120810613:. - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10 were excluded from calculation. - median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. - RPKM = Reads per kilobase of exon per million"		"After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30. The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads. SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64)."
Protocol Parameters				Software Versions					
Protocol Hardware			Illumina HiSeq 2000						
Protocol Software				Illumina RTA					
Protocol Contact
SDRF File	TARGET_WT_mRNA-seq_20170609.sdrf.txt
Term Source Name	NCBITaxon	NCIt	MO	EFO	OBI
Term Source File	http://www.ncbi.nlm.nih.gov/taxonomy	http://ncit.nci.nih.gov/	http://mged.sourceforge.net/ontologies/MGEDontology.php	http://www.ebi.ac.uk/efo	http://purl.obolibrary.org/obo/obi
Term Source Version
Comment[SRA_STUDY]	SRP012006
Comment[BioProject]	PRJNA89521
Comment[dbGaP Study]	phs000471
