﻿#########################

V1_BWAtoTranscriptome, V1_RNASeqQuantification: UNC V1 RNA-Seq Workflow - BWA Alignment to Transcriptome

Date: 20101108

Authors:

* Sara Grimm <sacheek@med.unc.edu>
* Brian O'Connor <brianoc@email.unc.edu>

Versions:

This analysis was carried out using the SeqWare Pipeline project, version
0.7.0. The workflow was "RNASeqAlignmentBWA" version 0.7.x. UNC
provides all our analysis software through this open source project. Users can
download this software to run the identical RNA-Seq analysis described in the
steps below. See the project website at http://seqware.sf.net for more
information. The UNCIDs provided in file names are identifiers unique to UNC
and can be used to provide data/analysis provenance tracking.

Annotations:

The Generic Annotation File (GAF) that provides all of our annotations for
genes, exons, etc can be found at
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf

Conventions:

Please note our spljxn.quantification.txt, exon.quantification.txt, and the
GAF file above use the convention of chr:smaller_int-larger_int:+ for plus
strand features and chr:larger_int-smaller_int:- for negative strand features.
Carefully examine future versions of these annotation and quantification files
since this convension is subject to change.

Analysis Process:

Step 1:  Construct database of reference transcript sequences, composite gene
models, composite exons, and splice junctions.

The reference transcript set is based on the hg19 UCSC Gene standard track
(December 2009 version), which is a publicly-available sequence set that can
be downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/).  From
this set, only sequences mapped to canonical human chromosomes (chr1-22, X, Y,
and M) were retained.  For each selected transcript, the nucleotide sequence,
association to Entrez/LocusLink genes (if known), and CDS range (if known)
were extracted from the UCSC database tables.  Additionally, a pairwise
alignment of each transcript against the hg19 genome was provided by Mark
Diehkans from UCSC.  The reference transcript set contains 73,671 human
transcript sequences representing 20,532 human genes.  67,434 transcript
sequences are given gene assignments according to the UCSC database tables;
the remaining 6,237 sequences in the reference transcript set are not
currently associated with an Entrez/LocusLink gene.  No transcript is
associated with more than one Entrez/LocusLink gene.  For each gene
represented by one or more transcripts in this set, a composite gene model was
generated by merging all overlapping exons (as defined by the genomic mapping)
from each associated reference transcript.  Thus, each composite gene model is
essentially the union of all associated reference transcripts.  Each gene
model is linked to a specific Entrez/LocusLink gene identifier.  A library of
composite exons was then established by collecting each contiguous genomic
segment from each gene model.  Note that because the gene model construction
process involves merging overlapping exons from different transcripts, the
resulting composite exons may or may not be observed in reference transcript
sequences.  This library contains 239,886 unique composite exons, each of
which is defined by its genomic range.  A library of all known splice
junctions was established by cataloging all junctions observed in the
reference transcript sequences.  Inclusion of a splice junction in this
library indicates that it has been observed in (at least) one reference
transcript.  However, not all splice junctions in this library will be
observed in the composite gene models.  This library contains 249,775 unique
splice junctions, where each junction is defined by the last position of exon
N and the first position of exon N+1.

Step 2:  Prepare raw sequencing reads for alignment.

The Illumina Casava 1.8.1 sofatware is used to process raw images and output 
fastq files for each sample.  CASAVA also performs de-multiplexing of pooled 
samples when appropriate.  In some cases, the insert size of the sequenced 
fragment is shorter than the read length, resulting in part or all of the adapter 
(primer) sequence being incorporated into the output read. Reads are further 
processed to trim any adapter segments using the fastq-mcf software 
(http://code.google.com/p/ea-utils/). The following arguments were used for 
fastq-mcf: -l 25 -f -P 33 -m 10 -k 0 -q 0.  This instructs the trimmer to 
require at least 10 bases prior to matching, and remove read pairs where either
member of the pair is less 25 bases after trimming.  Quality based trimming
is disabled.  In addition, only adaptors that are found in at least 250 of 100,000 
randomly sampled reads will be considered during trimming.

Step 3:  Align sequencing reads against reference transcript database.

A pre-processed lane of sequencing reads is then submitted to the BWA
algorithm (http://bio-bwa.sourceforge.net/) for alignment against the
reference transcript database using default parameters.  The resulting SAM
file is converted to BAM format using Picard tools
(http://picard.sourceforge.net/).

Step 4:  Calculate quantification at the transcript level.

The reads aligned to the reference transcript sequences are used to determine
transcript level quantification.  Three quantification values are reported:
raw read counts, coverage, and RPKM.  Raw read counts is simply the number of
reads aligned to a given reference transcript.  Raw read counts are normalized
by transcript length to give coverage.  For a given TranscriptX, coverage is
calculated by:  total bases aligned to TranscriptX / length of TranscriptX.
The “total bases aligned” will typically be equal to aligned reads × read
length.  However, in cases where reads have been trimmed due to adapter
contamination, this value may be slightly lower.  Raw read counts are
normalized by both transcript length and overall lane yield to give RPKM.  For
a given TranscriptX, RPKM is calculated by:  (raw read counts × 10^9) / (total
reads × length of TranscriptX).  For this calculation, “total reads” is the
lane yield after removing poor quality reads.

Step 5:  Calculate quantification at the gene level.

Quantification at the gene level is determined by collapsing the transcript
level quantifications onto their associated gene loci.  Raw read counts,
coverage, and RPKM are reported for each gene.  Raw read counts for a given
GeneX is the sum of reads aligned to all transcripts associated with that
gene.  Coverage for a given GeneX is determined by:  sum of bases aligned to
all transcripts associated with GeneX / length of GeneX.  RPKM for a given
GeneX is calculated by:  (raw read counts × 10^9) / (total reads × length of
GeneX).  “Total reads” is the lane yield after removing poor quality reads.
In both the coverage and RPKM calculations, the length of GeneX is defined as
the median length of all transcripts associated with GeneX.

Step 6:  Calculate quantification at the exon level.

In order to carry quantification for sequence features defined by their
genomic locus (i.e. exons and splice junctions), the aligned reads must first
be translated from transcript coordinates to genomic coordinates.  The
pre-established pairwise mapping between each reference transcript and the
hg19 genome allows for a straightforward conversion between these two
coordinate systems.  These converted aligned reads are then used to determine
exon level quantification.  Raw base counts, coverage, and RPKM are reported
for each entry in the library of composite exons.  The raw base counts for a
given ExonX is the total number of bases aligned to that genomic segment.  Raw
base counts are used instead of raw read counts because in many cases only a
portion of a read will align to a given exon.  For a given ExonX, coverage is
calculated by:  raw base counts / exon length.  RPKM for a given ExonX is
determined by:  ( (raw base counts / median read length) × 10^9) / (total
reads × exon length).  Except in cases where the raw reads have undergone
extensive adapter trimming, the median read length will be equal to the
experimental read length for that lane.  “Total reads” is the lane yield after
removing poor quality reads.

Step 7:  Calculate quantification at the splice junction level.

Quantification at the splice junction level is also calculated based on the
aligned reads converted to genomic coordinates.  The only reported value for
this level is raw read counts, which is determined by the number of reads that
cross a particular junction.  Because a splice junction has an effective
length of zero, the coverage and RPKM calculations do not apply.

Step 8:  Generate visualization of coverage.
 
To facilitate visualization of read coverage across genomic segments of
interest, the quantification data is converted to the UCSC bigWig format to
enable viewing in the UCSC Genome Browser with down to single-base resolution.

Step 9:  Evaluate QC metrics.

Each lane is evaluated according to a variety of both pre- and post-alignment
QC measures in order to determine whether the data set in question falls
within expected quality thresholds.  The following metrics are assessed on a
per-lane basis:

* Total read counts
* Reads passing filter
* Percent of reads that are unique
* Base quality per cycle
* Nucleotide distribution per cycle
* Percent of reads requiring adapter trimming
* Mean effective read length after adapter trimming
* Percent of reads aligning to human reference transcripts
* Percent of reads aligning to human rRNA
* Percent of reads aligning to human miRNA
* Percent of reads aligning to human mtDNA
* Percent of reads aligning to the human genome
* Percent of reads aligning to other genomes (mouse, rat, fruit fly,
* Arabidopsis, yeast)
* Percent of reads aligning to viral genomes
* Average coverage per human gene
* Average coverage across the length of reference transcripts

Column Headers:

These are just brief descriptions of the column headers you will find in the
various level 3 files.

File: *.trimmed.annotated.gene.quantification.txt

* gene: This is the Entrez/LocusLink gene symbol followed by the
  Entrez/LocusLink gene ID.
* raw_counts: The number of reads mapping to this gene.
* median_length_normalized: This is the total aligned bases to all transcript
  models associated with this gene divided by the mean transcript length.
* RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for
  information on how this is calculated.

File: *.trimmed.annotated.exon.quantification.txt

* exon: This is the location of the exon in hg19 (GRCh37) based on the UCSC
  Gene standard track (December 2009 version).
* raw_counts: The number of reads mapping to this exon.
* median_length_normalized: This is the total aligned bases to this exon
  divided by the exon length.
* RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for
  information on how this is calculated.

File: *.trimmed.annotated.spljxn.quantification.txt

This file does not include normalized counts since splice junctions are a
fixed size.

* junction: This is the location of the splice junction in hg19 (GRCh37)
  based on the UCSC Gene standard track (December 2009 version).
* raw_counts: The number of reads mapping to this splice junction.

File: *.wig

This is a WIG file format that represents coverage, see
http://genome.ucsc.edu/FAQ/FAQformat.html#format6 for
more information.



#########################

V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1

Date: 05-10-2012

Contacts:

* Lisle Mose <lmose@email.unc.edu>
* Joel Parker <parkerjs@email.unc.edu>

Versions:

This analysis was carried out using the SeqWare Pipeline project, version
0.7.0. The workflow was "MapspliceRSEM" version 0.7.x. UNC
provides all our analysis software through this open source project. 

Annotations:

The Generic Annotation File (GAF) that provides all of our annotations for
genes, exons, etc can be found at
https://tcga-data.nci.nih.gov/docs/GAF/GAF.hg19.June2011.bundle/outputs/TCGA.hg19.June2011.gaf

Conventions:

Please note our junction_quantification.txt, exon_quantification.txt, and the
GAF file above use the convention of chr:smaller_int-larger_int:+ for plus
strand features and chr:larger_int-smaller_int:- for negative strand features.
Carefully examine future versions of these annotation and quantification files
since this convension is subject to change.

Analysis Process:

Step 1:  Prepare raw sequencing reads for alignment.

The Illumina Casava 1.8.1 sofatware is used to process raw images and output 
fastq files for each sample.  CASAVA also performs de-multiplexing of pooled 
samples when appropriate.  

Step 3:  Align sequencing reads against reference transcript database.

A pre-processed lane of sequencing reads is then submitted to the Mapsplice
algorithm (http://www.netlab.uky.edu/p/bioinfo/MapSplice2UserGuide) for alignment against the
hg19 reference genome database using default parameters.  

Step 4:  Calculate quantification at the gene and transcript level.

The reads aligned to the reference genome sequences are translated to transcriptome coordinates 
prior to transcript level quantification.  This step is performed using the UNC Bioinformatics Utilities (UBU)
software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances
(http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html).  These values are normalized
to a fixed upper quaritile value of 1000 for gene and 300 for transcript level estimates and the normalized
values are placed in a separate file.

Step 5:  Calculate quantification at the exon level.

Raw base counts, coverage, and RPKM are reported
for each entry in the library of composite exons.  The raw base counts for a
given ExonX is the total number of bases aligned to that genomic segment.  Raw
base counts are used instead of raw read counts because in many cases only a
portion of a read will align to a given exon.  For a given ExonX, coverage is
calculated by:  bases covered / exon length.  RPKM for a given ExonX is
determined by:  ( (raw base counts / read length) × 10^9) / (total
reads × exon length).

Step 6:  Calculate quantification at the splice junction level.

The only reported value for
this level is raw read counts, which is determined by the number of reads that
cross a particular junction.  Because a splice junction has an effective
length of zero, the coverage and RPKM calculations do not apply.


Step 7:  Evaluate QC metrics.

Each lane is evaluated according to a variety of both pre- and post-alignment
QC measures in order to determine whether the data set in question falls
within expected quality thresholds.  The following metrics are assessed on a
per-lane basis:

* Total read counts
* Reads passing filter
* Percent of reads that are unique
* Base quality per cycle
* Nucleotide distribution per cycle
* Percent of reads requiring adapter trimming
* Mean effective read length after adapter trimming
* Percent of reads aligning to human reference transcripts
* Percent of reads aligning to human rRNA
* Percent of reads aligning to human miRNA
* Percent of reads aligning to human mtDNA
* Percent of reads aligning to the human genome
* Percent of reads aligning to other genomes (mouse, rat, fruit fly,
* Arabidopsis, yeast)
* Percent of reads aligning to viral genomes
* Average coverage per human gene
* Average coverage across the length of reference transcripts
