Note:  This file needs to be updated 2011-08-17

UNC RNA-Seq Workflow - BWA Alignment to Transcriptome

Date: 20101108

Authors:

* Sara Grimm <sacheek@med.unc.edu>
* Brian O'Connor <brianoc@email.unc.edu>

Versions:

This analysis was carried out using the SeqWare Pipeline project, version
0.7.0. The workflow was "RNASeqAlignmentBWA" version 0.7.4 (UNCID:23324). UNC
provides all our analysis software through this open source project. Users can
download this software to run the identical RNA-Seq analysis described in the
steps below. See the project website at http://seqware.sf.net for more
information. The UNCIDs provided in file names are identifiers unique to UNC
and can be used to provide data/analysis provenance tracking.

Annotations:

The Generic Annotation File (GAF) that provides all of our annotations for
genes, exons, etc can be found at
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf

Conventions:

Please note our spljxn.quantification.txt, exon.quantification.txt, and the
GAF file above use the convention of chr:smaller_int-larger_int:+ for plus
strand features and chr:larger_int-smaller_int:- for negative strand features.
Carefully examine future versions of these annotation and quantification files
since this convension is subject to change.

Analysis Process:

Step 1:  Construct database of reference transcript sequences, composite gene
models, composite exons, and splice junctions.

The reference transcript set is based on the hg19 UCSC Gene standard track
(December 2009 version), which is a publicly-available sequence set that can
be downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/).  From
this set, only sequences mapped to canonical human chromosomes (chr1-22, X, Y,
and M) were retained.  For each selected transcript, the nucleotide sequence,
association to Entrez/LocusLink genes (if known), and CDS range (if known)
were extracted from the UCSC database tables.  Additionally, a pairwise
alignment of each transcript against the hg19 genome was provided by Mark
Diehkans from UCSC.  The reference transcript set contains 73,671 human
transcript sequences representing 20,532 human genes.  67,434 transcript
sequences are given gene assignments according to the UCSC database tables;
the remaining 6,237 sequences in the reference transcript set are not
currently associated with an Entrez/LocusLink gene.  No transcript is
associated with more than one Entrez/LocusLink gene.  For each gene
represented by one or more transcripts in this set, a composite gene model was
generated by merging all overlapping exons (as defined by the genomic mapping)
from each associated reference transcript.  Thus, each composite gene model is
essentially the union of all associated reference transcripts.  Each gene
model is linked to a specific Entrez/LocusLink gene identifier.  A library of
composite exons was then established by collecting each contiguous genomic
segment from each gene model.  Note that because the gene model construction
process involves merging overlapping exons from different transcripts, the
resulting composite exons may or may not be observed in reference transcript
sequences.  This library contains 239,886 unique composite exons, each of
which is defined by its genomic range.  A library of all known splice
junctions was established by cataloging all junctions observed in the
reference transcript sequences.  Inclusion of a splice junction in this
library indicates that it has been observed in (at least) one reference
transcript.  However, not all splice junctions in this library will be
observed in the composite gene models.  This library contains 249,775 unique
splice junctions, where each junction is defined by the last position of exon
N and the first position of exon N+1.

Step 2:  Prepare raw sequencing reads for alignment.

The “illumina2srf” tool from the DNA Sequence Read Toolkit
(http://sourceforge.net/projects/sequenceread/) is used to convert the raw
sequencing data from the vendor-specific format to standard SRF (sequence read
format), which is a preferred compressed format for storing sequencing reads.
Before data processing begins, the “srf2fastq” tool from the Staden Package
(http://staden.sourceforge.net/) is utilized to convert the data from SRF to
the FASTQ format, using the “-C” parameter to simultaneously filter poor
quality reads.  In some cases, the insert size of the sequenced fragment is
shorter than the read length, resulting in part or all of the adapter (primer)
sequence being incorporated into the output read.  Since any read containing
adapter sequence is highly unlikely to align to a biological reference
database, all reads are further processed to trim any adapter segments from
the sequences.  The criteria for identifying adapter sequence within a read
are as follows:  (a) the read contains an exact match with the adapter
sequence, (b) the sequence match begins at the first base of the adapter, (c)
the sequence match continues until either the end of the adapter or the end of
the read, and (d) the sequence match is at least 5 bases in length.  If all
criteria are met, the read is trimmed starting at the first base of the
adapter sequence match.

Step 3:  Align sequencing reads against reference transcript database.

A pre-processed lane of sequencing reads is then submitted to the BWA
algorithm (http://bio-bwa.sourceforge.net/) for alignment against the
reference transcript database using default parameters.  The resulting SAM
file is converted to BAM format using Picard tools
(http://picard.sourceforge.net/).

Step 4:  Calculate quantification at the transcript level.

The reads aligned to the reference transcript sequences are used to determine
transcript level quantification.  Three quantification values are reported:
raw read counts, coverage, and RPKM.  Raw read counts is simply the number of
reads aligned to a given reference transcript.  Raw read counts are normalized
by transcript length to give coverage.  For a given TranscriptX, coverage is
calculated by:  total bases aligned to TranscriptX / length of TranscriptX.
The “total bases aligned” will typically be equal to aligned reads × read
length.  However, in cases where reads have been trimmed due to adapter
contamination, this value may be slightly lower.  Raw read counts are
normalized by both transcript length and overall lane yield to give RPKM.  For
a given TranscriptX, RPKM is calculated by:  (raw read counts × 10^9) / (total
reads × length of TranscriptX).  For this calculation, “total reads” is the
lane yield after removing poor quality reads.

Step 5:  Calculate quantification at the gene level.

Quantification at the gene level is determined by collapsing the transcript
level quantifications onto their associated gene loci.  Raw read counts,
coverage, and RPKM are reported for each gene.  Raw read counts for a given
GeneX is the sum of reads aligned to all transcripts associated with that
gene.  Coverage for a given GeneX is determined by:  sum of bases aligned to
all transcripts associated with GeneX / length of GeneX.  RPKM for a given
GeneX is calculated by:  (raw read counts × 10^9) / (total reads × length of
GeneX).  “Total reads” is the lane yield after removing poor quality reads.
In both the coverage and RPKM calculations, the length of GeneX is defined as
the median length of all transcripts associated with GeneX.

Step 6:  Calculate quantification at the exon level.

In order to carry quantification for sequence features defined by their
genomic locus (i.e. exons and splice junctions), the aligned reads must first
be translated from transcript coordinates to genomic coordinates.  The
pre-established pairwise mapping between each reference transcript and the
hg19 genome allows for a straightforward conversion between these two
coordinate systems.  These converted aligned reads are then used to determine
exon level quantification.  Raw base counts, coverage, and RPKM are reported
for each entry in the library of composite exons.  The raw base counts for a
given ExonX is the total number of bases aligned to that genomic segment.  Raw
base counts are used instead of raw read counts because in many cases only a
portion of a read will align to a given exon.  For a given ExonX, coverage is
calculated by:  raw base counts / exon length.  RPKM for a given ExonX is
determined by:  ( (raw base counts / median read length) × 10^9) / (total
reads × exon length).  Except in cases where the raw reads have undergone
extensive adapter trimming, the median read length will be equal to the
experimental read length for that lane.  “Total reads” is the lane yield after
removing poor quality reads.

Step 7:  Calculate quantification at the splice junction level.

Quantification at the splice junction level is also calculated based on the
aligned reads converted to genomic coordinates.  The only reported value for
this level is raw read counts, which is determined by the number of reads that
cross a particular junction.  Because a splice junction has an effective
length of zero, the coverage and RPKM calculations do not apply.

Step 8:  Generate visualization of coverage.
 
To facilitate visualization of read coverage across genomic segments of
interest, the quantification data is converted to the UCSC bigWig format to
enable viewing in the UCSC Genome Browser with down to single-base resolution.

Step 9:  Evaluate QC metrics.

Each lane is evaluated according to a variety of both pre- and post-alignment
QC measures in order to determine whether the data set in question falls
within expected quality thresholds.  The following metrics are assessed on a
per-lane basis:

* Total read counts
* Reads passing filter
* Percent of reads that are unique
* Base quality per cycle
* Nucleotide distribution per cycle
* Percent of reads requiring adapter trimming
* Mean effective read length after adapter trimming
* Percent of reads aligning to human reference transcripts
* Percent of reads aligning to human rRNA
* Percent of reads aligning to human miRNA
* Percent of reads aligning to human mtDNA
* Percent of reads aligning to the human genome
* Percent of reads aligning to other genomes (mouse, rat, fruit fly,
* Arabidopsis, yeast)
* Percent of reads aligning to viral genomes
* Average coverage per human gene
* Average coverage across the length of reference transcripts

Column Headers:

These are just brief descriptions of the column headers you will find in the
various level 3 files.

File: *.trimmed.annotated.gene.quantification.txt

* gene: This is the Entrez/LocusLink gene symbol followed by the
  Entrez/LocusLink gene ID.
* raw_counts: The number of reads mapping to this gene.
* median_length_normalized: This is the total aligned bases to all transcript
  models associated with this gene divided by the mean transcript length.
* RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for
  information on how this is calculated.

File: *.trimmed.annotated.exon.quantification.txt

* exon: This is the location of the exon in hg19 (GRCh37) based on the UCSC
  Gene standard track (December 2009 version).
* raw_counts: The number of reads mapping to this exon.
* median_length_normalized: This is the total aligned bases to this exon
  divided by the exon length.
* RPKM: See the DESCRIPTION.txt file in the mage-tab bunlde for
  information on how this is calculated.

File: *.trimmed.annotated.spljxn.quantification.txt

This file does not include normalized counts since splice junctions are a
fixed size.

* junction: This is the location of the splice junction in hg19 (GRCh37)
  based on the UCSC Gene standard track (December 2009 version).
* raw_counts: The number of reads mapping to this splice junction.

File: *.wig

This is a WIG file format that represents coverage, see
http://genome.ucsc.edu/FAQ/FAQformat.html#format6 for
more information.
