1. The TCGA miRNAseq data generation process, including strand-specific library construction,
sequencing, and computational processing is described in:
Chu A, Robertson G, Brooks D, Mungall AJ, Birol I, Coope R, Ma Y, Jones S, Marra MA. Large-scale
profiling of microRNAs for The Cancer Genome Atlas. Nucleic Acids Res. 2015 Aug 13.


2. The analysis pipeline software is available from Github: https://github.com/bcgsc/mirna
The documentation on the project’s main Github web page describes the miRNA profiling process and
how to run the pipeline software. Answers to FAQs and a description of how to transform a set of
level 3 files into a level 4 expression data matrix are included.

The software used for adapter trimming is available from:
http://www.bcgsc.ca/platform/bioinfo/software/adapter-trimming-for-small-rna-sequencing


3. Brief description of computational analysis of miRNA-seq data
The analysis pipeline runs on Linux or Unix-like systems. It takes as input a .sam file
containing sequence read alignments for a sample, and generates a profile of small RNA abundance
for that sample. Size selection during library construction, and the 30-nt sequence read length,
result in sequence data that consist largely of RNAs that are 22±3 nt. So, while the sequence
data can contain different types of small RNAs, most reads represent miRNA mature strands.

A. Preprocessing and aligning sequence reads
B. Annotating read alignments with genomic features
C. Profiling abundance
D. Reads that multimap and crossmap
E. Output data files 
F. Data file formats

A.  Preprocessing and aligning sequence reads
The 3' end of a 30-nt sequence read typically includes part of the 3' adapter sequence. To ensure
that such adapter sequence does not interfere with alignment to the reference genome, we detect
and remove (trim) it before aligning the read. The publication describes the trimming algorithm.

Given that the shortest mature miRNA in miRBase is 15bp, by default we discard any trimmed read
that is shorter than 15bp. We then align the remaining reads to the reference genome, using
BWA v0.5.7.

B.  Annotating read alignments with genomic features
Given a .sam file generated by the read aligner, the pipeline first compares each of the
alignment coordinates for each read against miRBase and several UCSC genome browser annotations.
By default we require a 3 base pair overlap between an aligned read and a genomic feature (e.g. a
miRNA mature strand). When a read’s alignment (or multiple alignments) overlaps features in
different databases (e.g. miRNA and gene and repeat), we resolve the multiple annotations using the 
priority list below:

Priority | Annotation type | Database
 1 | mature strand | miRBase
 2 | star strand
 3 | precursor miRNA
 4 | stemloop, from 1 to 6 bases outside the mature strand, between the mature and star strands
 5 | "unannotated", any region other than the mature strand in miRNAs where no star strand is annotated
 6 | snoRNA | UCSC small RNAs and  RepeatMasker
 7 | tRNA
 8 | RNA
 9 | snRNA
10 | scRNA
11 | srpRNA
12 | Other RNA repeats
13 | coding exon with no annotated CDS region | UCSC Genes
14 | 3' UTR
15 | 5' UTR
16 | coding exon
17 | intron
18 | LINE | UCSC RepeatMasker
19 | SINE
20 | LTR
21 | Satellite
22 | RepeatMasker DNA
23 | RepeatMasker Low complexity
24 | RepeatMasker Simple Repeat
25 | RepeatMasker Other
26 | RepeatMasker Unknown

C.  Profiling abundance
After aligned reads have been annotated, i.e. resolved as associated with specific genome
features, the read counts for miRBase miRNAs are summed for stem-loops and for distinct read
sequences to produce two abundance (i.e. expression) reports. See section E.

For TCGA, we used only exact-match read alignments in determining abundance. If you wish to work
with reads that have alignment mismatches, or with unaligned reads, the BAM files available from
CGHub (https://cghub.ucsc.edu) include all sequence reads for a library.

D.  Reads that multimap and crossmap
The trimmed sequence reads largely represent isomiRs, so are short (22±3 nt). Their 5' and 3'
ends can differ from miRBase reference mature strand coordinates, particularly at the 3' end. As
well, some miRNAs occur as families of closely related sequences that have identical or nearly
identical mature strand sequences (and MIMAT accession IDs). These factors result in reads
multimapping and crossmapping, which we address as follows (see the publication for more
details).

A read can multimap to identical mature strands from a miRNA family whose members are in
different locations in the genome (e.g. miR-181a-5p=MIMAT0000256 is present in hsa-mir-181a-1
at 1q32.1 and in hsa-mir-181a-2 at 9q33.3). When we annotate a read as miR-181a-5p, we
increment the read count for this MIMAT, and we increment the read count of one of the genomic
locations, chosen randomly, for the family’s stem-loops.

A short isomiR read may map exactly to mature strands whose sequences are similar but not
identical, when the read sequence does not capture the bases that distinguish these miRNAs (e.g.
hsa-mir-30a at 6q13 and hsa-mir-30e at 1p34.2, which differ at position 18). We report such a
read as cross-mapped, and we increment the read count for each MIMAT that it mapped to.


E.  Output data files
Two level 3 output files are available for each sample: one with pre-miRNA (stem-loop) abundance
(mirna.quantification.txt) and one with isomiR abundance (isoform.quantification.txt). Level 3
and level 4 expression files for mature strands can be calculated from the isomiR file. (See the
NAR publication and FAQ: https://github.com/bcgsc/mirna)

The mirna.quantification.txt data file gives the expression of a miRNA as the summed count for
all reads that align to any region of a precursor or stem-loop, not just its mature strand
regions.

The isoform.quantification.txt data file gives the expression for each distinct read sequence
observed, and associates each of these sequences with a miRBase stem-loop or mature strand. From
this file, the expression level of individual mature strands can be retrieved by summing the
read counts for specific MIMAT accession IDs.


F.  Data file formats
The mirna.quantification.txt data file that describes the abundance of each miRNA stem-loop has
the following format:
   miRNA name
   raw read count
   reads per million miRNA reads
   cross-mapped (Y or N)

The isoform.quantification.txt data file that describes the abundance of every distinct miRNA
sequence observed has the following format:
   miRNA name
   alignment coordinates as <assembly>:<Chromosome>:<Start position>-<End position>:<Strand> (coordinates are 1-based half-closed: [start,end)) 
   raw read count
   reads per million miRNA reads
   cross-mapped (Y or N)
   region within miRNA
