Overview of miRNA analysis at the BCCA-GSC

	The analysis pipeline takes as input a .sam file containing BWA alignments for a sample, and generates a profile of the content of the sample. This includes metrics such as the expression of various RNAs with particular focus on the expression of miRNA species.

Preprocessing

	Before any alignment can occur, some preprocessing on the sequences is necessary. MiRNAs tend to be shorter than the length of reads generated by sequencers, which means that the read will sequence past biological RNA into the 3' adapter of the sequencing protocol. This non-biological sequence at the end of the read will make it difficult for the aligner to align the read, so it is trimmed off first.

	The adapter trimming works by looking for as much of the specified adapter sequence in the read as possible, allowing mismatches based on the length of adapter found. Only the first 15 bases of the adapter sequence are used for the string matching, regardless of the length of the adapter sequence supplied. If a read matches 15bp of non-biological adapter, it is safe to assume the sequence truly is adapter sequence.

	For each read, the algorithm first checks if it is an adapter dimer by checking if the adapter sequence is at the start of the read. Since it has been seen in indexed reads that the first few bases of the adapter can be lost, not only is the full adapter sequence matched, the adapter sequence with the first 1, 2, and 3 bases trimmed off the beginning is also matched. If the read is not an adapter dimer, then the algorithm tries to match the 15 bases of adapter anywhere along the read, starting from the end. When matching the full 15bp of adapter, any match with up to 2 mismatches is accepted. If the full 15bp is not found, decreasing lengths of adapter, down to the first 8 bases, is checked with 1 mismatch allowed. If a match is still not found, 7 down to 1 base of adapter is checked with an exact match required. The algorithm will trim 1 base off the end of a read if it happens to match the first base of the adapter. This is because it is better to get a perfect alignment than an alignment with potentially 1 mismatch. In addition, if only 1 base of adapter was found, the read is likely not a miRNA and would not be a concern anyway.

	After each read has been processed, a report is generated containing the number of reads at each read length. Any read that is under 15bp after trimming is discarded, the rest is sent to alignment. 15bp is chosen because it is the shortest mature miRNA in miRBase.
	
Annotation

	Given a .sam file, the first step of the pipeline is to annotate each of the aligned coordinates for each read against reference databases.	A 3 base pair overlap is required, and in the event that a coordinate range overlaps features in multiple databases, each feature type is given a priority. The feature types, in the order that they are applied, are in Table 1.

Table 1. Annotation priorities to resolve multiple matches.
	From miRBase.
		mature strand
		star strand
		precursor miRNA
		stemloop, from 1 to 6 bases outside the mature strand, between the mature and star strands
		"unannotated", any region other than the mature strand in miRNAs where there is no star strand annotated
	From UCSC RepeatMasker, other small RNAs.
		snoRNA
		tRNA
		rRNA
		snRNA
		scRNA
		srpRNA
		other RepeatMasker RNAs
	From UCSC knownGene.
		coding exons where the annotated CDS region is 0 length
		3' UTR
		5' UTR
		coding exon
		intron
	From UCSC RepeatMasker, other types of repeats.
		LINE
		SINE
		LTR
		Satellite
		RepeatMasker DNA
		RepeatMasker Low complexity
		RepeatMasker Simple Repeat
		RepeatMasker Other
		RepeatMasker Unknown

Profiling

	After annotation is done, the results are aggregated to produce a report. For each read, a series of filters are used to check whether its alignment will be used. If the read has more than 3 alignments, it is discarded. Currently only perfect alignments with no mismatches are used. Based on expression profiles of test libraries, reads which fail chastity are kept, while reads which are soft-clipped by BWA are not. After filtering, each alignment annotation of the read is looked at to determine the annotation for the read itself. If a read has more than 1 alignment, and the annotations are different, the priorities from Table 1 are used as long as only 1 alignment is to a miRNA. If there are multiple alignments to different miRNAs (or even different regions of the same miRNA), the read is flagged as cross-mapped and every miRNA annotation is preserved. Using these rules, the reads are summed by annotation and coverage reports are generated.

	To generate a list of isoforms for all miRNAs, all reads annotated as miRNAs are grouped by annotation and sequence, then summed to produce the expression of each individual sequence.

Data files

	The .mirna.quantification.txt data file gives summed expression for all reads aligning to a miRNA.
	The .isoform.quantification.txt data file gives expression for each individual sequence isoform observed. 

Data file format
	
	File formats for data files are also available online at https://wiki.nci.nih.gov/display/TCGAAnalysWG/BC+miRNA+Formats.

	The .mirna.quantification.txt data file describing summed expression for each miRNA is as follows:

miRNA name
raw read count
reads per million miRNA reads
cross-mapped to other miRNA forms (Y or N)

	The .isoform.quantification.txt data file describing every individual sequence isoform observed is as follows:

miRNA name
alignment coordinates as <version>:<Chromosome>:<Start position>-<End position>:<Strand>
raw read count
reads per million miRNA reads
cross-mapped to other miRNA forms (Y or N)
region within miRNA
