In this post, I’m gonna show you how to read the Log.final.out
from RNA-seq aligner STAR.
For most of beginner Bioinformaticians, students, including myself back then, tend to apply those reads mappers blindly and head into the downstream analysis. Basically run with the default parameters, and care less about:
But when the downstream profiling went south, went back to discover something wrong at the beginning(reads mapping) would be a waste of time and energy 😵.
I was once in the above situation, a clustering analysis went quite far and showed sense of a complete wrongness. And I had to go back and find that the reads contains UMIs(Unique-Molecules-Identifier).
In this post, I would decipher STAR’s Log.final.out
line by line, to show how can we diagnose the library and the mapping and may identify problems as soon as possible.
> less Log.final.out
Started job on | May 11 15:53:47
Started mapping on | May 11 15:54:07
Finished on | May 11 15:56:26
Mapping speed, Million of reads per hour | 32.58
Number of input reads | 1257924
Average input read length | 51
UNIQUE READS:
Uniquely mapped reads number | 317851
Uniquely mapped reads % | 25.27%
Average mapped length | 44.03
Number of splices: Total | 11046
Number of splices: Annotated (sjdb) | 8743
Number of splices: GT/AG | 9717
Number of splices: GC/AG | 422
Number of splices: AT/AC | 2
Number of splices: Non-canonical | 905
Mismatch rate per base, % | 0.95%
Deletion rate per base | 0.00%
Deletion average length | 1.11
Insertion rate per base | 0.01%
Insertion average length | 1.13
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 120440
% of reads mapped to multiple loci | 9.57%
Number of reads mapped to too many loci | 7043
% of reads mapped to too many loci | 0.56%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 63.99%
% of reads unmapped: other | 0.60%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Above is a single-end STAR mapping Log.final.out. It contains metadata like timestamp, mapping speed. Statistics like read count, avg. read length. Let’s examine some important metrics.
this is quite low and therefore raise my attention for a double-check on posible reasons.
Here comes the bad one:
This is for sure unusual, and you may be confused that the avg.length are close to read-length, then why too many reads are “too short”?
Actually too short is poorly labeled, it stands for “alignment too short”. This could either happen for normal length reads that are not mapping well or reads that are literally too short(e.g., over-trimmed reads). You can increase the number of mapped reads by relaxing the requirements on mapped length.
Like with the following options, default is 0.66, means 2/3 bps should be mapped, see STAR manual:
--outFilterScoreMinOverLread 0.3
--outFilterMatchNminOverLread 0.3
One of the large issue for bioinformatics discoveries is reproducibility. One should be consistent and accurate when performing analysis. So check and make sure the reads mapper (for example STAR) is performing reasonably before you head into subsequent analysis.