Bo Sun | Facebook AI | How to Read STAR (RNA-seq aligner) output

How to Read STAR (RNA-seq aligner) output

Jun 7 2019 · 4 min read

In this post, I’m gonna show you how to read the Log.final.out from RNA-seq aligner STAR.

Table of Contents

Background

For most of beginner Bioinformaticians, students, including myself back then, tend to apply those reads mappers blindly and head into the downstream analysis. Basically run with the default parameters, and care less about:

the quality of the library
the output Log-files generated by the end of the mapping

But when the downstream profiling went south, went back to discover something wrong at the beginning(reads mapping) would be a waste of time and energy 😵.
I was once in the above situation, a clustering analysis went quite far and showed sense of a complete wrongness. And I had to go back and find that the reads contains UMIs(Unique-Molecules-Identifier).

Goal of this post

In this post, I would decipher STAR’s Log.final.out line by line, to show how can we diagnose the library and the mapping and may identify problems as soon as possible.

Log.final.out

    > less Log.final.out
                                 Started job on |       May 11 15:53:47
                             Started mapping on |       May 11 15:54:07
                                    Finished on |       May 11 15:56:26
       Mapping speed, Million of reads per hour |       32.58

                          Number of input reads |       1257924
                      Average input read length |       51
                                    UNIQUE READS:
                   Uniquely mapped reads number |       317851
                        Uniquely mapped reads % |       25.27%
                          Average mapped length |       44.03
                       Number of splices: Total |       11046
            Number of splices: Annotated (sjdb) |       8743
                       Number of splices: GT/AG |       9717
                       Number of splices: GC/AG |       422
                       Number of splices: AT/AC |       2
               Number of splices: Non-canonical |       905
                      Mismatch rate per base, % |       0.95%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.11
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.13
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       120440
             % of reads mapped to multiple loci |       9.57%
        Number of reads mapped to too many loci |       7043
             % of reads mapped to too many loci |       0.56%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       63.99%
                     % of reads unmapped: other |       0.60%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Break it down

Above is a single-end STAR mapping Log.final.out. It contains metadata like timestamp, mapping speed. Statistics like read count, avg. read length. Let’s examine some important metrics.

Uniquely mapped reads -> 25.27%

this is quite low and therefore raise my attention for a double-check on posible reasons.

Average mapped length -> 44.03. Good since it was close to the avg.read length.
Number of splices. -> Good since number of splices were dominated by annotated and canonical splices.
Mismatch rate per base -> 0.95%. Good, tho a tiny bit on the high side, 1.11%. 0.5-0.8% for good libraries,
Insertion and Deletions -> Good, in/dels were low.
Multi-mapping reads -> 9.57%. Well this depends on your analysis, you can choose to use uniquely mapped reads for simplicity. But there are tools that my re-estimate the expression by incorporating multi-mapped reads.
Number of reads mapped to too many loci: 0.56% -> Good, default for “too many” is 10 loci. 0.56% is not missing too much.

Here comes the bad one:

reads unmapped: too short -> 63.99%

This is for sure unusual, and you may be confused that the avg.length are close to read-length, then why too many reads are “too short”?
Actually too short is poorly labeled, it stands for “alignment too short”. This could either happen for normal length reads that are not mapping well or reads that are literally too short(e.g., over-trimmed reads). You can increase the number of mapped reads by relaxing the requirements on mapped length.
Like with the following options, default is 0.66, means 2/3 bps should be mapped, see STAR manual:

--outFilterScoreMinOverLread 0.3
--outFilterMatchNminOverLread 0.3

Wrap up

One of the large issue for bioinformatics discoveries is reproducibility. One should be consistent and accurate when performing analysis. So check and make sure the reads mapper (for example STAR) is performing reasonably before you head into subsequent analysis.