ESPRESSO – robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data

Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. Researchers from The Children’s Hospital of Philadelphia present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, the researchers generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.

Overview of ESPRESSO

(A and B) Proportion of incorrect splice junctions (SJs) among (A) imperfectly aligned or (B) perfectly aligned putative SJs found in raw long-read-to-genome alignments of ONT 1D cDNA reads (n = 3) and direct RNA reads (n = 3) for Spike-In RNA Variants (SIRVs). Perfectly aligned putative SJs do not have any mismatches or indels within 10 nt of splice sites. (C) High-confidence SJs are identified from raw long-read-to-genome alignments based on whether they are present in the existing transcript catalog, or if they have canonical splice site dinucleotide motifs (GT/AG, GC/AG, or AT/AC) and are supported by at least two (by default) perfectly aligned reads. The resulting set of high-confidence SJs is used to correct, recover, and evaluate SJs found in individual long reads based on each read’s alignment and error profile. (D) First, reads are classified into the following categories on the basis of the annotation statuses of their corresponding SJs in the existing transcript catalog: full splice match (FSM), incomplete splice match (ISM), novel in catalog (NIC), novel not in catalog (NNC), or not completely determined (NCD). Second, FSM and full-length NIC/NNC reads are used to discover annotated and novel transcript isoforms, respectively. Third, all long reads (full-length and non-full-length) are matched to compatible transcript isoforms. Last, abundances of discovered isoforms are quantified using an expectation-maximization (EM) algorithm. Thickness of arrows drawn between reads and compatible transcript isoforms (bottom right) indicates probability of assigning reads to specific transcript isoforms.

Availability – The ESPRESSO software, together with other scripts used in this study, is available at GitHub (https://github.com/Xinglab/espresso) and archived at Zenodo (version 1.2.2, https://doi.org/10.5281/zenodo.6977552).

Gao Y, Wang F, Wang R, Kutschera E, Xu Y, Xie S, Wang Y, Kadash-Edmondson KE, Lin L, Xing Y. (2023) ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data. Sci Adv 9(3):eabq5072. [article]