Single-cell gene set enrichment analysis and transfer learning for functional annotation of scRNA-seq data

Google+ Pinterest LinkedIn Tumblr +


Although an essential step, cell functional annotation often proves particularly challenging from single-cell transcriptional data. Several methods have been developed to accomplish this task. However, in most cases, these rely on techniques initially developed for bulk RNA sequencing or simply make use of marker genes identified from cell clustering followed by supervised annotation. To overcome these limitations and automatize the process, researchers from the Telethon Institute of Genetics and Medicine have developed two novel methods, the single-cell gene set enrichment analysis (scGSEA) and the single-cell mapper (scMAP). scGSEA combines latent data representations and gene set enrichment scores to detect coordinated gene activity at single-cell resolution. scMAP uses transfer learning techniques to re-purpose and contextualize new cells into a reference cell atlas. Using both simulated and real datasets, we show that scGSEA effectively recapitulates recurrent patterns of pathways’ activity shared by cells from different experimental conditions. At the same time, we show that scMAP can reliably map and contextualize new single-cell profiles on a breast cancer atlas we recently released. Both tools are provided in an effective and straightforward workflow providing a framework to determine cell function and significantly improve annotation and interpretation of scRNA-seq data.

Single-cell gene set enrichment analysis overview and performances

Single-cell gene set enrichment analysis overview and performances. (A) GFICF package overview. (B) Single-cell gene set enrichment analysis pipeline. (C) UMAP plot of 5000 simulated cells grouped in four distinct groups. (D) Reconstructed activity of 24 simulated pathways across the 5000 cells in (C). In the heatmap pathways are along rows while simulated cells along columns. Cells are ordered according to their group of origin. (E) Comparison between scGSEA pathway scores and signature scores originally computed by Schiebinger et al. on 25 1203 single-cell profiles collected during differentiation stages. First row shows original gene set scores computed by Schiebinger et al. using wot phyton package. Second row shows gene set scores computed with scGSEA tool in the gficf R package. Each column represents a different gene set. Scores were plotted on the original FLE (force-directed layout) coordinates published by Schiebinger et al. (F) Spearman Correlation Coefficient (SCC) between scGSEA scores and wot package signature scores across the 25 1203 single-cell transcriptional profiles in (E). (G) UMAP representation of 1044 cells subject to eleven days of consecutive erlotinib treatment. Cells are color-coded according to sequenced day (i.e. 0, 1, 2, 4, 9 and 11 days). Single-cell transcriptional profiles were normalized with gficf package. (H) EMT activity scores against inferred cell pseudo-time using the activity scores of 50 hallmark gene sets downloaded from MSigDB. Cells are color-coded as in (G). (I–K) Same as (H) but for wnt, cholesterol and fatty acid pathways respectively.

(A) GFICF package overview. (B) Single-cell gene set enrichment analysis pipeline. (C) UMAP plot of 5000 simulated cells grouped in four distinct groups. (D) Reconstructed activity of 24 simulated pathways across the 5000 cells in (C). In the heatmap pathways are along rows while simulated cells along columns. Cells are ordered according to their group of origin. (E) Comparison between scGSEA pathway scores and signature scores originally computed by Schiebinger et al. on 25 1203 single-cell profiles collected during differentiation stages. First row shows original gene set scores computed by Schiebinger et al. using wot phyton package. Second row shows gene set scores computed with scGSEA tool in the gficf R package. Each column represents a different gene set. Scores were plotted on the original FLE (force-directed layout) coordinates published by Schiebinger et al. (F) Spearman Correlation Coefficient (SCC) between scGSEA scores and wot package signature scores across the 25 1203 single-cell transcriptional profiles in (E). (G) UMAP representation of 1044 cells subject to eleven days of consecutive erlotinib treatment. Cells are color-coded according to sequenced day (i.e. 0, 1, 2, 4, 9 and 11 days). Single-cell transcriptional profiles were normalized with gficf package. (H) EMT activity scores against inferred cell pseudo-time using the activity scores of 50 hallmark gene sets downloaded from MSigDB. Cells are color-coded as in (G). (IK) Same as (H) but for wnt, cholesterol and fatty acid pathways respectively.

Availability – R package of gf-icf pipeline and examples of use are available at the following address https://github.com/gambalab/gficf.


Franchini M, Pellecchia S, Viscido G, Gambardella G. (2023) Single-cell gene set enrichment analysis and transfer learning for functional annotation of scRNA-seq data. NAR Gen and Bio 5(1); lqad024. [article]
Share.