RNA-MuTect-WMN – Estimating tumor mutational burden from RNA-sequencing without a matched-normal sample

Google+ Pinterest LinkedIn Tumblr +


Detection of somatic mutations using patients sequencing data has many clinical applications, including the identification of cancer driver genes, detection of mutational signatures, and estimation of tumor mutational burden (TMB). Researchers at the University of Maryland and Technion-Israel have previously developed a tool for detection of somatic mutations using tumor RNA and a matched-normal DNA. Here, they further extend it to detect somatic mutations from RNA sequencing data without a matched-normal sample. This is accomplished via a machine-learning approach that classifies mutations as either somatic or germline based on various features. When applied to RNA-sequencing of >450 melanoma samples high precision and recall are achieved, and both mutational signatures and driver genes are correctly identified. Finally, the researchers show that RNA-based TMB is significantly associated with patient survival, showing similar or higher significance level as compared to DNA-based TMB. This pipeline can be utilized in many future applications, analyzing novel and existing datasets where only RNA is available.

Summary of pipeline predictions

Fig. 1

a An overview of the RNA-MuTect-WMN pipeline: In the training set (n = 100, green arrows), RNA-MuTect is applied on tumor RNA and a matched-normal DNA to obtain a list of variants labeled as somatic or germline. A random forest classifier is then trained with the collected set of features for each variant in a 5-fold cross validation manner. In the test set (orange arrows), 3 steps are performed: (1) MuTect is applied with tumor RNA and without a matched-normal sample, to yield a list of mixed somatic and germline variants. (2) The five trained models are then applied to this set of variants and classify them as either somatic or germline in a majority vote manner. (3) Finally, the predicted set of variants is further filtered by the RNA-MuTect filtering steps. b Distribution of precision and recall values on validation (left) and test (right) sets computed for each sample. Box plots show median, 25th, and 75th percentiles. The whiskers extend to the most extreme data points not considered outliers, and the outliers are represented as dots. c Precision as the function of the number of true somatic mutations per sample. d Correlation between the number of predicted somatic mutations and the number of somatic mutations as determined by DNA with a matched-normal DNA sample. e Correlation between the number of predicted somatic mutations and the number of somatic mutations as determined by RNA with a matched-normal DNA sample. f Distribution of precision and recall values on validation (left) and test (right) sets computed for each sample in the lung dataset. Box plots show median, 25th, and 75th percentiles. The whiskers extend to the most extreme data points not considered outliers, and the outliers are represented as dots. g Distribution of precision and recall values on validation (left) and test (right) sets computed for each sample in the colon dataset. Box plots show median, 25th, and 75th percentiles. The whiskers extend to the most extreme data points not considered outliers, and the outliers are represented as dots. Source data are provided as a Source Data file.


Katzir R, Rudberg N, Yizhak K. (2022) Estimating tumor mutational burden from RNA-sequencing without a matched-normal sample. Nat Commun 13(1):3092. [article]
Share.