Making sense of silence in gene regulatory networks with RNA-Seq

Google+ Pinterest LinkedIn Tumblr +


rna-seq

The study of gene regulation has the potential to illuminate fundamental biological mechanisms, such as cell differentiation and the development of organisms. It can also provide insights into the source of disease and ways in which scientists can develop medicines to treat them.

Gaining actionable insights from the study of genes, however, is a complex process. This is partially the case because studying genes directly provides scientists with information that is valuable but limited in scope.

A common analogy compares genes to blueprints for a house: they illustrate what is possible, but don’t tell you what rooms were actually built or the condition of the home at a certain point in time. For that reason, scientists look to the products of genes — RNA and proteins — to gather what is known as gene expression data that provides insights into which genes are active and which are silent.

Through an approach known as gene regulatory network inference, scientists can analyze RNA and use statistical methods to infer how genes influence one another.

“Gene regulatory networks are by their nature causal relationships,” in that genes interact with each other and influence gene expression, said Gongxu Luo, a Ph.D. student at Mohamed Bin Zayed University of Artificial Intelligence. “Scientists want to try to understand how one gene affects another and how that can lead to disease.”

Luo is co-author of a recent study that proposes a new statistical method for analyzing data that is used for gene regulatory network inference. It is a collaboration between researchers at MBZUAI, Carnegie Mellon University and the Broad Institute of MIT and Harvard.

The research will be presented as an oral presentation at the Twelfth International Conference on Learning Representations (ICLR 2024) and it is one of 39 studies by MBZUAI scientists that will be shared at the conference. Kun Zhang, acting chair of machine learning, professor of machine learning and director of the center for integrative artificial intelligence (CIAI) at MBZUAI, also contributed to the research.

Gene regulatory network inference is based on analysis of data produced by a technique called single-cell RNA sequencing, or RNA-seq. With this approach, the RNA that is contained in a single cell is analyzed, providing a snapshot into what genes are active in the cell at that time. When scientists see RNA produced by a particular gene, known as the expression value, they can tell that the corresponding gene that produces the RNA is active.

Expression values can also be zero, which typically means that a gene isn’t expressed. Unfortunately, analysis is complicated by the fact that zeros can be caused for two reasons. Some are because the gene isn’t expressed. Others are because of technical issues — the expression value could be too low to register, or the machine doing the sequencing simply missed it. These are called dropouts.

Single RNA-seq is a quite new technology and there is always a high percentage of zeros,” Luo said. “Yet we need to learn something from the data. Our approach can help address this dropout problem.”

Traditionally, an approach called imputation was used to make sense of RNA-seq data that had many zeros. Imputation methods fill in missing values with estimated or predicted values based on available data.

“But once you fill that hole, the distribution will change,” Luo said. “You can never assess the real distribution of the gene.”

Luo and his colleagues found that with their approach, which they call the causal dropout model, they don’t need to know if a zero is the result of a gene not being expressed or a technical error.

With their model, they found that the conditional independent relationship on non-missing data is consistent with that on data following real distribution. It means they are able to access the true causal relationship with the assistance of statistical methods like test-wise deletion. Through this approach, even gene expression data with a large number of dropouts is usable.

“If we are able to better understand the causal relationships between genes, scientists may be able to produce medicines that influence the genes that cause the disease,” Luo said. “But if there is no real causal relationship between genes you could intervene and it wouldn’t work. That’s why we need a tool that focuses on the causal relationship.”

SourceMohamed bin Zayed University of Artificial Intelligence


Dai H, Ng I, Luo G, Spirtes P, Stojanov P, Zhang K. (2024) Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View. arXiv [online preprint]. [abstract]
Share.