Understanding gene expression and genetic variation with MAGE

Google+ Pinterest LinkedIn Tumblr +


Genetic variation plays a crucial role in creating differences in how genes are expressed and how their instructions are processed (splicing). This variation contributes significantly to the diversity we see in physical traits and disease susceptibilities among people. However, most studies examining these genetic influences have focused mainly on people of European descent. This narrow focus limits our ability to apply findings broadly across different human populations and hampers our understanding of human evolution.

To address this gap, researchers at Johns Hopkins University have developed MAGE, an open-access RNA sequencing dataset. This dataset includes information from 731 individuals from the 1000 Genomes Project, representing five continental groups and 26 distinct populations. By analyzing this diverse dataset, researchers can gain a more comprehensive understanding of genetic influences on gene expression and splicing.

A globally diverse transcriptomics dataset

Fig. 1

a, RNA-seq data were generated from LCLs from 731 individuals from the 1KGP, roughly evenly distributed across 26 populations and 5 continental groups. Populations included in MAGE are indicated in pink, whereas the Maasai population is in blue as it is present in the AFGR dataset (based on sequencing of HapMap cell lines) but not in the 1KGP or MAGE. Full population descriptors can be found at https://catalog.coriell.org/1/NHGRI/About/Guidelines-for-Referring-to-Populationsb, Genotype principal component 1 (PC1) and PC2 comparing MAGE to other large studies with paired RNA and whole-genome sequencing data. Samples from the specified study (that is, MAGE, Geuvadis, GTEx and AFGR) are depicted with coloured points, whereas samples from other studies are depicted with grey points in each respective panel. c, Proportion of variance explained by the first ten PCs. d, ADMIXTURE results displaying proportions of individual genomes (columns) attributed to inferred ancestry components. For MAGE, Geuvadis and AFGR, samples are stratified according to population and continental group labels from the respective source projects, whereas GTEx does not include population labels. A subset of 1KGP samples are present across multiple RNA-seq studies and therefore appear in multiple panels, but these samples were not duplicated within the input to ADMIXTURE. Ancestry components are modelling constructs that do not directly correspond to true ancestral populations, and the results of ADMIXTURE analysis strongly depend on sampling characteristics of the input data. Although k = 7 minimizes the cross-validation error within this combined dataset, alternative choices of k reflect structure at different scales. Map in a adapted from the US CIA World Factbook, 2005.

Key Findings from the MAGE Dataset

  1. Variation Within and Between Populations:
    • The majority of variation in gene expression (92%) and splicing (95%) exists within populations rather than between them. This mirrors the distribution of genetic variation at the DNA level, indicating that genetic differences within populations are more significant than those between different populations.
  2. Genetic Associations with Gene Expression and Splicing:
    • The study identified associations between specific genetic variants and the expression and splicing of nearby genes, known as cis-expression quantitative trait loci (eQTLs) and cis-splicing QTLs (sQTLs), respectively.
    • The researchers found over 15,000 potential eQTLs and over 16,000 potential sQTLs. These genetic variants are enriched with relevant epigenomic signatures, meaning they likely play important roles in gene regulation.
  3. Population-Specific Variants:
    • Among the identified eQTLs and sQTLs, 1,310 eQTLs and 1,657 sQTLs are largely unique to underrepresented populations. This highlights the importance of including diverse populations in genetic research to uncover variants that might be missed in more homogeneous studies.
  4. Consistency Across Populations:
    • The study found that the effects of eQTLs are highly consistent in both magnitude and direction across different populations. This means that the genetic mechanisms influencing gene expression are similar globally, despite observed differences.
    • Previous observations of ‘population-specific’ effects were often due to low-resolution data or the presence of additional independent eQTLs that were not detected.

Implications and Future Research

The MAGE dataset significantly broadens our understanding of human genetic diversity in gene expression and splicing. By including a diverse range of populations, this resource helps ensure that genetic research findings are more generalizable and applicable to a broader range of people.

The consistency of eQTL effects across populations suggests that findings from genetic studies in one population can be relevant to others, although unique variants in underrepresented groups highlight the need for inclusive research. The MAGE dataset serves as a valuable resource for studying human genome evolution and function, paving the way for more comprehensive and equitable genetic research.

The development and analysis of the MAGE dataset represent a significant step forward in genetic research. By embracing diversity, scientists can gain deeper insights into the genetic basis of human traits and diseases, ultimately leading to more personalized and effective healthcare solutions for all.


Taylor DJ, Chhetri SB, Tassia MG, Biddanda A, Yan SM, Wojcik GL, Battle A, McCoy RC. (2024) Sources of gene expression variation in a globally diverse human cohort. Nature [Epub ahead of print]. [article]
Share.