---
title: "How to Build a Virtual Cell"
---

![A drug fails — and the cell didn't behave the way we predicted](../illustrations/virtual-cell/01-scene-opening-v2.png)

A drug fails in clinical trials. The mechanism looked right. The animal model worked. The target was validated. But the cell didn't behave the way we thought.

This happens constantly. Not because we're bad at biology — but because we're trying to reason about systems we can't fully observe. A human cell contains ~20,000 genes interacting in a network so complex that even the best biologists can only reason about small pieces at a time.

A virtual cell changes that. Not as a simulation that tracks every molecule — we don't have that — but as a predictive model that's learned the underlying language of cellular behavior from data. Something you can query: "What happens to this cell when you knock out this gene?" "How does this cancer cell respond to this drug?" "What side effects would this perturbation cause in healthy tissue?"

This is where computational biology is heading. And building it turns out to be one of the harder problems in AI.

Here's how it works — and what makes it hard.

## A Brief History of Trying

The idea of computationally modeling a cell is not new. What's new is that it might actually work.

The first serious attempt was E-Cell, launched in 1997 by Masaru Tomita's lab at Keio University. The goal was to simulate a minimal living cell as executable software — every metabolic reaction, every enzyme, coded up as differential equations. They modeled 127 enzymes in a hypothetical minimal organism. It worked, in a narrow sense. But it required manually encoding every biochemical reaction, and biology has millions of them. The approach didn't scale.

The same bottleneck killed the next generation of mechanistic models. Throughout the 2000s, labs built genome-scale metabolic models — constraint-based mathematical representations of all the metabolic reactions a cell can run. Bernhard Palsson's group at UCSD was particularly productive here, generating models for E. coli, yeast, and eventually human cell lines. These models were scientifically useful for studying metabolism and predicting growth phenotypes, but they captured only one slice of cell biology and required enormous expert curation to build.

The landmark of the mechanistic era came in 2012: Jonathan Karr and colleagues published a whole-cell model of Mycoplasma genitalium, the smallest free-living organism, in Cell. They modeled all 525 genes and 28 biological processes — DNA replication, transcription, translation, metabolism, cell division — as an integrated simulation. It took years to build and couldn't predict responses to novel perturbations. But it proved the concept: a computational cell model could reproduce cell behavior from first principles.

The problem was the same one that killed E-Cell. You can't hand-curate a model for a human cell. The complexity is thousands of times greater, and our knowledge of the regulatory wiring is far from complete. Mechanistic modeling hit a wall.

![Virtual Cell 1.0 vs 2.0: from hand-coded equations to learned models](../illustrations/virtual-cell/02-timeline-vc1-vs-vc2.png)

What changed everything was data — and a new generation of models built to learn from it. Call it Virtual Cell 2.0.

Starting around 2014, high-throughput single-cell sequencing became practical. Drop-seq, inDrop, and eventually 10x Genomics brought the cost of profiling a single cell down from thousands of dollars to pennies. By 2016, when the Human Cell Atlas launched with the goal of cataloguing every cell type in the human body, the field was generating data at a scale nobody had anticipated. The question shifted from "can we manually encode the biology?" to "can we learn it from data?"

The first answer came in 2019. scGen, from Fabian Theis's group, used a variational autoencoder to predict how cells respond to drugs — not by simulating biochemistry, but by learning latent structure from perturbation data. It worked better than anything before it and pointed the field in a new direction: instead of building models by hand, let the data do the teaching.

Then transformers arrived. Geneformer (Theodoris et al., Nature 2023) pre-trained on 30 million cells to learn gene network relationships. GEARS (Roohani et al., Nature Biotechnology 2023) combined gene interaction graphs with deep learning to predict combinatorial perturbation outcomes. scGPT (Cui et al., Nature Methods 2024) scaled to 33 million cells with generative pre-training across multiple tasks — annotation, perturbation prediction, batch correction, regulatory network inference — all from a single model.

By 2024, "virtual cell" had moved from academic curiosity to strategic priority. The Chan Zuckerberg Initiative announced a dedicated program. Xaira Therapeutics launched with a core thesis around building the perturbation datasets needed to make virtual cells work in drug discovery. Recursion, Isomorphic Labs, and others made similar bets.

Two decades after E-Cell, the field has the data, the models, and the compute. What it's still working out is whether the models are actually learning the right things.

## The Foundation: Single-Cell Data

You can't build a virtual cell without data that captures cellular behavior at resolution. The key technology that makes this possible is single-cell RNA sequencing (scRNA-seq) — a method that measures gene expression in individual cells rather than in bulk tissue.

Before scRNA-seq, if you took a tissue sample and measured its transcriptome, you'd get an average over millions of cells. A tumor sample would give you a blended signal: cancer cells, immune cells, fibroblasts, endothelium — all mixed together. You'd miss everything interesting about their individual behaviors and how they interact.

Single-cell sequencing changed that. Now you can profile hundreds of thousands of individual cells in a single experiment, see the exact transcriptional state of each one, and understand cell-type heterogeneity, rare populations, and dynamic transitions.

The datasets are large. CellxGene, the largest public repository of single-cell data, has over 100 million cells across hundreds of tissue types, diseases, and organisms. scGPT was trained on 33 million of them. This is the raw material.

## What the Model Actually Learns

The insight that drove the foundation model approach is simple: cells speak a language. That language is gene expression — which genes are active, how active, in what combinations. Different cell types speak different dialects. Cells transitioning between states change their expression in patterned ways. Perturbations — drugs, genetic knockouts, disease states — push cells in structured directions through this space.

A language model learns grammar from text. A single-cell foundation model learns grammar from gene expression matrices.

In scGPT, each cell is tokenized as a sequence of genes ranked by expression level. A transformer architecture learns contextual relationships between genes — which genes tend to be active together, how their co-expression patterns define cellular identity, and how that identity shifts under different conditions.

The key insight is that gene expression is not random. There are deep structural patterns — regulatory programs, pathway activations, lineage constraints — that manifest across millions of cells and conditions. A large enough model, trained on diverse enough data, can learn these patterns implicitly without being told what they are.

![How scGPT works: gene expression tokens flow through a transformer to produce cell type embeddings](../illustrations/virtual-cell/03-framework-model-learns.png)

The result is a model that produces dense, information-rich representations of cellular states. These representations capture biological meaning in ways that prior methods couldn't: two cells with similar functional states end up close in representation space, even if they come from different donors, tissues, or experimental platforms.

## Perturbation: The Hard Part

Understanding cell identity is one thing. Predicting how a cell changes in response to an intervention is harder.

This is the core challenge for drug discovery. You have a candidate compound. You want to know: which cell types respond? Does it hit the intended target? Does it dysregulate anything else? What does the transcriptional response look like in a disease cell versus a healthy one?

The naive approach — train a model to predict "cell state after perturbation A" from "cell state before perturbation A" — runs into an immediate problem. Perturbations interact. Cells in different states respond differently to the same drug. Tissue context matters. The model needs to generalize to combinations and contexts it's never seen.

This is where most current models fall short, and why the field keeps producing papers that look impressive on benchmarks and then underperform on held-out conditions. The perturbation problem isn't a retrieval problem. It's a causal reasoning problem.

A virtual cell doesn't just need to memorize "this drug causes this response." It needs to model the underlying causal structure of the cell — the gene regulatory network, the signaling cascades, the feedback loops — well enough to reason about novel interventions from first principles.

![Perturbation prediction: a drug shifts a cell toward disease or health — the model must predict which](../illustrations/virtual-cell/04-infographic-perturbation.png)

That's a much harder target, and we're not there yet.

## Data Is the Bottleneck, Not Architecture

One of the most common mistakes in building virtual cells is assuming that architectural improvements will carry you to generalization. Make the transformer bigger. Add more layers. Change the attention mechanism. The numbers improve on standard benchmarks, and you declare progress.

But the fundamental constraint isn't the model. It's the data.

Biological data is deeply non-uniform. Most public single-cell datasets come from a handful of commonly studied tissues — blood, brain, lung, liver — from a handful of commonly studied organisms, collected with a handful of commonly used protocols. The distribution is far narrower than it appears.

A model trained on this distribution learns the patterns that are common to this distribution. Ask it to predict a response in a rare cell type, an underrepresented tissue, or a disease context that's poorly represented in training data, and performance degrades dramatically.

The path to a genuinely generalizable virtual cell runs through data diversity first. Not more cells from the same tissues — different tissues, different diseases, different organisms, different perturbation types, different measurement modalities. Perturbation screens at scale. Multi-omics data that combines transcriptomics with chromatin accessibility, protein levels, and spatial information.

At Xaira, this is the core bet: that the way to build virtual cells that work in drug discovery is to build the datasets that capture biology in its actual diversity, not its convenient proxy.

## Multi-Omics: Seeing More of the Cell

The transcriptome — gene expression — is one window into cell state. It's an important one, but it's incomplete.

Chromatin accessibility (measured by ATAC-seq) tells you which parts of the genome are available for transcription — the regulatory layer above gene expression. Protein levels tell you what the cell is actually doing, not just what it's planning to do. Spatial transcriptomics tells you where in a tissue the cell sits and who its neighbors are.

A virtual cell that only sees RNA is like trying to understand a city by only counting the lights on at night. You get real information, but you miss the street layout, the zoning laws, the supply chains.

![The four data layers of a virtual cell: genome, transcriptome, proteome, and spatial context](../illustrations/virtual-cell/05-framework-multiomics-v2.png)

Multi-modal foundation models — models that jointly reason over multiple data types — are technically harder to train but biologically richer. The challenge is data: multi-modal experiments are expensive and technically demanding, so the available data is sparse. Scaling multi-modal models requires solving data collection problems that no single lab can solve alone.

## What You Can Do With It

Once you have a well-trained virtual cell, the applications are concrete:

**Perturbation prediction**: Given a cell type in a disease state, predict the transcriptional response to a drug candidate before synthesizing it. Screen thousands of candidates computationally. Prioritize the ones most likely to push cells toward a healthy state.

**Combination screening**: Predict which combinations of perturbations have synergistic or antagonistic effects. Combinatorial space is too large to screen experimentally; virtual cells compress it.

**Cell type annotation**: Map new single-cell datasets against a reference atlas automatically. scGPT and its successors can annotate cell types with accuracy that rivals expert manual curation.

**Gene regulatory inference**: Use the attention patterns learned by the model to infer causal relationships between genes. Which transcription factors control which programs? Which regulatory elements are active in which contexts?

**Disease modeling**: Simulate how a patient's cells would behave given their genetic background and disease state. Understand why two patients with the same diagnosis respond differently to the same treatment.

None of these are solved problems. All of them are closer to being solved than they were five years ago.

## What's Still Hard

The honest accounting:

**Causality**: Current models learn correlations, not causes. A model can learn that drug A tends to downregulate gene X in this cell type. It cannot reliably tell you whether downregulating gene X would be therapeutic, neutral, or toxic in a new context where the upstream regulators are different. Causal reasoning over biological networks remains an open problem.

**Tissue context**: Most models treat cells as isolated objects. Real biology happens in tissues, where cells communicate, compete, and cooperate. A hepatocyte in an inflamed liver behaves differently from a hepatocyte in isolation. Modeling cell-cell interactions at the level of the transcriptome is technically tractable but data-hungry.

**Out-of-distribution generalization**: Models trained on existing data generalize to similar data. Novel cell types, novel diseases, novel organisms — these are genuinely out-of-distribution problems. How much data diversity do you need before the model generalizes instead of memorizing? We don't know yet.

**Validation**: How do you know a virtual cell is right? You run the wet lab experiment and compare. But wet lab experiments are expensive, slow, and noisy. The feedback loop between virtual prediction and experimental validation is long enough that it's hard to iterate quickly.

## The Path Forward

A virtual cell will not replace wet lab science. It will compress the search space.

Drug discovery today is an enormous, expensive, largely empirical process. You generate hypotheses, test them in cells, test them in animals, test them in people, and most of the time they fail. A good virtual cell doesn't eliminate those steps. It makes the early-stage search more rational — letting you ask "which of these 10,000 candidate compounds is worth synthesizing and testing?" rather than picking based on intuition and incomplete prior knowledge.

That compression matters. A 10x improvement in the hit rate at the screening stage propagates forward into fewer failed trials, lower development costs, and faster paths to medicines that work.

The technical pieces — foundation models, single-cell data, perturbation screens, multi-omics — exist today. The challenge is assembling them into a system that generalizes reliably, validating it against real biology, and iterating quickly enough to learn from what the experiments show.

That's what building a virtual cell actually looks like. Not a breakthrough moment. A long, careful engineering effort, grounded in the best biology we have.

---

*Bo Wang is a co-creator of scGPT and a researcher at Xaira Therapeutics, where the team is building causally-rich perturbation datasets and virtual cell models for drug discovery.*