# Virtual Cell

**One-line definition:** A computational model that simulates cellular behavior — gene expression, state transitions, and responses to perturbations — without wet-lab experiments.

## What it is

A Virtual Cell is an in silico representation of a cell that can predict how it would respond to interventions (genetic knockouts, drug treatments, cytokine stimulation, etc.). The term covers a spectrum from narrow perturbation predictors to ambitious whole-cell simulators.

The key distinction Bo has emphasized: **a Virtual Cell is not just a perturbation predictor**. A perturbation predictor interpolates within a training distribution. A true Virtual Cell would generalize causally — predicting responses to novel combinations, unseen cell types, and out-of-distribution conditions.

## Why it matters

Drug discovery currently requires enormous wet-lab screening. A reliable Virtual Cell would:
- Prioritize which perturbations to actually run
- Predict combination effects (two drugs, two knockouts)
- Transfer predictions across cell types and species
- Reduce the discovery timeline from years to months

## Key approaches

### Foundation models (static representations)
- scGPT, Geneformer, TranscriptFormer, Cell-JEPA — learn embeddings from large corpora of scRNA-seq data
- Primary limitation: trained on observational data; no causal perturbation signal baked in
- Good for: cell type annotation, batch correction, representation transfer

### Perturbation-conditioned models
- CPA (Lotfollahi et al, NeurIPS 2021) — disentangle cell-intrinsic + perturbation effect in latent space
- GEARS (Roohani et al, Nature Biotechnology 2023) — graph-based model using GO gene interaction network; first to generalize to unseen 2-gene combinations
- SAMS-VAE — structured additive model for perturbation response

### Generative cellular world models
- Lingshu-Cell (2026) — masked discrete diffusion; models transcriptomic state distributions, not just embeddings; can simulate conditional cell states
- Goal: treat the cell like a generative world model, sample from it, condition on arbitrary interventions

### Causal approaches
- Norman et al (Cell 2019) — large-scale genetic interaction screen; showed additive vs. epistatic perturbation effects in K562
- Key insight: most double-gene perturbations are approximately additive; epistatic interactions are rare but biologically critical

## Current limitations (Bo's perspective)

1. **Data diversity bottleneck** — most models trained on CellxGene bulk; lacks causally rich, contextually diverse perturbation data
2. **Extrapolation vs. interpolation** — models predict well within training distribution; fail at truly novel combinations
3. **Scaling the wrong thing** — bigger models on same data don't fix the data problem; need better perturbation atlases
4. **Cell context** — same perturbation in different cell types / states gives different outcomes; current models undermodel this

## Connections

- [../papers/lingshu-cell.md](../papers/lingshu-cell.md) — recent generative world model for virtual cells
- [../papers/scgpt.md](../papers/scgpt.md) — Bo's model; foundation model trained on 33M cells
- [../papers/cell-jepa.md](../papers/cell-jepa.md) — JEPA-style alternative to masked reconstruction
- [../concepts/perturbation-biology.md](perturbation-biology.md) — the data side
- [../concepts/single-cell-foundation-models.md](single-cell-foundation-models.md) — the model side
- [../entities/xaira-therapeutics.md](../entities/xaira-therapeutics.md) — Bo's org; building perturbation data infrastructure
- [../entities/cellxgene.md](../entities/cellxgene.md) — primary training data source

## Open questions

- When does a model cross from "perturbation predictor" to "Virtual Cell"? What's the minimal test?
- Can causal representation learning (do-calculus, IRM) be applied at scale to scRNA-seq?
- How much does cell state (not just type) matter for perturbation response?
- Is the Virtual Cell Challenge benchmark actually measuring the right thing?
