How to Build a Virtual Cell
From mechanistic models to foundation models
Two decades of trying
- E-Cell (1997): simulate a minimal organism — required manually encoding every reaction
- Whole-cell model of M. genitalium (Karr et al. 2012): 525 genes, took years, couldn't generalize
- Single-cell sequencing changed everything — 100M+ cells now in CellxGene
scGPT trained on 33M cells → cell type annotation accuracy rivals expert curation
💡 Timeline from 1997 to 2026 showing VC 1.0 vs VC 2.0
The Perturbation Problem
- Current models predict within-distribution perturbations reasonably well
- Novel combinations and unseen cell types: performance drops sharply
- Root cause: we're learning correlations, not causal mechanisms
GEARS: 0.81 R² on seen gene pairs → 0.47 on unseen combinations
💡 Two scatter plots: seen perturbations (tight) vs unseen (diffuse)
Data Diversity Is the Bottleneck
What Xaira Is Building
- Causally-rich perturbation datasets at scale — not just more cells, but more diverse contexts
- Multi-omics: RNA + chromatin + protein + spatial in the same model
- Validation against real wet lab outcomes, not benchmark proxies
The goal: a 10x improvement in hit rate at drug discovery screening stage
💡 Data diversity diagram: tissues × organisms × perturbation types