How to Build a Virtual Cell

Two decades of trying

  • E-Cell (1997): simulate a minimal organism — required manually encoding every reaction
  • Whole-cell model of M. genitalium (Karr et al. 2012): 525 genes, took years, couldn't generalize
  • Single-cell sequencing changed everything — 100M+ cells now in CellxGene
scGPT trained on 33M cells → cell type annotation accuracy rivals expert curation
💡 Timeline from 1997 to 2026 showing VC 1.0 vs VC 2.0

The Perturbation Problem

  • Current models predict within-distribution perturbations reasonably well
  • Novel combinations and unseen cell types: performance drops sharply
  • Root cause: we're learning correlations, not causal mechanisms
GEARS: 0.81 R² on seen gene pairs → 0.47 on unseen combinations
💡 Two scatter plots: seen perturbations (tight) vs unseen (diffuse)

Data Diversity Is the Bottleneck

What Xaira Is Building

  • Causally-rich perturbation datasets at scale — not just more cells, but more diverse contexts
  • Multi-omics: RNA + chromatin + protein + spatial in the same model
  • Validation against real wet lab outcomes, not benchmark proxies
The goal: a 10x improvement in hit rate at drug discovery screening stage
💡 Data diversity diagram: tissues × organisms × perturbation types