# Overview: Bo's Knowledge Base — Virtual Cell & Perturbation Biology

*Last updated: 2026-04-04*

---

## The Central Question

Can we build a computational model of a cell that predicts how it would respond to any intervention — a drug, a gene knockout, a combination — without running the experiment?

This is the **Virtual Cell** — and it's the organizing question of Bo's research at Xaira Therapeutics.

---

## Current State of the Field

### What we have

The single-cell genomics field has produced large foundation models (scGPT, Geneformer, TranscriptFormer) trained on tens of millions of cells from CellxGene. These models achieve strong performance on standard tasks: cell type annotation, batch correction, cross-species transfer.

A new class of generative models is emerging that goes further: **Lingshu-Cell** (2026) treats the cell as a world model — modeling the full distribution of cellular states via discrete diffusion, rather than just learning embeddings. It achieves SOTA on the Virtual Cell Challenge H1 benchmark.

### What we don't have

The fundamental bottleneck is **data**:

- Current training corpora are observational (no perturbation signal in pretraining)
- Dominated by cancer cell lines; lacks tissue diversity, disease context, primary cells
- Existing perturbation datasets (Replogle Perturb-seq, Norman CRISPRa) cover single-gene knockouts in a handful of cell lines
- Combinatorial perturbation coverage is sparse
- No large-scale dataset linking genetic perturbation → chemical perturbation → in vivo response

Two recent papers (Souza & Mehta 2026; Kendiukhov 2026) independently confirm: **model scaling on CellxGene observational data has diminishing returns**. Simple linear baselines match foundation models on current benchmarks. Scaling laws exist but flatten in data-limited regimes.

---

## Bo's Core Thesis

> **Virtual Cell ≠ perturbation predictor.**

A perturbation predictor interpolates within its training distribution. A true Virtual Cell generalizes causally: it predicts unseen combinations, transfers across cell types, and captures biology rather than batch effects.

Getting there requires:
1. **Causally rich data** — systematic perturbation atlases with diverse genetic × chemical × environmental interventions
2. **Contextual diversity** — primary cells, disease states, multiple species, multiple timepoints
3. **Better benchmarks** — current benchmarks are too easy; need out-of-distribution causal evaluation

Xaira's strategy: build that data infrastructure. The models will follow.

---

## Key Tensions in the Field

| Tension | Side A | Side B |
|---------|--------|--------|
| Model vs. data | Bigger models needed | Data diversity is the bottleneck |
| FM vs. simple methods | Foundation models generalize | Linear methods match on current benchmarks |
| Representation vs. causality | Learn better embeddings | Embeddings don't solve perturbation prediction |
| Perturbation predictor vs. Virtual Cell | Narrow task is tractable | Narrow task doesn't generalize |

---

## Emerging Directions (worth watching)

- **Generative world models for cells** — Lingshu-Cell as prototype; expect more in 2026
- **Causal representation learning** applied to scRNA-seq — IRM, do-calculus at scale
- **Multi-modal perturbation** — linking transcriptomics + imaging + proteomics in one model
- **Synthetic data generation** — using Virtual Cell to generate training data for downstream tasks
- **Large-scale combinatorial screens** — Tahoe-100M class datasets

---

## How to use this wiki

Ask Moon anything about these topics. Responses will be grounded in the wiki and filed back in for future sessions.

To add a new paper: share the arXiv link or paste the abstract. Moon will create the paper page, update the index, and touch related concept/entity pages.

To add an article or note: drop it in `raw/articles/` as a `.md` file and tell Moon to ingest it.
