# Perturbation Biology

**One-line definition:** The systematic study of how cells change their molecular state in response to controlled interventions — genetic, chemical, or environmental.

## What it is

Perturbation biology uses high-throughput experiments to map how interventions (gene knockouts, drug treatments, cytokine stimulation, environmental stress) alter cellular programs, typically measured via single-cell transcriptomics (scRNA-seq) or multi-omic readouts.

The goal: build a rich causal atlas of gene-gene, gene-drug, and context-drug interactions at cellular resolution.

## Key experimental platforms

### Genetic perturbations
- **Perturb-seq** (Dixit et al, Cell 2016; Replogle et al, Cell 2022) — CRISPR knockout + scRNA-seq readout; ~2.5M single-cell profiles across ~10,000 gene knockouts in Replogle (the "essential gene" dataset)
- **Norman et al (Cell 2019)** — combinatorial CRISPRa screen; 131 TF overexpression singles + pairs in K562; landmark dataset for studying genetic interactions
- **CRISPR activation (CRISPRa)** vs **interference (CRISPRi)** — different perturbation modes, different response profiles
- **CROPseq** — early version of Perturb-seq from Datlinger et al

### Chemical perturbations
- **LINCS L1000** — bulk expression of ~1000 landmark genes across thousands of compounds; early atlas but bulk, not single-cell
- **sci-Plex** (Srivatsan et al, Science 2020) — single-cell drug response; combinatorial treatment of cancer cell lines
- **RxRx3** (Recursion) — multi-modal cell imaging + gene expression across compounds

### Cytokine / signaling perturbations
- IFN-β stimulation in PBMCs — classic benchmark; Lingshu-Cell tested on this
- Systematic cytokine screens emerging as key benchmarks

## Key datasets

| Dataset | Scale | Type | Notes |
|---------|-------|------|-------|
| Replogle 2022 | ~2.5M cells, ~10k KOs | Perturb-seq | Most comprehensive single-gene KO dataset |
| Norman 2019 | ~100k cells, 131 TF pairs | CRISPRa | Combinatorial; go-to for interaction modeling |
| sci-Plex | ~650k cells, 188 drugs | Chemical | 3 cancer cell lines |
| LINCS L1000 | ~1.3M profiles | Chemical bulk | Older; 978 landmark genes |
| Tahoe-100M | 100M cells (projected) | Multi-perturbation | Xaira-adjacent large-scale effort |

## Why diversity matters

Bo's core thesis: scaling needs **data diversity**, not just data volume.

A model trained on K562 CRISPR knockouts will not generalize to:
- Primary cells (immune, neuronal)
- Disease states
- Different species
- Drug combinations
- Genetic background variants

Current datasets are: (1) dominated by cancer cell lines, (2) single perturbation per cell mostly, (3) lacking tissue/disease context, (4) limited combinatorial coverage.

The Virtual Cell bottleneck is not model capacity — it's causally rich, contextually diverse perturbation data.

## Computational frameworks

- **GEARS** — graph-based perturbation response prediction using GO gene networks
- **CPA** — latent disentanglement of cell state + perturbation effect; handles multiple perturbations
- **scGen** (Lotfollahi et al, Nature Methods 2019) — VAE-based perturbation transfer across cell types
- **SAMS-VAE** — structured additive model; cleaner causal assumptions than CPA

## Connections

- [../concepts/virtual-cell.md](virtual-cell.md) — the goal that perturbation data enables
- [../concepts/single-cell-foundation-models.md](single-cell-foundation-models.md) — the models trained on this data
- [../papers/scgpt.md](../papers/scgpt.md) — includes perturbation prediction task
- [../papers/lingshu-cell.md](../papers/lingshu-cell.md) — tests on Virtual Cell Challenge perturbation benchmark
- [../entities/xaira-therapeutics.md](../entities/xaira-therapeutics.md) — building causal perturbation atlas

## Open questions

- What is the minimal set of perturbations + contexts that yields a generalizable model?
- How do you measure "causal richness" of a dataset?
- Can transfer learning from genetic to chemical perturbations work?
- What fraction of perturbation responses are truly non-additive?