# Single-Cell Foundation Models

**One-line definition:** Large transformer-based models pretrained on millions of single-cell RNA-seq profiles, designed to learn generalizable representations of cell state.

## What they are

Analogous to language models (trained on text) or vision models (trained on images), single-cell foundation models are trained on large corpora of scRNA-seq data — treating each cell's gene expression profile as a "sentence" and each gene as a "token."

Primary pretraining objectives:
- **Masked gene modeling** (scGPT, Geneformer) — predict masked gene expression values
- **Autoregressive generation** (scGPT generation mode) — generate expression profiles
- **Masked discrete diffusion** (Lingshu-Cell) — diffusion over tokenized expression space
- **Joint embedding prediction** (Cell-JEPA) — predict latent representation of one cell view from another

## Key models

| Model | Year | Training data | Key innovation |
|-------|------|--------------|----------------|
| scGPT | 2024 | 33M cells (CellxGene) | First large-scale GPT for single-cell; multi-task fine-tuning |
| Geneformer | 2023 | 29.9M cells | Ranks genes by expression; transfer to cardiac disease |
| TranscriptFormer | 2024 | ~100M cells, cross-species | Large-scale; cross-species transfer |
| Cell-JEPA | 2026 | CellxGene | JEPA architecture; 36% better clustering vs scGPT |
| Lingshu-Cell | 2026 | ~18k genes, diverse tissues | Generative world model; discrete diffusion; Virtual Cell Challenge H1 SOTA |
| scFoundation | 2024 | 50M cells | Read-depth-aware; 19k+ genes |

## Downstream tasks

1. **Cell type annotation** — zero-shot or few-shot classification
2. **Batch correction** — harmonize data across labs/protocols
3. **Perturbation prediction** — predict expression after unseen perturbation
4. **GRN inference** — infer gene regulatory networks from attention weights
5. **Drug response prediction** — cell line response to compounds
6. **Cross-species transfer** — predict mouse responses from human training

## Critical debate: do foundation models actually help?

A significant challenge: **Souza & Mehta (2026)** showed that simple parameter-free pipelines (careful normalization + PCA/linear methods) match or beat foundation models on most standard benchmarks, including OOD tasks with novel cell types.

This raises the question: are current benchmarks too easy? Are models learning batch effects and cell line identity rather than generalizable biology?

Bo's view is consistent with this: scaling models on the same observational CellxGene data hits diminishing returns. The next gains require better data (perturbation, causal, diverse context), not bigger models.

## Scaling laws

**Kendiukhov (2026)** showed that masked-reconstruction transformers on scRNA-seq do follow power-law scaling — but only when data is sufficient. In data-limited regimes, model capacity is not the binding constraint. Data-to-parameter ratio matters more than absolute scale.

Estimate: ~2.30 bits of entropy per masked gene position.

## Architecture choices

- **Gene-level tokenization** — each gene is a token; expression value is embedded
- **Binned expression** — discretize continuous expression into bins (avoids floating-point regression)
- **HVG selection** — most models filter to top 2k-8k highly variable genes; Lingshu-Cell does not (operates on all ~18k genes)
- **Positional encoding** — no natural order; models use learned gene embeddings

## Connections

- [../concepts/virtual-cell.md](virtual-cell.md) — the goal; foundation models are one path
- [../concepts/perturbation-biology.md](perturbation-biology.md) — data that trains and evaluates these models
- [../papers/scgpt.md](../papers/scgpt.md) — Bo's foundational model
- [../papers/cell-jepa.md](../papers/cell-jepa.md) — alternative pretraining paradigm
- [../papers/lingshu-cell.md](../papers/lingshu-cell.md) — generative world model approach
- [../papers/scaling-laws-scrna.md](../papers/scaling-laws-scrna.md) — empirical scaling analysis
- [../papers/parameter-free-representations.md](../papers/parameter-free-representations.md) — challenge to foundation model claims
- [../entities/cellxgene.md](../entities/cellxgene.md) — primary training data source

## Open questions

- What pretraining objective best captures causal biology (vs. correlational structure)?
- How much of foundation model performance is generalization vs. memorization of cell line identity?
- Can a single model handle RNA + ATAC + protein jointly at scale?
- Is there a "scaling cliff" where data diversity becomes the bottleneck before parameters?