# TranscriptFormer: A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution

**Authors:** Ethan Pearce, Joseph Simmonds, et al. (Chan Zuckerberg Initiative)
**Year:** 2025
**Venue:** bioRxiv preprint
**bioRxiv:** 2025.04.25.650731
**Model page:** https://virtualcellmodels.cziscience.com/model/01966441-339f-77f7-aa06-f67636f865dc

## One-sentence summary

CZI's cross-species foundation model trained on 112 million cells spanning 12 species and 1.53 billion years of evolution, achieving SOTA on cross-species cell annotation and zero-shot disease identification.

## Key contribution

The **largest single-cell foundation model** as of its release. Goes far beyond human data to span 12 species — demonstrating that evolutionary conservation can be used as a biological supervision signal. Models trained across species learn representations that generalize to unseen organisms and cell types.

## Methods

- **Training data:** 112M cells from CZ CellxGene, Tabula Sapiens, ZebraHub + other cross-species corpora
- **Species:** 12 species spanning 1.53 billion years of evolution (human, mouse, zebrafish, fly, worm, etc.)
- **Architecture:** Generative transformer; jointly models gene identities + expression levels
- **Scale:** Largest training corpus in the field at time of release
- **Released by:** Chan Zuckerberg Initiative (CZI)

## Key findings

- SOTA cell type classification, including for species **not seen during training**
- Zero-shot disease state identification in human cells
- Accurately transfers cell state annotations across species boundaries
- Representations naturally reveal developmental trajectories and phylogenetic relationships without explicit supervision
- Predicts cell type-specific transcription factors and gene–gene interactions

## Emergent properties

TranscriptFormer learns **biologically meaningful structure without being taught it**: phylogenetic trees of cell types emerge from clustering the embedding space; developmental trajectories appear in UMAP. This is the single-cell analog of GPT learning syntax without being taught grammar.

## The challenge from Souza & Mehta (2026)

Importantly, the parameter-free representations paper (2602.16696) specifically benchmarks *against* TranscriptFormer and finds that simple normalized PCA matches or beats it on standard downstream tasks. TranscriptFormer's advantages appear primarily in cross-species and evolutionary transfer tasks — exactly the tasks simple linear methods can't handle.

## Positioning

TranscriptFormer is CZI's bet on the "more species = better biology" hypothesis. The evolutionary constraint provides a natural supervision signal that single-species models lack. This is conceptually different from scGPT (more cells, same species) or Lingshu-Cell (better generative model, same data type).

## Limitations

- No perturbation prediction evaluation — primarily a representation model
- Cross-species gene orthology mapping introduces noise
- Observational data only (no causal perturbation signal)
- 112M cells still primarily from standard conditions

## Connections

- [../concepts/single-cell-foundation-models.md](../concepts/single-cell-foundation-models.md) — largest model in this landscape
- [../concepts/virtual-cell.md](../concepts/virtual-cell.md) — CZI frames this as a step toward Virtual Cell
- [../papers/parameter-free-representations.md](parameter-free-representations.md) — challenged on standard benchmarks
- [../entities/cellxgene.md](../entities/cellxgene.md) — primary data source; CZI built both

## Bo's notes

TranscriptFormer is an impressive engineering and data effort from CZI. The cross-species angle is genuinely new and the evolutionary supervision signal is clever. But it's still fundamentally observational. The question remains: does cross-species scale translate to better perturbation prediction? This is the key empirical question to watch in 2026.
