# CellxGene Census

**Type:** dataset / platform
**Maintained by:** Chan Zuckerberg Initiative (CZI)
**URL:** https://cellxgene.cziscience.com/

## Description

The largest public single-cell RNA-seq data repository, aggregating data from thousands of studies across human and mouse tissues. The primary training corpus for most single-cell foundation models including scGPT (33M cells), TranscriptFormer (~100M cells), and Geneformer.

## Key stats (as of early 2026)

- ~50M+ human single-cell profiles
- Diverse tissues, disease states, donors
- Standardized metadata and ontologies (cell type, tissue, disease)
- Available via Census API (Python/R)

## Limitations as a training source

- **Observational only** — no perturbation data; cells measured in their natural state
- **Cell line dominated** — many profiles from immortalized cancer cell lines (K562, A549, etc.)
- **Limited combinatorial coverage** — no drug × gene interaction data
- **Batch effects** — data from different labs/protocols; technical variation is a major confound

These limitations are why models trained only on CellxGene hit a ceiling: they learn correlational structure and batch patterns, not causal biology.

## Connections

- [../concepts/single-cell-foundation-models.md](../concepts/single-cell-foundation-models.md) — primary training data
- [../papers/scgpt.md](../papers/scgpt.md) — trained on 33M CellxGene cells
- [../papers/scaling-laws-scrna.md](../papers/scaling-laws-scrna.md) — scaling study used CellxGene data
- [../concepts/perturbation-biology.md](../concepts/perturbation-biology.md) — what CellxGene lacks