# scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

**Authors:** Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fei Fang, Nidhi Suresh, **Bo Wang** (senior/corresponding)
**Year:** 2024
**Venue:** Nature Methods
**DOI:** 10.1038/s41592-024-02201-0
**arXiv:** 2308.10519

## One-sentence summary

scGPT is a generative pretrained transformer trained on 33 million human single cells from CellxGene, achieving state-of-the-art performance across cell type annotation, perturbation prediction, batch correction, multi-omic integration, and gene regulatory network inference.

## Key contribution

First large-scale GPT-style foundation model for single-cell multi-omics. Demonstrated that a single pretrained model can be fine-tuned across diverse downstream biology tasks — analogous to BERT/GPT in NLP.

## Methods

- **Architecture:** Transformer encoder-decoder with gene-level tokenization; each gene is a token, expression value encoded separately
- **Pretraining data:** 33M human single cells from CellxGene Census (diverse tissues, donors, protocols)
- **Pretraining objective:** Masked gene value modeling — predict randomly masked gene expression values given the rest of the cell's profile
- **Tokenization:** Genes ranked/selected; expression binned into discrete levels
- **Fine-tuning paradigm:** Task-specific adapter heads; supervised fine-tuning on labeled data for each task

## Key findings

- SOTA or competitive on all 5 core tasks vs. specialized models
- Cell type annotation: strong zero-shot and few-shot performance
- Perturbation prediction: improves over baseline scGen; handles unseen perturbations via fine-tuning
- Batch correction: competitive with scVI/Harmony
- Multi-omic integration: RNA + ATAC joint embedding
- GRN inference: gene-gene attention patterns recover known regulatory relationships

## Limitations

- Trained on observational data (no causal perturbation signal in pretraining)
- HVG filtering (top ~3000 genes) loses rare gene signals
- Perturbation prediction still interpolates within training distribution
- Cell-JEPA (2026) shows 36% better clustering in zero-shot, suggesting scGPT's masked reconstruction objective has room for improvement

## Connections

- [../concepts/virtual-cell.md](../concepts/virtual-cell.md) — scGPT as foundation for Virtual Cell approaches
- [../concepts/single-cell-foundation-models.md](../concepts/single-cell-foundation-models.md) — seminal paper in this field
- [../concepts/perturbation-biology.md](../concepts/perturbation-biology.md) — perturbation prediction task
- [../papers/cell-jepa.md](cell-jepa.md) — 2026 model that improves over scGPT on clustering
- [../papers/lingshu-cell.md](lingshu-cell.md) — generative successor concept
- [../entities/bo-wang.md](../entities/bo-wang.md) — corresponding author
- [../entities/cellxgene.md](../entities/cellxgene.md) — training data source

## Bo's notes

scGPT is Bo's flagship model. Key intellectual tension: scGPT proves the foundation model paradigm works for single-cell data, but Bo's current position at Xaira is that the next frontier requires better *data* (causally rich perturbation atlases), not just better models. scGPT was the "proof of concept"; now the field needs to move from scaling observational data to scaling causal data.
