# Building Foundation Models That Reason
### BioReason & BioReason-Pro: Multimodal Biological Reasoning from Genomics to Proteomics
Bo Wang · Xaira Therapeutics · University of Toronto · Vector Institute
NIH, 2026

---

## The Gap: From Representation to Reasoning
- DNA and protein foundation models have achieved strong sequence representations
- Yet they operate as black boxes: no step-by-step mechanistic logic
- Complex biology requires multi-step causal inference — variant → pathway → phenotype
- Expert biologists don't just label; they reason across sequence, structure, interaction context

---

## BioReason: Architecture — Deep DNA-LLM Integration
- Frozen DNA encoder (Evo2-1B or Nucleotide Transformer 500M) → per-token embeddings E_DNA ∈ R^{L × d_dna}
- Learned linear projection: Proj: R^{d_dna} → R^{d_llm} maps genomic space to LLM token space
- Multimodal input: X_LLM = (<dna_start>, E'_DNA, <dna_end>, E_Q_text) with RoPE positional encoding
- LLM backbone: Qwen3 (1.7B or 4B), fine-tuned via LoRA; DNA encoder weights frozen throughout

---

## BioReason: Training — Supervised Fine-Tuning + GRPO
- Stage 1 — SFT: LoRA fine-tuning (rank 32, α=64) on curated chain-of-thought reasoning traces
- Stage 2 — GRPO (Dr. GRPO): samples G=8 outputs per query, normalizes rewards group-relative
- Composite reward: correctness (+2.0) + answer conciseness (+0.5) + reasoning format (+0.5) → max 2.5
- 4B backbone: stable convergence by step 400; lower variance than 1.7B; generalizes across DNA encoders

---

## BioReason: Datasets — Three Evaluation Tasks
- KEGG Reasoning (1,449 variants, 37 diseases): variant + pathway network → multi-step mechanism → disease prediction
- VEP-Coding (50,083 variants, chr8 test): pathogenic/benign classification + disease from coding SNVs
- VEP-Non-SNV (36,088 variants): indels ≤64 bp; stratified train/test by disease
- All datasets pair reference + variant DNA sequences (~4,000 bp avg length, 1–3 nt differences)

---

## BioReason: Results — KEGG Pathway Reasoning
- Best model: Evo2+Qwen3-4B+GRPO — 98.28% accuracy, 93.05% F1 (290 test points)
- vs. DNA-only Evo2-1B: 88.28% accuracy, 72.43% F1 (+10 pts accuracy, +21 pts F1)
- vs. LLM-only Qwen3-4B: 90.00% accuracy, 79.66% F1 (+8 pts accuracy, +13 pts F1)
- GRPO adds +3.1 pts accuracy and +6.9 pts F1 on top of SFT alone (Evo2+Qwen3-4B)

---

## BioReason: Results — VEP Benchmarks + Case Study
- VEP-Coding: Evo2+Qwen3-4B → 80.21% accuracy, 80.00% F1 (vs. DNA-only 70.07%/49.19%)
- VEP-Non-SNV: Evo2+Qwen3-1B → 88.20% accuracy, 89.91% F1 (vs. DNA-only 76.17%/66.51%)
- Case study (PFN1, chr17): model generated 10-step mechanistic chain: C>G substitution → profilin-1 dysfunction → impaired actin dynamics → disrupted axonal transport → ALS
- Model reasons over unseen biological entities — generalizes beyond training diseases

---

# BioReason established that reasoning works for DNA. BioReason-Pro scales this to protein function — the largest open problem in molecular biology.

---

## The Protein Annotation Crisis
- UniProt contains ~250 million protein sequences; <1% have experimental functional annotations
- Gene Ontology (GO): 45,000+ terms spanning Molecular Function, Biological Process, Cellular Component
- GO term dependencies are hierarchical (parent-child) and cross-aspect — ignored by standard classifiers
- Current methods: sequence similarity (misses functional convergence) or independent per-term classifiers

**~249M proteins lack reliable annotation — the foundational bottleneck for drug target discovery**

---

## GO-GPT: Autoregressive Modeling of Gene Ontology
- Autoregressive transformer trained to generate GO term annotations in hierarchical dependency order
- Models parent-child and cross-aspect (MF/BP/CC) dependencies jointly — not independent classifiers
- Generates GO annotations as structured token sequences, enabling coherent multi-label output
- GO-GPT predictions serve as the structured ontology input backbone for BioReason-Pro

---

## BioReason-Pro: Multimodal Protein Reasoning Architecture
- Inputs: protein sequence embeddings (ESM-style encoder) + GO-GPT structured predictions + domain/structural features
- LLM backbone generates structured reasoning traces: integrates sequence context, GO predictions, and biological knowledge
- Output: GO term predictions + interpretable functional summary narrative with per-step justification
- First multimodal reasoning LLM for protein function — combines precise ontology modeling with free-form biological explanation

---

## Training Stage 1: Synthetic Reasoning Traces at Scale
- Challenge: no large-scale dataset of expert protein function reasoning chains exists
- Solution: GPT-5 generated synthetic reasoning traces for 130,000+ proteins from SwissProt/UniProt
- Each trace: sequence analysis → domain identification → GO term justification → functional summary
- Quality control: traces validated against experimental UniProt annotations; inconsistent traces filtered

---

## Training Stage 2: Reinforcement Learning for Functional Coherence
- RL reward designed for protein function: GO term accuracy (Fmax-based) + ontological consistency + reasoning quality
- Penalizes violations of GO hierarchy (predicting child without parent)
- Rewards biologically specific annotations over generic high-level terms
- RL optimization on top of SFT closes the gap between correct-but-vague and correct-and-precise

---

## BioReason-Pro: GO Term Prediction Results
- Fmax: 73.6% — state-of-the-art across all three GO aspects (MF, BP, CC)
- Outperforms ESM-2 + per-term classifiers, ProtTrans, DeepFRI, and all prior methods
- Consistent gains in the most information-rich, rarely-annotated term categories
- Scales to novel protein families with no training examples through learned reasoning transfer

**73.6% Fmax on GO term prediction — new state-of-the-art**

---

## BioReason-Pro: Functional Summary Evaluation
- Task: generate a human-readable functional paragraph for a protein given its sequence
- Evaluation: LLM judge (GPT-4 acting as an expert evaluator) scores coherence, accuracy, specificity
- BioReason-Pro: 8.0 / 10 average judge score across 500 evaluated proteins
- Ablation: removing GO-GPT input drops score to 6.4; removing RL drops to 7.1

---

## Human Expert Evaluation: Preferred Over Ground Truth
- Blind evaluation: expert protein biologists compared BioReason-Pro vs. UniProt (gold standard) annotations
- Both annotations presented without identifying which was human-curated vs. AI-generated
- BioReason-Pro preferred in 79% of head-to-head comparisons (n = 200 proteins)
- Expert rationale: richer mechanistic context, better integration of interaction evidence, clearer functional logic

**79% of human experts prefer BioReason-Pro annotations over curated UniProt ground truth**

---

## De Novo Binding Partner Prediction
- Task: given only protein sequence, predict which other proteins it physically interacts with
- No interaction database entries used; prediction from first principles via reasoning
- BioReason-Pro predicted binding partners for 12 test proteins with no prior interaction data
- Predictions experimentally confirmed via independent cryo-EM studies of the same proteins

---

## Structural Validation: Per-Residue Attention Maps Localize to Contact Residues
- BioReason-Pro's per-residue attention weights analyzed for predicted binding interfaces
- Attention peaks at specific residue positions — mapped onto cryo-EM structure of the complex
- Attention-highlighted residues correspond precisely to experimentally resolved contact residues
- This was not trained explicitly — emergent from reasoning about sequence and domain features

---

## Reasoning Trace: What BioReason-Pro Outputs
- Step 1: Sequence analysis — identifies conserved domains, signal peptides, transmembrane regions
- Step 2: Homology reasoning — relates to characterized proteins via evolutionary conservation argument
- Step 3: GO-GPT anchoring — grounds predictions in hierarchically consistent ontology terms
- Step 4: Functional synthesis — generates integrated narrative connecting structure to biology

---

## The BioReason Framework: A Unified Recipe
- Core recipe: domain encoder (DNA or protein) → projection → LLM backbone → SFT on reasoning traces → RL refinement
- Generalizes across modalities: DNA (BioReason-1) → protein (BioReason-Pro) → RNA, epigenomics, metabolomics (future)
- Reasoning is the interface between computational prediction and human-interpretable biology
- All models, code, and checkpoints open-source: github.com/bowang-lab/BioReason | bioreason.net

---

## Summary
- BioReason (NeurIPS 2025): DNA-LLM integration with SFT+GRPO achieves 98.28% on KEGG pathway reasoning, +15% avg on VEP; generates interpretable multi-step mechanistic traces
- BioReason-Pro (2026): GO-GPT + multimodal LLM achieves 73.6% Fmax on GO prediction; preferred by experts over UniProt 79% of the time; de novo binding partner prediction validated by cryo-EM
- Central thesis: adding reasoning to biological foundation models improves accuracy, interpretability, and scientific utility simultaneously
- Acknowledgments: Adibvafa Fallahpour*, Hani Goodarzi, Patrick Hsu, Chris Maddison, and all collaborators

**Code + models: github.com/bowang-lab/BioReason · bioreason.net**

---
