---
title: World Models for Medicine — Why Healthcare Is Where This Research Agenda Has Its Sharpest Edge
---

# World Models for Medicine — Why Healthcare Is Where This Research Agenda Has Its Sharpest Edge

Yann LeCun's AMI just raised $1B around a central bet: that the path to truly intelligent AI runs through world models — systems that build internal representations of how the world works, rather than pattern-matching on surface statistics.

I think this bet is right. And I think medicine is the domain where it will be proven most decisively.

Here's the argument.

## What Yann Got Right (and Why It Took This Long to Show Up in Healthcare)

The core intuition behind JEPA — Joint Embedding Predictive Architecture — is deceptively simple: stop reconstructing inputs. Predict latent structure instead.

Reconstruction-based objectives (masked autoencoders, diffusion models, VAEs) force the model to account for every pixel, every voxel, every token. In natural images, that's often fine — the signal is dense, noise is low, and the structure you care about is legible at the pixel level.

But there's a class of data where reconstruction is not just suboptimal — it's actively harmful. And that class is medical imaging.

## Two Properties That Break Pixel Reconstruction in Medicine

**1. Measurements are noisy by design.**

Medical imaging is a physical process, not a photograph. Ultrasound uses acoustic waves that scatter, attenuate, and reflect based on tissue density, operator angle, probe frequency, and patient anatomy. The result is speckle — a granular noise pattern that is coherent enough to look like signal but carries no clinical information.

CT has beam hardening. MRI has susceptibility artifacts, motion ghosting, field inhomogeneity. Pathology slides have staining variation and sectioning artifacts. X-ray has scatter and positioning variation.

When you ask a model to reconstruct these images, you are asking it to spend capacity explaining variance that is physically uninformative. The model learns the scanner. Not the patient.

**2. Physiology is a process, not a snapshot.**

The heart does not just look like something. It does something. Left ventricular ejection fraction is a ratio of volumes across time. Wall motion abnormalities are temporal asymmetries. Valve regurgitation is a flow pattern. Diastolic dysfunction is a pressure-volume relationship that unfolds across the cardiac cycle.

These are not properties you can read from a single frame. They require understanding how the system evolves — its dynamics. A model that learns from static or independently-masked frames will learn anatomy. A model that learns to predict forward in latent space will learn physiology.

This is the precise gap that world models close.

## EchoJEPA: The Experiment

Together with @alifmunim and guidance from the Meta FAIR team (@garridoq_, @koustuvsinha), we built EchoJEPA — a JEPA-based world model trained on 18M echocardiograms from 300K patients.

The design follows Yann's prescription directly: the encoder learns to predict the latent representation of spatiotemporal regions from context, without reconstructing pixels. The model is forced to organize its representations around what is predictable across time — heart structure, motion patterns, chamber geometry — and discard what is not: speckle, acquisition noise, probe positioning.

The results across frozen encoder evaluations (no fine-tuning):

- **20% reduction in LVEF prediction error** vs reconstruction baselines
- **79% accuracy with 1% of labels** (vs 42% for baselines using 100% of labels)
- **2% degradation under acoustic artifacts** (vs 17% for baselines)
- **Zero-shot transfer to pediatric echo** that beats all fine-tuned comparisons

These margins are not incremental. A 37-point accuracy gap at 1% labels is not about hyperparameters. It reflects a structural difference in what the representations have learned to encode.

When we project the learned embeddings:

- Prior reconstruction methods → diffuse, entangled clusters
- EchoJEPA → clean anatomical and functional organization, with structure separated from acquisition noise

The model has, without supervision, learned to organize the cardiac cycle.

## What This Means Going Forward

Echocardiography is one modality. The same argument applies anywhere physiology moves, anywhere measurements are corrupted by physics, anywhere the signal you care about is latent — not pixel-level.

Surgical video. Endoscopy. Fetal monitoring. Continuous glucose and wearable biosignals. Multi-modal ICU streams.

All of these have the same two properties: noisy acquisition and meaningful temporal dynamics. World models that predict latent structure across time are, by construction, the right inductive bias for this data.

The clinical translation path also becomes clearer. Frozen world model encoders that generalize well under label scarcity matter enormously in medicine — where labeled data is expensive, rare disease populations are small, and annotator expertise is the bottleneck. EchoJEPA's 79%-at-1%-labels result is not an ablation point. It is a deployment argument.

## The AMI Bet Is Also a Healthcare Bet

AMI is building general-purpose world models. But the places where world models prove their worth will not be the easy ones — they will be the domains where surface statistics fail, where noise is structured and meaningful, where understanding requires modeling dynamics rather than matching patterns.

That description fits healthcare better than almost any other domain.

The $1B is a bet that the JEPA architecture — learn what is predictable in latent space, discard the rest — is the right foundation for intelligent systems. EchoJEPA is one piece of evidence that this bet is correct.

📄 Paper: https://arxiv.org/abs/2602.02603
💻 Code: https://github.com/bowang-lab/EchoJEPA
🐦 Original thread: https://x.com/BoWang87/status/2019864109517611440
