Captain Safari: A World Engine

Yu-Cheng Chou1, Xingrui Wang1, Yitong Li2, Jiahao Wang1, Hanting Liu1,
Cihang Xie3, Alan Yuille1, Junfei Xiao1
1Johns Hopkins University   2Tsinghua University   3UC Santa Cruz

Overview

World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion.

To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.

To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled video generators.

How It Works

Captain Safari Teaser

Captain Safari is a pose-aware world engine that generates long-horizon, 3D-consistent FPV videos from any user-specified camera trajectory. By retrieving pose-aligned world memory, it keeps geometry stable across large viewpoint changes and reconstructs crisp, well-formed structures while faithfully tracking aggressive 6-DoF motion.



Memory-Guided Generation

Captain Safari Pipeline

Captain Safari Pipeline. Given a local memory, we represent each time step by a pose token and its associated memory tokens. Our retriever is designed to (i) jointly encode pose-memory pairs into a coherent world representation, and (ii) extract, for any query pose, a compact set of pose-aligned tokens that summarize the most relevant parts of this local world.

OpenSafari

OpenSafari Data Pipeline

OpenSafari Data Pipeline. A new in-the-wild FPV dataset with rigorously verified camera trajectories, designed to stress-test geometry-consistent, camera-controllable video generation. We curate clips through a compact, multi-stage pipeline that filters, reconstructs, and verifies trajectories, yielding clean, motion-rich videos with reliable camera paths.

Experimental Analysis

Benchmark camera-controlled video generation. Captain Safari ranks first in 3D consistency and trajectory following with competitive video quality. Compared to the ablated variant without memory, Captain Safari substantially improves 3D consistency and trajectory following, with only a slight trade-off in video quality.

Model Video Quality 3D consistency Trajectory Following
FVD ↓ LPIPS ↓ MEt3R ↓ Recon. ↑ AUC@30 ↑ AUC@15 ↑ CosSim ↑
Geometry Forcing 2662.75 0.667 0.4834 0.877 0.168 0.056 0.429
Real-CamI2V 1585.61 0.513 0.3703 0.923 0.174 0.051 0.296
Wan2.2-5B-Control-Camera 1387.75 0.545 0.3932 0.767 0.181 0.054 0.420
Captain Safari w/o Mem. 998.47 0.504 0.3720 0.912 0.193 0.068 0.508
Captain Safari 1023.46 0.512 0.3690 0.968 0.200 0.068 0.563

Human preference. Users overwhelmingly prefer Captain Safari across all criteria, capturing 67% of total votes. The memory-removed variant ranks a distant second, while baselines competitors receive single-digit preference.

Model Video Quality 3D consistency Trajectory Following Average
Geometry Forcing 0.20% 0.00% 0.20% 0.13%
Real-CamI2V 4.20% 6.40% 4.40% 5.00%
Wan2.2-5B-Control-Camera 3.20% 3.80% 6.40% 4.47%
Captain Safari w/o Mem. 25.00% 24.20% 20.00% 23.07%
Captain Safari 67.40% 65.60% 69.00% 67.33%

Qualitative Comparison

Prompt

A swift journey through a verdant forest, capturing the essence of nature in motion.

Prompt

Soaring over rugged terrain towards a distant valley under a bright sky.

Prompt

Aerial view of a soccer field surrounded by lush green hills and misty cliffs.

Prompt

A thrilling descent through rugged terrain under a cloudy sky.

Qualitative Results

Trajectory Camera Trajectory
Prompt

A serene beach with waves crashing and palm trees lining the shore under a bright blue sky.

Trajectory Camera Trajectory
Prompt

A serene coastal view unfolds, with the sun casting a bright, sparkling path across the turquoise waters as the camera glides over the rocky shoreline towards a lush green hillside.

Trajectory Camera Trajectory
Prompt

Aerial view capturing the stunning contrast between rugged cliffs and turquoise waters.

Trajectory Camera Trajectory
Prompt

A thrilling snowmobile ride through a sunlit forest at sunset.

Trajectory Camera Trajectory
Prompt

Aerial view soaring over snow-covered mountain peaks under a clear sky.

Trajectory Camera Trajectory
Prompt

Exploring an abandoned, graffiti-covered structure with crumbling walls and overgrown vegetation.

Trajectory Camera Trajectory
Prompt

Aerial view of ancient ruins by the sea.

Trajectory Camera Trajectory
Prompt

Exploring ancient ruins under a bright sky.

Trajectory Camera Trajectory
Prompt

Exploring an abandoned building with vibrant graffiti and broken windows.

Trajectory Camera Trajectory
Prompt

A scenic drive through a mountainous valley, featuring lush green fields and snow-capped peaks under a bright blue sky.

Trajectory Camera Trajectory
Prompt

A steam train speeds through a rugged, mountainous landscape, leaving a trail of smoke behind.

Trajectory Camera Trajectory
Prompt

A breathtaking aerial view of snow-covered mountain peaks under a clear sky.

BibTeX

BibTeX
@inproceedings{chou2026captain,
  title={Captain Safari: A World Engine},
  author={Chou, Yu-Cheng and Wang, Xingrui and Li, Yitong and Wang, Jiahao and Liu, Hanting and Xie, Cihang and Yuille, Alan and Xiao, Junfei},
  booktitle={Preprint},
  year={2026}
}