Distill3R: Democratizing 3D Foundation Models on Commodity Hardware
The paper "Distill3R: A Pipeline for Democratizing 3D Foundation Models on Commodity Hardware" by Brandon Leblanc and Charalambos Poullis was presented at the Canadian Conference on AI, Robots & Vision (AI/CRV) 2026.
TL;DR: State-of-the-art multi-view 3D reconstruction has moved to foundation models that take hundreds of GPUs and the better part of a week to train, putting them out of reach for most academic labs. Distill3R distills that geometric reasoning into a compact 72M-parameter student that trains in under three days on a single workstation, runs roughly 5× faster than its teacher, and still recovers globally consistent 3D structure.
The pipeline rests on two ideas. First, an offline teacher-caching stage decouples the expensive foundation model from the training loop: a Fast3R teacher (650M parameters, originally trained on 128 A100s for six days) runs once over the data to produce point maps and confidence maps, which are compressed with float16 and run-length encoding and written to disk. The student then trains directly against this cache, so no teacher inference ever happens inside the training loop. Second, a confidence-aware distillation loss uses the teacher's own uncertainty to weight the geometric supervision, down-weighting unreliable predictions, preventing degenerate solutions, and keeping training stable on commodity GPUs. The student itself is deliberately lightweight: a shared DUNE ViT-Small encoder, a compact six-layer fusion transformer with cross-view attention, and two DPT heads that predict global and local 3D coordinates along with confidence. Trained on six standard datasets (CO3D-v2, ScanNet++, Habitat, MegaDepth, BlendedMVS, and ARKitScenes) using two RTX 6000 Ada GPUs, it achieves a 9× parameter reduction and a 5× inference speedup at 128 views relative to the teacher, while preserving the structural consistency required for functional 3D awareness. The aim is not to beat the largest foundation models, but to give labs without cluster-scale compute a reproducible single-workstation recipe — and a practical way to specialize 3D models on their own domain-specific data at minimal cost.
Research paper: https://arxiv.org/abs/2602.00865