Conformal Prediction for Vision-and-Language Navigation (DUET + HAMT)

Situation

Vision-and-Language Navigation (VLN) agents operating in unseen environments require calibrated uncertainty estimates to know when to defer to human oversight. A navigation agent that is 70% confident may be catastrophically wrong -- high confidence does not guarantee correctness. Conformal Prediction (CP) offers distribution-free coverage guarantees but requires systematic evaluation across multiple VLN architectures and uncertainty propagation formulations to determine which methods actually improve set efficiency without sacrificing coverage.

The Challenge

Evaluating conformal prediction for navigation requires navigating three compounding challenges: (1) Multiple navigation architectures with fundamentally different action representations — one operating on a topological map of the environment, the other on local candidate steps — make fair cross-system comparison non-trivial; (2) Navigation trajectories are short (mean 6-8 steps), making them structurally incompatible with temporal uncertainty propagation methods designed for long-horizon tasks where running-average methods need many steps to converge; and (3) Twenty-one propagation formulations must be benchmarked systematically to distinguish genuine improvements from configuration artifacts. Building a unified experimental framework bridging two architectures with incompatible software ecosystems added significant engineering overhead before scientific evaluation could begin.

The Approach

Built a unified experimental framework supporting both VLN-DUET and VLN-HAMT navigation architectures under a consistent evaluation protocol, enabling direct comparison across systems that would otherwise be benchmarked in isolation. Evaluated three classes of uncertainty scoring methods — threshold-based, adaptive prediction sets, and regularized adaptive sets — across both architectures. Both passive observation mode (does uncertainty improve over time?) and active intervention mode (does restricting the agent to high-confidence actions improve success rates?) evaluated across all configurations.

Designed and executed systematic benchmark of 21 uncertainty propagation formulations across 3 VLN architectures and 4 alpha levels. Key finding: 16 of 21 temporal propagation formulations catastrophically inflate prediction set sizes (mean 17-38 actions, saturating the full action space) on finite-horizon VLN trajectories. Root cause: VLN's short trajectories prevent running-average methods (EMA, ACI) from converging and cause control-based methods (PID) to overshoot.

Discovered that augmenting the base uncertainty score with a logarithmic confidence penalty — additionally down-weighting actions where the model is both uncertain and assigns low probability — improves prediction set efficiency by 18-47% across all tested architectures while maintaining valid coverage guarantees. The improvement holds across architectures because the augmentation operates on the model's raw output probabilities, which all tested systems produce in the same form. This is a counterintuitive finding: a simple one-term modification outperforms complex temporal propagation methods that attempt to accumulate uncertainty across navigation steps.

Built a unified research codebase supporting both navigation architectures with consistent interfaces, enabling controlled comparison between systems previously evaluated only in isolation. Includes automated pipeline orchestration, the full suite of 21 uncertainty propagation formulations, and reproducible experiment configuration — a research infrastructure contribution that enables future VLN uncertainty research without rebuilding from scratch. Full calibration benchmarks run approximately 24 hours per configuration on available hardware.

Impact

Provides the first systematic conformal prediction evaluation across multiple VLN architectures, establishing which uncertainty quantification methods work and why most temporal propagation approaches fail on short-horizon navigation trajectories. The logarithmic penalty finding — that a simple score augmentation outperforms complex temporal propagation — is a counterintuitive and publication-worthy result targeting ICRA 2027. The unified experimental framework enables reproducible benchmarking for future VLN uncertainty research without the setup overhead of the initial investigation.

3 architectures tested 21 propagation formulations 4 alpha levels 18-47 efficiency improvement pct 1.9 mean set size baseline 1.38 mean set size logprob penalty 7 trajectory length mean 24 calibration runtime hours

Tech Stack

PyTorchDeepSpeed ZeRO-2VLN-DUETVLN-HAMTTHRAPSRAPSConformal Prediction21 formulations (EMA, CumulMax, ACI, PID, entropy-family, etc.)R2RR4RREVERIESOONCoverage guaranteesSet sizeIntervention rateA40V100

← All Use Cases