Uncertainty Estimation
DROID-W estimates accurate frame-wise uncertainty maps by leveraging multi-view feature consistency.
We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS.
Our proposed DROID-W takes a sequence of RGB images as inputs and simultaneously estimates camera poses while recovering scene geometry. It alternately performs pose-depth refinement and uncertainty optimization in an iterative manner. The proposed uncertainty-aware dense bundle adjustment weights reprojection residuals with per-pixel uncertainty \(u\) to mitigate the influence of dynamic distractors. In addition, we use predicted monocular depth \(D\) as regularization of bundle adjustment, to improve its robustness under highly dynamic environments. For the uncertainty optimization module, we first extract DINOV2 features from the input images and then iteratively update the dynamic uncertainty map by leveraging multi-view feature consistency. Specifically, feature consistency is measured by the cosine similarity between features of image \(I_i\) and its corresponding features in image \(I_j\), where the rigid-motion correspondences \(p_{ij}\) are derived using the current pose and depth estimates.
To rigorously evaluate our approach in truly in-the-wild settings, we introduced a dataset that contains 7 casually captured outdoor sequences. DROID-W dataset features various dynamic objects and cluttered environments, making it a challenging real-world benchmark for dynamic SLAM methods.
DROID-W estimates accurate frame-wise uncertainty maps by leveraging multi-view feature consistency.
DROID-W reconstructs high-quality point clouds under challenging real-world environments.