PRISM-SLAM: Probabilistic Ray-Grounded Inference
for Scale-aware Metric SLAM

Paper Code Soon

Abstract

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric SE(3) Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned Sim(3) error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction.

Trajectory

Verified metric trajectories from PRISM-SLAM (monocular RGB): indoor (TUM) and outdoor (KITTI seq. 03).

PRISM-SLAM trajectory on KITTI sequence 03
KITTI Seq. 03
PRISM-SLAM trajectory on TUM RGB-D fr1/xyz
TUM RGB-D fr1/xyz

System Architecture

PRISM-SLAM Pipeline

PRISM-SLAM uses a four-process architecture: (1) a C++ ORB tracker on CPU at ∼30 FPS, (2) a DA3-Large GPU worker that processes keyframes asynchronously, (3) a Python metric optimizer with log-domain WLS scale estimation and Kalman filtering, and (4) optional DSUG-gated dense reconstruction. Online, the tracker streams keyframes to DA3 and fuses metric depth and ray constraints back into the frontend with DSUG-gated optimization. For final maps, optimized poses are batched into DA3’s multi-view module (paper Sec. 3.7) before DSUG-filtered back-projection.



DSUG: Temporal Depth Consistency as Uncertainty Proxy

DSUG Bayesian Information Matrix
TUM RGB-D fr3/walking static. (a) Input RGB. (b) Ground-truth depth (reference only; not used at runtime). (c) Pose-compensated DA3 depth residual as the DSUG epistemic uncertainty proxy u(p); bright regions mark temporally unstable geometry at moving subjects. (d) DSUG maps this variance into the optimization information matrix so dynamic regions are down-weighted without hard semantic masks.

Metric Scale in Monocular SLAM

Recent methods incorporate metric depth priors but still rely on post-hoc Sim(3) trajectory alignment. In evaluation, Sim(3) rescales the trajectory with a ground-truth oracle, whereas SE(3) measures rigid motion without any scale correction. PRISM-SLAM is the first to achieve true metric tracking evaluated under strict SE(3) alignment.

Method Depth Prior Eval Metric? FPS
ORB-SLAM3 Sim(3) × 30
DROID-SLAM Sim(3) × 5
DPV-SLAM++ Sim(3) × 50
GO-SLAM Sim(3) × 3
MonoGS Sim(3) × 3
Splat-SLAM Omnidata Sim(3) × 1.2
MASt3R-SLAM MASt3R Sim(3) × 15
VGGT-SLAM VGGT Sim(3) × 20
EC3R-SLAM VGGT Sim(3) × 36
WildGS-SLAM Metric3D v2 Sim(3) × 0.5
GigaSLAM UniDepth n.s. ×
PRISM (ours) DA3 SE(3) 30

Experimental Results

All runs use monocular RGB on an NVIDIA RTX 4500 Ada unless noted. We benchmark TUM RGB-D, 7-Scenes, and BONN Dynamic, reporting ATE RMSE in centimeters under standard Sim(3) alignment and, for PRISM-SLAM, additional SE(3) ATE using the system’s own metric scale (no oracle scale correction). Unless noted, numbers are the median of three independent runs.

TUM RGB-D — Static Sequences (fr1)

Method FPS Metric fr1/xyz fr1/rpy
Sim3 SE3 Sim3 SE3
ORB-SLAM3 30 × 0.9
DeepV2D 2 × 6.4 10.5
DeepFactors 30 × 3.5 4.3
DPV-SLAM 15 × 1.0 3.0
DPV-SLAM++ 50 × 1.0 3.2
GO-SLAM 3 × 1.0 1.9
DROID-SLAM 5 × 1.2 2.6
MASt3R-SLAM 15 × 0.9 2.7
VGGT-SLAM 20 × 1.4 3.0
PRISM (ours) 30 2.86 3.04 4.10 4.94

ATE RMSE (cm). SE(3) uses runtime metric scale (no oracle Sim(3) correction). On fr1/xyz, the SE(3) ATE closely matches the Sim(3) ATE with only 3.3% scale error.

TUM RGB-D — Static & Dynamic Sequences (fr3)

Method Metric sit-static walk-static sit-xyz walk-xyz
Sim3 SE3 Sim3 SE3 Sim3 SE3 Sim3 SE3
ORB-SLAM3 × 0.7 0.9 25.3 48.8
DROID-SLAM × 0.4 0.3 17.6 21.4
MonoGS × 1.1 3.6 16.2 18.9
WildGS-SLAM × 0.5 0.6 4.1 5.2
PRISM (ours) 1.6 1.8 1.9 2.7 12.7 20.0 23.2 26.8

ATE RMSE (cm). Baselines evaluated under Sim(3) (Umeyama); PRISM additionally reports SE(3) under its own runtime metric scale. The same strict metric fidelity extends to fr3 (paper Sec. 4.2): e.g. sit-static Sim(3) 1.6 cm vs. SE(3) 1.8 cm. WildGS-SLAM uses Metric3D v2 but reports only Sim(3) ATE (∼0.5 FPS, GPU-only).

7-Scenes Indoor Localisation

Method Metric Chess Fire Heads Office Pumpkin Redkit. Stairs Mean
Sim3 SE3 Sim3 SE3 Sim3 SE3 Sim3 SE3 Sim3 SE3 Sim3 SE3 Sim3 SE3 Sim3 SE3
ORB-SLAM3 × 2.1 2.4 1.2 3.5 4.8 5.1 8.9 4.0
DROID-SLAM × 1.8 2.1 1.3 2.7 3.4 4.2 12.5 4.0
MonoGS × 2.5 2.8 1.4 4.1 4.5 4.8 35.7 8.5
VGGT-SLAM × 4.1 3.9 2.8 5.5 7.2 8.4 15.2 6.7
PRISM (ours) 7.1 7.1 10.8 17.7 8.8 68.4 11.7 15.5 7.9 7.9 3.6 12.1 11.5 11.9 8.8 12.0

ATE RMSE (cm). KeyNet frontend with 4096 features (Nf = 4096; paper Table 7). Mean Sim(3) ATE is 8.8 cm across all seven scenes. Heads is dominated by extreme rotation and weak metric observability (large SE(3) gap); the Mean column reports Sim(3)/SE(3) averaged over the six non-degenerate scenes, matching the project table convention.

BONN Dynamic — Monocular RGB vs. RGB-D (Sim(3) / SE(3))

Method Sensor balloon balloon2 pers trk balloon trk
Sim(3) SE(3) Sim(3) SE(3) Sim(3) SE(3) Sim(3) SE(3)
ORB-SLAM3 RGB-D 5.8 17.7 70.7
DynaSLAM RGB-D 3.0 2.9 6.1
ReFusion RGB-D 17.5 25.4 28.9
RoDyn-SLAM RGB-D 7.9 11.5 14.5
PRISM (ours) Mono RGB 9.8 18.1 14.0 17.7 36.7 39.5 7.8 9.1

ATE RMSE (cm). RGB-D baselines use active hardware depth (standard Sim(3) reporting); dashes mark sequences not evaluated in the paper table. PRISM reports both alignments: on balloon2 and pers trk, SE(3) (17.7 / 39.5 cm) closely tracks Sim(3) (14.0 / 36.7 cm), evidencing consistent runtime metric scale under heavy dynamics. See the paper (Table 6) for DSUG-only ablations on these splits.

BONN — DSUG Ablation (SE(3) ATE, cm)

Configuration balloon balloon2 pers trk balloon trk Mean (Δ)
Full system (Ours) 18.1 17.7 39.5 9.1 21.1
w/o DSUG 21.3 24.5 51.1 15.7 28.2 (+7.1)

Strict SE(3) alignment. Disabling DSUG spikes the mean error by +7.1 cm; on pers trk, ATE rises from 39.5 cm to 51.1 cm (paper Table 6).

KITTI Odometry (outdoor, first 500 frames)

Outdoor stress test with large depth range and high ego-speed; PRISM runs monocular RGB with metric SE(3) ATE. ORB-SLAM2 uses stereo to fix scale (reference only).

Method Input Metric SE(3) ATE [m] trel [%]
ORB-SLAM2 Stereo 0.91 0.71
PRISM (ours) Mono RGB 4.30 2.29

Sequence 03. Stereo row follows the paper’s Table 9 (†): two-camera input to bypass monocular scale ambiguity.


ViT-Driven Loop Closure (DA3 [CLS])

Loop Closure Results
Impact of ViT-driven loop closure on TUM fr1/xyz. Without LC, drift grows to 4.8 cm ATE RMSE; activating loop closure (2048-D [CLS] from DA3’s ViT (Yang et al., 2024) with geometric verification) reduces ATE to 2.9 cm (∼40% gain) across 30 verified loop matches with high RANSAC inlier confidence. On a 600-frame fr3/sit-static extension, the system runs 31 closures at 100% empirical precision (0 false positives), lowering ATE from 2.61 cm to 1.67 cm and increasing valid tracked frames from 401 to 480.

DSUG-Gated Dense Metric Reconstruction

Dense Map 1
Loading…
fr1/desk2
Dense Map 2
Loading…
fr1/room

Offline pipeline (paper Sec. 3.7, Table 10): batches of globally optimized poses feed DA3 multi-view inference (N = 5, GT extrinsics in the evaluation), then DSUG-filtered depth is fused in a TSDF (DA3-MV uses 2 cm voxels vs. 1 cm for the filtered single-view baseline). On fr1/desk2, DA3-MV cuts Chamfer distance to 17.0 cm vs. 23.5 cm and raises F@5 cm to 31.7% (vs. 20.7%) relative to the filtered baseline.
Drag to rotate · scroll to zoom · right-click to pan.

Ablation Study

Configuration sit-st walk-st fr1/xyz Mean (Δ)
Full system (Ours) 1.60 1.90 2.86 2.12
w/o Plücker Ray Factor 2.45 2.75 3.83 3.01 (+0.89)
w/o DSUG 1.80 2.10 3.12 2.34 (+0.22)
w/o Log-domain Kalman 1.75 2.05 3.04 2.28 (+0.16)
w/o WLS 1.70 2.00 2.99 2.23 (+0.11)

Paper Table 5: mean ATE (cm) as SE(3) over fr3/sit-static, fr3/walk-static, and fr1/xyz; full system 2.12 cm. Removing the Plücker ray-distance factor (+0.89 cm) is the dominant failure mode, consistent with cross-view ray constraints resolving the scale null-space. Disabling DSUG adds +0.22 cm on these mostly static splits (its impact is much larger on BONN dynamics; Table 6). Log-domain Kalman (+0.16 cm) and confidence-weighted WLS vs. a simple unweighted variant (+0.11 cm) further stabilize asynchronous scale fusion.


BibTeX

@article{prism-slam2026,
  title={PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM},
  author={Anonymous},
  journal={Under review},
  year={2026}
}