PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

Abstract

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric SE(3) Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned Sim(3) error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction.

Trajectory

Verified metric trajectories from PRISM-SLAM (monocular RGB): indoor (TUM) and outdoor (KITTI seq. 03).

KITTI Seq. 03

TUM RGB-D fr1/xyz

System Architecture

PRISM-SLAM uses a four-process architecture: (1) a C++ ORB tracker on CPU at ∼30 FPS, (2) a DA3-Large GPU worker that processes keyframes asynchronously, (3) a Python metric optimizer with log-domain WLS scale estimation and Kalman filtering, and (4) optional DSUG-gated dense reconstruction. Online, the tracker streams keyframes to DA3 and fuses metric depth and ray constraints back into the frontend with DSUG-gated optimization. For final maps, optimized poses are batched into DA3’s multi-view module (paper Sec. 3.7) before DSUG-filtered back-projection.

DSUG: Temporal Depth Consistency as Uncertainty Proxy

TUM RGB-D fr3/walking static. (a) Input RGB. (b) Ground-truth depth (reference only; not used at runtime). (c) Pose-compensated DA3 depth residual as the DSUG epistemic uncertainty proxy u(p); bright regions mark temporally unstable geometry at moving subjects. (d) DSUG maps this variance into the optimization information matrix so dynamic regions are down-weighted without hard semantic masks.

Metric Scale in Monocular SLAM

Recent methods incorporate metric depth priors but still rely on post-hoc Sim(3) trajectory alignment. In evaluation, Sim(3) rescales the trajectory with a ground-truth oracle, whereas SE(3) measures rigid motion without any scale correction. PRISM-SLAM is the first to achieve true metric tracking evaluated under strict SE(3) alignment.

Method	Depth Prior	Eval	Metric?	FPS
ORB-SLAM3	—	Sim(3)	×	30
DROID-SLAM	—	Sim(3)	×	5
DPV-SLAM++	—	Sim(3)	×	50
GO-SLAM	—	Sim(3)	×	3
MonoGS	—	Sim(3)	×	3
Splat-SLAM	Omnidata	Sim(3)	×	1.2
MASt3R-SLAM	MASt3R	Sim(3)	×	15
VGGT-SLAM	VGGT	Sim(3)	×	20
EC3R-SLAM	VGGT	Sim(3)	×	36
WildGS-SLAM	Metric3D v2	Sim(3)	×	0.5
GigaSLAM	UniDepth	n.s.	×	—
PRISM (ours)	DA3	SE(3)	✓	30

Experimental Results

All runs use monocular RGB on an NVIDIA RTX 4500 Ada unless noted. We benchmark TUM RGB-D, 7-Scenes, and BONN Dynamic, reporting ATE RMSE in centimeters under standard Sim(3) alignment and, for PRISM-SLAM, additional SE(3) ATE using the system’s own metric scale (no oracle scale correction). Unless noted, numbers are the median of three independent runs.

TUM RGB-D — Static Sequences (fr1)

Method	FPS	Metric	fr1/xyz		fr1/rpy
			Sim3	SE3	Sim3	SE3
ORB-SLAM3	30	×	0.9	—	—	—
DeepV2D	2	×	6.4	—	10.5	—
DeepFactors	30	×	3.5	—	4.3	—
DPV-SLAM	15	×	1.0	—	3.0	—
DPV-SLAM++	50	×	1.0	—	3.2	—
GO-SLAM	3	×	1.0	—	1.9	—
DROID-SLAM	5	×	1.2	—	2.6	—
MASt3R-SLAM	15	×	0.9	—	2.7	—
VGGT-SLAM	20	×	1.4	—	3.0	—
PRISM (ours)	30	✓	2.86	3.04	4.10	4.94

ATE RMSE (cm). SE(3) uses runtime metric scale (no oracle Sim(3) correction). On fr1/xyz, the SE(3) ATE closely matches the Sim(3) ATE with only 3.3% scale error.

TUM RGB-D — Static & Dynamic Sequences (fr3)

Method	Metric	sit-static		walk-static		sit-xyz		walk-xyz
		Sim3	SE3	Sim3	SE3	Sim3	SE3	Sim3	SE3
ORB-SLAM3	×	0.7	—	0.9	—	25.3	—	48.8	—
DROID-SLAM	×	0.4	—	0.3	—	17.6	—	21.4	—
MonoGS	×	1.1	—	3.6	—	16.2	—	18.9	—
WildGS-SLAM	×	0.5	—	0.6	—	4.1	—	5.2	—
PRISM (ours)	✓	1.6	1.8	1.9	2.7	12.7	20.0	23.2	26.8

ATE RMSE (cm). Baselines evaluated under Sim(3) (Umeyama); PRISM additionally reports SE(3) under its own runtime metric scale. The same strict metric fidelity extends to fr3 (paper Sec. 4.2): e.g. sit-static Sim(3) 1.6 cm vs. SE(3) 1.8 cm. WildGS-SLAM uses Metric3D v2 but reports only Sim(3) ATE (∼0.5 FPS, GPU-only).

7-Scenes Indoor Localisation

Method	Metric	Chess		Fire		Heads		Office		Pumpkin		Redkit.		Stairs		Mean
		Sim3	SE3	Sim3	SE3	Sim3	SE3	Sim3	SE3	Sim3	SE3	Sim3	SE3	Sim3	SE3	Sim3	SE3
ORB-SLAM3	×	2.1	—	2.4	—	1.2	—	3.5	—	4.8	—	5.1	—	8.9	—	4.0	—
DROID-SLAM	×	1.8	—	2.1	—	1.3	—	2.7	—	3.4	—	4.2	—	12.5	—	4.0	—
MonoGS	×	2.5	—	2.8	—	1.4	—	4.1	—	4.5	—	4.8	—	35.7	—	8.5	—
VGGT-SLAM	×	4.1	—	3.9	—	2.8	—	5.5	—	7.2	—	8.4	—	15.2	—	6.7	—
PRISM (ours)	✓	7.1	7.1	10.8	17.7	8.8	68.4	11.7	15.5	7.9	7.9	3.6	12.1	11.5	11.9	8.8	12.0

ATE RMSE (cm). KeyNet frontend with 4096 features (N_f = 4096; paper Table 7). Mean Sim(3) ATE is 8.8 cm across all seven scenes. Heads is dominated by extreme rotation and weak metric observability (large SE(3) gap); the Mean column reports Sim(3)/SE(3) averaged over the six non-degenerate scenes, matching the project table convention.

BONN Dynamic — Monocular RGB vs. RGB-D (Sim(3) / SE(3))

Method	Sensor	balloon		balloon2		pers trk		balloon trk
Method	Sensor	Sim(3)	SE(3)	Sim(3)	SE(3)	Sim(3)	SE(3)	Sim(3)	SE(3)
ORB-SLAM3	RGB-D	5.8	—	17.7	—	70.7	—	—	—
DynaSLAM	RGB-D	3.0	—	2.9	—	6.1	—	—	—
ReFusion	RGB-D	17.5	—	25.4	—	28.9	—	—	—
RoDyn-SLAM	RGB-D	7.9	—	11.5	—	14.5	—	—	—
PRISM (ours)	Mono RGB	9.8	18.1	14.0	17.7	36.7	39.5	7.8	9.1

ATE RMSE (cm). RGB-D baselines use active hardware depth (standard Sim(3) reporting); dashes mark sequences not evaluated in the paper table. PRISM reports both alignments: on balloon2 and pers trk, SE(3) (17.7 / 39.5 cm) closely tracks Sim(3) (14.0 / 36.7 cm), evidencing consistent runtime metric scale under heavy dynamics. See the paper (Table 6) for DSUG-only ablations on these splits.

BONN — DSUG Ablation (SE(3) ATE, cm)

Configuration	balloon	balloon2	pers trk	balloon trk	Mean (Δ)
Full system (Ours)	18.1	17.7	39.5	9.1	21.1
w/o DSUG	21.3	24.5	51.1	15.7	28.2 (+7.1)

Strict SE(3) alignment. Disabling DSUG spikes the mean error by +7.1 cm; on pers trk, ATE rises from 39.5 cm to 51.1 cm (paper Table 6).

KITTI Odometry (outdoor, first 500 frames)

Outdoor stress test with large depth range and high ego-speed; PRISM runs monocular RGB with metric SE(3) ATE. ORB-SLAM2 uses stereo to fix scale (reference only).

Method	Input	Metric	SE(3) ATE [m]	t_rel [%]
ORB-SLAM2^†	Stereo	✓	0.91	0.71
PRISM (ours)	Mono RGB	✓	4.30	2.29

Sequence 03. Stereo row follows the paper’s Table 9 (†): two-camera input to bypass monocular scale ambiguity.

ViT-Driven Loop Closure (DA3 [CLS])

Impact of ViT-driven loop closure on TUM fr1/xyz. Without LC, drift grows to 4.8 cm ATE RMSE; activating loop closure (2048-D [CLS] from DA3’s ViT (Yang et al., 2024) with geometric verification) reduces ATE to 2.9 cm (∼40% gain) across 30 verified loop matches with high RANSAC inlier confidence. On a 600-frame fr3/sit-static extension, the system runs 31 closures at 100% empirical precision (0 false positives), lowering ATE from 2.61 cm to 1.67 cm and increasing valid tracked frames from 401 to 480.

DSUG-Gated Dense Metric Reconstruction

Loading…

fr1/desk2

Loading…

fr1/room

Offline pipeline (paper Sec. 3.7, Table 10): batches of globally optimized poses feed DA3 multi-view inference (N = 5, GT extrinsics in the evaluation), then DSUG-filtered depth is fused in a TSDF (DA3-MV uses 2 cm voxels vs. 1 cm for the filtered single-view baseline). On fr1/desk2, DA3-MV cuts Chamfer distance to 17.0 cm vs. 23.5 cm and raises F@5 cm to 31.7% (vs. 20.7%) relative to the filtered baseline.
Drag to rotate · scroll to zoom · right-click to pan.

Ablation Study

Configuration	sit-st	walk-st	fr1/xyz	Mean (Δ)
Full system (Ours)	1.60	1.90	2.86	2.12
w/o Plücker Ray Factor	2.45	2.75	3.83	3.01 (+0.89)
w/o DSUG	1.80	2.10	3.12	2.34 (+0.22)
w/o Log-domain Kalman	1.75	2.05	3.04	2.28 (+0.16)
w/o WLS	1.70	2.00	2.99	2.23 (+0.11)

Paper Table 5: mean ATE (cm) as SE(3) over fr3/sit-static, fr3/walk-static, and fr1/xyz; full system 2.12 cm. Removing the Plücker ray-distance factor (+0.89 cm) is the dominant failure mode, consistent with cross-view ray constraints resolving the scale null-space. Disabling DSUG adds +0.22 cm on these mostly static splits (its impact is much larger on BONN dynamics; Table 6). Log-domain Kalman (+0.16 cm) and confidence-weighted WLS vs. a simple unweighted variant (+0.11 cm) further stabilize asynchronous scale fusion.

BibTeX

@article{prism-slam2026,
  title={PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM},
  author={Anonymous},
  journal={Under review},
  year={2026}
}

PRISM-SLAM: Probabilistic Ray-Grounded Inferencefor Scale-aware Metric SLAM

Abstract

Trajectory

System Architecture

DSUG: Temporal Depth Consistency as Uncertainty Proxy

Metric Scale in Monocular SLAM

Experimental Results

TUM RGB-D — Static Sequences (fr1)

TUM RGB-D — Static & Dynamic Sequences (fr3)

7-Scenes Indoor Localisation

BONN Dynamic — Monocular RGB vs. RGB-D (Sim(3) / SE(3))

BONN — DSUG Ablation (SE(3) ATE, cm)

KITTI Odometry (outdoor, first 500 frames)

ViT-Driven Loop Closure (DA3 [CLS])

DSUG-Gated Dense Metric Reconstruction

Ablation Study

BibTeX

PRISM-SLAM: Probabilistic Ray-Grounded Inference
for Scale-aware Metric SLAM