Abstract: Neural implicit representations have had a significant impact on simultaneous localization and mapping (SLAM) by enabling robots to build continuous, differentiable, and high-fidelity 3D maps from sensor data. However, as the scale and complexity of the environment increase, neural SLAM approaches face renewed challenges in the back-end optimization process to keep up with runtime requirements and maintain global consistency. We introduce MISO, a hierarchical optimization approach that leverages multiresolution submaps to achieve efficient and scalable neural implicit reconstruction. For local SLAM within each submap, we develop a hierarchical optimization scheme with learned initialization that substantially reduces the time needed to optimize the implicit submap features. To correct estimation drift globally, we develop a hierarchical method to align and fuse the multiresolution submaps, leading to substantial acceleration by avoiding the need to decode the full scene geometry. MISO significantly improves computational efficiency and estimation accuracy of neural signed distance function (SDF) SLAM on large-scale real-world benchmarks.
Given odometry and point-cloud observations from range sensors, a robot aims to estimate its trajectory and build a local submap represented as a multiresolution feature grid (similar to NGLOD and Instant-NGP). Organizing implicit features into a hierarchy of grids effectively disentangles information at different spatial resolutions. At inference time, interpolated features from different hierarchy levels are aggregated and processed by a decoder network to predict the scene geometry, represented as a Signed Distance Function (SDF). Local SLAM is performed by jointly optimizing robot trajectory and the submap feature grids.
Suppose that we have already optimized the level-1 features and obtained the corresponding SDF and error residuals (measured with respect to the observations). How can we use these results to initialize the level-2 features? In the paper, we first analyzed a simpler case in which the decoder network is assumed to be linear. In this case, one can compute the optimal feature at the next level via a closed-form mapping from the residuals of the previous levels. Motivated by this observation, in the general case, we design an encoder network to learn this mapping. We show that this method effectively initializes the scene geometry and substantially reduces mapping time.
In large environments or over long time durations, the robot trajectory estimates will inevitably drift and cause the submaps to be misaligned. To address this challenge, MISO introduces an approach to align and fuse all submaps in the global reference frame. Compared to existing approaches, which rely on decoding the scene geometry into an explicit representation like occupancy, mesh, or distance field, MISO performs alignment and fusion directly using the implicit features in the multiresolution submaps in a hierarchical manner. We show that this results in significantly faster optimization and outperforms other methods under large initial alignment errors.
We gratefully acknowledge support from ARL DCIST CRA W911NF-17-2-0181, ONR N00014-23-1-2353, and NSF CCF-2112665 (TILOS). Project website modified from relight-to-reconstruct.
@inproceedings{tian2025miso,
title={{MISO}: Multiresolution Submap Optimization for Efficient Globally Consistent Neural Implicit Reconstruction},
author={Tian, Yulun and Cao, Hanwen and Kim, Sunghwan and Atanasov, Nikolay},
booktitle={Robotics: Science and Systems (RSS)},
year={2025}
}