SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

Sunghwan Kim1*, Byeonghyun Pak2*, Kehan Long3, Yulun Tian4, Nikolay Atanasov1

1UC San Diego   2Agency for Defense Development   3SceniX Inc.   4University of Michigan

Spatiotemporal Environment and Robot Feature (SERF) Map

A persistent spatial memory for the robot, objects, and task-relevant scene state.

Our SERF map maintains a 4D scene representation of both the environment and the robot body in a shared latent feature space. We build the map using neural points, which are 3D points with learnable features trained to reconstruct dense DINO embeddings.

To track changes, we construct 3D keypoint correspondences between consecutive observations, estimate an object-level rigid transform, and update the corresponding points. We extend the neural points to the robot body by sampling surface points from a URDF and positioning them via forward kinematics at each step.

BEHAVIOR-1K

Task 21: Collecting Children's Toys

Videos are played at 20x speed.

Third-Person View
Egocentric Robot Observations
SERF Map PCA Visualization

We construct and maintain the SERF map using only egocentric observations and proprioceptive state

Map-conditioned VLA Policy

SERF map tokens provide a VLA policy with an explicit spatial representation.

We condition the VLA policy on SERF map tokens in addition to RGB observations, proprioceptive state, and the task embedding. We tokenize the SERF map across multiple reference frames, including global, end-effector, and robot-base frames, to capture both local and global context. These tokens provide the SERF VLA policy with allocentric scene memory and egocentric robot-environment context for long-horizon action decisions.

Map-conditioned VLA policy framework using SERF map tokens.

Evaluation

We evaluate whether the SERF policy improves long-horizon mobile manipulation compared with an image-only VLA policy by providing persistent spatiotemporal memory. We also evaluate whether SERF supports scene-configuration generalization and failure recovery.

Long-Horizon Mobile Manipulation

BEHAVIOR-1K Task Progress

SERF outperforms PI0.5 (Image-only) across all tasks, follows more direct trajectories, and reaches subgoals faster.

Task 21: Collecting Children's Toys

Videos are played at 20x speed.

PI0.5 (Image-only)
SERF (Map-conditioned)

Scene-Configuration Generalization

Out-of-Distribution (OOD) Configuration Shifts

Policies are trained only on the original in-distribution scenes and evaluated on a moved goal location, additional target objects, and target objects placed in an unvisited navigation region. SERF achieves higher task progress across all three variations, suggesting that explicit spatial representation supports robust behavior under out-of-distribution (OOD) configuration shifts.

Out-of-Distribution (OOD) Settings

Scene-Configuration Generalization

Task progress (%) under test-time scene shifts.

Failure Recovery

Object-Drop Recovery

To induce the failure, we open the gripper during transport so the held object drops and leaves the camera view, then resume both policies from the same post-drop state. SERF re-localizes and re-grasps the dropped object more reliably.

PI0.5 (Image-only)
SERF (Map-conditioned)

Failure Recovery

Recovery from Object-Drop Failures

Success rate over 20 recovery trials and mean recovery time.