R4A: Spatial, Temporal, and Symbolic Reasoning for Agents

R4A: Spatial, Temporal, and Symbolic Reasoning for Agents

There has been significant interest in learning generalist policies for diverse locomotion and manipulation task execution in open-world environments. Recent vision-language-action (VLA) models (e.g., π_0.5, OpenVLA, RT-2) excel at learning diverse tasks and handling many real-world edge cases, thanks to internet-scale pre-training. However, state-of-the-art VLA models are still limited to learning semantically simple and atomic tasks (e.g., "fold laundry") that do not involve long-term spatial or temporal reasoning and complex subtask compositions. We identify three open challenges that hinder existing agent policies from achieving higher-order, human-like behaviors.

Spatial Reasoning: Reliance on 2D images leads to local reasoning limited by field of view constraints and a lack of 3D spatial understanding, making agents struggle with viewpoint changes, occlusion, and spatially extended tasks.
Temporal Reasoning: Using only short-term video-based attention leads to limited temporal reasoning scales, causing failures when executing tasks featuring long-term temporal correlations.
Symbolic Reasoning: Existing agent policies employ predefined skills and coarse skill switching schemes, lacking symbolic reasoning to generalize behaviors and achieve flexible compositional task planning.

This workshop explores the next frontier of embodied AI with the goal of designing generalist agents capable of human-level spatial, temporal, and symbolic reasoning. Current approaches can be broadly categorized into model-free and model-based. Model-free methods, such as state-of-the-art VLA models, infer actions directly from sensory inputs using pre-trained foundation models. In contrast, model-based methods rely on constructing and reasoning over intermediate representations such as 3D maps, dynamics models, or symbolic programs. A central theme that this workshop aims to explore is related to the comparison and interaction between these paradigms. We will consider fundamental questions such as: (1) How much can we achieve by further scaling existing model-free approaches?; (2) Can model-based approaches improve generalization and data efficiency of VLA and reinforcement learning methods, and what are their limitations? The consideration of model-based approaches also raises new questions for architecture design: (3) How can we efficiently tokenize 3D/4D features utilized by model-based environment representations for policy learning?; (4) Is explicit encoding of long-term, scene-level, and physically realistic dynamics feasible and competitive with implicit model-free reasoning?; (5) How can agents automatically synthesize layered and functional abstractions from continuous states and controls? Lastly, the above questions also necessitate renewed discussion about training and evaluation: (6) What are the roles of simulators and simulation-based methods (e.g., Sim2Real and Real2Sim) for learning agent policies with extended spatial, temporal, and symbolic reasoning capabilities?; (7) How can the community progress toward unified metrics and evaluation interfaces for consistent benchmarking?

Call for Papers

We invite submissions of short papers (up to 5 pages in NeurIPS format), excluding references and supplementary material. The submission should outline the results being presented, their novelty, and their relevance to the workshop questions. Example topics include but are not limited to the following areas:

Vision-Language-Action (VLA) models
Policy learning using 3D scene representations
3D/4D distillation of foundation models and coupling with task planning
Task and motion planning
Neuro-symbolic reasoning
Differentiable simulation, Sim2Real, and Real2Sim methods
New benchmarks for spatial, temporal, and/or symbolic reasoning

The shorter submission format is preferred to encourage contributions on brand new ideas and work in progress. The submissions will be reviewed by a Program Committee consisting of experts in related fields. The review process will prioritize new work over already finalized papers. In particular, work that is presented at the main NeurIPS conference will not be accepted by this workshop. Accepted contributions will be made available on the workshop website as non-archival reports, and the authors will be invited to present their work during the poster session. More details and submission link will be made available.

Call for Talks

We invite junior researchers, who are PhD candidates or recent graduates (±2 years from PhD degree), to share their PhD research work and research vision on spatial, temporal, and symbolic reasoning for agents at our workshop as a 20-minute talk.

Applicants are invited to submit a talk proposal in the form of an extended abstract of up to 3 pages in NeurIPS format (excluding references) summarizing their PhD research on a topic of interest to the workshop.

The extended abstract is expected to contain and will be evaluated on the following aspects:

Motivation behind the research question(s) addressed in the applicant's research
Clarity in defining and scoping the problem
Alignment with the topics of the workshop
Brief review of related work on the aforementioned research question(s)
Description of techniques contributed by the applicant, their novelty and potential advantages over existing ones
Overview of future research directions

The submitted talk proposals will be reviewed by the workshop Program Committee following the same timeline as regular paper contributions. One proposal will be selected for presentation based on quality and relevance to the workshop topic. The corresponding junior researcher will share the stage with other invited speakers to present their PhD research. Any submitted talk proposal will also be considered for a poster presentation by default.