Optimal Scene Graph Planning with Large Language Model Guidance

Zhirui Dai¹, Arash Asgharivaskasi¹, Thai Duong¹, Shusen Lin¹,

Maria-Elizabeth Tzes², George Pappas², Nikolay Atanasov¹

¹Existential Robotics Lab, University of California San Diego

²GRASP Lab, University of Pennsylvania, Philadelphia

ICRA 2024

Abstract: Recent advances in metric, semantic, and topological mapping have equipped autonomous robots with semantic concept grounding capabilities to interpret natural language tasks. This work aims to leverage these new capabilities with an efficient task planning algorithm for hierarchical metric-semantic models. We consider a scene graph representation of the environment and utilize a large language model (LLM) to convert a natural language task into a linear temporal logic (LTL) automaton. Our main contribution is to enable optimal hierarchical LTL planning with LLM guidance over scene graphs. To achieve efficiency, we construct a hierarchical planning domain that captures the attributes and connectivity of the scene graph and the task automaton, and provide semantic guidance via an LLM heuristic function. To guarantee optimality, we design an LTL heuristic function that is provably consistent and supplements the potentially inadmissible LLM guidance in multi-heuristic planning. We demonstrate efficient planning of complex natural language tasks in scene graphs of virtualized real environments.

Method Overview

Planning a natural language mission, μ : “Reach the oven in the kitchen”, in a scene graph G of the Gibson environment Benevolence with object, room, and floor attributes. The terms “oven” and “kitchen” in μ belong to the object and room attributes of the scene graph, respectively. The scene graph G is described to the LLM using the connectivity of its attributes (attribute hierarchy) and the LLM is used to translate μ to LTL formula φ_μ and associated Automaton M_φ. We construct a hierarchical planning domain from the scene graph, and use multi-resolution multi-heuristic planning to plan the mission execution. In addition to mission translation, the LLM is used to provide heuristic guidance to accelerate the planning, while an LTL heuristic is used to guarantees optimality.

Scene Graph and LTL Automaton

We represent the environment as a scene graph (left) where nodes correspond to locations with associated attributes (e.g., object, room, floor) and edges represent belonging or connectivity relations. The natural language task μ is translated by the LLM to an LTL formula φ_μ that is represented as a finite state automaton M_φ (right).

Scene Graph Hierarchy Description

Due to the complexity of scene graphs, directly describing them to LLMs can exceed their context window size and token rate limits. To address this, we extract the attributes and hierarchy of the scene graph and present it as an attribute hierarchy in YAML format. This hierarchical description allows the LLM to effectively understand and utilize the scene graph information for task planning.

LLM Task Translation

We prompt the LLM with the hierarchical description of the scene graph and a natural language task μ (e.g., “Reach the oven in the kitchen”) to generate an LTL formula φ_μ that captures the task requirements. The LTL formula is then converted to a finite state automaton M_φ for use in planning.

Hierarchical Planning Domain

We construct a hierarchical planning domain from the scene graph and the LTL automaton. Each level in the hierarchical planning domain is a graph where nodes are the product of scene graph nodes and automaton states. Edges in the hierarchical planning domain represent feasible transitions between nodes, which should satisfy three conditions: 1) the edge is a transition from one node in the scene graph to the boundary of another node; 2) the transition should respect the automaton state transitions; 3) the transition should have the minimal cost among all possible transitions between the two nodes in the scene graph. This hierarchical structure allows efficient planning by enabling the planner to operate at multiple levels of abstraction.

LTL Heuristics

To guarantee the optimality of the planning process, we propose a consistent LTL heuristic:

where the cost function c of label transition and the function g of cost to reach the accepting state set are defined as:

LLM Heuristics

We design an LLM heuristic function that leverages the LLM's understanding of the scene graph and task requirements to provide semantic guidance during planning. The LLM heuristic estimates the cost to reach the goal by considering the attributes of the scene graph nodes and their relevance to the task at hand. Given the current state (s, q) in the planning domain with the task context, we prompt the LLM to obtain a possible task solution, which can be used to estimate the cost to reach the goal from the current state. This heuristic aims to accelerate the planning process by guiding the search towards promising regions of the state space based on semantic information. It may make inaccurate estimations, so it is used in conjunction with the consistent LTL heuristic to ensure optimality.

Experimental Results

We evaluate our proposed hierarchical LTL planning approach with LLM guidance in various virtualized environments represented as scene graphs. The results demonstrate that our method is able to efficiently plan complex natural language tasks while ensuring optimality. The combination of the LLM heuristic and the consistent LTL heuristic significantly reduces the number of state expansions and planning time compared to using either heuristic alone or no heuristics. For detailed experimental results and analysis, please refer to the paper.

BibTeX

@inproceedings{dai2024optimal,
        author={Dai, Zhirui and Asgharivaskasi, Arash and Duong, Thai and Lin, Shusen and Tzes, Maria-Elizabeth and Pappas, George and Atanasov, Nikolay},
        booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
        title={{Optimal Scene Graph Planning with Large Language Model Guidance}},
        year={2024},
        pages={14062-14069},
        keywords={Measurement;Grounding;Large language models;Semantics;Automata;Planning;Logic},
        doi={10.1109/ICRA57147.2024.10610599}}
}