
We evaluate our system in a mock indoor environment with eight object categories and four tiers of tasks, ranging from simple single-object missions to complex, context-based instructions. Across 45 scenarios, we use LTL specifications generated by LTLCodeGen or NL2LTL, testing success rate, semantic and syntactic error rates, and LLM runtime. Results show that LTLCodeGen consistently outperforms NL2LTL, achieving stronger correctness and better handling of sequential tasks, though it requires longer inference times. An ablation study reveals that removing task explanations significantly harms complex task performance, removing code comments has minor effects, and using a smaller model (GPT-4o-mini) leads to greater difficulty handling ambiguity and sequencing constraints.

We evaluate our task planning approach on the GameTraversalBenchmark (GTB), a dataset of 150 LLM-generated 2D maps with objects, stories, and task objectives. For each map, we create a 2D semantic occupancy map and an instruction combining all objectives in order. Our system uses the map and instruction to generate a path that completes all tasks, and we measure performance with Accuracy (success rate) and Mean Path Length (path efficiency). Results show our system outperforms the GPT-4o-based planner from GTB.

We evaluate LTLCodeGen and three baselines (NL2LTL, BART-FT-RAW-human, and BART-FT-RAW-synthetic) on the Drone, Cleanup, and Pick datasets, using GPT-4o. For each dataset, prompts are tuned with a small number of examples, and both human-written and synthetic instructions are tested. Results show that LTLCodeGen significantly outperforms all baselines, demonstrating strong generalization and robust LTL generation. NL2LTL performs better than fine-tuned LLMs but worse than LTLCodeGen, mainly due to LTL syntax errors and semantic mismatches that it fails to correct.