Mentor's Seminar 02⚓︎

Date: Thursday, Apr 9, 2026 Recording: Read.ai

Part 1: LLM-Powered Collaborative Task Planning Framework (ICAPS 2025)⚓︎

Speaker: Petr Shortahila

A paper proposing a system that allows domain experts to participate in collaborative planning by expressing constraints in natural language, without requiring knowledge of planning formalisms.

Problem and motivation⚓︎

Collaborative planning -- multiple experts contribute insights to build structured plans for solving optimization problems, but participation requires knowledge of formal planning notations, creating a barrier for domain experts unfamiliar with these formalisms.
Core idea: build a service that lets domain experts contribute their knowledge through natural language, with the system translating it into machine-executable planning constraints.

Architecture⚓︎

The system receives a problem description and its visualization, then experts provide constraints in natural language.
Two-stage decomposition pipeline -- the main contribution of the paper:
LLM (SONNET-4) decomposes natural language constraints into simpler components, asking the expert for clarification to prevent misunderstandings.
Decomposed constraints are translated into PDDL3 (Planning Domain Definition Language), with a second round of expert feedback for corrections.
A verifier checks syntactic correctness of generated PDDL3 constraints.
Valid constraints are passed to the ENHSP planner, which produces a plan visualized via PDC (Planner Domain Construction tool) so experts can see the effect of their constraints visually.

Example: XenoTravel⚓︎

Cities (rectangles), people (circles), planes (triangles); each person has a goal destination. People travel by plane, each flight consumes fuel. Goal: everyone reaches their destination with minimal fuel.
Expert constraint: "planes 2 and 3 have higher fuel consumption" -- the model translates this to "use only plane 1," and the planner adjusts accordingly.

Limitations and critique⚓︎

Only state-based constraints are supported (predicates), no action-based constraints (e.g., "plane 1 should fly only from city 1 to city 2") -- this is a fundamental limitation of PDDL3, not just the system.
No fine-tuning of the LLM -- authors used off-the-shelf open-source models, which likely hurts constraint translation quality.
No comparison with alternative approaches, no error analysis, no measurable evaluation.
Reproducibility is poor -- no GitHub repo, no GUI, no code; only a two-page paper and an 8-minute demo video with a robotic voice.
PDDL3 is structurally similar to context-free grammars -- actions lead to results described by production rules, analogous to CFG derivations.

Discussion highlights⚓︎

The seminar discussed whether zero-shot prompting is sufficient or whether fine-tuning (e.g., LoRA) is necessary for reliable natural-language-to-PDDL conversion, drawing parallels with Text-to-SQL systems.
Comparison with BPMN (Business Process Model and Notation) as an alternative formalism for expressing plans -- the question of which formalism is most powerful for planning tasks remains open.
The paper was compared unfavorably to a SIGIR demo submission that was rejected despite providing a repository (albeit with a reproducibility score of 1/5).
The idea of using LLMs to formalize planning tasks was considered promising as a starting point for a course paper or graduate work, especially with contemporary models (e.g., SONNET-5) instead of SONNET-4.

Part 2: TAPAS -- Task-Adaptation and Planning Using Agents (2025)⚓︎

Speaker: Danila

A multi-agent framework combining LLMs with symbolic planning to solve real-world tasks, demonstrating the ability to adapt to new domains and retain knowledge across tasks.

Problem and motivation⚓︎

LLMs alone are good at text transformation but poor at making structured plans.
Symbolic planners alone are good at planning but require precisely defined, passive domains -- they cannot generate domain descriptions themselves.
TAPAS bridges the gap: LLMs build the domain model, symbolic planners solve it.

Architecture⚓︎

Domain modeling agent (LLM-powered):
Domain generator -- converts natural-language task description into PDDL domain code (object types, predicates, actions).
Initial state generator -- produces initial state values from the text description.
Goal state generator -- determines the specific goal (e.g., "I want an apple from my bag" -> which apple, where it is).
Generators work independently but with upward feedback -- lower-level generators update the domain generator, iterating until a satisfiable plan model is reached.
Memory mechanism: short-term memory (within-task, standard GPT context) and procedural (long-term) memory -- computes similarity with previous tasks and reuses knowledge from analogous tasks to handle new, more complex problems.
Self-reflection mechanism: an internal critic assigns a quality score (0--1) to the model; if above a threshold, the process stops; otherwise, it iterates up to a limit.
Symbolic planning agent:
Solver -- receives PDDL code from the LLM agent and generates a sequence of plan steps.
Plan abstraction -- translates formal plan steps back into natural language for user comprehension.
Plan executor -- executes the plan and checks whether the goal is satisfied; if not, feedback is sent back to improve the model.

Experiments and results⚓︎

Experiment 1: 7 planning domains from the LLM-PlanBench, GPT-4 class model, temperature 0, 10 iteration limit. Average accuracy >88%, with strong performance across most domains.
Experiment 2 (ablation): tested temperatures 0, 0.1, and 0.3 -- 0.1 is optimal; accuracy drops at higher temperatures.
Experiment 3 (domain adaptation): added new features to existing tasks (colors and sizes in Blocksworld, battery levels). Accuracy reached 70--100%, demonstrating that long-term memory enables successful transfer to upgraded domains.
Experiment 4 (VirtualHome simulation): simulated real-world household tasks (take food from fridge, eat, place on table). The agent also remembered a persistent instruction ("always close the fridge door") and applied it to new tasks without being reminded, confirming long-term memory utility.

Limitations and critique⚓︎

Quality score is opaque -- no formula provided, only described as a 0--1 scale based on plan executability and information completeness.
Tasks are very simple -- Blocksworld, Tireworld, VirtualHome; the most complex example is changing tires, which is far simpler than a real garage procedure.
No discussion of ontological contradictions -- as the knowledge base grows, conflicting facts may accumulate with no resolution mechanism.
No low-level planning -- the framework operates only at high-level abstraction (e.g., "open fridge," "take food"); real-world robotics requires fine-grained motor control (step forward, raise hand, grasp handle), which is not addressed.
Plan evaluation is text-based -- similarity scoring compares text descriptions, but reasonable-sounding plans may fail during execution (the "Neocon law" problem).
Reproducibility concerns -- the speaker did not attempt to run the code, believing it required special OpenAI research resources; the mentor noted that modern cloud GPUs ($10/day for H100) and coding assistants make reproduction feasible.

Discussion highlights⚓︎

High-level vs. low-level planning: should agents learn only abstract plans or also detailed motor sequences? The speaker proposed a hierarchical approach -- TAPAS generates high-level plans, while a separate "brain" inside the robot handles low-level execution (analogous to brain vs. spinal cord).
Domain adaptation and memory: skills from low-level designs could be reused across domains, but high-level designs would require significant retraining. The mentor noted this distinction itself could be a research paper.
Software development analogy: typical development workflows (design, backend, frontend, testing, deployment) are also planning problems with reusable "bricks" -- measuring plan quality when multiple valid plans exist remains an open question.
The mentor emphasized that future speakers should at least attempt to run the code from papers they present, using available cloud tools and coding assistants.