Mentor's Seminar 02⚓︎
Date: Thursday, Apr 9, 2026 Recording: Read.ai
Part 1: LLM-Powered Collaborative Task Planning Framework (ICAPS 2025)⚓︎
Speaker: Petr Shortahila
A paper proposing a system that allows domain experts to participate in collaborative planning by expressing constraints in natural language, without requiring knowledge of planning formalisms.
Problem and motivation⚓︎
- Collaborative planning -- multiple experts contribute insights to build structured plans for solving optimization problems, but participation requires knowledge of formal planning notations, creating a barrier for domain experts unfamiliar with these formalisms.
- Core idea: build a service that lets domain experts contribute their knowledge through natural language, with the system translating it into machine-executable planning constraints.
Architecture⚓︎
- The system receives a problem description and its visualization, then experts provide constraints in natural language.
- Two-stage decomposition pipeline -- the main contribution of the paper:
- LLM (SONNET-4) decomposes natural language constraints into simpler components, asking the expert for clarification to prevent misunderstandings.
- Decomposed constraints are translated into PDDL3 (Planning Domain Definition Language), with a second round of expert feedback for corrections.
- A verifier checks syntactic correctness of generated PDDL3 constraints.
- Valid constraints are passed to the ENHSP planner, which produces a plan visualized via PDC (Planner Domain Construction tool) so experts can see the effect of their constraints visually.
Example: XenoTravel⚓︎
- Cities (rectangles), people (circles), planes (triangles); each person has a goal destination. People travel by plane, each flight consumes fuel. Goal: everyone reaches their destination with minimal fuel.
- Expert constraint: "planes 2 and 3 have higher fuel consumption" -- the model translates this to "use only plane 1," and the planner adjusts accordingly.
Limitations and critique⚓︎
- Only state-based constraints are supported (predicates), no action-based constraints (e.g., "plane 1 should fly only from city 1 to city 2") -- this is a fundamental limitation of PDDL3, not just the system.
- No fine-tuning of the LLM -- authors used off-the-shelf open-source models, which likely hurts constraint translation quality.
- No comparison with alternative approaches, no error analysis, no measurable evaluation.
- Reproducibility is poor -- no GitHub repo, no GUI, no code; only a two-page paper and an 8-minute demo video with a robotic voice.
- PDDL3 is structurally similar to context-free grammars -- actions lead to results described by production rules, analogous to CFG derivations.
Discussion highlights⚓︎
- The seminar discussed whether zero-shot prompting is sufficient or whether fine-tuning (e.g., LoRA) is necessary for reliable natural-language-to-PDDL conversion, drawing parallels with Text-to-SQL systems.
- Comparison with BPMN (Business Process Model and Notation) as an alternative formalism for expressing plans -- the question of which formalism is most powerful for planning tasks remains open.
- The paper was compared unfavorably to a SIGIR demo submission that was rejected despite providing a repository (albeit with a reproducibility score of 1/5).
- The idea of using LLMs to formalize planning tasks was considered promising as a starting point for a course paper or graduate work, especially with contemporary models (e.g., SONNET-5) instead of SONNET-4.
Part 2: TAPAS -- Task-Adaptation and Planning Using Agents (2025)⚓︎
Speaker: Danila
A multi-agent framework combining LLMs with symbolic planning to solve real-world tasks, demonstrating the ability to adapt to new domains and retain knowledge across tasks.
Problem and motivation⚓︎
- LLMs alone are good at text transformation but poor at making structured plans.
- Symbolic planners alone are good at planning but require precisely defined, passive domains -- they cannot generate domain descriptions themselves.
- TAPAS bridges the gap: LLMs build the domain model, symbolic planners solve it.
Architecture⚓︎
- Domain modeling agent (LLM-powered):
- Domain generator -- converts natural-language task description into PDDL domain code (object types, predicates, actions).
- Initial state generator -- produces initial state values from the text description.
- Goal state generator -- determines the specific goal (e.g., "I want an apple from my bag" -> which apple, where it is).
- Generators work independently but with upward feedback -- lower-level generators update the domain generator, iterating until a satisfiable plan model is reached.
- Memory mechanism: short-term memory (within-task, standard GPT context) and procedural (long-term) memory -- computes similarity with previous tasks and reuses knowledge from analogous tasks to handle new, more complex problems.
-
Self-reflection mechanism: an internal critic assigns a quality score (0--1) to the model; if above a threshold, the process stops; otherwise, it iterates up to a limit.
-
Symbolic planning agent:
- Solver -- receives PDDL code from the LLM agent and generates a sequence of plan steps.
- Plan abstraction -- translates formal plan steps back into natural language for user comprehension.
- Plan executor -- executes the plan and checks whether the goal is satisfied; if not, feedback is sent back to improve the model.
Experiments and results⚓︎
- Experiment 1: 7 planning domains from the LLM-PlanBench, GPT-4 class model, temperature 0, 10 iteration limit. Average accuracy >88%, with strong performance across most domains.
- Experiment 2 (ablation): tested temperatures 0, 0.1, and 0.3 -- 0.1 is optimal; accuracy drops at higher temperatures.
- Experiment 3 (domain adaptation): added new features to existing tasks (colors and sizes in Blocksworld, battery levels). Accuracy reached 70--100%, demonstrating that long-term memory enables successful transfer to upgraded domains.
- Experiment 4 (VirtualHome simulation): simulated real-world household tasks (take food from fridge, eat, place on table). The agent also remembered a persistent instruction ("always close the fridge door") and applied it to new tasks without being reminded, confirming long-term memory utility.
Limitations and critique⚓︎
- Quality score is opaque -- no formula provided, only described as a 0--1 scale based on plan executability and information completeness.
- Tasks are very simple -- Blocksworld, Tireworld, VirtualHome; the most complex example is changing tires, which is far simpler than a real garage procedure.
- No discussion of ontological contradictions -- as the knowledge base grows, conflicting facts may accumulate with no resolution mechanism.
- No low-level planning -- the framework operates only at high-level abstraction (e.g., "open fridge," "take food"); real-world robotics requires fine-grained motor control (step forward, raise hand, grasp handle), which is not addressed.
- Plan evaluation is text-based -- similarity scoring compares text descriptions, but reasonable-sounding plans may fail during execution (the "Neocon law" problem).
- Reproducibility concerns -- the speaker did not attempt to run the code, believing it required special OpenAI research resources; the mentor noted that modern cloud GPUs ($10/day for H100) and coding assistants make reproduction feasible.
Discussion highlights⚓︎
- High-level vs. low-level planning: should agents learn only abstract plans or also detailed motor sequences? The speaker proposed a hierarchical approach -- TAPAS generates high-level plans, while a separate "brain" inside the robot handles low-level execution (analogous to brain vs. spinal cord).
- Domain adaptation and memory: skills from low-level designs could be reused across domains, but high-level designs would require significant retraining. The mentor noted this distinction itself could be a research paper.
- Software development analogy: typical development workflows (design, backend, frontend, testing, deployment) are also planning problems with reusable "bricks" -- measuring plan quality when multiple valid plans exist remains an open question.
- The mentor emphasized that future speakers should at least attempt to run the code from papers they present, using available cloud tools and coding assistants.