[TOC]

  1. Title: PG3 Policy Guided Planning for Generalised Policy Generation
  2. Author: Ryan Yang et. al.
  3. Publish Year: 21 Apr 2022
  4. Review Date: Wed, May 24, 2023
  5. url: https://arxiv.org/pdf/2204.10420.pdf

Summary of paper

image-20230524195832214

Motivation

  • a longstanding objective in classical planning is to synthesise policies that generalise across multiple problems from the same domain
  • this work, we study generalised policy search-based methods with a focus on the score function used to guide the search over policies

Contribution

  • we study a specific instantiation of policy search where planning problems are PDDL-based and policies are lifted decision lists.

Some key terms

what is generalised planning and generalised policy search (GPS)

  • GPS is a flexible paradigm for generalised planning. In this family of methods, a search is performed through a class of generalised (goal-conditioned) policies, with the search informed by a score function that maps candidates policies to scalar values.
  • there has been relatively less work on the score function. the score function plays an important role: if the score are uninformative or misleading, the search will languish in less promising regions of policy space.

Problem setting in generalised planning

  • Given:
    • PDDL domain
    • Training problems
    • PLanner
  • Goal: Learn a goal-conditioned policy that generalises to all test problems in domain.

How does PG3 work

  • search through the space of candidate policies
  • candidate policies representation is a lifted decision list which consists of an ordered list of rules
  • image-20230526000923745

Evaluation process

  • calculate how many problems can a given candidate policy is able to solve

policy evaluation

  • executes the policy on the training tasks and records the number of success
  • extremely sparse, effectively forcing an exhaustive search until reaching a region of non-zero scores.

plan comparison

  • plans on the training tasks and records the agreement between the found plans and the candidate policy.
  • (in practice the score is the priority score, lower is better)

image-20230526004102515

  • image-20230526004119468

image-20230526005835198