

Appeared in the Eighteenth Conference on Uncertainty in Artificial Intelligence, Edmonton, Canada, August 2002.Distributed Planning in Hierarchical Factored MDPs

Carlos GuestrinComputer Science Dept Stanford Universityguestrin@cs.stanford.edu

Geoffrey GordonComputer Science Dept Carnegie Mellon Universityggordon@cs.cmu.edu

We present a principled and efficient planning algo-rithm for collaborative multiagent dynamical systems. All computation, during both the planning and the ex-ecution phases, is distributed among the agents; each agent only needs to model and plan for a small partof the system. Each of these local subsystems is small, but once they are combined they can representan exponentially larger problem. The subsystems are connected through a subsystem hierarchy. Coordina-tion and communication between the agents is not imposed, but derived directly from the structure of thishierarchy. A globally consistent plan is achieved by a message passing algorithm, where messages corre-spond to natural local reward functions and are computed by local linear programs; another message pass-ing algorithm allows us to execute the resulting policy. When two portions of the hierarchy share the samestructure, our algorithm can reuse plans and messages to speed up computation.

1 Introduction Many interesting planning problems have a very large num-ber of states and actions, described as the cross product

of a small number of state variables and action variables.Even very fast exact algorithms cannot solve such large planning problems in a reasonable amount of time. For-tunately, in many such cases we can group the variables in these planning problems into subsystems that interact in asimple manner.

As argued by Herbert Simon [20] in "Architecture ofComplexity", many complex systems have a "nearly decomposable, hierarchical structure", with the subsystemsinteracting only weakly between themselves. In this paper, we represent planning problems using a hierarchicaldecomposition into local subsystems. (In multiagent planning problems, each agent will usually implement one ormore local subsystems.) Although each subsystem is small, once these subsystems are combined we can represent anexponentially larger problem.

The advantage of constructing such a grouping is thatwe can hope to plan for each subsystem separately. Coordinating many local planners to form a successful globalplan requires careful attention to communication between planners: if two local plans make contradictory assump-tions, global success is unlikely.

In this paper, we describe an algorithm for coordinat-ing many local Markov decision process (MDP) planners

to form a global plan. The algorithm is relatively sim-ple: at each stage we solve a small number of small linear programs to determine messages that each planner willpass to its neighbors, and based on these messages the local planners revise their plans and send new messages untilthe plans stop changing. The messages have an appealing interpretation: they are rewards or penalties for taking ac-tions that benefit or hurt their neighbors, and statistics about plans that the agents are considering.Our hierarchical decomposition offers another significant feature: reuse. When two subsystems are equivalent(i.e., are instances of the same class), plans devised for one subsystem can be reused in the other. Furthermore, in manyoccasions, larger parts of the system may be equivalent. In these cases, we can not only reuse plans, but also messages.The individual local planners can run any planning algorithm they desire, so long as they can extract a particu-lar set of state visitation frequencies from their plans. Of course, suboptimal planning from local planners will tendto reduce the quality of the global plan.

Our algorithm does not necessarily converge to an op-timal plan. However, it is guaranteed to be the same as the plan produced by a particular centralized planning al-gorithm (approximate linear programming as in [19], with a particular basis). 2 Related work Many researchers have examined the idea of dividing aplanning problem into simpler pieces in order to solve it

faster. There are two common ways to split a problem intosimpler pieces, which we will call serial decomposition and parallel decomposition. Our planning algorithm is signif-icant because it handles both serial and parallel decompositions, and it provides more opportunities for abstractionthan other parallel-decomposition planners. Additionally, it is fully distributed: at no time is there a global combina-tion step requiring knowledge of all subproblems simultaneously. No previous algorithm combines these qualities.In a serial decomposition, one subproblem is active at any given time. The overall state consists of an indicatorof which subproblem is active along with that subproblem's state. Subproblems interact at their borders, whichare the states where we can enter or leave them. For example, imagine a robot navigating in a building with multiplerooms connected by doorways: fixing the value of the doorway states decouples the rooms from each other and lets ussolve each room separately. In this type of decomposition, the combined state space is the union of the subproblemstate spaces, and so the total size of all of the subproblems is approximately equal to the size of the combined problem.Serial decomposition planners in the literature include Kushner and Chen's algorithm [12] and Dean and Lin'salgorithm [6], as well as a variety of hierarchical planning algorithms. Kushner and Chen were the first to ap-ply Dantzig-Wolfe decomposition to Markov decision processes, while Dean and Lin combined decomposition withstate abstraction. Hierarchical planning algorithms include MAXQ [7], hierarchies of abstract machines [16], andplanning with macro-operators [22, 9].

By contrast, in a parallel decomposition, multiple sub-problems can be active at the same time, and the combined state space is the cross product of the subproblem statespaces. The size of the combined problem is therefore exponential rather than linear in the number of subproblems,which means that a parallel decomposition can potentially save us much more computation than a serial one. Foran example of a parallel decomposition, suppose there are multiple robots in our building, interacting through a com-mon resource constraint such as limited fuel or through a common goal such as lifting a box which is too heavy forone robot to lift alone. A subproblem of this task might be to plan a path for one robot using only a compact summaryof the plans for the other robots.

Parallel decomposition planners in the literature includeSingh and Cohn's [21] and Meuleau et al.'s [15] algorithms. Singh and Cohn's planner builds the combinedstate space explicitly, using subproblem solutions to initialize the global search. So, while it may require fewer plan-ning iterations than naive global planning, it is limited by having to enumerate an exponentially large set. Meuleauet al.'s planner is designed for parallel decompositions in which the only coupling is through global resource con-straints. More complicated interactions such as conjunctive goals or shared state variables are beyond its scope.Our planner works by representing a planning problem as a linear program [14], substituting in a compact approx-imate representation of the solution [19], and transforming and decomposing the LP so that it can be solved by a dis-tributed network of planners. One of this paper's contributions is the method for transformation and decomposition.Our transformation is based on the factorized planning algorithm of Guestrin, Koller, and Parr [8]. Their algo-rithm uses a central planner, but allows distributed execution of plans. We extend that result to allow planning to bedistributed as well, while still guaranteeing that we reach the same solution. That means that our algorithm allowsfor truly decentralized multiagent planning and execution: each agent can run its own local planner and compute mes-sages locally, and doesn't need to know the global state of the world or the actions of unrelated agents.

The transformation produces a sparse linear program, towhich we apply a nested Benders decomposition [1]. (Or, dually, a Dantzig-Wolfe decomposition [3].) As mentionedabove, other authors have used this decomposition for planning; but, our method is the first to handle parallel decom-positions of planning problems.

Another contribution of our new hierarchical repre-sentation and planning algorithm over the algorithm of Guestrin et al. [8] is reuse. As we describe in Sec. 9, ourapproach can reuse plans and messages for parts of the hierarchy that share the same structure. 3 Markov Decision Processes The Markov Decision Process (MDP) framework formal-izes the problem where agents are acting in a dynamic

environment, and are jointly trying to maximize their ex-pected long-term return. An MDP is defined as a 4-tuple (X; A; R; P ) where: X is a finite set of N = jXj states; Ais a set of actions;

R is a reward function R : X \Theta  A 7! Rsuch that R(x; a) represents the reward obtained in state x after taking action a; and P is a Markovian transitionmodel where

P (x

0

j x; a) represents the probability of go-ing from state

x to state x

0 with action

a. We will write

Ra to mean the vector of rewards associated with action a (with one entry for each state), and we will write Pa tomean the transition matrix associated with action

a (withone entry for each pair of source and destination states).

We assume that the MDP has an infinite horizon and thatfuture rewards are discounted exponentially with a discount factor fl 2 [0; 1).In general, the state space

X is more compactly definedin terms of state variables, i.e.,

X = fX1; : : : ; Xng. Sim-ilarly, the action can be decomposed into action variables

A = fA1; : : : ; Agg.The optimal value function

V

\Lambda  is defined so that

V

\Lambda 

(x)is the maximal value achievable by any action at state

x.More precisely, V

\Lambda  is the solution to (1) below. A stationary policy is a mapping ss : X 7! A, where ss(x) is theaction to be taken at state

x. For any value function V , wecan define the policy obtained by one-step lookahead on

V :Greedy [V ] = arg maxa[Ra + PaV ], where the arg max istaken componentwise. The greedy policy for the optimal

value function V

\Lambda  is the optimal policy

ss

\Lambda .

There are several algorithms for computing the optimalpolicy (see [17] for a review). One is to solve the Bellman

linear program: write V 2 R

N for the value function, so

that Vi represents the value of state i. Pick any fixed staterelevance weights

ff 2 R

N with

ff ? 0; without loss ofgenerality assume

P

i

ffi = 1. Then the Bellman LP is

Minimize ff \Delta  V V * Ra + flPaV (1)

Throughout this paper, inequalities between vectors holdcomponentwise; e.g.,

a * b means ai * bi for all i. Also,free index variables are by convention universally quantified; e.g., the constraint in (1) holds for all a.

fuel-flow air-flow brake

RPMengine-temp

actual-speedengine-power

road-slope desired-speed O2-sensor

transmission-gear

Figure 1: Smart engine control dynamics; e.g., the state variableO2-sensor depends on the action variables fuel-flow and air-flow.

The dual of the Bellman LP is

Maximize

P

a

Ra \Delta  OEa

P

a

OEa \Gamma  fl

P

a

P

T

a

OEa = ff ; OEa * 0 (2)

The vector OEa, called the flow for action a, can be inter-preted as the expected number of times that action

a willbe executed in each state (discounted so that future visits

count less than present ones). So, the objective of (2) isto maximize total reward for all actions executed, and the constraints say that the number of times we enter each statemust be the same as the number of times we leave it. State relevance weights tell us how often we start in each state. 4 Hierarchical multiagent factored MDPs Most interesting MDPs have a very large number of states.For example, suppose that we want to build a smart engine control, whose state at any time is described by thevariables shown in Fig. 1. The number of distinct states in such an MDP is exponential in the number of state vari-ables. Similarly, many interesting MDPs have a very large number of actions described by a smaller number of ac-tion variables. Even very fast exact algorithms cannot solve such large MDPs in a reasonable amount of time.Fortunately, in many cases we can group the variables in a large MDP into subsystems that interact in a sim-ple manner. The rounded boxes in Fig. 1 show one possible grouping; we might call the three subsystems fuel-injection, engine-control, and speed-control.

These subsystems overlap with each other on some vari-ables; e.g., fuel-injection and engine-control overlap on O2-sensor. We can consider O2-sensor to be an output ofthe fuel-injection system and an input to the engine-control.

The advantage of constructing such a grouping is thatwe can now plan for each subsystem separately. Since there are many fewer variables in each subsystem than there arein the MDP as a whole, we can hope that it will be possible to construct a global plan by stitching together plansfor the various subsystems. Of course, if we plan for each subsystem completely separately there's no guarantee thatthe overall plan will be useful, so we may need to replan for each subsystem several times taking into account thecurrent plans for neighboring subsystems.

In our engine example, we can examine the speed-control subsystem and compute what values of enginepower, brake, and transmission-gear would help us most inkeeping actual-speed near desired-speed. While the speedcontrol system controls transmission-gear and brake di-rectly, it must communicate with the engine-control system to influence the value of engine-power. If the desired valueof engine-power turns out to be too expensive to maintain,

the speed-control system may have to form a new plan thatmakes do with the available engine-power. 4.1 Subsystem tree MDPs To formalize the above intuition, we will define a generalclass of MDPs built from interacting subsystems. In Sec. 5,

we will give an algorithm for planning in such MDPs; thisalgorithm will run efficiently when the number of state variables in the scope of each subsystem is small.We will start by defining a basic subsystem. On its own, a basic subsystem is just an MDP with factored state andaction spaces, but below we will describe how to combine several basic subsystems into a larger MDP by allowingthem to share state and action variables. (In particular, an action variable in one subsystem can be a state variable inanother; a subsystem's actions are just the variables which can be set by forces external to the subsystem.) Definition 4.1 (Basic subsystem) A basic subsystem Mjis an MDP defined by the tuple

(Xj; Aj; Rj; Pj ). Theinternal state variables Xj = Internal[Mj] and the ex-ternal state variables Aj = External[Mj] are disjointsets. The scope of Mj is Scope[Mj] = Xj [ Aj. Thelocal reward function

Rj(xj ; aj) maps assignments for Scope[Mj] to real-valued rewards, and the local transitionfunction

Pj(x

0

j

j xj; aj ) maps assignments for Scope[Mj]to a probability distribution over the assignment

x

0

j to theinternal variables

Xj in the next time step.

We have divided the scope of a subsystem into internalvariables and external variables. Each subsystem knows the

dynamics of its internal variables, and can therefore reasonabout value functions that depend on these variables. External variables are those which the subsystem doesn't knowhow to influence directly; a subsystem may form opinions about which values of the external variables would be help-ful or harmful, but cannot perform Bellman backups for value functions that depend on external variables.We will write

Rj;a

j for the vector of rewards associatedwith a setting

aj for Aj. Rj;a

j has one component for eachsetting of

Xj. Similarly, we will write Pj;a

j for the transi-tion matrix associated with setting

Aj to aj; each columnof

Pj;a

j corresponds to a single setting of

Xj at time t,while each row corresponds to a setting of

Xj at time t + 1.Given several basic subsystems, we can collect them together into a subsystem tree. Right now we are not impos-ing any consistency conditions on the subsystems, but we will do so in a little while. Definition 4.2 (Subsystem tree) A subsystem tree M is aset of local subsystem MDPs

M1; : : : ; Mm together witha tree parent function Mi = Parent[Mj]. Without loss ofgenerality take M1 to be the root of the tree. We will writeParent [M1] = ;, and we will define Children[Mj] in theusual way. The internal variables of the tree,

Internal[M],are those variables which are internal to any

Mj; the ex-ternal variables, External[M], are the variables which arein the scope of some subsystem but not internal to any subsystem. Finally, the common (separator) variables between

a subsystem and its parent are denoted by SepSet[Mj] = Sj = Scope[Mj] " Scope[Parent[Mj]].

There are two consistency conditions we need to en-force on a subsystem tree to make sure that it defines an

MDP. The first says that every subsystem that referencesa variable

Xi must be connected to every other subsystemreferencing

Xi, and the second says that neighboring sub-systems have to agree on transition probabilities.

Definition 4.3 (Running intersection property) A sub-system tree

M satisfies the running intersection propertyif, whenever a variable

Xi is in both Scope[Mj] and Scope[Mk], then Xi 2 Scope[Ml] for every subsystem Ml in the path between Mj and Mk in M. Similarly,internal variables must satisfy the same condition.

Definition 4.4 (Consistent dynamics) A subsystem tree M has consistent dynamics if pairs of subsystems agreeon the dynamics of their joint variables. Specifically, if

zis an assignment to Scope[Mj] [ Scope[Mk], and if x

0 is

an assignment to Internal[Mj] " Internal[Mk] in the nexttime step, then

X [x

0

k

jx

0

]

Pk(x

0

k

j z) =

X

[x

0

j

jx

0

]

Pj(x

0

j

j z)

Here the sum [x

0

k

j x

0

] runs over all assignments x

0

k to

Internal[Mk] consistent with x

0.

A subsystem tree which satisfies the running intersec-tion property and has consistent dynamics is called a consistent subsystem tree. For the rest of this paper, all subsys-tem trees will be assumed to be consistent. Given a consistent subsystem tree, we can construct a plain old MDP,which defines the dynamics associated with this tree:

Lemma 4.5 (Equivalent MDP) From a consistent sub-system tree

M we can construct a well-defined equiva-lent MDP, MDP[M] = (X; A; R; P ). The state vari-able set is X = Internal[M]. The action variable set is A = External[M]. The reward function is the sum of allsubsystem rewards

R =

P

m

j=1

Rj. The transition proba-bilities for a given state and action

z = (x; a) are

P (x j z) =

Q

m

j=1

Pj (xj j zj )

Q

m

k=2

P

[xkjsk]

Pk(xk j zk)

where zj is the restriction of z to the scope of Mj; xj is therestriction of

x to the internal variables of Mj; sk is therestriction of z to variables in SepSet[Mk]; and [xk j sk]denotes the set of assignments to

xk consistent with sk.

In our engine example, the rounded boxes in Fig. 1 arethe scopes of the three subsystems. The O2-sensor variable

is an external variable for the engine-control system, andengine-power is an external variable for the speed-control system; all other variables of each subsystem are internal.(Note: the way we have described this example, each variable is internal to only one subsystem; this is not necessaryso long as we maintain the running intersection property.)

A

2 A

2

A

1 A

1

X

1

R

1 R

1

X

3

X

2

X'

3 X'

3

X'

2

X'

1

h

2 h

2

h

1 h

1

R

2 R

2

R

3

A

ERROR: rangecheckOFFENDING COMMAND: get STACK: 1 [0 ] false 12  (\Delta )6254  5040 -savelevel- %%[ Error: rangecheck; OffendingCommand: get ]%%
