Markov Decision Process¶
Markov Decision Process (MDP)¶
-
An MDP is defined by:
- A set of states \(s \in S\)
- A set of actions \(a \in A\)
-
A transition function \(T(s, a, s')\)
- Probability that from \(s\) leads to \(s'\), i.e., \(P(s' | s, a)\)
- Also called the model or the dynamics
-
A reward function \(R(s, a, s')\)
- Sometimes just \(R(s)\) or \(R(s')\)
- A start state
- Maybe a terminal state
-
MDPs are non-deterministic search problems
- One way to solve them is with expectimax search
Quantities¶
Policy¶
-
In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal
-
For MDPs, we want an optimal policy \(\pi^*: S \to A\) (a mapping from states to actions)
- A policy \(\pi\) gives an action for each state
- An optimal policy is one that maximizes expected utility if followed
- An explicit policy defines a reflex agent
Discounting¶
-
Rewards in the future are considered less valuable than immediate rewards
-
Discount Factor \(\gamma \in [0, 1]\)
- \(\gamma = 0\) means only immediate rewards matter
- \(\gamma = 1\) means all rewards matter equally
- \(\gamma \in (0, 1)\) means future rewards are discounted
- \(\gamma\) is a hyperparameter
-
The utility of a sequence of rewards is the sum of rewards, each discounted by a factor of \(\gamma\) raised to the power of the number of steps into the future.
Solve MDP¶
Optimal Quantities¶
-
The value (utility) of a state \(s\):
- $V^*(s) = $ expected utility starting in \(s\) and acting optimally
-
The value (utility) of a \(q\)-state \((s,a)\):
- $Q^*(s,a) = $ expected utility starting out having taken action a from state \(s\) and (thereafter) acting optimally
-
The optimal policy:
- $\pi^*(s) = $ optimal aciton from state
Values of States¶
- Expected utility under optimal action
-
Compute using expectimax search
-
Recursive definition of value:
Problems with Expectimax¶
Problem 1: State are repeated
- Idea: Only compute needed quantities once
Problem 2: Tree goes on forever
- Idea: Do a depth-limited computation, but with increasing depths until change is small
- Note: deep parts of the tree eventually don’t matter if \(\gamma < 1\)
Value Iteration¶
Time-Limited Values¶
-
Define \(V_k(s)\) to be the optimal value of \(s\) if the game ends in \(k\) more time steps
-
Start with \(V_0(s) = 0\): no time steps left means an expected reward sum of zero
- Given vector of Vk(s) values, do one ply of expectimax from each state:
- Repeat until convergence
- Complexity of each iteration: \(O(S^2A)\)
- Theorem: will converge to unique optimal values
Policy Iteration¶
- An alternative approach for value iteration
- Step 1: Policy evaluation: calculate utilities for some fixed (not optimal) policy.
- Step 2: Policy improvement: update policy using one-step look-ahead with resulting converges (but not optimal!) utilities as future values.