Skip to content

Reinforcement Learning Problem Formulation

  • Markov Chain (MC)
    • Define (fully observable) states and memoryless (i.e., Markovian) state transitions
  • Markov Reward Process (MRP)
    • Indicate preferences through a reward function
  • Markov Decision Process (MDP)
    • Include decisions that can be controlled by an agent's policy
  • Partial Observable MDP (POMDP)
    • Deal with situations where the true state information isn't available

Markov Chain (MC)

  • Markov Property: the current state determines the state transition probability, regardless the history (i.e., memoryless).
  • The state transition can be deterministic or stochastic.
  • MC=(S,P)
    • States: sS
    • State transition function: P(s|s)
Markov Chain, from Wikipedia.

Markov Reward Process (MRP)

  • Indicate preferences through reward signals.
  • MRP=(S,P,R)
    • States: sS
    • State transition function: P(s|s)
    • Reward function: R(r|s,s)
      • rewards can be defined for each state transition (i.e., edges)
      • or can be defined for each state (i.e., vertices)

Markov Decision Process (MDP)

  • Include decisions that are controlled by an agent's policy.
  • Involves an agent with its policy π(a|s).
  • The agent's goal is to update its policy so as to maximize the total reward (return).
  • MDP=(S,A,P,R)
    • States: sS
    • Actions: aA, where aπ(a|s)
      • Note that [no action] is also an action.
    • State transition function: P(s|s,a)
    • Reward function: R(r|s,a,s)
      • rewards can be defined for each state transition (i.e., edges)
      • or can be defined for each state (i.e., vertices)
Markov Decision Process, from Wikipedia.

Partial Observable MDP (POMDP)

  • Include an observation function O(o|s)
  • Agent does not observe the full state features, but can utilize either:
    • the current observation π(a|ot),
    • the observation history π(a|o0:t), or
    • the action-observation history π(a|ht), where ht=[o0,a0,,ot,at].
  • The true state may be approximated by observations based on frame stacking, RNNs, transformers, etc.

SideNote: Markov Chains vs. Markov Decision Processes is analogous to Hidden Markov models (HMM) vs. POMDPs. 1

POMDP, from Section 9.5.6 of Artificial Intelligence: Foundations of Computational Agents.

The Agent-Environment Interface

  • Training Phase of (Online) RL
    • Agent collects experiences/data (st,at,rt+1,st+1) by exploring the environment, and learn a policy from it.
    • The state transitions & rewards can only be observed through sampling.
  • Testing Phase of (Online) RL
    • Agent applies the best learned policy.
The Agent-Environment Interface, from Section 3.1 of Reinforcement Learning: An Introduction.

Examples

Formulating tasks as MDPs for RL. Note that these are just minimal examples, where the actual formulation can be much more complicated.

1. Classification

  • Goal: Maximize accuracy.
  • State: Input Image x.
  • Action: Dog or Cat.
  • Reward: +1 if correct, +0 otherwise.

2. Shortest Path

  • Goal: Minimize total distance.
  • State: Current vertex.
  • Action: Move along an edge.
  • Reward: (1)[edge weight] for each step.

3. Console Game

  • Goal: Maximize total score.
  • State: Current game screen.
  • Action: Game controls (Up, down, left, right, jump, etc.)
  • Reward: Immediate score.

4. Robot Control

  • Goal: Control robot to perform tasks.
  • State: Joint angles, camera images, etc.
  • Action: Target joint angles, etc.
  • Reward: +100 for each completed task, 1 for each timestep.

  1. See the topic on Markov Chain on Wikipedia

Comments