Photo by Michael Dziedzic on Unsplash

# Advanced Reinforcement Learning Concepts

## Temporal-Difference Learning and On vs Off-Policy Learning

Today, we'll look at two RL algorithms that appear to be identical on the surface but have a crucial difference. Both algorithms learn the action-value function \(Q(s, a)\) in order to find the best policy possible. Once the action-value function \(Q(s, a)\) has been learned and optimized, the policy \(\pi(a | s)\) can be derived by greedily choosing the best action \(a\) leading to the highest value of \(Q(s, a)\) when in state \(s\).

# Temporal-Difference Learning

TD-Learning analyzes the observed experiences and, based on that data, optimizes the action-value function \(Q(s, a)\) or the value function \(V(s)\) for use in future operations. TD-Learning is similar to **Stochastic Gradient Descent (SGD)** as used in neural networks in that it slightly shifts the parameters in each iteration to improve the neural network performance gradually.

Each TD-Learning step considers a sequence \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) from a trajectory \(\tau\) at a timestep \(t\). Further a hyperparameter \(\alpha \in (0, 1)\) is needed and the discount of future rewards \(\gamma\) has to be taken into consideration. TD-Learning shifts the existing function according to the **TD target** given by \(r_{t+1} + \gamma V(s_{t+1})\) or \(r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})\) respectively, which estimates the function in the observed trajectory:

$$ \text{new function} = (1 - \alpha) \cdot \text{current function} + \alpha \cdot \text{observed function} $$ $$ \text{new value} = (1 - \alpha) \cdot \text{old value} + \alpha \cdot \text{TD target} $$

$$ V(s_t) = (1 - \alpha) V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1})) $$ $$ V(s_t) = V(s_t) - \alpha V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1})) $$ $$ V(s_t) = V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) $$

$$
Q(s_t, a_t) = (1 - \alpha) Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}))
$$
$$

Q(s_t, a_t) = Q(s_t, a_t) - \alpha Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}))
$$
$$
Q(s_t, a_t) = Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))
$$

It's also worth noting that TD-Learning can learn from past experiences and even incomplete trajectories because only a sequence \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) needs to be analyzed. This may be significant when considering RL concerns that have no terminal state.

# \(\epsilon\)-Greedy

RL-Algorithms which learn from generated trajectories face a dilemma. The generation of a trajectory \(\tau\) requires a policy \(\pi(a | s)\). On the one hand, this strategy should aim to maximize the anticipated return in order for the new path to be utilized to fine-tune and profit from the acquired parameters. On the other hand, we need to explore unanticipated options in order for us not to get trapped in a negative local optimum.

As a result, we should take a brief look at the \(\epsilon\)**-Greedy policy**. This policy has a straightforward concept to balance exploration and analysis while developing a policy with the current learned model utilizing a hyperparameter \(\epsilon \in (0, 1) \).

The first option with probability \(\epsilon\) is to choose a random action \(a\) from all available actions. This option contains the exploration aspect and hopefully leads to the discovery of new strategies.

The second choice, with a probability of \(1 - \varepsilon\), is to select the action \(a\) greedily based on the current knowledge. This approach, which includes an exploitation component, improves conventional strategies.

# Code Examples

To compare the different approaches, each method is applied to OpenAI Gym's Taxi environment. A taxi picks up a passenger from four possible locations and drops him off at the target destination. To this end, the taxi can move in each of the four cardinal directions, as well as pick up and drop off the passenger. The methods' efficiency was evaluated using the variable "Timesteps" which indicates the number of actions taken by the taxi driver to reach the goal.

## Random Action Agent

```
from IPython.display import clear_output
from time import sleep
def output(frames, animation=True):
if animation:
for i, frame in enumerate(frames):
clear_output(wait=True)
print(frame['frame'])
print(f"Timestep: {i + 1}")
print(f"State: {frame['state']}")
print(f"Action: {frame['action']}")
print(f"Reward: {frame['reward']}")
sleep(.1)
else:
print(frames[-1]['frame'])
print(f'Total timesteps: {len(frames)}')
print('Training finished.\n')
```

```
import gym
env = gym.make("Taxi-v3").env
env.reset()
# initialize variables
epoch = 0
reward = 0
done = False
frames = []
while not done:
epoch += 1
action = env.action_space.sample()
state, reward, done, info = env.step(action)
# Put each rendered frame into dict for animation
frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
}
)
# save last frame
frame = {
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
}
# output random action agent
output(frames, animation=False)
```

```
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Total timesteps: 6955
Training finished.
```

## SARSA

**SARSA** is the most basic TD-Learning algorithm by learning the action-value function \(Q(s, a)\). The name reflects, that each learning step is done based on the sequence state-action-reward-state-action \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\).

- generate trajectory \(\tau\) using \(\epsilon\)-Greedy policy.
- for each sequence \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) adjust \(Q(s_t, a_t)\) according to: $$ Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left ( r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right ) $$
- If learning is insufficient and time is left go back to step 1. Else terminate the algorithm and output \(Q(s, a)\) and the greedy policy if needed.

```
import gym
import random
import numpy as np
env = gym.make('Taxi-v3').env
# hyperparameters
q_table = np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# start the SARSA learning over 10000 episodes
for i in range(10000):
# initialize varibales
epoch = 0
reward = 0
done = False
state = env.reset()
# choose action epsilon-greedy
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
while not done:
epoch += 1
# get the next state
next_state, reward, done, info = env.step(action)
# chooose the next action epsilon greedy
if random.uniform(0, 1) < epsilon:
next_action = env.action_space.sample()
else:
next_action = np.argmax(q_table[next_state])
old_value = q_table[state, action]
new_value = q_table[next_state,next_action]
# learn the Q-value
q_table[state, action] = old_value + alpha * (reward + gamma * new_value - old_value)
state = next_state
action = next_action
# evaluate the performance
epoch = 0
done = False
frames = []
state = env.reset()
while not done:
epoch += 1
action = np.argmax(q_table[state])
state, reward, done, info = env.step(action)
frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
})
# output sarsa agent
output(frames, animation=False)
```

```
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Total timesteps: 12
Training finished.
```

## Q-Learning

**Q-Learning** is very similar to SARSA except for the utilization of a different TD-target. The new TD-target is \(r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(s_{t+1}, a)\). In other words, we no longer calculate the observed return but the maximum achievable reward from state \(s_{t+1}\). Thus we also no longer require \(a_{t+1}\) and the sequence only contains \(s_t, a_t, r_{t+1}, s_{t+1}\).

- generate trajectory \(\tau\) using \(\epsilon\)-Greedy policy.
- for each sequence \(s_t, a_t, r_{t+1}, s_{t+1}\) adjust \(Q(s_t, a_t)\) according to: $$ Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left ( r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(s_{t+1}, a) - Q(s_t, a_t) \right ) $$
- If learning is insufficient and time is left, then go back to step 1. Else terminate the algorithm and output \(Q(s, a)\) and the greedy policy if needed.

```
import gym
import random
import numpy as np
env = gym.make("Taxi-v3").env
env.reset()
# hyperparameters
q_table = np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# start the Q-learning over 10000 episodes
for i in range(1, 10000):
state = env.reset()
epoch = 0
reward = 0
done = False
while not done:
epoch += 1
# choose action epsilon-greedy
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
# get next state
next_state, reward, done, info = env.step(action)
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
# learn the Q-value
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
state = next_state
# evaluate the performance
epoch = 0
done = False
frames = []
state=env.reset()
while not done:
epoch += 1
action = np.argmax(q_table[state])
state, reward, done, info = env.step(action)
frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
})
# output
output(frames, animation=False)
```

```
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Total timesteps: 11
Training finished.
```

# Conclusion from examples

By using one of the provided reinforcement algorithms the goal was reached in a much shorter time frame compared to the control method. The two former performed similarly throughout the test period (note that both methods were not evaluated on the same instance of the environment thus optimal solution might differ).

# On- vs Off-policy

The change in the TD-target may seem small yet it is very important. SARSA estimates the Q-value assuming the \(\epsilon\)-Greedy policy used to generate the data \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) is maintained. Thus, SARSA optimizes the \(\epsilon\)-Greedy policy and not the greedy policy. We call this **on-policy** since the policy used for data generation and updates are **the same**.

In contrast, Q-Learning also generates data using the \(\epsilon\)-Greedy policy, yet Q-Learning updates are based on the greedy policy. Through this, Q-Learning always aims to improve the greedy policy. This behavior is called **off-policy** since the policy used for data generation and updates are **not** the same.