Policy Types in Reinforcement Learning

Deterministic and Stochastic Policies Explained

In Reinforcement Learning (RL), a policy is a description of how an agent behaves given its current state and the goal. In this blog post, we will discuss Reinforcement Learning Policy Types: Deterministic Policies and Stochastic Policies.

Deterministic Policies: In a deterministic policy, the action taken at each state is always the same. This can be implemented using a lookup table or decision tree. In this case, we denote it is denoted by \( \mu \):

$$ a_t = \mu(s_t) $$

Stochastic Policies: A stochastic policy, on the other hand, produces different actions from one instance to the next, but these are still chosen according to some fixed probabilities. The stochastic policy is denoted by \( \pi \):

$$ a_t \sim \pi( \cdot | s_t) $$

Because the policy is essentially the agent's brain, it's not unusual to replace "policy" with "agent," such as when someone says, "The agent is attempting to maximize reward."

In deep RL, we deal with parameterized policies: policies that produce computable functions that are dependent on a set of parameters (for example, the weights and biases of a neural network) that may be adjusted to modify the outcome using some optimization algorithm. Stochastic Policies

Categorical and diagonal Gaussian policies are the two most frequent types of stochastic policies in deep RL. Categorical policies can be applied in discrete action regions, while diagonal Gaussian policies are appropriate for continuous action regions.

Categorical Policy: A categorical policy is a stochastic policy that favors either one or zero actions in each state, with equal probabilities assigned to all possible action choices.

Diagonal Gaussian Policy: On the other hand, Diagonal Gaussian policies take any number of actions from zero to infinity and distribute them according to a diagonal Gaussian distribution. This means that in any given state, the agent can select from many different actions with equal probabilities assigned to all possible action choices.

The two most important computations for employing and training stochastic policies are:

  • Sampling actions from the policy,
  • Comparing log-likelihoods of various actions.

We'll go through how to implement these for both categorical and diagonal Gaussian policies in the following.

Categorical Policies

A categorical policy is comparable to a classifier in terms of its structure. You construct a neural network for a categorical policy in the same way as you would for a classifier: the input is the observation, followed by one or more layers (possibly convolutional or densely-connected, depending on the type of input), and then there's an output layer with one node for each possible action.

Action Sampling (log-probability)

To take actions according to a categorical policy, we sample them from the posterior distribution of all the possible actions given the observation and current state. There are different ways of doing this depending on whether you want to maintain Monte Carlo estimates of the policy or not.

Maintaining Monte Carlo Estimates: If you want to maintain Monte Carlo estimates of the policy, then you can use a method like importance sampling. This involves drawing from the posterior at each step and keeping track of how often different actions are sampled.

Not Maintaining Monte Carlo Estimates: Alternatively, if you don't want to maintain Monte Carlo estimates, you can use a random sampling method. This just picks an action at random from the posterior for each step.

In either case, we need to compute the log-probability of taking each action given the observation and state:

$$ \log \pi_\theta(a | s) = \log P_\theta(s)_a $$

where \( P_\theta(s) \), is the last layer of probabilities taken from the NN. It's computed by inserting the observation and state into our neural network (which has one node for each possible action in this case) and then applying the activation function.

Once we have the log-probabilities of all the actions, we can compare them and take the action with the highest log-probability (or some other measure like expected utility).

Diagonal Gaussian Policies

A multivariate normal distribution, often known as a multivariate Gaussian distribution (but not to be confused with the multidimensional normal form, which is a different topic), is characterized by a mean vector, \( \mu \), and a covariance matrix, \( \Sigma \). A diagonal Gaussian distribution is one in which the entries of \( \Sigma \), are all zero except for the diagonal, which consists of the variances of the individual variables.

The advantage of a diagonal Gaussian distribution is that it's easy to compute its mean and variance.

This makes it easy to sample from a diagonal Gaussian distribution: you just compute the mean and variance of each variable, then draw values for each variable independently from a standard normal distribution.

We can use a similar approach to sampling actions according to a diagonal Gaussian policy. We just have to compute the mean and covariance for each possible action, then draw from a standard normal distribution. . There are two different ways that the covariance matrix is typically represented.

  1. The log standard deviation vector, which is not a function of the state, contains one value: the log standard deviations are independent parameters.
  2. The state-to-log-standard-deviation mapping is done by a neural network. It can be configured to share certain layers with the mean network if desired.

In both cases, we obtain log standard deviations rather than direct standard deviations. This is because log stds have the freedom to take on any value between \( - \infty\) and \( \infty \) (or any number outside of that range), while stds must be nonnegative. If you don't need to enforce those restrictions, it's much easier to train parameters. The log standard deviations can be exponentiated to obtain the standard deviations immediately, so we don't lose anything by displaying them like this.


Given the mean action \( \mu_\theta(s) \) and standard deviation \( \sigma_\theta(s) \), and a vector of noise \( z \sim \mathcal{N}(0,I)\), an action can be taken with the following formula:

$$ a = \mu_\theta(s) + \sigma_\theta(s) \cdot z$$

, where \( \cdot \) represents the elementwise product of two vectors.


The log-likelihood of a k-dimensional action \( a \) is given by:

$$ \log \pi_\theta(a | s) = - \frac{1}{2} \left ( \sum_{i=1}^k \left ( \frac{(a_i - \mu_i )^2}{\sigma_i^2} + 2 \log \sigma_i \right ) + k \log 2 \pi \right ) $$


The reason Reinforcement Learning works is because we can estimate values by taking expectations (i.e., weighted averages) of future rewards based on different policies. For example, if our policy gives us an 80% chance of going left and a 20% chance of going right, then that means that on average we can expect to be standing at the left side of a door 80% of the time and on the right 20%.

Registering for a credit card and receiving your first statement can be a memorable experience. If you have two insurance plans that both take you to the same doors at equal probability, but one is more likely than the other to lead us to food (i.e., it has a higher expected reward), Reinforcement Learning will pick the plan with the greater expected reward.

In this post, we've looked at a few different ways of representing policies in Reinforcement Learning. We've seen how to represent deterministic and stochastic policies as well as diagonal Gaussian policies. In the next post, we'll look at how to actually implement these policies in code.

Did you find this article valuable?

Support Johannes Loevenich by becoming a sponsor. Any amount is appreciated!