The aim of this post is to share my understanding of some of the conceptual and theoretical background behind implementations of the Proximal Policy Optimisation (PPO) reinforcement learning (RL) algorithm. PPO is widely used due to its stability and sample efficiency—popular applications include beating the Dota 2 world champions and aligning language models. While the PPO paper provides quite a general and straightforward overview of the algorithm, modern implementations of PPO use several additional techniques to achieve state-of-the-art performance in complex environments [1]. You might discover this if you try to implement the algorithm solely based on the paper. I try and present a coherent narrative here around these additional techniques.
I’d recommend reading parts one, two, and three of SpinningUp if you’re new to reinforcement learning. There’s a few longer-form educational resources that I’d recommend if you’d like a broader understanding of the field [2], but this isn’t comprehensive. You should be familiar with common concepts and terminology in RL [3]. For clarity, I’ll try to spell out any jargon I use here.
Recap
Policy Gradient Methods
PPO is an on-policy reinforcement learning algorithm. It directly learns a stochastic policy function parameterised by θ representing the likelihood of action a in state s, πθ(a|s). Consider that we have some differentiable function, J(θ), which is a continuous performance measure of the policy πθ. In the simplest case, we have J(θ)=Eτ∼πθ[R(τ)], which is known as the return[4] over a trajectory[5], τ. PPO is a kind of policy gradient method [6] which directly optimizes the policy parameters θ against J(θ). The policy gradient theorem shows that:
∇θJ(θ)=E[inf∑t=0∇θlnπθ(at|st)Rt]
In other words, the gradient of our performance measure J(θ) with respect to our policy parameters θ points in the direction of maximising the return Rt. Crucially, this shows that we can estimate the true gradient using an expectation of the sample gradient—the core idea behind the REINFORCE [7] algorithm. This is great. This expression has the more general form which substitutes Rt for some lower-variance estimator of the total expected reward, Φ[8] :
∇θJ(θ)=E[inf∑t=0∇θlnπθ(at|st)Φt](1)
Modern implementations of PPO make the choice of Φt=Aπ(st,at), the advantage function. This function estimates the advantage of a particular action in a given state over the expected value of following the policy, i.e. how much better is taking this action in this state over all other actions? Briefly described here, the advantage function takes the form
Aπ(s,a)=Qπ(s,a)−Vπ(s)
where V(s) is the state-value function, and Q(s,a) is the state-action -value function, or Q-function [9]. I’ve found it easier to intuit the nuances of PPO by following the narrative around its motivations and predecessor. PPO iterates on the Trust Region Policy Optimization (TRPO) method which constrains the objective function with respect to the size of the policy update. The TRPO objective function is defined as [10][11] :
Where KL is the Kullback-Liebler divergence (a measure of distance between two probability distributions), and the size of policy update is defined as the ratio between the new policy and the old policy:
r(θ)=πθ(at,st)πθold(at,st)
Policy gradient methods optimise policies through (ideally small) iterative gradient updates to parameters θ. The old policy, πθold(at,st), is the one used to generate the current trajectory, and the new policy, πθ(at,st) is the policy currently being optimised [12]. If the advantage is positive, then the new policy becomes greedier relative to the old policy [13], and we have r(θ)>1 - the inverse applies for negative advantage, where we have r(θ)<1. The core principle of TRPO (and PPO) is to prevent large parameter updates which occur due to variance in environment and training dynamics, thus helping to stabilise optimisation. Updating using the ratio between old and new policies in this way allows for selective reinforcement or penalisation of actions whilst grounding updates relative to the original, stable policy [14].
PPO
PPO modifies the TRPO objective function by constraining the objective function:
To break this somewhat dense equation down, we can substitute our earlier expression r(θ) in:
J(θ)=E[min(r(θ)At,clip(r(θ),1−ϵ,1+ϵ)At)]
The clip term constrains the policy ratio, r, to lie between 1−ϵ and 1+ϵ[15]. In this manner, the objective function disincentives greedy policy updates in the direction of improving the objective. When the new policy places lower density on actions compared to the previous policy, i.e. may be more conservative, the advantage update is smaller. When the new policy places higher density on actions compared to the previous policy i.e. is greedier, the advantage update is also smaller. This is why PPO is considered to place a lower pessimistic bound on policy updates. The policy is only updated by 1+ϵ or 1−ϵ, depending on whether the advantage function is positive or negative, respectively.
So far, we’ve introduced the concept of policy gradients, objective functions in RL, and the core concepts behind PPO. Reinforcement learning algorithms place significant emphasis in reducing variance during optimisation [16]. This becomes apparent in estimating the advantage function which relies on rewards obtained during a trajectory. Practical implementations of policy gradient algorithms take additional steps to trade variance for bias here by also estimating an on-policy state-value function Vπϕ(st), which is the expected return an agent receives from starting in state st, and following policy π thereafter. Jointly learning a value function and policy function in this way is known as the Actor-Critic framework [17].
Value Learning and Actor-Critics
Value-function learning involves approximating the future rewards of following a policy from a current state. The value function is learned alongside the policy, and the simplest method is to minimise a mean-squared-error objective against the discounted return [18], R(τ)=inf∑t=0γtrt:
ϕ∗=arg minϕEst,Rt∼π[(Vϕ(st)−γtr(st))2]
We can now use this to learn a lower-variance estimator of the advantage function [19] :
We end up with an estimator of the advantage function that solely relies on samples rewards and our learned approximate value function. In fact, our expression shows that the advantage function can be estimated with the temporal-difference [20] residual error, δt, of the value function:
There’s one more thing we can do to reduce variance. The current advantage estimator estimates reward by samples collected from a single trajectory. However, there is a huge amount of variance in the possible trajectories that may evolve from a given state. These trajectories may look similar in the short-term (except for policies early in the optimisation process which are far more random), but longer-term rewards can vary significantly. We could consider a lower-variance estimate of the advantage by only sampling rewards near to the present, and replacing longer-term reward estimates with our lower-variance, biased value function estimates. This is the central idea behind n-step returns[21]. If we choose n correctly, i.e. before trajectories start to diverge significantly, we can obtain better samples of near-term rewards.
Consider the term δt in our estimation of the advantage function. We take the initial reward observed from the environment, rt, then bootstrap future estimated discounted rewards, and subtract a baseline estimated value function for the state [22]. For a single time-step, this can be denoted as:
^At(1)=rt+γVπϕ(st+1)−Vπϕ(s)=δVϕt
What if we sample rewards from multiple timesteps, and then estimate the future discounted rewards from then on? Let’s denote ^At(k) as follows:
n=1 and n=inf now become extreme cases in n-step return estimation. Observe that for k=∞ we recover an unbiased, high-variance expectation of the infinite-horizon discounted return, minus our baseline estimate value function. GAE [23] introduces a discount factor, λ∈[0,1], to take an exponentially weighted average over every k-th step estimator. Using notation from the paper [24] we can derive a generalized advantage estimator for cases where 0<λ<1:
We see that λ=0 obtains the original biased low-variance actor-critic advantage estimator. For λ=1, we obtain a low-bias, high-variance advantage estimator, which is simply the discounted return minus our baseline estimated value function. For λ∈(0,1), we obtain an advantage estimator which allows control of the bias-variance tradeoff. The authors note that the two parameters, γ and λ, control variance in different ways. Lower values of γ discount future rewards, which will always result in a biased estimate of the return, since there’s an implicit prior that future rewards are less valuable. The authors note that γ controls the strength of this prior regardless of how accurate our value function is. On the other hand, since n-step returns are unbiased estimators of the advantage function, lower values of λ reduce variance when the value function is accurate. In other words, when Vϕ(st)=V(st) and for 0<λ<1 , we obtain an unbiased estimator of the advantage function.
Pseudocode
Tying everything together, we can show the general process of updating our policy and value function using PPO for a single trajectory of fixed length N:<br><br>
Given policy estimator πθ, value function estimator Vϕ, γ, λ, N time-steps.
For t=1,2,...,N:
Run policy πθ in environment and collect rewards, observations, actions, and value function estimates rt,st,ar,vt where vt=Vϕ(st)
Compute δVϕt=rt+γvt+1−vt for all t.
Compute generalized advantage estimate, ^At=N∑l=0(γλ)tδt for all t.
Sample πθ(at,st)log-probabilities from stored actions and states.
Optimise ϕ using L=(Vϕ(st)−γtrt)2 using stored rt and st[26].
… repeat!
Feedback and corrections are very welcomed and appreciated. I’ll follow this post with implementation details soon. For now, I’d recommend the excellent resource by Shengyi Huang on reproducing PPO implementations from popular RL libraries. If you’re able to implement the policy update from the PPO paper, hopefully there’s enough detail here that you’re able to reproduce other implementations of PPO. Thank you so much for reading.
The return represents the sum of rewards achieved over some time frame. This can be over a fixed timescale, i.e. the finite-horizon return, or over all time, i.e. the infinite-horizon return.
Consider optimising our policy using eqn. 1 - without normalising the update w.r.t. the old policy, updates to the policy can lead to catastrophically large updates.
The discounted return down-weights rewards obtained in the future by an exponential discount factor γt, i.e. rewards in the distant future aren’t worth as much as near-term rewards.
This is usually done over M minibatch steps for M≤N. πθold is fixed as the initial policy at the start of the trajectory, and πθ is taken as the policy at the current optimisation step.
Similarly to the policy optimisation step, this is also done over M steps.Vϕ is taken as the value function at the current optimisation step, i.e. value estimates are bootstrapped.
The theory of Proximal Policy Optimisation implementations
Link post
Prelude
The aim of this post is to share my understanding of some of the conceptual and theoretical background behind implementations of the Proximal Policy Optimisation (PPO) reinforcement learning (RL) algorithm. PPO is widely used due to its stability and sample efficiency—popular applications include beating the Dota 2 world champions and aligning language models. While the PPO paper provides quite a general and straightforward overview of the algorithm, modern implementations of PPO use several additional techniques to achieve state-of-the-art performance in complex environments [1]. You might discover this if you try to implement the algorithm solely based on the paper. I try and present a coherent narrative here around these additional techniques.
I’d recommend reading parts one, two, and three of SpinningUp if you’re new to reinforcement learning. There’s a few longer-form educational resources that I’d recommend if you’d like a broader understanding of the field [2], but this isn’t comprehensive. You should be familiar with common concepts and terminology in RL [3]. For clarity, I’ll try to spell out any jargon I use here.
Recap
Policy Gradient Methods
PPO is an on-policy reinforcement learning algorithm. It directly learns a stochastic policy function parameterised by θ representing the likelihood of action a in state s, πθ(a|s). Consider that we have some differentiable function, J(θ), which is a continuous performance measure of the policy πθ. In the simplest case, we have J(θ)=Eτ∼πθ[R(τ)], which is known as the return [4] over a trajectory [5], τ. PPO is a kind of policy gradient method [6] which directly optimizes the policy parameters θ against J(θ). The policy gradient theorem shows that:
∇θJ(θ)=E[inf∑t=0∇θlnπθ(at|st)Rt]
In other words, the gradient of our performance measure J(θ) with respect to our policy parameters θ points in the direction of maximising the return Rt. Crucially, this shows that we can estimate the true gradient using an expectation of the sample gradient—the core idea behind the REINFORCE [7] algorithm. This is great. This expression has the more general form which substitutes Rt for some lower-variance estimator of the total expected reward, Φ [8] :
∇θJ(θ)=E[inf∑t=0∇θlnπθ(at|st)Φt](1)
Modern implementations of PPO make the choice of Φt=Aπ(st,at), the advantage function. This function estimates the advantage of a particular action in a given state over the expected value of following the policy, i.e. how much better is taking this action in this state over all other actions? Briefly described here, the advantage function takes the form
Aπ(s,a)=Qπ(s,a)−Vπ(s)
where V(s) is the state-value function, and Q(s,a) is the state-action -value function, or Q-function [9]. I’ve found it easier to intuit the nuances of PPO by following the narrative around its motivations and predecessor. PPO iterates on the Trust Region Policy Optimization (TRPO) method which constrains the objective function with respect to the size of the policy update. The TRPO objective function is defined as [10][11] :
J(θ)=E[πθ(at,st)πθold(at,st)At]subject toE[KL(πθold||πθ)]≤δ
Where KL is the Kullback-Liebler divergence (a measure of distance between two probability distributions), and the size of policy update is defined as the ratio between the new policy and the old policy:
r(θ)=πθ(at,st)πθold(at,st)
Policy gradient methods optimise policies through (ideally small) iterative gradient updates to parameters θ. The old policy, πθold(at,st), is the one used to generate the current trajectory, and the new policy, πθ(at,st) is the policy currently being optimised [12]. If the advantage is positive, then the new policy becomes greedier relative to the old policy [13], and we have r(θ)>1 - the inverse applies for negative advantage, where we have r(θ)<1. The core principle of TRPO (and PPO) is to prevent large parameter updates which occur due to variance in environment and training dynamics, thus helping to stabilise optimisation. Updating using the ratio between old and new policies in this way allows for selective reinforcement or penalisation of actions whilst grounding updates relative to the original, stable policy [14].
PPO
PPO modifies the TRPO objective function by constraining the objective function:
J(θ)=E[min(πθ(at,st)πθold(at,st)At,clip(πθ(at,st)πθold(at,st),1−ϵ,1+ϵ)At]
To break this somewhat dense equation down, we can substitute our earlier expression r(θ) in:
J(θ)=E[min(r(θ)At,clip(r(θ),1−ϵ,1+ϵ)At)]
The clip term constrains the policy ratio, r, to lie between 1−ϵ and 1+ϵ [15]. In this manner, the objective function disincentives greedy policy updates in the direction of improving the objective. When the new policy places lower density on actions compared to the previous policy, i.e. may be more conservative, the advantage update is smaller. When the new policy places higher density on actions compared to the previous policy i.e. is greedier, the advantage update is also smaller. This is why PPO is considered to place a lower pessimistic bound on policy updates. The policy is only updated by 1+ϵ or 1−ϵ, depending on whether the advantage function is positive or negative, respectively.
So far, we’ve introduced the concept of policy gradients, objective functions in RL, and the core concepts behind PPO. Reinforcement learning algorithms place significant emphasis in reducing variance during optimisation [16]. This becomes apparent in estimating the advantage function which relies on rewards obtained during a trajectory. Practical implementations of policy gradient algorithms take additional steps to trade variance for bias here by also estimating an on-policy state-value function Vπϕ(st), which is the expected return an agent receives from starting in state st, and following policy π thereafter. Jointly learning a value function and policy function in this way is known as the Actor-Critic framework [17].
Value Learning and Actor-Critics
Value-function learning involves approximating the future rewards of following a policy from a current state. The value function is learned alongside the policy, and the simplest method is to minimise a mean-squared-error objective against the discounted return [18], R(τ)=inf∑t=0γtrt:
ϕ∗=arg minϕEst,Rt∼π[(Vϕ(st)−γtr(st))2]
We can now use this to learn a lower-variance estimator of the advantage function [19] :
^Aπ(s,a)=Est+1[Qπ(st,at)−Vπϕ(st)]=Est+1[rt+γVπϕ(st+1)−Vπϕ(s)]
We end up with an estimator of the advantage function that solely relies on samples rewards and our learned approximate value function. In fact, our expression shows that the advantage function can be estimated with the temporal-difference [20] residual error, δt, of the value function:
Vϕ(st)←Vϕ(st)+α(rt+γVϕ(st+1)−Vϕ(st))Vϕ(st)←Vϕ(st)+αδtwhereδt=rt+γVϕ(st+1)−Vϕ(st)
Generalised Advantage Estimation (GAE)
There’s one more thing we can do to reduce variance. The current advantage estimator estimates reward by samples collected from a single trajectory. However, there is a huge amount of variance in the possible trajectories that may evolve from a given state. These trajectories may look similar in the short-term (except for policies early in the optimisation process which are far more random), but longer-term rewards can vary significantly. We could consider a lower-variance estimate of the advantage by only sampling rewards near to the present, and replacing longer-term reward estimates with our lower-variance, biased value function estimates. This is the central idea behind n-step returns[21]. If we choose n correctly, i.e. before trajectories start to diverge significantly, we can obtain better samples of near-term rewards.
Consider the term δt in our estimation of the advantage function. We take the initial reward observed from the environment, rt, then bootstrap future estimated discounted rewards, and subtract a baseline estimated value function for the state [22]. For a single time-step, this can be denoted as:
^At(1)=rt+γVπϕ(st+1)−Vπϕ(s)=δVϕt
What if we sample rewards from multiple timesteps, and then estimate the future discounted rewards from then on? Let’s denote ^At(k) as follows:
^At(1)=rt+γVπϕ(st+1)−Vπϕ(s)=δVϕt^At(2)=rt+γrt+1+γ2Vπϕ(st+2)−Vπϕ(s)=δVϕt+γδVϕt+1^At(k)=rt+γrt+1+...+rt+k−1+γkVπϕ(st+k)−Vπϕ(s)=δVϕt+γδVϕt+1+γ2δVϕt+1^At(∞)=rt+γrt+1++γ2rt+2+...−Vπϕ(s)=k−1∑l=0γlδVϕt+1
n=1 and n=inf now become extreme cases in n-step return estimation. Observe that for k=∞ we recover an unbiased, high-variance expectation of the infinite-horizon discounted return, minus our baseline estimate value function. GAE [23] introduces a discount factor, λ∈[0,1], to take an exponentially weighted average over every k-th step estimator. Using notation from the paper [24] we can derive a generalized advantage estimator for cases where 0<λ<1:
^Atγλ=(1−λ)(^At(1)+λ^At(2)+λ2^At(3)+...)=(1−λ)(δVϕt+λ(δVϕt+γδVϕt+1)+λ2(δVϕt+γδVϕt+1+γ2δVϕt+2)+...)=(1−λ)(δVϕt(1+λ+λ2+...)+γδVϕt+1(λ+λ2+λ3+...)+...)=(1−λ)(δVϕt(11−λ)+γδVϕt(λ1−λ)+γ2δVϕt(λ21−λ))=∞∑l=1(γλ)lδVϕt+1
Great. As you may have noticed, there’s two special cases for this expression - λ=0, and λ=1:
^Atγ∗0=^At(1)=rt+γVπϕ(st+1)−Vπϕ(s)=δt^Atγ∗1=∞∑l=1γlδVϕt+1=(rt+γVϕ(st+1)−Vϕ(st))+γ(rt+1+γVϕ(st+2)−Vϕ(st+1))+γ2(rt+2+γVϕ(st+3)−Vϕ(st+2))+...=rt+γVϕ(st+1)−Vϕ(st)+γrt+1+γ2Vϕ(st+2)−γVϕ(st+1)+γ2rt+2+γ3Vϕ(st+3)−γ2Vϕ(st+2)+...=∞∑l=1γlrt+1−Vϕ(st)
We see that λ=0 obtains the original biased low-variance actor-critic advantage estimator. For λ=1, we obtain a low-bias, high-variance advantage estimator, which is simply the discounted return minus our baseline estimated value function. For λ∈(0,1), we obtain an advantage estimator which allows control of the bias-variance tradeoff. The authors note that the two parameters, γ and λ, control variance in different ways. Lower values of γ discount future rewards, which will always result in a biased estimate of the return, since there’s an implicit prior that future rewards are less valuable. The authors note that γ controls the strength of this prior regardless of how accurate our value function is. On the other hand, since n-step returns are unbiased estimators of the advantage function, lower values of λ reduce variance when the value function is accurate. In other words, when Vϕ(st)=V(st) and for 0<λ<1 , we obtain an unbiased estimator of the advantage function.
Pseudocode
Tying everything together, we can show the general process of updating our policy and value function using PPO for a single trajectory of fixed length N:<br><br>
Feedback and corrections are very welcomed and appreciated. I’ll follow this post with implementation details soon. For now, I’d recommend the excellent resource by Shengyi Huang on reproducing PPO implementations from popular RL libraries. If you’re able to implement the policy update from the PPO paper, hopefully there’s enough detail here that you’re able to reproduce other implementations of PPO. Thank you so much for reading.
Procgen, Karle Cobbe et al.
Atari, OpenAI Gymnasium
A (Long) Peek into Reinforcement Learning, Lilian Weng
Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig
Reinforcement Learning: An Introduction, Sutton and Barto
CS285 at UC Berkeley, Deep Reinforcement Learning, Sergey Levine
Lilian Weng’s list of RL notation is very useful here
The return represents the sum of rewards achieved over some time frame. This can be over a fixed timescale, i.e. the finite-horizon return, or over all time, i.e. the infinite-horizon return.
A trajectory, τ, (also known as an episode or rollout) describes a sequence of interactions between the agent and the environment.
Policy Gradient Algorithms, Lilian Weng
Policy Gradients, CS285 UC Berkeley, Lecture 5, Sergey Levine
Reinforcement Learning: An Introduction, 13.3 REINFORCE: Monte Carlo Policy Gradient, Sutton and Barto
Section 2, High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et. al. 2016
The value function, V(s), describes the expected return from starting in state s. Similarly the state-action value function, Q(s,a), describes the expected return from starting in state s and taking action a. See also Reinforcement Learning: An Introduction, 3.7 Value Functions, Sutton and Barto
I’m omitting ln from lnπ(at|st) for brevity from here on.
See Proximal Policy Optimization Algorithms, Section 2.2, Schulman et al. and Trust Region Policy Optimization, Schulman et al. for further details on the constraint.
Note: at the start of a series of policy update steps, we have πθold(at,st)=πθ(at,st), so r(θ)=1.
The new policy will place higher density on actions relative to the old policy, i.e. πθ(at,st)>πθold(at,st).
Consider optimising our policy using eqn. 1 - without normalising the update w.r.t. the old policy, updates to the policy can lead to catastrophically large updates.
ϵ is usually set to ∼0.1.
Stochasticity in environment dynamics, delayed rewards, and exploration-exploitation tradeoffs all contribute to unstable training.
Reinforcement Learning: An Introduction, 6.6 Actor-Critic Methods, Sutton and Barto
A (Long) Peek into Reinforcement Learning, Value Function, Lilian Weng
The discounted return down-weights rewards obtained in the future by an exponential discount factor γt, i.e. rewards in the distant future aren’t worth as much as near-term rewards.
The on-policy Q-function is defined as the expected reward of taking action at in state st, and following policy π thereafter: Qπ(s,a)=E[rt+γVπ(st+1)]
Reinforcement Learning: An Introduction, 6. Temporal-Difference Learning, Sutton and Barto
DeepMimic Supplementary A, Peng et al.
Reinforcement Learning: An Introduction, 7.1 n-step TD Prediction, Sutton and Barto
Daniel Takeshi’s post on using baselines to reduce variance of gradient estimates is useful here.
High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et. al. 2016
Notes on the Generalized Advantage Estimation Paper, Daniel Takeshi
The identity 11−k=1+k+k2+k3+... for |k|<1 is useful to note here.
This is usually done over M minibatch steps for M≤N. πθold is fixed as the initial policy at the start of the trajectory, and πθ is taken as the policy at the current optimisation step.
Similarly to the policy optimisation step, this is also done over M steps.Vϕ is taken as the value function at the current optimisation step, i.e. value estimates are bootstrapped.