# Proximal Policy Optimization(PPO)

# On-Policy到Off-Policy

• On-Policy: 和环境互动的Agent就是我们要学习的Agent； 一边做互动，一边学习。
• Off-Policy: 和环境互动的Agent不是我们要学习的Agent； 通过看别人学习。

$\nabla \overline{R}_{\theta}=E_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right]$

• Goal: Using the sample from $\pi_{\theta'}$ to train $\theta$. $\theta'$ is fixed, so we can re-use the sample data.

# 重要性抽样 Importance Sampling

$E_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f\left(x^{i}\right)$

$E_{x \sim p}[f(x)] = \int f(x) p(x) d x=\int f(x) \frac{p(x)}{q(x)} q(x) d x=E_{x\sim q}\left[f(x) \frac{p(x)}{q(x)}\right]$

$Var[ X ] = E[ X^2 ] - (E[ X ])^2$

$Var_{x \sim p}[f(x)]=E_{x \sim p}\left[f(x)^{2}\right]-\left(E_{x \sim p}[f(x)]\right)^{2}$

$Var_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]=E_{x \sim q}\left[\left(f(x) \frac{p(x)}{q(x)}\right)^{2}\right]-\left(E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]\right)^{2}$

$=E_{x \sim p}[f(x)^{2} \frac{p(x)}{q(x)}]-\left(E_{x \sim p}[f(x)]\right)^{2}$

$\mathbb{E}_{x\sim q}[f^2(x)\frac{p^2(x)}{q^2(x)}] = \int f^2(x)\frac{p^2(x)}{q^2(x)}q(x)dx = \int f^2(x)\frac{p^2(x)}{q(x)}dx = \int f^2(x)\frac{p(x)}{q(x)}p(x)dx = \mathbb{E}_{x\sim p}[f^2(x)\frac{p(x)}{q(x)}]$

# 正式开始将On-Policy变成Off-Policy

$\nabla \overline{R}_{\theta}=E_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right]$

$\nabla \overline{R}_{\theta}=E_{\tau \sim p_{\theta^{\prime}}(\tau)}\left[\frac{p_{\theta}(\tau)}{p_{\theta^{\prime}}(\tau)} R(\tau) \nabla \log p_{\theta}(\tau)\right]$

• 使用$\theta'$采样得到数据
• 使用采样得到的数据去训练$\theta$很多次

$\nabla \overline{R}_{\theta} \approx E_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta}}\left[A^{\theta}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right]$

$=E_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{P_{\theta}\left(s_{t}, a_{t}\right)}{P_{\theta^{\prime}}\left(s_{t}, a_{t}\right)} A^{\theta'}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right]$

$=E_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} \frac{p_{\theta}\left(s_{t}\right)}{p_{\theta^{\prime}}\left(s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right]$

• 看到什么样的画面可能跟你采取什么样子的动作没有什么关系
• 这一项很难算，直接忽略掉 而这里的$p_{\theta}(a_t| s_t)$$p_{\theta'}(a_t| s_t)$很好算，就直接是两个网络$\theta$,$\theta'$的输出。

$\nabla f(x)=f(x) \nabla \log f(x)$

$J^{\theta^{\prime}}(\theta)=E_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right]$

$J_{P P O}^{\theta^{\prime}}(\theta)=J^{\theta^{\prime}}(\theta)-\beta KL\left(\theta, \theta^{\prime}\right)$

PPO Algorithm

• 初始化Policy的参数$\theta^0$
• 循环
• 使用$\theta^k$去和环境做互动,收集$\{s_t, a_t \}$, 计算advantage $A^{\theta^{k}}(s_t,a_t)$
• 找使得$J_{PPO}(\theta)$更优的$\theta$: 这一步要多次更新$\theta$
• $J_{P P O}^{\theta^{k}}(\theta)=J^{\theta^{k}}(\theta)-\beta K L\left(\theta, \theta^{k}\right)$

• $J^{\theta^{k}}(\theta) \approx \sum_{\left(s_{t}, a_{t}\right)} \frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{k}}\left(a_{t} | s_{t}\right)} A^{\theta^{k}}\left(s_{t}, a_{t}\right)$

• $KL(\theta,\theta^k)= ?$

• 动态调整Constrain的权值$\beta$, 也叫作Adaptive KL Penalty:
• 更新$\theta$后, 如果发现$KL\left(\theta, \theta^{k}\right)>K L_{\max }$, 说明后面的约束没有起到作用, 因此增大$\beta$
• 更新$\theta$后, 如果发现$KL\left(\theta, \theta^{k}\right), 说明后面的约束作用太强,因此减小$\beta$

clip function

$\text{clip}(a, b, c)=\left\{\begin{array}{ll}{b,} & {\text { if } ac} \\ {a,} & {\text { else }}\end{array}\right.$