# 为何GAIL得到的Reward没有一点用处?

GAIL算法最后会训练出两个网络:

  1. πθ\pi_\theta^{*}: 策略网络

  2. DD^{*}: 评价网络

GAIL算法的目标函数是:

L(D,πθ)=EπE[log(D(x))]+Eπθ[log(1D(x)]L(D,\pi_{\theta}) = \mathbb{E}_{\pi_{E}} \left[ log(D(x)) \right] + \mathbb{E}_{\pi_{\theta}}\left[log(1-D(x)\right]

我们在训练D时, 固定πθ\pi_{\theta}, 要最大化L(D,πθ)L(D,\pi_{\theta})

单个 x=(s,a)x=(s,a) 对对L(D,πθ)L(D,\pi_{\theta})的贡献是:

\pi_{E}}(x) logD(x) + \pi_{\theta}(x)log(1-D(x))

令上式对DD的一阶导为0, 可得:

D(x)=πE(x)πE(x)+πθ(x)D^*(x) = \frac{\pi_{E}(x)}{\pi_{E}(x) +\pi_{\theta}(x)}

带入到 L(D,πθ)L(D,\pi_{\theta}), 得到:

L(D,πθ)=2JSD(πEπθ)2log2L(D^*, \pi_{\theta}) = 2JSD(\pi_{E} \| \pi_{\theta}) - 2log2

训练πθ\pi_{\theta}时, 固定DD, 要最小化L(D,πθ)L(D,\pi_{\theta})

D=DD=D^*时,L(D,πθ)L(D^*, \pi_{\theta})是一个常数,θ\theta 的改变不会引起LL的变化, 要知道更新θ\theta是需要求LL的导数的, 常数的导数为0。所以给不了π\pi任何反馈, θ\theta的值不会有任何变化,所以G是永远都训练不起来的。

为什么是一个常数?

Theorem 2.3 in paper "Towards Principled Methods For Training Generative Adversarial Network":

Let Pr\mathbb{P}_r and Pg\mathbb{P}_g be two distributions whose support lies in two manifolds M\mathcal{M} and P\mathcal{P} that don’t have full dimension and don’t perfectly align. We further assume that Pr\mathbb{P}_r and Pg\mathbb{P}_g are continuous in their respective manifolds. Then,

  1. JSD(PrPg)=log2JSD(\mathbb{P}_r \| \mathbb{P}_g) = log2
  2. KL(PrPg)=+KL(\mathbb{P}_r \| \mathbb{P}_g) = +\infty
  3. KL(PgPr)=+KL(\mathbb{P}_g \| \mathbb{P}_r) = +\infty

总结:

  1. D=DD=D^* 时,DD的给不了 π\pi 任何有效的反馈。
上次更新: 11/24/2021, 10:39:29 PM