# # Maximum entropy(MaxEnt) Inverse Reinforcement Learning

## # 意义

MaxEnt IRL算法是解决下面问题的:

• There are many optimal policies that can explain a set of demonstrations.

## # 处理方法

MaxEnt IRL算法的处理办法是:

• Maximize the Log likelihood over the demonstrations.

• The reward of a trajectory is expressed as a linear combinations with feature counts.

Feature counts

feature counts 指的的是下面这个表达式:

$\mathbf{f}_{\tau}=\sum_{s_{j} \in \tau} \mathbf{f}_{s_{j}}$

$\text{reward} \left(\tau\right)=\theta^{\top} \mathbf{f}_{\tau}=\sum_{s_{j} \in \tau} \theta^{\top} \mathbf{f}_{s_{j}}$

## # 约束

1. Feature matching

$\sum_{\tau} p_{\tau} \mathbf{f}_{\tau}=\mathbf{\bar{f}}$

reward的基

1. maximize the log likelihood

$\text {maximize} -\sum_{\tau} p(\tau) \log p(\tau)$

## # 解这个最优问题

• $\sum_{\tau} p_{\tau} \mathbf{f}_{\tau}-\mathbf{\bar{f}} = \sum_{i=1}^m(p_{\tau_i}\mathbf{f}_{\tau_i} - \mathbf{f}_{\tilde{\tau}_i}) =0$
• $\sum_{\tau} p_{\tau} - 1 = 0$ 这就是一个条件极值的问题。

$F(p(\tau)) = \sum_{\tau} p(\tau) \log p(\tau) - \sum_{i=1}^m \lambda_i \sum_{i=1}^m(p_{\tau_i}\mathbf{f}_{\tau_i} - \mathbf{f}_{\tilde{\tau}_i}) - \lambda_2(\sum_{\tau} p_{\tau} - 1)$

$\frac{dF}{dp(\tau)} = \sum_{\tau}\log p(\tau) + \sum_{\tau}1 - \sum_{i=1}^m \lambda_i \sum_{i=1}^m\mathbf{f}(\tau_i) - \lambda_2 \sum_{\tau} 1 =$

$\sum_{\tau}\log p(\tau) - \sum_{i=1}^m\lambda_i \sum_{i=1}^m\mathbf{f}_{\tau_i} + \sum_{\tau}(1-\lambda_2) = \sum_{\tau}(\log p(\tau) + \lambda_0) - \sum_{i=1}^m\lambda_i \sum_{\tau} \mathbf{f}_{\tau} = 0$

$\log p(\tau) + \lambda_0 = \left(\sum_{i=1}^m \lambda_i\right)\mathbf{f}_{\tau}$

$p(\tau) = e^{- \lambda_0 +\sum_{i=1}^{m}\lambda_i \mathbf{f}(\tau)}$

$\sum_{\tau}p(\tau) = e^{- \lambda_0 } \sum_{\tau}e^{\sum_{i=1}^{m}\lambda_i \mathbf{f}(\tau)} = 1$

$e^{- \lambda_0 } = \frac{1}{\sum_{\tau}e^{\sum_{i=1}^{m}\lambda_i f(\tau)}}$

$p(\tau) = \frac{1}{\sum_{\tau}e^{\sum_{i=1}^{m}\lambda_i \mathbf{f}(\tau)}}e^{\sum_{i=1}^{m}\lambda_i \mathbf{f}(\tau)} = \frac{1}{\sum_{\tau}e^{\theta^{\top} \mathbf{f}(\tau)}}e^{\theta^{\top} \mathbf{f}(\tau)}$

$p_{\tau}$这个分布被叫做最大熵分布:

Principle of Maximum entropy(Jaynes 1957): Probability of a demonstrated trajectory is proportional to its exponential of reward of the trajectory.

$p(\tau) \propto \exp (r(\tau))$

And the objective is find $\theta$ to maximize the log likelihood of the demonstrated trajectories.

$\theta^{*}=\text{argmax}_{\theta} L(\theta)=\text{argmax}_{\theta} \frac{1}{m} \sum_{\tau_{d} \in D} \log p\left(r\left(\tau_{d}\right)\right)$