Theta is normalized between -pi and pi. Therefore, the lowest cost is
-(pi^2 + 0.1*8^2 + 0.001*2^2) = -16.2736044, and the highest cost is
0. In essence, the goal is to remain at zero angle (vertical), with the least rotational velocity, and the least effort.
# Starting State
Random angle from -pi to pi, and random velocity between -1 and 1
# Episode Termination
There is no specified termination. Adding a maximum number of steps might be a good idea.
NOTE: Your environment object could be wrapped by the TimeLimit wrapper, if created using the "gym.make" method. In that case it will terminate after 200 steps.
You should get an "AverageReturn" of around -100 to -150.