Cross-entropy optimization applied to simulated humanoid stair descent.

In reinforcement learning, you get what you ask for. The reward function here is equal to the velocity of the hip along the horizonal axis (as was the case in the walking video, details are the same except rollouts are only 30 steps long, so the vector being optimized is 210 dimensional). As you can see here, the best way down the stairs in this example is to just jump down!

I’m not sure if this is good or not good – I think it may make sense to try a taller flight of stairs and see what happens. Its also possible to set things up so that reward is accrued each time a step is touched in sequence, but that feels contrived. I could penalize changes in velocity along the vertical axis to make it smoother?

Ari Weinstein's Research