Autonomous reinforcement learning on raw visual input data in a real world application. Lange, Riedmiller, Voigtlander. Neural Networks (IJCNN) 2012

  1. Learn slot car racing from raw camera input
  2. Deep-fitted Q
  3. The slot car problem is interesting in particular because as being rewarded for speed, the best policies are the closest to failure (losing control of the car)
  4. Time resolution at about 0.25s
  5. Camera input is 52×80 = 4160d (greyscale it seems?)
  6. They do pretraining: “The size of the input layer is 52×80 = 4160 neurons, one for each pixel provided by the digital camera. The input layer is followed by two hidden layers with 7×7 convolutional kernels each. The first convolutional layer has the same size as the input layer, whereas the second reduces each dimension by a factor of two, resulting in 1 fourth of the original size. The convolutional layers are followed by seven fully connected layers, each reducing the number of its predecessor by a factor of 2. In its basic version the coding layer consists of 2 neurons.
    Then the symmetric structure expands the coding layer towards the output layer, which shall reproduce the input and accordingly consists of 4160 neurons.”
  7. “Training: Training of deep networks per se is a challenge: the size of the network implies high computational
    effort; the deep layered architecture causes problems with vanishing gradient information. We use a special two-stage training procedure, layer-wise pretraining [8], [29] followed by a fine-tuning phase of the complete network. As the learning rule for both phases, we use Rprop [34], which has the advantage to be very fast and robust against parameter choice at the same time. This is particularly important since one cannot afford to do a vast search for parameters, since training times of those large networks are pretty long.”
  8. Training is done with a set of 7,000 images, with the car moving a constant speed.
  9. “For the layer-wise pretraining, in each stage 200 epochs of Rprop training were performed.”
  10. “Let us emphasis the fundamental importance of having the feature space already partially unfolded after pretraining. A partially unfolded feature space indicates at least some information getting past this bottle-neck layer, although errors in corresponding reconstructions are still large. Only because the autoencoder is able to distinguish at least a few images in its code layer it is possible to calculate meaningful derivatives in
    the finetuning phase that allow to further “pull” the activations in the right directions to further unfold the feature space.”
  11. “Altogether, training the deep encoder network takes about 12 hours on an 8-core CPU with 16 parallel threads.”
  12. Because still images don’t include all stat information (position but not velocity) they need to use some method to allow the other parts of the state information in.  One option is to make the state the previous two frames, the other is to do a Kohonen Map <I assume this is what they do>
  13. “As the spatial resolution is non uniform in the feature space spanned by the deep encoder, a difference in the feature space is not necessarily a consistent measure for the dynamics of the system. Hence, another transformation, a Kohonen map … is introduced to linearize that space and to capture the a priori known topology, in this case a ring.”
    1. <Need to read this part over again>
  14. Use a paricular form of value FA I haven’t heard of before
  15. It seems this VFA may be an averager, not based strictly on an NN
  16. 4 discrete actions
  17. “Learning was done by first collecting a number of ’baseline’ tuples, which was done by driving 3 rounds with the constant safe action. This was followed by an exploration phase using an epsioln-greedy policy with epsilon= 0.1 for another 50 episodes. Then the exploration rate was set to 0 (pure exploitation). This was done until an overall of 130 episodes was finished. After each episode, the cluster-based Fitted-Q was performed until the values did not change any more. Altogether, the overall interaction time with the real system was a bit less than 30 minutes.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: