I’m writing on the assumption that we are basically done with Robocode – I don’t think we’ll be able to get results much better than I found in my last experiment. In my opinion they should be just as happy with a good robocode solution as a good hillcar solution; both are essentially independent of the framework, so I think its more of an engineering project than a science project at this point.
That being said, here are the next 2 things I thought of potentially working on:
1. Yet another robot navigation scheme
We talked about this briefly before I left. I’ve been reading up on MS’s robot studio (and played with it a bit). I am thinking about developing a system which performs classification based on sensor readings (and therefore can do things like path planning based on sensor readings). For example, learning that going forwards when the laser rangefinder pointing ahead of says an object is nearby is probably a bad idea. I could also see doing something similar to Bethany’s work where the terrain is different in different locations, which should yield different planning.
2. Transfer learning in MDPs
In more information theory related work, I’ve been reading Thomas Cover (and Joy Thomas)’s information theory book, where the entropy of Markov Chains are discussed, this also opens up other information theoretic metrics that can be applied such as the Kullback-Leiber (K/L) divergence on a Markov Chain.
I was considering something along the line of this:
Given an MDP with reward, learn a policy to maximize reward. Then, change that MDP in some way (I suppose the structure should remain the same, that is all possible actions from any given state are the same), and then measure the K/L divergence of the policy on the new MDP versus the old MDP. I was considering looking at the K/L divergence versus the amount of learning that is required to come to an optimal policy on the new MDP.
To me, the first experiment is perhaps slightly less interesting from the second, just for the reason that I feel there are plenty of solutions to this problem (but perhaps its the approach, not the problem that is interesting?). That being said, if Bethany worked on robot navigation stuff for years I could see it still has potential.
The second idea seems much less practically applicable, and I’m not sure whether it is at all interesting from a theoretic angle. I’m not even sure its a well-posed problem.
I’ll open it up to you to give me your feedback, interesting, or not? Let me know where you need clarification.
Of course, I could always to something completely different than either of these if you have something in mind. Let me know. Before the Robocode stuff picked up you mentioned evolving reward functions?
I’m reading Kearns & Vazirani’s Intro. to Computational Learning theory, and was thinking about the delta parameter that is used as part of the definition of whether an algorithm is PAC learnable. My understanding is the delta parameter represents the probability of having a very bad sample to train from, as the book puts it:
The confidence parameter delta is necessary since the learning algorithm may occasionally be extremely unlucky and draw a terribly “unrepresentative” sample of the target concept – for instance, a sample consisting only of repeated draws of the same instance despite the fact that the distribution is spread evenly over all instances.
Well from my understating, in the information theory literature, they use the asymptotic equipartition property (AEP) to basically say when looking at a large number of samples, only a certain number of them are at all probable (related to the law of large numbers). I was wondering if it were possible to apply AEP to the defintion of PAC and remove the delta parameter. Of course, AEP only applies at the limit, so I suppose that it only helps proving PAC-learnability, and not efficient PAC-learnability. Not sure if the idea is either correct or useful, but something I thought of.