Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Le, Zou, Yeung, Ng. CVPR 2011.

  1. Instead of using hand-coded features like SIFT for processing static images in a video, here unsupervised learning is used to generate feature detectors.
  2. Use extension of Independent Subspace Analysis <I don’t know what that is – ah they say its an extension of independent component analysis>
    1. Used in conjunction with deep learning methods lick stacking, convolution to generate hierarchical representations
  3. Method beats previously published results on a number of datasets
  4. Previous results show that ISA can generate receptive fields simialr to V1 and MT
  5. As opposed to ICA, ISA “… learns features that are robust to local translation while being selective to frequency, rotation, and velocity.”
  6. On the other hand, ISA/ICA scale poorly to high dimensional data, so it is modified here to work well in high-D, by leveraging ideas from convnets: convolution and stacking
  7. In comparison to previous state of the art, all steps of processing are the same aside from the first level of video processing, replacing hand-designed features with learned ones.
    1. <Since I am exactly interested in the latter parts of processing, I’m going to leave this paper now.>
Tagged

Deep Learning for Activity Recognition (A Brief and Incomplete Survey). Taylor. Workshop on Gesture Recognition (talk) 2011

http://videolectures.net/gesturerecognition2011_taylor_tutorial/

<Only 30 mins so probably pretty surface level>

  1. <Intro on common approach was not understandable to me>
  2. But the basic point is to use nets to develop feature detectors instead of hand-designing
  3. Tend to learn one level at a time in deep nets, unsupervised
  4. Often nets are used for classification but doesnt have to be
  5. Because representations are distributed they are more flexible than clustering techniques (k means is common in this field) which is based more on exemplars
  6. Convnets (biological motivation):
    1. Learn increasingly abstract representations as levels go up
    2. Feature maps <-> simple cells
    3. Pooled maps <->complex cells
  7. For video, can do 3D convnets (2D images + time)
    1. Pooling done across frames
  8. Lots of different architectures for doing deep learning for activity recognition.  Those discussed are very complex, and have differing levels of learned vs hard wired filters
    1. End layer is 128D
  9. Another approach is to try and construct a generative model of video
    1. Discusses “gated restricted Boltzmann machines” for this
    2. Extract temporal features, can be sensitive to motion, but also extracts static features. does a sort of segmentation
    3. Space-time deep belief networks
    4. Can be used to fill in missing pixels, or denoising
  10. “Stacked convolutional independent subspace analysis”
  11. Still of equivalent performance for more traditional methods in this field
  12. Room for progress is relying more on larger amounts of unsupervised data
  13. Work from Hinton questions doing invariant learning; instead treat them as latent features

Monte-Carlo Exploration for Deterministic Planning. Nakhost, Muller. IJCAI 2009.

  1. MC methods lead to breakthroughs in Go and general game playing
  2. Introduces planner ARVAND
  3. Local search are the most effective techniques for satisficing planning
    1. But they have trouble dealing with local minima and plateaus where the heuristic function is flat in a large neighborhood
  4. Generally, these types of search algorithms are greedy, so a heuristic that isn’t very good can lead search to areas that are ultimately unpromising
  5. Mentions exploration/exploitation tradeoff
  6. Monte-Carlo Random Walk (MRW) is based on two things:
    1. Action generation is orders of magnitude faster than state-of-the-art heuristic evaluation, so it often works well (in terms of wall-clock time) to try and move further and then evaluate the result with the heuristic less frequently
    2. Results from games (Go) show that exploration is also important
  7. Mention randomization in motion planning, and some other sutff
  8. Mention other methods for escaping local minima/plateaus (FF, and something called enforced hill climbing/EHC)
    1. An extension to EHC, Marvin, keeps track of sequences that lead to exits from plateaus and attempts to apply them again later
  9. Arvand may use heuristics generated by FF, so can be like FF with randomized exploration
  10. Variants of MRW may move away from chance exploration to bias action selection to what was effective previously
  11. # of random walks, how long to make them are parameters
  12. Randomized exploration runs into trouble if the #actions is huge (>1000), or many random trajectories lead to dead ends.  Pruning is important in these cases
    1. They discuss something that has a couple of names, which comes down to softmax action selection
    2. In this paper, ARVAND starts with chance random walks, but falls back on these if that doesn’t make progress
  13. Empirical results discuss performance of this algorithm and competitors thoroughly, but I don’t have the background to make sense of it
  14. “To the best of our knowledge, this is the first successful use of Monte-Carlo search as the main search method of a classical planner. Contrary to greedy local search methods that try to immediately exploit their local knowledge, MRW planning explores the local search space by using random walks.”
Tagged , ,

Towards a Second Generation Random Walk Planner: An Experimental Exploration. Nakhost, Muller. IJCAI 2013

  1. Focuses on best-first planners that use randomization to break out of regions of the search space that have a flat heuristic value
  2. Four major insights claimed
    1. Better to evaluate states frequently as opposed to just at the end of search
    2. Better to allow parameters to intelligently adjust on the fly
    3. Better to bias action selection based on current state as opposed to the simple heuristic of how useful it was throughout the entire history
    4. “Even simple forms of random walk planning can compete with GBFS.”
  3. Sat-searchers work almost exclusively based on heuristics.  IPC 2011 had 25 of 27 entries in that category.
    1. Because they use heuristics they don’t really explore; when heuristics are bad lack of search driven by other methods means that solution quality is poor
  4. The same authors, in 2009 introduced Monte-Carlo Random Walks, which run finite-length random walks, and then choose one course of action based on leaf evaluation
  5. Mention RRTs, as well as Roamer.
  6. The planner they discuss is called Arvand-2013, built on top of something called Fast Downward algorithm
  7. A fair amount of discussion of parameter values that worked well for particular different domains from IPC
  8. <Too results specific based on IPC domains, skipping>

The Roamer Planner Random-Walk Assisted Best-First Search. IPPC Competition 2011.

<Achieved 5th of 27 planners in the area it competed in the IPPC>

  1. Motivation is that when doing best-first search, algorithm may hit plateaus, where the believed best solution doesn’t change for a long period of time
    1. Is a problem in hardcore search/optimization like SAT/CSP
  2. Goal then, is to find what is called an “exit state,” which is a state that will cause a change in the heuristic value
  3. The algorithm presented here, ROAMER, uses random search to help find an exit state when a plateau is hit
  4. Description contains mention of some search methods I don’t know about <more of use to SAT/CSP community than RL probably>, but seems to be based on an iterative version of A*
  5. Cite a paper that introduces MC random exploration for search
  6. Basically these style planners do some form of heuristic BFS
  7. In the case that the heuristic is perfect, a planner will expand only the path from start to goal, but even a small error in the heuristic function it may have to expand exponential number of states to find goal
  8. In the formulation heuristic values are positive and decrease as search deepens until the goal is found with value 0
    1. In cases considered, the heuristic values of s, s’ (its successor) are the same, so there is no “gradient”
  9. <The premise seems a little ironic here – a standard claim in optimization is that when you reach a plateau, there is no gradient to follow, and then the optimization enters a random walk and you are in trouble.  Here, they say when you reach a plateau, get out of it through a random walk.>
  10. Reasons to do random walk:
    1. Better at jumping out of local minima
    2. Doesn’t require use of heuristic function, so can be cheaper to run
    3. Basically memory free, so does not add to cost of bestFS
  11. In SAT/CSP heuristic depends on the number of features not satisfied; in most cases changing the statement won’t satisfy an additional feature, so the heuristic value of both states are equivalent
  12. Mention tabu search and WalkSAT <only familiar with the first>
  13. When dealing with problems that have more helpful heuristics the random search is less useful.
  14. Best-first drives the search, but random search is invoked when deemed useful
  15. Basically if a plateau is detected (when to report a plateau is a parameter), does random walks (to a finite horizon) until a change in heuristic value occurs

Rollout Algorithms for Constrained Dynamic Programming. Bertsekas. LIDS (tech report?) 2005.

<This is a tech report?>

<Seems like it is dealing with deterministic case>

  1. Based on the abstract, seems similar to http://www.mit.edu/~jnt/Papers/J066-97-rollout.pdf, and his other work that deals with heuristics and rollouts
  2. “Under suitable assumptions, we show that if the base heuristic produces a feasible solution, the rollout algorithm also produces a feasible solution, whose cost is no worse than the cost correspoding to the base heuristic.”
  3. <Important!> “In the base heuristic is a DP policy, i.e., it generates state/control sequences via feedback functions μk, k=0,1,…,N-1, so that at any state xk, it applies the control μk(xk), then the algorithm can be viewed as a policy improvement step of the policy iteration method.  Based on the generic results for policy iteration, it then follows that the rollout algorithm’s cost <doing cost minimization in this setting> is no worse than the one of the base heuristic.  This cost improvement property has been extended in [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>] to base heuristics that are not DP policies, but rather satisfy an assumption called sequential improvement, which will be discussed shortly within the constrained context of this paper.”
  4. The base heuristic is more like a heuristic policy (as opposed to a value) in that it produces a “partial trajectory”, where the idea seems to be that if you plan to some depth you continue the rollout with the heuristic to give a better estimate of cost
    1. <This fits well with the random policy stuff>
    2. <Fits with Go literature>
  5. There is a given set of trajectories <a priori?>; the planned trajectory followed by the trajectory resulting from the heuristic may not be a member of this set of trajectories.
    1. <I think the paper is set up this way so that it can be used for constraint satisfaction; only trajectories satisfying the constraint are in this set; in this paper they deal with constrained DP as opposed to the more common unconstrained DP.>
  6. “We will show under an appropriate extension of the notion of sequential improvement, that the rollout algorithm produces a feasible solution, whose cost is no worse than the cost corresponding to the base heuristic. We note, however, that it does not appear possible to relate our rollout algorithm to concepts of policy iteration, and in fact we are not aware of an analog of the policy iteration method for constrained DP problems [my emphasis]…”
    1. <So this paper is probably less useful at the moment than his other work because of the focus on constrained DP… may skim the rest of this paper>
  7. For constrained rollouts, when performing action selection, it only considers actions at each time step that will keep the rollout satisfying the constraints
    1. At each decision point, examines all possible actions, and then does heuristic rollouts from each resulting state
    2. Takes the action that had the lowest cost and was still feasible, according to the information generated by the heuristic policy
  8. <Ah yes, this paper is the constrained version of that discussed in  [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>]  which I’m already familiar with>
    1. Ok so at this point I’m just going to skim and mine for references, as was recommended to me
  9. In some cases the heuristic will not produce policies that satisfy the constraints, in which case you are in trouble, but they introduce the idea of a heuristic being sequentially consistent, which basically means the policy itself is Markovian (they then say in this case you can just call the heuristic a policy)
  10. A sequentially improving base heuristic has a cost that keeps reducing as the rollout progresses <this actually does not seem trivial to achieve, but they say any sequentially consistent algorithm is also sequentially improving, so I guess it is>, and also that trajectories must be consistent
  11. <I suppose somewhat obviously> that if the base heuristic is sequentially improving, doing deeper rollouts with this heuristic will improve the solution quality
  12. “If the base heuristic is sequentially improving, then the rollout algorithm produces a trajectory … that is feasible and has cost that is no larger than the cost of the trajectory generated by the base heuristic starting from x0.”
  13. Then moves on to what you can do if the trajectory is not feasible
  14. As well as the case where doing trees (multiple partial trajectories) instead of rollouts

Learning Slow Features for Behaviour Analysis. Zafeiriou, Nicolaou, Zafeiriou, Nikitidis, Pantic. ICCV 2013.

  1. Discusses deterministic as well as probabilistic SFA <I don’t know anything about the latter>
  2. Derive a version of deterministic SFA “… that is able to identify linear projections that extract the common slowest varying features of two or more sequences.”
  3. Also, an EM method for inference in probabilistic SFA and extend it to handle multiple data sequences
  4. This EM-SFA algorithm “… can be combined with dynamic time warping techniques for robust sequence time-alignment.”
  5. Used on face videos
  6. <A number of good references in here – a few I still haven’t read>
  7. Seems like the probabalistic version of SFA is based on similarity to an ML estimate of a linear generative model with “… a Gaussian linear dynamical system with diagonal covariances over the latent space.” <Need to look into that more>
  8. Their version of probabilistic SFA is novel because it is based on EM, as opposed to other ML approaches.  This allows for modeling latent distributions that aren’t restricted by diagonal covariance matrices.
    1. <There is a hack in the math in the standard ML probabilistic SFA that maps variance to 0, which “… essentially reduces the model to a deterministic one…”
  9. Call facial expressions, action units (AUs), this is the first unsupervised method that does so based on temporal data; other unsupervised approaches are based on “…global structures…” <I assume that means global properties of still frames>
  10. <Not taking notes on particulars of math for probabilistic SFA, both ML and EM versions>
  11. When looking at more than one data sequence, SFA for this setting is designed to find the common slowly varying features
  12. <Naturally> assumption is that there is a common latent space between multiple-sequence SFA
    1. Math for both deterministic and probabilistic multi-sequence SFA
  13. <The time warping part / aligning sequences of different length is probably not relevant>
  14. The video data they work from is a public database of 400 annotated videos, each with a constrained structure of neutral, onset, apex, and offset.
  15. Not so clearly described, but seems like they dont operate off raw video data; they work off of 68 facial points tracked in 2D
  16. Compare performance of EM-SFA, SFA, and traditional linear dynamic systems.
    1. Seems like each is run on a subset of facial data, either mouth, eyes, or brows
  17. Use slowest feature of SFA as a classifier for when the expression is being performed (when its value goes from negative to positive, and back to negative)
  18. EM-SFA outperforms SFA ad LDS
  19. <Did a quick read, but I don’t see where they have experimental results of multi-sequence.  If it is indeed missing that is strange because they do have the math and the data already to test it.>
  20. The temporal alignment algorithm is used to match up videos from the same category so that the neutral, apex, etc… frames are synchronized
Tagged

Slow Feature Analysis for Human Action Recognition. Zhang, Tao. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 2012

  1. Considers 4 forms of SFA for action recognition (apparently the other 3 are new for this work):
    1. (original) unsupervised SFA
    2. supervised SFA
    3. discriminitve SFA
    4. spatial discriminitive SFA
  2. Train SVM on slow features of squared input <not exactly, see ASD in #6>
  3. Test on a number of databases with good results
  4. ” First, a large number of local cuboids are collected by randomly sampling in motion boundaries. Then, a number
    of slow feature functions are learned from these local cuboids.” <I don’t know what a cuboid is>
  5. “To the best of our knowledge, this is the first work that uses the slowness principle or SFA to analyze human motions.”
  6. “Instead of using the responses of the learned slow feature functions directly, we propose the Accumulated Squared Derivative (ASD) feature to represent a given action sequence which is a statistical representation of the slow features in an action sequence.”
  7. 3 common classes of action recognition algorithms:
    1. Holistic features: global properties like body shape, joint angles, motion.  Require good segmentation and tracking; sensitive to noise and tracking errors
    2. Local descriptors: Find points of interest, bag of words, some other stuff
    3. Biologically inspired: Motion sensitivity inspired by visual cortex.  SFA is also somewhat biologically motivated but has a very different approach
  8. “There are four main steps in the SFA-based human action recognition, including Collection of training cuboids, Slow feature function learning, Action feature representation,and Classification.InSlow feature function learning, we extend the original SFA by using weakly supervised information and spatial information of the training cuboids to obtain discriminative slow feature functions for action classification.”
  9. Finding cuboids basically means finding what part of the image is of interest (contains a person in the foreground).  The have some presupplied information that comes with the data set (bounding box, cuboid is more the actual shape of the foreground figure as composed of many overlapping squares/cubes) that makes this easier.
  10. They use quadratic basis, but since that is too big, then use PCA to project down to 50 dimensions, some other preprocessing
  11. In supervised SFA, instead of running all data through the same algorithm, SFA is run independently on the data for each action of interest. “Finally, the statistical feature is computed with all slow feature functions. However, different actions may share many similar local motion patterns, so different labels to these “common” cuboids are misleading.” <not sure what this means exactly>
  12. “D [descriptive]-SFA is inspired by discriminative sparse coding [45], where a number of sets of discriminative dictionaries are learned, and each set of dictionaries is used to reconstruct a specific image class. Accordingly, D-SFA learns a number of sets of functions and each set of functions is used to slowdown a specific action class.”
    1.  “Therefore, each learned function makes the intraclass signals x_c(t) vary slowly, but makes the interclass signals x_{c’}(t) that are different from class c vary quickly.”
    2. Results in a similar set of constraints as standard SFA, but x values are now c [class]-indexed
  13. spatial discriminitve SFA adds spatial information.  The motivation is that some activities might involve more motion in a certain region (for example, upper could correspond to arm motion, while lower to leg motion)
    1. Here they do upper, mid, and lower
  14. For standard SFA “Each cuboid is considered as one minisequence. The covariance matrix and the time-derivative covariance matrix are calculated by combining all minisequences.”
  15. Do L1 normalization on the cuboids because the number of them may vary but the vectors representing the data must be the same length
  16. Bottom of p 441 has more in-depth description of algorithm run end-to-end
  17. The ASD is the concatenation of multiple slow feature functions.
  18. Classification is done via SVM on ASD
  19. More technical details in experimental results section
  20. <low on energy not taking much further notes>

Planning for Markov Decision Processes with Sparse Stochasticity. Likhachev, Gordon, Thrun. NIPS 2004

  1. Algorithms for deterministic planning problems (Dijkstra’s) run faster than those for deterministic problems (primarily because in deterministic problems, optimal plans are acyclic).   This work tries to split the difference for domains that are only sparsely stochastic
  2. Algorithm is called MCP (MDP compression planning)
  3. Combines A* with VI for solving stochastic-shortest-path problems
  4. Empirical results involve integration and mapping of two actual robots (one helicopter, and one ground robot) to do mapping and then planning
  5. In their definition of sparse stochasticity, most states have actions with deterministic outcomes, and only a few have stochastic outcomes (this is different from the more common definition of sparse stochasticity where all transitions may be stochastic, but that distribution is only across a small number of states)
    1. The trick with this assumption is that since only a few of the states are stochastic, they are the only ones that can lead to cycles in an optimal policy
  6. The idea then is to build an MDP that consists only of the stochastic states from the original domain.
    1. So all the deterministic states are compressed out
    2. The states included (the stochastic ones) are called distinguished states.  In addition, start and goal states are included to the MDP
  7. Based on this differentiation, a fast algorithm for planning through determinstic states (such as A*) and then a different one for dealing with the stochastic states (like VI)
  8. Figuring out how to do this decomposition isn’t straightforward (for example, the transition diagram in the MDP of distinguished states depends on reachability between deterministic states)
    1. So the decomposition is done incrementally
  9. In the problem considered “The ground robot needs to plan a path from its current location to a goal, but
    has only partial knowledge of the surrounding terrain. The helicopter can aid the ground
    robot by flying to and sensing places in the map.”
  10. Map is discretized and described in 2d
  11. In areas still uncovered, there is a belief as to the probability that state will be open (and therefore passable).  The ground robot can try and plan through those areas as if they were open, but runs the risk of having to replan if one of the cells turns out to be occupied.  The other option is it can request that the helicopter uncover that cell
  12. The overall planning goal is to minimize the cost of running both robots, with the episode terminating when the ground robot reaches its goal and the helicopter goes back to home
  13. This forms a POMDP, (which are in general intractable) <but this particular class of POMDPs is almost certainly tractable because it is possible to get to ground truth just by observing each corresponding cell>
    1. <Ah not exactly> Sensors are imperfect and may contradict each other.  But it is assumed that most of the path planning is deterministic and only rarely noise in sensing will arise
  14. Planning is done in belief space; they discretize this though and plan in discretized belief space
  15. The compressed MDP is built by running deterministic searches <I suppose a deterministic policy?  Which part in particular is deterministic is a little unclear.  Seems like the policy is deterministic as>
    1. Actions that produce different outcomes are put in the set of distinguished states
    2. The best deterministic paths are memoized and are used to produce better cost estimates (in a manner similar to Bellman backups)
    3. These deterministic searches can use an admissible heuristic, which means that much of the state space may be (correctly) unexplored, which leads to cheaper planning
  16. Because the MDP of distinguished states is small, planning there is cheap
  17. Planning done over the stochastic states is used as a heuristic for the deterministic planning part
  18. Each action in the MDP actually corresponds to a series of actions in the domain; only the final action in this sequence may produce a stochastic outcome
    1. “Among all compressed actions starting at s and ending at (s’, a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression”
    2. If all compressions are optimal for all states in the MDP, then the MDP is said to be fully compressed
  19. Solving the full compressed MDP will produce the optimal policy, but its possible to get away with less work, in particular because the planning problem considered is goal-directed, so many states simply may not have to be considered
  20. State selection on each iteration is based on something in the form of the Bellman residual
  21. <I wonder how this relates to prioritized sweeping>
  22. <Not taking much  notes on specifics of algorithm, but seems totally reasonable for the current amount of attention I am paying>
  23. The top level of the algorithm repeatedly tries to find trajectories from start to goal
    1. Values produced by the MDP used as a heurisitic in this search should be admissible
  24. Need a modified version of A* that can factor in stochasticity in the domain; they propose such a modified A* planner
  25. Has theoretical guarantees, PACy
  26. Compare against RTDP, LAO*, and VI – VI was restricted to problems with perfect sensing because otherwise the size of the state space is prohibitive
  27. 5-10x the speed of competitors.  The slowest part is <surprisingly> the deterministic search component <I suppose the stochastic states are very sparse>.  Mention another search algorithm may help reduce the cost of that particular component
  28. As stochasticity becomes less sparse, its time advantage reduces, but is still better than competitors

Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning. Bouvrie, Maggioni. Arxiv 2012Use

  1. Deals with finding hierarchies – not just using them
  2. Covers two main contributions
    1. Compressing /homogenizing MDPs
    2. Identification of possibilities for transfer
  3. “Regardless of how the partitioning is accomplished, the statespace is divided into a multiscale collection of “clusters” connected via a small set of “bottleneck” states. A key function of the partitioning step is to tie computational complexity to problem complexity. It is problem complexity, controlled for instance by the inherent geometry of a problem, and its amenability to being partitioned, that should determine the computational complexity, rather than the choice of statespace representation or sampling. The decomposition we suggest attempts to rectify this difficulty. “
  4. Multiscale compression is local, and recursive
  5. Present a form of asynchronous policy iteration, which as O(n log n) cost per iteration
  6. In terms of transfer, they don’t only consider the common case of sharing low-level options, but also the other side of the coin that the high level dynamics between options may be the same, even if the dynamics in the low-level policies are different
  7. Consider stochastic policies, as well as discount factors that are distinct for state action pairs
  8. 3 Step process:
    1. Partition states into clusters connected by bottleneck states
    2. Compress MDP so each cluster becomes a state
      1. This may be done a number of times
    3. Solve hierarchy top-down
  9. Use diffusion map embeddings to cluster domain (but mention there are other possibilities)
  10.  Uses Chung’s method for finding the Laplacian of directed graphs, <but this can only describe a subset of all possible directed graphs>
    1. Does “low conductance cuts”
    2. Briefly mention a number of other possible metrics
  11. Describes two types of bottlenecks
    1. Geometric, which are based on graph structure alone
    2. Problem, which is based on the geometric bottlenecks as well as the goal structure of the MDP
    3. “If the policy is already strongly directed according to the goals defined by the rewards, then the bottlenecks can be interpreted as choke points for a random walker in the presence of a strong potential.”
  12. Have rewards, transitition funcitons, states, actions for each MDP at different levels of coarseness
    1. “One of the important consequences of these definitions is that the optimal fine scale value function on the bottlenecks is a good solution to the coarse MDP, compressed with respect to the optimal fine scale policy, and vice versa. It is this type of consistency across scales that will allow us to take advantage of the construction above and design efficient multiscale algorithms. “
  13. Mention similarities of the approach to other classes of problems, where the basic approach is to make a small problem of a large one by locally averaging parts of the problem
    1. Coarsening: PDEs  and meshes <?>
    2. Homogenization: PDEs
    3. Lumping: Markov chains
  14. Assume that the policy has at least some degree of chance action selection
  15. Basically solving the MDP at all hierarchies results in an iterative process
  16. <skimming the rest>
Follow

Get every new post delivered to your Inbox.