Title: An Incremental Fast Policy Search using a Single Sample Path.
We consider a modified version of the control problem in a reinforcement learning setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which optimizes the long run gamma-discounted transition costs, where gamma lies in [0, 1). They also assume access to a generative model/simulator of the underlying MDP with the hidden premise that realization of the system dynamics of the MDP for arbitrary policies in the form of sample paths can be obtained with ease from the model. We consider a generalized version, where the cost function is the expectation of a non-convex function of the value function without access to the generative model. Rather, we assume that a single sample path generated using a priori chosen behaviour policy is made available. In this information restricted setting, we solve the generalized control problem by developing an incremental version of cross entropy method. The proposed algorithm is shown to converge to the solution which is globally optimal relative to the chosen behaviour policy. We also present a few experimental results to corroborate our claims.