Statistical algorithms over a dense tree of ordered binary trees

Raazesh Sainudiin

Abstract

We present an efficient, data-driven, multi-dimensional, metric data-structure that is sufficient for non-parametric density estimation of massive data sets. Arithmetic operations are extended to these data structures in order to efficiently obtain the average of two histograms. We obtain a non-parametric point-estimate of the density from the sample mean of a Markov chain whose stationary distribution is the posterior density over a dense space of histograms and whose initial distribution is given by an L1 consistent randomised queue that is prioritised for statistically equivalent blocks. We apply our methods to earthquakes in NZ, weather and aircraft trajectories over a busy US airport and samples simulated from challenging multi-dimensional densities, including Levy and Rosenbrock. Our approach efficiently produces consistent density estimates from uniformly distributed samples in up to one thousand dimensions. We develop an arithmetic that generalises computationally efficient histogram averaging. This approach allows for arithmetic and analysis over mapped statistical regular sub-pavings, a binary tree whose mapped nodes are partially ordered compact subsets of a multidimensional metric space and whose leaves contain data. Obvious applications include computationally efficient enclosures of plug-in estimates and discrete function arithmetic for a large class of functionals of massive data sets. This is joint work with Jennifer Harlow, Dominic Lee and Gloria Teng.