|
We present an efficient, data-driven, multi-dimensional, metric
data-structure that is sufficient for non-parametric density estimation of
massive data sets. Arithmetic operations are extended to these data
structures in order to efficiently obtain the average of two histograms.
We obtain a non-parametric point-estimate of the density from the sample
mean of a Markov chain whose stationary distribution is the posterior
density over a dense space of histograms and whose initial distribution is
given by an L1 consistent randomised queue that is prioritised for
statistically equivalent blocks.
We apply our methods to earthquakes in NZ, weather and aircraft
trajectories over a busy US airport and samples simulated from challenging
multi-dimensional densities, including Levy and Rosenbrock. Our approach
efficiently produces consistent density estimates from uniformly
distributed samples in up to one thousand dimensions.
We develop an arithmetic that generalises computationally efficient
histogram averaging. This approach allows for arithmetic and analysis
over mapped statistical regular sub-pavings, a binary tree whose mapped
nodes are partially ordered compact subsets of a multidimensional metric
space and whose leaves contain data. Obvious applications include
computationally efficient enclosures of plug-in estimates and discrete
function arithmetic for a large class of functionals of massive data sets.
This is joint work with Jennifer Harlow, Dominic Lee and Gloria Teng.
|