I have been designing an interesting, novel (as far as I know) data structure. Its goal is to estimate a very large number Bayesian models in the smallest possible space and time. Sublinear complexity in space and time as well as online updates area key requirement.
The key idea is to combine particle filters and sketches in a single powerful tool. That’s why I call them Particle Sketches.
I devised them as a tool to track the posting behavior of a very large set of Twitter users, in order to decide who to follow using Twitter streaming API and who to follow using the Rest API based on their expected tweeting frequency and distribution across the day and week. The data structure is however totally generic and suitable to all kind of large-scale data estimation problem: I can surely see applications in high-frequency trading and website traffic forecast.
Let’s look at the building blocks first.
Particle Filters —otherwise known as Sequential Monte Carlo method— are model estimation techniques based on simulations. They are used to estimate Bayesian models in which the latent variables are connected in a Markov chain, typically when the state space of the latent variables is continuous and not sufficiently restricted to make exact interference tractable.
In practice, they work by maintaining a set of differently-weighted samples —called particles— that represent the expected distribution of the latent variable. You can then “filter”, which in this context mean determine the distribution at a specific time, by estimating how the particle swarm should have evolved under the observed variable (e.g. time).
On update, a new set of weighted particles is generated influenced by new observations. You can check Wikipedia’s page on particle filters for details or the multitude of pages about them available on the Internet.
The key fact here is that particle filters are composed by a set of discrete, importance-weighted particles and their distribution is effectively the sum of these particles.
We can thus split up a particle filter into multiple smaller filters by splitting up the particles into them. If we pick the particles at random, we should end up with essentially the same distribution among all of them.
If we add noise to a few of these sub-filters, the cumulative distribution of the sum of all of them will still be fairly close to the original estimated distribution. It will be even more true if we can account for the noise and remove its effect.
Count-Min (CM) Sketches are probabilistic, memory efficient data structures that allow one to estimate frequency-related properties of a data set (estimate frequencies of particular items, find top-K frequent elements, performance range frequency queries —i.e. find the sum of the frequencies of elements within a given range— such as estimate percentiles).
They are conceptually quite simple, they maintain a two-dimensional array [w, d] of integer counters. A set of d random hash functions with co-domain between 0 and w is used to choose which counters to update whenever we a new sample arrives.
Given an item, we can estimate its frequency by looking at all buckets in the two-dimensional array and choosing the one with the lowest value.
CM sketches are extremely memory efficient: for example, 48 KB are sufficient to calculate the top-100 most frequent elements in a 40 MB dataset with 4% error.
CM sketches tend to perform better with highly-skewed distributions, such as those generated by power laws or Zipf-like distributions.
In case of low or moderately skewed data, a more careful estimation can be done by accounting for the noise due to hash collisions: for example, by estimating and removing the noise for each hash function (calculated as the average value of all remaining counters) and choosing the median instead of the minimum as the estimated frequency. This variation is called Count-Mean-Min Sketch.
There are many other interesting properties of CM Sketches, such as various algebraic operations that they support or a way to measure how different two sketches are — calculating the cosine distance between them.
Particle Sketches replace the counters with particle filters. On each new observation, a set of filters is selected using hash functions and, for each of them, only a random subset of the particles is updated.
To obtain a distribution for a given item, we put together all the particle filters that are associated with that item. For better performance, we can estimate and remove the expected noise due to collisions in a way analogous to Count-Mean-Min sketches.
The nice thing is that many cool properties of CM Sketches are preserved. For example, by putting all particle filters together we have an estimation of the cumulative distribution of the whole population. Using techniques similar to those used to estimate the top-K elements in a set, we can calculate the cumulative particle filters of a sub-population and so on.
It’s still a work in progress, I am still exploring the characteristics of these models and the best way to merge together multiple particle filters discarding the peculiar noise generated by (pardon the repetition) structure of this data structure.
Possible improvements: stacking together particle sketches at different resolutions, it should be possible to decompose these models in a way analogous to wavelet decomposition. It should make it possible to use even a more compact representation, thus dually to have —if we use the same bits of information— a more faithful model with less noise.