Particle Sketches, a way to estimate a very large number of Bayesian models in sub-linear space and time

I have been designing an interesting, novel (as far as I know) data structure. Its goal is to estimate a very large number Bayesian models in the smallest possible space and time. Sublinear complexity in space and time as well as online updates area key requirement.

The key idea is to combine particle filters and sketches in a single powerful tool. That’s why I call them Particle Sketches.

I devised them as a tool to track the posting behavior of a very large set of Twitter users, in order to decide who to follow using Twitter streaming API and who to follow using the Rest API based on their expected tweeting frequency and distribution across the day and week. The data structure is however totally generic and suitable to all kind of large-scale data estimation problem: I can surely see applications in high-frequency trading and website traffic forecast.

Let’s look at the building blocks first.

Particle Filters —otherwise known as Sequential Monte Carlo method— are model estimation techniques based on simulations. They are used to estimate Bayesian models in which the latent variables are connected in a Markov chain, typically when the state space of the latent variables is continuous and not sufficiently restricted to make exact interference tractable.

In practice, they work by maintaining a set of differently-weighted samples —called particles— that represent the expected distribution of the latent variable. You can then “filter”, which in this context mean determine the distribution at a specific time, by estimating how the particle swarm should have evolved under the observed variable (e.g. time).

On update, a new set of weighted particles is generated influenced by new observations. You can check Wikipedia’s page on particle filters for details or the multitude of pages about them available on the Internet.

The key fact here is that particle filters are composed by a set of discrete, importance-weighted particles and their distribution is effectively the sum of these particles.

We can thus split up a particle filter into multiple smaller filters by splitting up the particles into them. If we pick the particles at random, we should end up with essentially the same distribution among all of them.

If we add noise to a few of these sub-filters, the cumulative distribution of the sum of all of them will still be fairly close to the original estimated distribution. It will be even more true if we can account for the noise and remove its effect.

Count-Min (CM) Sketches are probabilistic, memory efficient data structures that allow one to estimate frequency-related properties of a data set (estimate frequencies of particular items, find top-K frequent elements, performance range frequency queries —i.e. find the sum of the frequencies of elements within a given range— such as estimate percentiles).

They are conceptually quite simple, they maintain a two-dimensional array [w, d] of integer counters. A set of random hash functions with co-domain between 0 and is used to choose which counters to update whenever we a new sample arrives.

Given an item, we can estimate its frequency by looking at all buckets in the two-dimensional array and choosing the one with the lowest value.

CM sketches are extremely memory efficient: for example, 48 KB are sufficient to calculate the top-100 most frequent elements in a 40 MB dataset with 4% error.

CM sketches tend to perform better with highly-skewed distributions, such as those generated by power laws or Zipf-like distributions.

In case of low or moderately skewed data, a more careful estimation can be done by accounting for the noise due to hash collisions: for example, by estimating and removing the noise for each hash function (calculated as the average value of all remaining counters) and choosing the median instead of the minimum as the estimated frequency. This variation is called Count-Mean-Min Sketch.

There are many other interesting properties of CM Sketches, such as various algebraic operations that they support or a way to measure how different two sketches are — calculating the cosine distance between them.

For a good explanation of CM Sketches with sample code, check this nice post on highlyscalable. For deeper knowledge, check this site.

Particle Sketches replace the counters with particle filters. On each new observation, a set of filters is selected using hash functions and, for each of them, only a random subset of the particles is updated.

To obtain a distribution for a given item, we put together all the particle filters that are associated with that item. For better performance, we can estimate and remove the expected noise due to collisions in a way analogous to Count-Mean-Min sketches.

The nice thing is that many cool properties of CM Sketches are preserved. For example, by putting all particle filters together we have an estimation of the cumulative distribution of the whole population. Using techniques similar to those used to estimate the top-K elements in a set, we can calculate the cumulative particle filters of a sub-population and so on.

It’s still a work in progress, I am still exploring the characteristics of these models and the best way to merge together multiple particle filters discarding the peculiar noise generated by (pardon the repetition) structure of this data structure.

Possible improvements: stacking together particle sketches at different resolutions, it should be possible to decompose these models in a way analogous to wavelet decomposition. It should make it possible to use even a more compact representation, thus dually to have —if we use the same bits of information— a more faithful model with less noise.

NEXT: Particle Sketches / 2 — working around the central limit theorem

6 thoughts on “Particle Sketches, a way to estimate a very large number of Bayesian models in sub-linear space and time

  1. “On each new observation, a set of filters is selected using hash functions and, for each of them, only a random subset of the particles is updated.”

    If you only update *some* of the particles in the relevant filters, doesn’t this lose a lot of the properties that make particle filters useful (or for that matter, Bayesian) in the first place? That is, suppose at time t, my particles are all over the place, representing a high amount of uncertainty, and then I observe some datum d which is extremely unlikely under almost all hypotheses, but reasonably likely under a very small minority of hypotheses. Under a typical particle filter, at t+1 we’d have wiped out almost all of the particles in the d-is-unlikely places, and have a whole log of particles in the d-is-reasonable areas. Your method, if it only updates some subset of its particles, seems like it should leave you with a bunch of extra particles in the d-unlikely areas, or said otherwise, you wouldn’t be left with a good approximation of the actual posterior distribution.

    What am I missing here?

    1. The effect I am aiming for is spreading the information among multiple particle filters.

      It is true that a single one will loose the ability to be a good approximation of the actual distribution for a given individual item, but the goal is that the combination of them is still a good one while having acceptable data and computing complexity.

      I am not satisfied with the idea of using just hash functions to choose which filters to update, I wonder if there is a smarter way that compounds similar individuals together in such a way that the information density is maximized.

      1. Thanks for responding. Can you expand on your goals a little? I don’t quite follow. Specifically, I think what I must not be understanding is
        (a) What’s an example of a task where you care about the aggregate behavior of several models, but not about accurate inference wrt any of them individually? And in such a case, would you not be better off conceptually combining your many latent variables into a single inference problem?
        (b) It sounds like, any given particle filter, instead of at time n approximating p(x_i|y_i,1, …, y_i,n), will end up being a weighted mixture of [p(x_i), p(x_i|y_i,1), p(x_i|y_i,2), p(x_i|y_i,1, y_i,2), …], and that all your particles taken in aggregate (across your many filters) will be a mixture of several such mixtures. How are you getting the combination of all your filters to tell you something meaningful?

        The above point might be unclear, so let me use a concrete example for the sake of argument. Suppose I’m tracking the movements of several people in a room, using multiple cameras. I suppose I *could* have a separate particle filter for each person being tracked, where each particle represents a possible locaiton for a single person. My point in (a) is that I’d probably be better off having a single particle filter, where the latent variable is a vector over locations. This allows me to break some strong assumptions about the independence of different peoples locations which are implicitly made by having multiple, free-standing filters, for instance, that two people can’t occupy exactly the same location.

        My point in (b) is that at time t, the filter responsible or tracking Alice’s position will actually show me something like a blur of all the places Alice has been up until time t, which is not what we’d ordinarily want a particle filter to do. And more to the point, adding in the particles from the filters for Bob and Eve just makes matters *worse* — I’d get something like a blur of all the places any of them have been. If Alice, Bob and Eve all start at the left side of the room, and move to the right side arriving at time t, a strategy which only updates a random subset of particles sounds like it would, at t, not be able to say that all parties had arrived at the far right, because some portion of the particles would have survived in the middle of the room, or at the far left.

  2. I will detail further my ideas in a post later. Let me just answer your first question: “What’s an example of a task where you care about the aggregate behavior of several models, but not about accurate inference wrt any of them individually?”

    I care only about the details of the behavior of the top-K most active individuals. The task I care at the moment is following 100K to a few millions of Twitter users with as few resources as possible. In particular, Twitter exposes two APIs: a streaming one which is limited by the number of followers (the basic access limits you to 10K users, or maximum 1% of all tweets; you can get higher quotas if you’re wiling to pay a premium) and a REST-based one which is rate-limited and you only get updates from a single user per query, or a bunch of users per query if you throw in some not so trivial heuristics with APIs calls meant for other purposes (bulk user lookups return the last tweet for each user in full).

    Here we have a precious resource that we want to dynamically allocate to a subset of all individuals based on their forecasted behavior.

    Scheduling users on the streaming API is purely based on Twitter activity, here is where I want to use particle sketches. Scheduling on the Rest API is more complex and generally dominated by the rate-limiting, so I’d rather schedule based on the number of API calls I have available for a Twitter account (i.e. the the aggregated interest by users of my application who want to follow a particular Twitter account).

    Regarding (b), what is the best way to aggregate particle filters is still a open question. In general, given the design of sketches, we’ll need a function that works sort of like a very narrow-band filter in engineering terms, or like an intersection in set theory terms.

    I am looking for a joint probability function which is the sum of all particle filters, assuming as a first approximation that they model independent random variables, so essentially the joint PDF is the convolution of their separate density functions.

Comments are closed.