If you read my previous post about Particle Sketches and are literate in statistics, you may have already dismissed my ideas as flawed, convinced that they will shatter against the hard cliffs of classical statistics and its mightiest peak, the central limit theorem.
It is not necessarily so. Don’t get me wrong, I don’t pretend to be able to circumvent it in any way or outsmart the math giants of the past. The naive way to put together count-min sketches and particle filters is bound to hit this wall, but if we are a little bit smart we can use the central limit theorem itself to guide us around these perilous shores.
What is the problem? The core concept of count-min sketches is to use a set of hash functions to distribute hits on various counters, relying on the fact that good hash functions will have few collisions. The counter with the lowest number of hits will be an upper-bound approximation of the true frequency for a given item.
Count-Mean-Min Sketches are a bit smarter and incorporate the number of expected collisions in the estimation itself.
When we replace counters with statistical models, like particle filters, we are replacing integers with random variables.
What’s the deal with adding together independent random variables? Under very mild conditions, the result is the normal distribution (if the variance of the independent variables is finite) or one of the other well known and well studied distributions that you can find in literature (if the variance is infinite, such as in power-laws).
The key here is the word independent. That’s surely the case if we choose a set of random hash functions in the same way we would do with normal count-min sketches. We can be smarter than that. If our items are not opaque labels for the object they stand for, but instead have an internal structure, we can use that internal structure to drive the selection of which particle filter to update.
In simpler terms, we can use the attributes of our items to cluster them into specific, non-random categories. Let these attributes drive our hash functions: if they are effective properties around which we can cluster our items, we’ll see non uniform distributions arise among various buckets. If not, we’ll end up with very similar distributions along a single line.
Let’s see a concrete example. A tweet is not just the text that the user sees, it includes a lot of metadata: the user location and timezone, the user bio, the number of followers the user has and the number of users he follows, how many times it has been retweeted, the kind of tweet (reply, retweet, plain text, with a link, with hashtags, with user mentions).
We can maintain separate set of buckets for each of these dimensions, using hash functions that map closely related items in the domain to the same slots in the co-domain.
If the processes underlying the phenomenon we are tracking do not change, we can expect these distributions not to change. In which case, we can discard these dimensions as not interesting.
If the processes do evolve, we can use a key strength of particle filters to still keep the computing load manageable: we can dynamically reduce the number of particles used to track these uninteresting dimensions, allocating resources to more interesting ones. Over time, what is interesting and uninteresting may change and we’ll just have to focus (i.e have more particles) where the real action is at the moment. That’s one reason I like particle filters over other statistical methods. The other reason is that they make it possible to estimate any kind of phenomenon, you don’t have to commit to a specific statistical model up front.