Download citation...
Question
Asked 9 February 2012

Is there an algorithm faster than k-d trees for nearest-neighbor search in the domain of bit arrays of length (n)?

Let ''I'' be a set of bit arrays of length ''n''. Given a bit array ''t'', it is desired to find the array ''i'' in ''I'' which is closest to ''t'' in terms of their Hamming distance [1] ''h'' – which may be 0 if ''t'' is in ''I'', but most likely ''h > 0''.
Clearly this is an instance of the multidimensional nearest-neighbor problem, but if ''n'' is too large (say, ''n = 256''), traditional nearest-neighbor algorithms such as k-d trees become impractical. However, it also seems certain that the domain allows for some optimizations – after all, only two values (0 and 1) are possible for each dimension along the data points.
Does anyone know of a nearest-neighbor algorithm that is optimized for this case?
Helio Perroni Filho
Federal University of Espírito Santo
That's a good question. I plan to run some tests to confirm this, but my feeling is that distribution of bit arrays over the domain space is not random – there's ought to be some clustering, and therefore varying expectations for each bit array index ''i'' (''0 <= i < n'') across the data set.
Yes, looking first into the branches (indices) with highest expectation, hoping to quickly lock into a restricted set of possible best matches, looks like a promising strategy. I've come with a dynamic programming algorithm around it (though it'll probably be Hell to implement); right now I'm busy finishing a report, but once I'm done I'll start working on it. Stay tuned.