Question
Asked 9 February 2012

Is there an algorithm faster than k-d trees for nearest-neighbor search in the domain of bit arrays of length (n)?

Let ''I'' be a set of bit arrays of length ''n''. Given a bit array ''t'', it is desired to find the array ''i'' in ''I'' which is closest to ''t'' in terms of their Hamming distance [1] ''h'' – which may be 0 if ''t'' is in ''I'', but most likely ''h > 0''.
Clearly this is an instance of the multidimensional nearest-neighbor problem, but if ''n'' is too large (say, ''n = 256''), traditional nearest-neighbor algorithms such as k-d trees become impractical. However, it also seems certain that the domain allows for some optimizations – after all, only two values (0 and 1) are possible for each dimension along the data points.
Does anyone know of a nearest-neighbor algorithm that is optimized for this case?

All Answers (3)

Thomas P A Debray
University Medical Center Utrecht
is there some kind of structuring of "l"? If so, you can adapt your search algorithm and considerably reduce the search space. If "l" is totally random, it becomes much more difficult. however, one potential optimalization is to explore branches with highest expectations (i.e. branches who have a good overlap already). As long as you dont find an identical match, you'll have to keep searching though...
Therefore, it may be more useful to sort "l" first, this action will cost less than searching the whole tree. There are plenty of fast sorting algorithms. After "l" is sorted, finding the closest match is a piece of cake.
Helio Perroni Filho
Federal University of Espírito Santo
That's a good question. I plan to run some tests to confirm this, but my feeling is that distribution of bit arrays over the domain space is not random – there's ought to be some clustering, and therefore varying expectations for each bit array index ''i'' (''0 <= i < n'') across the data set.
Yes, looking first into the branches (indices) with highest expectation, hoping to quickly lock into a restricted set of possible best matches, looks like a promising strategy. I've come with a dynamic programming algorithm around it (though it'll probably be Hell to implement); right now I'm busy finishing a report, but once I'm done I'll start working on it. Stay tuned.
Thomas P A Debray
University Medical Center Utrecht
One additional tip: you could use multithreading to search multiple promising branches simultaneously. In this manner, you could keep a relatively broad perspective on the problem without narrowing down too much on a potentially worthless branch, or evaluating too many useless branches.
The tree search approach is mostly useful if there are multiple solutions and you only need one though. If you want all possible solutions, and especially if you are not sure when the optimal solution is reached, I think sorting of "l" is going to be much more effective.

Similar questions and discussions

Related Publications

Got a technical question?
Get high-quality answers from experts.