University Medical Center Utrecht
Question
Asked 9 February 2012
Is there an algorithm faster than k-d trees for nearest-neighbor search in the domain of bit arrays of length (n)?
Let ''I'' be a set of bit arrays of length ''n''. Given a bit array ''t'', it is desired to find the array ''i'' in ''I'' which is closest to ''t'' in terms of their Hamming distance [1] ''h'' – which may be 0 if ''t'' is in ''I'', but most likely ''h > 0''.
Clearly this is an instance of the multidimensional nearest-neighbor problem, but if ''n'' is too large (say, ''n = 256''), traditional nearest-neighbor algorithms such as k-d trees become impractical. However, it also seems certain that the domain allows for some optimizations – after all, only two values (0 and 1) are possible for each dimension along the data points.
Does anyone know of a nearest-neighbor algorithm that is optimized for this case?
All Answers (3)
is there some kind of structuring of "l"? If so, you can adapt your search algorithm and considerably reduce the search space. If "l" is totally random, it becomes much more difficult. however, one potential optimalization is to explore branches with highest expectations (i.e. branches who have a good overlap already). As long as you dont find an identical match, you'll have to keep searching though...
Therefore, it may be more useful to sort "l" first, this action will cost less than searching the whole tree. There are plenty of fast sorting algorithms. After "l" is sorted, finding the closest match is a piece of cake.
Federal University of Espírito Santo
That's a good question. I plan to run some tests to confirm this, but my feeling is that distribution of bit arrays over the domain space is not random – there's ought to be some clustering, and therefore varying expectations for each bit array index ''i'' (''0 <= i < n'') across the data set.
Yes, looking first into the branches (indices) with highest expectation, hoping to quickly lock into a restricted set of possible best matches, looks like a promising strategy. I've come with a dynamic programming algorithm around it (though it'll probably be Hell to implement); right now I'm busy finishing a report, but once I'm done I'll start working on it. Stay tuned.
University Medical Center Utrecht
One additional tip: you could use multithreading to search multiple promising branches simultaneously. In this manner, you could keep a relatively broad perspective on the problem without narrowing down too much on a potentially worthless branch, or evaluating too many useless branches.
The tree search approach is mostly useful if there are multiple solutions and you only need one though. If you want all possible solutions, and especially if you are not sure when the optimal solution is reached, I think sorting of "l" is going to be much more effective.
Similar questions and discussions
What's your perception on the next scientific revolution following the agriculture, industry, energy & information revolutions in human civilization?
Yingxu Wang
Will it be “intelligence revolution [Wang, 2013]” to fulfill the ultimate level of human needs? What kinds of new products will be resulted in the intelligence revolution? (ICIC, http://www.ucalgary.ca/icic/).
Which heuristic optimization method is the simplest and easiest to use?
Grzegorz Dudek
Many heuristic optimization methods (see:
have many parameters that are difficult to tune. And this is a big problem in their use. You never know whether the parameters have their optimal values.
I prefere methods with the smallest number of parameters which are easy to tune (see
or
What are your opinions and experiences?
Conference Paper Tournament Searching Method to Feature Selection Problem