Tree Based Index (TBI) System
Getting Started with TBI
Jia Xu1Zhenjie Zhang2Anthony K. H. Tung2Ge Yu1
May 5, 2010
Probabilistic data is coming as a new deluge along with the technical ad-
vances on geographical tracking, multimedia processing, sensor network and
RFID. While similarity search is an important functionality supporting the
manipulation of probabilistic data, it raises new challenges to traditional re-
lational database. The problem stems from the limited effectiveness of the
distance metric supported by the existing database system. On the other
hand, some complicated distance operators have been proved their values for
better distinguishing ability in probabilistic domain. In this paper, we dis-
cuss the similarity search problem with the Earth Mover’s Distance, which
is the most successful distance metric on probabilistic histograms and an ex-
pensive operator with cubic complexity. We present a new Tree-Based Index
(TBI) approach to answer range queries and k-nearest neighbor queries on
probabilistic data, on the basis of Earth Mover’s Distance. Our solution uti-
lizes the primal-dual theory in linear programming and deploys B+tree index
structures for effective candidate pruning. Extensive experiments show that
our proposal dramatically improves the scalability of probabilistic databases.
This work has been submitted to the VLDB 2010.
2 Contents of Code Package
The TBIcode.zip file consists of three subfolders. The content of each sub-
folder is introduced in order.
This subfolder consists of 6 data files corresponding to the DBLP data set.
DBLP data set is a probabilistic histogram set with 8 dimensions and 50000
records, generated from the DBLP database retrieved in Oct. 2007. The
8 dimensions of each data record represent 8 different domains in computer
science, including application, database, hardware, software, system, theory
and bio-information. We define the feature of each domain/bin considering
its correlation degree to the following three aspects, i.e., computer, mathe-
matics and architecture. As thus, each dimension will have an 3-dimensional
feature vector. For the other specific content of DBLP data set, please refer
to . The format and content of each data file is given below.
dblp-data-50000.txt: this file includes the histograms of DBLP data. The
first line of this file is given as ‘50000 8’ where ‘50000’ is the cardinality of the
histogram set and ‘8’ is the dimensionality of the histograms. The remaining
lines follow the format of ‘record-id histogram’.
dblp-center.txt: this file contains the features of each histogram bin. The
first line ‘8 3’ depicts the number of bins used in the histograms and the the
size of the vectors describing bins. In particular, the ground distance between
two bins is calculated by the Euclidean distance between the corresponding
vectors in this file.
dblp-reduce8-8.txt: this file is unnecessary if dimensionality reduction is
not run over the histograms. The first line of this file is ‘8 8’, denoting the
original dimensionality and new dimensionality across the dimensionality re-
duction process. The following lines represent the reduction matrix obtained
following the method introduced in .
dblp-solution.txt: this file is essential in our index framework. It keeps the
feasible solutions for the construction of the index structure. In the current
file, it contain 8 lines representing 4 feasible solutions (two lines compose one
feasible solution) based on the primal-dual theory. To generate new feasible
solutions on your own data and index, you need to run our generateFeasibleS-
olution.exeunder the directory of TBIcode\generateFeasibleSolution\
to obtain the feasible solutions based on the histogram data for index con-
task.txt: in this file, we define the basic operations. This first line tells the
number of operations and the dimensionality of the histogram. The remain-
ing lines comply with the following rules. If it is an insertion operation, the
format is ‘I userId histogram’. It is is a deletion operation, the format
should be ‘D userId histogram’. When you want to add a knn query,
the format is ‘K parameterK histogram’ and a range query follows the
format ‘R thresh histogram’.
parameter.txt: you need to fill the parameter file based on the setting of
your program. For DBLP data set, the content of this file is,
the string behind the mark ’#’ indicates the meaning of the next line. Just
as ’DataHistogramFile’ points out that the next line should be filled with
the name of data histogram file. If you don’t want to use the data reduction
method on your histogram, the line under the ’#ReducedMatrixFile’ can be
filled with any string or stay empty. In our code, we set the index page size
as 4096 and the cache size is set through the number of pages it can contains.
We offer both of the Euclidean and the Manhattan distance as the candidate
ground distances of Earth Mover’s Distance (EMD). You can also determine
your ground distance by setting the parameter file appropriately. But do not
forget to use the Manhattan-based feasible solution file when you decide to
use Manhattan as the ground distance in your program.
This subfolder contains one?.dll?file and one?.lib?file. The?.dll?file consists of
the executable of our TBI framework and two external programming interface
allowing access for your own C++ codes,
void iniTBISystem(const char* paraFileName)
void executeOperationFile(const char* fileName)
where the function iniTBISystem is responsible to the initialization of the
TBI system. And the ‘parameter.txt’ is the actual parameter of this func-
tion. As to function executeOperationFile, the operation file (e.g., the
‘task.txt’) is needed to execute all operations.
This subfolder includes a C++ project showing how to use our TBI frame-
work to process your the EMD-based query. This project is compiled with
Microsoft Visual Studio 2005. Two major files (example.h and example.cpp)
are included in this project. Do not forget that the iniTBISystem function
should be called before any of the executeOperationFile function, for you
need to initialize the TBI system before doing any operation in it.
This subfolder contains a ’.exe’ file for generating your feasible solution file,
just like the ’dblp-solution.txt’ file mentioned above.
3 Run the example Application
By running the example application in TBIcode\example\, you need to
follow these 3 steps.
1. Copy the .lib files into the directory of TBIcode\example\. And
then, compile the example project and get a executable file example.exe.
2. Copy the example.exe file together with the file TBIDll.dll into the
directory of TBIcode\data\.
3. Now, everything is ready. You can just enter the directory TBIcode\data\
and run the example.exe file.
After successfully executing the before-mentioned 3 steps, you will get
your result files with the name of ’operationFileName.rlt’, where ’opera-
tionFileName’ is the name of your operation file. Just take the operation
file task.txt for example. The result file name of it is ’task.rlt’. And the
result file follows these formats,
• kNN Query: ‘K resultList’
• Range Query: ‘R resultList’
• Insertion: ‘I userId’
• Deletion: ‘D userId’
4 Generate Your Feasible Solution File
The before-mentioned feasible solution file can be generated using our ’gen-
FeaSolu.exe’ file under the directory of TBIcode\genFeaSolu\. The ’.exe’
file takes 4 parameters as the input and they are the histogram file name,
the center file name, the result file name and the ground distance type (0
for Euclidean Distance and 1 for Manhattan Distance). For example, if we
want to generate a feasible solution file based on the histogram file ’dblp-data-
50000.txt’ with center file as ’dblp-center.txt’, result file as ’dblp-solu.txt’ and
ground distance as Euclidean Distance, the call statement should be written
genFeaSolu.exe dblp-data-50000.txt dblp-center.txt dblp-solu.txt 0
The generated feasible solution file will contain 20 lines (10 feasible solutions).
Pay attention that line 1 and line 2 should be appeared at the same time,
for they join together to denote a feasible solution in the dual space. And
similarly, line 3 and line 4,..., should be used concurrently. Follow this rule,
you can choose any 4 pairs from them, and make your final feasible solution
file (that’s because our TBI framework only ask for 4 feasible solutions).
Based on our paper, there are two ways of generating feasible solution file,
i.e., clustering-based method and random-sampling-based method. You can
obtain your clustering-based feasible solution file by filling the histogram file
(the first parameter of ’genFeaSolu.exe’) with your clustering results.
 M. Wichterich, I. Assent, P. Kranen, and T. Seidl. Efficient emd-based
similarity search in multimedia databases via flexible dimensionality re-
duction. In SIGMOD Conference, pages 199–212, 2008.
 J. Xu, Z. Zhang, A. K. H. Tung, and G. Yu. Efficient and effective
similarity search over probabilistic data based on earth movers distance.
In VLDB Conference, under reviewed, 2010.
 Z. Zhang, B. C. Ooi, S. Parthasarathy, and A. K. H. Tung. Similarity Download full-text
search on bregman divergence: Towards non-metric indexing. PVLDB,