Implementing Inference Networks for Information Retrieval System in Indonesian Language.
-
Citations (0)
-
Cited In (0)
Page 1
IMPLEMENTING INFERENCE NETWORK
FOR INFORMATION RETRIEVAL SYSTEM
IN INDONESIAN LANGUAGE
Adi Wahyu Pribadi, Zaenal Arifin Hasibuan
Faculty of Computer Science
University of Indonesia
Depok 16424, Indonesia
adi_wahyu@softhome.net, zhasibua@cs.ui.ac.id
Abstract
Information Retrieval is concerned with selecting documents from a collection that will be of
interest to a user with a stated information need or query. This paper describes a retrieval model
that uses probabilistic inference networks consisting of a document network which is built once to
represent document collections and a query network which is built every new query or information
need is given. In the first section of this paper, an inference network model will be introduced and
described briefly. Next, a simple example is given to illustrate how inference networks works on
document collections. In the following section, the model will be implemented using news articles
taken from Republika Online and Tempo Interaktif. Boolean and extended Boolean models are
given to compare the proposed model. In the end, the paper will conclude that inference networks
will be able to develop technique that can improve performance over conventional retrieval models.
The model also has limitations and remains open to examine and study further. One of the
limitations is the lack of document collections in Indonesian Language. In the future, additional
document collections should be used to gather more precise and detail results to measure the
effectiveness of inference network model.
Page 2
1. Introduction
Information retrieval is an inference or evidential reasoning process in which we estimate the
probability that a user’s information need, expressed as one or more queries, is met given a
document as evidence [6]. Therefore network representations can be used to model the technique.
Turtle implemented Information Retrieval System based on Bayesian inference network [7]. Two
years later Croft and Turtle adopted their recent system on hypertext [3]. Tzeras and Hartman
proposed automatic Indexing based on Bayesian inference network [8]. Callan, Z. Lu, and Croft
also used Bayesian inference network for distributed document collections and discussed the
implementation on INQUERI system [2].
The use of the model is intended to retrieve relevant Indonesian news articles from document
collections. As we know that the model also ranks the retrieved documents so we want to see
whether the most relevant document is listed in the first order and the second relevant document is
listed in the second order and so on.
2. Inference Network
A Bayesian inference network is a directed, acyclic dependency graph (DAG) in which nodes
represent propositional variables or constants and edges represent dependence relations between
propositions. If a proposition represented by a node p “causes” or implies the proposition
represented by node q, where p and q are joined by a directed line (?). The node q contains a link
matrix that specifies P(q|p) for all possible values of the two variables. When a node has multiple
parents, the link matrix specifies the dependence of that node on the set of parents (?q) and
characterizes the dependence relationship between that node and all nodes representing its potential
causes. Given a set of prior probabilities for the roots of the DAG, these networks can be used to
compute the probability or degree of belief associated with all remaining nodes [6].
Different restrictions on the topology of the network and assumptions about the way in which the
connected nodes interact lead to different schemes for combining probabilities. In general, these
schemes have two components which operate independently: a predictive component in which
parent nodes provide support for their children (the degree to which we believe a proposition
depends on the degree to which we believe the propositions that might cause it), and a diagnostic
component in which children provide support for their parents (if our belief in a proposition
increases or decreases, so does our belief in its potential causes). The propagation of probabilities
through the net can be done using information passed between adjacent nodes [6].
Page 3
Figure 1: Basic document inference network [6]
The inference network has ability to perform a ranking given many sources of evidence by
performing a combination of confidence. The inference network is used to model documents, the
contents of document, and the query given by a user. As we see in figure 1, the inference network
consists of two sub-networks: the first is the document network that is produced during indexing
and then become static during retrieval process; and the second is the query network that is
produced from the query text during retrieval process [5].
2.1. Document Network
As we see in figure 1, the document network consists of document nodes (di’s), document contents
nodes (ti’s), and keyword or concept representation nodes (rk’s). The document nodes represent the
retrievable units within the network, that is, those items we wish to see in the resultant ranking. A
casual link between document node and the concept node indicates that the document content is
represented by the concept. Each link contains a conditional probability, or weight, to indicate the
strength of the relationship. The evaluation of a node is done using the value of the parent nodes and
the conditional probabilities [5].
Page 4
2.2. Query Network
The query network represents the submitted query and consists of a framework that represent the
required concepts (query concept nodes) and the operators (query operator nodes), connected in an
inverted tree structure. The query network is constructed with a final leaf node I that represents the
user Information Need [5].
Two further processes are done to perform retrieval: the attachment process, where query network is
attached to the document network to form the complete inference network and is done where
concepts in both networks are the same; the evaluation process, whereby the complete inference
network is evaluated for each document node to form the probability of the relevance to the query.
The evaluation is initialized by setting the output of one document node to 1 and all the other
document nodes to 0. This is done for each document node in turn and the network is evaluated. The
probability of document relevance is taken from the final node I and is used to produce the ranking
[5].
2.3. Link Matrix Forms
[7] For all non-root nodes in the inference network can be estimated its probability using values
from its parent nodes. If a node a has a set of parents ?a = {p1, …, pn}, we must estimate
P(a|p1, …, pn). The most direct way to encode our estimate is as a link matrix. Since only true or
false value used as proposition, this matrix is of size 2 x 2n for n parents and specifies the
probability that a takes the value a = true or a = false for all combinations of parent values. The
update procedures for Bayesian network then use the probabilities provided by the set of parents to
condition over link the link matrix values to compute the predictive component of our belief in a or
P(a = true).
We will describe five canonical link matrix forms. For illustration, we will assume the node Q has
three parents A, B, and C and that
P(A = true) = a, P(B = true) = b, P(C = true) = c
For OR combinations, Q will be true when any of A, B, or C is true and false only when A, B, and C
are all false. This suggests a link matrix of the form
?
?
1110
Using a closed form of the update procedures, we have
P(Q = true) = (1 – a)(1 – b)c + (1 – a)b(1 – c) + (1 – a)bc + a(1 – b)(1 – c)
+ a(1 – b)c + ab(1 – c) + abc
= 1 – (1 – a)(1 – b)(1 – c)
??
?
?
??
?
1111
00000001
OR
L
Page 5
Similar matrix can be developed for AND (P(Q = true) = abc) and NOT (P(Q = true) = 1 – a).
To summarize the results (for details how matrix link constructed, see [7]), the following closed-
form expressions can be used to evaluate the canonical matrices for a node Q with parents
P1, P2, …, Pn where P(P1 = true) = p1, …, P(Pn = true) = pn.
bel OR(Q) = 1 – (1 – p1) * (1 – p2) * … * (1 – pn)
bel (Q) = p * p2 * … * pn AND1
bel (Q) = 1 – p1 NOT
bel SUM(Q) =
n
wpw
?
(
11
PPP
n
???
...
21
bel WTD(Q) =
n
qnn
www
wPwP
???
??
...
) ...
21
22
2.4. Weighting Scheme with TF-IDF
We assume that belief in a representation concept is proportional to the within-document frequency
(tf) and inversely proportional to the frequency of the concept in the collection. The collection
frequency component is generally expressed as the term’s inverse document frequency (idf) which
is given by
?
frequencyconcept
Then we normalize both tf and idf to the range [0,1], we have
document theinitermoffrequency
max
sizecollection
idf =
??
?
?
??
?
sizecollection
log
or idf =
??
?
?
??
?
?
i
ndf
n
log
ntfij =
j documenttheintermof frequency
j
or ntfij =
j
ij
tf
tf
max_
idfi =
)log(
) log(
size collection
collection initermfrequency
or
idfi =
)log(
)log(
N
df
N
i
In conventional Boolean model, the weight of index term is binary (0 or 1) which shows whether an
index term is present in a document or not. Therefore it causes the retrieved document is not ranked.
If conventional Boolean model is simulated with inference network model we have
P(ti|dj = true) = 1 and P(ti|dj = false) = 0. Fitriyanti [4] solves the ranking problem with
P(ti|dj = true) = tfij and P(ti|dj = false) = 0. The formula gives a ranking to retrieved documents
using frequency of term in documents.