Page 1

IMPLEMENTING INFERENCE NETWORK

FOR INFORMATION RETRIEVAL SYSTEM

IN INDONESIAN LANGUAGE

Adi Wahyu Pribadi, Zaenal Arifin Hasibuan

Faculty of Computer Science

University of Indonesia

Depok 16424, Indonesia

adi_wahyu@softhome.net, zhasibua@cs.ui.ac.id

Abstract

Information Retrieval is concerned with selecting documents from a collection that will be of

interest to a user with a stated information need or query. This paper describes a retrieval model

that uses probabilistic inference networks consisting of a document network which is built once to

represent document collections and a query network which is built every new query or information

need is given. In the first section of this paper, an inference network model will be introduced and

described briefly. Next, a simple example is given to illustrate how inference networks works on

document collections. In the following section, the model will be implemented using news articles

taken from Republika Online and Tempo Interaktif. Boolean and extended Boolean models are

given to compare the proposed model. In the end, the paper will conclude that inference networks

will be able to develop technique that can improve performance over conventional retrieval models.

The model also has limitations and remains open to examine and study further. One of the

limitations is the lack of document collections in Indonesian Language. In the future, additional

document collections should be used to gather more precise and detail results to measure the

effectiveness of inference network model.

Page 2

1. Introduction

Information retrieval is an inference or evidential reasoning process in which we estimate the

probability that a user’s information need, expressed as one or more queries, is met given a

document as evidence [6]. Therefore network representations can be used to model the technique.

Turtle implemented Information Retrieval System based on Bayesian inference network [7]. Two

years later Croft and Turtle adopted their recent system on hypertext [3]. Tzeras and Hartman

proposed automatic Indexing based on Bayesian inference network [8]. Callan, Z. Lu, and Croft

also used Bayesian inference network for distributed document collections and discussed the

implementation on INQUERI system [2].

The use of the model is intended to retrieve relevant Indonesian news articles from document

collections. As we know that the model also ranks the retrieved documents so we want to see

whether the most relevant document is listed in the first order and the second relevant document is

listed in the second order and so on.

2. Inference Network

A Bayesian inference network is a directed, acyclic dependency graph (DAG) in which nodes

represent propositional variables or constants and edges represent dependence relations between

propositions. If a proposition represented by a node p “causes” or implies the proposition

represented by node q, where p and q are joined by a directed line (?). The node q contains a link

matrix that specifies P(q|p) for all possible values of the two variables. When a node has multiple

parents, the link matrix specifies the dependence of that node on the set of parents (?q) and

characterizes the dependence relationship between that node and all nodes representing its potential

causes. Given a set of prior probabilities for the roots of the DAG, these networks can be used to

compute the probability or degree of belief associated with all remaining nodes [6].

Different restrictions on the topology of the network and assumptions about the way in which the

connected nodes interact lead to different schemes for combining probabilities. In general, these

schemes have two components which operate independently: a predictive component in which

parent nodes provide support for their children (the degree to which we believe a proposition

depends on the degree to which we believe the propositions that might cause it), and a diagnostic

component in which children provide support for their parents (if our belief in a proposition

increases or decreases, so does our belief in its potential causes). The propagation of probabilities

through the net can be done using information passed between adjacent nodes [6].

Page 3

Figure 1: Basic document inference network [6]

The inference network has ability to perform a ranking given many sources of evidence by

performing a combination of confidence. The inference network is used to model documents, the

contents of document, and the query given by a user. As we see in figure 1, the inference network

consists of two sub-networks: the first is the document network that is produced during indexing

and then become static during retrieval process; and the second is the query network that is

produced from the query text during retrieval process [5].

2.1. Document Network

As we see in figure 1, the document network consists of document nodes (di’s), document contents

nodes (ti’s), and keyword or concept representation nodes (rk’s). The document nodes represent the

retrievable units within the network, that is, those items we wish to see in the resultant ranking. A

casual link between document node and the concept node indicates that the document content is

represented by the concept. Each link contains a conditional probability, or weight, to indicate the

strength of the relationship. The evaluation of a node is done using the value of the parent nodes and

the conditional probabilities [5].

Page 4

2.2. Query Network

The query network represents the submitted query and consists of a framework that represent the

required concepts (query concept nodes) and the operators (query operator nodes), connected in an

inverted tree structure. The query network is constructed with a final leaf node I that represents the

user Information Need [5].

Two further processes are done to perform retrieval: the attachment process, where query network is

attached to the document network to form the complete inference network and is done where

concepts in both networks are the same; the evaluation process, whereby the complete inference

network is evaluated for each document node to form the probability of the relevance to the query.

The evaluation is initialized by setting the output of one document node to 1 and all the other

document nodes to 0. This is done for each document node in turn and the network is evaluated. The

probability of document relevance is taken from the final node I and is used to produce the ranking

[5].

2.3. Link Matrix Forms

[7] For all non-root nodes in the inference network can be estimated its probability using values

from its parent nodes. If a node a has a set of parents ?a = {p1, …, pn}, we must estimate

P(a|p1, …, pn). The most direct way to encode our estimate is as a link matrix. Since only true or

false value used as proposition, this matrix is of size 2 x 2n for n parents and specifies the

probability that a takes the value a = true or a = false for all combinations of parent values. The

update procedures for Bayesian network then use the probabilities provided by the set of parents to

condition over link the link matrix values to compute the predictive component of our belief in a or

P(a = true).

We will describe five canonical link matrix forms. For illustration, we will assume the node Q has

three parents A, B, and C and that

P(A = true) = a, P(B = true) = b, P(C = true) = c

For OR combinations, Q will be true when any of A, B, or C is true and false only when A, B, and C

are all false. This suggests a link matrix of the form

?

?

1110

Using a closed form of the update procedures, we have

P(Q = true) = (1 – a)(1 – b)c + (1 – a)b(1 – c) + (1 – a)bc + a(1 – b)(1 – c)

+ a(1 – b)c + ab(1 – c) + abc

= 1 – (1 – a)(1 – b)(1 – c)

??

?

?

??

?

1111

00000001

OR

L

Page 5

Similar matrix can be developed for AND (P(Q = true) = abc) and NOT (P(Q = true) = 1 – a).

To summarize the results (for details how matrix link constructed, see [7]), the following closed-

form expressions can be used to evaluate the canonical matrices for a node Q with parents

P1, P2, …, Pn where P(P1 = true) = p1, …, P(Pn = true) = pn.

bel OR(Q) = 1 – (1 – p1) * (1 – p2) * … * (1 – pn)

bel (Q) = p * p2 * … * pn AND1

bel (Q) = 1 – p1 NOT

bel SUM(Q) =

n

wpw

?

(

11

PPP

n

???

...

21

bel WTD(Q) =

n

qnn

www

wPwP

???

??

...

) ...

21

22

2.4. Weighting Scheme with TF-IDF

We assume that belief in a representation concept is proportional to the within-document frequency

(tf) and inversely proportional to the frequency of the concept in the collection. The collection

frequency component is generally expressed as the term’s inverse document frequency (idf) which

is given by

?

frequencyconcept

Then we normalize both tf and idf to the range [0,1], we have

documenttheiniterm of frequency

max

size collection

idf =

??

?

?

??

?

sizecollection

log

or idf =

??

?

?

??

?

?

i

ndf

n

log

ntfij =

j documentthe intermoffrequency

j

or ntfij =

j

ij

tf

tf

max_

idfi =

) log(

) log(

sizecollection

collectionini term frequency

or

idfi =

) log(

) log(

N

df

N

i

In conventional Boolean model, the weight of index term is binary (0 or 1) which shows whether an

index term is present in a document or not. Therefore it causes the retrieved document is not ranked.

If conventional Boolean model is simulated with inference network model we have

P(ti|dj = true) = 1 and P(ti|dj = false) = 0. Fitriyanti [4] solves the ranking problem with

P(ti|dj = true) = tfij and P(ti|dj = false) = 0. The formula gives a ranking to retrieved documents

using frequency of term in documents.