### Full-text (PDF)

Available from: Zainal Arifin Hasibuan, Nov 14, 2015IMPLEMENTING INFERENCE NETWORK

FOR INFORMATION RETRIEVAL SYSTEM

IN INDONESIAN LANGUAGE

Adi Wahyu Pribadi, Zaenal Arifin Hasibuan

Faculty of Computer Science

University of Indonesia

Depok 16424, Indonesia

adi_wahyu@softhome.net, zhasibua@cs.ui.ac.id

Abstract

Information Retrieval is concerned with selecting documents from a collection that will be of

interest to a user with a stated information need or query. This paper describes a retrieval model

that uses probabilistic inference networks consisting of a document network which is built once to

represent document collections and a query network which is built every new query or information

need is given. In the first section of this paper, an inference network model will be introduced and

described briefly. Next, a simple example is given to illustrate how inference networks works on

document collections. In the following section, the model will be implemented using news articles

taken from Republika Online and Tempo Interaktif. Boolean and extended Boolean models are

given to compare the proposed model. In the end, the paper will conclude that inference networks

will be able to develop technique that can improve performance over conventional retrieval models.

The model also has limitations and remains open to examine and study further. One of the

limitations is the lack of document collections in Indonesian Language. In the future, additional

document collections should be used to gather more precise and detail results to measure the

effectiveness of inference network model.

Page 1

1. Introduction

Information retrieval is an inference or evidential reasoning process in which we estimate the

probability that a user’s information need, expressed as one or more queries, is met given a

document as evidence [6]. Therefore network representations can be used to model the technique.

Turtle implemented Information Retrieval System based on Bayesian inference network [7]. Two

years later Croft and Turtle adopted their recent system on hypertext [3]. Tzeras and Hartman

proposed automatic Indexing based on Bayesian inference network [8]. Callan, Z. Lu, and Croft

also used Bayesian inference network for distributed document collections and discussed the

implementation on INQUERI system [2].

The use of the model is intended to retrieve relevant Indonesian news articles from document

collections. As we know that the model also ranks the retrieved documents so we want to see

whether the most relevant document is listed in the first order and the second relevant document is

listed in the second order and so on.

2. Inference Network

A Bayesian inference network is a directed, acyclic dependency graph (DAG) in which nodes

represent propositional variables or constants and edges represent dependence relations between

propositions. If a proposition represented by a node p “causes” or implies the proposition

represented by node q, where p and q are joined by a directed line (Æ). The node q contains a link

matrix that specifies P(q|p) for all possible values of the two variables. When a node has multiple

parents, the link matrix specifies the dependence of that node on the set of parents (

q

) and

characterizes the dependence relationship between that node and all nodes representing its potential

causes. Given a set of prior probabilities for the roots of the DAG, these networks can be used to

compute the probability or degree of belief associated with all remaining nodes [6].

Different restrictions on the topology of the network and assumptions about the way in which the

connected nodes interact lead to different schemes for combining probabilities. In general, these

schemes have two components which operate independently: a predictive component in which

parent nodes provide support for their children (the degree to which we believe a proposition

depends on the degree to which we believe the propositions that might cause it), and a diagnostic

component in which children provide support for their parents (if our belief in a proposition

increases or decreases, so does our belief in its potential causes). The propagation of probabilities

through the net can be done using information passed between adjacent nodes [6].

Page 2

Figure 1: Basic document inference network [6]

The inference network has ability to perform a ranking given many sources of evidence by

performing a combination of confidence. The inference network is used to model documents, the

contents of document, and the query given by a user. As we see in figure 1, the inference network

consists of two sub-networks: the first is the document network that is produced during indexing

and then become static during retrieval process; and the second is the query network that is

produced from the query text during retrieval process [5].

2.1. Document Network

As we see in figure 1, the document network consists of document nodes (d

i

’s), document contents

nodes (t

i

’s), and keyword or concept representation nodes (r

k

’s). The document nodes represent the

retrievable units within the network, that is, those items we wish to see in the resultant ranking. A

casual link between document node and the concept node indicates that the document content is

represented by the concept. Each link contains a conditional probability, or weight, to indicate the

strength of the relationship. The evaluation of a node is done using the value of the parent nodes and

the conditional probabilities [5].

Page 3

2.2. Query Network

The query network represents the submitted query and consists of a framework that represent the

required concepts (query concept nodes) and the operators (query operator nodes), connected in an

inverted tree structure. The query network is constructed with a final leaf node I that represents the

user Information Need [5].

Two further processes are done to perform retrieval: the attachment process, where query network is

attached to the document network to form the complete inference network and is done where

concepts in both networks are the same; the evaluation process, whereby the complete inference

network is evaluated for each document node to form the probability of the relevance to the query.

The evaluation is initialized by setting the output of one document node to 1 and all the other

document nodes to 0. This is done for each document node in turn and the network is evaluated. The

probability of document relevance is taken from the final node I and is used to produce the ranking

[5].

2.3. Link Matrix Forms

[7] For all non-root nodes in the inference network can be estimated its probability using values

from its parent nodes. If a node a has a set of parents

a

= {p

1

, …, p

n

}, we must estimate

P(a|p

1

, …, p

n

). The most direct way to encode our estimate is as a link matrix. Since only true or

false value used as proposition, this matrix is of size 2 x 2

n

for n parents and specifies the

probability that a takes the value a = true or a = false for all combinations of parent values. The

update procedures for Bayesian network then use the probabilities provided by the set of parents to

condition over link the link matrix values to compute the predictive component of our belief in a or

P(a = true).

We will describe five canonical link matrix forms. For illustration, we will assume the node Q has

three parents A, B, and C and that

P(A = true) = a, P(B = true) = b, P(C = true) = c

For OR combinations, Q will be true when any of A, B, or C is true and false only when A, B, and C

are all false. This suggests a link matrix of the form

¸

¸

¹

·

¨

¨

©

§

11111110

00000001

OR

L

Using a closed form of the update procedures, we have

P(Q = true) = (1 – a)(1 – b)c + (1 – a)b(1 – c) + (1 – a)bc + a(1 – b)(1 – c)

+ a(1 – b)c + ab(1 – c) + abc

= 1 – (1 – a)(1 – b)(1 – c)

Page 4

Similar matrix can be developed for AND (P(Q = true) = abc) and NOT (P(Q = true) = 1 – a).

To summarize the results (for details how matrix link constructed, see [7]), the following closed-

form expressions can be used to evaluate the canonical matrices for a node Q with parents

P

1

, P

2

, …, P

n

where P(P

1

= true) = p

1

, …, P(P

n

= true) = p

n

.

bel

OR

(Q) = 1 – (1 – p

1

) * (1 – p

2

) * … * (1 – p

n

)

bel (Q) = p * p

2

* … * p

n AND 1

bel (Q) = 1 – p

1 NOT

bel

SUM

(Q) =

n

PPP

n

...

21

bel

WTD

(Q) =

n

qnn

www

wPwPwpw

...

)...(

21

2211

2.4. Weighting Scheme with TF-IDF

We assume that belief in a representation concept is proportional to the within-document frequency

(tf) and inversely proportional to the frequency of the concept in the collection. The collection

frequency component is generally expressed as the term’s inverse document frequency (idf) which

is given by

idf =

¸

¸

¹

·

¨

¨

©

§

frequencyconcept

sizecollection

log

or idf =

¸

¸

¹

·

¨

¨

©

§

i

ndf

n

log

Then we normalize both tf and idf to the range [0,1], we have

ntf

ij

=

jdocumenttheintermoffrequency

jdocumenttheinitermoffrequency

max

or ntf

ij

=

j

ij

tf

tf

max_

idf

i

=

)log(

)log(

sizecollection

collectioninitermfrequency

sizecollection

or idf

i

=

)log(

)log(

N

df

N

i

In conventional Boolean model, the weight of index term is binary (0 or 1) which shows whether an

index term is present in a document or not. Therefore it causes the retrieved document is not ranked.

If conventional Boolean model is simulated with inference network model we have

P(t

i

|d

j

= true) = 1 and P(t

i

|d

j

= false) = 0. Fitriyanti [4] solves the ranking problem with

P(ti|dj = true) = tf

ij

and P(t

i

|d

j

= false) = 0. The formula gives a ranking to retrieved documents

using frequency of term in documents.

Page 5

The P-norm model (if we also simulate it with inference network model) formulates the index term

weighting by combining both ntf

ij

and nidf

i

[9]. So we have P(t

i

|d

j

= true) = ntf

ij

* nidf

i

and

P(t

i

|d

j

= false) = 0.

Turtle and Croft improved [6] P-norm model using , , , and in order to get the probability in

range of 0 to 1 [3]. We have P(t

i

|d

j

= true) = + * ntf + * nidf + * ntf * nidf. By performing

many experiments they concluded that the best performance can be achieved if 0.4 0.8, and

= = 0. The retrieval performance for a typical belief function is given by

P(t

i

|d

j

= true) = D + (1 – D) * ntf

ij

* nidf

i

P(t

i

|d

j

= false) = D where D [0, 0.5]

3. Inference Network Example

In this section, we will see how document network is constructed and how queries are evaluated.

The following example is taken from [9]. The inference network shown in figure 2 contains two

documents and four representation concepts. A single query has been attached containing keywords

‘informasi’, ‘probabilistik’, and ‘jaringan’ using Boolean operator and. Now we will estimate

bel(Q|d

1

) and bel(Q|d

2

). For the following example, we use D = 0.4, = = 0.

Figure 2: Inference network fragment

If node Q on figure 2 interpreted as query, then the probability of node ‘informasi’ which has one

parent can be computed using ntf and nidf which is shown on table 1. We assume max_tfd

1

= 5 and

max_tfd

2

=4 and total documents in a collection are 136,600.

frequency

nidf

tf

d1

tf

d2

ntf

d1

ntf

d2

informasi 16 0.77 3 0 0.60 0.00

probabilistik 16461 0.18 3 2 0.60 0.50

jaringan 820 0.43 5 1 1.00 0.25

satelit 2675 0.33 0 4 0.00 1.00

Table 1: Frequencies and idf and tf weights

Node ‘informasi’ with one parent d

1

P(informasi = true | d

1

= true) = 0.4 + 0.6 * ntf * nidf

Page 6

= 0.4 + 0.6 * 0.6 * 0.77 = 0.677

L

informasi

=

¸

¸

¹

·

¨

¨

©

§

677.00

323.01

Node ‘probabilistik’ with two parents d

1

and d

2

P(probabilistik = true | d

1

= true) = 0.4 + 0.6 * 0.6 * 0.18 = 0.465

P(probabilistik = true | d

2

= true) = 0.4 + 0.6 * 0.5 * 0.18 = 0.454

L

probabilistik

=

¸

¸

¹

·

¨

¨

©

§

465.0465.0454.00

535.0535.0546.01

Node ‘jaringan’ with two parents d

1

and d

2

P(jaringan = true | d

1

= true) = 0.4 + 0.6 * 1.00 * 0.43 = 0.658

P(jaringan = true | d

1

= true) = 0.4 + 0.6 * 0.25 * 0.43 = 0.464

L

jaringan

=

¸

¸

¹

·

¨

¨

©

§

658.0658.0464.00

342.0432.0536.01

Node ‘satelit’ with one parent d

2

P(satelit = true | d

2

= true) = 0.4 + 0.6 * 1.0 * 0.33 = 0.598

L

satelit

=

¸

¸

¹

·

¨

¨

©

§

598.00

402.01

Then the following figure 3 here is the ‘inverted belief list’ of figure 2

informasi (d

1

, 0.677)

probabilistik (d

1

, 0.465) (d

2

, 0.454)

jaringan (d

1

, 0.658) (d

2

, 0.464)

sateli

t

(

d

2

, 0.598)

Figure 3: Inverted belief list

4. Implementation

In this paper, the inference network model is implemented using 200 articles taken from two

national daily newspapers (Republika Online and Tempo Interaktif). Bomb in Bali, Bomb in J. W.

Marriott Hotel, Annual Representatives Meeting 2003, War in Aceh, and War in Iraq are topics of

the articles.

Not all words are equally significant for representing the semantics of a document. Therefore, it is

useful to preprocess the text to determine which terms to be used as index terms. Preprocess consists

of eliminating stopwords and stemming (reduction of a word to its grammatical root). We must

know exactly the way of a word is constructed in Indonesian language; therefore stemming a word

must be performed carefully. For example a word ‘menyapu’ has its grammatical root as ‘sapu’.

Page 7

The indexing processes of Boolean, extended Boolean, and inference networks models are similar.

In Boolean model the weight of a term is in binary and extended Boolean the weight is computed by

normalization of tf-idf scheme. While the inference networks model adds one more process which is

converting the ntf-nidf matrix into inverted belief list. The converting process is computed by link-

matrix. Figure 4 shows the ordered processes to build document network which has been shown as

creating inverted belief list in section 3.

Figure 4: Building Document Network Processes

To retrieve documents in document collections using P-Norm model without the learning schemes

we use

sim(q

or

,d

j

) =

p

p

m

pp

m

xxx

1

21

...

¸

¸

¹

·

¨

¨

©

§

sim(q

and

, d

j

) =

p

p

m

pp

m

xxx

1

21

)1(...)1()1(

1

¸

¸

¹

·

¨

¨

©

§

where xi stands for the weight w

i

,

d

associated to the pair [k

i

, d

j

] [1].

5. Result

In this section we present results from conventional Boolean, Extended Boolean, and Inference

Networks model. As we know that the Boolean model does not retrieve documents in ordered lists.

It only retrieves whether a document is relevant or not. Wisnani [9] said there is a relation between

Extended Boolean using P-Norm and inference networks models. This happens by sharpening

Boolean operators. From the experiments conducted by Andri [1], the results showed that

the P-Norm model works well where p is in range of 1 to 100. The weight of index terms will

approach to zero when p is higher than 100. We use p = 4 in our experiments.

Page 8

From the experiments conducted by Turtle [9], the best performance is generally achieved when

= = 0 and only the tf.idf product term remains. He concluded a good overall belief estimate is

given when = 0.4. The results from experiments conducted by Wisnani [9] also showed the best

performance of inference networks model achieved when = 0.4 hence we use = 0.4 and

= = 0 in our experiments.

Table 2 shows the precision from the three models using OR and AND operators. We do not use the

SUM and WTD operators because their precisions are not better than the Boolean operators (AND,

OR, NOT) [9].

Boolean

P-Norm

P = 4

Inference

Network

Q

1

0.777778 0.666667 0.647727

Q

2

0.930070 0.923077 0.923077

Q

3

0.800000 0.812500 0.822785

Q

4

0.951807 0.927711 0.939024

Q

5

0.927273 0.767123 0.878049

Q

6

0.975610 0.975610 0.975610

Q

7

0.937500 0.926829 0.951220

Q

8

0.950000 0.950000 0.950000

Q

9

0.959184 0.721429 0.726619

Q

10

0.771429 0.771429 0.794118

AVG

0.898065 0.844237 0.860823

Table 2: precisions of Boolean, Extended Boolean, and Inference Network

As wee see in table 2, apparently the precisions of Extended Boolean and Inference Networks

models are not better than the Boolean model. The Boolean model has average precision at 89.80%

when 10 queries are given to retrieve relevant documents in document collections. The P-Norm

model decreases its precision at 5.99% to 84.42% compared to Boolean model while Inference

networks model also decreases it precision at 4.15% to 86.08%. But the inference networks model

increases its precision at 1.96% compared to the P-Norm model.

The results from Q

3

and Q

8

show that Boolean model is not better than P-Norm and inference

networks models. We know that the problems of Boolean model are the retrieved documents are not

ranked and documents retrieved can be too much or no documents at all. By using Boolean model

some relevant documents are not retrieved from document collections because the results generally

have fewer documents retrieved compared to the results of P-Norm and inference networks

techniques. It can be judged that the P-Norm and inference Networks are better than the Boolean

model because of two reasons; both have ranking results and more documents relevant retrieved.

6. Conclusion

Page 9

The experiment conducted here has only 200 documents. Therefore, further research is expected to

gain more document collections and subject domains in order to evaluate the model more precisely

and accurate. Moreover, the system should be tested by a real user who is looking for Indonesian

news articles on the Internet or libraries. However, the inference networks shows that the

experiments improve the precisions of Extended Boolean model with P-norm = 4. From

experiments conducted here we can conclude that inference network is able to overcome the

problem of too many or no documents retrieved at all. The documents retrieved by inference

networks have a ranking.

7. References

[1] ANDRI, Y., Teknik Learning Scheme Berdasarkan Model P-Norm pada Sistem Temu-Kembali Informasi.

Skripsi, Computer Science Faculty, University of Indonesia, 1997.

[2] CALLAN, J., LU, Z., CROFT, W. B., Searching Distributed Collections with Inference Networks. To appear at

the 18

th

Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,

1995.

[3] CROFT, W. B., TURTLE, H. R., Retrieval Strategies for Hypertext. 1992.

[4] FITRIYANTI, M., Sistem Temu-Kembali Informasi dengan Mengimplementasikan Operasi Boolean, Sistem

Peringkat, Perbaikan Query, dan Pemanfaatan Tesaurus. Skripsi, Computer Science Faculty, University of

Indonesia, 1997.

[5] GRAVES, A., LALMAS, M. Video Retrieval using an MPEG-7 Based Inference Network. ACM SIGIR 2002.

[6] TURTLE, H. R., CROFT, W. B., Inference Networks for Document Retrieval. 1989.

[7] TURTLE, H. R., Inference Network for Document Retrieval. PhD Thesis, University of Massachusetts, 1990.

[8] TZERAS, K., HARTMAN, S., Automatic Indexing Based on Bayesian Inference Networks. Technische

Hochschule Darmstadt, Fachbereich Informatik W-6100 Darmstadt, Germany, 1993.

[9] WISNANI, Y., Sistem Temu-Kembali Informasi Berbasis Jaringan Inferensia. Thesis, Computer Science

Faculty, University of Indonesia, 1998.

[10] YATES, R. B., NETO, N. R., Modern Information Retrieval. Addison Wesley, Essex, England, 1999.

Page 10

- CitationsCitations1
- ReferencesReferences8

- "Relevant studies about text retrieval applications in Indonesian language are identified but none in automatic ontology constructor exclusively. A study of Pribadi and Hasibuan in [7] tried to implement inference network using Indonesian news articles while a study of spoken query-based Indonesian information retrieval was presented in [8]. A number of studies show that ontology can enhance system performance. "

[Show abstract] [Hide abstract]**ABSTRACT:**Rich information is scattered under Indonesian Choral Lovers (ICL) mailing list and many of its members prefer posting a query-mail to using the available search engine. A text retrieval system based on ontology is then proposed. However, considering the continual number of emails, developing an automatic ontology constructor (OC) will be the focus of the study while the retrieval system becomes an evaluation tool of the OC developed. Besides using 3,000 emails of ICL as the corpus, this study is challenging for it takes a cognitive approach as the heart of the process and employs Linear Model known in Expert System field as the system life cycle. The effectiveness of OC will be determined based on the automatic-thesaurus effectiveness. Performance measure of information retrieval is going to be computed in technical evaluation while qualitative measures of ontology (consistency, completeness, and conciseness) are going to be used in user's evaluation.