Implementing Inference Networks for Information Retrieval System in Indonesian Language.

Conference Paper (PDF Available) · January 2003with21 Reads
Source: DBLP
Conference: iiWAS'2003 - The Fifth International Conference on Information Integrationand Web-based Applications Services, 15-17 September 2003, Jakarta, Indonesia

Full-text (PDF)

Available from: Zainal Arifin Hasibuan, Nov 14, 2015
IMPLEMENTING INFERENCE NETWORK
FOR INFORMATION RETRIEVAL SYSTEM
IN INDONESIAN LANGUAGE
Adi Wahyu Pribadi, Zaenal Arifin Hasibuan
Faculty of Computer Science
University of Indonesia
Depok 16424, Indonesia
adi_wahyu@softhome.net, zhasibua@cs.ui.ac.id
Abstract
Information Retrieval is concerned with selecting documents from a collection that will be of
interest to a user with a stated information need or query. This paper describes a retrieval model
that uses probabilistic inference networks consisting of a document network which is built once to
represent document collections and a query network which is built every new query or information
need is given. In the first section of this paper, an inference network model will be introduced and
described briefly. Next, a simple example is given to illustrate how inference networks works on
document collections. In the following section, the model will be implemented using news articles
taken from Republika Online and Tempo Interaktif. Boolean and extended Boolean models are
given to compare the proposed model. In the end, the paper will conclude that inference networks
will be able to develop technique that can improve performance over conventional retrieval models.
The model also has limitations and remains open to examine and study further. One of the
limitations is the lack of document collections in Indonesian Language. In the future, additional
document collections should be used to gather more precise and detail results to measure the
effectiveness of inference network model.
Page 1
1. Introduction
Information retrieval is an inference or evidential reasoning process in which we estimate the
probability that a user’s information need, expressed as one or more queries, is met given a
document as evidence [6]. Therefore network representations can be used to model the technique.
Turtle implemented Information Retrieval System based on Bayesian inference network [7]. Two
years later Croft and Turtle adopted their recent system on hypertext [3]. Tzeras and Hartman
proposed automatic Indexing based on Bayesian inference network [8]. Callan, Z. Lu, and Croft
also used Bayesian inference network for distributed document collections and discussed the
implementation on INQUERI system [2].
The use of the model is intended to retrieve relevant Indonesian news articles from document
collections. As we know that the model also ranks the retrieved documents so we want to see
whether the most relevant document is listed in the first order and the second relevant document is
listed in the second order and so on.
2. Inference Network
A Bayesian inference network is a directed, acyclic dependency graph (DAG) in which nodes
represent propositional variables or constants and edges represent dependence relations between
propositions. If a proposition represented by a node p “causes” or implies the proposition
represented by node q, where p and q are joined by a directed line (Æ). The node q contains a link
matrix that specifies P(q|p) for all possible values of the two variables. When a node has multiple
parents, the link matrix specifies the dependence of that node on the set of parents (
q
) and
characterizes the dependence relationship between that node and all nodes representing its potential
causes. Given a set of prior probabilities for the roots of the DAG, these networks can be used to
compute the probability or degree of belief associated with all remaining nodes [6].
Different restrictions on the topology of the network and assumptions about the way in which the
connected nodes interact lead to different schemes for combining probabilities. In general, these
schemes have two components which operate independently: a predictive component in which
parent nodes provide support for their children (the degree to which we believe a proposition
depends on the degree to which we believe the propositions that might cause it), and a diagnostic
component in which children provide support for their parents (if our belief in a proposition
increases or decreases, so does our belief in its potential causes). The propagation of probabilities
through the net can be done using information passed between adjacent nodes [6].
Page 2
Figure 1: Basic document inference network [6]
The inference network has ability to perform a ranking given many sources of evidence by
performing a combination of confidence. The inference network is used to model documents, the
contents of document, and the query given by a user. As we see in figure 1, the inference network
consists of two sub-networks: the first is the document network that is produced during indexing
and then become static during retrieval process; and the second is the query network that is
produced from the query text during retrieval process [5].
2.1. Document Network
As we see in figure 1, the document network consists of document nodes (d
i
’s), document contents
nodes (t
i
’s), and keyword or concept representation nodes (r
k
’s). The document nodes represent the
retrievable units within the network, that is, those items we wish to see in the resultant ranking. A
casual link between document node and the concept node indicates that the document content is
represented by the concept. Each link contains a conditional probability, or weight, to indicate the
strength of the relationship. The evaluation of a node is done using the value of the parent nodes and
the conditional probabilities [5].
Page 3
2.2. Query Network
The query network represents the submitted query and consists of a framework that represent the
required concepts (query concept nodes) and the operators (query operator nodes), connected in an
inverted tree structure. The query network is constructed with a final leaf node I that represents the
user Information Need [5].
Two further processes are done to perform retrieval: the attachment process, where query network is
attached to the document network to form the complete inference network and is done where
concepts in both networks are the same; the evaluation process, whereby the complete inference
network is evaluated for each document node to form the probability of the relevance to the query.
The evaluation is initialized by setting the output of one document node to 1 and all the other
document nodes to 0. This is done for each document node in turn and the network is evaluated. The
probability of document relevance is taken from the final node I and is used to produce the ranking
[5].
2.3. Link Matrix Forms
[7] For all non-root nodes in the inference network can be estimated its probability using values
from its parent nodes. If a node a has a set of parents 
a
= {p
1
, …, p
n
}, we must estimate
P(a|p
1
, …, p
n
). The most direct way to encode our estimate is as a link matrix. Since only true or
false value used as proposition, this matrix is of size 2 x 2
n
for n parents and specifies the
probability that a takes the value a = true or a = false for all combinations of parent values. The
update procedures for Bayesian network then use the probabilities provided by the set of parents to
condition over link the link matrix values to compute the predictive component of our belief in a or
P(a = true).
We will describe five canonical link matrix forms. For illustration, we will assume the node Q has
three parents A, B, and C and that
P(A = true) = a, P(B = true) = b, P(C = true) = c
For OR combinations, Q will be true when any of A, B, or C is true and false only when A, B, and C
are all false. This suggests a link matrix of the form
¸
¸
¹
·
¨
¨
©
§
11111110
00000001
OR
L
Using a closed form of the update procedures, we have
P(Q = true) = (1 – a)(1 – b)c + (1 – a)b(1 – c) + (1 – a)bc + a(1 – b)(1 – c)
+ a(1 – b)c + ab(1 – c) + abc
= 1 – (1 – a)(1 – b)(1 – c)
Page 4
Similar matrix can be developed for AND (P(Q = true) = abc) and NOT (P(Q = true) = 1 – a).
To summarize the results (for details how matrix link constructed, see [7]), the following closed-
form expressions can be used to evaluate the canonical matrices for a node Q with parents
P
1
, P
2
, …, P
n
where P(P
1
= true) = p
1
, …, P(P
n
= true) = p
n
.
bel
OR
(Q) = 1 – (1 – p
1
) * (1 – p
2
) * … * (1 – p
n
)
bel (Q) = p * p
2
* … * p
n AND 1
bel (Q) = 1 – p
1 NOT
bel
SUM
(Q) =
n
PPP
n
...
21
bel
WTD
(Q) =
n
qnn
www
wPwPwpw
...
)...(
21
2211
2.4. Weighting Scheme with TF-IDF
We assume that belief in a representation concept is proportional to the within-document frequency
(tf) and inversely proportional to the frequency of the concept in the collection. The collection
frequency component is generally expressed as the term’s inverse document frequency (idf) which
is given by
idf =
¸
¸
¹
·
¨
¨
©
§
frequencyconcept
sizecollection
log
or idf =
¸
¸
¹
·
¨
¨
©
§
i
ndf
n
log
Then we normalize both tf and idf to the range [0,1], we have
ntf
ij
=
jdocumenttheintermoffrequency
jdocumenttheinitermoffrequency
max
or ntf
ij
=
j
ij
tf
tf
max_
idf
i
=
)log(
)log(
sizecollection
collectioninitermfrequency
sizecollection
or idf
i
=
)log(
)log(
N
df
N
i
In conventional Boolean model, the weight of index term is binary (0 or 1) which shows whether an
index term is present in a document or not. Therefore it causes the retrieved document is not ranked.
If conventional Boolean model is simulated with inference network model we have
P(t
i
|d
j
= true) = 1 and P(t
i
|d
j
= false) = 0. Fitriyanti [4] solves the ranking problem with
P(ti|dj = true) = tf
ij
and P(t
i
|d
j
= false) = 0. The formula gives a ranking to retrieved documents
using frequency of term in documents.
Page 5
The P-norm model (if we also simulate it with inference network model) formulates the index term
weighting by combining both ntf
ij
and nidf
i
[9]. So we have P(t
i
|d
j
= true) = ntf
ij
* nidf
i
and
P(t
i
|d
j
= false) = 0.
Turtle and Croft improved [6] P-norm model using , , , and  in order to get the probability in
range of 0 to 1 [3]. We have P(t
i
|d
j
= true) =  +  * ntf +  * nidf +  * ntf * nidf. By performing
many experiments they concluded that the best performance can be achieved if 0.4    0.8, and
 =  = 0. The retrieval performance for a typical belief function is given by
P(t
i
|d
j
= true) = D + (1 – D) * ntf
ij
* nidf
i
P(t
i
|d
j
= false) = D where D [0, 0.5]
3. Inference Network Example
In this section, we will see how document network is constructed and how queries are evaluated.
The following example is taken from [9]. The inference network shown in figure 2 contains two
documents and four representation concepts. A single query has been attached containing keywords
‘informasi’, ‘probabilistik’, and ‘jaringan’ using Boolean operator and. Now we will estimate
bel(Q|d
1
) and bel(Q|d
2
). For the following example, we use D = 0.4,  =  = 0.
Figure 2: Inference network fragment
If node Q on figure 2 interpreted as query, then the probability of node ‘informasi’ which has one
parent can be computed using ntf and nidf which is shown on table 1. We assume max_tfd
1
= 5 and
max_tfd
2
=4 and total documents in a collection are 136,600.
frequency
nidf
tf
d1
tf
d2
ntf
d1
ntf
d2
informasi 16 0.77 3 0 0.60 0.00
probabilistik 16461 0.18 3 2 0.60 0.50
jaringan 820 0.43 5 1 1.00 0.25
satelit 2675 0.33 0 4 0.00 1.00
Table 1: Frequencies and idf and tf weights
Node ‘informasi’ with one parent d
1
P(informasi = true | d
1
= true) = 0.4 + 0.6 * ntf * nidf
Page 6
= 0.4 + 0.6 * 0.6 * 0.77 = 0.677
L
informasi
=
¸
¸
¹
·
¨
¨
©
§
677.00
323.01
Node ‘probabilistik’ with two parents d
1
and d
2
P(probabilistik = true | d
1
= true) = 0.4 + 0.6 * 0.6 * 0.18 = 0.465
P(probabilistik = true | d
2
= true) = 0.4 + 0.6 * 0.5 * 0.18 = 0.454
L
probabilistik
=
¸
¸
¹
·
¨
¨
©
§
465.0465.0454.00
535.0535.0546.01
Node ‘jaringan’ with two parents d
1
and d
2
P(jaringan = true | d
1
= true) = 0.4 + 0.6 * 1.00 * 0.43 = 0.658
P(jaringan = true | d
1
= true) = 0.4 + 0.6 * 0.25 * 0.43 = 0.464
L
jaringan
=
¸
¸
¹
·
¨
¨
©
§
658.0658.0464.00
342.0432.0536.01
Node ‘satelit’ with one parent d
2
P(satelit = true | d
2
= true) = 0.4 + 0.6 * 1.0 * 0.33 = 0.598
L
satelit
=
¸
¸
¹
·
¨
¨
©
§
598.00
402.01
Then the following figure 3 here is the ‘inverted belief list’ of figure 2
informasi (d
1
, 0.677)
probabilistik (d
1
, 0.465) (d
2
, 0.454)
jaringan (d
1
, 0.658) (d
2
, 0.464)
sateli
t
(
d
2
, 0.598)
Figure 3: Inverted belief list
4. Implementation
In this paper, the inference network model is implemented using 200 articles taken from two
national daily newspapers (Republika Online and Tempo Interaktif). Bomb in Bali, Bomb in J. W.
Marriott Hotel, Annual Representatives Meeting 2003, War in Aceh, and War in Iraq are topics of
the articles.
Not all words are equally significant for representing the semantics of a document. Therefore, it is
useful to preprocess the text to determine which terms to be used as index terms. Preprocess consists
of eliminating stopwords and stemming (reduction of a word to its grammatical root). We must
know exactly the way of a word is constructed in Indonesian language; therefore stemming a word
must be performed carefully. For example a word ‘menyapu’ has its grammatical root as ‘sapu’.
Page 7
The indexing processes of Boolean, extended Boolean, and inference networks models are similar.
In Boolean model the weight of a term is in binary and extended Boolean the weight is computed by
normalization of tf-idf scheme. While the inference networks model adds one more process which is
converting the ntf-nidf matrix into inverted belief list. The converting process is computed by link-
matrix. Figure 4 shows the ordered processes to build document network which has been shown as
creating inverted belief list in section 3.
Figure 4: Building Document Network Processes
To retrieve documents in document collections using P-Norm model without the learning schemes
we use
sim(q
or
,d
j
) =
p
p
m
pp
m
xxx
1
21
...
¸
¸
¹
·
¨
¨
©
§
sim(q
and
, d
j
) =
p
p
m
pp
m
xxx
1
21
)1(...)1()1(
1
¸
¸
¹
·
¨
¨
©
§
where xi stands for the weight w
i
,
d
associated to the pair [k
i
, d
j
] [1].
5. Result
In this section we present results from conventional Boolean, Extended Boolean, and Inference
Networks model. As we know that the Boolean model does not retrieve documents in ordered lists.
It only retrieves whether a document is relevant or not. Wisnani [9] said there is a relation between
Extended Boolean using P-Norm and inference networks models. This happens by sharpening
Boolean operators. From the experiments conducted by Andri [1], the results showed that
the P-Norm model works well where p is in range of 1 to 100. The weight of index terms will
approach to zero when p is higher than 100. We use p = 4 in our experiments.
Page 8
From the experiments conducted by Turtle [9], the best performance is generally achieved when
 =  = 0 and only the tf.idf product term remains. He concluded a good overall belief estimate is
given when  = 0.4. The results from experiments conducted by Wisnani [9] also showed the best
performance of inference networks model achieved when  = 0.4 hence we use  = 0.4 and
 =  = 0 in our experiments.
Table 2 shows the precision from the three models using OR and AND operators. We do not use the
SUM and WTD operators because their precisions are not better than the Boolean operators (AND,
OR, NOT) [9].
Boolean
P-Norm
P = 4
Inference
Network
Q
1
0.777778 0.666667 0.647727
Q
2
0.930070 0.923077 0.923077
Q
3
0.800000 0.812500 0.822785
Q
4
0.951807 0.927711 0.939024
Q
5
0.927273 0.767123 0.878049
Q
6
0.975610 0.975610 0.975610
Q
7
0.937500 0.926829 0.951220
Q
8
0.950000 0.950000 0.950000
Q
9
0.959184 0.721429 0.726619
Q
10
0.771429 0.771429 0.794118
AVG
0.898065 0.844237 0.860823
Table 2: precisions of Boolean, Extended Boolean, and Inference Network
As wee see in table 2, apparently the precisions of Extended Boolean and Inference Networks
models are not better than the Boolean model. The Boolean model has average precision at 89.80%
when 10 queries are given to retrieve relevant documents in document collections. The P-Norm
model decreases its precision at 5.99% to 84.42% compared to Boolean model while Inference
networks model also decreases it precision at 4.15% to 86.08%. But the inference networks model
increases its precision at 1.96% compared to the P-Norm model.
The results from Q
3
and Q
8
show that Boolean model is not better than P-Norm and inference
networks models. We know that the problems of Boolean model are the retrieved documents are not
ranked and documents retrieved can be too much or no documents at all. By using Boolean model
some relevant documents are not retrieved from document collections because the results generally
have fewer documents retrieved compared to the results of P-Norm and inference networks
techniques. It can be judged that the P-Norm and inference Networks are better than the Boolean
model because of two reasons; both have ranking results and more documents relevant retrieved.
6. Conclusion
Page 9
The experiment conducted here has only 200 documents. Therefore, further research is expected to
gain more document collections and subject domains in order to evaluate the model more precisely
and accurate. Moreover, the system should be tested by a real user who is looking for Indonesian
news articles on the Internet or libraries. However, the inference networks shows that the
experiments improve the precisions of Extended Boolean model with P-norm = 4. From
experiments conducted here we can conclude that inference network is able to overcome the
problem of too many or no documents retrieved at all. The documents retrieved by inference
networks have a ranking.
7. References
[1] ANDRI, Y., Teknik Learning Scheme Berdasarkan Model P-Norm pada Sistem Temu-Kembali Informasi.
Skripsi, Computer Science Faculty, University of Indonesia, 1997.
[2] CALLAN, J., LU, Z., CROFT, W. B., Searching Distributed Collections with Inference Networks. To appear at
the 18
th
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,
1995.
[3] CROFT, W. B., TURTLE, H. R., Retrieval Strategies for Hypertext. 1992.
[4] FITRIYANTI, M., Sistem Temu-Kembali Informasi dengan Mengimplementasikan Operasi Boolean, Sistem
Peringkat, Perbaikan Query, dan Pemanfaatan Tesaurus. Skripsi, Computer Science Faculty, University of
Indonesia, 1997.
[5] GRAVES, A., LALMAS, M. Video Retrieval using an MPEG-7 Based Inference Network. ACM SIGIR 2002.
[6] TURTLE, H. R., CROFT, W. B., Inference Networks for Document Retrieval. 1989.
[7] TURTLE, H. R., Inference Network for Document Retrieval. PhD Thesis, University of Massachusetts, 1990.
[8] TZERAS, K., HARTMAN, S., Automatic Indexing Based on Bayesian Inference Networks. Technische
Hochschule Darmstadt, Fachbereich Informatik W-6100 Darmstadt, Germany, 1993.
[9] WISNANI, Y., Sistem Temu-Kembali Informasi Berbasis Jaringan Inferensia. Thesis, Computer Science
Faculty, University of Indonesia, 1998.
[10] YATES, R. B., NETO, N. R., Modern Information Retrieval. Addison Wesley, Essex, England, 1999.
Page 10
    • "Relevant studies about text retrieval applications in Indonesian language are identified but none in automatic ontology constructor exclusively. A study of Pribadi and Hasibuan in [7] tried to implement inference network using Indonesian news articles while a study of spoken query-based Indonesian information retrieval was presented in [8]. A number of studies show that ontology can enhance system performance. "
    [Show abstract] [Hide abstract] ABSTRACT: Rich information is scattered under Indonesian Choral Lovers (ICL) mailing list and many of its members prefer posting a query-mail to using the available search engine. A text retrieval system based on ontology is then proposed. However, considering the continual number of emails, developing an automatic ontology constructor (OC) will be the focus of the study while the retrieval system becomes an evaluation tool of the OC developed. Besides using 3,000 emails of ICL as the corpus, this study is challenging for it takes a cognitive approach as the heart of the process and employs Linear Model known in Expert System field as the system life cycle. The effectiveness of OC will be determined based on the automatic-thesaurus effectiveness. Performance measure of information retrieval is going to be computed in technical evaluation while qualitative measures of ontology (consistency, completeness, and conciseness) are going to be used in user's evaluation.
    Full-text · Conference Paper · Aug 2010