Content uploaded by Samir Elloumi

Author content

All content in this area was uploaded by Samir Elloumi

Content may be subject to copyright.

Journal on Relational Methods in Computer Science, Vol. 1, 2004, pp. 217 - 234

Discovering Regularities in Databases Using

Canonical Decomposition of Binary Relations

A. Jaoua1, S. Elloumi2, A. Hasnah1, J. Jaam1, and I. Nafkha2

1University of Qatar, College of Science,

Department of Computer Science.

{jaoua;hasnah;jaam}@qu.edu.qa

2Faculty of Science of Tunis, University of Manar-Tunis

Unit´e de Recherche en Programmation Algorithmique et Heuristique

{samir.elloumi;ibtissem.nafkha}@fst.rnu.tn

Abstract. Regularities in databases are directly useful for knowledge discovery and data

summarization. As a mathematical background, relational algebra helped for discovering

the main data structures and existing dependencies between the diﬀerent attributes in a

relational database. Functional, difunctional and other kinds of dependencies in a rela-

tional database describe invariant regular structures that have been used intensively for

database decomposition, and for minimizing redundancy. In this paper, we explain why

“concepts” or “maximal rectangles” should be considered as the atomic regular structure

for decomposing a binary relation which can be useful for diﬀerent applications. More

speciﬁcally, we have noticed experimentally, that “optimal concepts” contain pertinent

information about data that we have exploited positively for machine learning, dynamic

and incremental database organization, text summarization, data reduction, and even

for modeling human thinking. Operators on concepts need to be developed because of

their general usefulness in data and information engineering. In this paper, we propose

to work on a canonical decomposition of binary relations based on two operators fand

g, to model some important open problems, as for example on how to put in equation

the best optimal conceptual coverage of a binary relation. We ﬁrst develop an algorithm

to ﬁnd a conceptual coverage of a binary relation. We then exploit Riguet’s difunctional

relation to put in equation all isolated pairs in a binary relation. Using iteratively these

isolated pairs, we ﬁnd several varieties of eﬃcient solutions for the canonical decomposi-

tion problem.

Keywords: regularities in database, maximal rectangles, concepts, knowledge extraction, data reduc-

tion

Received by the editors April 29, 2004, and, in revised form, October 21, 2004.

Published on December 10, 2004.

c

°A. Jaoua, S. Elloumi, A. Hasnah, J. Jaam, and I. Nafkha, 2004.

Permission to copy for private and scientiﬁc use granted.

217

218 Jaoua et al.

1 Introduction

Regular structures in databases played a major role for database decomposition,

and for discovering explicitly some form of knowledge embedded in data. In this

paper, we try to make a synthesis or state of art of some important regular

structures inside databases. We also ask several questions which have not speciﬁc

answers. The reason is that some problems are not yet solved. In this work, we

would like to invite other researchers to think about their solutions. We are more

concerned with the application of relational algebra in solving of some problems

in information engineering, than with proving new theorems related to relational

algebra. More speciﬁcally, we generalize the difunctional relation canonical de-

composition proposed in [7] to more general relations. For that purpose, we give

an approximate algorithm for the canonical relation decomposition and we exploit

Riguet’s difunctional relation for the same purpose.

Functional [3] and difunctional dependencies [7] represent some invariant

structures we could ﬁnd in databases that are used for decomposing a database

into some sub-schemas for the purpose of minimizing redundancies. These de-

pendencies are general and do not depend on any particular database instance.

However, even after such decomposition, instances of database may contain hid-

den regularities, that we can only discover by looking to the lattice of concepts

[4, 11] embedded inside each speciﬁc instance. From this lattice [11], it is known

that we can extract some association rules. However, these rules are not general,

because if we change the database instance, the extracted knowledge from the

database may also change.

As an illustration, we use the main concepts for automatic text summariza-

tion. We ﬁrst decompose the text into some elementary sub-texts: (sentences,

parts of sentences or sections). Second, we create a binary relation by indexing

each elementary piece of text by non empty words included in each elementary

sub-text. A possible tested approach for summarization is based on the main

concept of the text (i.e. associating the maximum of sentences to the maximum

of shared indexing words). In this paper we explain why the strength of this asso-

ciation is measured by the gain of the notion of “optimal concept”. By this way,

in [12], we have developed a system for generating diﬀerent summaries each one

associated with a diﬀerent optimal concept. Another important problem concerns

the design of good information ﬁlters in the search engines, each time we have to

search for web pages sharing speciﬁc indexing words. A possible modeling of this

problem is also to create a binary relation associating for each web page a list of

shared indexing words. A conceptual coverage of a binary relation, may be used

as a base for extracting the main web page references from the total space of web

Discovering Regularities in Databases 219

page references. By this way, the user will receive only diﬀerent levels of clusters

of web pages in decreasing importance level.

In our work, after recalling some known regular structures in a database, we

present the conceptual decomposition technique. Among these regular structures,

we ﬁrst deﬁne difunctional and functional dependencies and explain their appli-

cations. As a second kind and more general and diverse regularities, we deﬁne

maximal rectangles (or concepts) as the atomic information we may extract from

any binary relation. All these regular structures are now used for data mining. We

will explain this procedure in Section 6. In Section 4, we propose an approximate

algorithm for an approximate coverage of a binary relation with optimal concepts.

In Section 5, we give a solution to put in equation the coverage of some kind of

binary relations by a minimal number of concepts using Riguet’s difunctional

relation. Using difunctionality of relations as described by Riguet, we develop an

algorithm to ﬁnd a canonical decomposition of some classes of binary relations.

We then explain that we need to generalize the solution of the problem for more

complex binary relations.

2 Deﬁnition of Regularities

We use relational algebra for formalizing the data space and regularities we may

discover embedded in these data. We assume that we are able to map our data into

a binary context (i.e. a subset of the Cartesian product of two sets: Ethe set of

objects and Pthe set of properties). This hypothesis is not too restrictive, because

we noticed that most databases may be considered as a binary relation after some

transformations. We also apply the proposed work on some available data from

the internet or documentary databases or tabular data that are available in most

of professional companies. For this kind of databases, we can directly obtain a

binary relation linking document (of a web page) references to indexing terms.

So all dependencies extracted between the terms of the documentary database

give additional information for users. For all these general cases, we always need

to extract regular associations between attributes to make the right decision in

case of similar situations. We also may exploit regularities to ﬁlter information,

to keep only a minimal data size. This kind of application is very useful for search

engines to only give few pertinent web page references corresponding to the user

query.

2.1 Functional Dependencies and Their Application

A functional dependency (fd) is the dependency most frequently used in practice,

since the development of Codd’s relational model [3]. Functional dependencies

220 Jaoua et al.

have been used to minimize redundancy and to normalize the relational database

schema. The universe Uof a relational schema is composed of a set of attributes.

Each attribute Ahas a domain dom(A). An element of dom(A) is denoted by

a, b, etc. We use capital letters as A, B for single attribute, and X, Y for subsets

of attributes. The union of two subsets Xand Yis written as XY . We also

make the diﬀerence between a single attribute Aand the set {A}. A relation s

deﬁned on the set of attributes A1,A2,...,Anis a subset of the Cartesian product

dom(A1)×... ×dom(An). We say that sis an instance of the relational schema

S(U). A tuple is an element of s, called also a vector or a sequence of nvalues

associated with the nattributes. For example if t= (4,2,6) is a tuple of relation

sdeﬁned on the relational schema S(A, B, C) then t[AC] = (4,6) while t[A] = 4.

Generally t[X] is the restriction of tto the subset of attributes X.

Deﬁnition 1 (Functional dependency). Let Xand Ybe two subsets of at-

tributes of the universal set U. We say that it exists a functional dependency (fd)

from Xto Yif and only if, for any instance sof the relational schema S(U), if t1

and t2are any two tuples of s, if t1[X] = t2[X]then t1[Y] = t2[Y]. We generally

use the notation X→Y.

Several properties of these dependencies have been deﬁned by Codd, as fol-

lows:

–Reﬂexivity rule: If Y⊆X, then X→Y

–Augmentation rule: If X→Ythen XZ →Y Z

–Transitivity: X→Yand Y→Z, then X→Z.

–If X→Ythen Xis by deﬁnition a key in the relational schema S[X∪Y].

2.2 Difunctional Dependencies

Difunctional dependencies [7, 13] are a generalization of functional dependencies.

Let Rbe a binary relation. Ris difunctional if and only if R◦R−1◦R=R.

Where “◦” is the operator for relational composition, and R−1is the inverse of

R. A difunctional is no more no less than the union of the Cartesian product of

several pairs of subsets which are disjoint in their domains and their codomains.

This general relational equation has been used intensively in software engineering

and proved to be a very frequent data structure specifying the link between inputs

and outputs. It has been ignored for a long time in data engineering. Its utility

has been shown by name correct?? Nlt Lethan and Jaoua between 1985 and 1992,

under diﬀerent names (iso-dependencies or regular relations) [7].

Deﬁnition 2 (Difunctional dependencies). We say that there is a difunc-

tional dependency between Xand Y, denoted by X↔Y, if and only if, for any

instance sof S(U), the binary relation R[X, Y ]deﬁned by s[X, Y ]is difunctional.

Discovering Regularities in Databases 221

Example 1. Consider U=A, B , C, and sthe following instance of S(U) in Ta-

ble 1. We can see that A↔Bis true for s, because the binary relation R[A, B]

is difunctional; on the contrary B↔Cis false.

Table 1. An instance of s(U)

A B C

2 3 5

2 4 5

3 3 5

3 4 8

Redundancy reduction. Consider a relational schema S(U) and any instance

sof S. Assume that for any swe associate a difunctional binary relation R[X, Y ]

(with X∪Y=U), which is the union of maximal rectangles whose projections

are disjoint. So:

R[X, Y ] = (A1×B1)∪(A2×B2)∪... ∪(Ai×Bi)∪... ∪(An×Bn)

with Ai∩Aj=Bi∩Bj=φ, ∀i6=j.

It is easy to see that we can reduce redundancy by decomposing R[X, Y ] into

two binary relations R1[X, C] and R2[C, Y ], where Cis the attribute class. In

R1[X, C], for each element of the set Aiwe associate the value i. In R2[C, Y ], for

each value j, we associate all elements of subset Bj.

This kind of decomposition is called “canonical decomposition”. Experimen-

tations on several databases have shown that we can save an important amount of

memory space by such a decomposition. Even when we don’t ﬁnd difunctional de-

pendencies, we discovered that most of instances of a database contain a uniform

part which has a difunctional structure.

Even more general than functional dependencies and generalized to fuzzy

difunctional dependencies [13], this kind of dependency has not been directly

useful in database, because in most cases, attributes in databases do not have such

a uniform structure. But, the most important thing exhibited by a difunctional

relation is the notion of concept which is the maximal Cartesian product included

in a database, also called maximal rectangle, and rectangular binary relation

decomposition [1]. We will discuss this question in the next section.

2.3 Conceptual Dependencies

Deﬁnition 3. Let R be a binary relation deﬁned on a set E. The relation A×B,

such that A⊆Eand B⊆E, is called rectangular relation (or rectangle) of R [6,

14, 15]. A is the domain of this relation and B is its codomain (or its range).

222 Jaoua et al.

In Figure 1, we can see an example of a rectangular relation:

A

x

y

z

t

B

u

v

w

Fig. 1. Rectangular relation RE with 12 pairs

Deﬁnition 4. From a memory storage space perspective, the gain which is asso-

ciated to a given rectangular relation RE =A×Bis assessed by the following

heuristic function:

g(RE) = (kAk × k Bk)−(kAk+kBk) (1)

where kAkdenotes the cardinality of the set A.

Remark 1. This deﬁnition is introduced in [2]. We explain this formula by the

fact that a rectangular relation (or rectangle) associates kAkvalues to kBk

values. So, when we cluster in one side all the values of A, and in the other side

all the values of B, we can replace kAk × k Bkdirect pairs of RE (Figure 1)

by kAk+kBkindirect pairs by using an intermediary element ialso called

extra-symbol which links any element of Ato any element of Bas illustrated by

Figure 2.

A

x

y

z

ti

B

u

v

w

Fig. 2. A resumed rectangular representation of RE

Deﬁnition 5. A rectangle RE =A×Bwhich contains an element (a,b) of a

binary relation R is said to be optimal if it realizes the maximal gain g(RE) among

all rectangles which contain (a,b).

Discovering Regularities in Databases 223

Remark 2. Searching for the optimal rectangle containing (a, b) is an NP-complete

problem [1, 5]. Several heuristics which are based on a branch and bound prin-

ciple have been implemented and applied for database decomposition [2], object

oriented system decomposition [6] and data mining [17].

Remark 3. Optimal rectangles have a particular meaning because it represents

the most important data associations. Several rectangles may be optimal, because

they realize the same maximal gain. So with respect to some equivalence relation,

we can assimilate the class of all rectangles with the same gain to only one

representative element.

Deﬁnition 6. [2] A rectangular relation (or rectangle) RE =A×Bis said

degenerate if and only if kAk= 1 (Figure 3a) or kBk= 1 (Figure 3b).

AB

u

v

w

y

AB

x

y

z

u

(a) (b)

Fig. 3. Examples of degenerate rectangles

A concept is a maximal rectangle (i.e. a rectangle that cannot be extended

simultaneously in the domain and in the codomain). Assume that you have a

binary context R. We are always able to extract all concepts included in R.

Wille proved in [16], that this set of concepts is a complete lattice. This lattice

structure has been used intensively for knowledge extraction from data (i.e. de-

pendencies between attributes or association rules between the terms contained

in a documentary data base or in a single document). Importance of the notion

of concept has been discovered by the scientiﬁc groups working on graph theory.

Starting from 1990, we applied it to extract knowledge from data. Another group

on relational algebra discovered applications of concepts for software and data

decomposition [9], for machine learning [8], text summarization and several other

applications. Because of its simple and uniform structure, we believe more and

more that an atomic information is something like a directed pair of two subsets

(i.e. a complete bipartite sub-graph). So we assume that the data are composed

of a set of concepts.

224 Jaoua et al.

2.4 Gain of a Binary Relation

The gain in W(R) of binary relation Ris given by:

W(R) = ( r

d×c)×(r−(d+c)) (2)

Where:

–ris the cardinality of R(i.e. the number of pairs in R)

–dis the cardinality of the domain of R

–cis the cardinality of the range of R

Remark 4. The quantity r

d×cprovides a measure of the density of the relation

R. The quantity r−(d+c) is a measure describing how economical information

is represented. It is a logical extension of the corresponding deﬁnition from a

concept to a general relation. This deﬁnition will be used in the proposed heuristic

in Section 4.

2.5 Elementary Relation (noted PR)

If Ris a ﬁnite binary relation (i.e., subset of E×F, where Eis a set of objects

and Fa set of properties) and (a, b)∈R, then the union of rectangles containing

(a, b) is the elementary relation P R (i.e. subset of R) given by:

P R =ΦR(a, b) = I(b.R−1)◦R◦I(a.R) (3)

where:

–Iis the identity relation.

–R−1is the inverse relation of R(i.e. set of inverted pairs of R).

–“◦” refers to the relative product operator, where:

R◦R0={(x, y)|∃z: (x, z)∈R∧(z, y)∈R0}(4)

Let A⊆E, then I(A) = {(a, a)|a∈A}.

P R is the sub-relation of R, pre-restricted by the antecedents of b(i.e. b.R−1),

and post-restricted by the set of images of a(i.e. a.R). In the next section, we

use such elementary relations P R to ﬁnd the coverage of a relation by some

“minimal” number of optimal concepts. Note that the problem is NP-complete.

For that reason, we will only propose an approximate solution in Section 4, based

on a greedy method using the gain function W.

Discovering Regularities in Databases 225

3 Conceptual Binary Relation Coverage and Canonical

Decomposition

We may consider any binary relation Ras the union of concepts. The problem is

that among the diﬀerent possible combinations of concepts covering R, we have

to select the most economical ones in terms of memory. Finding the minimal

coverage of Ris an NP-complete problem [5]. For that reason, in [1, 6], we used

some approximate algorithms to decompose huge binary relations based on the

function “gain” given in Deﬁnition 4. The problem of ﬁnding the optimal rectangle

with a maximum “gain” is also NP-complete [5]. For that reason, we think that in

the future, we should make more research investigations about formal properties

of such coverages to ﬁnd better approximate methods. An open problem is to ﬁnd

new eﬃcient algorithms to update an initial conceptual coverage of some binary

relation Rwhen we add or remove some pairs in R. These researches will have

an impact on conceptual data mining systems.

Assume that {A1×B1,A2×B2,...,Ap×Bp}is some minimal coverage of the

binary relation R(i.e. R=A1×B1∪A2×B2∪... ∪Ap×Bp). If we deﬁne the

two following operators:

f(R) = A1× {c1} ∪ A2× {c2} ∪ ... ∪Ap× {cp}(5)

g(R) = {c1} × B1∪ {c2} × B2∪... ∪ {cp} × Bp(6)

generally the number of pairs in Ris much higher than the number of pairs

in f(R)∪g(R) , while R=f(R)◦g(R). Here {c1, c2, ..., cp}are extra-symbols

diﬀerent from any element in the domain or range of R, which are created to

represent the diﬀerent concepts in R. Another open problem is related to an

incremental conceptual binary relation transformation: the question is to ﬁnd an

eﬃcient method to calculate f(R∪ {a, b}), g(R∪({a, b}), f(R− {a, b}), g(R−

{a, b}) using only f(R) and g(R). The objective is to continue to update the

conceptual coverage of Rusing its minimal representation by the two relations

f(R) and g(R), by removing or adding the minimal number of extra-symbols.

Operators fand gmight be deﬁned automatically by some relational operator. It

is even interesting to give options for speciﬁc functions with interesting properties

we could use for mapping binary relations to their canonical forms. Using this

kind of decomposition in many experimental databases, we saved a huge amount

of memory space. In the following two sections, we ﬁrst propose an approximate

solution (Section 4), and second the difunctional of Riguet (Section 5) for deriving

a coverage of a binary relation with optimal concepts.

226 Jaoua et al.

4 Approximate Algorithm for Canonical Decomposition

In this section we propose an approximate algorithm to ﬁnd a set of optimal

rectangles that provides a coverage of a given relation R: an approximate solution

for a canonical decomposition of a binary relation. The algorithm is explained in

Figures 9 and 10. But here we explain the steps using the following relation R.

Let Rbe a ﬁnite binary relation between two sets as illustrated below in Figure 5:

1

2

3

4

5

6

7

8

9

10

11

12

Fig. 4. An example of a binary relation R

–Step 1: Divide the relation Rinto disjoint sub-relations ,..., Here we have only

one sub-relation (also called elementary relation).

–Step 2: For each elementary relation P Ri, search the optimal rectangle, which

includes an element of P Ri.

If P Riis a rectangle, then it is an optimal rectangle containing (a, b), else check

if P Ricontains other elements (X, Y ) in the form (a, Y ) or (X, b) by trying all

the images of aand all the antecedents of b(see Figure 6).

P R(1,7) = ΦR(1,7) = I(7.R−1)◦R◦I(1.R)

So we search with an iterative way the optimal rectangles of P R (1,7) which

successively contain the elements (1,8), (1,9), (1,11), (2,7), and (3,7).

–First Iteration: from the ﬁve elementary relations of the above mentioned

elements select the ﬁrst that gives a maximal gain Wdeﬁned in Section 2.

As a matter of fact the relation with the maximum gain represents the best

compromise between density, and information economy.

Discovering Regularities in Databases 227

1

2

3

4

5

6

7

8

9

10

11

12

1

2

3

7

8

9

11

NR(1,7)

7.R-1

1.R

Fig. 5. Elementary relation ΦR(1,7) =P R(1,7)

1. P R0

1,8=ΦP R1,7(1,8); W(P R0

1,8) = 0

2. P R0

1,9=ΦP R1,7(1,9); W(P R0

1,9) = 7/8XSelected

3. P R0

1,11 =ΦP R1,7(1,11); W(P R0

1,11) = 7/8

4. P R0

2,7=ΦP R1,7(2,7); W(P R0

2,7) = 0

5. P R0

3,7=ΦP R1,7(3,7); W(P R0

3,7) = 7/9

The selected elementary relation P R0

1,9is not a rectangle, so the algorithm con-

tinues on the already selected elements i.e. (1,7) and (1,9) as shown in Figure 7.

1

2

3

7

8

9

PR/(1,9)

11

Fig. 6. Elementary relation P R0

(1,9)

–Second Iteration: Search now the optimal rectangles of P R0

(1,9) that succes-

sively contain elements (1,8), (1,11), and (3,9). This step provides three ele-

mentary relations:

1. P R00

1,8=ΦP R0

1,9(1,8); W(P R00

1,8) = −1

2. P R00

1,11 =ΦP R0

1,9(1,11); W(P R00

1,11) = 7/8

3. P R00

3,9=ΦP R0

1,9(3,9); W(P R00

3,9) = 7/8XSelected

228 Jaoua et al.

P R00

3,9is a rectangle, so it is an optimal one that contains element (1,7) of R.

The following Figures 8 and 9 illustrate the iterations of searching the optimal

rectangle.

1

2

3

7

9

11

11

PR//(3,9)

)

Fig. 7. Elementary relation P R00

(3,9)

/

7

,

13

PR

7/9

/

8

,

1

PR

0

/

11

,

1

PR

7/8

/

7

,

2

PR

0

/

9

,

1

PR

7/8

//

8

,

1

PR

-1

//

11

,

1

PR

7/8

//

9

,

3

PR

1

First Iteration

Second Iteration

Fig. 8. The search tree for optimal concept

In bold you can see the selected elementary relation at each level of the search

tree. Each level is associated with an iteration in the proposed algorithm. The

proposed algorithm is polynomial (Figure 10 and 11). When we ﬁnd an optimal

rectangle, we continue to search for a next optimal one containing another pair

not already selected. Here if we select the pair (6,12), we ﬁnd at the ﬁrst iteration

the concept: P R6,12 = 5,6×11,12. Then if we select the pair (4,10), we obtain

the concept: P R4,10 = 3,4×9,10. Finally, if we select the pair (2,8), we obtain

the concept: P R2,8= 2,1×7,8. The selected coverage is composed of: {P R00

3,9,

P R6,12,P R4,10,P R2,8}

Discovering Regularities in Databases 229

(int s, int w) Optimal_Rectangle (Relation R)

Problem: Determine the optimal rectangle of a binary relation R

Inputs: A binary relation R[][], pair (s,w)

Outputs: The pair (s, w) containing an optimal rectangle in R.

Begin

Let R [m][n] be the binary relation of n keywords and m sentences.

Emax = 0;// The maximum searched gain in R (W(R)) initialized to 0

For s=0 to n-1

For w=0 to m-1

If R[s, w]! =0

Then PR=I(R.w) o R o I(s.R); // calculating the elementary relation of

(s,w)

E=economy (PR);

If E>Emax

Then { Emax=E;

Highest = PR; // Highest is the concept of maximal gain

End if

End if

End for

End for

If Highest is not rectangle // r != cd

//Optimal_Rectangle starting from

//relation Highest corresponding to the

//next level in the search tree

End if

End.

Fig. 9. Algorithm calculating an optimal rectangle in a binary relation R

Problem: Determine the economy of a binary relation

Inputs: A binary relation R

Outputs: The economy

Begin

Let R [m][n] be the binary relation of n keywords and m sentences.

Let r be the number of pairs in R.

Let c be the cardinality of domains of R.

Let d be the cardinality of co-domain of R.

Return (r/(c*d))*(r-(c+d))

End.

Fig. 10. Economy of a binary relation calculus

230 Jaoua et al.

5 Relational Calculus for Conceptual Coverage

Extraction

5.1 Relational Calculus with the Difunctional of Riguet

Is it possible to ﬁnd a speciﬁc coverage of a binary relation Rby a minimal

number of concepts using relational methods? The answer is that this is possible

if each concept of the coverage contains at least one isolated pair in R. An isolated

element, by deﬁnition is an element which belongs to only one concept cin R.

In this case, concept cbelongs to any conceptual coverage of R. Fortunately, in

[10] Khcherif, et al. proved that we can extract all existing isolated elements by

calculating the following difunctional Rdproposed by Riguet in 1995:

Rd=R◦R−1◦R∩R(7)

Here, Ris the complement of R. From the domain Diof each rectangle of

Rd, we ﬁnd a concept by using Galois connection operators f1and f2, where

Si=f1(Di) gives the set of all common images of Di,Ni=f2(Si), calculates

all common antecedents of elements belonging to Siwith respect to relation R.

The concept Ni×Siis included in Rand belongs to any possible conceptual

coverage of R. In the following example in Figure 5, we can see how can we ﬁnd

the conceptual coverage of R:

Relation R:

A

B

C

D

E

1

1

0

0

1

0

2

0

1

1

0

0

3

1

0

0

1

0

4

1

1

1

1

0

5

1

0

0

0

1

6

1

0

0

0

0

Relation Rd:

A

B

C

D

E

1

0

0

0

1

0

2

0

1

1

0

0

3

1

0

0

1

0

4

0

0

0

0

0

5

0

0

0

0

1

6

1

0

0

0

0

Fig. 11. Calculation of Rd

Discovering Regularities in Databases 231

Then from Rd, using Galois connection operators f1and f2on R, we ﬁnd

the following conceptual coverage of Rby concepts C1,C2,C3and C4where

C1={1,3,4} × {A, D},C2={2,4}×{B, C },C3={5}×{A, E}, and C4=

{1,3,4,5,6}×{A}.

5.2 An Open Problem

The relational calculus using the difunctional of Riguet gives the optimal coverage

of only some kind of binary relations. The reason is that not all concepts of a

binary relation contain isolated elements (i.e. some or all elements in Rbelong

to more than one concept). In that case, we can remove initial isolated elements,

from R, and then calculate R0din the remaining relation R0. We reiterate this

last step until we ﬁnd the coverage of R. The problem is that Rdmay be empty.

In that case, the problem is to ﬁnd by some other relational calculation the most

economical conceptual coverage of R.

6 Optimal Concepts and Applications

A machine learning system may be considered as a continual concepts reorganiza-

tion. What do we mean by the central ideas we have in computer “mind”? How

do we optimize storage space by continuously creating new symbols replacing

unorganized associations between existing symbols? Here we could deﬁne several

ways for associating new objects into the space of symbols. We generally asso-

ciate new objects with the central idea which optimizes the total space storage. So

learning may be considered as an optimization task, by looking for the maximum

of stability obtained by always giving a priority to the most economical concepts.

In the previous Sections 3 and 4, we have deﬁned two operators fand g, to map

a binary relation Rinto its economical form (f(R), g(R)). But here we can notice

that fand gare not deﬁned in the same direction. Because while fis associat-

ing with each element aof the domain of Rto all the symbols representing all

concepts to which a belongs, gis associating with each symbol crepresenting a

concept C, all elements of the range of Rbelonging to the range of C.

7 Application of Canonical Decomposition for Text

Summarization and Improving Search Engines

The idea is to extract a summary from a document. For that purpose, we ﬁrst

decide about how to decompose a text: into chapters or sections or sentences.

Then we may ask users if he/or she wants to get automatically a summary or

232 Jaoua et al.

extract association rules from the text. Assume that the user decides to consider

that a sentence is the atomic structure in the text that we are not allowed to

change or reduce. The proposed system already implemented in January 2004, will

ﬁrst create a binary context R, where objects are sentence numbers (recognized by

their position in the text) and properties are words (each word is also recognized

by a position in a hash table). So by deﬁnition (i, j) belongs to Rif and only if

word number jbelongs to sentence number ior is very similar to another word in

sentence number i. If the word is empty we do not consider it. A table of empty

words is ﬁrst consulted. Now, the crucial question about how to recognize that

two words are similar has been resolved with an approximate way. We assume

that two words are similar if they contain a longest common sub-sequence with

some relative size greater than some value pnear to 1. Here, when we decrease the

value of p, we obtain more similar words and of course this has some impact on the

quality of the summary. As a next step, starting from the binary context R, we use

the method proposed in the previous section to ﬁnd the optimal concept. Then

we select all sentences in the domain of this concept to generate a summary. If the

user would like more precision about the document, he/she may ask about the

next optimal concept, and obtain by the same way a complement of information.

We can repeat this until covering the entire document. We realized a system

for experimentation using many documents, and we are generally satisﬁed by

the selected sentences. We think that our system is suitable to provide several

improvements. This same method may be used for improving search engines, by

ﬁrst selecting documents corresponding to the optimal concept.

8 Conclusion

Most of the research on relational methods in data mining should concentrate on

studying diﬀerent properties of regular structures in binary relations. Algorithms

related to graph theory about incremental conceptual restructuring should also

be improved to use as a model for machine learning and classiﬁcation. Properties

of operators fand gdeﬁned in Section 3 should be studied in depth in the future

to give fundamental bases for database organization, for improving the quality

of the current search engines by structuring information. An important question

is to ﬁnd the canonical decomposition of f(R) and g(R). This generalization

needs to ﬁnd diﬀerent heuristics for economical decomposition, as for example

by associating a weight to extra-elements, very probably equal to the gain of

the concept they are representing. Canonical decomposition may be generalized

to fuzzy concepts, to deal with imprecision. In the future, we need to discover

some hidden invariant rules i.e. holding even if we change the database instance.

Relational studies must be investigated to ﬁnd more eﬃciently the common asso-

Discovering Regularities in Databases 233

ciation rules of diﬀerent data instances with incremental approaches. Cooperative

information retrieval and knowledge extraction need more and more studies about

regular structures using intersection, union or join merging operators [13]. The

question is now to study diﬀerent kinds of interactions between these concepts

(i.e, operations as union, intersection, or composition). Also, assume that you

want to merge arriving concepts from diﬀerent sides, how do we reorganize the

space of concepts? We should be able to organize it incrementally into a minimal

number of merged and transformed new concepts. If we assume that our data

is organized as a union of equally overlapped concepts, is there a mathematical

relational structure more general than difunctional relations? What are the main

categories of a uniform space of concepts? Finally, is it possible to consider that

thinking is a continual reorganization of regular structures into other optimized

regular concepts?

Acknowledgment:We would like to thank all Anonymous Reviewers for their

precious comments.

References

1. K. Arour, A. Jaoua, H. Ounelli, and N. Belkhiter. Rectangular decomposition of n-ary relations.

In Proc. of the 7th Siam Conference on Discrete Mathematics, Albuquerque, Nouveau Mexique,

june 1994.

2. N. Belkhiter, C. Bourhﬁr, M. M. Gammoudi, A. Jaoua, N. Lethan, and N. Reguig. D´ecomposition

rectangulaire optimale d’une relation binaire: application aux bases de donn´ees documentaires.

INFormation Systems and Operational Research Journal, 32(1):33–54, 1994.

3. C. J. Date. An Introduction to Database Systems Vol I. Addison Wesley, 1987.

4. B. Ganter and R. Wille. Formal Concept Analysis. Springer-Verlag, Heidelberg, 1999.

5. M. R. Garey and D. S. Johnson. Computers and Intractability: A guide to the theory of NP-

Completness. W. H. Freeman, 1979.

6. A. Jaoua, J. M. Beaulieu, N. Belkhiter, A. C. Debaque, J. Desharnais R. Lelouche, T. Moukam,

and M. Reguig. Rectangular decomposition of object-oriented software architectures. In Proc. of

the 14th Int. Conf. on Soft. Eng. (ICSE 14), Melbourne, Australia, Mai 1992.

7. A. Jaoua, N. Belkhiter, and T. Moukam. Propri´et´es des d´ependances difonctionnelles dans les bases

de donn´ees relationnelles. INFormation Systems and Operational Research Journal, 30(1):297–316,

1992.

8. A. Jaoua and S. Elloumi. Galois connection, formal concept and Galois lattice in real binary

relation: Applications in a real classiﬁer. Journal Systems and Software, 60(2):149–163, March

2002.

9. A. Jaoua, H. Ounelli, and N. Belkhiter. Automatic Entity Extraction From an N-ary Relation:

Towards a General Law for Information Decomposition. Journal Systems and Software, pages

216–232, November 1995.

10. R. Khcherif, M. Gammoudi, and A. Jaoua. Using Difunctional Relations in Information Organi-

zation. Information Science, 1-4(125):153–166, June 2000.

11. G. W. Mineau and R. Godin. Automatic Structuring of Knowledge Bases by Conceptual Clustering.

IEEE Transactions On Knowledge and Data Engineering, 7(5):824–829, 1995.

12. T. Mosaid, F. Hassen, and H. Salah. Conceptual Text Summarization. Senior pro ject, University

of Qatar, 2004.

234 Jaoua et al.

13. H. Ounelli and A. Jaoua. On Fuzzy Difunctional Relations. Journal of Information Sciences,

(95):216–232, 1996.

14. J. Riguet. Relations binaires, fermetures et correspondances de Galois. 76:114–145, 1948.

15. G. Schmidt and Str¨ohlein. Relations and Graphs. Springer Verlag, 1989.

16. R. Wille. Restructuring lattice theory : an approach based on hierarchies of concepts. In Proc. of

Nato Advanced Study Institute, Ed. by I. Rival, Reidel Publ. Dordrecht, volume 81, pages 445–470,

1982.

17. S. Ben Yahia, K. Arour, Y. Slimani, and A. Jaoua. Discovery of Compact Rules in Relational

Databases. Information Science Journal, 4(3):497–511, October 2000.

Journal on Relational Methods in Computer Science, Vol. 1, 2004, pp. 217 - 234

Received by the editors April 29, 2004, and, in revised form, October 21, 2004.

Published on December 10, 2004.

c

°A. Jaoua, S. Elloumi, A. Hasnah, J. Jaam, and I. Nafkha, 2004.

Permission to copy for private and scientiﬁc use granted.

This article may be accessed via WWW at http://www.jormics.org.