ArticlePDF Available

Discovering regularities in databases using canonical decomposition of binary relations

Authors:
Article

Discovering regularities in databases using canonical decomposition of binary relations

Abstract and Figures

Regularities in databases are directly useful for knowledge discovery and data summarization. As a mathematical background, relational algebra helped for discovering the main data structures and existing dependencies between the different attributes in a relational database. Functional, difunctional and other kinds of dependencies in a rela-tional database describe invariant regular structures that have been used intensively for database decomposition, and for minimizing redundancy. In this paper, we explain why "concepts" or "maximal rectangles" should be considered as the atomic regular structure for decomposing a binary relation which can be useful for different applications. More specifically, we have noticed experimentally, that "optimal concepts" contain pertinent information about data that we have exploited positively for machine learning, dynamic and incremental database organization, text summarization, data reduction, and even for modeling human thinking. Operators on concepts need to be developed because of their general usefulness in data and information engineering. In this paper, we propose to work on a canonical decomposition of binary relations based on two operators f and g, to model some important open problems, as for example on how to put in equation the best optimal conceptual coverage of a binary relation. We first develop an algorithm to find a conceptual coverage of a binary relation. We then exploit Riguet's difunctional relation to put in equation all isolated pairs in a binary relation. Using iteratively these isolated pairs, we find several varieties of efficient solutions for the canonical decomposi-tion problem., 2004. Permission to copy for private and scientific use granted.
Content may be subject to copyright.
Journal on Relational Methods in Computer Science, Vol. 1, 2004, pp. 217 - 234
Discovering Regularities in Databases Using
Canonical Decomposition of Binary Relations
A. Jaoua1, S. Elloumi2, A. Hasnah1, J. Jaam1, and I. Nafkha2
1University of Qatar, College of Science,
Department of Computer Science.
{jaoua;hasnah;jaam}@qu.edu.qa
2Faculty of Science of Tunis, University of Manar-Tunis
Unit´e de Recherche en Programmation Algorithmique et Heuristique
{samir.elloumi;ibtissem.nafkha}@fst.rnu.tn
Abstract. Regularities in databases are directly useful for knowledge discovery and data
summarization. As a mathematical background, relational algebra helped for discovering
the main data structures and existing dependencies between the different attributes in a
relational database. Functional, difunctional and other kinds of dependencies in a rela-
tional database describe invariant regular structures that have been used intensively for
database decomposition, and for minimizing redundancy. In this paper, we explain why
“concepts” or “maximal rectangles” should be considered as the atomic regular structure
for decomposing a binary relation which can be useful for different applications. More
specifically, we have noticed experimentally, that “optimal concepts” contain pertinent
information about data that we have exploited positively for machine learning, dynamic
and incremental database organization, text summarization, data reduction, and even
for modeling human thinking. Operators on concepts need to be developed because of
their general usefulness in data and information engineering. In this paper, we propose
to work on a canonical decomposition of binary relations based on two operators fand
g, to model some important open problems, as for example on how to put in equation
the best optimal conceptual coverage of a binary relation. We first develop an algorithm
to find a conceptual coverage of a binary relation. We then exploit Riguet’s difunctional
relation to put in equation all isolated pairs in a binary relation. Using iteratively these
isolated pairs, we find several varieties of efficient solutions for the canonical decomposi-
tion problem.
Keywords: regularities in database, maximal rectangles, concepts, knowledge extraction, data reduc-
tion
Received by the editors April 29, 2004, and, in revised form, October 21, 2004.
Published on December 10, 2004.
c
°A. Jaoua, S. Elloumi, A. Hasnah, J. Jaam, and I. Nafkha, 2004.
Permission to copy for private and scientific use granted.
217
218 Jaoua et al.
1 Introduction
Regular structures in databases played a major role for database decomposition,
and for discovering explicitly some form of knowledge embedded in data. In this
paper, we try to make a synthesis or state of art of some important regular
structures inside databases. We also ask several questions which have not specific
answers. The reason is that some problems are not yet solved. In this work, we
would like to invite other researchers to think about their solutions. We are more
concerned with the application of relational algebra in solving of some problems
in information engineering, than with proving new theorems related to relational
algebra. More specifically, we generalize the difunctional relation canonical de-
composition proposed in [7] to more general relations. For that purpose, we give
an approximate algorithm for the canonical relation decomposition and we exploit
Riguet’s difunctional relation for the same purpose.
Functional [3] and difunctional dependencies [7] represent some invariant
structures we could find in databases that are used for decomposing a database
into some sub-schemas for the purpose of minimizing redundancies. These de-
pendencies are general and do not depend on any particular database instance.
However, even after such decomposition, instances of database may contain hid-
den regularities, that we can only discover by looking to the lattice of concepts
[4, 11] embedded inside each specific instance. From this lattice [11], it is known
that we can extract some association rules. However, these rules are not general,
because if we change the database instance, the extracted knowledge from the
database may also change.
As an illustration, we use the main concepts for automatic text summariza-
tion. We first decompose the text into some elementary sub-texts: (sentences,
parts of sentences or sections). Second, we create a binary relation by indexing
each elementary piece of text by non empty words included in each elementary
sub-text. A possible tested approach for summarization is based on the main
concept of the text (i.e. associating the maximum of sentences to the maximum
of shared indexing words). In this paper we explain why the strength of this asso-
ciation is measured by the gain of the notion of “optimal concept”. By this way,
in [12], we have developed a system for generating different summaries each one
associated with a different optimal concept. Another important problem concerns
the design of good information filters in the search engines, each time we have to
search for web pages sharing specific indexing words. A possible modeling of this
problem is also to create a binary relation associating for each web page a list of
shared indexing words. A conceptual coverage of a binary relation, may be used
as a base for extracting the main web page references from the total space of web
Discovering Regularities in Databases 219
page references. By this way, the user will receive only different levels of clusters
of web pages in decreasing importance level.
In our work, after recalling some known regular structures in a database, we
present the conceptual decomposition technique. Among these regular structures,
we first define difunctional and functional dependencies and explain their appli-
cations. As a second kind and more general and diverse regularities, we define
maximal rectangles (or concepts) as the atomic information we may extract from
any binary relation. All these regular structures are now used for data mining. We
will explain this procedure in Section 6. In Section 4, we propose an approximate
algorithm for an approximate coverage of a binary relation with optimal concepts.
In Section 5, we give a solution to put in equation the coverage of some kind of
binary relations by a minimal number of concepts using Riguet’s difunctional
relation. Using difunctionality of relations as described by Riguet, we develop an
algorithm to find a canonical decomposition of some classes of binary relations.
We then explain that we need to generalize the solution of the problem for more
complex binary relations.
2 Definition of Regularities
We use relational algebra for formalizing the data space and regularities we may
discover embedded in these data. We assume that we are able to map our data into
a binary context (i.e. a subset of the Cartesian product of two sets: Ethe set of
objects and Pthe set of properties). This hypothesis is not too restrictive, because
we noticed that most databases may be considered as a binary relation after some
transformations. We also apply the proposed work on some available data from
the internet or documentary databases or tabular data that are available in most
of professional companies. For this kind of databases, we can directly obtain a
binary relation linking document (of a web page) references to indexing terms.
So all dependencies extracted between the terms of the documentary database
give additional information for users. For all these general cases, we always need
to extract regular associations between attributes to make the right decision in
case of similar situations. We also may exploit regularities to filter information,
to keep only a minimal data size. This kind of application is very useful for search
engines to only give few pertinent web page references corresponding to the user
query.
2.1 Functional Dependencies and Their Application
A functional dependency (fd) is the dependency most frequently used in practice,
since the development of Codd’s relational model [3]. Functional dependencies
220 Jaoua et al.
have been used to minimize redundancy and to normalize the relational database
schema. The universe Uof a relational schema is composed of a set of attributes.
Each attribute Ahas a domain dom(A). An element of dom(A) is denoted by
a, b, etc. We use capital letters as A, B for single attribute, and X, Y for subsets
of attributes. The union of two subsets Xand Yis written as XY . We also
make the difference between a single attribute Aand the set {A}. A relation s
defined on the set of attributes A1,A2,...,Anis a subset of the Cartesian product
dom(A1)×... ×dom(An). We say that sis an instance of the relational schema
S(U). A tuple is an element of s, called also a vector or a sequence of nvalues
associated with the nattributes. For example if t= (4,2,6) is a tuple of relation
sdefined on the relational schema S(A, B, C) then t[AC] = (4,6) while t[A] = 4.
Generally t[X] is the restriction of tto the subset of attributes X.
Definition 1 (Functional dependency). Let Xand Ybe two subsets of at-
tributes of the universal set U. We say that it exists a functional dependency (fd)
from Xto Yif and only if, for any instance sof the relational schema S(U), if t1
and t2are any two tuples of s, if t1[X] = t2[X]then t1[Y] = t2[Y]. We generally
use the notation XY.
Several properties of these dependencies have been defined by Codd, as fol-
lows:
Reflexivity rule: If YX, then XY
Augmentation rule: If XYthen XZ Y Z
Transitivity: XYand YZ, then XZ.
If XYthen Xis by definition a key in the relational schema S[XY].
2.2 Difunctional Dependencies
Difunctional dependencies [7, 13] are a generalization of functional dependencies.
Let Rbe a binary relation. Ris difunctional if and only if RR1R=R.
Where “” is the operator for relational composition, and R1is the inverse of
R. A difunctional is no more no less than the union of the Cartesian product of
several pairs of subsets which are disjoint in their domains and their codomains.
This general relational equation has been used intensively in software engineering
and proved to be a very frequent data structure specifying the link between inputs
and outputs. It has been ignored for a long time in data engineering. Its utility
has been shown by name correct?? Nlt Lethan and Jaoua between 1985 and 1992,
under different names (iso-dependencies or regular relations) [7].
Definition 2 (Difunctional dependencies). We say that there is a difunc-
tional dependency between Xand Y, denoted by XY, if and only if, for any
instance sof S(U), the binary relation R[X, Y ]defined by s[X, Y ]is difunctional.
Discovering Regularities in Databases 221
Example 1. Consider U=A, B , C, and sthe following instance of S(U) in Ta-
ble 1. We can see that ABis true for s, because the binary relation R[A, B]
is difunctional; on the contrary BCis false.
Table 1. An instance of s(U)
A B C
2 3 5
2 4 5
3 3 5
3 4 8
Redundancy reduction. Consider a relational schema S(U) and any instance
sof S. Assume that for any swe associate a difunctional binary relation R[X, Y ]
(with XY=U), which is the union of maximal rectangles whose projections
are disjoint. So:
R[X, Y ] = (A1×B1)(A2×B2)... (Ai×Bi)... (An×Bn)
with AiAj=BiBj=φ, i6=j.
It is easy to see that we can reduce redundancy by decomposing R[X, Y ] into
two binary relations R1[X, C] and R2[C, Y ], where Cis the attribute class. In
R1[X, C], for each element of the set Aiwe associate the value i. In R2[C, Y ], for
each value j, we associate all elements of subset Bj.
This kind of decomposition is called “canonical decomposition”. Experimen-
tations on several databases have shown that we can save an important amount of
memory space by such a decomposition. Even when we don’t find difunctional de-
pendencies, we discovered that most of instances of a database contain a uniform
part which has a difunctional structure.
Even more general than functional dependencies and generalized to fuzzy
difunctional dependencies [13], this kind of dependency has not been directly
useful in database, because in most cases, attributes in databases do not have such
a uniform structure. But, the most important thing exhibited by a difunctional
relation is the notion of concept which is the maximal Cartesian product included
in a database, also called maximal rectangle, and rectangular binary relation
decomposition [1]. We will discuss this question in the next section.
2.3 Conceptual Dependencies
Definition 3. Let R be a binary relation defined on a set E. The relation A×B,
such that AEand BE, is called rectangular relation (or rectangle) of R [6,
14, 15]. A is the domain of this relation and B is its codomain (or its range).
222 Jaoua et al.
In Figure 1, we can see an example of a rectangular relation:
A
x
y
z
t
B
u
v
w
Fig. 1. Rectangular relation RE with 12 pairs
Definition 4. From a memory storage space perspective, the gain which is asso-
ciated to a given rectangular relation RE =A×Bis assessed by the following
heuristic function:
g(RE) = (kAk × k Bk)(kAk+kBk) (1)
where kAkdenotes the cardinality of the set A.
Remark 1. This definition is introduced in [2]. We explain this formula by the
fact that a rectangular relation (or rectangle) associates kAkvalues to kBk
values. So, when we cluster in one side all the values of A, and in the other side
all the values of B, we can replace kAk × k Bkdirect pairs of RE (Figure 1)
by kAk+kBkindirect pairs by using an intermediary element ialso called
extra-symbol which links any element of Ato any element of Bas illustrated by
Figure 2.
A
x
y
z
ti
B
u
v
w
Fig. 2. A resumed rectangular representation of RE
Definition 5. A rectangle RE =A×Bwhich contains an element (a,b) of a
binary relation R is said to be optimal if it realizes the maximal gain g(RE) among
all rectangles which contain (a,b).
Discovering Regularities in Databases 223
Remark 2. Searching for the optimal rectangle containing (a, b) is an NP-complete
problem [1, 5]. Several heuristics which are based on a branch and bound prin-
ciple have been implemented and applied for database decomposition [2], object
oriented system decomposition [6] and data mining [17].
Remark 3. Optimal rectangles have a particular meaning because it represents
the most important data associations. Several rectangles may be optimal, because
they realize the same maximal gain. So with respect to some equivalence relation,
we can assimilate the class of all rectangles with the same gain to only one
representative element.
Definition 6. [2] A rectangular relation (or rectangle) RE =A×Bis said
degenerate if and only if kAk= 1 (Figure 3a) or kBk= 1 (Figure 3b).
AB
u
v
w
y
AB
x
y
z
u
(a) (b)
Fig. 3. Examples of degenerate rectangles
A concept is a maximal rectangle (i.e. a rectangle that cannot be extended
simultaneously in the domain and in the codomain). Assume that you have a
binary context R. We are always able to extract all concepts included in R.
Wille proved in [16], that this set of concepts is a complete lattice. This lattice
structure has been used intensively for knowledge extraction from data (i.e. de-
pendencies between attributes or association rules between the terms contained
in a documentary data base or in a single document). Importance of the notion
of concept has been discovered by the scientific groups working on graph theory.
Starting from 1990, we applied it to extract knowledge from data. Another group
on relational algebra discovered applications of concepts for software and data
decomposition [9], for machine learning [8], text summarization and several other
applications. Because of its simple and uniform structure, we believe more and
more that an atomic information is something like a directed pair of two subsets
(i.e. a complete bipartite sub-graph). So we assume that the data are composed
of a set of concepts.
224 Jaoua et al.
2.4 Gain of a Binary Relation
The gain in W(R) of binary relation Ris given by:
W(R) = ( r
d×c)×(r(d+c)) (2)
Where:
ris the cardinality of R(i.e. the number of pairs in R)
dis the cardinality of the domain of R
cis the cardinality of the range of R
Remark 4. The quantity r
d×cprovides a measure of the density of the relation
R. The quantity r(d+c) is a measure describing how economical information
is represented. It is a logical extension of the corresponding definition from a
concept to a general relation. This definition will be used in the proposed heuristic
in Section 4.
2.5 Elementary Relation (noted PR)
If Ris a finite binary relation (i.e., subset of E×F, where Eis a set of objects
and Fa set of properties) and (a, b)R, then the union of rectangles containing
(a, b) is the elementary relation P R (i.e. subset of R) given by:
P R =ΦR(a, b) = I(b.R1)RI(a.R) (3)
where:
Iis the identity relation.
R1is the inverse relation of R(i.e. set of inverted pairs of R).
” refers to the relative product operator, where:
RR0={(x, y)|∃z: (x, z)R(z, y)R0}(4)
Let AE, then I(A) = {(a, a)|aA}.
P R is the sub-relation of R, pre-restricted by the antecedents of b(i.e. b.R1),
and post-restricted by the set of images of a(i.e. a.R). In the next section, we
use such elementary relations P R to find the coverage of a relation by some
“minimal” number of optimal concepts. Note that the problem is NP-complete.
For that reason, we will only propose an approximate solution in Section 4, based
on a greedy method using the gain function W.
Discovering Regularities in Databases 225
3 Conceptual Binary Relation Coverage and Canonical
Decomposition
We may consider any binary relation Ras the union of concepts. The problem is
that among the different possible combinations of concepts covering R, we have
to select the most economical ones in terms of memory. Finding the minimal
coverage of Ris an NP-complete problem [5]. For that reason, in [1, 6], we used
some approximate algorithms to decompose huge binary relations based on the
function “gain” given in Definition 4. The problem of finding the optimal rectangle
with a maximum “gain” is also NP-complete [5]. For that reason, we think that in
the future, we should make more research investigations about formal properties
of such coverages to find better approximate methods. An open problem is to find
new efficient algorithms to update an initial conceptual coverage of some binary
relation Rwhen we add or remove some pairs in R. These researches will have
an impact on conceptual data mining systems.
Assume that {A1×B1,A2×B2,...,Ap×Bp}is some minimal coverage of the
binary relation R(i.e. R=A1×B1A2×B2... Ap×Bp). If we define the
two following operators:
f(R) = A1× {c1} ∪ A2× {c2} ∪ ... Ap× {cp}(5)
g(R) = {c1} × B1∪ {c2} × B2... ∪ {cp} × Bp(6)
generally the number of pairs in Ris much higher than the number of pairs
in f(R)g(R) , while R=f(R)g(R). Here {c1, c2, ..., cp}are extra-symbols
different from any element in the domain or range of R, which are created to
represent the different concepts in R. Another open problem is related to an
incremental conceptual binary relation transformation: the question is to find an
efficient method to calculate f(R∪ {a, b}), g(R({a, b}), f(R− {a, b}), g(R
{a, b}) using only f(R) and g(R). The objective is to continue to update the
conceptual coverage of Rusing its minimal representation by the two relations
f(R) and g(R), by removing or adding the minimal number of extra-symbols.
Operators fand gmight be defined automatically by some relational operator. It
is even interesting to give options for specific functions with interesting properties
we could use for mapping binary relations to their canonical forms. Using this
kind of decomposition in many experimental databases, we saved a huge amount
of memory space. In the following two sections, we first propose an approximate
solution (Section 4), and second the difunctional of Riguet (Section 5) for deriving
a coverage of a binary relation with optimal concepts.
226 Jaoua et al.
4 Approximate Algorithm for Canonical Decomposition
In this section we propose an approximate algorithm to find a set of optimal
rectangles that provides a coverage of a given relation R: an approximate solution
for a canonical decomposition of a binary relation. The algorithm is explained in
Figures 9 and 10. But here we explain the steps using the following relation R.
Let Rbe a finite binary relation between two sets as illustrated below in Figure 5:
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 4. An example of a binary relation R
Step 1: Divide the relation Rinto disjoint sub-relations ,..., Here we have only
one sub-relation (also called elementary relation).
Step 2: For each elementary relation P Ri, search the optimal rectangle, which
includes an element of P Ri.
If P Riis a rectangle, then it is an optimal rectangle containing (a, b), else check
if P Ricontains other elements (X, Y ) in the form (a, Y ) or (X, b) by trying all
the images of aand all the antecedents of b(see Figure 6).
P R(1,7) = ΦR(1,7) = I(7.R1)RI(1.R)
So we search with an iterative way the optimal rectangles of P R (1,7) which
successively contain the elements (1,8), (1,9), (1,11), (2,7), and (3,7).
First Iteration: from the five elementary relations of the above mentioned
elements select the first that gives a maximal gain Wdefined in Section 2.
As a matter of fact the relation with the maximum gain represents the best
compromise between density, and information economy.
Discovering Regularities in Databases 227
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
7
8
9
11
NR(1,7)
7.R-1
1.R
Fig. 5. Elementary relation ΦR(1,7) =P R(1,7)
1. P R0
1,8=ΦP R1,7(1,8); W(P R0
1,8) = 0
2. P R0
1,9=ΦP R1,7(1,9); W(P R0
1,9) = 7/8XSelected
3. P R0
1,11 =ΦP R1,7(1,11); W(P R0
1,11) = 7/8
4. P R0
2,7=ΦP R1,7(2,7); W(P R0
2,7) = 0
5. P R0
3,7=ΦP R1,7(3,7); W(P R0
3,7) = 7/9
The selected elementary relation P R0
1,9is not a rectangle, so the algorithm con-
tinues on the already selected elements i.e. (1,7) and (1,9) as shown in Figure 7.
1
2
3
7
8
9
PR/(1,9)
11
Fig. 6. Elementary relation P R0
(1,9)
Second Iteration: Search now the optimal rectangles of P R0
(1,9) that succes-
sively contain elements (1,8), (1,11), and (3,9). This step provides three ele-
mentary relations:
1. P R00
1,8=ΦP R0
1,9(1,8); W(P R00
1,8) = 1
2. P R00
1,11 =ΦP R0
1,9(1,11); W(P R00
1,11) = 7/8
3. P R00
3,9=ΦP R0
1,9(3,9); W(P R00
3,9) = 7/8XSelected
228 Jaoua et al.
P R00
3,9is a rectangle, so it is an optimal one that contains element (1,7) of R.
The following Figures 8 and 9 illustrate the iterations of searching the optimal
rectangle.
1
2
3
7
9
11
11
PR//(3,9)
)
Fig. 7. Elementary relation P R00
(3,9)
/
7
,
13
PR
7/9
/
8
,
1
PR
0
/
11
,
1
PR
7/8
/
7
,
2
PR
0
/
9
,
1
PR
7/8
//
8
,
1
PR
-1
//
11
,
1
PR
7/8
//
9
,
3
PR
1
First Iteration
Second Iteration
Fig. 8. The search tree for optimal concept
In bold you can see the selected elementary relation at each level of the search
tree. Each level is associated with an iteration in the proposed algorithm. The
proposed algorithm is polynomial (Figure 10 and 11). When we find an optimal
rectangle, we continue to search for a next optimal one containing another pair
not already selected. Here if we select the pair (6,12), we find at the first iteration
the concept: P R6,12 = 5,6×11,12. Then if we select the pair (4,10), we obtain
the concept: P R4,10 = 3,4×9,10. Finally, if we select the pair (2,8), we obtain
the concept: P R2,8= 2,1×7,8. The selected coverage is composed of: {P R00
3,9,
P R6,12,P R4,10,P R2,8}
Discovering Regularities in Databases 229
(int s, int w) Optimal_Rectangle (Relation R)
Problem: Determine the optimal rectangle of a binary relation R
Inputs: A binary relation R[][], pair (s,w)
Outputs: The pair (s, w) containing an optimal rectangle in R.
Begin
Let R [m][n] be the binary relation of n keywords and m sentences.
Emax = 0;// The maximum searched gain in R (W(R)) initialized to 0
For s=0 to n-1
For w=0 to m-1
If R[s, w]! =0
Then PR=I(R.w) o R o I(s.R); // calculating the elementary relation of
(s,w)
E=economy (PR);
If E>Emax
Then { Emax=E;
Highest = PR; // Highest is the concept of maximal gain
End if
End if
End for
End for
If Highest is not rectangle // r != cd
//Optimal_Rectangle starting from
//relation Highest corresponding to the
//next level in the search tree
End if
End.
Fig. 9. Algorithm calculating an optimal rectangle in a binary relation R
Problem: Determine the economy of a binary relation
Inputs: A binary relation R
Outputs: The economy
Begin
Let R [m][n] be the binary relation of n keywords and m sentences.
Let r be the number of pairs in R.
Let c be the cardinality of domains of R.
Let d be the cardinality of co-domain of R.
Return (r/(c*d))*(r-(c+d))
End.
Fig. 10. Economy of a binary relation calculus
230 Jaoua et al.
5 Relational Calculus for Conceptual Coverage
Extraction
5.1 Relational Calculus with the Difunctional of Riguet
Is it possible to find a specific coverage of a binary relation Rby a minimal
number of concepts using relational methods? The answer is that this is possible
if each concept of the coverage contains at least one isolated pair in R. An isolated
element, by definition is an element which belongs to only one concept cin R.
In this case, concept cbelongs to any conceptual coverage of R. Fortunately, in
[10] Khcherif, et al. proved that we can extract all existing isolated elements by
calculating the following difunctional Rdproposed by Riguet in 1995:
Rd=RR1RR(7)
Here, Ris the complement of R. From the domain Diof each rectangle of
Rd, we find a concept by using Galois connection operators f1and f2, where
Si=f1(Di) gives the set of all common images of Di,Ni=f2(Si), calculates
all common antecedents of elements belonging to Siwith respect to relation R.
The concept Ni×Siis included in Rand belongs to any possible conceptual
coverage of R. In the following example in Figure 5, we can see how can we find
the conceptual coverage of R:
Relation R:
A
B
C
D
E
1
1
0
0
1
0
2
0
1
1
0
0
3
1
0
0
1
0
4
1
1
1
1
0
5
1
0
0
0
1
6
1
0
0
0
0
Relation Rd:
A
B
C
D
E
1
0
0
0
1
0
2
0
1
1
0
0
3
1
0
0
1
0
4
0
0
0
0
0
5
0
0
0
0
1
6
1
0
0
0
0
Fig. 11. Calculation of Rd
Discovering Regularities in Databases 231
Then from Rd, using Galois connection operators f1and f2on R, we find
the following conceptual coverage of Rby concepts C1,C2,C3and C4where
C1={1,3,4} × {A, D},C2={2,4}×{B, C },C3={5}×{A, E}, and C4=
{1,3,4,5,6}×{A}.
5.2 An Open Problem
The relational calculus using the difunctional of Riguet gives the optimal coverage
of only some kind of binary relations. The reason is that not all concepts of a
binary relation contain isolated elements (i.e. some or all elements in Rbelong
to more than one concept). In that case, we can remove initial isolated elements,
from R, and then calculate R0din the remaining relation R0. We reiterate this
last step until we find the coverage of R. The problem is that Rdmay be empty.
In that case, the problem is to find by some other relational calculation the most
economical conceptual coverage of R.
6 Optimal Concepts and Applications
A machine learning system may be considered as a continual concepts reorganiza-
tion. What do we mean by the central ideas we have in computer “mind”? How
do we optimize storage space by continuously creating new symbols replacing
unorganized associations between existing symbols? Here we could define several
ways for associating new objects into the space of symbols. We generally asso-
ciate new objects with the central idea which optimizes the total space storage. So
learning may be considered as an optimization task, by looking for the maximum
of stability obtained by always giving a priority to the most economical concepts.
In the previous Sections 3 and 4, we have defined two operators fand g, to map
a binary relation Rinto its economical form (f(R), g(R)). But here we can notice
that fand gare not defined in the same direction. Because while fis associat-
ing with each element aof the domain of Rto all the symbols representing all
concepts to which a belongs, gis associating with each symbol crepresenting a
concept C, all elements of the range of Rbelonging to the range of C.
7 Application of Canonical Decomposition for Text
Summarization and Improving Search Engines
The idea is to extract a summary from a document. For that purpose, we first
decide about how to decompose a text: into chapters or sections or sentences.
Then we may ask users if he/or she wants to get automatically a summary or
232 Jaoua et al.
extract association rules from the text. Assume that the user decides to consider
that a sentence is the atomic structure in the text that we are not allowed to
change or reduce. The proposed system already implemented in January 2004, will
first create a binary context R, where objects are sentence numbers (recognized by
their position in the text) and properties are words (each word is also recognized
by a position in a hash table). So by definition (i, j) belongs to Rif and only if
word number jbelongs to sentence number ior is very similar to another word in
sentence number i. If the word is empty we do not consider it. A table of empty
words is first consulted. Now, the crucial question about how to recognize that
two words are similar has been resolved with an approximate way. We assume
that two words are similar if they contain a longest common sub-sequence with
some relative size greater than some value pnear to 1. Here, when we decrease the
value of p, we obtain more similar words and of course this has some impact on the
quality of the summary. As a next step, starting from the binary context R, we use
the method proposed in the previous section to find the optimal concept. Then
we select all sentences in the domain of this concept to generate a summary. If the
user would like more precision about the document, he/she may ask about the
next optimal concept, and obtain by the same way a complement of information.
We can repeat this until covering the entire document. We realized a system
for experimentation using many documents, and we are generally satisfied by
the selected sentences. We think that our system is suitable to provide several
improvements. This same method may be used for improving search engines, by
first selecting documents corresponding to the optimal concept.
8 Conclusion
Most of the research on relational methods in data mining should concentrate on
studying different properties of regular structures in binary relations. Algorithms
related to graph theory about incremental conceptual restructuring should also
be improved to use as a model for machine learning and classification. Properties
of operators fand gdefined in Section 3 should be studied in depth in the future
to give fundamental bases for database organization, for improving the quality
of the current search engines by structuring information. An important question
is to find the canonical decomposition of f(R) and g(R). This generalization
needs to find different heuristics for economical decomposition, as for example
by associating a weight to extra-elements, very probably equal to the gain of
the concept they are representing. Canonical decomposition may be generalized
to fuzzy concepts, to deal with imprecision. In the future, we need to discover
some hidden invariant rules i.e. holding even if we change the database instance.
Relational studies must be investigated to find more efficiently the common asso-
Discovering Regularities in Databases 233
ciation rules of different data instances with incremental approaches. Cooperative
information retrieval and knowledge extraction need more and more studies about
regular structures using intersection, union or join merging operators [13]. The
question is now to study different kinds of interactions between these concepts
(i.e, operations as union, intersection, or composition). Also, assume that you
want to merge arriving concepts from different sides, how do we reorganize the
space of concepts? We should be able to organize it incrementally into a minimal
number of merged and transformed new concepts. If we assume that our data
is organized as a union of equally overlapped concepts, is there a mathematical
relational structure more general than difunctional relations? What are the main
categories of a uniform space of concepts? Finally, is it possible to consider that
thinking is a continual reorganization of regular structures into other optimized
regular concepts?
Acknowledgment:We would like to thank all Anonymous Reviewers for their
precious comments.
References
1. K. Arour, A. Jaoua, H. Ounelli, and N. Belkhiter. Rectangular decomposition of n-ary relations.
In Proc. of the 7th Siam Conference on Discrete Mathematics, Albuquerque, Nouveau Mexique,
june 1994.
2. N. Belkhiter, C. Bourhfir, M. M. Gammoudi, A. Jaoua, N. Lethan, and N. Reguig. ecomposition
rectangulaire optimale d’une relation binaire: application aux bases de donn´ees documentaires.
INFormation Systems and Operational Research Journal, 32(1):33–54, 1994.
3. C. J. Date. An Introduction to Database Systems Vol I. Addison Wesley, 1987.
4. B. Ganter and R. Wille. Formal Concept Analysis. Springer-Verlag, Heidelberg, 1999.
5. M. R. Garey and D. S. Johnson. Computers and Intractability: A guide to the theory of NP-
Completness. W. H. Freeman, 1979.
6. A. Jaoua, J. M. Beaulieu, N. Belkhiter, A. C. Debaque, J. Desharnais R. Lelouche, T. Moukam,
and M. Reguig. Rectangular decomposition of object-oriented software architectures. In Proc. of
the 14th Int. Conf. on Soft. Eng. (ICSE 14), Melbourne, Australia, Mai 1992.
7. A. Jaoua, N. Belkhiter, and T. Moukam. Propri´et´es des d´ependances difonctionnelles dans les bases
de donn´ees relationnelles. INFormation Systems and Operational Research Journal, 30(1):297–316,
1992.
8. A. Jaoua and S. Elloumi. Galois connection, formal concept and Galois lattice in real binary
relation: Applications in a real classifier. Journal Systems and Software, 60(2):149–163, March
2002.
9. A. Jaoua, H. Ounelli, and N. Belkhiter. Automatic Entity Extraction From an N-ary Relation:
Towards a General Law for Information Decomposition. Journal Systems and Software, pages
216–232, November 1995.
10. R. Khcherif, M. Gammoudi, and A. Jaoua. Using Difunctional Relations in Information Organi-
zation. Information Science, 1-4(125):153–166, June 2000.
11. G. W. Mineau and R. Godin. Automatic Structuring of Knowledge Bases by Conceptual Clustering.
IEEE Transactions On Knowledge and Data Engineering, 7(5):824–829, 1995.
12. T. Mosaid, F. Hassen, and H. Salah. Conceptual Text Summarization. Senior pro ject, University
of Qatar, 2004.
234 Jaoua et al.
13. H. Ounelli and A. Jaoua. On Fuzzy Difunctional Relations. Journal of Information Sciences,
(95):216–232, 1996.
14. J. Riguet. Relations binaires, fermetures et correspondances de Galois. 76:114–145, 1948.
15. G. Schmidt and Str¨ohlein. Relations and Graphs. Springer Verlag, 1989.
16. R. Wille. Restructuring lattice theory : an approach based on hierarchies of concepts. In Proc. of
Nato Advanced Study Institute, Ed. by I. Rival, Reidel Publ. Dordrecht, volume 81, pages 445–470,
1982.
17. S. Ben Yahia, K. Arour, Y. Slimani, and A. Jaoua. Discovery of Compact Rules in Relational
Databases. Information Science Journal, 4(3):497–511, October 2000.
Journal on Relational Methods in Computer Science, Vol. 1, 2004, pp. 217 - 234
Received by the editors April 29, 2004, and, in revised form, October 21, 2004.
Published on December 10, 2004.
c
°A. Jaoua, S. Elloumi, A. Hasnah, J. Jaam, and I. Nafkha, 2004.
Permission to copy for private and scientific use granted.
This article may be accessed via WWW at http://www.jormics.org.
... This should include generalizing other kinds of data dependency -eg. difunctional dependencies (Jaoua et al., 2004) -which have not been dealt with at all in the current report. ...
... A relation S is difunctional iff S · S • · S = S holds (Bird and de Moor, 1997;Jaoua et al., 2004). This is equivalent to S · S • · S ⊆ S since S ⊆ S · S • · S holds for every S (Bird and de Moor, 1997;Backhouse, 2004). ...
... In this paper, multivalued dependencies have only been hinted at. Join dependencies and difunctional dependencies [9] have not been considered at all. The use of functional dependencies in solving ambiguities in multiple parameter type classes in the Haskell type system [10] may happen to be another area of application of the reasoning techniques developed in this paper. ...
Article
Full-text available
Introduction to normalization using a functional language without points, Point free.
... Information regularities can in addition be exploited to keep data size to a minimum and to extract frequent and reliable associations considered as pertinent knowledge. As a mathematical background, FCA and RA have been already combined and used to discover regularities in data [22]. In fact, a formal concept represents the atomic regular structure for decomposing a BR. ...
Article
In recent years, several mathematical concepts have been successfully explored in the computer science domain as a basis for finding original solutions for complex problems related to knowledge engineering, data mining, and information retrieval. Hence, relational algebra (RA) and formal concept analysis (FCA) may be considered as useful mathematical foundations that unify data and knowledge into information retrieval systems. For example, some elements in a fringe relation (related to the (RA) domain) called isolated points have been successfully used in FCA as formal concept labels or composite labels. Once associated with words in a textual document, these labels constitute relevant features of a text. This paper proposes the MinGenCoverage algorithm for covering a Formal Context (as a formal representation of a text) based on isolated labels and using these labels (or text features) for categorization, corpus structuring, and micro–macro browsing as an advanced information retrieval functionality. The main thrust of the approach introduced here relies heavily on the close connection between isolated points and minimal generators (MGs). MGs stand at the antipodes of the closures within their respective equivalence classes. By using the fact that the minimal generators are the smallest elements within an equivalence class, their detection and traversal is greatly eased and the coverage can be swiftly built. Extensive experiments provide empirical evidence for the performance of the proposed approach.
... A total deterministic relation R is said to be injective if and only if it is more-injective than I. A relation R is said to be regular [2,35,36] (or difunctional [57,60]) if and only if R  RR = R; ...
Article
Invariant assertions play an important role in the analysis and documentation of while loops of imperative programs. Invariant functions and invariant relations are alternative analysis tools that are distinct from invariant assertions but are related to them. In this paper we discuss these three concepts and analyze their relationships. The study of invariant functions and invariant relations is interesting not only because it provides alternative means to analyze loops, but also because it gives us insights into the structure of invariant assertions, hence may help us enhance techniques for generating invariant assertions.
... FD theory is the kernel of the classical relational database design theory developed by Codd [12]. It has been thoroughly studied [5,21,28], and is a mandatory part of standard database literature [27,38,14]. ...
... In this paper, multi-valued dependencies have only been hinted at. Join dependencies and difunctional dependencies [9] have not been considered at all. The use of functional dependencies in solving ambiguities in multiple parameter type classes in the Haskell type system [10] may happen to be another area of application of the reasoning techniques developed in this paper. ...
Article
Full-text available
When software designers refer to the relational calculus, what they usually mean is the set-theoretic kernel of relational database design ì a la Coddî and not the calculus of binary relations which was initiated by De Morgan in the 1860s an eventually became the core of the algebra of programming. Contrary to the intuition that a binary relation is just a particular case of -ary relation, this paper shows the effectiveness of the former in ìexplainingî and reasoning about the latter. The theory of functional dependencies, which is central to such database design techniques, is addressed in a pointfree style instead of reasoning in the standard set-theoretic model ì a la Coddî. It turns out that the theory becomes more general and considerably simpler. Ele- gant expressions replace lengthy formulÊ and easy-to-follow calculations replace pointwise proofs with lots of ì î notation, case analyses and natural language explanations for ìobviousî steps.
... Database normalization and de-normalization, for instance, are driven by functional dependencies. FD theory is the kernel of the classical relational database design theory developed by Codd [6], it has been thoroughly studied [2,12], and is part of standard database literature [17,25,9]. ...
Conference Paper
Full-text available
Haskell's type system with multi-parameter constructor classes and functional dependencies allows static (compile-time) computations to be expressed by logic programming on the level of types. This emergent capability has been exploited for instance to model arbi- trary-length tuples (heterogeneous lists), extensible records, func- tions with variable length argument lists, and (homogenous) lists of statically fixed length (vectors). We explain how type-level programming can be exploited to de- fine a strongly-typed model of relational databases and operations on them. In particular, we present a strongly typed embedding of a significant subset of SQL in Haskell. In this model, meta-data is represented by type-level entities that guard the semantic correct- ness of database operations at compile time. Apart from the standard relational database operations, such as selection and join, we model functional dependencies (among ta- ble attributes), normal forms, and operations for database transfor- mation. We show how functional dependency information can be represented at the type level, and can be transported through opera- tions. This means that type inference statically computes functional dependencies on the result from those on the arguments. Our model shows that Haskell can be used to design and proto- type typed languages for designing, programming, and transform- ing relational databases.
Chapter
A data scientist could apply several machine learning approaches in order to discover valuable knowledge from the data. While applying several techniques, he might discover that some pieces of knowledge are invariant, what ever the technique he used. We consider such knowledge as mandatory concepts, i.e., unavoidable knowledge to be discovered. As interesting property, a mandatory concept is characterized by a non-shared isolated point, that relates pieces of data, e.g., an object to a property, a document to specific words, an image to a specific topic, etc. Hence, the isolated points allow to make the distinction between the concepts. In this paper, we present a new approach for mandatory concepts extraction by making a level-based properties composition. Hence, the N-Composites isolated points are identified and constitute a key element for mandatory concept localization. We experiment our new algorithm by considering the coverage quality metrics.
Article
Textual Feature Selection (TFS) aims to extract relevant parts or segments from text as being the most relevant ones w.r.t. the information it expresses. The selected features are useful for automatic indexing, summarization, document categorization, knowledge discovery, so on. Regarding the huge amount of electronic textual data daily published, many challenges related to the semantic aspect as well as the processing efficiency are addressed. In this paper, we propose a new approach for TFS based on Formal Concept Analysis background. Mainly, we propose to extract textual features by exploring the regularities in a formal context where isolated points exist. We introduce the notion of N -composite isolated points as a set of N words to be considered as a unique textual feature. We show that a reduced value of N (between 1 and 3) allows extracting significant textual features compared with existing approaches even for non-completely covering an initial formal context.
Article
Today is very normally to use DBMS systems for data storing and management. Theoretical basics on which they are functioning are well known. In wide professional usage there are some evident problems in development, using and management stages on data systems. Data modelling methods are well known, and accepted in practice. There will be shown that ideal data model has some practical problems. These problems are not true data model anomalies. They are some complicated circumstances in data content usage. Problem has to be resolved by data model modifying, with a goal to use and manage data content as easy, efficient and economic as possible. Above mentioned problems will be identified and detailed explained. Also, there will be explained how, why and when to do modifications on data model. That modification will be proved on theoretical level. Application of modified model will be explained on practical examples.
Article
Full-text available
A rectangle of a binary relation R is a couple of two sets (A, B) suchthat A x B ⊆ R. Searching for the maximal rectangles of a finite binary relation is a problem which has been previously studied by pure mathematicians within the framework of lattice theory and which has been later proved relevant in several practical fields of computer science. In this paper, we study a similar problem (which is NP-complete) of searching for a minimal subset of rectangles which covers a binary relation R. In particular, we propose a strategy for an “aptimal” decomposition of a finite binary relation R in the sense that it realizes the maximum of an heuristic gain fonction (from a storage space gain point of view). The resulting solution (minimal nujmber of rectangles) provides us with a pyramidal covering of R (in the sense that an element of R can belong to several rectangles). As an illustration, we apply this decomposition strategy for structuring a toy indexed documentary database where R is the binary relation between a set D of documents and a set T of terms used for indexing these documents (i.e. R ⊆ D x T ).RésuméUn rectangle d’une relation binaire R est un couple de deux ensembles (A, B) tels que A x B ⊆ R. L’extraction des rectangles maximaux d’une relation binaire finie est un problème qui a fait l’objet d’études antérieures par des mathématiciens dans le contexte de la théorie des treillis et dont la pertinence a ensuite été démontrée dans plusieurs domaines d’application en informatique par d’autres chercheurs. Dans cet article, nous étudions un problème similaire (et NP-complet) de recherche d’un sous-ensemble minimal de rectangles recouvrant une relation binaire R. En particulier, nous proposons une stratégie de décomposition “optimale” d’une relation binaire finie R dans le sens où elle réalise le maximum d’une fonction heuristique de gain (d’un point de vue économie du codage). La solution obtenue (nombre minimal de rectangles) constitue une couverture pyramidale de R (dans le sens où un élément de R peut appartenir à plusieurs rectangles). En guise d’illustration, nous appliquons cette stratégie de décomposition pour la structuration d’une base de données documentaire indexée en exploitant la relation binaire R qui existe entre un ensemble D de références de documents et un ensemble T de termes qui indexent ces documents (avec R ⊆ D x T).
Chapter
We now pass from homogeneous to heterogeneous relations. In terms of matrices this amounts to passing from square matrices to general rectangular ones. The general principles stay the same as before, but when multiplying, joining, or intersecting heterogeneous relations one has to make sure that these operations are defined.
Chapter
A complex concept lattice can possibly be split up into simpler parts. Here the mathematical model must prove its worth by providing efficacious and versatile methods for the decomposition. Every such decomposition principle can be reversed to make a construction method. Therefore, some of the following subjects will be taken up again in the next chapter with this second focus.
Article
Difunctional relations have proved to play an important role in software design and in database theory. On the other hand, the concept of fuzzy set has received increasing attention from researchers in a wide range of scientific areas, especially in computer science. This paper extends difunctional relations in the framework of fuzzy relations with max-min composition for the purpose of gaining a better understanding of their properties, their structure, and their behavior. One motivation for this is to study fuzzy difunctional dependencies in the framework of the fuzzy relational data model.