Conference PaperPDF Available

Specific Characteristics of Computer Criminal Offenses With Regard to the Law Regulations


Abstract and Figures

There are different categories of perpetrators of computer criminal activity. This paper puts focus on the profile of a hacker-a perpetrator of computer criminal activity who is not motivated by financial gain. It also deals with current classifications of hackers and emphasizes their important characteristics and principles of hacker ethics. The paper also gives a review of judiciary in The Republic of Serbia in this area. It is clear that a society can adequately confront negative phenomenon, only if all of its characteristics and specificities are recognized.
Content may be subject to copyright.
Specific Characteristics of Computer Criminal
Offenses With Regard to the Law Regulations
Jelena D. Matijasevic1 and Zaklina S. Spalevic 2
Abstract – There are different categories of perpetrators of
computer criminal activity. This paper puts focus on the profile
of a hacker – a perpetrator of computer criminal activity who is
not motivated by financial gain. It also deals with current
classifications of hackers and emphasizes their important
characteristics and principles of hacker ethics. The paper also
gives a review of judiciary in The Republic of Serbia in this
area. It is clear that a society can adequately confront negative
phenomenon, only if all of its characteristics and specificities are
Keywords – Computer criminal activity, Hackers, Computer
Computers represent one of the most important and the
most revolutionary achievements of development of technical
and technological civilization. There is no single sphere of
life, from production, trade and service provision to the
national defense and security in the widest sense in which
computer does not have practical application. Nowadays we
are all aware of the enormous significance of computer use in
contemporary societies and of the fact that there is not a single
area of human activity in which computers are not being used.
However, the conclusion that there has not been a single
technical and technological accomplishment that has not been
misused in various ways is pretty devastating. Phases of
development in which the invention was susceptible to
misuse, groupings of persons who committed such actions and
different intents of misuse represent specific characteristics.
Growing use of computer technology causes an increase in
computer criminal activities, as a new form of criminality in
the contemporary society, and a development of its diverse
forms. Computer technology is developing very quickly, at
the same pace with education and training of persons who
intend to misuse it. The press abounds with information about
a person (or a group) that had penetrated an important
government computer system and had not only acquired
specific data, but had also created a possibility for
endangering or activation of systems such as nuclear potential
of great world forces. This phenomenon is not only a
characteristic of the developed western countries – it
increasingly becomes typical of the Balkans area [1].
There are different categories of perpetrators of computer
criminal acts, with respect to a variety of criminal acts that
they commit and considering motives which impel them to
engage in such activities.
In fact, motive is an important indicator of many classic
forms of criminality, and even computer criminal acts. Motive
as a clue becomes prominent in setting up versions of
suspected persons, regardless of whether it is a case of a
single perpetrator or a case of a group of perpetrators, where
the method of elimination is used so as to remove suspicion
from innocent persons [2].
Obtaining illegal financial gain by committing computer
criminal acts is one of the most common motives found in
perpetrators of these criminal acts. However, this motivation
can be induced by various wishes of the perpetrator, such as
unjustified gain, possibility of repaying a debt, an adequate
status in society, satisfying certain personal vices and the like.
Revenge, inferiority complex, economic competition, the
desire for self-approval and achieving a certain success, as
well as envy, hatred, jealousy, enthusiasm for one’s own
knowledge and skills and even political motives in some cases
can all be possible motives for committing computer criminal
acts [3].
There is a general division of perpetrators of such acts into
malicious ones, who commit crime so as to obtain financial
gain or just cause damage, and into perpetrators who are not
motivated neither by obtaining gain, nor by causing damaging
consequences, but simply find pleasure in unauthorized
penetration into a well-secured information system.
Malicious perpetrators of computer crimes are mostly
motivated by greed. Data from practice indicate a definite set
of characteristics that form their criminal profile: about 80%
of them are first-time offenders, 70% of them have been
working for more than five years for the company which is the
damaged party; they belong to the age group below 30; they
are mostly male, highly intelligent; they generally have
several years of business experience and are considered as
conscientious workers that don’t cause any problems while
fulfilling their work tasks; their degree of technical
competence surpasses technical qualifications required for
their work position; the perpetrators do not consider
themselves thieves or criminals in general, but just borrowers
Computer criminal acts motivated by greed are very
common in banking, financial corporations and insurance
companies. Statistical data on the perpetrators of computer
crime in the area of banking indicates the most common
1Jelena D. Matijaševic, Faculty of Law, Business Academy, Geri
Karolja 1, 21000 Novi Sad, The Republic of Serbia, E-mail:
2Zaklina S. Spalevic, Faculty of Law, Business Academy, Geri
Karolja 1, 21000 Novi Sad, The Republic of Serbia, E-mail:
occupations of the perpetrators: 25% are persons who have
special authorization and responsibilities for IT systems; 18%
are computer programmers; 18% are empoyees who have
access to the terminals; 16% are cashiers, 11% of them are
operators – informaticists, and 12% are persons outside the
affected corporation, including the service users [4].
The second group of perpetrators of computer criminal acts
find deep pleasure in the very act of breaking into multiple
security IT systems. The higher the security of the system is,
the higher is the challenge to engage in such activities.
Here we are dealing with so-called hackers who break into
other people’s computer system, using their computer
knowledge and a modem [5].
Regarding professional affilation, they are usually computer
programmers, operators or highly qualified informaticists, and
sometimes they are just people with computers as hobby.
Given the fact that the second group of perpetrators of
computer criminal activity raises a lot of attention, causes
much contraversy and mixed reactions and that even the
computer networks of governments of modern countries were
targets of these perpetrators, we will further examine the
hacker profile in the following text.
The word ’hacker’ is very often used in a negative context
today, without trying to grasp the essence of activities of these
persons, or to adequately analyze the reasons and
consequences of their activities.
The term ’hacker’is in its original meaning used to denote a
person who deals with research of computer potential and its
positive application in everyday life. Hackers are highly
intelligent people who explore what is hidden in hardware and
software. In simpler terms, they locate something hidden from
the public or find randomly made mistakes [6].
Hackers still remain a sort of enigma to the world of
psychology and sociology. Understanding their development
and motivation has become one of the areas of their interest.
Different authors approach hackers in different ways and
analyze them from various viewpoints. The dominant attitude
nowadays is the one provided in the explanation of the term
‘hacker’- namely, that hackers look for errors in programs and
then inform the public about it, so that the manufacturer of the
given program can rectify the error and that the public can
take necessary steps to protect themselves on time.
With the purpose of ensuring a more positive approach to
the term ‘hacker’, data has supplied so as to indicate that more
than 10 000 errors have been found so far and that about five
new errors are found daily.
There are several criteria for hacker classification.
According to the criterion of respect for ethics, there is a
division into the following types of hacker:
- White Hat Hackers – These hackers respect hacker ethics;
they deal with computer system and network protection. They
try to improve protection of the information system, so as to
avoid penetration into it and causing damage. They are
typically rented by companies to break into a system and then
inform the owner how it was done and how to improve the
- Black Hat Hackers – These hackers do not hesitate to steal
and destroy data in networks and systems they penetrate into.
They interpret the hacker ethics in a way they see fit. The
principle that all information should be free grants them an
excuse to enter into other people’s systems. They often
destroy a part of the system. Creating and distributing viruses
and worms which damage computers belongs to their
- Gray Hat Hackers – are somewhere in between Black and
White Hats. They wish to be distinguished from security
testers of a company on one hand, and to disassociate
themselves from the negative image of Black Caps. These are
mostly hackers who initially violated hacker ethics and then
used the acquired knowledge according to all the rules of the
Another more detailed and precise division, where level of
computer skills, sphere of interest and ethical rules are taken
as a criterion, distinguishes between the following types of
Old school hackers – Persons who have dealt with
computers from their very emergence belong to this
group. These hackers could only rely on themselves
in terms of learning about computers, because they
were the first to engage in research of computers and
their possibilities. Copyright protection was
unknown to old school hackers. Their favorite
activity was to read other people’s programs, to
modify and expand their possibilities.
Phreakers – They are hackers whose narrow
specialty includes theft of dial impulses, conducting
international calls at the expense of another person
and all activities related to telephone traffic.
Crackers – The major preoccupation of a cracker is
the safety of computer systems. Their main activities
include breaking into other people’s computers.
There is a clear difference between crackers and
typical hackers – hackers find loopholes in computer
systems in order to patch them, whereas crackers use
such flaws to cause damage.
Warezdoodz – They specialize in editing programs,
finding serial numbers and their illegal distribution to
the users. They are at the top of the piracy chain. The
activities of Warezdoodz directly violate copyright
laws; they contribute to illegal distribution and
copying of programs.
Hacktivists – Hacktivists use their hacking skills to
promote political ideology, and thus interpret hacker
ethics in their own manner, in the sense that hacking
for political goals is not contrary to the ethics.
Using computers to achieve political goals was an
exception in the past and it did not attract much attention.
However, the development of technology enabled the
unlimited access of computers to the world of politics and
management, which is why the term ‘cyber war’ has often
been mentioned lately. The latest example of hacktivism is the
so-called cyber war between Serbian and Albanian hackers,
which started in the August of 2008. Moreover, Serbian and
Croatian hackers led a fierce hacking battle during the year
2004, by crashing websites of TV stations, sport clubs and
faculties [7].
Each classification of hackers is conditional and by no
means final. It is difficult to determine the exact boundaries
for some types of hackers; some types are intertwined, while
some cannot function alone, without another type. In any case,
the existing classifications provide us with a better overview
for analysis of specific characteristics of this group of
perpetrators of computer criminal acts.
If you are a good hacker, everyone knows you; if you are
the best, nobody does! Although there are a variety of
prejudices against hackers, it is clear that all hackers share the
following features (based on different analyses of this specific
group of perpetrators of computer crimes): a high IQ,
consuming curiosity and the ease of intellectual abstraction.
They have an increased ability to absorb knowledge and they
pay attention to a variety of details which are irrelevant to the
“ordinary people”. Hackers are not interested in just one area;
on the contrary: they tend to be involved in any subject that
stimulates intellectual effort. On the other hand, hackers are
afraid of control and do not want to deal with anything
binding or authoritative. Similarly, they have no ability of
emotional identification with other people, according to many
authors. They often tend to be arrogant and impatient with
people or things they believe are wasting their time.
Still, there is one thing some of them are exceptionally
good that – social engineering. Social engineering denotes the
ability of disclosing confidential information by manipulating
people. It is most often used by telephone or the Internet and it
makes people reveal their confidential information (such as
passwords used to access accounts and credit card numbers)
or do illegal things [7].
Hackers are often completely disorganized and clumsy
when it comes to communicaton with people around them.
During the years 1994 and 1995, ADD (Attention-deficit
disorder) was discovered in people who deal with hacking.
ADD is characterized by inability of paying attention,
combined with hyper-focusing on things they are interested in.
In 1999, AS (Asperger’s syndrome) was discovered. This
disorder is also known as „high-functioning autism“. It is
manifested in the inability to understand face and body
language of other people, as well as in inability to express
empathy with them. On the other hand, people suffering from
AS have high intelligence, great analytical skills and an
extraordinary ability to solve technical problems [7].
Some authors even advocate the attitude that perpetrators of
compter crimes do not have a developed moral maturity.
Hackers believe that many of their illegal acts are justified
and ethically correct. The psychologist Lawrence Kohlberg
has developed a three-level theory to explain moral
development in normal people. The first level deals with
avoiding punishment and receiving rewards, the second level
comprises social rules and the third one includes moral
principles. Each of these level contains two phases. Computer
criminals have only evolved through the lowest three phases
of the Kohlberg model: two phases of the first level and the
first phase of the second level [8].
Hackers have also developed a specific way of
communication, which is another important characteristic of
them. Due to the fact that they are much more successful in
written communication than in face-to-face, interpersonal
communication, they have adopted „leet speak“. Leet speak is
an encrypted form of writing in which letters are represented
by numbers, symbols and other signs that resemble the letters.
The basic function of this form of communication is to
exclude „outsiders“ from the communication, i.e. to make a
clear difference between the language of this group of people
and the language of the the majority. Leet is not to be
confused with the so-called AOL language found on the
Internet. The primary function of AOL language is to shorten
written forms of some words, while the purpose of the leet
speak is to make traditional language incomprehensible to
people who do not belong to this group.
There is no definitive and generally accepted definition of
the hacker ethics. In a way, every person has their reasons and
justifications for the things they are doing. In the same way,
hacker ethics does not exist in the form of written, official
document anywhere, although several authors have presented
its entries.
According to Jargon File, hacker ethics is:
The belief that the dissemination of information is a
powerful, positive characteristic and that it is the
ethical duty of hackers to share their knowledge by
creating free programs and enabling access to
information and computer sources whenever it is
The belief that breaking into a computer system for
fun and research is ethically correct, as long as the
hacker commits no theft, vandalism or reveals
confidential information [7].
With the development of technology over time, the
approach to determining the hacker ethics has changed. Two
following approaches particulary stand out: The Original
Hacker Ethics and the Hacker Ethic of 90s Hackers.
Steven Levy, the representative of The Original Hacker
Ethics singled out six key principles of the hacker ethics in his
1984 book Hackers: Heroes of the Computer Revolution.
Those principles are: access to computers and anything which
might teach us something about the way the world works –
needs to be unlimited and total; all information should be free
(public); mistrust toward authority – promotion of
decentralization; hackers should be judged by their hacking,
and not by false criteria such as degree, age, race, sex or
position in the society; computers are used to create art and
beauty; computers can change life for the better.
On the other hand, The Hacker Ethic of 90s Hackers is
essentially contradictory to the Original Hacker Ethics,
because it advocates the opinion that the activity of hackers
should be safe, that it should not damage anything, that it
should not threaten anyone either physically, or mentally or
emotionally, and that it above all should be fun for most
people who practice it. All previously stated principles of
hacker ethics suggest certain duties, type of conduct,
refrainment, attitudes and needs. The extent to which the
ethics is accepted and in what way it is interpreted was
depicted in the classification of hackers on White, Black and
Gray Hackers, which is based on adherence to and compliance
with the principles of the hacker ethics.
In the area of various types of misuse of computer
technology, it is necessary to adopt specific legislative
regulations, which settle criminal acts committed within this
sphere. It is also necessary to understand technology and
individuals who commit computer crimes, so as to achieve
effectiveness of legislators.
However, due to poorly developed legal regulations and
control in the area of information technology, many countries
have become a paradise for hackers. Still, in spite of the cyber
crime expansion in the eastern countries, the USA remains by
far the leading target on the list of countries attacked by
hackers [7]. Difficulties that legal systems face in monitoring
new criminal trends of the hackers are a source of
embarrassment for governments all around the world. A
typical example is that of Canada, because its Criminal Law
(law of criminal acts) does not clearly define computer
criminal acts. This resulted in prosecutors using metaphors to
explain the criminal act, due to lack of knowledge about
Computer criminal acts in the legislation of the Republic of
Serbia are regulated by regulations of the Law on
Organization and Jurisdiction of Government Authorities in
the Suppression of High Technological Crimes [9] and in the
Criminal Code of the Republic of Serbia [10].
According to the latest novelty, the criminal acts against the
security of computer data have been regulated in the chapter
twenty-seven (Articles 298-304a) of the Criminal Code of the
Republic of Serbia. The legislator included the following
criminal acts in this special group of the Code: damage of
computer data and programs, computer sabotage, creation and
distribution of computer viruses, computer fraud,
unauthorized access to a protected computer, to a computer
network and to electronic data procession, preventing and
limiting access to a public computer network, unauthorized
use of computer or computer network, the criminal act of
creating, obtaining and providing the other person with means
necessary for execution of criminal acts against the security of
computer data. The latest amedmends to the Criminal Code
envisage yet another form of misuse of computer and
computer networks. Due to the fact that the computer network
is often misused so as to commit or to conceal criminal acts
against sexual freedom of minors, Article 185b in the Chapter
Eighteen of the Criminal Code has regulated the criminal act
of using computer network or other technical means of
communication to commit criminal offenses against sexual
freedom of minors. Introduction of legal provisions on
computer criminal acts in the criminal legislation of the
Republic of Serbia has contributed to making great progress
and to creating new opportunities for prevention of illegal
activities and practices in this area. However, given the fact
that situations dealt with in practice can be highly
unpredictable and that the perpetrators of computer crimes are
certainly very inventive, it is necessary to constantly renew
these regulations, by adoptinng new legal provisions and
making amendments to the existing ones.
It is perfectly clear that the society can adequately confront
a certain negative phenomenon only if all of its characteristics
and specificities are recognized. Given the fact that the means
of the misuse of computer technology are becoming
increasingly advanced and more complicated to detect, and
that it is very difficult to be step ahead of these criminal
activities, it is necessary to keep raising public awareness about
this phenomenon and to constantly work on finding the most
adequate solution to various criminal activities in this field.
It was our intention to depict the profile of the perpetrator
of computer criminal acts as well as possible in this paper and
thus to shed light on all of his/her specific characteristics,
because effective steps in eliminating negative effects of a
certain phenomenon include not only understanding the
phenomenon, but also understanding perpetrators of the
criminal activities in that field. Transparency and determined
opposition to different forms of criminal activities are two
important elements of the aim to reduce different forms of
crime, including computer crime, to a previously determined
framework, which is endurable for that specific community.
[2] B. Banovic, “Providing evidence in the criminal process of
economic crimes “, Police College, Belgrade, 2002.
[3] Main problems related to the Cybercrime, 10th United Nations
Congress on the Prevention of Crime and the Treatment of
[4] Z. Aleksic, M. Skulic, “ Crime tactics, techniques, methods ”,
„Official Gazette “, Belgrade, 2007.
[5] G. Goldman, H. J. Stenger, “Die ganze Welt als Tatort,
Computer Hacking: Modus operandi und
Ermittulungsprobleme”, Kriminalistik, 8-9/89, Kriminalistik
Verlag, Heidelberg
[8] The socio-psychological profile of the perpetrator of a computer
crime, Faculty of Informatics and Computing,
[9] Law on the organization and responsibilities of state bodies for
the fight against cyber crime, „ Official Gazette of the Republic
of Serbia “, No. 61/2005.
[10] Kriminal Code, „ Official Gazette of the Republic of Serbia “,
No. 85/2005, 88/2005, 107/2005 i 72/2009.
A Modular System for Solving Optimization Problems
by Testing Them with Genetic Algorithms
Hristo I. Toshev1, Stefan L. Koynov2 and Chavdar D. Korsemov3
Abstract: – The paper introduces a modular system for testing
with genetic algorithms (GA-s) to solve optimization problems.
The system consists of several modules that include blocks with
different purpose which are interconnected in various ways. The
GA is realized via different methods of GA set-ups, the search
process, evolutionary models thus allowing to estimate the
practical effectiveness of the algorithm.
Keywordsgenetic algorithms, optimization, selection,
crossover, mutation.
Genetic algorithms (GA) are a method for search based on
the selection of the best species in the population in analogy to
the theory of evolution of Ch. Darwin.
Their origin is based on the model of biological evolution
and the methods of random search. From the bibliographical
sources [1], [2] it is evident that the random search appeared
as a realization of the simplest evolutionary model when the
random mutations are modelled during random phases of
searching the optimal solution and the selection is modelled as
“removal” of the unfeasible versions.
The main goal of GA-s is twofold:
- abstract and formal explanation of the adaptation
processes in evolutionary systems;
- modelling natural evolutionary processes for efficient
solution of determined class of optimization and other
Following [3], [4], [5] GA-s differ from other optimization
search procedures according to the following below items:
- they operate with a coded set of parameters, not with the
problem parameters;
- they realize the search not by improving a single solution
but by the simultaneous usage of several alternatives for
the given solution set;
- they use the fitness function (FF), not its different
increases to estimate the quality of the taken decision;
- they apply probabilistic rules for the optimization
problem, not deterministic ones.
During the last years a new paradigm is applied to solve
optimization problems GA-based and modifications of GA.
EA realize searching a balance between efficiency and quality
of solutions at the expense of selecting the strongest
1 Hristo I. Toshev, 2 Stefan L. Koynov and 3 Chavdar D.
Korsemov are with the Institute of Information Technologies,
Bulgarian Academy of Sciences, Acad. G. Bonchev str., bl. 29A,
1113 Sofia, Bulgaria, E-mail:, ,,
alternative solution [1], [2].
The continuously growing number of publications and also
of the practical implementations during the last years is a
stable proof of the growing expansion of the scientific and
application research in the domain of GA.
In order to give a general fancy for the type of applications,
they could be classified in four main directions [1]:
- science to increase the level of various scientific
research [1], [2], [6], [7], [8], [9], [10], [11], [12];
- engineering – to solve specific engineering problems [5],
[12], [13], [14], [15], [16];
- industry to increase the quality and the amount of
production for some industrial processes [4], [5], [14],
[17], [18], [19], [20],;
- various other directions (miscellaneous applications) [1],
[6], [21], [22], [23].
The present paper introduces a way to improve the qualities
of applied algorithms for solving various classes of
optimization problems. This is done by a modular program
system that is GA-based for testing. It allows the application
of elaborated various methods for set-ups of genetic operators
at execution time with respect to the search process, the
different evolutionary models included.
The new program modular system to test using genetic
algorithms for solving optimization problems is realized as a
set of program devices consisting of various program modules
each of which including different blocks: (Fig. 1):
• block 1 – Block for Input;
• block 2 – Optimization Problems Editor used for editing
different optimizations tasks;
• block 3 – Set-Up Block including three set-up subblocks:
- 3.1 – Set Up Genetic Operators block;
- 3.2 – Set Up Search Methods block;
- 3.3 – Set Up Evolutionary Models block.
• block 4 – Genetic Algorithm block for the already fulfilled
set-ups of algorithms with genetic operators, search
methods and evolutionary methods;
• block 5 – Genetic Operators Data Base;
• block 6 – Search Methods Data Base;
• block 7 – Evolutionary Models Data Base;
• block 8 – Adaptation Block;
• block 9 – Convergence Analysis Block;
• block 10 – Output Bloc for the Results.
Based on the already presented blocks a synthesis was
performed of several program modules that are combinations
of the introduced blocks.
MODULE 1 is used to explore various modifications of
genetic operators: a recombination operator, a crossover
operator, a mutation operator, an inversion operator, a
segregation operator, a translocation operator, of their
modifications and also of their joint operation.
Set-Up Block
Set Up Genetic
Set Up
Search Methods
Set Up
Problems Editor
Genetic Algorithm
Output Block for
the Results
Genetic Operators
Data Base
Search Methods
Data Base
Models Data Base
Adaptation Block
Analysis Block
Block for Input
The introduced module is realized based on GA-s and it is
used to compare original properties of one and the same GA
for different genetic operators. The result of this comparison
includes the operative time of the algorithm, the best solution
from the execution, the algorithm convergency estimate. The
general structural graph of the program module includes
block 1, block 2, block 3.1, block 4, block 5, block 10 (Fig. 2).
Block 1 is the block for the data input about the problem
(input of variables and parameters of the investigated block,
combination matrices, various criteria, etc.);
The input of the set-up data for the algorithm is done in
block 3.1 (population size, types of applied operators,
probability for their usage, etc.) and the data base (DB) for
various types of genetic operators is realized in block 5.
Fig. 2
Different set-ups of GA are performed in block 4 and the
results are obtained in block 10.
MODULE 2 is applied to explore the enabled heuristic
search methods. It is realized analogously to MODULE 1 and
it consists of blocks with numbers 1, 2, 3.2, 4, 5, 6, 10 (Fig.
Blocks with numbers 1 and 2 operate in the already known
way, with the introduced functions.
Block 3.2.inputs the set-up parameters for the algorithm
(the search automation functions included) so that various
methods to improve the search process can be enabled or
Block 4 is oriented to GA with its set-up about the search
optimization to produce information for the current value of
the best FF and also the graph of the solution modifications
along all generations of the algorithm.
Fig. 3
Block 6 is related to the data base with different search
During the process of operation block 10 produces an
output file with the code of all chromosomes, all iterations of
the current generation and the result about the global
MODULE 3 concerns the research of the applied in the
algorithm evolutionary models. It includes blocks with
numbers 1, 2, 3.3, 4, 5, 6, 7, 10 (Fig. 4).
More specific are the listed below blocks:
Block 3.3 to input the set-up parameters for the algorithm
(they enable the search optimization functions). The block
enables or disables various methods to improve the search
process and also various evolutionary models;
Block 4 is oriented to GA with its set-up about the search
optimization to produce information for the current value of
the best fitness function (FF) and also the graph of the
solution modifications along all generations of the algorithm.
Fig. 4
Block 7 is the Evolutionary Models Data Base block.
Block 10 is the Output Bloc for the Results. The output data
are the same as the features of MODULE 2.
MODULE 4 is used during the research process of the
applied in the algorithm search methods, evolutionary models,
the iterative and statistical improvement, the adaptation and
the analysis of convergence. The realization of this scheme is
analogous to the rest of the MODULE-s and it contains blocks
with numbers from 1 to 10:
- Input of the task data and of the algorithm set-up
parameters (the search optimization included) – block 1;
- Optimization task editing– block 2;
- Set-up of genetic operators, search methods and
evolutionary methods – block 3 (3.1, 3.2, 3.3 included);
- GA with all set-ups – block 4;
- Three types of DB-s (genetic operators DB, search
methods DB and evolutionary models DB) are realized as
blocks with numbers 5, 6, 7;
- The feedbacks and the balance between the procedures
are achieved in block 8 for adaptation;
- The analysis and the convergence are realized in block 9;
- The obtained results are recorded in block 10.
The execution of a given program MODULE of some
subsystem begins with the input of information about the
performed task(s). Then a set-up is performed of the global
and the private parameters of the algorithm, namely:
• the initial distribution of the coefficients and the criteria;
• the initial population size;
the upper limit of the iteration number for the algorithm
(the number of steps);
the upper limit of the number of generations (the number
of algorithm (re)starts);
the limits (upper and lower) for the FF values (the global
optimum, if neither is given);
• the probability for applying the genetic operators.
Private parameters for the MODULE 1 are various types of
- for crossover – single dot, two dots, cyclic, universal, etc.;
- for mutation – single dot, two dots;
- for translocations;
- for segregation;
- for selection randomly, according to a given criterion,
elite, tournament, etc.
The execution of all established genetic operators is
followed by a check of the FF. The research process restarts
iff the FF value is unsatisfactory.
The search parameters about the DB methods (MODULE 2,
block 6) are enable flags to use heuristics based on:
- optimization statistical methods;
- gradient methods;
- dichotomy methods;
- Fibonacci methods;
- golden-section methods;
- fractal sets, etc.
GA is executed after the performance of all established
search methods. If the stop criterion is reached then the
algorithm finishes (the execution), else the research process
The MODULE 3 Evolutionary Models Data Base program
block 7 parameters are as it is listed below:
- Darwin-evolution model;
- Lamarckism – Lamarck-evolution model;
- saltationism – de Vries model of evolution;
- K.Popper’s model of evolution;
- synthetic theory of evolution.
The GA is realized via the best operator set that is
determined during the test. The cited above elaborated
methods are applied during the execution time of the
algorithm. The main idea is representation of the general
improvement embedded by the heuristics during the search
which includes their behavior for various types of problems.
This is followed by the execution of GA according to the
result about the operator set from the execution of all heuristic
procedures and all evolutionary models included. The
operative set-up influences the Adaptation Block no. 8 that
realizes the balance between the procedures and the
Convergence Analysis Block no. 9 to overcome the algorithm
convergence. This is the way to achieve the best features at
execution time and also raising the algorithm robustness.
The basic idea of the performed testing was to receive better
features at run time for the set of algorithms, and an increase
in their stability. The determination of the improvement
parameters includes approaches, algorithms, software
modules, DB-s, etc. They are performed by three basic series:
testing genetic operators and the Gas, testing methods for
searching and testing of the evolutionary models.
The conception of the presented system covers testing of
the cited below aspects:
- algorithms for a specific problematic case by a successive
enabling of all elaborated heuristics from a standard problem
and the determination of the values of the improved
parameters. In this way it is possible to estimate the
effectiveness of the applied in the algorithm methods to
optimize the search.
- algorithms of some standard tests followed by comparing
the obtained results with already existing ones for a given
benchmark that are calculated by another algorithm(s). So it is
possible to estimate the effectiveness of the algorithm and of
the program modules.
[1] C. A. Coello Coello, and G. B. Lamont, Applications of Multi-
Objective Evolutionary Algorithms, Vol. 1, World Scientific,
2004, 761 pp. Hardcover, ISBN: 981-256-106-4, 2004.
[2] V. V. Emelyanov, V. M. Kureychik, V. V. Kureychik, Theory
and Practice of Evolutionary Modelling, Moscow. (in Russian),
[3] J. Holland, Adaptation in Natural and Artificial Systems: An
Introductory Analysis with Application to Biology, Control and
Artificial Intelligence, University of Michigan, 1975.
[4] D. Goldberg, Genetic Algorithms in Search, Optimization and
Machine Learning, Addison-Wesley Publishing Company, Inc.,
[5] Handbook of Genetic Algorithms / Edited by Lawrence Davis,
New York: Van Nostrand Reinhold, 1991.
[6] E. Falkenauer, Genetic Algorithms and Grouping Problems,
New York: Wiley, 1998.
[7] Y. Wang, Z. Cai, Y. Zhou and W. Zeng, An Adaptive Tradeoff
Model for Constrained Evolutionary Optimization, IEEE
Transactions on Evolutionary Computation, vol. 12, no. 1,
February 2008, pp. 80-92, 2008.
[8] S. Gustafson and L. Vanneschi, Crossover-Based Tree Distance
in Genetic Programming, IEEE Transactions on Evolutionary
Computation, vol. 12, no. 4, August 2008, pp. 506-524, 2008.
[9] D. Goldberg, Web Courses,,
[10] H. I. Toshev, S. L. Koynov and Ch. D. Korsemov, Evolutionary
Theories and Genetic Algorithms, In: Proceedings of the XLI
International Scientific Conference on Information,
Communication and Energy Systems and Technologies ICEST
2006, (Editor Dimitrov, D.), Sofia, Bulgaria, pp. 236-239, 2006..
[11] A. Sóbester, P. B. Nair, and A. J. Keane, Genetic Programming
Approaches for Solving Elliptic Partial Differential Equations,
IEEE Transactions on Evolutionary Computation, vol.12, no. 4,
August 2008, pp. 469-478, 2008.
[12] V. Guliashki, H. Toshev and Ch. Korsemov, Survey of
Evolutionary Algorithms Used in Multiobjective Optimization,
Problems of Engineering Cybernetics and Robotics, (Ed. Sgurev
V.), Sofia 2009, ISSN: 0204-9848, Vol. 60, 2009, pp. 42-54,
[13] M. Gen and R. Cheng, Genetic Algorithms and Engineering
Design, New York: Wiley, 1997.
[14] H. I. Toshev, S. L. Koynov, and Ch. D. Korsemov, , Application
of Genetic Algorithms for Supply Management of Multi-range
Warehouse In: Proceedings of the XIX International Scientific
Conference on Robotics and Mechatronics RM 2009, (Editor
Zahariev, R.), Varna, Bulgaria, pp. 12-17, 2009.
[15] A. Sóbester, P. B. Nair, and A. J. Keane, Genetic Programming
Approaches for Solving Elliptic Partial Differential Equations,
IEEE Transactions on Evolutionary Computation, vol.12, no. 4,
August 2008, pp. 469-478, 2008.
[16] V. Guliashki, H. Linkov „An Accelerated Genetic
Reconstruction Algorithm for Inverse EIT Problems”, XV
International Symposium on Theoretical Electrical Engineering
ISTET 2009, 22-24 June 2009, Lübeck, Germany, ISSN: 0932-
6022, pp. 379-382, 2009.
[17] P.V. Afonin, System for Rational Cutting Out of Materials
Applying Genetic Optimization Algorithms, 4th International
Summer Workshop in Artificial Intelligence for Students and
Ph.D. Students, Proceedings, Minsk, 2000, pp. 125-128 (in
Russian), 2000.
[18] M. Nicolini, A two-level evolutionary approach to multi-
criterion optimization of water supply systems, In Carlos A.
Coello Coello et al, editor, Evolutionary Multi-Criterion
Optimization. Third International Conference, EMO 2005, pp.
736-751, Guanajuato, Mexico, Springer, Lecture Notes in
Computer Science Vol. 3410, Mar. 2005.
[19] E. Mininno, F. Cupertino and D. Naso, Real-Valued Ifmpact
Genetic Algorithms for Embedded Microcontroller
Optimization, IEEE Transactions on Evolutionary Computation,
vol. 12, no. 2, April 2008, pp. 203-219, 2008.
[20] C. Mattiussi and D. Floreano, Analog Genetic Encoding for the
Evolution of Circuits and Networks, IEEE Transactions on
Evolutionary Computation, vol. 11, no.5, October 2007, pp.
596-607, 2007.
[21] F. Schlottmann and D. Seese. Financial applications of multi-
objective evolutionary algorithms: Recent developments and
future research directions. In Carlos A. Coello Coello and Gary
B. Lamont, editors. Applications of Multi-Objective
Evolutionary Algorithms, pp. 627-652. World Scientific,
Singapore, 2004.
[22] A. Malossini, E. Blanzieri, and T. Calarco, Quantum Genetic
Optimization, IEEE Transactions on Evolutionary Computation,
vol. 12, no. 2, April 2008, pp. 231-141, 2008.
[23] N. Wagner, Z. Michalewicz, M. Khouja, and R. R. McGregor,
Time Series Forecasting for Dynamic Environments: The DyFor
Genetic Program Model, IEEE Transactions on Evolutionary
Computation, vol. 11, no. 4, August 2007, pp. 433-452, 2007.
Simulation of Selective Repeat Automatic
Retransmission Request Scheme
Valentin Hristov1
Abstract – The Selective Repeat Automatic Repeat reQuest
(SR ARQ) protocol is a general strategy for handling frame
transmission errors when the round-trip time for frame
transmission and reception of the acknowledgment is larger than
frame transmission time. The present paper presents a simulator
to evaluate the impact of burstness both in the channel errors
and in the arrival process on the SR ARQ statistics. The
simulations then used to show and discuss some results which are
Keywords – Selective Repeat Automatic Retransmission
Request, Simulations.
Automatic Retransmission reQuest (ARQ) schemes are
used to control transmission errors caused by channel noise
[1], e.g. wireless network. All ARQ schemes employ some
kind of error coding of the transmitted data so that the receiver
has the ability to detect the presence of errors. When an error
is detected, the receiver requests a retransmission of the faulty
data. ARQ schemes are simple to implement in hardware and
they are especially effective when there is a reliable feedback
channel connecting the receiver to the transmitter such that the
round-trip delay is small. There are three main types of ARQ
• Stop-and-wait ARQ;
• Go-back-N ARQ;
• Selective-repeat ARQ.
We will discuss and model Selective Repeat ARQ
(SR ARQ) scheme in the following sections.
The selective-repeat protocol is a general strategy for
handling frame transmission errors when the round-trip time
for frame transmission and reception of the acknowledgment
is comparable to or larger than frame transmission time. SR
ARQ is used by the TCP transport protocol. In this protocol,
the transmitter groups the frames into windows so that each
window contains N frames. When the sender sends frames
within a window, the receiver stores the frames of the current
window and checks for errors. After a complete window has
been received, or after the proper timeout period, the receiver
instructs the transmitter to resend only the frames that
contained errors.
If the round-trip delay is larger than the frame transmission
time, frames are not always transmitted in numerical
increasing order, and this forces the receiver to keep the
received frames in a buffer, from where they can be released
only when all frames with lower identiers have been
acknowledged. Thus, the delay between the rst transmission
of a frame and its release from the receiver buffer can not be
computed trivially [1], since it also depends on the outcome of
the transmission of all frames with lower id.
The investigations of delay performance and other related
issues of different ARQ schemes has been subject of many
papers [1]–[7].
In [1], Badia presents an extended analysis, with two
Markov chains describing arrival and channel error processes.
However, he assumes error-free ACK/NACKs and unlimited
transmitter and receiver buffers as well as omits the constant
propagation delay term.
Seo et al., in [4], derive the delay statistics of Hybrid ARQ
also through Markov chains.
A matrix geometric approach [7] has been used by Le et al.
to evaluate the performance of ARQ schemes in a radio link
with adaptive modulation and coding. To derive the queueing
statistics it is observed that the process is Quasi-Birth and
Death (QBD), which holds also for the system studied in [5].
Finally, in [6], Luo et al. discuss the ARQ delivery delay by
focusing on the impact of the link layer ARQ on the
performance of upper layers, i.e., the service data unit (SDU)
delay. Though their focus is different, they obtain some results
by means of simulation, which in what follows will be derived
The purpose of present paper is to present a simulator to
evaluate the impact of burstness both in the channel errors and
in the arrival process on the SR ARQ statistics. The
simulations then used to show and discuss some results which
emerge in the statistics and which are non intuitive.
Figure 1 shows the scheme of transmitting frames using
Selective Repeat ARQ protocol. Right arrow at conditional
block indicates ACK signals and down arrow indicates NAK
signals. We can see that the transmitter starts to resend a
frame which was fault as soon as the corresponding NAK is
The delay between the rst transmission of a frame and its
release from the receiver buffer, we will call delivery delay
(TD). Total delay (Tt) experienced by a frame also comprises
the time spent in the transmitter’s queue, which we denote as
queueing delay (TQ). An approach to estimate these delay
terms is to create simulation model which precisely describes
the process of delivery of frames in the system.
1 Valentin P. Hristov is with Department of Computer Systems
and Technology at South West University, Iv. Mihajlov 66, 2700
Blagoevgrad, Bulgaria, E-mail:
In this investigation General Purpose Simulation System is
used to create the simulation model and to estimate the delay
terms. Q- system of this model is depicted on fig.2.
The transmitter transfers frames after that receiver answers
with positive or negative acknowledgement (ACK/NACK)
according to the correct/erroneous reception of these frames,
respectively. After a full round-trip time, i.e. after m slots,
ACK/NACKs arrive at the transmitter’s side, and either a new
frame or a retransmission is sent over the channel. We assume
b) Time delay
Fig. 1. Frames delivery in wireless network
Fig. 2. Q- system of simulation model.
that the value of the ARQ window size is m, i.e. the round-trip
time equals m transmitted frames (slots).
We also assume that frames arrive at the transmitter’s
queuing buffer from an ON- OFF source with two states,
referred in the following as “OFF”=no frame arrival and
“ON”=frame arrival. Thus, multiple arrivals are not possible
during the same time slot.
The ON- OFF source is characterized by means of two
independent parameters, the average arrival rate λ, and the
average arrival burst length A.
The data sent from the transmitter’s queue arrive at the
receiver through a noisy channel. This is modeled through
“good” state corresponding to error-free transmission and
“bad” state where the frame is always in error.
Transition probabilities pij from state i to state j are
collected in the transition matrix P=(pij), i,j={0,1}. Thus,
steady-state channel error probability ε = p01/(p10+ p01) and
the average error burst length B=1/p10. The case where B
=1/(1 ε) corresponds to independent identically distributed
error case (B=iid).
In general, any channel description through Markov models
can be reduced to the introduction of only two channel states,
one good (state0) and one bad (state1). This two-state Markov
channel is similar to the one of [6].
The channel in proposed model is characterized again by
two parameters, the error probability ε, and the average error
burst length B.
In this section we present some interesting results given
from the Simulation model crated on General Purpose
Simulation System (GPSS), working in manner reported
above. For all of the reported results, m and ε are taken to be
equal to 10 and 0.1, respectively, even though other values
have been tested and the results agree with the ones shown
Fig. 3 shows the queueing delay and the delivery delay as
functions of A in the case B =3 (a mildly correlated channel),
m=10, and ε=0.1, for various values of λ, λ={0.4, 0.6}. The
delivery delay curves show that the value of TD does not
signicantly change when λ and/or A varies. The queueing
delay, instead, is shown to increase with λ, which maybe
somehow expected, but also it exhibits a linearly increasing
behavior in A.
This can be explained by considering that the frames arrive
in bursts and therefore are likely to nd many other frames
ahead in the queue, which results in a higher TQ and total
Fig. 4. shows average values of the delivery delay for
m=10, ε=0.1, A=2.5 as a function of arrival rate λ, for various
values of B.
In this gure, a counterintuitive behavior is emphasized:
one might expect that the delay increases with λ, since the
system is more heavily loaded. This reasoning is correct for
the queuing delay, but not for the delivery delay.
However, for more realistic cases where the average burst
Fig. 3. Queueing and delivery delay vs. A, for various values of λ
Fig. 4. Delivery delay vs. λ, for various values of B
Fig. 5. Total delay vs. B, for various values of A={2.5, 7}.
Fig. 6. Queueing, delivery, and total delay vs. B
length is moderate or higher, the delivery delay is almost
independent of the frame arrival rate or may decrease with
increasing λ. This phenomenon can be explained by
appearance of long sequences of slots where the channel is in
a good state, thus it is easier to solve an entire sequence of
frames directly. It is more acute for large values of burst error
length B.
Fig. 5. shows average values of the total delay for m=10,
ε=0.1, λ= 0.6 as a function of burst error length B, for various
values of arrival burst length A={2.5, 7}.
As can to see total delay decreases at rst and then
increases linearly, i. e. the moderate channel burstiness
achieves lower delay than one at the lower or higher channel
Finally, comparison between the queueing delay, the
delivery delay, and the total delay for m=10, ε=0.1, λ=0.6, as
a function of B, for A=2.5 is shown in Fig. 6. This figure
explains the fact that moderate channel burstiness achieves a
lower total delay than one at the lower or higher channel
burstiness: By looking at the gure, we are now able to
recognize that total delay depends on the dominant delay term
being either the delivery or the queueing delay. In fact, while
TD is decreasing when the channel burstiness increases around
moderate values, TQ is linearly increasing, which becomes the
prominent term for high B.
Similar curves have been given in [1], which are derived
analytically, and this fact can be presented as the kind of
verification of proposed in present paper simulation model
and results.
In present paper, we compare the SR ARQ delays with
various intensities of the arrival rate and the arrival burstnesss
at the transmitter’s queue as well as investigate the effect of
the error burstness in the channel. We show that the delivery
delay may actually decrease for an increasing arrival rate
when the channel is moderately burst, and in certain cases
error burstness may imply a general decrease of the total
delay. These aspects are remarkable to achieve correct delay
estimation in real time multimedia services over wireless
channels, e.g. video-streaming applications.
This paper is sponsored by Research project No SPR-B4 at
South West University- Blagoevgrad, Bulgaria
[1] Badia, L. On the Impact of Correlated Arrivals and Errors on
PROTOCOLS, Proc. of the Conference Computer
Science’2006, Instambul, Turkey, 30 September – 2 October,
2005, Part I, pp. 114- 119.
[3] J. G. Kim and M. M. Krunz, “Delay analysis of selective repeat
ARQ for a Markovian source over a wireless channel,” IEEE
Trans. Veh. Technol. , vol. 49, no. 5, 2000, pp. 1968-1981.
[4] J.-B. Seo, Y.-S. Choi, S.-Q. Lee, N.-H. Park, and H.-W. Lee,
“Performance analysis of a type-II hybrid-ARQ in a TDMA
system with correlated arrival over a non-stationary channel,” in
Proc. ISWCS,Siena, Italy, 2005, pp. 59-63.
[5] L. B. Le, E. Hossain, and A. S. Alfa, “Radio link level
performance evaluation in wireless networks using multi-rate
transmission with ARQ– based error control,” IEEE Trans.
Wireless Commun.,vol.5,no.10, Oct. 2006, pp. 2647-2653.
[6] W. Luo, K. Balachandran, S. Nanda, and K. Chang, “Delay
analysis of selective-repeat ARQ with applications to link
adaptation in wireless frame data systems,” IEEE Trans.
Wireless Commun., vol. 4, no. 3, May 2005, pp. 1017-1029.
[7] M.F.Neuts,Matrix-Geometric Solutions in Stochastic
Models.New York: Dover Publications, Inc..
Ontology-Based Deep Web Search For E-Science1
Tatyana I. Ivanova2
Abstract: This paper makes a brief exploration of Deep
Web search technologies and proposes a new semantic ontology-
based approach for personalized searching scientific publications
in digital libraries, books in web catalogs of scientific-content
books, and other scientific data in web databases. Our main aim
is to investigate main deep web search tools and digital libraries
and in the base of them develop a conceptual model of
personalized searching tool for scientists.
Keywords: Deep Web Search, Personalized Search,
Searching Web databases, Ontology-based Search, Searching
digital libraries
User queries on the Web can be classified into three types
according to user’s intention [3]: informational query (The
intent is to acquire some information assumed to be present on
one or more web pages), navigational query (The immediate
intent is to reach a particular site) and transactional query [2]
(The intent is to perform some web-mediated activity, as
downloading or purchasing). General search engines usually
don’t recognize the user intent and disregarding the result
type, return the mixed result list. Sometimes, it is difficult to
make a strict classification of user queries according to his
intent. For example, searching the digital library for scientific
information is informational query, but downloading the
chosen paper is transactional operation.
Web search engines can’t index most of the possible
pages, that can be returned by the dynamic web sites, or data,
stored in Web databases (so called Deep Web[6]) and it is
difficult to find such information if (the location of) source
site is unknown. Google scholar for example is very useful for
searching free scientific publications, but it has indexed only a
little part of all of the scientific papers, published in the Web.
It is very important for scientists to be able to find easy all the
new research papers, related to his subject, purchase new
issues of valuable books or download needed software.
Another drawback of Web search engines is that during
searching or ranking results they do not take into account
personal user preferences or interests. Federated search tools
help users to identify the databases that are best suited to the
subjects they are searching. It allows users to search across
multiple resources: subscription databases, library catalogs,
and other types web databases.
1 The research, presented in this publication is funded by
the Internal Research Project 102ни013-10 Research and
Development sector at Technical University of Sofia for 2010
2Tatyana I. is from the Technical University of Sofia,
Bulgaria , E-mail:
In this paper a new semantic ontology-based approach for
personalized searching scientific publications in digital
libraries and web catalogs of scientific-content books is
proposed. Our aim is to develop a conceptual model and as a
future research, a tool for personalized searching of scientific
publications in digital libraries and scientific books in web
catalogs (for purchasing). As such resources are stored in full
text databases or web catalogs, and are intended for users with
specific research interests, we have to made research about
and develop a specialized personalized Deep Web search tool.
It will be used as part of the virtual scientific laboratory to
facilitate the search for scientific publications, books, or
specific information in Internet databases relating to scientific
The paper is organized as follows: Section 2 discusses
earlier research in Deep Web searching; Section 3 proposes a
new semantic ontology-based approach for searching
scientific objects; Section 4 discuss the expected problems,
strengths and drawbacks of proposed approach and it’s further
realization; Section 5 concludes the article.
Most of Search engines rely on programs known as
crawlers (or spiders) that gather information by following the
trails of hyperlinks that tie the Web together. Traditional
search engines [5] such as Google, or Yahoo can be searched,
retrieved and accessed only sources that have been indexed by
the search engine’s crawler technology. That approach works
well for the pages that make up the surface Web, but for
online databases that are set up to respond to grand amount of
typed queries it is practically impossible to index all possible
responses. The large volumes of documents that compose the
Deep Web are not open to traditional Internet search engines
because of limitations in crawler technology.
There are two main approaches for Deep Web search:
searching previously harvested metadata (in search engine
indexes, as in surface web), and federated search. Deep Web
indexing methods are very different from those in surface
web, as all indexing process is based on automatically
querying and retrieving data behind web database search
interfaces. Search engines, indexing deep web content (as
Google, or Yahoo) use specific deep web crawlers. They
detect the index features by issuing probe queries against the
search and build a sample of the queried database by issuing a
set of queries. Next, they select the most frequent words in the
documents in samples to crawl the database, assuming they
also have a high frequency in the actual database/index.
Federated search makes deep web documents in databases
searchable by sending queries directly to native search
interfaces of these databases. Additionally, federated search
provides a singular search interface to numerous underlying
deep web data sources. Federated search is the technology of
simultaneously searching multiple content sources from one
search form and aggregating the results into a single results
page. This reduces the burden on the search patron by not
requiring knowledge of each individual search interface or
even knowledge of the existence of the individual data sources
being searched. Federated search process consists of four
phases: 1. transforming a query and broadcasting it to a group
of disparate databases or other web resources, with the
appropriate syntax; 2. merging the results collected from the
databases; 3. presenting them in a unified format with
minimal duplication; 4. providing a means, performed either
automatically or by the portal user, to sort or cluster the
merged result set. Federated search is a type metasearch. We
can build our own metasearch engines for federated search,
using database or other federated search engine interfaces.
For our purposes, it is important to build our metasearch
personalized tool for searching scientific digital libraries, e-
commerce book catalogs and some type specific scientific
A digital library is a library in which collections are stored
in digital formats and accessible by computers. Many
academic and government organizations provide libraries,
some of which are actively involved in building institutional
repositories of the institution's books, papers, theses, and other
works which can be digitized. Many of these repositories are
made available to the general public with few restrictions, in
accordance with the goals of open access, in contrast to the
publication of research in commercial journals, where the
publishers often limit access rights.
Digital libraries frequently use the Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH) to
expose their metadata to other digital libraries, and search
engines like Google Scholar, Yahoo! and Scirus. The OAI-
PMH compliant digital repositories in the world may be find
on [15]. Metadata according to this protocol are represented in
XML format. The main drawback of this representation is the
lack of explicit formal semantic. As digital libraries are Deep
Web recourses (mainly textual databases), there are three
general strategies for searching digital libraries: Searching,
using the library search interface, Federated search and
Searching previously harvested metadata.
Almost every one digital library proposes internal
searching tools. It provides full access to all stored in the
library resources and make full use of specific library
metadata. The main disadvantage of this approach is the need
for the user to know in which library to search, as there are
many different metadata standards [1]. There are a lot of
digital libraries and choosing the best one for concrete search
is a problem.
For building effective federated search engines, the
knowledge of internal architecture and metadata standard of
used libraries is needed. For example, DSpace architecture
has three layers and two APIs : Storage layer to store digital
objects and their metadata in databases and file systems;
Business logic layer for key operations such as searching and
browsing services; and Application layer for users to access
DL system through its user interface; Networked Digital
Library of Theses and Dissertations (NDLTD) is based on
Federated Architecture, with MARIAN as a mediation
middleware; CiteSeer uses a Service-Oriented Architecture
(SOA), Open Digital Libraries uses Component-based DL
architecture. Various architectures and metadata standards are
the main sourse of problems in federated search engine
Federated search engines searching in digital libraries
perform typical vertical search, as most of libraries contain
resourses, related to one or few domains. For example,
CiteSeer.IST[9] search engine (and digital library) search
information within scientific literature, Scopus finds academic
information among science, technology, medicine, and social
science categories, GoPubMed searches for biomedical
articles in PubMed, PubFocus searches Medline and PubMed
for information on articles, authors, and publishing trends,
Scitation search over one million scientific papers from
journals, conferences, magazines, and other sources, Scirus
moves beyond journal articles and also includes searches
among such resources as scientists’ webpages, courseware,
patents, and more, Sesat is an open sourced Search
Middleware with federation capabilities and a built-in search
portal framework, CompletePlanet uses a query based engine
to index 70,000+ deep Web databases and surface Web sites,
WorldWideScience is composed of more than 40 information
sources, several of which are federated search portals
themselves. One such portal is which itself
federates more than 30 information sources representing most
of the Federal government articles. This approach of cascaded
federated search enables large number of information sources
to be searched via a single query. For effective searching user
have to have some knowledge about digital library search
engines, mainly which libraries they search, papers, related to
which domains store corresponding libraries and what
metadata is important in searching. List of important academic
databases and search engines can be found in [4].
Big search engines as Google, Yahoo, or Bing index a
nearly every web site (web developers take care of this by
complying with search engine optimization rules), and one
may rely on them for finding emerging digital libraries before
choosing the best tool for search them. Strength of general
purpose search engines, having deep web searching
capabilities is that they (for example Google Scholar) can
offer many of freely available in the internet scientific
publications in nearly every domain.
There are three main ways to search web catalogs: direct
usage of building search engines, using general purpose or
deep web (e-commerce) meta search engines, or making own
federated search engine to search in many (directly chosen
from the user) catalogs simultaneously.
For efficient direct usage of building search engines user
can be previously informed about type and coverage of
catalog content and corresponding search engine capabilities
(accuracy, relevancy of returned results; misspelling
correction capabilities; ability of searching and sorting
according to different criteria, as price, brand, availability;
ability in finding related words and common synonyms for
terms; helping in query formulation).
Ecommerce Meta Search Engines (EMSEs) [8] provide a
unified access to multiple ESEs allowing users to search and
compare products from multiple sites with ease. A number of
EMSEs have been created for some application areas on the
Web. For example, searches books from multiple
book sites.
Search engine technology had to grow up dramatically
during the last few years [7], in order to keep up with the
rapidly growing amount of information available on the web.
Despite of all the continuous improvement, usually search
engines return thousands results, but it is difficult to find those
that are needed for the user, or they are not among the
returned results. We believe that for easy and successful
finding of needed information on the web, the search has to be
personalized focused and semantic-based. User has to be
advised or supported in full text unambiguous query
formulation and choosing the best for his concrete purpose
search engines. In the second chapter we propose a conceptual
model of semantic ontology-based tool for scientists. It is
intended for personalized searching scientific publications in
digital libraries, books in web catalogs of scientific-content
books, and other scientific data in web databases.
Recent research in both Data Integration, Semantic Web,
or EScience, witness increasing needs for semantically driven
data access, and in particular for the so called Ontology Based
Data Access (OBDA). The aim of OBDA is to use ontology,
i.e. a formal conceptualization of the application domain, to
mediate access to data. Ontologies provide a semantic access
to domain specific application data and the expression of
constraints allow overcoming incompleteness that may be
present in the actual data. Our idea is to model the user profile
of the scientist and his scientific domain semantically using
ontologies to achieve more flexibility in choosing the right
search engine, decrease query ambiguity and in such a way
increase the precision and recall in searching scientific
publications, books or another type of scientific data. We also
may rerank the returned (from one or several similar search
engines) results according to particular user profile.
Conceptual schema of the proposed deep web search tool for
scientists is shown on figure 1.
The main differences between our tool and other Deep
Web tools are:
It is specialized for scientific papers, books and
scientific data;
It is intended to search three main Deep Web
recourse types and recourse type is explicitly specified in the
sending query.
The tool will offer three type services: Searching specific
digital libraries for reading or purchasing scientific papers,
searching book catalogs for purchasing scientific books, and
searching scientific databases for integrated circuits scientific
We propose rich user interface, allowing selection of
query intention (informational, for reading papers, or
transactional, for book purchasing), selection of preferred
libraries or repositories, as well as searching, dependent from
chosen data or metadata. As the tool use semantic metadata
(domain and profile ontologies), we will experiment some
query expansion techniques [12], as well as automatic library
selection or returned results reranking according to particular
user profile.
Searching specific digital libraries everything uses various
metadata, related to keywords, bibliographic and other
specific to concrete library metadata. We will manually
explore widely used technical and scientific digital libraries
(for example [16], [17], [18]) and supply user with a tool for
extracted from
Deep web EC
Deep web DL
Deep web SDB
extracted from
library databases
extracted from
Scientific Web
Upper Domain
Specific Domain
User interface
Fig. 1. Conceptual model of deep web search tool for scientists
automatic library selection according to the query and his
profile information. User also may manually choose a library
before sending a query. As a whole, stored in digital libraries
paper metadata are not sufficient for efficient searching. We
propose on the fly annotation of selected papers before it
ranking and recommending to the users. For such annotation,
domain ontologies, scientific experiment ontology (EXPO
[14]) and paper structure ontology may be used. Annotation
will be discussed in another paper. We also may use a Deep
Web crawler for extracting metainformation from digital
libraries or finding automatically ones, which we don’t
explore manually.
For searching book catalogs for purchase scientific books
we will experiment several Ecommerce Meta Search Engines.
Their drawback is that they query many databases (not only
these that are preferable for scientists) and syntactic search
approach may cause appearing the best for concrete user
results very backward in the result list, or even disappearing.
That is why we plan using specialized Deep web Electronic
catalog semantic crawler to extract specific metadata from
electronic catalogs or find emerging catalogs (figure 1).
For searching Web databases, containing scientific data,
related to our electronic circuit testing domain, we firstly will
exporting database schema as ontology, representing semantic
of our data and then will develop and test semantic scientific
data search tool, which realize Deep Web crawling for similar
databases, metadata extraction and searching. Query-based
sampling [10] can be used to discover the content available at a
remote database server. Database translation as ontology will discuss
in another paper. We expect that access rights to analogous
Web databases nay be serious problem for deep web crawling
and data extraction.
More than a half of Web data are hidden from the surface-
web search engines in databases of financial information,
shopping catalogs, medical and other research in digital
libraries. It is of great importance for scientific research to
have easy and continuous access to the latest developments in
the scientific area (presented in publications, books, and other
scientific resources, usually stored in web databases).
As a result of rapidly growing number of scientific
publications and books in electronic catalogs, the search
precision and accuracy are becoming more and more
important. One of the main trends for improving search
quality is increasing the recourse metadata quality by using
collaborative or semantic web technologies for metadata
extraction, representation, and usage. Another important trend
is digital library and web catalog standardization, exporting
recourse metadata in mashine-processable format,
development of the more and more effective deep web search
In this paper after an analytical survey of deep web tools
and approaches, we propose a conceptual model of specialized
personalized Deep Web search tool for scientific information,
stored as publications in digital libraries or specific databases.
It uses ontology-based semantic search approach to improve
search quality. It relies on rich collection of metadata,
extracted from repositories, or by using methods of direct
automatic otology-based annotation of textual resources to
propose a flexible user friendly search interface and user
query disambiguation capabilities. After analyzing the query
and taking into account user profile, domain ontology and
explicitly selected from the user options, the tool may
reformulate the query and take a decision to which search
engine (s) forward it. The tool will be implemented and tested
as part of our research project. We plan to experiment
dynamic selection of search strategy among several variants:
direct forwarding the query to one or more scientific database
or e-commerce search engines, manage user feedback and
store processed information in user profile ontology for future
usage in strategy-selection process.
[1] Links to metadata standards
[2] Y. Liz, R Krishnamurthy, S. Vaithyanathan, “Getting Work
Done on the Web: Supporting Transactional Queries”,,
[3] A. Broder, “ A taxonomy of web search” , IBM Research, 2002
[4] Wikipedia’s List of academic databases and search engines,
[5] Wikipedia’s List of search engines,
[6] Deep Web Research Resources and Sites
[7] Search Tools News,
search.html, 2010
[8] Q. Peng, W. Meng, and Hai He, “WISE-Cluster: Clustering E-
Commerce Search Engines Automatically”, WIDM’04, November
12-13, 2004, Washington, DC, USA, 2004
[10] A. S. Tigelaar,, D. Hiemstra, “QueryBased Sampling: Can we do Better
than Random?”, CTIT Technical Report , 2009
[11] K. C. Chang, B. He, C. Li, M. Patel, and Z. Zhang.” Structured
databases on the web: Observations and implications”, SIGMOD
Record, 33(3):61–70, Sept. 2004.
[12] M. Shokouhi, L. Azzopardi, P. Thomas, “Effective Query
Expansion for Federated Search”, SIGIR’09, July 19–23, , Boston,
Massachusetts, USA, 2009
[13] A. S. Tigelaar , D. Hiemstra, “Query-Based Sampling: Can we do
Better than Random?”, CTIT Technical Report,
04.pdf, 2004
[14] L. Soldatova1, R. D. King, “ An Ontology of Scientific
Experiments”, Journal of the Royal Society Interface,
December 22; 311): 795–803. , 2006
[15] Web site to OAI-PMH compliant digital repositories in
the world,
RDB to RDF or Ontology Mapping – Approaches,
Tools and Problems1
Tatyana I. Ivanova2
Abstract: Web based relational databases are secure, reliable,
and widely used, but search engines index only a little part of it
content. To facilitate the search and collaborative usage of Web –
based scientific data, exporting database schemas as ontologies
are needed. This paper explores current approaches and tools for
relational databases to ontology translation and mapping to find
suitable tool for automatic representation of our scientific data in
mashine-processable format.
Keywords: database to ontology mapping, relational
database to ontology transformation tool, OWL, ontology
Most of data on the Web are stored in relational databases
and are accessible for humans through Web browsers. Web
applications, as crawling-based general search engines like
Google are not fully capable of searching them as their
contents, known as deep Web or invisible Web, are hidden
behind their Web search interfaces and not effectively
crawlable. The form-based interface to relational Web
databases, used by humans, is not suitable for intelligent
agents. If a software agent was directed to use the e-commerce
system for example, it would need to interpret the instructions
for filling out the form, enter the appropriate parameters on
the form, submit the form, and parse the results. All of the
above may be difficult if not impossible for a software agent
to do, especially if the requirement is to compare results from
multiple e-commerce sites, all with different instructions,
forms, and returned data formats. Semantic Web technologies
and standards have been developing for making all kind of
data, available on a web site or web service, accessible and
easy to use from both humans and computers. For that
purposes, data semantics have to represent by ontologies and
the methods for querying ontologies, return semantic web data
formats have to be used. However, relational databases are the
best known tools for storing, managing and accessing data, as
they are reliable, secure and well-working. Therefore, it is
important to establish interoperability between relational
databases and ontologies.
Working on our scientific project, we will store valuable
testing data in relational database and we wish to make them
easily accessible from web for other scientists. For our
investigations, comparing our data to another scientific
research results is of great importance, and we also would like
to simplify finding of related data on the web.
1 The research, presented in this publication is funded by the Internal
Research Project 102ни013-10 Research and Development sector at
Technical University of Sofia for 2010
2Tatyana I. is from the Technical University of Sofia, Bulgaria , E-mail:
Effective ways to achieve interoperability between
databases are finding mappings between relational database
schemas and ontologies-mediators to develop semantic
database wrappers, or exporting database schemas as
ontologies and dynamically mapping these ontologies. A lot
of approaches [3], methods and tools [3] for semantic
interoperability of relational databases have been developing
during the last years. Because of some significant differences
between relational and semantic knowledge models and
semantic web technologies immaturity, no one of them can
guarantee automatic free of errors disambiguous database to
ontology export, or high precision and recall in web searching.
We will explore, compare approaches and tools, using
described them research papers, documentations, or making
our own tests (if systems are available for downloading). Our
main aim is to choose the best approaches, methods, find
suitable open source and free tools that after some
customization we may use to expose our database to semantic
web and make possible both it usage from other scientists and
finding similar databases in the Web.
There are grand variety of methods for exposing relational
database data to semantic web, differing from each other in
used models (annotation or translation), languages and
additional database manipulation techniques. Some extract the
schema from the database and convert it to semantic web
format; others use annotations, or wrappers. Used formats for
storing semantic data are usually RDF(S) or OWL. Extracted
semantic data can be stored together with database schema in
new Semantic Repositories, or remain the relational database,
but store Semantic Metadata Extractions in a Separate
Repository, or Adding Semantic Markup to the Existing data
In order to annotate database data, it is necessary to assign
the meaningful labels to them. Existing automatic data
annotation techniques [10] can be divided into three
categories: approaches based on Web interface pages
information (for example Arlotta presents an automatic
annotating approach which basically annotates data units with
the closest labels on result pages), interface schema (DeLa
uses some heuristic rules to construct some basic annotators,
Yiyao Lu utilizes new features that can be automatically
obtained from the result page including the content and data
types of the data units to align data), and domain ontology.
The ontology annotation-based approach [2] suppose that
database owner produces server side web page markup
(usually XML) describing the database’s information
structures; The searching client after database location use his
own client-side ontology, describing the semantic of his
domain and his annotator to produce client-side annotations
that conform to his ontology and the server-side markup. Then
he can send semantic queries to server-side database, using his
ontology and mapping rules through the Web-service API. If
there is no server side XML markup, describing the database,
client may use deep web crawler for sending random queries
to a server to obtain a sample of documents of the underlying
collection. The sample represents the entire server content.
This representation is called a resource description.
OntoMat-Annotizer is a user-friendly interactive webpage
annotation tool (may be used to annotate directly database
Web interface forms). It supports the user with the task of
creating and maintaining ontology-based OWL-markups i.e.
creating of OWL-instances, attributes and relationships. It
includes an ontology browser for the exploration of the
ontology and instances and a HTML browser that will display
the annotated parts of the text. It is Java-based and provides a
plug-in interface for extensions. It is freely available and can
be downloaded from and
come with documentation and tutorial for usage.
Annotation-based approach is easy for database-
developers, but he never represents unambiguously the
semantic of database content and that sometimes leads to
extraction of non-relevant data, or strictly limited reasoning
Extraction of the database schema and representing it as
ontology ensure semantic access to database data, but there
are some problems, related to it. First, there are significant
differences in RDB, RDF and OWL models. We will discuss
them bellow. Second, manual representation is difficult, time
consuming and assume knowledge-management skills,
whereas automatic methods are far from his maturity. Third, a
lot of web databases are accessible only thought HTML
forms, what make additional problems. Extraction of
metadata from the database schema is a common method used
by OntoKnowledge [5].
Difficulties in database schema extraction methods usage
depends from the amount of information about database
schema. When we develop our own database, we have full
access to it data and schema, and if we plan to make it easily
accessible from the web for human and we have to make
semantic interface to data by automatic or semiautomatic
generation of semantic description as ontology. When we will
have an access to not semantically described web database,
we can learn about it structure and data only by direct
querying through it form-based interface.
There are two main difficulties in translating from RDB to
OWL: how to capture and represent all implicitly used in
database domain knowledge and how to manage with
different logical foundations of RDB and OWL.
Data models, such as database or XML-schemes, typically
specify the structure and integrity of data sets, and the
semantics of data models often constitute an informal
agreement between the developers and the users of the data
model and which finds its way only in application programs
that use the data model. Ontologies, in contrast, should make
explicit all the semantic of data model and make him as much
generic and task-independent as possible. There are two main
sources for acquisition of all this implied semantic during
RDB and OWL mapping process – domain knowledge,
represented in machine-processable format as ontologies
(automatic approach) and human user or expert (manual
approach). Other mismatches between RDB and OWL data
models that affect a transformation system are related to
inheritance modeling, property characteristics, underlined
logical systems and open/closed world assumptions [19].
Deep web crawling and Web information extraction are
the two main important areas, related to extracting data and
metadata through web interfaces from databases for
simplifying the database access. A first prototype deep Web
crawler, presented to automatically extract and analyze the
interface elements and submit queries through these query
interfaces was HiWe. Many independent efforts are proposed
for keyword query selection techniques for downloading the
textual content from Web repositories.
There has been an active research interest in understanding
the semantics of the query interfaces of the structured Web
databases [4], [12]. WISE-integrator [12], for example,
extracts element labels and default value of the elements to
automatically identify matching attributes, [4] uses statistical
models to find the hidden domain-specific schema by
analyzing co-appearance of attribute names.
Three main models and markups are used for storing
extracted knowledge: XML, RDF(S) and OWL. We will
discuss them from knowledge representation point of view in
details separately lather.
The disadvantage of generating an ontology, based on the
database schema approach and converting all the database in
a new semantic web format is that any other applications that
to interface with the legacy database will need to change. In
the tangled network of databases in a corporation or other
information organization, this option may be too costly and
disruptive to contemplate in the near term. Moreover,
response time of the knowledge base strictly depends from his
logical model and richness and in some cases checking and
querying the base may be very slow (OWL full for example is
undesidable, which means that in some cases database
couldn’t respond within a finite time).
The schema of a database can be extracted and converted
into a semantic format such as RDF-S. This semantic version
of the schema can be mapped to ontology or published via
UDDI or WSDL to make the data available to semantic
applications. The semantic metadata and mappings can then
be stored in a central repository for the purpose of making
queries across multiple data sources.
Semantic markup can be provided at the web page or web
service accessing the data or on the repository itself. This
approach is used mainly in deep web annotation methods. If a
system uses a high level of semantic encoding, there will be
greater richness and precision in the semantics available to
capture the relationships between concepts that the logical
reasoning of agents requires. Levels of Semantic Encoding
(from lowest level to highest level) are: XML; XML Schema;
OWL Full.
There are several tools available for transforming
relational databases to ontologies [21], [22], 23]. There are
three main approaches, using in transformation: data-mining
based, knowledge-based and rule-based. DataGenie [24] is
rule-based Protégé’s plug-in that is capable of importing data
from a relational database and representing it in ontology.
This import is simple: each table maps to a class, each column
maps to a data type property and each row maps to an
instance. The drawback of this simplicity is that DataGenie
and similar tools fail to discover inheritance, restrictions,
symmetric and transitive properties, object properties and
restrictions. It also ignore constraints that capture additional
semantics and do not analyze loss of semantics caused by the
transformation. The RTAXON learning method combine the
most robust rules for exploiting relational schemas with data
mining focused on the specific problem of concept hierarchy
identification. It is implemented in the RDBToOnto tool,
which can be downloading free from [25]. Another similar
free java-based tool, RDB2Onto converts selected data from a
relational database to a RDF/OWL ontology document based
on a defined template. It is intended for ontology population.
DB2OWL [14], is another tool for automatic generation of
ontologies from database schemas. OntoWrapper [17] exposes
external semi-structured data to an ontology repository.
METAmorphoses processor [8] is a tool for the data
transformation from a relational database into RDF
documents. It is implemented in Java and is based on the two-
layer data transformation model: the mapping layer and
template layer. In the mapping layer, a given database schema
is mapped into a structure of a given ontology. The template
layer uses this mapping and produces RDF documents in the
way driven by templates.
The difference between transformation of relational
databases to ontologies and database-to-ontology mapping is
that the transformation generates ontology, corresponding to
database schema, whereas mapping assumes the existence of
both a relational database and ontology and produces a set of
correspondences between the two. Two main logical models
of semantic data representation are used in the Web: RDF –
based (including RDF and RDF Schema) and OWL-based
(including OWL Lite, OWL DL, OWL Full, and their
The RDF data model is a directed labeled graph, which
consists of nodes and labeled directed arcs linking pairs of
nodes. RDF is more expressive than the relational data model
and data represented in RDF can be interpreted, processed and
reasoned over by software agents. Two main approaches for
mapping generation between RDB and RDF are used:
Automatic domain-independent Mapping Generation, and
Domain Semantics driven Mapping Generation.
Automatic Mapping usually generate mappings between
RDB and RDF with RDB table as a RDF class node and the
RDB column or relation names as RDF predicates Even
though these automatically generated mappings often do not
capture complex domain semantics that are required by many
applications, these mappings can serve as a useful starting
point to create more customized, domain specific mappings,
or enable Semantic Web applications to query RDB sources.
The Domain Semantics driven Mapping Generation
approach generates mappings from RDB to RDF by
incorporating domain semantics that is often implicit or not
captured at all in the RDB schema. The explicit modeling of
domain semantics, often modeled as domain ontology, enables
software applications to take in mind valuable facts or
relations, concerning data, that users implicitly assume
working with database. There are freely available ontologies
in the internet (such as the National Center for Biomedical
Ontologies (NCBO) at; Gene
Ontology GO and so. on.) in almost all domains, that may by
used ( usually after customization).
The mappings between RDB and RDF may be represented
as XPath rules in a XSLT stylesheet, in a XML-based
declarative language such as R2O [5], D2RQ [16], D2R
MAP[1] or as “quad patterns” defined in Virtuoso’s [6]
metaschema language. The mappings, especially if they are
created by domain experts or reference domain ontology, may
have wider applicability.
Mapping of RDB to RDF may be either a static Extract
Transform Load (ETL) implementation (called “RDF dump”),
and implemented in almost all such tools, or a query-driven
dynamic implementation. The dynamic approach, (for
example in D2RQ, or Virtuoso systems) implements the
mapping dynamically in response to a query.
Tools from the OntoKnowledge project [17] and KAON
project [9] can be used for mapping a database schema to an
existing ontology or generating an ontology based on the
database schema.
Virtuoso RDF View [6] uses the unique identifier of a
record (primary key) as the RDF object, the column of a table
as RDF predicate and the column value as the RDF subject in
the mapping process. Other similar tools are D2RQ [16] and
SquirrelRDF [Seaborne et al., 2007]. D2RQ platform is freely
available and can be downloading from
projects/d2rq-map/. SquirrelRDF provides access to relational
databases, by providing a SPARQL interface to a non-RDF
store by extending the basic ARQ - query engine for Jena.
This approach ensures a full SPARQL implementation over
the foreign data source. SquirrelRDF is freely available and
can be downloading from
/files/. Triplify [18] is an approach to publish RDF and
Linked Data from relational databases. It transforms the
resulting relations into RDF statements and publishes the data
on the Web in RDF serializations, as Linked Data. Triplify
can be easily integrated and deployed with Web applications.
It is complemented by a library of configurations for common
relational schemata and a REST enabled data source registry.
Creating mappings between database schema and Web
ontology is a preconditioning process in the generation of
ontological annotations for dynamic Web page contents
extracted from the database.
In OWL, a class can be mapped to a relational table.
Properties of a class can be mapped to the attributes of a
relational table. Inheritance (subClassOf) relation between
classes can be realized by the foreign key (acting as a primary
key) between relational tables, and foreign key, disjoint with
primary key can be mapped to object property. Declarative
Languages as D2R MAP may be used to describe mappings
between relational database schemata and OWL ontologies, or
mappings may be stored as pare of initial ontology [15]. Two
main approaches may be used to discover semantic mappings:
statistical and knowledge-based.
The mapping process, based on statistical approach [13]
starts with a relational schema and an ontology, constructs
virtual documents for the entities in the relational schema and
the ontology to capture their implicit semantic information,
discovers simple mappings between entities by calculating the
confidence measures between virtual documents via the
TF/IDF model, uses mappings between relations and classes
to validate the consistency of mappings between attributes,
and properties a set of simple discovered mappings.
Knowledge-based approaches use knowledge sources as
WordNet or previously developed domain ontologies for
extracting shared concepts between RDB and ontology. These
approaches are semiautomatic, or complementary to rule-
based and statistical, as the quality of knowledge processing is
relatively low.
As shown above, we have to expose the schema of our
scientific data, using XML – based syntaxes for easy usage
from the Web. We may do this, using two different
approaches: expose the whole database schema or only
propose annotations. Making annotations is easy and they can
be easily used from the Web software, but such type of data
representation lack of formal semantic and natural language
ambiguity can become an obstacle to the proper use of data.
Database schema representation as ontology provides both
metadata, related to our data, and formal semantic, and make
possible for software agents reasoning about the semantic of
the data. Moreover, this representation will be used when we
search related to our data in the Web. We will map our
ontology concepts to metadata or concepts, representing
considered Web databases. Only database schema
representation ontology may not be sufficient for performing
this mapping process and we will expose additionally our
domain ontology (for handling synonymy, one to many
domain relationships…) and local context ontology (for
explicit representation of all the domain knowledge, implicitly
implied, but not explicitly represented in the database), used
in the process of automatic building of database ontology.
Moreover, queries using semantic web query languages can be
imposed to our database through its connection to ontology.
RDBToOnto is a free open source tool for automatically
generation of fine-tuned ontologies from relational databases.
We plan to use it (may be after some customization) for
automatic exporting our database schema to ontology.
Using Protégé, we were able to map ontology instances
into relational databases and retrieve results by semantic web
query languages. The key idea is that, instead of storing
instances along with the ontology terminology, we can keep
them stored in a database and maintain a link to the dataset.
VisAVis is a open source java-based Protégé plug-in for
mapping ontologies to databases, can be download from [15]
VisAVis maps the relational database contents to the TBox of
the ontology. We plan to use it (may be after some
customization) for mapping of external database data to our
ontology in the process of searching the related to ours data in
the Web.
[1] C. Bizer, “D2R MAP – A Database to RDF Mapping Language”,
In Proceedings 12th International WWW Conference, 2003
[2] S. Handschuh and R. Volz , “Annotation for the Deep Web”,
EEE Intelligent Systems archive. Vol. 18 , Issue 5 , 2003)
[3] S.S.Sahoo et al., “A Survey of Current Approaches for Mapping
of Relational Databases to RDF”,W3C RDB2RDF Incubator
Group, 2009
[4] Z. Zhang, B. He, K. C. Chang, “Understanding Web Query Inter-
faces: Best-Effort Parsing with Hidden Syntax”,SIGMOD, 2004.
[5] J. Barrasa, A. GómezPérez,, “Upgrading relational legacy data to
the semantic web”, In Proc. of 15th international conference on
World Wide Web Conference (WWW 2006), pages 1069-1070
[6] C. Blakeley, “RDF Views of SQL Data (Declarative SQL
Schema to RDF Mapping)”, OpenLink Software, 2007.
[9] Karlsruhe Ontology Project (KAON). Online. Internet.
2/15/2005. Available at:
[10] C. Xiao-Jun, P. Zhi-Yong, W. Hui, "Multi-source Automatic
Annotation for Deep Web," csse, vol. 4, pp.659-662, 2008
[11] S. Handschuh, R. Volz, S.Staab, “Annotation for the Deep
Web”, IEEE Intelligent Systems, September/October 2003.
[12] H. He, W. Meng, “WISE-Integrator: A System for Extracting
and Integrating Complex Web Search Interfaces of the Deep
Web”, VLDB'03, pp.357-368, Berlin, Germany, 2003
[13] W. Hu and Y. Qu, “ Discovering Simple Mappings Between
Relational Database Schemas and Ontologies”, 2007,
[14] N. Cullot, R. Ghawi, and K.Yétongnon, “DB2OWL: A Tool for
Automatic Database-to-Ontology Mapping”, CiteSeerX, 2008
[15] N. Konstantinou, et al. “VisAVis: An Approach to an
Intermediate Layerbetween Ontologies and Relational Database
[16] C. Bizer, R.Cyganiak, “D2RQ — Lessons Learned”, W3C
Workshop on RDF Access to Relational Databases, 2007.
[17] “The OntoKnowledge Toolset,” Online. Internet., 2004.
[18] S. Auer et al., “Triplify Lightweight Linked Data Publication
from Relational Database”, WWW 2009, Madrid, Spain
[19] SS Bhowmick, J. Küng, and R. Wagner, “Translating SQL
Applications to the Semantic Web” LNCS 5181, pp. 450–464, 2008.
[20]Ontostudio web site
[21]. M. Li, X. Du, S.Wang, “Learning Ontology from Relational
Database”. ICMLC, Vol. 6, 2005
[22]. G.Shen, et al.,. “Research on the Rules of Mapping from
Relational Model to OWL”. Workshop on OWL: Experiences
and Directions.Vol. 216 (2006)
[23] I.Astrova, A. Kalja, “Towards the Semantic Web: Extracting
OWL Ontologies from SQL Relational Schemata” IADIS
International Conference WWW/Internet(2006) 62–66
[24] DataGenie:, 2007
[25] RDBToOnto download page, http://www.tao-
Software Cost Estimation - a Practical Approach
Violeta T. Bozhikova1
Abstract – Software cost estimation is considered as one of the
most challenging tasks in software project management. The
process of software estimation includes estimating the size of the
future software product, estimating the effort required,
estimating the duration of the project and finally – the people
required. This paper gives an overview of the most powerful cost
estimation models, discusses their advantages and weakness and
finally a hybrid cost estimation approach that combines their
strengths is recommended
Keywords – Software Cost Estimation, Software Cost
Estimation Methods, Software Cost Estimation Tools.
Software cost estimation [1-5] is a continuing activity
which starts at stage of the project proposal and continues
through the overall life time of the software project. The goal
of this continual cost estimation is to ensure that the expenses
will not exceed the budget provided.
Considerable research has focused on development and
evaluation of universal software cost estimation models and
tools suitable for all software projects. After 20 years
research, we could claim that there are many software cost
estimation methods available, but no one method is suitable
for all software projects. In fact, their strengths and
weaknesses are often complimentary to each other. To
understand their strengths and weaknesses is very important
for the software estimators. The estimators are increasingly
convinced that accurate software estimation is impossible
using a single method and increasingly believe that a
combination of methods will allow a more accurate and
reliable software cost estimate.
This paper gives an overview of COCOMO hierarchy and
Function Points cost estimation models, discusses their
advantages and disadvantages and finally a practical cost
estimation approach that combines their strengths is
recommended as a way for efficient cost estimation.
A. COCOMO Family
One of the most commonly used software cost estimation
methods are the COnstructive COst MOdels (COCOMO
models). These methods for software estimation are
considered as algorithmic because provide mathematical
equations to perform software estimation. The COCOMO
mathematical equations are based on extensive historical
research and use inputs such as Source Lines of Code
(SLOC), number of functions to perform, and other cost
drivers such as language cost drivers, design methodology,
skill-levels, risk assessments, etc. As algorithmic methods the
COCOMO models have a lot of advantages. The most
important are the objectivity, stability and the sensitivity of
the results produced. Using such models the estimator gets
repeatable results. In the same time, it is easy to modify input
data, refine and customize formulas. The general disadvantage
of these models is the strongly dependence of the estimations
on the inputs. Some inputs can not be easily quantified. As a
result, poor sizing inputs or|and inaccurate cost driver rating
will result in inaccurate estimation
Basic COCOMO [1] is the first from the family COCOMO
models. It is designed by Barry W. Boehm as a model for
estimating effort, cost, and schedule for software projects in
1981. Now, a hierarchy of COCOMO models is available:
Basic COCOMO model computes software effort applied
“PM” (development effort i.e. development cost) in “person-
months” as a function of program size expressed in estimated
thousands lines of code KLOC. Person month is the amount of
time one person spends working on the software development
project for one month. This number is exclusive of holidays
and vacations but accounts the weekends. The basic Cocomo
equations are:
monthspersonKLOCaPM b
monthsSchedExpPMcTDEF b
StaffingAverage =
The coefficients ab, bb, cb and SchedExp depend of the type
of the project (organic, semi-detached or embedded) and are
given in the next table:
Software Project a
Organic 2.4 1.05 2.5 0.38
Semi-Detached 3.0 1.12 2.5 0.35
Embedded 3.6 1.20 2.5 0.32
1Violeta T. Bozhikova is with the Faculty of Computing an
Automation, Technical University of Varna, 9000 Varna, Bulgaria,
Intermediate COCOMO
Advanced COCOMO
TDEF is the Development Time in chronological months
and Average Staffing is the People required for the whole
project development This модел is good for quick, early and
rough estimates of software costs, but its accuracy is limited
because it doesn’t account the influence of a number of well
known factors such as hardware constraints, personnel quality
and experience and so on that have a significant influence on
software costs.
Intermediate COCOMO is an extension of the Basic
COCOMO. This model computes software developement
effort PM as a finction of program size and set of "cost
drivers" that include subjective assessements of 15 cost driver
attributes that are grouped into 4 major categories “Product
attributes”, “Hardware attributes”, “Personnel attributes”,
“Project attributes”. Each of the 15 attributes is rated on a 6-
point scale that ranges from "very low" to "extra high" (in
importance or value). Effort adjustment factor (EAF) for a
given project is calculated as the product of the fifteen effort
ratings (EMi, i=1…15). Typical values for EAF range from
0.9 to 1.4. The Intermediate Cocomo formula for PM now
takes the form:
monthspersonEAFKLOCEFPM ee =
Where the coefficient EF and the exponent ee are given in
the following table:
Software projec
Organic 3.
Semi-detached 3.
Embedded 2.
The Development time (TDEF) and People required
(Average Staffing) are calculated from PM in the same way as
with Basic COCOMO.
Advanced COCOMO can be seen is an extension of the
Intermediate COCOMO version. It calculates PM the same
way as Intermediate COCOMO but with an assessment of the
cost driver's impact on each stage (analysis, design, etc.) of
the software engineering process.
The development of the new COCOMO II model by the
Boehm's team is based on a study of about sixty projects at
TRW (a Californian automotive and IT company) in 2002 and
is the latest major extension to the original COCOMO. This
model is turned to the newer software paradigms (for example
OOP) and the modern software life cycles. For comparison,
the previous COCOMO models have been very successful for
projects up to 100000 lines of code, based mostly on the
waterfall model of software development and for
programming languages ranging from assembly to PL/I. In
addition, the previous COCOMO versions were defined in
terms of estimated lines of code LOC (and thousands of LOC,
i.e. KLOC). The COCOMO II model bases the calculation of
required effort PM on the software project's size measured in
SLOC (and thousands of SLOC, i.e. KSLOC). The difference
between LOC and SLOC (single Source Line of Code) is that
a SLOC may include several physical lines. Each structured
construction, for example the "if-then-else" statement would
be counted as one SLOC. For comparison, in basic COCOMO
model this statement might be counted as several LOC.
The first equation below (PMnom) is the base model for the
Early Design and Post-Architecture cost estimation of the
software project. The inputs are the Size of software
development in KSLOC, a constant A and a scale factor – B
[3]. The size is in KLOCS is derived from estimating the size
of software modules that will constitute the application
program. It can also be estimated from unadjusted function
points (UFP), converted to SLOC then divided by one
thousand. The scale (or exponential) factor B derived from
five scale drivers, such as Team Cohesiveness factor, Process
maturity factor, Precedentness, Flexibility and Breakage
factor and accounts for the relative economies or
diseconomies of scale encountered for software projects of
different sizes [3]. The constant A depends on the size of the
project. The nominal effort PMnom and the adjusted effort
PMadjasted calculations for a given size project and expressed as
person months are presented by the next equations:
monthspersonKLOCAPM B
nom =
monthspersonEAFPMPM nomadjasted =
COCOMO II has 17 cost drivers attributes (Analyst
Capability, Applications Experience, Programmer Capability,
Use of Software Tools, Multisite Development, Required
Development Schedule, Required Software Reliability,
Database size, Product complexity, Personnel Experience,
Language and Tool Experience, Personnel Continuity,
Execution Time Constraint, Main Storage Constraint,
Platform Volatility, Required Reusability, Documentation
match to life-cycle needs) which rating (expressed as a
number EMi, i=1…17) the estimator has to determine with
the goal to calculate the value of effort required EAF .
B.Function Point Analysis
Although counting lines of code is the first and most
common software sizing methodology this sizing method is
no longer practical due to the great advancements in software
engineering and modern programming languages. Another
commonly used sizing method is the IFPUG method [5] called
Function Point Analysis (FPA). It is another method of
quantifying the size in terms of the functions that the system
delivers to the user. The function point measurement method
was developed by A. Albrecht at IBM in 1979. The main
advantages of function point analysis based model are:
- function points (FP) can be estimated from requirements
specifications or design specifications, so using FPA it
possible to estimate development cost in the early phases of
- function points are independent of the programming
language or the methodologies used for software
- since function points are based on the system user's
external view of the system the non-qualified users have a
better understanding of what function points are measuring
Different variations of Function Points have emerged over
the years, such as Object Oriented Function Points, Use Case
Function Points and so on. Function point estimation approach
is widely used within COCOMO II because COCOMO II is
oriented to the newer software paradigms and to the modern
software life cycles.
Our approach is a combination between almost all
COCOMO models: Basic COCOMO, Intermediate
COCOMO, and COCOMO II with Function Point Estimation
features. The raison to develop such hybrid approach
collecting all these mentioned above methods is to give the
estimators an opportunity for a suitable choice of cost
estimation model, depending of the concrete project type and
the specific and often incomplete initial knowledge about the
software product in the early stages of its development.
The two basic steps, required to accomplish software
estimation are:
- Estimate product size,
- Estimate effort applied, project’s duration and resources
A. Estimate product size
Our approach bases the calculation of required effort PM on
the software project's size measured in COCOMO II SLOC
(and thousands of SLOC, i.e. KSLOC). The calculation of
SLOC (KSLOC) may be based on the expert’s estimation of
the size of software project (if is possible to make such
estimate) or on FP estimation. The usual Function point’s
estimation procedure is based on information that is available
early in the project life cycle. It begins with determining and
classifying (by complexity level) the user functions as Inputs,
Outputs, Files, Interfaces, and Queries (figure 1.). As a result,
the Unadjusted Function Points (ФТ) quantity is calculated
(figure 1). Next a Translation of Unadjusted Function Points
(ФТ) into SLOC is realized. The unadjusted function points
are converted into equivalent SLOC depending of a
LangFactor of the language used. For example, the
LangFactor [3] for Assembly language is 320SLOC/UFP, for
C++ - 29SLOC/UFP, for Fortran 77 – 105SLOC/UFP, for
Lisp – 64SLOC/UFP, for Pascal – 91 SLOC/UFP and so on.
The usual Function Point procedure accounts the degree of
influence DI (2) of fourteen application characteristics (figure
2), such as distributed functions, performance, reusability, etc.
The ratings of these 14 characteristics (rating scale of 0.0 to
0.05 for each characteristic) are added together, and added to
a base level of 0.65 to produce a general characteristics
adjustment factor that ranges from 0.65 to 1.35.
Our approach has respected this described above usual
Function Point procedure to calculate the size of the project.
The final equation that is used for cost estimates is shown
)3(01.065.0 LangFactorDIФТSLOC ××
Fig. 1. FP calculation
B. Estimate effort applied, project’s duration and resources
The general equation that we have used to calculate the
effort needed (PM) for a given size project development,
expressed as person months is given below:
If the effort adjustment factor EAF is 1 (it is its default
value) PM is interpreted as the nominal effort PMnom needed
for a given size project development, expressed as person
months. The values of the coefficient EF and the exponent ee
in this case are based on Intermediate COCOMO model.
The calculation of the effort adjustment factor EAF (5) is
related with the calculation of the adjusted effort PMreal. EAF
estimation could be based on the fifteen COCOMO
Intermediate cost drivers or on the seventeen COCOMO II
Cost Drivers plus one. Total of eighteen Cost Drivers in the
latter case are grouped into 3 major categories “Personnel
attributes”, “Project attributes” and “Product attributes”. An
additional user defined cost driver, named USER is added to
the classic COCOMO II Cost Drivers. It gives estimators an
opportunity to recognize the impact of a chosen project-
specific factor, other than the provided in COCOMO II.
EAF for a given project is calculated as the product of the
effort ratings of these attributes.
)1(LangFactorФТSLOC ×=
)4(][ monthspersonKSLOCEAFEFPM ee ××=
Fig. 2. Application characteristics and DI calculation
The calculation of the duration TDEF of the project is based
on the effort predicted by the effort equation:
PM is the effort (nominal or real) that is calculated,
SchedExp is the schedule exponent derived from Basic
COCOMO model and EF is a coefficient derived from
Intermediate COCOMO model.
The average staffing is calculated as follows:
This paper gives a comparative overview of COCOMO and
FPA models, discussing their advantages and disadvantages
and proposes a hybrid cost estimation approach that combines
their strengths. Our observation is that an approach that
collects all these mentioned above methods gives the
estimators an opportunity to choose the appropriate estimating
method in a situation of often incomplete specifications and
unclear requirements in the early stages of the project life
An interactive and flexible tool (figure 3) that implements
the software estimation approach, discussed above, was
developed. Depending on the specific characteristics of the
project, the estimator can choose the appropriate sizing metric
and method of cost estimation. The experiments prove that is
not reasonable to use SLOC as sizing metric, but it is not also
reasonable to use Function points as sizing metric for low
level language projects estimation or for legacy system’s
estimation. Although the results are encouraging and match
expectations for the tested projects, research must continue in
the direction of evaluating large and complex projects.
Fig. 2. A tool for Software Cost Estimation (the main window)
[1] B.W. Boehm et al, "The COCOMO 2.0 Software Cost
Estimation Model", American Programmer, 1996, pp.2-17.
[2] Boehm, B.W. "Software Engineering Economics",
Prentice_Hall, 1981.
[3] COCOMO II Model Definition Manual,
[4] Karen Lum et al. "Handbook for Software Cost Estimation", Jet
Propulsion Laboratory, Pasadena, California, 2003.
[5] Liming Wu, "The Comparison of the Software Cost Estimating
Methods", ttp://
)6(][)( monthsPMEFTDEV SchedExp
)7(][ people
StaffingAverage =
Web-based Interactive System for
Multicriteria Decision Analysis
Mariana V. Vassileva1, Krassimira B. Genova2 and Boris A. Staykov3
Abstract – The multicriteria decision analysis problems are
decision problems, in which a finite number of decision
alternatives are evaluated in terms of multiple conflict criteria. A
web-based software system is presented in the paper, which
implements an interactive optimizationally motivated method,
appropriate for solving problems with a big number of
alternatives. The information, which the decision maker (DM)
has to set, includes the desired or acceptable changes in the
values of some criteria and the desired or acceptable changes in
the direction of other criteria. In this way the DM is able to
manage the solving process of finding the most preferred
solution, which makes him/her feel more confident in the final
results obtained.
Keywords – Multicriteria decision analysis, Interactive
method, Web-based system.
In problems for multicriteria analysis (МА) (also called
multicriteria problems with discrete alternatives), several
conflicting criteria have to be simultaneously optimized in a
feasible set of alternatives [20]. In the general case there does
not exist one alternative which could be optimal for all the
criteria. However, there does exist a whole set of alternatives,
called non-dominated set of alternatives, which possess the
following property: every improvement in the value of one
criterion leads to deterioration in the value of at least one
other criterion. Each alternative from this set is called a non-
dominated alternative and can be the final Pareto optimal
solution of the multicriteria analysis problem.
The multicriteria analysis problems can be formulated in
different areas of resources management /financial, natural,
etc./, communications, production, commerce, services,
education and others. In these problems the set of alternatives
usually consists of a not very large set of variants for choice,
ordering or ranking. However, there also exist such, that
might have hundreds of alternatives [4, 7]. In some MA
problems the evaluations of the alternatives with respect to the
criteria have got an uncertain value. The evaluations may be
quantitative, qualitative or ranking.
1Mariana V. Vassileva is with the Institute of Information
Technologies-BAS, “Acad. G. Bonchev”, str. Bl. 29A 1113 Sofia,
Bulgaria, E-mail: .
2Krasimira B. Genova is with the Institute of Information
Technologies-BAS, “Acad. G. Bonchev”, str. Bl. 29A 1113 Sofia,
Bulgaria, E-mail: .
3Boris A. Staykov is with the Institute of Information
Technologies-BAS, “Acad. G. Bonchev”, bl. 29A 1113 Sofia,
Bulgaria, E-mail: .
The methods for solving multicriteria analysis problems,
developed by now, can be grouped into three groups. Each
one of these methods has its own advantages and