ArticlePDF Available

References and arrow notation instead of join operation in query languages



We study properties of the join operation in query languages and describe some of its major drawbacks. We provide strong arguments against using joins as a main construct for retrieving related data elements in general purpose query languages and argue for using references instead. Since conventional references are quite restrictive when applied to data modeling and query languages, we propose to use generalized references as they are defined in the concept-oriented model (COM). These references are used by two new operations, called projection and de-projection, which are denoted by right and left arrows and therefore this access method is referred to as arrow notation. We demonstrate advantages of the arrow notation in comparison to joins and argue that it makes queries simpler, more natural, easier to understand, and the whole query writing process more productive and less error-prone.
Computer Science Journal of Moldova, vol.20, no.3(60), 2012
References and arrow notation instead of
join operation in query languages
Alexandr Savinov
We study properties of the join operation in query languages
and describe some of its major drawbacks. We provide strong
arguments against using joins as a main construct for retrieving
related data elements in general purpose query languages and ar-
gue for using references instead. Since conventional references are
quite restrictive when applied to data modeling and query lan-
guages, we propose to use generalized references as they are de-
fined in the concept-oriented model (COM). These references are
used by two new operations, called projection and de-projection,
which are denoted by right and left arrows and therefore this ac-
cess method is referred to as arrow notation. We demonstrate
advantages of the arrow notation in comparison to joins and ar-
gue that it makes queries simpler, more natural, easier to under-
stand, and the whole query writing process more productive and
less error-prone.
Keywords: Data modeling, query languages, concept-oriented
model, join, reference, arrow notation, data semantics.
1 Introduction
The main goal of a data model is providing suitable structure for rep-
resenting things and connections between them. Operations for data
access and analysis are performed by means of some kind of query lan-
guage which reflects and relies on these structural principles. For a
general purpose data model and query language, the key problem is in
finding the simplest and most natural structure and operations which
°2012 by A. Savinov
A. Savinov
cover a wide range of patterns of thought and mechanisms being used
in data modeling.
Most data models are very similar in how they represent things but
they are quite different in representing connections. There exist several
major ways for representing connectivity such as relationships, links,
references, keys, joins. A relationship is a thing which may have its
own properties and identity. Relationships can connect many things
but they do not have a direction. A link is a directed binary relation-
ship, that is, a thing that connects two other things with special roles:
an origin and a destination. A reference is also a directed connection
between two things but in contrast to links it is not a thing and has
neither separate identity nor properties. A key is a number of prop-
erties of the thing which are used for identification purposes. Join is
an operation which relies on thing properties in order to establish a
connection between them at the level of queries.
One of the main motivating factors for developing the relational
model [1] was the desire to get rid of (physical) identifiers and to focus
on the data itself rather than on how it is represented and accessed.
However, removing physical identifiers led to removing connectivity
from the model. As a consequence, data was broken into several iso-
lated sets of tuples and the question was how to retrieve related (con-
nected) tuples. The solution was extremely simple: tuples containing
the same values were supposed to be related. For example, if both
an employee record and a department record have an attribute with
the value ’HR’ then this employee was supposed to be related to this
department. The operation which finds and combines such tuples was
called join.
Although join was introduced as one of the main operations of the
relational algebra, now it is used in almost any data model so it can
be characterized as a pillar of data modeling. It is one of the most
frequently used words in the literature on query languages and can be
found in almost any data related context. The main purpose of join
consists in connecting data elements which are modeled as existing sep-
arately in different relations. It can be viewed as a means of activating
implicit relationship at the level of queries. Since joins are not declared
References and arrow notation . . .
at the level of the model, they provide almost arbitrary control over
the data at query time. This property makes it very powerful opera-
tion but at the same time rather difficult to use and even dangerous
for inexperienced users. In this sense, join is analogous to the goto
(jump) operator in programming languages which is also a powerful
low level operator providing high freedom in programming but leading
to unstructured code and difficult to find errors [2].
Another wide-spread mechanism of connectivity is reference. One
of the most important properties of references is that they are not part
of the represented thing. For example, a class in object-oriented mod-
els does not describe references that will be used for representing its
instances. References are not stored as part of the object in any of its
fields but rather are provided separately. Another important property
is that things cannot be accessed without some kind of reference. In-
deed, if a property needs to be accessed then it is not possible to use
another property for this purpose just because it is not accessible yet.
The pattern ”accessing properties using properties” obviously contains
a cycle and therefore it cannot be directly implemented. Therefore, it
is always necessary to have something that exists separately from and
is intended to provide access to object properties. This is precisely
what references are intended for. The question is only whether they
are described explicitly as integral part of the model, provided by the
platform as it is done in object-oriented models, or completely removed
and replaced by some other mechanism like primary keys as it is done
in the relational model. Essentially, the question is whether references
are data and hence the model has to provide adequate means for their
modeling or references are not data and should be excluded from the
References have numerous advantages in comparison to joins. They
are extremely easy to understand because they are widely used in ev-
eryday life where all things have some unique identifiers. They are also
very easy in use. It is enough to know a reference in order to get the
contents of the represented thing. There is no need in specifying what
and how has to be compared and what criteria have to be satisfied to
access the represented thing. For example, given an employee record
A. Savinov
we could retrieve its department by using the reference stored in one
of the employee properties. The use of join operation means that a
database is a set of things with common values. To access data, it
is necessary to specify a criterion which has to be satisfied by all ele-
ments. For example, to get a publisher we need to specify that both
the book and the publisher must have the same value in some property
(publisher id). Although references are very natural and simple to use,
joins are much more powerful when it is necessary to manipulate sets
of elements rather than their individual instances. For this reason, it is
not that easy to replace joins by references and this is why join is still
dominating in the area of data modeling and query languages although
it is quite difficult to use.
This paper is devoted to comparing joins and references. We
demonstrate that join operation has some significant drawbacks which
make it difficult to use and error-prone in comparison to references.
Therefore, we ask the question whether it is possible to eliminate joins
from data modeling (or at least diminish their use) by retaining most of
the possibilities this operation provides. Obviously, it is a highly non-
trivial task and one difficulty is that thinking of data in terms of joins is
so deeply penetrated into our minds that it is considered more a dogma
than one of the alternatives in data modeling and querying. Another
difficulty is that join is a set-oriented operation while references are
instance-oriented and this is why references are not so popular in data
modeling. As a reference-based solution to the problem of joins, we de-
scribe a novel approach to data modeling, called the concept-oriented
model (COM) [8, 9, 10], which generalizes references. In particular, it
allows for modeling domain-specific references which replace primary
keys. What is more important, COM provides two novel operations,
called projection and de-projection, which can be viewed as set-oriented
analogue of the classical dot notation. These two operations are de-
noted by left and right arrows and therefore this approach is referred
to as arrow notion. We demonstrate how typical tasks can be (eas-
ier) implemented using COM references and arrow notation without
using joins. The paper has the following layout. Section 2 describes
the operation of join, references and arrow notation in COM. Section
References and arrow notation . . .
3 describes drawbacks of joins and how these problems can be solved
by means of COM references. Section 4 makes concluding remarks.
2 Joins and references
2.1 Joins and common value approach
In mathematics, a Cartesian product is an operation which allows us to
build a new set out of a number of given sets by producing all possible
combinations of their members. Given two sets Uand V, the Cartesian
product U×Vis defined as the set of all possible 2-tuples: U×V=
{hu, vi|uUvV}. Each element of the Cartesian product connects
two input elements. Including all combinations of the input tuples in
the result set means that all these tuples are considered related, that is,
every element of one set is associated with every element of the other
Normally, not all input tuples are related and therefore a mecha-
nism is needed which would allow us to restrict the Cartesian product
by specifying which tuples from two sets should match. This task is
performed by join operation the basic idea of which is that only those
combinations of tuples are included in the result set which both satisfy
some common criterion. In most practical cases, the selection of related
tuples is performed by using the equality condition (this join is therefore
referred to as equijoin). Tuples in the relational model are composed of
values which are accessed by means of attribute names. In this case, re-
lated (matching) tuples produced by join must contain equal values in
the specified attributes: U p=qV={hu, vi|uUvVu.p =v.q},
where pand qare attributes which have to contain the same values in
both tuples.
In order to be matched, two data elements have to contain the same
value in some attributes and therefore we will refer to this mechanism
as a common value approach. Thus records which store common values
are considered related in the database. For example, records from two
tables Employees and Departments could be defined as related if
they have the same value in the city attribute.
A. Savinov
Note also that the general idea of the common value approach is
also present in formal logic and deductive databases [11]. In predicate
calculus, if two predicates have the same free variable then they have to
match (to be bound to the same value) in order for the resulting propo-
sition to be true. Since relations can be represented as n-place predi-
cates, join can be written in logical form. For example, given two predi-
cates Employees(#e, cname, city) and Departments(#d, ename, city)
representing relations Employees and Departments, respectively,
we can find all combinations of free variables where the matching vari-
able city takes the same value.
The common value approach has the following properties:
The relationship defined by join (via common values) does not
have a direction. We simply say that two records match because
they have the same property. Although some variants of join like
left and right outer join have a direction, it cannot be easily se-
mantically interpreted and should be viewed as variations of one
operation. In particular, we cannot say that one record is refer-
enced or linked to the other. In this sense, the common value ap-
proach is similar to relationships in the entity-relationship model
which also do not have a direction.
It is defined in terms of values and attribute domains, that is, a
connection between two relations is specified via some common
domain. There is no direct way to define join in terms of other
relations. For example, we cannot directly find Employees and
Departments which have the same address attribute which
represents a record from the Addresses table rather than a
domain. The reason is that attributes contain values and cannot
contain tuples.
2.2 References and dot notation
Reference is one of the corner stones of the object-oriented paradigm
where it is assumed that any object has a unique identity which is
References and arrow notation . . .
used to represent and access it. References have the following main
References are values which are passed by-copy. It is enough to
store this value in order to represent the object and then access it.
When a reference is copied, the contents of the object is not copied
but can be accessed later by using this reference. References do
not have their own references.
References are not object properties (not included in the object
contents) and not part of the object. They exist separately from
the objects they represent.
References hide the details of object identity so that different ob-
jects may have different structure of their references which how-
ever are not visible when they are accessed.
References provide transparent access to objects by hiding its
internal mechanics which can be quite complex. They create the
illusion of instantaneous access.
References are used along with a very convenient access pattern,
called dot notation, where the result of access is considered a
reference which can be used for the next access operation.
References make excellent job in the area of programming but they
have a rather limited use in data modeling. So what is the problem
in introducing references in query languages and combining features of
object-oriented and relational approaches? In fact, it is a rather old
idea and almost any new query language tries to use references and dot
notion to make data manipulations easier. But the fact is that they
all fail in eliminating joins which means that not everything can be
done by references in the area of data modeling. The primary reason
(for the failure of references in data modeling) is that references and
dot notation were designed to manipulate instances rather than sets.
In other words, programing is an instance-oriented area while data
modeling is a set-oriented area. Indeed, only individual objects can
A. Savinov
reference each other, not sets. We cannot easily adopt dot notation
for manipulating sets. Another reason is that tuples in the relational
model do not have identities because any tuple is unique and identifies
itself by its own contents. In the next section we describe an approach
to data modeling which does not have these drawbacks.
2.3 References in the concept-oriented model
The concept-oriented data model (COM) is a unified general purpose
model the main goal of which is to radically simplify data modeling
by reducing a large number of existing data modeling methods to a
few novel structural principles. One of its principles is that identities
and entities are supposed to be equally important. This distinguishes
it from most other models which have a strong bias towards modeling
entities while identities (references, addresses, surrogates, OIDs) are
considered secondary elements which are either modeled by means of
entities or provided by the platform.
COM makes identities and entities equally important parts of a
data element both being in the focus of data modeling. An element in
COM is defined as consisting of two parts, identity and entity, which
are also called reference and object, respectively. Identity is passed
by-value while entity is passed by-reference. Both constituents have
arbitrary domain-specific structure which is modeled by means of a
novel construct, called concept (hence the name of the model). Concept
is defined as a pair of two classes: one identity class and one entity class.
For example, if employees are identified by their passport number and
characterized by name then they are described by the following concept:
CONCEPT Employees
CHAR(10) passNo
CHAR(64) name
Note that objects (entities) of this concept will have only one field
and these objects will be represented by a reference (identity) also
consisting of one field. However, identity part is passed by-value and
References and arrow notation . . .
stored in variables while entity part is passed by-reference. A concept
can be thought of as a conventional class with an additional class for
describing the format of references.
COM provides several benefits which are important in the context
of this paper:
COM does not distinguish between sets of values and sets of
objects or, in relational terms, between domains and relations.
There is only one type construct, concept, which is used for defin-
ing both domains and relations. In particular, relation attributes
can be both relation typed and value typed.
Concepts make it possible to describe arbitrary domain-specific
references what is not possible in object-oriented models. In this
sense, references in COM are similar to primary keys in the re-
lational model. However, the difference is that they are treated
and behave like true references while primary keys are treated as
integral part of the entity used for identification purposes (more
about these difference can be found in [10], Section 2).
COM introduces an operation of projection which is analogous to
dot notation but is applied to sets. In the concept-oriented query lan-
guage (COQL) it is denoted by right arrow and returns a set of elements
which are referenced by the elements from the given set. Sets in COQL
are enclosed in parentheses and can also include a condition for con-
straining its elements. For example, all publishers for a set of books
can be obtained by projecting this set of books to the set of publishers:
(Books | year > ’2005’)
-> publisher -> (Publishers)
COM also introduces the opposite operation of de-projection which
can be viewed as a set-oriented reversed dot notation. It is denoted by
left arrow and returns a set of elements referencing the elements from
the given set. For example, given a set of publishers we can get all
their books:
(Publishers | country = ’MD’)
<- publisher <- (Books)
A. Savinov
Projection and de-projection can be applied to the result set re-
turned by the previous operation and such an approach is referred to
as arrow notation. Arrow notation has the following main properties:
Operations are applied to sets rather than instances
It uses domain-specific instances as they are defined in concepts
rather than only primitive references
The structure of references is hidden and is not exposed in the
In the rest of the paper we describe how these two operations are
used for querying instead of joins.
3 References for solving join problems
3.1 Connectivity
Perhaps the main use of joins consists in implementing what references
are intended for. A database is thought of as a set of objects referencing
each other. However, if the database is unaware of references and
manipulates only values then these connections have to be expressed
by means of joins. For example, if each employee record references its
department then a set of departments for all employees in one country
is retrieved by means of the following join-based query:
SELECT FROM Departments D, Employees E
WHERE D.dept = E.dept AND = ’MD’
Here we immediately see one problem: join is a symmetric construct
while references are directed. Indeed, if we look at the above query then
it is difficult to understand whether departments reference employees or
employees reference departments. It is not surprising because joins have
quite different purpose but this fact makes them not very appropriate
for implementing references. The mechanism of foreign and primary
keys can help here but it is optional and is used at the level of schema
rather than in queries.
References and arrow notation . . .
Another problem of joins is that they expose the structure of refer-
ences by explicitly specifying all the details which actually do not be-
long to the domain-specific part of the query. Effectively, the low level
mechanics of references becomes integral and explicit part of each and
every query that involves more than one table. If the structure of con-
nections changes then all queries where it is used have to be updated.
Such program logic or query fragments which are scattered throughout
the whole source code are referred to as cross-cutting concern. This
problem is well known in programming [4] because it makes programs
difficult to maintain and error prone. Such functions as logging, trans-
action management, persistence and security are typical examples of
cross-cutting concerns because they are used in the same form across
the whole program. The main goal here is to separate these functions
or query fragments from the main business logic.
Join operation is a typical example of a cross-cutting concern
because many queries solving different domain specific tasks involve
the same fragments in the form of join conditions. The reason
is that database schemas always follow certain structure of connec-
tions and relationships while joins simply materialize them at query
time. In the previous example, the schema contains two tables
Departments and Employees which are connected via the join con-
dition D.dept=E.dept. Note however that this join is specified along
with the second condition for selecting employees of one country only.
The problem is that the first condition is a cross-cutting concern be-
cause it depends on the schema structure only and will be repeated in
the same form in many queries involving these two tables. The sec-
ond condition reflects business logic and is unique for each query. In
a good query language they should be at least separated and, ideally,
the join condition should be modularized so that it does not appear in
explicit form in each query. This problem can be partially solved by
using a dedicated JOIN clause for connectivity and WHERE clause for
domain-specific conditions. However, this use is optional and the join
condition will still be repeated for each and every query.
The mechanism of foreign and primary keys could help in hiding
the structure of references at the level of schema. Once a foreign key
A. Savinov
has been declared, it is then enough to specify its name instead of
enumerating all the columns it (and the corresponding primary key)
is composed of. However, foreign keys do not solve the problem of
joins at the level of queries because we still have to write them as
some condition within WHERE or JOIN clause along with other domain-
specific conditions. Another possible solution consists in defining user-
defined types (UDT) in the case of complex primary keys and the
corresponding foreign keys. Here again, UDTs allow us to simplify join
conditions but do not eliminate them completely so that all queries
have to specify how two or more tables have to be joined.
In contrast to joins, the logic of conventional references and refer-
encing is completely hidden so that we see only what has to be retrieved
and not how it has to be done. Business logic is effectively separated
from the mechanism of implementing references. For example, given
an employee we can get the department name by using dot notation:
Here we see neither the structure of references nor the conditions used
to match the objects. References can be implemented as 64-bit integers,
character strings or more complex structures. Matching related objects
could be implemented via look up tables or more complex indexes but
these details are also not visible in the access statement. The benefit is
that if the structure of references and connections between departments
and employees changes then this line of code will still work without any
modifications because it does not involve any details of how employees,
departments and other objects are connected.
The question is then why not to use references instead of translating
them into the representation via joins? One problem is that references
need identities to be explicitly declared in referenced elements and ref-
erencing attributes have to be appropriately typed. Only in this case
the reference structure can be hidden. This problem can be solved by
adopting the mechanism of primary keys for identification and foreign
keys for typing referencing attributes. One difficulty with this solu-
tion is that primary keys are not true references (they are identifying
attributes [10]) and also they are optional. A more serious problem
is that references cannot be applied to sets while joins are inherently
References and arrow notation . . .
set-oriented. Indeed, if we apply dot notation to sets then what kind
of result should be returned by such expressions?
The solution is provided by introducing COM concepts. First, they
provide a mechanism for defining domain-specific references which are
used instead of primary keys. Once a concept has been defined, it is
used as a type of attributes in other concepts by replacing the mecha-
nism of foreign keys. Thus COM references combine features of primary
keys (which are not references) and object-oriented (true) references.
For example, the structure of departments and employees can be de-
clared as follows:
CONCEPT Departments
IDENTITY // True reference
INT dept
CHAR(64) name
CONCEPT Employees
IDENTITY // True reference
INT emp
Departments dept
Note that the last line does not expose the structure of connection, that
is, how employees are connected to departments. If the department
identity changes then all other attributes referencing it will not be
Concepts not only allow us to remove the structure of references
from schema but also remove it from set-oriented queries by using arrow
notation. For example, all departments for a set of employees in one
country can be retrieved as follows:
(Employees | country = ’MD’)
-> dept -> (Departments)
This roughly corresponds to the following instance-based query using
dot notation:
References can also be followed in the opposite direction by means
A. Savinov
of de-projection operation. For example, all employees of a set of de-
partments located in one country is found as follows:
(Departments | country = ’MD’)
<- dept <- (Employees)
Operations of projection and de-projection can be applied consec-
utively and many fragments can be omitted because they can be easily
reconstructed from the schema. Thus rather complex queries involv-
ing many tables with numerous joins can be written in a very simple
and natural form [6]. What is more important, these queries are set-
oriented and do not expose the structure of connections.
3.2 Semantics
One problem of joins is that they appear only at the level of queries and
the database is unaware of possible and meaningful joins at the level of
the model. For that reason join can be characterized as an application-
specific operation. Every new application can issue its own query with
arbitrary joins. On one hand, it is an advantage because applications
are not restricted in the use of data and can do whatever they need.
However, if the meaning and consistency of results is important, it
is a drawback because arbitrary joins lead to arbitrary results. The
database is unaware of what operations are meaningful and therefore
cannot restrict applications from producing meaningless results. For
instance, the database is not able to prevent an application or user
from joining integer department ids with the number of product items
which is obviously a meaningless operation. From the performance
point of view, it is also a disadvantage because the database engine
is not able to optimize its operations for executing predefined joins
declared at the level of schema.
From this point of view, joins are somewhat analogous to the goto
operator in programming which also ignores the program structure and
provides the possibility to organize arbitrary control flow. It was clearly
shown that such style of programming without any constraints is harm-
ful [2] because goto not only ignores the semantics behind program
structure but also the compiler is not able to restrict programmers
References and arrow notation . . .
from making errors. The freedom in using joins has the same effect:
the database is not able to restrict users and applications from issu-
ing meaningless queries and cannot restrict them from making errors.
The mechanism of joins essentially assumes that the meaning of data
is described at the level of queries rather than in the model structure.
In particular, by looking at queries we can get more information about
data semantics then by looking at the schema. One way to overcome
this problem is to use foreign keys which can be viewed as a way to
declare what is meaningful in the database. Yet, this mechanism has
significant limitations when used in queries and should be viewed as a
Since join is a low level operation, it can be used to implement many
different patterns which are difficult to reconstruct from the query. For
example, the join condition WHERE (where Aand Bare
two tables) says almost nothing about the real intention of the query.
We do not know whether table Areferences table Bor maybe it is not
about referencing at all. We do not know whether the purpose of this
query is to build a multidimensional space for OLAP analysis or to find
related records connected via some relationships. And if this operation
uses a relationship then is it containment or general-specific? Join is
not an operation which can be easily semantically interpreted. Given a
join we cannot say what kind of semantic relation it represents and how
the joined elements are related. On the other hand, assume that we
want to use existing relationships in the model. How should we join the
tables in order to represent them in the query? The answer is not clear
because the translation procedure is ambiguous and does not cover all
possible situations. This problem has been studied in semantic data
models [3, 5] but these models focus more on conceptual representation
issues and less on query languages. Although many operations can be
expressed at conceptual level, joins cannot be removed completely just
because the lower logical level of the model is supposed to always exist.
COM allows us to remove the gap between low level join and high
level query semantics because it is also a conceptual model with main
constructs having some semantics behind them. In particular, refer-
ences in COM are not simply a means of connectivity but rather a way
A. Savinov
to represent semantics. More specifically, references in COM have the
following semantic interpretations [9, 10]:
General-specific A referenced element is more general than the refer-
encing (more specific) element. For example, if table Products
references table Categories then products are more specific
elements than their categories.
Containment A referenced element is interpreted as a container
where the referencing element exists. For example, if an employee
record references a department then this employee is supposed to
be included in this department as one of its elements.
Relationships An element referencing other elements is interpreted as
a relationship between them. For example, if a marriage record
references two persons then it is interpreted as a relationship
between them.
Multidimensional An element referencing other elements is inter-
preted as a point while the referenced elements are its coordi-
nates. For example, since sales record references a product item
and its price, this sale is considered a point while its characteris-
tics are coordinates along some axes.
According to this interpretations, projection operation applied to
a set means getting more general elements, containing elements, de-
pendent elements (connected via this relationship) and coordinates for
these elements. And de-projection has the opposite meaning by pro-
ducing more specific elements, members of a container, relationships
and points with these coordinates. As a result, references are used not
only for navigating through a graph but rather for semantic naviga-
tion. This makes queries much more semantically rich and much easier
to write and understand. For example, projecting a set of employees
to departments means getting containers for employees because a de-
partment is interpreted as a container for a set of employees. At the
same time, a department can be treated as a coordinate for employees
which are points in a multidimensional space.
References and arrow notation . . .
3.3 Common value approach
There is one pattern which cannot be modeled by references, namely,
the original common value approach directly supported by join oper-
ation. This pattern cannot be ignored because in many cases it is
precisely what needs to be done. The common value pattern has its
own value and the question is how it can be implemented by means of
references without joins. For example, if it is necessary to find depart-
ments and employees having the same location then it is not clear how
it can be done without join operation.
This task can be solved by using product operation which takes
two or more collections as input and returns all combinations of their
elements as a result collection. In COQL, input collections along with
their instance variables are written in parentheses (instance variables
are analogous to table aliases in SQL). For example, all combinations
of departments and employees are built as follows:
(Departments D, Employees E)
If we need to return records having some common value then this con-
dition is specified as an additional constraint:
(Departments D, Employees E | =
Obviously, it is very similar to how join operation works:
SELECT D.*, E.*FROM Departments D, Employees E
So the question is why COM is better. The difference is that product
in COM is used exclusively to produce combinations of elements. In
particular, it is not used for referencing and navigation purposes. Its
typical application is in data analysis where it is necessary to produce a
multidimensional cube. For that reason, queries in COM much easier to
interpret because the purpose of operations is clearer: arrow notation
is used for set-based navigation while product is used to build multidi-
mensional space with combinations of records. In other words, COM
reflects the real purpose of each operation. Also, product in COM is
more general because there is no difference between value domains and
relations (see [10], Section 2, for more information). In particular, it is
A. Savinov
possible to use any common collection rather than only direct domains
of two relations. The following query retrieves all employees who live
in the city where their department is located:
(Employees E | =
Here we do not use product operation at all although its relational
analogue would require joining two tables. The next query finds a set
of departments which have at least one employee living in a different
city than this department location:
(Employees E | !=
-> dept -> (Departments)
Again, here we do not use product operation but still can do what
would require joining in SQL.
Since product operation constrained by some common values is a
quite frequent pattern, it can be simplified and generalized. Instead of
explicitly specifying a condition the combined elements have to satisfy,
it is easier to just specify a common greater collection for the input
collections. The paths from the input collections to this common col-
lection are then reconstructed automatically from the schema. (In the
case of multiple alternative paths, the condition has to be specified
explicitly.) For example, the query
(Departments, Employees | (Cities) )
returns all combinations of departments and employees which have the
same city where Cities is their common greater collection. Note that
Cities need not be a direct greater collection and a longer path can
lead from the input collections to the Cities collection.
An interesting use of product operation restricted by common val-
ues consists in implementing inference which is a procedure where con-
straints can be automatically propagated through the model [7]. For
example, assume that we want to relate departments and employees
by the city they are located in. The final goal is to impose constraints
on departments and then automatically find employees living in these
cities (by ignoring departments people work in). Inference is always
performed via some common lesser collection. In our example it is de-
fined as a product of employees and departments with the condition
References and arrow notation . . .
that they have to belong to the same city. Inference consists of two
steps: first de-project to the common lesser collection and then project
to the target collection:
(Departments | name = ’HR’)
<- (Departments D, Employees E | (Cities) )
-> (Employees)
Note how simple and natural this query is. It specifies only collection
names and has no indication how they have to be joined. Even if it
is necessary to specify connections, they are specified as paths rather
than explicit joins. If the schema changes and the collections will be
connected differently then in many cases this query will still work.
4 Conclusion
In this paper we have provided a critical analysis of join operation and
its use for data querying and retrieving related elements. Although
join is an extremely powerful operation which makes it possible to
dynamically (at the level of query) relate arbitrary tuples and retrieve
quite complex result sets it has several major problems:
Join is not appropriate for implementing references which is one
of its main uses and one of the main data modeling mechanisms.
Join exposes the details of reference implementation and is a
cross-cutting concern of query languages which cannot be eas-
ily modularized.
Join is not appropriate for representing semantics behind the
higher level operation or pattern it implements. From join struc-
ture, it is quite difficult to understand what kind of relationship is
used in this query. Joins do not reflect their purpose and cannot
be unambiguously interpreted from the point of view of business
purpose of the query.
Of course, these are not absolute flaws but rather consequences of
the low level character of this operation which makes it inappropriate
for domain-specific queries in general purpose query languages where
A. Savinov
the criteria of simplicity, closeness to the domain concepts, structural
and semantic consistency are of primary importance. Therefore, joins
not only require high expertise but also can easily result in semantic
bugs which are very difficult to find.
Data access via references and dot notation does not have the prob-
lems of join – it is more intuitive, much easier to use and more reli-
able. Yet, this approach is intended for manipulating instances rather
than sets and therefore its benefits in the context of query languages
are very limited. To overcome these limitations, we proposed to use
generalized references and arrow notation as they are defined in the
concept-oriented model. This new representation and access method
allows us to combine set-orientation of joins with the simplicity and
naturalness of references. The use of generalized references and arrow
notation instead of join will result in simpler queries, more natural and
structured model design, less errors and higher productivity in query
[1] E.Codd. A Relational Model for Large Shared Data Banks. Com-
munications of the ACM, 13(6): 377–387, 1970.
[2] E.W.Dijkstra. Go To Statement Considered Harmful. Communi-
cations of the ACM, 11(3): 147–148, 1968.
[3] R.Hull, R.King. Semantic database modeling: survey, applica-
tions, and research issues. ACM Computing Surveys (CSUR),
19(3): 201–260, 1987.
[4] G.Kiczales, J.Lamping, A.Mendhekar, C.Maeda, C.Lopes, J.-
M.Loingtier, J.Irwin. Aspect-Oriented Programming. ECOOP’97,
LNCS 1241: 220–242, 1997.
[5] J.Peckham, F.Maryanski. Semantic data models. ACM Computing
Surveys (CSUR), 20(3): 153–189, 1988.
References and arrow notation . . .
[6] A.Savinov. Logical Navigation in the Concept-Oriented Data
Model. Journal of Conceptual Modeling, Issue 36, 2005.
[7] A.Savinov. Query by Constraint Propagation in the Concept-
Oriented Data Model. Computer Science Journal of Moldova,
14(2): 219–238, 2006.
[8] A.Savinov. Concept-Oriented Query Language for Data Model-
ing and Analysis. Advanced Database Query Systems: Techniques,
Applications and Technologies, L.Yan, Z.Ma (Eds.), IGI Global,
2010, 85–101.
[9] A.Savinov. Concept-Oriented Model: Extending Objects with
Identity, Hierarchies and Semantics. Computer Science Journal
of Moldova, 19(3): 254–287, 2011.
[10] A.Savinov. Concept-Oriented Model: Classes, Hierarchies and
References Revisited. Journal of Emerging Trends in Computing
and Information Sciences, 3(4): 456–470, 2012.
[11] J.D.Ullman, C.Zaniolo. Deductive databases: achievements and
future directions. ACM SIGMOD Record, 19(4): 75–82. 1990.
Alexandr Savinov, Received June 28, 2012
SAP Research Dresden,
Chemnitzer Str. 48,
01187 Dresden, Germany
Home page: http ://
... The concept-oriented query language (COQL) (Savinov, 2014a;2012a;2011b) is a syntactic embodiment of COM. ConceptMix uses a modified version of this language, called the concept-oriented expression language (COEL), the purpose of which is similar to that of the Microsoft Data Analysis Expressions (DAX) (Russo, Ferrari & Webb, 2012). ...
... Formally, a new set will contain all output values of one function evaluated for the input values of the source set. Project and de-project are two operations of the novel arrow notation (Savinov, 2012a). It is analogous to the conventional dot notation with the difference that there are two operations (project and de-project) and these operations are applied to sets rather than to instances. ...
... Also, an aggregated column could be part of other expressions. An alternative way to define aggregation is using de-projection (Savinov, 2012a). ...
Conference Paper
Full-text available
Data integration as well as other data wrangling tasks account for a great deal of the difficulties in data analysis and frequently constitute the most tedious part of the overall analysis process. We describe a new system, ConceptMix, which radically simplifies analytical data integration for a broad range of non-IT users who do not possess deep knowledge in mathematics or statistics. ConceptMix relies on a novel unified data model, called the concept-oriented model (COM), which provides formal background for its functionality.
... It has numerous advantages but probably the most important one in the context of data management is that join is an operation on sets. It also has quite significant drawbacks (Savinov, 2012a): join is a low level and error-prone operation which requires high expertise (Atzeni, 2013), joins can easily produce meaningless results because it lacks semantics, join exposes the mechanics of connectivity at the level of business logic and is a typical cross-cutting concern, join is not analytics-friendly. ...
... Logical Navigation and Arrow Notation COQL uses arrow notation (Savinov, 2012a) for navigating through the data structure and retrieving sets of elements. Arrow notation can be viewed as a set-oriented analogue of the conventional dot notation. ...
... Inverse arrow is opposite to dot notation and we use it [13,14] because dot symbol does not have an inversion. ...
Full-text available
We describe a new logical data model, called the concept-oriented model (COM). It uses mathematical functions as first-class constructs for data representation and data processing as opposed to using exclusively sets in conventional set-oriented models. Functions and function composition are used as primary semantic units for describing data connectivity instead of relations and relation composition (join), respectively. Grouping and aggregation are also performed by using (accumulate) functions providing an alternative to group-by and reduce operations. This model was implemented in an open source data processing toolkit examples of which are used to illustrate the model and its operations. The main benefit of this model is that typical data processing tasks become simpler and more natural when using functions in comparison to adopting sets and set operations.
... In most cases however, joins are used to retrieve elements connected via references but if the mechanism of referencing is not inherently supported (as it is in RM) then the only way is to implement it manually at the level of queries by means of joins. The use of joins for implementing references has significant drawbacks [23,26]: (i) it is a low level and error-prone operation [2], (ii) it lacks semantics by easily producing wrong results because the intention of the operation is not obvious from its syntax, (iii) it is a cross-cutting concern because the same fragments are explicitly written in many queries, (iv) joins describe the logic of connectivity which appears at the same level as the domain-specific logic of retrieval, (v) joins are not analytics friendly. Therefore, COM does not encourage the use of product operation with filters for that purpose. ...
Full-text available
The plethora of existing data models and specific data modeling techniques is not only confusing but leads to complex, eclectic and inefficient designs of systems for data management and analytics. The main goal of this paper is to describe a unified approach to data modeling, called the concept-oriented model (COM), by using functions as a basis for its formalization. COM tries to answer the question what is data and to rethink basic assumptions underlying this and related notions. Its main goal is to unify major existing views on data (generality), using only a few main notions (simplicity) which are very close to how data is used in real life (naturalness).
... Switching from tables to columns is a highly non-trivial task. In particular, it is necessary to get rid of such operators like join (Savinov, 2012a) and group-by because they are inherently set-oriented operators. To solve these problems, DataCommandr relies on a novel concept-oriented model of data (Savinov, 2014b;Savinov, 2012b) which provides the necessary theoretical basis. ...
Conference Paper
Full-text available
In this paper, we describe a novel approach to data integration, transformation and analysis, called DataCommandr. Its main distinguishing feature is that it is based on operations with columns rather than operations with tables in the relational model or operations with cells in spreadsheet applications. This data processing model is free of such typical set operations like join, group-by or map-reduce which are difficult to comprehend and slow at run time. Due to this ability to easily describe rather complex transformations and high performance on analytical workflows, this approach can be viewed as an alternative to existing technologies in the area of ad-hoc and agile data analysis.
... (So parentheses are like SQL FROM and bar symbol is like SQL WHERE.) Projection operation is denoted by right arrow '->' followed by a dimension name which is followed by the target collection (Savinov, 2012d). In the database schema, projection means moving up to the domain of the specified dimension (Fig. 9). ...
Full-text available
In spite of its fundamental importance, inference has not been an inherent function of multidimensional models and analytical applications. These models are mainly aimed at numeric (quantitative) analysis where the notions of inference and semantics are not well defined. In this paper we argue that inference can be and should be integral part of multidimensional data models and analytical applications. It is demonstrated how inference can be defined using only multidimensional terms like axes and coordinates as opposed to using logic-based approaches. We propose a novel approach to inference in multidimensional space based on the concept-oriented model of data and introduce elementary operations which are then used to define constraint propagation and inference procedures. We describe a query language with inference operator and demonstrate its usefulness in solving complex analytical tasks.
... Formally, nothing prevents COM from using joins. However, joins have several negative properties (Savinov, 2012d):  Join is a low level operation describing how to get a result rather than what result has to be produced. ...
Technical Report
Full-text available
Concept-oriented model of data (COM) has been recently defined syntactically by means of the concept-oriented query language (COQL). In this paper we propose a formal embodiment of this model, called nested partially ordered sets (nested posets), and demonstrate how it is connected with its syntactic counterpart. Nested poset is a novel formal construct that can be viewed either as a nested set with partial order relation established on its elements or as a conventional poset where elements can themselves be posets. An element of a nested poset is defined as a couple consisting of one identity tuple and one entity tuple. We formally define main operations on nested posets and demonstrate their usefulness in solving typical data management and analysis tasks such as logic navigation, constraint propagation, inference and multidimensional analysis.
Full-text available
The paper describes logical navigation in the concept-oriented data model. This model explicitly and formally separates physical structure and logical structure so that each element of the model is simultaneously a collection and a combination of other elements. The physical structure is used to representing and access by elements by means of references. The logical structure is used to reflect the problem domain dependencies. The two-level model considered in the paper consists of a set of concepts and a set of items. Concept structure defines the model syntax while item structure defines its semantics. In the paper it is shown how the properties of the model can be used for logical navigation where we do not need to specify join conditions or other complicated parameters of queries.
Full-text available
We present the concept-oriented model (COM) and demonstrate how its three main structural principles — duality, inclusion and partial order — naturally account for various typical data modeling issues. We argue that elements should be modeled as identity-entity couples and describe how a novel data modeling construct, called concept, can be used to model simultaneously two orthogonal branches: identity modeling and entity modeling. We show that it is enough to have one relation, called inclusion, to model value extension, hierarchical address spaces (via reference extension), inheritance and containment. We also demonstrate how partial order relation represented by references can be used for modeling multidimensional schemas, containment and domain-specific relationships.
Full-text available
Semantic data models have emerged from a requirement for more expressive conceptual data models. Current generation data models lack direct support for relationships, data abstraction, inheritance, constraints, unstructured objects, and the dynamic properties of an application. Although the need for data models with richer semantics is widely recognized, no single approach has won general acceptance. This paper describes the generic properties of semantic data models and presents a representative selection of models that have been proposed since the mid-1970s. In addition to explaining the features of the individual models, guidelines are offered for the comparison of models. The paper concludes with a discussion of future directions in the area of conceptual data modeling.
Full-text available
The paper describes an approach to query processing in the concept-oriented data model. This approach is based on imposing constraints and specifying the result type. The constraints are then automatically propagated over the model and the result contains all related data items. The simplest constraint propagation strategy consists of two steps: propagating down to the most specific level using de-projection and propagating up to the target concept using projection. A more complex strategy described in the paper may consist of many de-projection/projection steps passing through some intermediate concepts. An advantage of the described query mechanism is that it does not need any join conditions because it uses the structure of the model for propagation. Moreover, this mechanism does not require specifying an access path using dimension names. Thus even rather complex queries can be expressed in simple and natural form because they are expressed by specifying what information is available and what related data we want to get.
Full-text available
The concept-oriented data model (COM) is an emerging approach to data modeling which is based on three novel principles: duality, inclusion and order. These three structural principles provide a basis for modeling domain-specific identities, object hierarchies and data semantics. In this paper these core principles of COM are presented from the point of view of object data models (ODM). We describe the main data modeling construct, called concept, as well as two relations in which it participates: inclusion and partial order. Concepts generalize conventional classes by extending them with identity class. Inclusion relation generalizes inheritance by making objects elements of a hierarchy. We discuss what partial order is needed for and how it is used to solve typical data analysis tasks like logical navigation, multidimensional analysis and reasoning about data.
To many people, Dijkstra's letter to the Editor of Communications of the A CM, published in March 1968, marks the true beginning of structured programming. That it influenced the industry is clear, if for no other reason than for the articles it spawned, ranging from "IF-THEN-ELSE Considered Harmful," to "The Else Must Go, Too," to "Programming Considered Harmful." In form and content, Dijkstra's letter is similar to his 1965 paper, which appears first in this collection. Description of the inverse relationship between a programmer's ability and the density of goto statements in his program is repeated, as is the emphasis on the limited ability of the human brain. Much of the discussion is somewhat theoretical in nature, and the typical COBOL programmer will hunger for some coding examples so that he can see why goto statements make program logic harder to understand. Echoing his 1965 paper, the last few paragraphs underscore once again why the subject of structured programming stayed out of the mainstream of the data processing industry for so long. As Dijkstra points out, goto statements were a subject of discussion among academicians as far back as 1959. But even today, people whom Dijkstra acknowledges --- names like Wirth, Hoare, Strachey, and Landin --- are not well known to business-oriented or scientificoriented programmers, so it should be no surprise that their ideas have languished for so many years.
In the paper we describe a novel query language, called the concept-oriented query language (COQL), and demonstrate how it can be used for data modeling and analysis. The query language is based on a novel construct, called concept, and two relations between concepts, inclusion and partial order. Concepts generalize conventional classes and are used for describing domain-specific identities. Inclusion relation generalized inheritance and is used for describing hierarchical address spaces. Partial order among concepts is used to define two main operations: projection and de-projection. We demonstrate how these constructs are used to solve typical tasks in data modeling and analysis such as logical navigation, multidimensional analysis and inference.
Most common database management systems represent information in a simple record-based format. Semantic modeling provides richer data structuring capabilities for database applications. In particular, research in this area has articulated a number of constructs that provide mechanisms for representing structurally complex interrelations among data typically arising in commercial applications. In general terms, semantic modeling complements work on knowledge representation (in artificial intelligence) and on the new generation of database models based on the object-oriented paradigm of programming languages. This paper presents an in-depth discussion of semantic data modeling. It reviews the philosophical motivations of semantic models, including the need for high-level modeling abstractions and the reduction of semantic overloading of data type constructors. It then provides a tutorial introduction to the primary components of semantic models, which are the explicit representation of objects, attributes of and relationships among objects, type constructors for building complex types, ISA relationships, and derived schema components. Next, a survey of the prominent semantic models in the literature is presented. Further, since a broad area of research has developed around semantic modeling, a number of related topics based on these models are discussed, including data languages, graphical interfaces, theoretical investigations, and physical implementation strategies.
For a number of years I have been familiar with the observation that the quality of programmers is a decreasing function of the density of go to statements in the programs they produce. More recently I discovered why the use of the go to statement has such disastrous effects, and I became convinced that the go to statement should be abolished from all “higher level” programming languages (i.e. everything except, perhaps, plain machine code). At that time I did not attach too much importance to this discovery; I now submit my considerations for publication because in very recent discussions in which the subject turned up, I have been urged to do so.
In the recent years, Deductive Databases have been the focus of intense research, which has brought dramatic advances in theory, systems and applications. A salient feature of deductive databases is their capability of supporting a declarative, rule-based style of expressing queries and applications on databases. As such, they find applications in disparate areas, such as knowledge mining from databases, and computer-aided design and manufacturing systems.In this paper, we briefly review the key concepts behind deductive databases and their newly developed enabling technology. Then, we describe current research on extending the functionality and usability of deductive databases and on providing a synthesis of deductive databases with procedural and object-oriented approaches.