PreprintPDF Available

Concept-oriented model: Modeling and processing data using functions

Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We describe a new logical data model, called the concept-oriented model (COM). It uses mathematical functions as first-class constructs for data representation and data processing as opposed to using exclusively sets in conventional set-oriented models. Functions and function composition are used as primary semantic units for describing data connectivity instead of relations and relation composition (join), respectively. Grouping and aggregation are also performed by using (accumulate) functions providing an alternative to group-by and reduce operations. This model was implemented in an open source data processing toolkit examples of which are used to illustrate the model and its operations. The main benefit of this model is that typical data processing tasks become simpler and more natural when using functions in comparison to adopting sets and set operations.
1
Concept-oriented model:
Modeling and processing data using functions
Alexandr Savinov
http://conceptoriented.org
17.11.2019
ABSTRACT
We describe a new logical data model, called the concept-
oriented model (COM). It uses mathematical functions as first-
class constructs for data representation and data processing as
opposed to using exclusively sets in conventional set-oriented
models. Functions and function composition are used as primary
semantic units for describing data connectivity instead of
relations and relation composition (join), respectively. Grouping
and aggregation are also performed by using (accumulate)
functions providing an alternative to group-by and reduce
operations. This model was implemented in an open source data
processing toolkit examples of which are used to illustrate the
model and its operations. The main benefit of this model is that
typical data processing tasks become simpler and more natural
when using functions in comparison to adopting sets and set
operations.
KEYWORDS
Logical data models; Functional data models; Data processing
1 Introduction
1.1 Who Is to Blame?
Most of the currently existing data models, query
languages and data processing frameworks including SQL
and MapReduce use mathematical sets for data
representation and set operations for data transformations.
They describe a data processing task as a graph of
operations with sets. Deriving new data means producing
new sets from existing sets where sets can be implemented
as relational tables, collections, key-value maps, data
frames or similar structures.
However, many conventional data processing patterns
describe a data processing task as deriving new properties
rather than sets where properties can be implemented as
columns, attributes, fields or similar constructs. If
properties are represented via mathematical functions then
this means that they are main units of data representation
and transformation. Below we describe several typical
tasks and show that solving them by means of set
operations is a problem-solution mismatch, which makes
data modeling and data processing less natural, more
complex and error prone.
Figure 1: Example data model
Calculated attributes. Assume that there is a table with
order Items characterized by Quantity and Price
attributes (Fig. 1, left). The task is to compute a new
attribute Amount as their arithmetic product. A solution in
SQL is almost obvious:
SELECT *, Quantity * Price AS Amount (1)
FROM Items
Although this standard solution seems very natural and
almost trivial, it does have one subtle flaw: the task was to
compute a new attribute while this query produces a new
table. Then the question is why not to do exactly what has
been requested by producing a new attribute? Why is it
necessary to produce a new table (with a new attribute) if
we actually want to attach a new attribute to the existing
table? A short answer is that such an operation for adding
new (derived) attributes simply does not exist. We simply
have no choice and must adopt what is available – a set
operation.
Link attributes. Another generic data processing pattern
consists in computing links (or references) between tables:
given a record in one table, how can we access attributes
of related records in another table? For example, assume
that Price is an attribute of a second Products table
(Fig. 1, right), and it does not exist as an attribute of the
Items table. We have two tables, Items and Products,
with attributes ProductId and Id, respectively, which
relate their records. If now we want to compute the Amount
for each item then the price needs to be retrieved from the
second Products table. This task can be easily solved by
copying the necessary attributes into a new table using the
relational (left) join:
Items
ProductId
Quantity
Price
Products
Id
Price
Amount
Product
TotalQ
TotalA
table
existing columns
derived columns
calculate
link
aggregate
2
SELECT i.*, p.Price (2)
FROM Items i
JOIN Products p
ON i.ProductId = p.Id
This new result table has the necessary attributes
Quantity and Price copied from two source tables and
hence it can be used for computing the amount. Yet, let us
again compare this solution with the problem formulation.
Do we really need a new table? No. Our goal was to have
a possibility to access attributes of the second Products
table (while computing a new attribute in the first Items
table). Hence, it again can be viewed as a workaround and
forced solution where a new (unnecessary) table is
produced just because it is the only way to access related
data in this set-oriented model.
Aggregated attributes. The next typical data processing
task is data aggregation. Assume that for each product in
Products, we want to compute the total number of items
ordered (Fig. 1). Group-by operation provides a standard
solution:
SELECT ProductId, SUM(i.Quantity) AS TotalQ
FROM Items i
GROUP BY ProductId (3)
Again, we produce a new table although the real goal was
adding a new (aggregated) attribute to the Products table.
Our intention was to make TotalQ equivalent to all other
attributes in the Products table so that it could be used for
computing other product properties. Apparently, this also
could be done in SQL but then we would have to apply
join to combine the group-by result (3) with the original
Products table to bring all attributes into one table like (2)
followed by yet another set operation like (1) for
calculating new attributes.
In all these examples, the problem formulation does not
mention and does not actually require any new table. Yet,
the applied data processing model provides only set
operations, which means that it is a problem-solution
mismatch. The necessity to adapt set operations for the
task of defining and adding new attributes is not a problem
of only SQL or the relational model (RM) [2]: it exists in
all models and frameworks, which rely on set operations
for data processing. In particular, we can see it in
MapReduce [4] where map and reduce operations always
produce new collections even if the goal is to compute a
new object field or a new aggregated property,
respectively. In this situation, there is no choice: we must
use sets for all kinds of data operations even when they do
not match the problem at hand and no sets need to be
produced at all.
Adopting set operations for deriving new attributes has
quite significant negative consequences at different levels
of data organization. If multiple tables are being processed
then we can easily get a conceptual mess: many different
types of joins (inner, left, right, full), nested joins and
intermediately computed and aggregated attributes in these
tables all packed in one SQL statement.
1.2 What Is to Be Done?
A general solution to this problem consists in introducing
a column-oriented data model providing operations for
directly manipulating columns without changing the table
data these columns belong to and hence doing precisely
what is required: adding derived attributes to existing
tables. Below, for illustration purposes, we show how such
a hypothetical data model could be applied to the
examples described before.
Adding a new calculated attribute could be done as
follows (compare it to (1)):
CREATE ATTRIBUTE Amount
FROM Items TO Double // Mapping
AS Quantity * Price // Definition
A solution to the problem of accessing data in related
tables is well known: it is based on introducing link or
reference attributes, which can be then accessed using dot
notion supported by queries. Such attributes contain
values, which provide access to records in other tables. In
our example, we need to define a new attribute in the
Items table, which references records in the Products
table (compare it to (2)):
CREATE ATTRIBUTE Product
FROM Items TO Products // Mapping
AS ProductId == Id // Definition
Now we can easily compute Amount using dot notion even
though Price belongs to the Products table:
CREATE ATTRIBUTE Amount
FROM Items TO Double // Mapping
AS Quantity * Product.Price // Definition
What we have achieved here is simple and natural
semantics of links: everybody understands what a link is
and how to use it via dot notation. We also separated two
different concerns - link definition and link usage – which
are now in two statements by making it easier to maintain
this code: if later we change how Items and Products are
related then the way the amount is computed needs not be
changed.
It is also possible to add an attribute, which will
compute its values from subsets of records in another
table. Such a query in our pseudo code could look as
follows:
CREATE ATTRIBUTE TotalA
FROM Products TO Double // Mapping
AS SUM(Items.Amount) // How to aggregate
GROUP Items.Product // How to group
In contrast to calculate and link attributes described above,
this attribute computes its values by aggregating data
stored in another table. It relies on the previously defined
link attribute Product for grouping but does not include its
3
definition. Therefore, the grouping condition can be easily
changed later independently without modifying other
queries that use it.
These queries demonstrates the central idea behind our
approach: we define a new attribute in an existing table
instead of defining new unnecessary tables.
1.3 Contributions and Outline
Of course, there are tools, patterns and best practices,
which can significantly help in writing such queries and
data processing scripts, for instances, by translating them
into SQL, MapReduce or another conventional set-
oriented language. Yet, we argue that the demonstrated
problem-solution mismatch is not a minor drawback, but
rather a major problem caused by the application of wrong
tools and the absence of right methods. Our goal therefore
is not to fix conceptual problems of one layer by
introducing yet another layer of complexity. It consists in
finding a principled solution by developing a new data
model, which can solve such tasks directly without the
need to adapt inappropriate mechanisms.
In this paper, we describe a data model, called the
concept-oriented model (COM), which is intended for
representing and processing data using mathematical
functions as opposed to using only sets and set operations
in existing set-oriented data models and data processing
frameworks. COM is able to manipulate functions as first-
class elements. For example, Amount, Product and TotalQ
in COM are (derived) functions and no new sets will be
produced during inference.
COM radically changes the way we think of data by
significantly strengthening the role of functions. In many
(but not all) cases, it is possible to represent and process
data by using only functions without changing the sets. In
particular, COM has the following two important
properties:
Data can be stored in functions in the same way as it
can be stored in sets. In particular, there can be two
different databases, which have the same sets but
different functions.
Deriving (inferring) data in COM means computing
new (mathematical) functions as opposed to
producing new sets. In particular, a COM query may
well produce a function (by processing data in other
functions) rather than a set.
The idea of using functions for data modeling is not
new and this branch has a long history of research starting
from [6, 16]. COM can be viewed as a further
development of the functional data modeling paradigm.
Our main contribution in this context is that functions are
made first-class elements of the logical model with the
same status as sets. The existing functional models are
either conceptual models (while COM is a logical data
model) or heavily rely on set-oriented operations. They
essentially extend the scope of a set-oriented model rather
than providing a major alternative to set orientation.
Conventional functional models emphasize that functions
are important, should not be ignored and in many cases
make data modeling easier (especially at conceptual level)
but data management (at logical level) is still done mainly
using sets and set operations like join and group-by.
Significantly strengthening the role of functions and
making them first-class elements of the data model allows
us to rethink the role of sets in data modeling. Formally,
the role of sets is weakened because many tasks can be
and should be solved by using functions. However, this
weakening leads to significant simplification of the model
as a whole. It can be viewed as a return to the original
treatment of sets as collections of unique tuples by
removing many complex and sometime controversial
mechanisms arising from the necessity to use them for
other purposes. In particular, COM does not need the
following features: separation of relations and domains,
the need in having primary keas and foreign keys,
adopting set operations for performing calculations,
aggregations and linking. All these mechanisms are now
replaced by one formal construct, function, which makes
the model simpler and more natural.
Such a simplification by reducing data modeling and
processing to only two basic constructs - sets and
functions treated in their original mathematical
sense - would not be possible without rethinking some
fundamental principles. In particular, we describe a
functional alternative to describing data connectivity.
COM assumes that two data elements are connected if
there exists a function, which maps one of them to the
other. It is opposed to the relational principle that data
elements are connected if there exists a tuple in some set,
which includes them as constituents. Obviously, these are
two fundamentally different assumptions. Accordingly,
COM assumes that connectivity is derived using function
composition as opposed to relation composition (join).
The main general benefit of COM is that it does
precisely what is requested: it allows us to define derived
attributes without unnecessarily producing new tables.
In summary, our contributions are as follows:
We argue that having only sets is not enough for data
modeling and data processing and describe a new data
model, which makes functions first-class elements of
the model. Both sets and functions are equally used
for data representation and data processing.
We demonstrate how functions can represent the
semantics of connectivity and how function
composition can be used to derive new connections.
This provides an alternative to the relational principles
where relations are used for connectivity and relation
composition (join) is used for inference.
4
We describe how operations with functions can be
used to solve some typical data processing tasks like
computing new properties, data linking and data
aggregation.
We describe an open source framework, which is
based on this data model and can be viewed as a
functional alternative to MapReduce and other similar
set-oriented languages and frameworks.
This paper focuses only on the logical level of data
modeling and does not discuss any conceptual or physical
aspects of data management. In particular, the column-
orientation in the paper does not relate to column stores
(even though the implementation uses columnar format for
storing data). We describe the Bistro
1
toolkit only to
illustrate one possible implementation of COM and do not
discuss such (important) aspects as physical data
organization, dependencies and topology of operations,
incremental evaluation, optimization of function
evaluation etc. This approach to data modeling and
processing was also used for self-service data integration
and analysis [11,12].
Note also that the above examples were provided using
an SQL-like pseudo code to make it easier to comprehend
the main motivation behind this research. The open source
framework we describe is implemented differently and is
closer to how MapReduce works where data processing
logic is described programmatically as a graph of
operations. In our code examples, we follow the
convention that lower case identifiers like product denote
(Java) objects while upper case identifiers like Product
refer to (column and tables) names.
The paper is organized as follows. In Section 2, we
introduce sets and describe how a purely set-oriented
model can be used for data modeling by emphasizing the
arising problems. In Section 3, we introduce functions and
describe how they can be used for data modeling by
solving the problems arising in a purely set-oriented
approach. In Section 4, we describe operations with
functions and demonstrate how they can be used for data
processing. Section 5 describes how COM modifies set
operations. Section 6 provides a summary, concluding
remarks and outlook for future research.
2 Sets for Data Modeling
2.1 Sets and Values
In the Concept-Oriented Model (COM), the main unit of
data is a value. Values can be only copied and there is no
possibility to represent them indirectly via other values or
share them. Examples of values are numbers like 45.67 or
1
https://github.com/asavinov - Bistro data processing toolkit
letters like ‘b’ represented using an appropriate encoding
convention.
Any data value is supposed to have some structure.
Values the structure of which is hidden or ignored are
referred to as primitive values. Values with an explicitly
declared structure are referred to as complex values.
Complex values are made up of the copies of other values
and this composition is formally represented by a tuple,
which is treated in its accepted mathematical sense by
capturing the notion of an ordered list.
A tuple consisting of n member values is called n-tuple
and n is called arity. Tuple members are enclosed in angle
brackets and their position is referred to as an attribute.
Attribute names are separated from the member values by
a colon. For example,    is a complex value
composed of two values x and y having attributes a and b,
respectively. The values x and y might also have some
structure. One value can be part of many different tuples in
the form of multiple copies. It is not permitted to include a
value into itself.
The empty tuple  without any structure is treated as a
special data element denoted as NULL. We assume that
adding empty value to or removing it from a tuple
(independent of its position) does not change the tuple:
 .
A collection of unique values is formally represented as
a mathematical set. Sets capture the very simple notion of
a group or collection of things. Importantly, a set is a
collection of distinct tuples and hence no element can
appear more than once in the same set. For example,  
 is a set consisting of three values x, y and z which
must be distinct. We will assume that any value is a
member of some set and it is possible to determine the set
a value belongs to. The notation    is used to denote
that the value x is a member of the set S.
The empty set, written as  or , is a special set which
does not contain any values. In mathematics, it is also
assumed that    .
2.2 Set Membership for Data Modeling
A class of set-oriented data models rely on only sets of
tuples for modeling data. Accordingly, a generic set-
oriented database is defined as a number of sets each
consisting of some tuples (complex values):
 
Here  are sets, and   are
tuples composed of values from other sets.
Structural constraints. In general, tuples within one set
may have any attributes which differ from tuple to tuple. It
is possible to impose structural constraints by specifying a
list of attributes and their types which are allowed for the
set:
  , where  
5
Now tuples may have only certain structure by including
only values from the specified sets called types.
Below we enumerate some properties of set-oriented
models:
Set nesting. According to this definition, a set-oriented
model does not support nested sets, that is, a set
consists of only tuples and cannot include other sets as
its members. Although such a support could be
theoretically provided, this feature makes the model
much more complicated. At the same time set nesting
can be modeled using flat sets and references.
Two kinds of sets. Many concrete set-oriented models
introduce two kinds of sets. For example, RM
distinguishes between domains and relations, which
are both normal sets but play very different roles in
the model. Other models like the functional data
model (FDM) distinguish between value sets and
entity sets. Although practically having two kinds of
sets can be useful, it is a controversial decision from
the theoretical points of view. Indeed, a set is a set and
there have to be a really strong reason to introduce
different kinds of sets.
Tuple nesting. Nested tuples are naturally supported
by this category of models because a tuple member
can be a tuple with its own tuple members up to
primitive values.
Flattening nested structure. In some models like RM,
the nested structure of tuples is flattened by removing
intermediate levels so that any tuple consists of
primitive values. It is a highly controversial feature
from the data modeling point of view because we
essentially discard important information about the
structure.
Inclusion by-value. Both nested and flat tuples support
only inclusion by-value. This means that a tuple
consists of copies of its member values. In particular,
there is no possibility to reference elements in other
sets or include them using other indirect ways.
Probably the most important property of set-oriented
models is that the only basic relationship is set
membership: either a (data) value x is a member of a set,
   , or not,   . The only way to change a set-
oriented database (at the basic level) is either adding a
tuple to a set or removing a tuple from a set. Importantly,
no other operations are supported. Why it is a significant
limitation is discussed in the following subsections
describing more specific aspects of data modeling.
2.3 Modeling Identifiers. Primary Keys and
Surrogates
Although adding things to and removing them from a
collection is a very general modeling pattern, the question
is what the thing is? In this context, there exists another
important modeling pattern (not only in data modeling):
things we model are typically uniquely identified, which
means that there exists something unique that can be
“detached” from it and then used to access it. In data
modeling, this detachable part is frequently referred to as
an identifier and can be implemented as a pointer,
reference, surrogate, link, primary key or a similar
construct.
What is in an identifier, their roles and uses is big topic
but we would like to emphasize only two their benefits:
They significantly decrease the amount of (redundant)
data being transferred and stored because only some
(small) part of the represented thing is copied.
The represented thing can be modified without the
need to update all its numerous copies (the identifier
itself is supposed to be immutable). It is essentially a
mechanism of sharing data.
An important observation is that a purely set-oriented
model does not support the mechanism of identifiers. We
can only manipulate a whole thing by adding it to a set or
removing it. The whole thing in this case is supposed to
identify itself and can be stored or transferred only by-
copy. For example, if we want to represent an order item
then we create a relation with attributes characterizing this
order item including quantity ordered, price and date. If
we want to represent this order item in some other set then
we must copy the whole tuple including all its attributes.
There is no other choice if we do not want to modify the
underlying theory and fundamental properties of sets and
tuples.
Obviously, such a model is extremely inconvenient and
there exist several general solutions. One approach is
based on introducing an additional layer on top of the base
set-oriented model, which can be characterized as a subset
of attributes used for identification. In RM, such a subset
is called a primary key (PK). However, the mechanism of
PKs has one fundamental flaw. Tuple as a whole becomes
a mutable data element. For example, we can change the
quantity of an order item, and this change does not
produce a new thing we still have the same order item
because it has the same PK. Thus, PKs change the
fundamental principle of set-orientation: tuples are
immutable and can be only added or removed. Apparently,
the cause of the problem is that we still assume that a thing
(identifier and properties) is represented by one tuple, that
is, we follow the principle “one tuple – one thing”.
There exist also other problems with PKs like the
controversy with the treatment of inclusion. On one hand,
we want to include only PK in other tuples and hence only
PK is treated as a true tuple. On the other hand, we still
treat all attributes as a tuple because it is how a relation is
defined. The controversy is that we cannot unambiguously
answer the question whether a set consists of PKs only or
a set consists of whole things (PK and non-PK attributes).
6
Another solution consists in introducing some built-in
identifiers typically implemented as surrogates [5], oids
[7], references or system identifiers. It is somewhat similar
to PKs because we break all attributes into two groups.
The main and important difference from PKs is that
surrogates are managed by the system and hence are not
part of the model (and not part of the tuples). This has
some benefits and drawback. An advantage is that
surrogates are immutable (while PK typically can be
changed) and have many other properties of true
identifiers implemented by the system. A significant
drawback is that it is not possible to define their domain-
specific structure (in contrast, PKs may have arbitrary
user-defined structure). Yet, from the fundamental point of
view, we still have the controversy: does a set consist of
surrogates (as its tuples) or it consists of whole things
(surrogate and properties)? If a set consists of only
surrogates then only surrogates can be processed, which is
useless in most cases (because we want to process data in
properties). If we assume that a set consists of whole
things (surrogates and properties) then we break the
fundamentals of set-orientation because tuples become
mutable and we must copy them into other tuples.
Why do we want to answer these questions and resolve
the controversies? Because we want to use formal set
operations for data processing rather than rely on specific
properties of ad-hoc mechanisms and additional layers.
These controversies can be resolved by introducing
functions and we describe this in Section 3.3.
2.4 Modeling Properties. Foreign Keys and
References
In the previous subsection, we emphasized the importance
of having identifiers and inability to support them without
sacrificing some major principles of set-orientation. In this
section, we discuss how we can model thing properties,
that is, the other side of identifiers. Assume that we know
an identifier of a thing (e.g., modeled by PK or surrogate).
The main question now is how we can use it to access
properties of the represented thing?
Accessing a property normally means two operations:
getting a value stored in the property and setting
(assigning) a new value to the property (by overwriting the
old one). Here we see a fundamental difference of this data
manipulation pattern from the add-remove pattern. Indeed,
we do not want to add or remove anything – we are
thinking about something existing and want to simply
modify it. A pure set-oriented model does not support such
a pattern but there exist workarounds, which simulate it
using set operations or some other mechanisms and
assumptions.
One wide spread approach to implementing the update
operation is based on the mechanism of foreign keys (FK).
Here the idea is that some values stored in the attributes of
this tuple are associated with the values stored in attributes
of another relation. This allows us to find a tuple in
another relation given values stored in this relation.
Normally it is assumed that only PK is stored in other
relations. Formally, FK is a constraint, which allows for
using only values already existing in the target relation.
The idea is that attributes from two related tables are
copied into one table by using the relational join operation,
by matching tuples from the source relations. The main
problem of this approach [8] is that FKs have the
semantics of references and properties while the
operations provided along with the mechanism of FK are
set-oriented (Section 1.1).
Another approach is based on built-in system identifiers
(surrogates, oids, references etc.) so that associations
between values in different sets are maintained by the
system. It could be viewed as an ideal solution because the
system supports dot notation in queries and we do not have
to think how to read and write values of properties.
However, it is too far from this status for one reason: this
mechanism of access cannot be customized because it is
not part of the model. Essentially, it is the same problem
as we have with references and surrogates. We solve the
problem for the price of losing control over how things are
identified and how things are accessed.
Thus, the choice is either to have full control over
identification and access by using rather inappropriate and
complex set operations, or first-class support of access
operations without control over its implementation. COM
solves this problem by satisfying both of these
requirements and in Section 3.4 we describe how this
mechanism based on functions works.
2.5 Modeling Objects
Things can be modeled by representing them as objects or
entities [1]. Objects are different from and opposed to
values and it is a fundamental observation. Values are
passed by copying their constituents while objects are
passed by-reference and hence can be used for sharing
data. Since it is a wide spread data modeling pattern, the
question is how objects can be represented using sets and
tuples? For example, how a product (object) can be
thought of as and formally represented via tuples in sets?
Many generalizations of set-oriented models [17, 3]
make a principled assumption that tuple attributes
represent fields of one object and hence a set stores a
number of objects of the same class (in COM it is not so).
This approach suffers from one controversy:
mathematically, a tuple is a value passed by-copy while an
object is not a value because it is passed by-reference. In
order to resolve it, we need to mark these (entity)
attributes as having a special status by essentially
excluding them from the tuple. However, if object fields
do not belong to the set tuples then how they should be
7
treated formally? Obviously, it is analogous to the
controversy arising due to the introduction of PKs and
FKs.
2.6 Modeling Connectivity. Joins
One fundamental question is how different tuples are
related and what does it mean for tuples to be connected?
RM provides a clear answer:
n values are (directly) related if there exists
a tuple   where they are members
(in any order and possibly combined with other
values)
This is why the set S is referred to as a relation in RM – its
tuples relate values from the domains. If we want to
connect some existing values then the only way is to create
a new set and add a tuple, which is made up of the related
values. This type of connectivity is symmetric, that is, all
values have the same status. This connectivity relationship
is also n-ary, that is, 2 or more values can be related. One
serious restriction of RM is that only values from domains
can be related tuples from arbitrary relations cannot be
explicitly related because tuple attributes cannot contain
other tuples (due to flattening). Yet, relations between
arbitrary tuples can be modeled indirectly by including all
their attributes (which is highly unnatural).
This definition allows us to model direct connections
between values. In order to infer indirect connections, we
need another assumption, which defines what is meant by
inference. RM uses relation composition for deriving a
new relation given two input relations. If      and
    are two (binary) relations then their
composition    is a set of  pairs:
           
The idea is that initially (before inference) two values x
and z are not included into any tuple and hence they are
not directly related. However, (different) tuples they are
included into contain one common value y and hence they
are indirectly related. This idea of inference is based on
the property of including some common value. In other
words, if two values have some common parts then they
are related. Note that this operation is also symmetric and
it allows for indirectly connecting more than two values.
This semantics of connectivity has inherently set-
oriented nature and provides a very powerful formalism
for inferring new sets from existing sets by essentially
adopting the principles of the logic of predicates. The
question however is how relevant this semantics of
connectivity and mechanism of inference is for data
modeling? It is proven to be useful for many use cases and
data modeling patterns. However, as we demonstrated in
the introduction, there exist quite general scenarios, which
do not use one tuple as a representation of a relation
between values. In addition, the relational connectivity
semantics does not directly model the concept of objects
where we distinguish between an identifier and properties.
Although relation composition (join) is used for accessing
properties given an identifier in the FK pattern, this
support is not very natural because we apply a set-oriented
pattern (derive a new set) without having such a need (as
described in Section 1.1). Therefore, we will describe new
function-oriented semantics of connectivity in Section 3.6
and show how it is used to model and derive connections.
3 Functions for Data Modeling
3.1 Functions and Value Mappings
Mathematically, a function is a mapping from a set of
input values into a set of output values where exactly one
output is associated with each input:   . Here D
is a set of all input values, called domain, R is a set of all
output values, called range, and x is an argument that takes
its values from D. There are two conventions for
representing an output given input:    and  
 (dot notation).
A function can be represented as a set of input-output
pairs:      . One pair in this set is
referred to as a function element and (like any tuple) it is a
value. This representation is useful for formal reasoning
but since it hides the semantics of functions (as a
mapping), we will not use it. For data modeling, we
assume that
a set is a collection and hence we can add or remove
its member values
a function is a mapping and hence we can get or set its
output values
In the case some input has no output value explicitly
assigned it is supposed to be NULL (empty tuple).
Therefore, all inputs have exactly one output assigned.
Yet, for the purposes of this paper, we assume that
functions take only non-NULL values.
3.2 Value Mappings for Data Modeling
A concept-oriented database is defined as a number of sets
and a number of functions between these sets:
  
where   are sets, are
functions,  ,   , for some j and k. It
belongs to a category of function-oriented models because
function is an explicit element of this model used for data
representation and (as we show later in the paper) data
processing. It is a generic definition and depending on the
constraints imposed on the structure of functions and sets,
we can get more specific types of data models. For
example, we could prohibit cycles of functions or we
could introduce an (unstructured) model with only one set
(the universe of discourse) and functions representing
8
mappings between its elements. However, studying such
(important) cases is not the purpose of this paper.
In this definition, it is important that functions have the
same status as sets but different semantics and purpose.
Data representation and data processing is not limited by
sets only. In addition to sets, we can represent data using
functions and process data by producing new functions
from other functions. In particular, two databases may
have identical sets but different functions and hence they
are different databases (which is not possible in purely set-
oriented data models).
A database schema is a database without set elements
and function elements. To define a schema, it is necessary
to specify its sets (without their members) as well as
functions along with their input and output sets but
without function elements. If a schema has been defined
then it is treated as a constraint, which means that, set
members and function members must obey this structure.
Note that in this definition, functions are distinguished
from tuple attributes. A value stored in an attribute is part
of the tuple where the attribute is defined. In contrast,
outputs of functions are not stored in the input tuple. (In
[10] we referred to attributes and functions as identity
functions and entity functions, respectively.)
How functions are used to represent data? In contrast to
sets where the basic operation is adding and removing
tuples, the basic operation with functions is getting a value
and setting a value for a given input tuple. Setting a value
is essentially assignment operation and it is precisely what
is absent in the set-oriented paradigm. Note that assigning
an output value to some input value does not change any
set, that is, we can manipulate the state of a database
without changing set membership relation. Manipulating
data in a function-oriented database means changing the
mappings between its sets where the sets represent existing
things.
Since the state of the functions (mapping) has to be
stored somewhere we say that functions are viewed as a
data store along with sets. Note that storing a function is
not directly related to the column store technology. In
column stores, we assume that all data is represented as a
table and the question is whether to physically represent it
as a row store or column store. In COM, the task is to
physically represent functions independent of the sets.
Another possible confusion comes from the formal
possibility to represent a function as a set of input-output
pairs. This suggests that there is actually no need to
introduce a dedicated construct – function – we can model
everything using sets. Here it is important to understand
that sets and functions have different semantics, and this
difference is of crucial importance for data modeling. In
other words, function membership can be and should be
used only for formal analysis or for physical representation
but not as its semantics. In data modeling, we treat
functions as mappings and can only get or set their
outputs.
3.3 Modeling Identifiers via Tuples
In a set-oriented model, it is a controversial issue
(Section 2.3). By introducing functions, this controversy is
resolved. Now we have an unambiguous answer: any tuple
within a set is an identifier for something. A set is then a
collection of identifiers. Tuples in sets have the semantics
of existence (no properties or characterization). If a tuple
is added to a set then it represents a thing, which is
supposed to really exist, and if it is removed from the set
then this thing is supposed to be non-existing. The main
benefit is that there is no need in having such mechanisms
as PKs or surrogates. The semantic load on sets is
significantly reduced and the whole model gets simpler.
More details about semantic differences between identities
and entities can be found in [15].
3.4 Modeling Properties via Functions
Legalizing functions as first-class elements of data models
essentially means that we recognize that mappings
between things are as important as things themselves.
Moreover, things without mappings represent a formally
degenerated and practically quite useless model. This view
contrasts with the purely set-oriented paradigm where the
complete data state is represented and all operations are
performed by using only sets.
In COM, tuples have only one main usage: they
manifest the fact of existence of a thing, which essentially
means that a tuple is an identifier of a thing. How then
such things are characterized? This is done by means of
functions. Namely, a function is treated as a property and
its output is treated as a value of this property. Thus, the
primary purpose of functions is characterizing things using
other things.
The usage of functions for characterizing things has the
following important features:
Data is manipulated by getting and setting function
outputs as opposed to adding and deleting tuples in
the case of set operations
Functions differ from attributes because changing a
function output does not change any set while
changing an attribute changes the set. Functions allow
us to characterize things without changing the thing
identifier.
Properties and functions essentially turn values into
references and introduce the mechanism of data
access by-reference. In other words, a reference is a
normal value, which can be used to retrieve other
values using functions
Properties and functions provide a mechanism of
sharing data. If we change some property (by setting a
new function output value) then all other elements
9
storing this input will see the new value without the
need to update them
Function names become an important part of the data
model because they need to be specified in data
processing scripts
3.5 Modeling Objects
Functions allows us to simply and naturally solve the
problem of representing entities (or objects). In COM, an
object is a number of function output values returned for
the same input value e which is treated as the object
identifier or reference:
   
Here we used round brackets in order to distinguish it from
tuples denoted by angle brackets. An object always has
some identifier, which is an (input) tuple in some set.
Object fields are also values but they are stored in arbitrary
sets. Importantly, an object is not a tuple. In particular, it is
not possible to pass one object as whole in one operation
because these values are stored separately and are
available only by using the corresponding functions.
For example, a product can be identified by its number
and hence product numbers are values within the set of
products. A product object, however, is defined by its
properties, which are represented by functions, and the
function output values are stored in other sets. More
specifically, if a product object is characterized by its
name and price then these two functions map each product
number into some values in the corresponding (string and
numeric) primitive sets.
3.6 Modeling Connectivity
COM uses semantics of connectivity, which is based on
functions:
two values x and y are related or connected if there
exists a function f which maps value x to value y
In other words, two tuples are related if one of them is
mapped to the other one by means of some function. If we
need to relate two sets then it is done by defining a new
function between them. The main distinguishing feature of
this approach is that the way data values are related is
determined by functions. In particular, we can change the
way elements in the database are connected without
changing its sets. In contrast, the set-oriented approach
assumes that connections between elements are
determined by sets and hence we need to modify some set
in order to change connections between elements.
Functions provide a direct way to connect values. New
connections can be derived using function composition.
This operation combines two or more consecutive
mappings into one mapping by applying the next function
to the output of the previous function. Formally,
if     and     are two functions, then
their composition is a new function:  
Alternatively, function composition can be written using
dot notion,    , or circle notion,
   . This way to derive new data is a
functional counter part of relation composition
(Section 2.6). The main difference is that instead of
producing new sets it produces new functions.
4 Functions for Data Processing
4.1 Manipulating Data Using Functions
The currently dominating approach to manipulating data is
based on set-oriented principles where deriving new data
means defining a new set with tuples composed of tuples
from already existing sets. In this section, we describe a
function-oriented approach where deriving new data
means defining a new function using already existing
functions.
Such function-oriented data processing is based on two
basic operations:
Getting function output given some input:   
or   
Setting function output for some input:   or
  
These operations are widely adopted in programming
but they are not suitable for data processing where we
want to manipulate collections of objects rather than
individual objects. What is worse, these operations do not
reflect the semantics of typical data processing patterns: it
is not specified what it means to read or assign a function
output. Therefore, main questions are how such basic
functional operations can be used to solve typical data
processing tasks.
We consider three general tasks, which can be solved
by using functional operations instead of set operations.
They correspond to the motivating examples in Section 1:
Computing new function outputs directly from other
functions in the same set (Section 4.2). Such
definitions are referred to as calculate functions and
they replace SELECT and Map set operations.
Finding new function outputs using outputs of
existing functions as criteria (Section 4.3). Such
definitions are referred to as link functions. They are
intended for linking sets and replace JOIN operation.
Updating new function output values (multiple times
for one input) using functions in another (related) set
(Section 4.4). Such definitions are referred to as
accumulate functions. They are intended for data
aggregation and replace GROUP BY and Reduce.
It is important that we do not want to define functions
by specifying explicitly their output values for all inputs.
The way a function is defined should avoid iterators and
10
loops over the input values. Therefore, functions will be
defined by providing a mechanism of computing one
output while it is the task of the system to apply this logic
to all necessary inputs. For any definition type, we define a
new function in an existing table using some other
functions and the difference is only how its output values
are computed: by computing, finding or updating values.
4.2 Calculating Function Output
Let us assume that new function outputs depend on only
this table function outputs and the new output values can
be directly computed from them. Such a function
definition is referred to as a calculate function and it needs
only one expression, which specifies how one output value
is computed given other function outputs for one input.
Given some value x of the input set X with functions
, a calculate function output is represented by an
expression:
    
Here calculate is an expression returning a value from Y
given outputs of the functions . Note that this
expression processes individual values – not sets. This
expression has to guarantee that the computed value really
exists in the output set.
For example, let us assume that we have already a set
Items with two functions Price and Quantity, which
map each item to some numbers. Now we want to define a
new function Amount, which computes the product of
Price and Quantity. First, we create a column (function)
object:
Column amount = db.createColumn(
"Amount", // Column name
items, // Input table
objects // Output table
);
and then we provide a definition:
amount.calculate(
x -> (double)x[0] * (double)x[1], // Lambda
price, quantity // List of parameters
);
The first argument is a lambda expression, which returns
the product of two parameters passed as an array. The
second and third arguments are column objects this
function depends on. The system evaluates this column by
iterating through all elements of the Items set, retrieving
the outputs of the functions Price and Quantity, calling
the lambda expression by passing these two values as an
array, and storing the expression return value as the
Amount function output for the current input.
This approach solves the first problem we described in
Section 1.1 by relying on only functions defined using a
value-based expression computing its output directly from
inputs and without any awareness of the sets existing in
the model. It essentially is a functional analogue of the
SELECT and Map operations but without the necessity to
define and generate sets.
4.3 Finding Function Output
There exist data processing patterns where it is not
possible to directly compute outputs of a new function.
However, we can find an output element by using criteria
expressed in terms of its properties. More specifically,
given an input value    of a new function    ,
the output value    is found by imposing constraints
expressed as a predicate p:
     
This predicate connects n properties of input x and m
properties of output y.
The simplest and most useful predicate is equality,
which means that we search for a tuple    with
properties equal to some properties of the input x:
     
Although there can be many elements y satisfying the
predicate for one input, we will assume that either there is
only one element or there exist additional criteria for
choosing only one.
Finding an output satisfying certain criteria is formally
based on the operation of inverting a function or de-
projecting a value. An inverse function
  
returns a subset of inputs, which all map to the same
output:
       
We also can use inverse arrows ‘ to denote the same
operation:
   
Inverse arrow is opposite to dot notation and we use it [13,
14] because dot symbol does not have an inversion.
In the case we have many properties specified as a
criterion, the operation of finding an output of a function is
written as follows:
   
 

The function takes a value from the intersection of the de-
projections of input value x properties.
In practice, the way a system performs de-projection
and finds an element satisfying certain criteria depends on
the implementation, and there exist numerous techniques
for optimizing such a search. At logical level, it is
important only that we can define a new function by
saying that its outputs have to be equal to certain input
properties.
For example, if we have two isolated sets Items and
Products (for example, loaded from CSV files) then we
11
might need to define a function, which maps each order
item to the corresponding product. First, we create a new
column object by specifying its name, input table and
output table objects:
Column product = db.createColumn(
"Product", // Column name
items, // Input table
products // Output table
);
Now we can provide a definition for this column:
product.link(
new Column[] { productsId } // In Products
new Column[] { itemsProductId } // In Items
);
The first argument in this definition is a list of the output
table properties: in this case only one column object
productsId representing column Id in the Products
table. The second argument lists the corresponding input
element properties: in this example, only one column
itemsProductId representing column ProductId in the
Items table. For each input element from Items, the
system will find an element from Products, which has the
same id. It will then store them as outputs of this new
column. After evaluation, this column can be used in other
expressions to access products directly from order items.
This approach solves the second problem we described
in Section 1.1 by relying on only functions defined by
specifying search criteria. It can be treated is a functional
analogue of the JOIN operation but without the necessity
to define and generate sets. Although the way such
functions are defined is very similar to join criteria, they
are semantically completely different because here we
define a function (mapping) rather than a set [8].
4.4 Updating (Accumulating) Function Output
Both calculate and link functions return a single final
value of the function by directly computing it or by finding
it, respectively. Importantly, an output depends on only
one input tuple (and its properties). There exist a very
important data processing task, called aggregation, which
cannot be solved by using these functions because its
result depends on many tuples, which are somehow related
to the input and are referred to as a group. In order to
compute an output value, the function has to process all
tuples in the group. For example, assume that we want to
compute the total sales of all products by defining a
function, which maps each product to some number.
Obviously, it does not depend on the properties of the
product it depends on line items stored in another table
(but related to this product). This data processing pattern
actually involves two separate tasks: grouping and
aggregation. Both of these tasks can be solved by using
only functions.
Grouping is performed using the following
interpretation of a function. If    is a function
then de-projection    is a subset of tuples from Z,
which are related to the element   . Elements from Z'
are frequently referred to as facts, and elements from X are
referred to as groups. Function , called grouping
function, assigns a group x to each fact z and, on the other
hand, the inverse function returns a subset of facts a
group x consists of.
In our example, Items contains facts and Products
contains groups. The Product link column we defined in
the previous section is a grouping function, which assigns
a product to each line item.
Now let us consider how aggregation is performed. The
task is to define a new function    , which
computes its output    by processing a subset  
. We could pass a subset of tuples Z' to an aggregate
expression, which will process them in a loop and return
one value. However, it is precisely what we want to avoid
because it breaks the whole conception by requiring an
explicit loop and explicitly processing subsets. The
problem can be solved by introducing accumulate
functions, which get only one fact    as well as some
output value   :
     
The task of this function is to modify y by using n
properties of the fact z and return the updated y' result.
This update expression is completely unaware of the loops
and groups it processes individual values. The idea of
aggregation using such accumulate expressions is that the
system calls it for each fact by passing the previous return
value as an input for the next call:
    where C is an initial value
      ,    
      
 
The initial value is some constant like 0. Then we update
this value using properties of the fact and get a new
output , which is then used to call again the update
expression but with the properties of the next fact and
so on. The last value will be the final value of the
function f being evaluated for the input x. It is necessary to
call the update expression k times for k facts
from the group in order to compute the output for one
input x.
In order to compute total sales for each product in our
example, we create a new column by specifying also a
default value:
12
Column total = db.createColumn(
"Total", // Column name
product, // Input table
objects // Output table
);
total.setDefaultValue(0.0);
and define it using an accumulate expression:
total.accumulate(
product, // Grouping (link) column
(a,x) -> (double)a + (double)x[0], // Lambda
amount // Fact properties to be aggregated
);
The first argument of this definition is a reference to the
grouping column (defined in the previous sub-section as a
link column), which maps items to products. The second
argument is a lambda expression, which adds the amount
of the fact (it is a calculate column) to the current
(intermediate) aggregate value for the product. It will be
called as many times as there are order items for this
product. The third argument is a reference to a property of
the items, which is being aggregated.
Note that this definition uses two derived columns
product (grouping function) and amount (aggregated
property) but for defining new functions it is not
important and it is one of the benefits of this approach
because we can define and, more important, later modify
various properties independently.
This approach solves the third problem described in the
introduction by providing a function-oriented replacement
for such set operations as GROUP-BY or Reduce. The
main benefit of accumulate functions is that no new
unnecessary sets are produced and it relies on only normal
value-based expressions requiring no loops or iterations
[9].
5 Sets for Data Processing
5.1 Manipulating Data Using Sets
In the previous section we described how new data can be
derived by defining new functions, and it was assumed
that the sets are not changed during inference. Although
calculate, link and accumulate functions can replace some
general set-oriented data processing patterns, there still
exist some cases where it is necessary to derive a new set
and not a function. We consider three such tasks:
Product of several sets
Filtering a set
Projecting a set
The main difference of all these operations from their
set-oriented analogues is that new sets are defined in terms
of functions, and they produce new functions as their
result in addition to a set. These new functions connect the
result set to the source set(s) and hence the result set is not
isolated. We can always use these new functions to access
other sets and their functions in other definitions. It is
important because there is no need to copy all the original
data into each new result set – they can be accessed from
the result using the connections (functions) between sets.
5.2 Combining Tuples Product of Sets
One important data modeling and data processing pattern
consists in finding all combinations of tuples in two or
more existing sets. This operation is one of the corner
stones of multidimensional analysis because the source
sets can be treated as axes (with tuples as coordinates) and
the product set treated as a multidimensional space
(combinations of coordinates representing points).
COM supports the product operation, which is defined
as follows:
    
 
Although formally it is the conventional Cartesian product,
it has the following distinguishing features:
In contrast to RM, the result set is not flattened and
each result tuple has n attributes each being equal to
some tuple from a range set.
The product set is a derived set with the population
automatically inferred from the existing sets. Thus,
COM can infer both functions and sets.
The product set retains its connection with the source
sets. We can always access source tuples given an
output tuple using attribute names what is useful when
defining other sets or functions.
The product reflects the semantics of
multidimensionality and the operation is not intended
for expressing connectivity via joins as it is in RM.
Although formally we can use it for joining (by
adding some filtering conditions), semantically it will
mean that we are using RM and not COM.
For example, let us assume that we have two tables:
Products with a list of products and Quarters with a list
of quarters (like 2018Q1. 2018Q2 etc.) For
multidimensional analysis, we might need to build a table
(cube) of all their combinations. First, we create a table
object representing a multidimensional space, and add two
columns, which will represent the corresponding axes:
Table pq = schema.createTable("PQ");
Column product =
db.createColumn( "Product", pq, products );
Column quarters =
db.createColumn( "Quarters", pq, quarters );
Second, we define this table as a product of two other
tables:
pq.product();
13
After such a definition, this table will be automatically
populated by all combinations of product and quarter
records, that is, each tuple in this table is a cell identified
by one product and one quarter.
The next step would be adding new derived functions
characterizing the cells and it can be done as described in
Section 4.4 using accumulate columns, which use facts
from the Items table.
5.3 Filtering Tuples
Filtering records is one of the most widely used operations
and its purpose is to select records from a table, which
satisfy certain conditions. COM allows for filtering
records using the product operation rather than a dedicated
operation:
    
 
If we now leave only one source dimension in the product
then it will be a filter:
   
Essentially, this means that the result set Y will contain
records from X, which satisfy the specified predicate. The
filtered table will have one attribute which points to
selected records from the source table.
For example, we could select all products with low
prices:
Table cheap = schema.createTable("Cheap");
Column product =
db.createColumn("Product", cheap, products);
cheap.product(
x -> (double)x[0] < 100.0, // Lambda
new ColumnPath(product, price)
);
First, we create a table object, which will store filtered
records. Second, we create a column, which points to a
table with source records. Finally, we provide a definition
with the first argument being a predicate lambda
expression returning true if the parameter is less than
100.0. The second arguments specifies a parameter. In our
example, it is a sequence of two column segments
(represented by the ColumnPath class). The first column
segment starts from the new table and leads to the source
table Products, and the second segment retrieves the price
of this product.
5.4 Projecting a Set
Assume that there is only table Items and no table
Products but the task is to compute various properties of
products like sales amount. In this case, we simply do not
have a table to attach these properties to. A list of products
could be restored by enumerating all unique product
identifiers occurring in the Items table. This set operation
is called projection along a function. It is applied to a
source set X by specifying one of its functions f and results
in a new set Y, which consists of all unique outputs of this
function:
        
The arrow here is analogous to dot in dot notation with the
difference that it is applied to sets (and also it allows for
inverting this operation).
In our system, project columns are used for projection,
and they are defined in the same way as link columns with
one difference: link columns do not change the output set
while project columns will automatically populate it.
6 Conclusion
The main motivation for this research is based on the
observation that applying exclusively sets and set
operations is inappropriate for many wide spread use cases
because they are actually aimed at deriving new columns
rather than tables. Since existing models and data
processing frameworks provide mainly set operations, this
leads to the need to define multiple tables without
necessity. This makes data models and data processing
scripts more complicated, difficult to write, comprehend
and maintain.
As a general solution, we described a new data model,
called the concept-oriented model (COM), which relies on
both sets and functions as two primary data modeling
constructs. In comparison to purely set-oriented models
(like RM), COM significantly reduces the semantic load
on sets by treating them in their original mathematical
sense as collections of tuples and only collections of tuples
without any additional mechanisms and assumptions like
PKs, FKs, domains vs. relations etc.
Functions are arbitrary mappings between sets. In
comparison to existing functional models, their semantic
load increases:
functions represent properties (instead of FKs)
functions represent connectivity (instead of joins)
functions allow us to introduce objects (as
combinations of their outputs)
function provide a mechanism of access by-reference
and dot notion
functions represent a portion of the state of the
database so that two databases with the same sets
could differ by their functions
functions are used for inference by deriving new
functions from existing functions
functions are used for linking
functions are used for aggregation
We described how COM can be used for data
processing by introducing three functional operations,
calculate, link and accumulate, as well as some set
14
operations. We also described one possible
implementation of this approach in an open source toolkit
intended for general purpose data processing and designed
as a functional alternative to MapReduce.
The main benefit of introducing functions as first-class
elements is that models as well as data processing scripts
become simpler, more natural, easier to design and
maintain because the data modeling constructs provided
by COM (functions and operations with functions) do
precisely what is necessary in many use cases – directly
defining a new column.
There are several directions for future research:
semantic and conceptual aspects of COM including
inheritance, polymorphism, semantic relationships,
multidimensional models, NULL values
expanding this approach on other data processing use
cases like stream processing and big data processing
architecture and system design aspects including
topology organization (a graph of set and function
operations), dependency management, incremental
evaluation (propagating small changes through the
topology), performance issues
REFERENCES
[1] M. Atkinson, F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier &
S. Zdonik (1990). The Object-Oriented Database System Manifesto.
In Proc. 1st Int. Conf. on Deductive and Object-Oriented Databases
[2] E. Codd (1970). A Relational Model for Large Shared Data Banks.
Communications of the ACM, 13(6), 377-387
[3] H. Darwen, C.J. Date, The Third Manifesto. ACM SIGMOD Record,
24(1), 39-49, 1995
[4] J. Dean & S. Ghemawat (2004). MapReduce: Simplified data
processing on large clusters. In OSDI'04, 137150
[5] P. Hall, J. Owlett & S. Todd (1976). Relations and Entities. In
Modeling in DataBase Management Systems, Editor G.M. Nijssen,
IFIP/North Holland. 201-220
[6] L. Kerschberg & J.E.S. Pacheco (1976). A Functional Data Base
Model. Report No. 2/1976, Departamento de Informatica, Pontificia
Universidade Catolica - Rio de Janeiro, Brazil
[7] S. Koshafian, G. Copeland (1986). Object identity. ACM SIGPLAN
Notes 20(11)
[8] A. Savinov (2016). Joins vs. Links or Relational Join Considered
Harmful. Internet of Things and Big Data (IoTBD’2016), 362-368
[9] A. Savinov (2016). From Group-by to Accumulation: Data
Aggregation Revisited. IoTBDS’2017, 370-379
[10] A. Savinov (2016). Concept-oriented model: The functional view.
arXiv preprint arXiv:1606.02237 [cs.DB] 2016
https://arxiv.org/abs/1606.02237
[11] A. Savinov (2016). DataCommandr: Column-Oriented Data
Integration, Transformation and Analysis. International Conference
on Internet of Things and Big Data (IoTBD 2016), 339347.
[12] A. Savinov (2014). ConceptMix: Self-Service Analytical Data
Integration Based on the Concept-Oriented Model, Proc. 3rd
International Conference on Data Technologies and Applications
(DATA 2014), 7884.
[13] A. Savinov (2012), References and arrow notation instead of join
operation in query languages. Computer Science Journal of Moldova
(CSJM), 20(3), 313-333
[14] A. Savinov (2011). Concept-oriented query language for data
modeling and analysis. In L. Yan & Z. Ma (Eds), Advanced
database query systems: Techniques, applications and technologies,
IGI Global, 85-101
[15] A. Savinov (2011). Concept-oriented model: Extending objects with
identity, hierarchies and semantics. Computer Science Journal of
Moldova (CSJM), 19(3), 254-287
[16] E.H. Sibley & L. Kerschberg (1977). Data architecture and data
model considerations. In Proc. AFIPS Joint Computer Conferences.
85-96
[17] M. Stonebraker, L. Rowe, B. Lindsay, J. Gray, M. Carey, M. Brodie,
P. Bernstein, & D. Beech (1990). Third generation database system
manifesto. ACM SIGMOD Rec., 19(3)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Most of the currently existing query languages and data processing frameworks rely on one or another form of the group-by operation for data aggregation. In this paper, we critically analyze properties of this operation and describe its major drawbacks. We also describe an alternative approach to data aggregation based on accumulate functions and demonstrate how it can solve these problems. Based on this analysis, we argue that accumulate functions should be preferred to group-by as the main operation for data aggregation.
Preprint
Full-text available
The plethora of existing data models and specific data modeling techniques is not only confusing but leads to complex, eclectic and inefficient designs of systems for data management and analytics. The main goal of this paper is to describe a unified approach to data modeling, called the concept-oriented model (COM), by using functions as a basis for its formalization. COM tries to answer the question what is data and to rethink basic assumptions underlying this and related notions. Its main goal is to unify major existing views on data (generality), using only a few main notions (simplicity) which are very close to how data is used in real life (naturalness).
Conference Paper
Full-text available
In this paper, we describe a novel approach to data integration, transformation and analysis, called DataCommandr. Its main distinguishing feature is that it is based on operations with columns rather than operations with tables in the relational model or operations with cells in spreadsheet applications. This data processing model is free of such typical set operations like join, group-by or map-reduce which are difficult to comprehend and slow at run time. Due to this ability to easily describe rather complex transformations and high performance on analytical workflows, this approach can be viewed as an alternative to existing technologies in the area of ad-hoc and agile data analysis.
Conference Paper
Full-text available
Data integration as well as other data wrangling tasks account for a great deal of the difficulties in data analysis and frequently constitute the most tedious part of the overall analysis process. We describe a new system, ConceptMix, which radically simplifies analytical data integration for a broad range of non-IT users who do not possess deep knowledge in mathematics or statistics. ConceptMix relies on a novel unified data model, called the concept-oriented model (COM), which provides formal background for its functionality.
Article
Full-text available
We study properties of the join operation in query languages and describe some of its major drawbacks. We provide strong arguments against using joins as a main construct for retrieving related data elements in general purpose query languages and argue for using references instead. Since conventional references are quite restrictive when applied to data modeling and query languages, we propose to use generalized references as they are defined in the concept-oriented model (COM). These references are used by two new operations, called projection and de-projection, which are denoted by right and left arrows and therefore this access method is referred to as arrow notation. We demonstrate advantages of the arrow notation in comparison to joins and argue that it makes queries simpler, more natural, easier to understand, and the whole query writing process more productive and less error-prone.
Chapter
In the paper we describe a novel query language, called the concept-oriented query language (COQL), and demonstrate how it can be used for data modeling and analysis. The query language is based on a novel construct, called concept, and two relations between concepts, inclusion and partial order. Concepts generalize conventional classes and are used for describing domain-specific identities. Inclusion relation generalized inheritance and is used for describing hierarchical address spaces. Partial order among concepts is used to define two main operations: projection and de-projection. We demonstrate how these constructs are used to solve typical tasks in data modeling and analysis such as logical navigation, multidimensional analysis and inference.
Conference Paper
The Data Base Management System is now a well established part of information systems technology, but the many architectures and their plethora of data models are confusing to both the practitioner and researcher. In the past, attempts have been made to compare and contrast some of these systems, but the greatest difficulty arises in seeking a common basis. This paper attempts to show how a generalized data system (GDS), represented by two different models, could form such a basis; it then proposes that data policy definitions can restrict the GDS to a specialized model, such as a relational or DBTG-like model. Finally, it proposes that this concept forms a better basis for data structure design of specific system applications.