PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

One of the prevalent techniques for modeling data warehouses is and has for the last decade been dimensional modeling. As initially defined, it had no constructs for keeping a record of changes and only provided the as-is latest view of available information. Since its introduction and from increasing requirements to record changes, different approaches have been suggested to manage change, mainly in the form of slowly changing dimensions of various types. This paper will show that every existing type of slowly changing dimension may lead to undesired anomalies, either at read or at write time, making them unsuitable for application in performance critical or near real-time data warehouses. Instead, based on current research in temporal database modeling, we introduce temporal dimensions that make facts and dimensions temporally independent, and therefore suffer from none of said anomalies. In our research, we also discovered the twine, a new concept that may significantly improve performance when loading dimensions. Code samples, along with query results showing the positive impact of implementing temporal dimensions compared to slowly changing dimensions are also presented.
Content may be subject to copyright.
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
Temporal Dimensional Modeling
lars r ¨
onnb ¨
ack
olle regardt
the Department of Computer Science, Stockholm University
lars.ronnback@anchormodeling.com
Abstract
Dimensional modeling has for the last decades been one of the prevalent techniques for modeling data
warehouses. As initially defined, a few constructs for recording changes were provided, but due to
complexity and performance concerns many implementations still provide only the latest information.
With increasing requirements to record changes, additional constructs have been suggested to manage
change, mainly in the form of slowly changing dimension types. This paper will show that the currently
existing types may lead to undesired anomalies, either at read or at write time, making them unsuitable
for application in performance critical or near real-time data warehouses. Instead, based on current
research in temporal database modeling, we introduce temporal dimensions that make facts and
dimensions temporally independent, and therefore suffer from none of said anomalies. In our research,
we also discovered the twine, a new concept that may significantly improve performance when loading
dimensions. Code samples, along with query results showing the positive impact of implementing
temporal dimensions compared to slowly changing dimensions are also presented.
dimensional modeling ·microbatch ·data warehouse ·slowly changing dimension
high performance ·twine ·temporal dimension ·real-time analytics
1. introduction
For the last 25 years, two techniques have
been dominating when it comes to data ware-
house implementations. (Inmon 1992) with
his normalized model and (Kimball 1996) with
his dimensional modeling. Initially, few con-
structs or guidelines for managing data that
changes over time were available, but as such
requirements became more common, tempo-
ral features were retrofitted onto these tech-
nologies (Bliujute et al. 1998; Body et al. 2002).
This paper focuses on slowly changing di-
mension types (Kimball 2008; Ross 2013),
which were introduced to manage change in
dimensional modeling, and how well these
fit with current requirements for high perfor-
mance. Along with the traditional types of
dimensions, a simple and new type of dimen-
sion is introduced, called a temporal dimen-
sion. It is based on current research in tem-
porality (Hultgren 2012; R
¨
onnb
¨
ack et al. 2010;
Golec, Mahni
ˇ
c, and Kova
ˇ
c 2017), and suffers
from none of the refresh anomalies associ-
ated with slowly changing dimensions (R. J.
Santos and Bernardino 2008; Slivinskas et al.
1998). For most types of slowly changing
dimensions the fact table and its dimension
tables become temporally dependent, leading
to update anomalies and performance degra-
dation (Araque 2003). This is not the case
when using temporal dimensions, in which
the tables are temporally independent, such
that existing relationships are preserved re-
gardless of all changes. The authors, we,
recommend that this new type of dimension
should become the de-facto standard for man-
aging change in dimensional modeling and
particularly when considering real-time an-
alytics, a growing requirement within data
warehousing (Russom, Stodder, and Halper
2014; Azvine, Cui, and Nauck 2005). While
much research has been done into strategies
for loading data at near real-time (J
¨
org and
Dessloch 2009; Vassiliadis and Simitsis 2009),
little has been done with respect to finding
the most suitable modeling techniques.
The temporal dimension and many of the
Thanks to Up to Change (www.uptochange.com) sponsoring research on anchor and transitional modeling.
1
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
existing types support all three query cate-
gories that may be relevant when retrieving
information that changes over time (Golfarelli
and Rizzi 2009; Ahmed, Zim
´
anyi, and Wrem-
bel 2015):
tiy
Today is Yesterday, in which all facts refer-
ence the current dimensional value.
toy
Today or Yesterday, in which all facts refer-
ence the dimensional value that was in effect
when the fact was recorded.
yit
Yesterday is Today, in which all facts refer-
ence the dimensional value that was in effect
at certain point in time.
The paper has a running example and is
structured as follows. This section served as
an introduction to the topics discussed, Sec-
tion 2 presents slowly changing dimensions,
their models, and issues stemming from how
they are implemented, and Section 3 intro-
duces our new concept of a temporal dimen-
sion. Following that is Section 4, that presents
results from reading from and writing to the
different types of dimensions, Section 5 con-
trasting our work with related research, and
Section 6 that ends the paper with conclusions
and suggestions for further research.
2. slowly changing dimensions
A running example will be used to illustrate
the differences found in the models of the
slowly changing dimensions of type 1–7, the
“junk” type, and the temporal dimension. The
example model, containing a simple fact and
dimension table, can be seen in Figure 1. Each
column is contained in a box, which is lined
and shaded when that column is part of the
primary key in the table, and white with
dotted lines if not. Slanted text is used for
columns that have nothing to do with the
keeping of a history of changes and upright
text for the columns necessary to keep such
a history. Relationships between tables are
shown along with their cardinalities.
To keep the model uncluttered and to
the point, it is the simplest possible that
still conveys the construction of the differ-
ent types. The dimension table has a single
value,
dimProperty
, along with a primary key,
dim ID
. The fact table references this key from
a column with the same name, along with one
measure,
factMeasure
, and the date,
factDate
,
for which this measure was recorded
1
. The
primary key in the fact table is a combined
key, consisting of all dimension table refer-
ences (in our case one) and the date that the
measures were recorded.
The numbering of the types are taken
from (Kimball 1996) for types 1 to 3, whereas
4 to 7 and ”junk” come from (Ross 2013)
and (Wikipedia 2019), favoring those def-
initions that actually capture a history of
changes. There is, in fact, also a slowly chang-
ing dimension of type 0, but in that type
changes are simply discarded, resulting in in-
formation that is not up to date. As this is less
than desirable, type 0 will not be discussed
further. It is well known that change is preva-
lent and integral to almost all information, so
with no further ado, these are the types of
solutions retrofitted onto dimensional model-
ing, trying to solve the problem of managing
change.
2.1. type i
dim_ID
factDate
factMeasure
Fact
dim_ID
dimProperty
Dimension_SCD1
1..n
1
Figure 1:
Dimensional model, with a slowly changing
dimension of type 1, also coinciding with
type 0.
In type 1, depicted in Figure 1, destructive
updates of
dimProperty
are used to bring in-
formation up to date. The history is, of course,
lost, and this type can only support tiy. This
type is commonly used, despite its inability to
keep previous versions, probably because the
1
Normally dates are referenced from a fact table to a time dimension, but since that would only introduce additional
and for all illustration and testing purposes equal complexity, it was simplified to a column directly in the fact table.
2
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
other types of slowly changing dimensions re-
sult in added complexity. However, as the re-
quirement is to keep a history of the changes
made, this type is discussed less, but will
act as a point of reference when performing
the experiments that measure performance of
read and write operations.
2.2. type ii
In type 2, depicted in Figure 2, instead of up-
dating
dimProperty
a new row is added with
the new value. This row will have a differ-
ent
dim ID
, so in order to be able to correlate
this change with the row containing the orig-
inal value, some portion of the dimension
must remain static over changes. This is rep-
resented by the
dimStatic
column. If all values
can change, this may instead be introduced
through a generated identifier. Furthermore,
to know the order of changes and when they
took effect, a dim ValidFrom date is added.
dim_ID
factDate
factMeasure
Fact
dim_ID
dim_ValidFrom
Dimension_SCD2
1..n
1
dimProperty
dimStatic
Figure 2:
Dimensional model, with a slowly changing
dimension of type 2.
Facts with
factDate
after
dim ValidFrom
will reference the new value, and consider-
ing a model that regularly is receiving data,
this will build a toy connection between facts
and dimension values. This can lead to con-
fusion, particularly if aggregating measures
over
dimProperty
, which now yields at least
two different aggregates, given the circum-
stances. Bringing facts into a tiy or yit view is
possible through the combination of
dimStatic
and
dim ValidFrom
, but the resulting query is
complex and needs additional indexing in
order to be performant.
If the use case of tiy is much more com-
mon, it is possible to update fact references
when values change. This will instead take a
heavy hit on performance at write time, since
part of a primary key in the fact table is up-
dated.
2.3. type iii
dim_ID
factDate
factMeasure
Fact
dim_ID
dimProperty_V1
Dimension_SCD3
1..n
1
dimProperty_V2
Figure 3:
Dimensional model, with a slowly changing
dimension of type 3.
In type 3, depicted in Figure 3, columns
are added to manage change. The column
dimProperty
is split into
dimProperty V1
and
dimProperty V2
, if only two versions are of in-
terest. This limits the number of changes
that can be kept track of, since regularly
adding columns to the table will make it un-
wieldy. Any existing queries will also have
to be continuously maintained in order to
cope with the addition of new columns. Fur-
thermore, if only a few of the rows in the
dimension change, most rows will never have
seen any changes and their columns then con-
tain empty or dummy values for all but the
first version, which is a waste of space.
2.4. type iv
In type 4, depicted in Figure 4, a type 1 de-
structive update is used on the actual value
in the dimension table producing a tiy con-
nection. However, this is complemented by a
second “history” table, to which the row with
the value about to be updated first is copied.
The primary key in the history table consists
of the combination of the columns
dim ID
and
dim ValidFrom
. Historical rows thereby retain
3
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
their original
dim ID
, making it easier to find
them, but since the key now is composed it
cannot be referenced from the fact table. The
dashed reference line intends to show that a
non-enforced relationship nevertheless exists.
The drawback is that a performance hit
is taken at write time, since not only must
a value be updated, but rows must also be
copied between tables. As no referential in-
tegrity constraint exist between the fact and
the history table, consistency must instead
be guaranteed by other means. Queries for
which yit or toy is of interest become less
complex than in the case of type 2, but ta-
ble elimination is no longer possible (Paulley
2000), due to the lacking referential integrity
constraint. This may prove a problem in situ-
ations where many dimensions are slowly
changing according to this type, since un-
needed joins may be performed during ex-
ecution of the query.
dim_ID
factDate
factMeasure
Fact
dim_ID
dim_ValidFrom
Dimension_Current
1..n
1
dim_ID
dim_ValidFrom
Dimension_SCD4
dimProperty
dimProperty
Figure 4:
Dimensional model, with a slowly changing
dimension of type 4.
2.5. type v
Taking ideas from type 2, type 5 depicted in
Figure 5, complements the dimension with
the column
dim Current ID
, making it easy to
find the current information. Where a sin-
gle join from the fact to the dimension table
yields a toy connection, the tiy connection
is only a second join away. Performance is
hampered by the fact that searches need to
be done over non-primary keys, which neces-
sitates the need for secondary indexes. Fur-
thermore, yit still needs custom queries, and
whenever a value changes updates are needed
on existing rows, to repoint these towards the
new latest row.
dim_ID
factDate
factMeasure
Fact
dim_ID
dim_Current_ID
Dimension_SCD5
dimProperty
1..n
1
dim_ValidFrom
1..n
1
dimStatic
Figure 5:
Dimensional model, with a slowly changing
dimension of type 5.
2.6. type vi
Trying to remedy some of the disadvantages
of type 2, type 6 depicted in Figure 6 shares
almost the same structure, but with a differ-
ent primary key. The advantage of this type
is that it is no longer needed for any part of
the dimension to remain static, since the same
dim ID
will remain across changes, thanks to
dim ValidFrom
now being part of the primary
key.
dim_ID
factDate
factMeasure
Fact
dim_ID
dim_ValidFrom
Dimension_SCD6
dimProperty
Figure 6:
Dimensional model, with a slowly changing
dimension of type 6.
4
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
The sacrifice being made is that referen-
tial integrity no longer can be maintained, as
only half of that key is present in the fact ta-
ble. While this gives a lot of flexibility, with
no preference towards either of tiy,toy, or
yit, problems arise if this model is the source
for a cube, or any tool that requires referential
integrity. A possible circumvention, although
not mentioned in literature but probably ex-
isting in practice, would be to create regular
views corresponding to tiy and toy, which
the fact table can reference using a logical key
in the tool utilizing the model.
2.7. type vii
Combining ideas from type 2 and type 5,
type 7 depicted in Figure 7 also adds a col-
umn to keep track of the current information.
Rather than to reference this from the dimen-
sion, as in type 5, the fact table is extended
with a column. Compared to type 5 the cur-
rent information is now a single join away
instead of needing a double join.
dim_ID
factDate
factMeasure
Fact_SCD7
dim_ID
dim_ValidFrom
Dimension_SCD2
1..n
1
dimProperty
dim_Current_ID
1..n
1
dimStatic
Figure 7:
Dimensional model, with a slowly changing
dimension of type 7, which extends the fact
table with a foreign key column referencing
the current dimension information.
Having the same issues as type 5 with ad-
ditional indexing and requiring a static part,
the reduced number of joins should, however,
improve performance at read time. The trade-
off is that more rows need updating at write
time, since now fact rows have to be updated
instead of dimension rows, and they tend to
be much more numerous.
2.8. junk and mini dimension
dim_ID
factDate
factMeasure
Fact
dim_ID
Dimension
1..n
1
dim_Junk_ID
dim_ValidFrom
Dimension_Mini
dim_Junk_ID
dimProperty
Dimension_Junk
dim_ID
1..n
1
1..n
1
Figure 8:
Dimensional model, with a rapidly changing
dimension, through mini and junk dimen-
sions.
A final approach depicted in Figure 8, not hav-
ing been given a type, is still a construction
that aims to manage change. The introduc-
tion of a “junk” and “mini” dimension is rec-
ommended when changes occur more rapidly
than what is suitable for a slowly changing di-
mension. It is not clear exactly what rapidity
necessitates this, but most likely when some
part of the dimension changes considerably
more often than other parts.
The parts that change rapidly,
dimProperty
in our case, are moved to a junk dimension,
having its own primary key. Parts that are
static or change slowly may remain in the
original dimension, which is of a suitable
slowly changing dimension type. All that
remain in the original dimension in the exam-
ple is
dim ID
. A mini dimension is then intro-
duced that references both the original dimen-
sion and the junk dimension, in which the
primary key is extended with dim ValidFrom.
This is actually the only model in which
both referential integrity is maintained as well
as being agnostic with respect to tiy,toy, and
5
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
yit. It suffers from now needing three joins
in order to assemble the information, along
with complex query logic. Furthermore, it
relies upon the modeller making perfect as-
sumptions about which columns will change
rapidly, slowly or not at all. Of course, all
columns could be moved to the junk dimen-
sion, but if they change at very different rates,
a lot of information will be unnecessary du-
plicated.
3. temporal dimension
Surprisingly, there is a model that is simpler
in nature than many of the ones already pre-
sented that maintains referential integrity, is
agnostic with respect to how the informa-
tion should be temporally retrieved, and that
has very respectable read and write perfor-
mance. The key is the realization that tem-
poral information benefits from having its
immutable and mutable parts separated from
each other and that minimal assumptions
should be made concerning the future of the
information being modeled.
dim_ID
factDate
(dim_ValidFrom)
Fact
dim_ID
Dimension_Immutable
1..n
1
dim_ID
dim_ValidFrom
Dimension_Mutable
dimProperty
1..n
1
factMeasure
Figure 9:
Dimensional model, with the in this paper
introduced temporal dimension, along with
an optional “valid from” column in the fact
table.
The hereafter dubbed temporal dimension
can be seen in Figure 9. It has two parts, a mu-
table table similar to a type 6 slowly changing
dimension and an immutable table, similar to
what would remain of the original dimension
in the junk dimension scenario on the assump-
tion that all columns may change. Should
more properties be available in the dimen-
sion and those change at very different rates,
each can reside in its own mutable table, still
referencing the same immutable table.
While it may look strange to introduce an
immutable table with a single column, in be-
tween the fact and dimension table, it ensures
that referential integrity can be maintained
by the database. Perhaps even stranger is the
fact that the query optimizer can use these
constraints to completely eliminate this table
during the execution of retrieval queries, di-
rectly joining the fact and the mutable table.
The immutable table will not be physically
touched, and therefore behaves only like a
logical modeling construct, not adversely af-
fecting performance.
The optional
dim ValidFrom
column in the
fact table is used to retain which version is in
effect with respect to the factDate. This will
speed up toy queries to some extent, at the
cost of enlarging the fact table, which then
slightly slows down tiy and yit queries. De-
pending on the requirements this column may
be omitted.
4. comparative experiment
Experiments have been made in which read
and write performance have been tested for
type 1–7 and junk compared with the pre-
sented temporal dimension. Read perfor-
mance is further classified into the three
query types necessary to retrieve information
in tiy,toy, and yit. Write performance is
measured separately for adding and updat-
ing rows in the dimension and fact tables.
4.1. configuration
The tests were carried out on a server
equipped with an Intel i7 4770 4-core CPU,
32 GB of RAM, an Intel 750 PCIe NVMe SSD
as data disk, and four RAID-striped Samsung
840 Pro SATA SSD assembled as a temp disk.
The server was running a 64-bit Microsoft
Windows
R
operating system and the chosen
database engine was Microsoft SQL Server
R
2017 Developer Edition.
6
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
The initial dimension data consists of
2
20
,
1 048 576 ,
unique values for
dim ID
, for
which one value changes 2
20
times, two val-
ues change 2
19
times, four values change 2
18
times, . . . , 2
18
values change two times, and
2
19
values change one time, thus represent-
ing both slowly and rapidly changing values.
This resulted in
21 495 808 d
imension rows in
total. The times these changes were made,
captured by
dim ValidFrom
, will range from
the end of 2017 back to January 2008, with
the most rapidly changing every five minutes.
The initial fact data consists of
10 r
ows added
every minute over the same time range, but
captured by
factDate
, with each row referenc-
ing a randomly selected
dim ID
. This random
distribution is squarely skewed towards the
values in the dimension that change more of-
ten. This resulted in
52 427 836 f
act rows in
total.
These values were chosen in order to
achieve query execution times in the range
of seconds to minutes, which was considered
long enough to not be disturbed by other
factors and short enough to make the exper-
iment feasible. They also represent the con-
cept of microbatches, achieving near real-time
updated data. For testing the write perfor-
mance a larger batch had to be constructed to
achieve consistent measures during the exper-
iment. The method of loading is through reg-
ular SQL statements
2
. For writing to the di-
mension, all
dimProperty
values related to the
1 048 576 u
nique
dim ID
s receive an update in
the beginning of 2018, and equally many new
values are added. One fact row per
dim ID
is
also added, totalling
2 097 152 n
ew rows. In
order to ensure that the tests are as equal as
possible with respect to the data, the tempo-
ral dimension model was populated first, and
the others populated from it.
Figure 10 begins with a code sample show-
ing the creation of the tables for the tem-
poral dimension. Both the fact and muta-
ble dimension table is partitioned by respec-
tively the year of
factDate
and the year of
dim ValidFrom
, to liken them with actual im-
plementations. Tables for type 1–7 and junk
are created in a similar manner, following the
models presented earlier. Partitioning can be
and is used for all types where
dim ValidFrom
is part of the primary key. All code for repro-
ducing the tests is available online (R
¨
onnb
¨
ack
2018) or by request to the corresponding au-
thor.
create table Dimension Immutable (
dim ID int not null,
primary key (dim ID asc)
);
create table Dimension Mutable (
dim ID int not null,
dim ValidFrom smalldatetime not null,
dimProperty char(42) not null,
foreign key (dim ID) references
Dimension Immutable (dim ID),
primary key (dim ID asc, dim ValidFrom desc)
)on Yearly(dim ValidFrom);
create table Fact (
dim ID int not null,
factDate smalldatetime not null,
factMeasure smallmoney not null,
foreign key (dim ID) references
Dimension Immutable (dim ID),
primary key (dim ID asc, factDate asc)
)on Yearly(factDate);
create function pitDimension (
@timepoint smalldatetime
)
returns table as return
select
dm in effect.dim ID,
dm in effect.dim ValidFrom,
dm in effect.dimProperty
from (
select
,
row number() over (partition by dim ID order
by dim ValidFrom desc)as
ReversedVersion
from
Dimension Mutable
where
dim ValidFrom <= @timepoint
) dm in effect
where
dm in effect.ReversedVersion = 1;
Figure 10:
Code sample showing how to create the in
this paper introduced temporal dimension,
followed by the point in time parametrized
view over the temporal dimension, through
which the rows in effect at the given point
in time can be fetched.
2
This is sometimes referred to as ELT, as in Extract, Load, then Transform. While Microsoft SQL Server comes with
its own streaming data ETL tool, Integration Services, it only supports type 2 out of the box.
7
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
4.2. preparation
To further liken the experimental setup with a
real world scenario, parametrized views were
created that encapsulate common operations.
Such views are inlined by the database engine,
“search and replaced” in the query, before it
is passed to the optimizer, and should there-
fore have no performance impact compared
to having typed the entire statement to begin
with.
The parametrized view in the bottom half
of Figure 10 is created using a windowed
function that helps finding the information
in effect. This is just one of many thinkable
ways to find the information in effect at a
given point in time and several methods were
evaluated with respect to performance. For
each type of slowly changing dimension, the
best performing method of those described
by (Hutmacher 2016) was picked for the ex-
periment. For models in which the
dim Static
column is present, this instead meant using a
cross apply combined with an ordered top 1
operation.
While the point in time parametrized
views provide an efficient way to find infor-
mation in effect at a specific point in time,
as for tiy and yit, they proved to be a poor
choice for toy queries. Trying to cross ap-
ply such a view for every
factDate
yielded a
query execution estimate in which billions of
operations would take place. An attempt to
run such a query was halted after four hours,
during which numerous drawings saw the
dawn of light on a whiteboard. Among these
was one in which two timelines were drawn
close together, from which the idea of a twine
came.
The parametrized view in Figure 11 shows
how two timelines are intertwined in order
to create a lookup table between the fact and
dimension keys for the information in effect.
After twining the timelines, a conditional cu-
mulative max-operation is used with a win-
dow to find the latest value from one of the
timelines for each value in the other timeline
over each
dim ID
. Using a twine, the “nev-
erending” query now took seconds to run. As
it is a simple trick, it is bound to have been
discovered by practitioners, but to the best
knowledge of the authors it is neither scientif-
ically documented nor practically discussed
in conjunction with slowly changing dimen-
sions. As such, it should be of significant in-
terest because of its tremendous performance
boost. The authors have already had uses
for twines in several situations outside of the
dimensional modeling domain.
create function twineFact (
@fromTimepoint smalldatetime,
@toTimepoint smalldatetime
)
returns table as return
select
in effect.dim ID,
in effect.factDate,
in effect.dim ValidFrom
from (
select
twine.dim ID,
twine.Timepoint as factDate,
twine.Timeline,
max(case when Timeline = ’D’ then Timepoint
end)over (partition by dim ID order by
Timepoint) as dim ValidFrom
from (
select
dim ID,
factDate as Timepoint,
’F’ as Timeline
from
dbo.Fact
where
factDate between @fromTimepoint and
@toTimepoint
union all
select
dim ID,
dim ValidFrom as Timepoint,
’D’ as Timeline
from
dbo.Dimension Mutable
where
dim ValidFrom <= @toTimepoint
) twine
) in effect
where
in effect.Timeline = ’F’;
Figure 11:
Code sample showing how to create the
twine parametrized view, intertwining the
timelines of the fact and dimension tables
between the given time points by using a
conditional cumulative max-operation, win-
dowed to find the value in effect at each
point of the timeline and for each
dim ID
.
It yields a significant performance gain for
toy type queries, compared to other known
methods.
8
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
4.3. approach
With these views in place, the actual test was
set up. For each type of query, reading tiy,
yit,toy and writing to the dimension and fact
tables, a loop per dimension type was set up.
This loop was run four times when measur-
ing read times, with the first iteration being
a “dry run”, after which all column statistics
is updated. This is because the optimizer will
automatically create statistics on the first run
where it deems it suitable, slightly slowing
down the query and thereby disturbing the
test results. Furthermore, as this is done dur-
ing execution, the statistics is only based on a
sample in order to not affect the performance
too adversely. The explicit update of the statis-
tics is instead based on full table scans. When
measuring write times, three runs without a
“dry run” was used.
The loop for queries that read data is struc-
tured as follows. The first part clears all
caches possible from within SQL Server, the
second sets a starting time with nanosecond
precision, the actual query is silently run with
its results redirected to a temporary table, af-
ter which an ending time is set. Similarly, the
loop for queries that write data is structured
as follows. The first part clears all caches, a
transaction is opened, a dimension starting
time is set, the query writing data to the di-
mension table is run, an ending time is set,
a fact starting time is set, the query writing
data to the fact table is run, and ending time
is set, after which the queries are rolled back.
The rollback is necessary in order for each
run to be equal and its time is not taken into
account.
For both reading and writing, the differ-
ence in milliseconds between the starting and
the ending time is stored for each iteration in
a separate table. From these the median value
is selected. A margin of error was also cal-
culated and monitored. Should it have been
large, indicating that something disturbed the
testing, the tests would have been run again,
but this never happened. Most of the tests
were nevertheless run many times, between
which refinements were made. Values for
dimProperty
and
factMeasure
were chosen in
a way that verifications could be made to
ensure that the results produced from the dif-
ferent queries were identical.
select
in effect.dimProperty,
f.numberOfFacts,
f.avgMeasure
into
#result tiy
from (
select
dim Id,
count()as numberOfFacts,
avg(factMeasure) as avgMeasure
from
Fact
where
factDate between ’20140101’ and ’20141231’
group by
dim ID
) f
join
pitDimension(’20180101’) in effect
on
in effect.dim ID = f.dim ID;
select
dm.dimProperty,
count()as numberOfFacts,
avg(f.factMeasure) as avgMeasure
into
#result toy
from
twineFact(’20140101’,’20141231’) in effect
join
Fact f
on
f.dim ID = in effect.dim ID
and
f.factDate = in effect.factDate
join
Dimension Mutable dm
on
dm.dim ID = in effect.dim ID
and
dm.dim ValidFrom = in effect.dim ValidFrom
group by
dm.dimProperty;
Figure 12:
Code sample showing the performance test
query for tiy and yit, joining the point
in time view for the temporal dimension
(but with “20140101” passed to the view
for yit), followed by the test query for toy,
using a twine to find the version in effect.
An example query, for tiy, is shown in Fig-
ure 12 using the point in time view from the
temporal dimension. As seen, the query calcu-
lates the number of fact rows and the average
of
factMeasure
per
dimProperty
. For the par-
9
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
ticular query, this resulted in
1 014 754 r
ows,
which is slightly less than the unique num-
ber of
dim ID
s, as the randomization did not
exhaust all possible references between facts
and dimension values. For some types, in
which fact rows are already joined to the cur-
rent version, tiy can be found through a direct
join of the fact and dimension table. The same
calculation is also used for yit and toy and
for all types of dimensions, which for yit is
as simple as passing a different date to the
point in time view, but for toy may involve a
twine, depending on the type.
4.4. results
To summarize the results, it clearly comes
across that the different types of dimensions
are geared towards specific use cases. These
preferences fall into four categories; fast re-
trieval of the current version, the previous
version, the historically correct version, or be-
ing temporally agnostic. In the first three,
fact rows are temporally bound to a particu-
lar version in the dimension, whereas for the
agnostic preference, fact rows and dimension
rows are temporally independent. Temporal
independence has faster write performance
at the cost of slower read performance.
4.5. general performance
As can be seen in Table 1 there is no clear
winner if taking both reading and writing
into account. Favoring the current version are
types 1, 3, 4, 5, and 7, favoring the previous
is only type 3, and favoring the historically
correct are types 2, 5, and 7. Favoring no par-
ticular version are temporal, optional, type 6,
and junk. If only current values are of in-
terest, type 1 is the best choice, and if only
very few versions suffice, type 3 may be an
option. If ad-hoc querying over versions is
necessary, requirements have to be weighed
against performance considerations. Whilst
those types favoring a particular version per-
form particularly well when retrieving that
version, they suffer at write time because of
additional logic to include the required tem-
poral bond.
If, for example, in a read intensive environ-
ment where historically correct versions are
necessary in most use-cases and only rarely
the version in effect at an arbitrary point in
time, then type 2 is the best choice, given that
write performance can be sacrificed. Loading
the type 2 dimension table is 50% slower than
for temporal and six times slower than the
update done by type 1. Loading the fact table
is seven times slower than for temporal and
type 1, due to a twine being necessary, but
cannot be done as efficiently since finding the
version in effect relies on dimStatic.
If both historically correct and current ver-
sions are common in use-cases, but still rarely
the version in effect at an arbitrary point in
time, then type 5 or type 7 could be a choice,
but only after carefully considering the addi-
tional sacrifice in write performance. Writing
to the dimension table of type 5 is seventeen
times slower than temporal and writing to
the fact table of type 7 is twenty-four times
slower than temporal, due to the updates that
both finds the version in effect and then sets
it for each row. Maintaining a type 5 or type 7
dimension will over time become an insur-
mountable task.
If both current and the version in effect
at an arbitrary point in time are common in
use-cases, then our temporal and option, as
well as type 6 and junk are good candidates.
While junk has very similar read performance
to temporal, the time is takes to load its di-
mension tables is doubled in comparison. For
type 6, showing the best write performance
of all types that retain a complete history of
changes, the difference seen is explained by
it having sacrificed referential integrity. How-
ever, as referential integrity must be guaran-
teed at some point, type 6 makes this “some-
body else’s problem”, where it nevertheless
is bound to hog some resources.
4.6. suitability for near real-time
To determine the suitability of each type of
dimension in a microbatched near real-time
data warehouse, a hypothetical data ware-
house system is envisioned. This hypothetical
system can either perform reading or writing,
with one blocking the other. Simplistic, but
not entirely unlike a real system. Assum-
ing such a system is approaching real-time,
there are four possible scenarios; write often —
10
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
Temporal Optional Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Junk
tiy 10.4 10.2 0.8 9.1 0.7 0.7 3.6 9.9 2.8 6.6
yit 6.4 6.7 9.1 0.7 60.2 9.1 6.3 9.1 4.1
toy 53.5 23.7 3.2 53.7 3.0 53.2 2.9 53.8
Dim 14.2 14.0 3.3 23.9 4.2 7.0 271.2 7.2 31.8 30.1
Fact 10.8 10.9 10.8 74.2 10.9 10.9 74.9 7.2 254.2 10.8
Table 1:
Median execution times (in seconds) for queries reading (tiy,yit,toy) and writing (Dim, Fact) for the
introduced temporal (and optional) dimension and each type of slowly changing dimension. The “skyline”
presents the total execution time for all queries and type of query.
read often, write often — read seldom, write
seldom — read often, and write seldom —
read seldom. Looking at the last of these, it
is easy to be misled into thinking that such a
scenario is never coinciding with the require-
ments of real-time decision support. This is
not the case as the determining characteris-
tic of real-time is latency (Bateni et al. 2011).
Even in a write seldom — read seldom sce-
nario, the rare read may take place arbitrarily
close to the rare write and it may be of utmost
importance that the data returned is as fresh
as possible.
In order to reduce latency in the hypo-
thetical system, the write needs to block as
few reads as possible and likewise for the
other way around. Type 5 and 7 are very
imbalanced in this respect, also having the
two poorest write performances, and cannot
be recommended for near real-time require-
ments. The imbalance is also quite large for
type 2, ranking in the bottom three for write
performance together with types 5 and 7.
Type 4 has poor read performance and types 1
and 3 may be disqualified for not maintain-
ing a complete history of changes, leaving
temporal, optional, type 6, and junk as the
candidates.
We deliberately chose to ignore a poten-
tial issue when setting up the experiment.
When rows are temporally bound, this de-
pends on the assumption that data has been
arriving synchronously over time, such that
all existing fact rows always reference the cor-
rect version. In practice, this does not always
happen, and retrospective modifications have
to be made, adversely affecting the total main-
tenance time for types 1, 2, 3, 4, 5, and 7. As
there is no way to estimate the impact, de-
pending of how large the error is, this may
range from a quick fix to lengthy outages, un-
acceptable in a near real-time environment.
This again leaves temporal, optional, type 6,
and junk as the candidates.
4.7. simplicity
Looking at the different models, junk is the
most complex. Type 4 adds complexity in
the form of a separate history table, type 2, 5,
and 7 rely on some other column being static.
Type 3 could lead to unwieldy tables due to
the large number of columns. Type 6 is the
second best with respect to model simplicity,
following type 1, which naturally has the sim-
plest model. The temporal dimension is very
similar to type 6, adding an immutable table
between the fact and dimension table, still
keeping it on the simpler side. It also make
no assumptions, allowing for every part of
the dimension to change. Some lineage must
exist if never before seen data is about to be
loaded though, making it possible to deter-
mine if it represents a change or an addition.
Querying becomes slightly more complex
for the temporal dimension, than in the tem-
porally dependent types, but is helped by
11
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
“hiding” the logic behind the point in time
and twine parametrized views. Every type
needed a point in time view for yit, except
type 3, but under the presumption that keep-
ing a single old version is enough. The
types 4, 6, junk, and temporal all need the
twine for toy. The “valid from” column in
optional adds complexity to the benefit of not
having to twine. This again, though, assumes
that once that time point is set it is correct
and will never have to be changed. Given
that reading times for toy were only halved,
this may not be advantageous enough to war-
rant its introduction.
5. related research
A lot of research has been done concerning
performance of non-temporal dimensional
models, many of which use Star Schema
Benchmark, a modification of TPC-H (Rabl
et al. 2013; O’Neil et al. 2009). More recent
research tend to use the newer TPC-DS bench-
mark model (Nambiar and Poess 2006), but
only one of its dimensions has temporal char-
acteristics in the form of type 5. Given the
state of established benchmarks, having for-
gotten about temporal requirements, the deci-
sion was made to produce our own test data.
A brief comparison of a subset of the types
of slowly changing dimensions has been done
by (V. Santos and Belo 2011), where they us-
ing only logical reasoning come to the conclu-
sion that type 4 is superior to all other types.
Given our findings, that seems like a poor
conclusion. No previous work contrasting all
types of dimensions could be found.
(Johnston 2014) discusses the need for new
thinking when it comes to slowly changing
dimensions and suggests a bitemporal dimen-
sion, albeit not readily implementable in cur-
rent database technologies. His dimension is
type 2-like as it relies on a “durable identifier”
apart from the unique surrogate primary key,
and therefore cannot be said to be completely
temporally independent.
Big Data platforms, like Hadoop, rely on
immutability of existing data. (Akinapelli,
Shetye, and Sangeeta 2017) points out that
slowly changing dimensions are particularly
hard to manage in this respect. Thanks to
the temporal decoupling between facts and
dimensions when using our temporal dimen-
sion, databases can be insert only, which may
open up for easier implementations on such
platforms.
6. conclusions and further
research
Most, if not all, analytical tools have support
for slowly changing dimensions, but to vary-
ing degrees with respect to different types.
To name a few are: Tableau, MicroStrategy,
Alteryx, SAP, Cognos, SAS, Analysis Services,
and Power BI. Many database vendors also
have guides for how to implement them, like
Microsoft, Oracle, Vertica, Hive, SAP HANA,
PostgreSQL, and IBM DB2 to name a few.
ETL tools have special adapters for loading
data into slowly changing dimensions, like In-
formatica, Microsoft Integration Services, Or-
acle Data Integrator, Pentaho, IBM DataStage,
Apache Kafka, and others. This, if nothing
else, should indicate that slowly changing di-
mensions are a hot topic, but as it stands,
with little scientific grounding. We aimed
to put them on better footing by challeng-
ing traditional types with a type of our own.
There may still be situations where the tra-
ditional types may be of preference, but the
introduced temporal dimension fares so well
that we suggest it be the default choice when
a history of changes is required. This sug-
gestions is further accentuated when a data
warehouse approaches near real-time updates.
To summarize its advantages:
It is temporally independent from the fact
table, such that it may be refreshed asyn-
chronously and favors no particular version
in time.
It has a simple model that extends upon
already familiar types and it does not rely
upon any part of the dimension being static.
It is free from assumptions about the future
and it may even hold future versions that
will come into effect at some later time.
It is modular and parts that change with
different rates can be broken apart while
keeping the same immutable connection to
the fact.
12
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
It maintains referential integrity in the
database and when changes occur only in-
serts are used to capture these, never up-
dates.
Its performance is close to or better than all
other types with similar capabilities, when
both reading and writing is weighed in.
Along the way, we also discovered the
twine; a new strategy for finding the ver-
sion in effect for each fact by the times these
facts were recorded. The twine is essentially
a union of timelines that is sorted with an
additional column containing a conditionally
repeated value. It performs much better than
any approach trying to produce a temporal
join through some inequality condition. Twin-
ing may help many existing implementations
gain better performance, as it can be used
in any scenario where a temporal join is re-
quired.
There are many venues for further re-
search. It would be interesting to utilize tem-
poral tables, added in the ANSI:SQL 2011
standard (Kulkarni and Michels 2012), to see
if the results would be the same. We used
SQL to drive the loading of the dimensions,
so an investigation into how streaming data
ETL tools compare is warranted. Running the
same data set and queries in other database
engines could give insights into possible in-
ternal optimizations that may have affected
our experiment. Extending the idea of our
temporal dimension to become bitemporal
should be straightforward as it is already tem-
porally decoupled from the fact table. Find-
ing the threshold for when changes are rapid
enough compared to slower changing parts
that breaking the mutable dimension apart
would yield even better performance is also of
interest. Skewing the data differently in our
experiment may also produce different results.
Combining the idea of temporal dimensions
with slowly changing measures (Goller and
Berger 2013) is also highly relevant.
References
Ahmed, Waqas, Esteban Zim
´
anyi, and Robert
Wrembel (2015). “Temporal DataWare-
houses: Logical Models and Querying.” In:
EDA, pp. 33–48.
Akinapelli, Sandeep, Ravi Shetye, and T
Sangeeta (2017). “Herding the elephants:
Workload-level optimization strategies for
Hadoop.” In: EDBT, pp. 699–710.
Araque, Francisco (2003). “Real-time Data
Warehousing with Temporal Require-
ments.” In: CAiSE Workshops, pp. 293–304.
Azvine, Behnam, Zheng Cui, and D Di Nauck
(2005). “Towards real-time business intel-
ligence”. In: BT Technology Journal 23.3,
pp. 214–225.
Bateni, MohammadHossein et al. (2011).
“Scheduling to minimize staleness and
stretch in real-time data warehouses”. In:
Theory of Computing Systems 49.4, pp. 757–
780.
Bliujute, Rasa et al. (1998). Systematic change
management in dimensional data warehousing.
Tech. rep. Citeseer.
Body, Mathurin et al. (2002). “A multidi-
mensional and multiversion structure for
OLAP applications”. In: Proceedings of the
5th ACM international workshop on Data
Warehousing and OLAP. ACM, pp. 1–6.
Golec, Darko, Viljan Mahni
ˇ
c, and Tatjana
Kova
ˇ
c (2017). “Relational model of tem-
poral data based on 6th normal form”. In:
Tehniˇcki vjesnik 24.5, pp. 1479–1489.
Golfarelli, Matteo and Stefano Rizzi (2009).
“A survey on temporal data warehousing”.
In: International Journal of Data Warehousing
and Mining (IJDWM) 5.1, pp. 1–17.
Goller, Mathias and Stefan Berger (2013).
“Slowly changing measures”. In: Proceed-
ings of the sixteenth international workshop on
Data warehousing and OLAP. ACM, pp. 47–
54.
Hultgren, Hans (2012). Modeling the agile data
warehouse with data vault. New Hamilton.
Hutmacher, Daniel (2016). Last row per group.
url:
https://sqlsunday.com/2016/04/
11 / last - row - per - group/
(visited on
07/11/2018).
Inmon, William H (1992). “Building the data
warehouse”. In:
Johnston, Tom (2014). Bitemporal data: theory
and practice. Newnes.
J
¨
org, Thomas and Stefan Dessloch (2009).
“Near real-time data warehousing using
state-of-the-art ETL tools”. In: International
13
Temporal Dimensional Modeling, R¨
onnb¨
ack and Regardt, preprint 2019 rev.4
Workshop on Business Intelligence for the
Real-Time Enterprise. Springer, pp. 100–117.
Kimball, Ralph (1996). The Data Warehouse
Toolkit: Practical Techniques for Building Di-
mensional Data Warehouses. New York, NY,
USA: John Wiley & Sons, Inc. isbn: 0-471-
15337-0.
(2008). “Slowly changing dimensions”. In:
Information Management 18.9, p. 29.
Kulkarni, Krishna and Jan-Eike Michels
(2012). “Temporal features in SQL: 2011”.
In: ACM Sigmod Record 41.3, pp. 34–43.
Nambiar, Raghunath Othayoth and Meikel
Poess (2006). “The making of TPC-DS”. In:
Proceedings of the 32nd international confer-
ence on Very large data bases. VLDB Endow-
ment, pp. 1049–1058.
O’Neil, Patrick et al. (2009). “The star schema
benchmark and augmented fact table
indexing”. In: Technology Conference on
Performance Evaluation and Benchmarking.
Springer, pp. 237–252.
Paulley, Glenn Norman (2000). “Exploiting
functional dependence in query optimiza-
tion”. In:
Rabl, Tilmann et al. (2013). “Variations of the
star schema benchmark to test the effects
of data skew on query performance”. In:
Proceedings of the 4th ACM/SPEC Interna-
tional Conference on Performance Engineering.
ACM, pp. 361–372.
R
¨
onnb
¨
ack, Lars (2018). Temporal Dimensional
Modeling Performance Suite.url:
https://
github.com /Roenbaeck/ tempodim/
(vis-
ited on 02/01/2019).
R
¨
onnb
¨
ack, Lars et al. (2010). “Anchor model-
ing—Agile information modeling in evolv-
ing data environments”. In: Data & Knowl-
edge Engineering 69.12, pp. 1229–1253.
Ross, Margy (2013). “Design Tip# 152 Slowly
Changing Dimension Types 0, 4, 5, 6 and
7”. In: Kimball Group.
Russom, Philip, David Stodder, and Fern
Halper (2014). “Real-time data, BI, and an-
alytics”. In: Accelerating Business to Lever-
age Customer Relations, Competitiveness, and
Insights. TDWI best practices report, fourth
quarter, pp. 5–25.
Santos, Ricardo Jorge and Jorge Bernardino
(2008). “Real-time data warehouse load-
ing methodology”. In: Proceedings of the
2008 international symposium on Database
engineering & applications. ACM, pp. 49–
58.
Santos, Vasco and Orlando Belo (2011). “No
need to type slowly changing dimen-
sions”. In: IADIS International Conference
Information Systems, pp. 11–13.
Slivinskas, G et al. (1998). “Systematic Change
Management in Dimensional Data Ware-
housing”. In:
Vassiliadis, Panos and Alkis Simitsis (2009).
“Near real time ETL”. In: New Trends
in Data Warehousing and Data Analysis.
Springer, pp. 1–31.
Wikipedia (2019). Slowly Changing Dimension.
url:
https://en.wikipedia.org/wiki/
Slowly_changing_dimension
(visited on
02/01/2019).
14
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper brings together two different research areas, i.e. Temporal Data and Relational Modelling. Temporal data is data that represents a state in time while temporal database is a database with built-in support for handling data involving time. Most of temporal systems provide sufficient temporal features, but the relational models are improperly normalized, and modelling approaches are missing or unconvincing. This proposal offers advantages for a temporal database modelling, primarily used in analytics and reporting, where typical queries involve a small subset of attributes and a big amount of records. The paper defines a distinctive logical model, which supports temporal data and consistency, based on vertical decomposition and sixth normal form (6NF). The use of 6NF allows attribute values to change independently of each other, thus preventing redundancy and anomalies. Our proposal is evaluated against other temporal models and super-fast querying is demonstrated, achieved by database join elimination. The paper is intended to help database professionals in practice of temporal modelling.
Conference Paper
Full-text available
With the growing maturity of SQL-on-Hadoop engines such as Hive, Impala, and Spark SQL, many enterprise customers are deploying new and legacy SQL applications on them to reduce costs and exploit the storage and computing power of large Hadoop clusters. On the enterprise data warehouse (EDW) front, customers want to reduce operational overhead of their legacy applications by processing portions of SQL workloads better suited to Hadoop on these SQL-on-Hadoop platforms-while retaining operational queries on their existing EDW systems. Once they identify the SQL queries to offload, deploying them to Hadoop as-is may not be prudent or even possible, given the disparities in the underlying architectures and the different levels of SQL support on EDW and the SQL-on-Hadoop platforms. The scale at which these SQL applications operate on Hadoop is sometimes factors larger than what traditional relational databases handle, calling for new workload level analytics mechanisms, optimized data models and in some instances query rewrites in order to best exploit Hadoop. An example is aggregate tables (also known as materialized tables) that reporting and analytical workloads heavily depend on. These tables need to be crafted carefully to benefit significant portions of the SQL workload. Another is the handling of UPDATEs-in ETL workloads where a table may require updating; or in slowly changing dimension tables. Both these SQL features are not fully supported and hence have been underutilized in the Hadoop context, largely because UPDATEs are difficult to support given the immutable properties of the underlying HDFS. In this paper we elaborate on techniques to take advantage of these important SQL features at scale. First, we propose extensions and optimizations to scale existing techniques that discover the most appropriate aggregate tables to create. Our approach uses advanced analytics over SQL queries in an entire workload to identify clusters of similar queries; each cluster then serves as a targeted query set for discovering the best-suited aggregate tables. We com-OpenProceed-ings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 pare the performance and quality of the aggregate tables created with and without this clustering approach. Next, we describe an algorithm to consolidate similar UPDATEs together to reduce the number of UPDATEs to be applied to a given table. While our implementation is discussed in the context of Hadoop, the underlying concepts are generic and can be adopted by EDW and BI systems to optimize aggregate table creation and consolidate UPDATEs.
Conference Paper
Full-text available
Data warehouses are designed to store data in a consistent and integrated way, being refreshed periodically with new data accordingly with some decision making business requirements. The refreshment process, also known as ETL, is one of the most important pieces of a data warehousing system, due to its own heterogeneous nature, complexity and resource demanding. Usually, during an ETL process new data is gathered, transformed and integrated into the data warehouse in specific multi-dimensional structures designed and implemented accordingly the main lines of analysis of decision agents. However, not only new facts are added to the data warehouse. There could be also changes at dimensions’ data level. Dealing correctly with these changes is a critical success factor to the future of a Data Warehousing System. In this paper, we present and discuss the common methodologies used to deal with slowly changing dimensions, and reveal that there is no need to type such kind of changes, since they can all be treated in the a same manner. Using a different approach to deal with slowly changing dimensions might help to reduce the complexity of the ETL design, improve the ETL process performance, and minimize future data analysis computation.
Chapter
Data warehouses are information repositories specialized in supporting decision making. Since the decisional process typically requires an analysis of historical trends, time and its management acquire a huge importance. In this paper we consider the variety of issues, often grouped under term temporal data warehousing, implied by the need for accurately describing how information changes over time in data warehousing systems. We recognize that, with reference to a three-levels architecture, these issues can be classified into some topics, namely: handling data/schema changes in the data warehouse, handling data/schema changes in the data mart, querying temporal data, and designing temporal data warehouses. After introducing the main concepts and terminology of temporal databases, we separately survey these topics. Finally, we discuss the open research issues also in connection with their implementation on commercial tools.
Article
Bitemporal data has always been important. But it was not until 2011 that the ISO released a SQL standard that supported it. Currently, among major DBMS vendors, Oracle, IBM and Teradata now provide at least some bitemporal functionality in their flagship products. But to use these products effectively, someone in your IT organization needs to know more than how to code bitemporal SQL statements. Perhaps, in your organization, that person is you. To correctly interpret business requests for temporal data, to correctly specify requirements to your IT development staff, and to correctly design bitemporal databases and applications, someone in your enterprise needs a deep understanding of both the theory and the practice of managing bitemporal data. Someone also needs to understand what the future may bring in the way of additional temporal functionality, so their enterprise can plan for it. Perhaps, in your organization, that person is you. This is the book that will show the do-it-yourself IT professional how to design and build bitemporal databases and how to write bitemporal transactions and queries, and will show those who will direct the use of vendor-provided bitemporal DBMSs exactly what is going on "under the covers" of that software. • Explains the business value of bitemporal data in terms of the information that can be provided by bitemporal tables and not by any other form of temporal data, including history tables, version tables, snapshot tables, or slowly-changing dimensions. • Provides an integrated account of the mathematics, logic, ontology and semantics of relational theory and relational databases, in terms of which current relational theory and practice can be seen as unnecessarily constrained to the management of nontemporal and incompletely temporal data. • Explains how bitemporal tables can provide the time-variance and nonvolatility hitherto lacking in Inmon historical data warehouses. • Explains how bitemporal dimensions can replace slowly-changing dimensions in Kimball star schemas, and why they should do so. • Describes several extensions to the current theory and practice of bitemporal data, including the use of episodes, "whenever" temporal transactions and queries, and future transaction time. • Points out a basic error in the ISO's bitemporal SQL standard, and warns practitioners against the use of that faulty functionality. Recommends six extensions to the ISO standard which will increase the business value of bitemporal data. • Points towards a tritemporal future for bitemporal data, in which an Aristotelian ontology and a speech-act semantics support the direct management of the statements inscribed in the rows of relational tables, and add the ability to track the provenance of database content to existing bitemporal databases. This book also provides the background needed to become a business ontologist, and explains why an IT data management person, deeply familiar with corporate databases, is best suited to play that role. Perhaps, in your organization, that person is you.
Conference Paper
The Star Schema Benchmark (SSB), now in its third revision, has been widely used to evaluate the performance of database management systems when executing star schema queries. SSB, based on the well known industry standard benchmark TPC-H, shares some of its drawbacks, most notably, its uniform data distributions. Today's systems rely heavily on sophisticated cost-based query optimizers to generate the most efficient query execution plans. A benchmark that evaluates optimizer's capability to generate optimal execution plans under all circumstances must provide the rich data set details on which optimizers rely (uniform and non-uniform distributions, data sparsity, etc.). This is also true for other database system parts, such as indices and operators, and ultimately holds for an end-to-end benchmark as well. SSB's data generator, based on TPC-H's dbgen, is not easy to adapt to different data distributions as its meta data and actual data generation implementations are not separated. In this paper, we motivate the need for a new revision of SSB that includes non-uniform data distributions. We list what specific modifications are required to SSB to implement non-uniform data sets and we demonstrate how to implement these modifications in the Parallel Data Generator Framework to generate both the data and query sets.
Conference Paper
In data warehousing, measures such as net sales, customer reliability scores, churn likelihood, or sentiment indices are transactional data scored from the business events by measurement functions. Dimensions model subject-oriented data used as analysis perspectives when interpreting the measures. While measures and measurement functions are traditionally regarded as stable within the Data Warehouse (DW) schema, the well-known design concept of slowly changing dimensions (SCDs) supports evolving dimension data. SCDs preserve a history of evolving dimension instances, and thus allow tracing and reconstructing the correct dimensional context of all measures in the cube over time. Measures are also subject to change if DW designers (i) update the underlying measurement function as a whole, or (ii) fine-tune the function parameters. In both scenarios, the changes must be obvious to the business analysts. Otherwise the changed semantics leads to incomparable measure values, and thus unsound and worthless analysis results. To handle measure evolution properly, this paper proposes Slowly Changing Measures (SCMs) as an additional DW design concept that prevents incomparable measures. Its core idea is to avoid excessive schema updates despite regular changes to measure semantics by a precautious design, handling the changes mostly at the instance level. The paper introduces four SCM types, each with different strengths regarding various practical requirements, including an optional historical track of measure definitions to enable cross-version queries. The approach considers stable business events under normal loading delays of measurements, and the standard temporality model based on the inherent occurrence time of facts. Furthermore, the SCMs concept universally applies to both, flow and stock measure semantics.
Article
SQL:2011 was published in December of 2011, replacing SQL:2008 as the most recent revision of the SQL standard. This paper covers the most important new functionality that is part of SQL:2011: the ability to create and manipulate temporal tables.