Conference PaperPDF Available


Advanced data analysis typically requires some form of pre-processing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program. The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.
Bridging the Gap: Towards Optimization Across Linear
and Relational Algebra
Andreas Kunft Alexander Alexandrov Asterios Katsifodimos Volker Markl
TU Berlin
Advanced data analysis typically requires some form of pre-
processing in order to extract and transform data before
processing it with machine learning and statistical analy-
sis techniques. Pre-processing pipelines are naturally ex-
pressed in dataflow APIs (e.g., MapReduce, Flink, etc.),
while machine learning is expressed in linear algebra with
iterations. Programmers therefore perform end-to-end data
analysis utilizing multiple programming paradigms and sys-
tems. This impedance mismatch not only hinders produc-
tivity but also prevents optimization opportunities, such as
sharing of physical data layouts (e.g., partitioning) and data
structures among different parts of a data analysis program.
The goal of this work is twofold. First, it aims to alleviate
the impedance mismatch by allowing programmers to author
complete end-to-end programs in one engine-independent
language that is automatically parallelized. Second, it aims
to enable joint optimizations over both relational and lin-
ear algebra. To achieve this goal, we present the design of
Lara, a deeply embedded language in Scala which enables
authoring scalable programs using two abstract data types
(DataBag and Matrix) and control flow constructs. Pro-
grams written in Lara are compiled to an intermediate rep-
resentation (IR) which enables optimizations across linear
and relational algebra. The IR is finally used to compile
code for different execution engines.
Data analytics requirements have changed over the last
decade. Traditionally confined to aggregation queries over
relational data, modern analytics is focused on advanced in
situ analysis of dirty and unstructured data at scale. Data
sources such as log files, clickstreams, etc., are first cleansed
using relational operators, and then mined using cluster-
ing and classification tasks based on statistical and machine
learning (ML) methods. As a result, data cleaning and
preprocessing is typically an initial step of advanced data
analysis pipelines (e.g., product recommendations, statisti-
cal analysis). Moreover, the preprocessing logic and data
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from
BeyondMR’16, June 26-July 01 2016, San Francisco, CA, USA
© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4311-4/16/06. . . $15.00
representation often depends on the ML method that will
be applied subsequently.
While relational domain-specific languages (DSLs) such as
Pig [14], Hive [15], or Spark SQL/DataFrame [3] are a good
fit for ETL tasks, programming machine learning algorithms
in those languages is cumbersome. To this end, R-like DSLs
such as SystemML’s DML [10] or Apache Mahout’s Samsara
were proposed. These DSLs offer linear algebra and control
flow primitives suitable for expressing ML algorithms, but
only provide limited, non-intuitive support for classic ETL
tasks. This strict separation of programming paradigms in-
troduces three fundamental problems.
Impedance Mismatch. Large analysis pipelines have to be
authored in multiple DSLs, plumbed together by additional
code, and possibly executed on different systems.
Inefficient Data Exchange. Pre-processing in a relational
algebra (RA) DSL and subsequent learning in a linear alge-
bra (LA) DSL enforces materialization of the intermediate
results at the boundary. Moreover, unless the staging for-
mat offers meta-information about the physical layout, this
information is lost when the boundary is crossed.
Loss of Optimization Potential. Separating RA and LA
into distinct DSLs entails that different, albeit similar in-
termediate representations (IRs) and compilation pipelines
are used for the two languages. For example, compare the
compilation phases and DAG-based IRs for Hive [15] and
SystemML [4]. As a result, optimizations that could be ap-
plied among the IRs (e.g., filter and projection push-down,
sharing of physical layout) are currently not possible.
To overcome these problems, we argue for the unification
of relational and linear algebra into a common theoretical
foundation. To achieve this goal, first we need to explore
and reason about optimizations across the two algebras in a
suitable intermediate language representation (IR). Second,
we need to showcase the added benefits of unification and the
optimizations that come thereof, by defining a common DSL
with high-level programming abstractions for both relational
and linear algebra. In line of the benefits offered by other
UDF-heavy dataflow API’s, the proposed DSL should be
embedded in a host-language like Scala (e.g. Spark RDDs,
Samsara) rather than external (e.g., Pig, DML).
In this paper we propose Lara, an embedded DSL in Scala
which offers abstract data types for both relational and lin-
ear algebra (i.e., bags and matrices). We build our work on
Emma [1, 2], a DSL for scalable collection based processing,
which we extend with linear algebra data types. We ex-
ploit the code rewriting facilities of the Scala programming
language to lift a user program into a unified intermediate
1// Read measurements into the DataBags A and B
2val A = readCSV(...) //
3val B = readCSV(...)
4// SELECT a1, ...,aN, b1, ..., bM
5// FROM A, B
6// WHERE =
7val X = for {
8a<- A
9b<- B
10 if ==
11 }yield (a1, ...,aN, b1, ..., bM)
13 // Convert DataBag X into Matrix M
14 val M = X.toMatrix()
15 // Calculate the mean of each column c of the matrix
16 val means = for ( c <- M.cols() ) yield (mean(c))
17 // Compute the deviation of each cell of M
18 // to the cell’s column mean.
19 val U = M - Matrix.fill(M.numRows, M.numCols)
20 ((i,j) => means(j))
21 // Compute the covariance matrix
22 val C = 1 / (U.numRows - 1) * U.t %*% U
23 // Compute singular value decomposition
24 // e.g. rescale M, reduce dimensions, etc.
Listing 1: Code snippet written in Lara
representation for joint optimization of linear and relational
algebra. Given this IR, a just-in-time (JIT) compiler gener-
ates code for different execution engines (Spark and Flink).
A Motivating Example. Consider a set of machinery
sensors found in industrial plants taking part in the produc-
tion of home mixers. In the end of the production line, a
percentage of those mixers is found to be defective. The
goal of our analysis is to train a classifier which will predict
whether a mixer has high chances of being defective, based
on the given measurements. For this task, we have to gather
data from various log files residing in different production
plants and join them in order to get all measurements of a
mixer throughout its production. Since there are thousands
of measurements per mixer and millions of mixers, we first
run a Principal Component Analysis (PCA) to prune the
number of measurements used for classification.
The above process can be implemented in Lara as shown
in Listing 1. Lines 2 and 3 read the data from two different
industrial plants before joining them (lines 7-11) to gain a
full view over all measurements for each of the mixers. Note
that the join is expressed as a native Scala construct, called
for-comprehension (see [2] for details). Next, DataBag X
which contains all projected measurements (a1, ...,aN, b1,
..., aM) is converted into Matrix M(line 14). The next line
computes the mean of each of the measurement columns
before we compute matrix U, holding the deviation of each
of M’s cells from their corresponding column’s mean using
the fill operator (called ”filling function” in [13]). Finally,
line 22 computes the covariance matrix C. In the next step we
would feed matrix Cto the PCA algorithm, which is omitted
from the example.
Discussion. Observe that the ETL part of the pipeline is
expressed as a declarative program of transformations over
DataBags, whereas the ML part is expressed in linear alge-
bra. Moreover, no physical execution strategy has been pre-
determined by the programmer. Our matrix abstraction is
strongly influenced by R’s matrix package and includes all
common operations on matrices. In addition, there are two
explicit operations to convert from a DataBag to a Matrix
and vise versa. Finally, note that converting a DataBag into
aMatrix does not necessarily mean that an operation is go-
ing to take place on a physical Matrix representation. For
example, consider a scalar multiplication of a Matrix: the
multiplication could be applied directly on a DataBag, since
scalar multiplications do not rely on special, linear algebra-
specific operators. Thus, we let the optimizer decide in
which physical data representation operations apply.
Types as First Class Citizens. Our DSL is based around
the concept of generic types. We propose a set of elemen-
tary generic types that model different structural aspects
and represent both user-facing (e.g, matrix, bag) and engine-
facing (e.g, partitioning) abstractions. Each type implies (i)
a set of structural constraints expressed as axioms, as well
as (ii) a set of structure-preserving transformations, which
operate element-wise and correspond to the functional no-
tion of map. Moreover, the types can be suitably composed
to define new, richer structure. This allows for reasoning
about the optimization space in a systematic way.
User Types. The core types included in our model are
Bag A,Matrix A, and Vector A. These types are polymorphic
(generic) with a type argument Aand represent different
containers for data (i.e., elements) of type A. As such, their
implementations should be independent on A.
Generic types can be modeled using algebraic data types
(ADTs) using algebraic specifications [8]. An algebraic spec-
ification defines a set of functions that produce values of the
specified type (constructors), as well as a set of axiomatic
equations that relate constructor terms. This approach gives
rise to categorical semantics of each specification – a free
functor corresponds to the classical (or loose) semantics,
and the initial object in the target category to the initial
semantics. For instance, [1] advocates conceptually treating
bags as types specified by the so-called union representation
data BagA=empty |sng A|union BagA×BagA
Bag values can be constructed by a nullary constructor
(empty), an unary constructor (sng), or a binary construc-
tor (union), and the associated axioms state that empty is
a unit and union is associative and commutative. LA types
with fixed dimensions, such as VectornAand Matrixn×mA, as
well as the monoids used in [9] naturally fit this framework.
The signature and dependencies between the constructors in
an ADT definition impose certain type structure. Mappings
that preserve this structure are called homomorphisms. In
the case of generic types, the structure represents an ab-
stract model for the shape of the container, and homomor-
phisms are characterized by a second-order function called
map. It is important to note that map is predominantly asso-
ciated with collections nowadays (as in MapReduce), while
the concept of map is pervasive to all generic types.
System Types. Reasoning about optimizations that af-
fect physical aspects such as partitioning or blocking means
that those should be included in our model. Crucially,
this type of structure can also be represented by generic
types. For example, we can use Par T Ato represent a par-
titioned container of type TA(e.g., Bag A,Matrix A), and
BlocknAto represent square-shaped blocks with dimension
n. Homomorphisms (maps) over those, model partition- or
block-preserving function applications (e.g., corresponding
to mapValues in Spark’s RDD API) respectively.
A./ B
1 5 5 5
2 6 6 6
3 7 7 7
4 8 8 8
(a) Result of A ./ B
(0,0) 1,5(0,1) 5,5
(0,0) 2,6(0,1) 6,6
(1,0) 3,7(1,1) 7,7
(1,0) 4,8(1,1) 8,8
(0,0) 1,5
2,6(0,1) 5,5
(1,0) 3,7
4,8(1,1) 7,7
ID a1
0 1
1 2
2 3
3 4
ID b1b2b3
0 5 5 5
1 6 6 6
2 7 7 7
3 8 8 8
(d) Original tables
(0,0) 1(0,0) 5(0,1) 5,5
(0,0) 2(0,0) 6(0,1) 6,6
(1,0) 3(1,0) 7(1,1) 7,7
(1,0) 4(1,0) 8(1,1) 8,8
(0,0) 1,5
2,6(0,1) 5,5
(1,0) 3,7
4,8(1,1) 7,7
Figure 1: Two execution strategies for toMatrix (A ./ B) (colors represent different partitions).
Above (a – c) Na¨
ıve Approach. Below (d – f) Partition Sharing.
Type Conversions. An obvious candidate to formalize
generic type conversions in a categorical setting are natu-
ral transformations – polymorphic functions t:TAUA
which change the container type from Tto U(e.g., from
Matrix to Bag) irrespective of the element type A. Their
characteristic property states that application of tcommutes
with application of map ffor all f. This formalism, how-
ever, cannot be directly applied in all cases. For example
toMatrix :Bag (N, A)Matrix Apreserves the element
value Abut relies on an extra index Nto determine the ele-
ment position in the resulting matrix. Extending or adapt-
ing the theory of natural transformations in order to fit our
needs is an open research question.
Control-Flow. To enable rewrite-based optimizations, our
proposed language has to be referentially transparent (i.e.,
purely functional). Moreover, in order to facilitate efficient
and concise implementations of optimizations, the language
IR should satisfy the following requirements. (R1) Both el-
ementary and compound expressions should be addressable
by-name. (R2) Each use-def and def-use information should
be efficient, and easy to compute and maintain. (R3) Con-
trol and data flow should be handled in a unified way.
All of the above requirements could be satisfied by an
IR in static single assignment (SSA) form. Graph-based
SSA IRs (e.g., sea of nodes) are nowadays used in compiler
backends like LLVM and Java HotSpot. We plan to use a
purely functional IR which conforms to the SSA constraints.
It enforces R1 through a restriction on the allowed terms
called A-normal form, and R2-R3 by modeling control-flow
through function calls in the so-called direct style.
A number of holistic optimizations can be derived from
the unified formal model and implemented under the as-
sumption of a full view of the algorithm code. Examples
are projection push down (based on knowledge that fields
are never accessed), as well as filters that are e.g., applied
on matrices whereas they can be pushed to the DataBags
from which the matrices originate. In the sequel we present
more sophisticated optimizations that come from a deeper
analysis of a program’s code.
Matrix Blocking Through Joins. Distributed opera-
tions over matrices are commonly done over block-partitioned
matrices [10, 5]. This representation differs a lot from the
unordered, non-indexed bag representation, commonly used
in dataflow APIs.
Consider again the example in Listing 1. Lines 7-11 per-
form a join, producing a bag which is converted to a matrix
and processed in lines 16 and 22. Note that the subsequent
linear algebra operations (filling as well as computing the
covariance matrix) can be executed over a block-partitioned
matrix. A na¨
ıve execution of this program would require to:
shuffle the data once in order to execute the join, and then
shuffle once again to block-partition the matrix to perform
the linear algebra operations. In the sequel we use an exam-
ple to show (i) how the na¨
ıve approach would perform the
join and subsequently the block partitioning, and (ii) how
we can avoid one partitioning step (for the join).
ıve Approach. Assume we execute the join of Aand B
as shown in Figure 1d on 4 nodes, using hash-partitioning
with h(k) = kmod 4, where kis the product id. To block
partition the matrix for the subsequent linear algebra oper-
ations, systems typically introduce a surrogate key rowIdx
for each tuple of the join result, to assign subsets of the rows
and columns to blocks. Therefore, we assign the following
key to each tuple:
k= ( rowIdx
rowBlkSize ,colIdx
colBlkSize )
The result of this key assignment for the join result is de-
picted in Figure 1b. Note that the blocks are partial. A
grouping operation can bring the partial blocks on the re-
spective machines and construct the final blocks as shown
in Figure 1f.
Partition Sharing. A full view over the code in Listing 1
allows us to see both, the RA part in lines 1-11, the LA
part in lines 16-22, and to holistically reason about the type
conversion in line 14. Ideally, the join operation and the
linear algebra operations can share the same partitioning.
We can achieve that by range-partitioning the input tables A
and Bseparately and then combine them. More specifically,
we use a different key to partition the inputs, taking into
account both the (unique, and sequential1) product id and
1Similar optimizations apply on joins over non-unique keys
(e.g. normalized data [12]). Moreover, the assumption of
sequential primary keys can be relaxed in the expense of an
extra pass over the data that is negligible in complex analysis
programs. For the lack of space, we omit this discussion.
the column index:
k= ( ID
rowBlkSize ,resultColIdx
colBlkSize )
where resultColIdx is the index of the column in the (now
virtual) bag Xand ID is the primary key of the inputs. As
the schema of the join result is explicitly provided in Lara
(Listing 1 line 11), we can easily obtain the column indexes
of the join result (X). The partitioning of the tables is shown
in Figure 1e. Observe that the blocks with column index
0 are split across the tables, thus, we also have column-
wise partial blocks. To create the final partitioning with
complete blocks, we union A and B, before we aggregate the
blocks sharing the same block id.
Row-wise Aggregation Pushdown. In our example List-
ing 1, line 16 calculates the mean for each column in the
matrix. Now, let us consider calculating the mean for each
row, as shown in the following snippet:
// Convert DataBag X into Matrix M
val M = X.toMatrix()
// Calculate the mean of each row r of the matrix
val means = for ( r <- M.rows() ) yield (mean(r))
This would require a full pass over the data, and in fact,
as the matrix is partitioned block-wise, we have to merge
the partial aggregates of the blocks, to compute the full ag-
gregate over each row. On the other hand, this could be
executed in the DataBag representation in a single pass, as
we have tuple/row at-a-time processing. In a typical hash-
partitioned join, the aggregate could be calculated while ex-
ecuting the join with a simple mapper after the join.
Caching Throughout Iterations. A holistic view over
the program allows us to reason about the structure of the
control flow. For instance, caching data which is used re-
peatedly within an iteration or along multiple control-flow
branches, can result in great performance benefits. This be-
comes even more interesting when the data originates from
pre-processing which would otherwise be re-computed for
each iteration. The decision to cache is not always evident
and forces the programmer to consider system specifics –
Lara currently employs a greedy strategy which implicitly
caches data used repeatedly in iterations.
ML Libraries & Languages. SystemML’s DML [4] and
Mahout’s Samsara provide R-like linear algebra abstractions
and execute locally or distributed on Hadoop and Spark.
While Samsara has fixed implementations for its linear alge-
bra operations, SystemML applies inter-operator optimiza-
tions like operator fusion and decides execution strategies
based on cost estimates. As there is no support for rela-
tional operators, ETL has to be executed in a different sys-
tem and there is no potential for holistic optimization. The
Delite project [6] provides a compiler framework for domain-
specific languages targeting parallel execution. Delite’s IR is
based on higher-order parallelizable functions (e.g., reduce,
map, zipWith). However, Delite’s IR does not allow to rea-
son holistically about linear and relational algebra. In this
work, we base our reasoning on types and on the holistic
view of the AST. Finally, we believe that monad compre-
hensions provide a better formal ground for reasoning and
applying algebraic optimizations.
Algebra Unifying Approaches. Kumar et al. [12] in-
troduce learning over linear models on data residing in a
relational database. They push parts of the computation of
the ML model into joins over normalized data, similar to [7].
These works focus on generalized linear models, as we focus
on more generic optimizations that can be derived directly
from the common intermediate representation of linear and
relational algebras . MLBase [11] provides high-level ab-
stractions for ML tasks with basic support for relational op-
erators. Their DSL allows the optimizer to choose different
ML algorithm implementations, but does not take relational
operators into account nor does it apply any optimization
among algebras.
Acknowledgments. The authors thank Matthias Boehm
(IBM Almaden) for his constructive feedback and discus-
sions. This work has been supported by grants from the Ger-
man Science Foundation (DFG) as FOR1306 Stratosphere,
the German Ministry for Education and Research as Berlin
Big Data Center BBDC (ref. 01IS14013A), the European
Commission through Proteus (ref. 687691) and Streamline
(ref. 688191), and by Oracle Labs.
[1] A. Alexandrov, A. Kunft, A. Katsifodimos, F. Sch¨
L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit
parallelism through deep language embedding. In ACM
SIGMOD, 2015.
[2] A. Alexandrov, A. Salzmann, , G. Krastev,
A. Katsifodimos, and V. Markl. Emma in action:
Declarative dataflows for scalable data analysis. In ACM
SIGMOD, 2016.
[3] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K.
Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi,
and M. Zaharia. Spark SQL: relational data processing in
spark. In ACM SIGMOD, 2015.
[4] M. Boehm, D. R. Burdick, A. V. Evfimievski, B. Reinwald,
F. R. Reiss, P. Sen, S. Tatikonda, and Y. Tian.
SystemML’s Optimizer: Plan Generation for Large-Scale
Machine Learning Programs. IEEE Data Engineering
Bulletin, 2014.
[5] P. G. Brown. Overview of scidb: Large scale array storage,
processing and analysis. In Proceedings of the 2010 ACM
SIGMOD International Conference on Management of
Data, 2010.
[6] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R.
Atreya, and K. Olukotun. A domain-specific approach to
heterogeneous parallelism. ACM SIGPLAN Notices, 2011.
[7] S. Chaudhuri and K. Shim. Including group-by in query
optimization. In VLDB, 1994.
[8] H. Ehrig and B. Mahr. Fundamentals of Algebraic
Specification 1: Equations und Initial Semantics, volume 6
of EATCS Monographs on Theoretical Computer Science.
Springer, 1985.
[9] L. Fegaras and D. Maier. Optimizing object queries using
an effective calculus. ACM TODS, 2000.
[10] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald,
V. Sindhwani, S. Tatikonda, Y. Tian, and
S. Vaithyanathan. SystemML: Declarative machine
learning on MapReduce. ICDE, 2011.
[11] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J.
Franklin, and M. Jordan. MLbase: A Distributed
Machine-learning System. In CIDR, 2013.
[12] A. Kumar, J. Naughton, and J. M. Patel. Learning
Generalized Linear Models Over Normalized Data. ACM
SIMGOD, 2015.
[13] D. Maier and B. Vance. A call to order. In ACM PODS,
[14] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
A. Tomkins. Pig latin: a not-so-foreign language for data
processing. In ACM SIGMOD, 2008.
[15] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a
petabyte scale data warehouse using hadoop. In ICDE.
IEEE, 2010.
... In view of the importance of large-scale statistical and machine learning (ML) algorithms in the overall data analytics workflow, database systems are in the process of being redesigned and extended to allow for a seamless integration of ML algorithms and mathematical and statistical frameworks, such as R, SAS, and MATLAB, with existing data manipulation and data querying functionality [42,19,5,38,10,27,21]. In particular, data scientists often use matrices to represent their data, as opposed to using the relational data model, and create custom data analytics algorithms using linear algebra, instead of writing SQL queries. ...
... In [27], Lara is proposed as a domain-specific programming language written in Scala that provides both linear algebra (LA) and relational algebra (RA) constructs. This approach is taken one step further in [21] where it is shown that the RA operations and a number of LA operations can be defined in terms of three core operations called Ext, Union, and Join. ...
We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv for inverting a matrix. In MATLANG + inv we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix. It is defined such that for each eigenvalue a set of orthogonal eigenvectors is returned that span the eigenspace of that eigenvalue. We show that inv can be expressed in MATLANG + eigen. We put forward the open question whether there are boolean queries about matrices, or generic queries about graphs, expressible in MATLANG+eigen but not in MATLANG+inv. Finally, the evaluation problem for MATLANG + eigen is shown to be complete for the complexity class 9R.
... Amalur's optimizer and query rewriting module receive the lifted data science pipeline, federation metadata (virtual tables, data sizes, etc.), and the three types of matrices. It then reasons about the data relationships and provenance, and can perform optimizations across linear and relational algebra operations [32]. ...
Full-text available
The data needed for machine learning (ML) model training and inference, can reside in different separate sites often termed data silos. For data-intensive ML applications, data silos present a major challenge: the integration and transformation of data, demand a lot of manual work and computational resources. Sometimes, data cannot leave the local store, and the model has to be trained in a decentralized manner. In this work, we propose three matrix-based dataset relationship representations, which bridge the classical data integration (DI) techniques with the requirements of modern machine learning. We discuss how those matrices pave the path for utilizing DI formalisms and techniques for our vision of ML optimization and automation over data silos.
... Linear algebra-based algorithms have become a key component in data analytic workflows. As such, there is a growing interest in the database community to integrate linear algebra functionalities into relational database management systems [5,23,[25][26][27]. In particular, from a query language perspective, several proposals have recently been put forward to unify relational algebra and linear algebra. ...
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.
We study the expressive power of the Lara language – a recently proposed unified model for expressing relational and linear algebra operations – both in terms of traditional database query languages and some analytic tasks often performed in machine learning pipelines. Since Lara is parameterized by a set of user-defined functions which allow to transform values in tables, known as extension functions, the exact expressive power of the language depends on how these functions are defined. We start by showing Lara to be expressive complete with respect to a syntactic fragment of relational algebra with aggregation (under the mild assumption that extension functions in Lara can cope with traditional relational algebra operations such as selection and renaming). We then look further into the expressiveness of Lara based on different classes of extension functions, and distinguish two main cases depending on the level of genericity that queries are enforced to satisfy. Under strong genericity assumptions the language cannot express matrix convolution, a very important operation in current machine learning pipelines. This language is also local, and thus cannot express operations such as matrix inverse that exhibit a recursive behavior. For expressing convolution, one can relax the genericity requirement by adding an underlying linear order on the domain. This, however, destroys locality and turns the expressive power of the language much more difficult to understand. In particular, although under complexity assumptions some versions of the resulting language can still not express matrix inverse, a proof of this fact without such assumptions seems challenging to obtain.
Full-text available
There is a long tradition in understanding graphs by investigating their adjacency matrices by means of linear algebra. Similarly, logic-based graph query languages are commonly used to explore graph properties. In this paper, we bridge these two approaches by regarding linear algebra as a graph query language. More specifically, we consider MATLANG, a matrix query language recently introduced, in which some basic linear algebra functionality is supported. We investigate the problem of characterising the equivalence of graphs, represented by their adjacency matrices, for various fragments of MATLANG. That is, we are interested in understanding when two graphs cannot be distinguished by posing queries in MATLANG on their adjacency matrices. Surprisingly, a complete picture can be painted of the impact of each of the linear algebra operations supported in MATLANG on their ability to distinguish graphs. Interestingly, these characterisations can often be phrased in terms of spectral and combinatorial properties of graphs. Furthermore, we also establish links to logical equivalence of graphs. In particular, we 1show that MATLANG-equivalence of graphs corresponds to equivalence by means of sentences in the three-variable fragment of first-order logic with counting. Equivalence with regards to a smaller MATLANG fragment is shown to correspond to equivalence by means of sentences in the two-variable fragment of this logic.
Machine learning (ML) is increasingly used to automate decision making in various domains. In recent years, ML has not only been applied to tasks that use structured input data, but also, tasks that operate on data with less strictly defined structure such as speech, images and videos. Prominent examples are speech recognition for personal assistants or face recognition for boarding airplanes.
Conference Paper
Full-text available
Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient execution. To retain a sufficient level of abstraction and lower the barrier of entry for data scientists, projects like Spark and Flink currently offer domain-specific APIs on top of their parallel collection abstractions. This demonstration highlights the benefits of an alternative design based on deep language embedding. We showcase Emma - a programming language embedded in Scala. Emma promotes parallel collection processing through native constructs like Scala's for-comprehensions - a declarative syntax akin to SQL. In addition, Emma also advocates quasi-quoting the entire data analysis algorithm rather than its individual dataflow expressions. This allows for decomposing the quoted code into (sequential) control flow and (parallel) dataflow fragments, optimizing the dataflows in context, and transparently offloading them to an engine like Spark or Flink. The proposed design promises increased programmer productivity due to avoiding an impedance mismatch, thereby reducing the lag times and cost of data analysis.
Conference Paper
Full-text available
The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.
Full-text available
SystemML enables declarative, large-scale machine learning (ML) via a high-level language with R-like syntax. Data scientists use this language to express their ML algorithms with full flexibility but without the need to hand-tune distributed runtime execution plans and system configurations. These ML pro- grams are dynamically compiled and optimized based on data and cluster characteristics using rule- and cost-based optimization techniques. The compiler automatically generates hybrid runtime execu- tion plans ranging from in-memory, single node execution to distributed MapReduce (MR) computation and data access. This paper describes the SystemML optimizer, its compilation chain, and selected optimization phases for generating efficient execution plans.
Full-text available
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Fur-thermore, existing scalable systems that support machine learning are typically not accessible to ML researchers with-out a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel opti-mizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML re-searchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.
Conference Paper
Full-text available
MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.
Conference Paper
Full-text available
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high performance but are too low-level to be accessible to the average programmer. We propose leveraging domain-specific languages (DSLs) to map high-level application code to heterogeneous devices. To demonstrate the potential of this approach we present OptiML, a DSL for machine learning. OptiML programs are implicitly parallel and can achieve high performance on heterogeneous hardware with no modification required to the source code. For such a DSL-based approach to be tractable at large scales, better tools are required for DSL authors to simplify language creation and parallelization. To address this concern, we introduce Delite, a system designed specifically for DSLs that is both a framework for creating an implicitly parallel DSL as well as a dynamic runtime providing automated targeting to heterogeneous parallel hardware. We show that OptiML running on Delite achieves single-threaded, parallel, and GPU performance superior to explicitly parallelized MATLAB code in nearly all cases.
Conference Paper
Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.
Conference Paper
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
Since the early seventies concepts of specification have become central in the whole area of computer science. Especially algebraic specification techniques for abstract data types and software systems have gained considerable importance in recent years. They have not only played a central role in the theory of data type specification, but meanwhile have had a remarkable influence on programming language design, system architectures, arid software tools and environments. The fundamentals of algebraic specification lay a basis for teaching, research, and development in all those fields of computer science where algebraic techniques are the subject or are used with advantage on a conceptual level. Such a basis, however, we do not regard to be a synopsis of all the different approaches and achievements but rather a consistently developed theory. Such a theory should mainly emphasize elaboration of basic concepts from one point of view and, in a rigorous way, reach the state of the art in the field. We understand fundamentals in this context as: 1. Fundamentals in the sense of a carefully motivated introduction to algebraic specification, which is understandable for computer scientists and mathematicians. 2. Fundamentals in the sense of mathematical theories which are the basis for precise definitions, constructions, results, and correctness proofs. 3. Fundamentals in the sense of concepts from computer science, which are introduced on a conceptual level and formalized in mathematical terms.
Conference Paper
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.