About
125
Publications
31,437
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,070
Citations
Introduction
Publications
Publications (125)
Some particularly important rich sources of open and free big geospatial data are the Earth observation (EO) programs of various countries such as the Landsat program of the US and the Copernicus programme of the European Union. EO data is a paradigmatic case of big data and the same is true for the big information and big knowledge extracted from...
A lot of geospatial data has become available at no charge in many countries recently. Geospatial data that is currently made available by government agencies usually do not follow the linked data paradigm. In the few cases where government agencies do follow the linked data paradigm (e.g., Ordnance Survey in the United Kingdom), specialized script...
Big Earth-observation (EO) data that are made freely available by space agencies come from various archives. Therefore, users trying to develop an application need to search within these archives, discover the needed data, and integrate them into their application. In this article, we argue that if EO data are published using the linked data paradi...
Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a databa...
Great database systems performance relies heavily on index tuning, i.e., creating and utilizing the best indices depending on the workload. However, the complexity of the index tuning process has dramatically increased in recent years due to ad-hoc workloads and shortage of time and system resources to invest in tuning.
This paper introduces holist...
When addressing the problem of 'big' data volume, preparation costs are one of the key challenges: the high costs for loading, aggregating and indexing data leads to a long data-to-insight time. In addition to being a nuisance to the end-user, this latency prevents real-time analytics on 'big' data. Fortunately, data often comes in semantic chunks...
Adaptive indexing initializes and optimizes indexes incrementally, as a side effect of query processing. The goal is to achieve the benefits of indexes while hiding or minimizing the costs of index creation. However, index-optimizing side effects seem to turn read-only queries into update transactions that might, for example, create lock contention...
A plethora of Earth Observation data that is becoming available at no charge in Europe and the US recently reects the strong push for more open Earth Observation data. Linked data is a paradigm which studies how one can make data available on the Web, and interconnect it with other data with the aim of making the value of the resulting "Web of data...
Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific da...
The multi-core architectures of today’s computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel...
Efficient management and exploration of high-volume scientific file repositories have become pivotal for advancement in science. We propose to demonstrate the Data Vault, an extension of the database system architecture that transparently opens scientific file repositories for efficient in-database processing and exploration.
The Data Vault facilit...
Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we have introduced SciQL (pronounced as 'cycle'). SciQL is the first SQL-based declarative query l...
Current data-management systems and analysis tools fail to meet scientists' data-intensive needs. A "data vault" approach lets researchers effectively and efficiently explore and analyze information.
Memory-Resident Database Management Systems(MRDBMS) have to be optimized for two resources: CPU cyclesand memory bandwidth. To optimize for bandwidth in mixedOLTP/OLAP scenarios, the hybrid or Partially DecomposedStorage Model (PDSM) has been proposed. However, in currentimplementations, bandwidth savings achieved by partial decom-position come at...
We present a real-time wildfire monitoring service that exploits satellite images and linked geospatial data to detect hotspots and monitor the evolution of fire fronts. The service makes heavy use of scientific database technologies (array databases, SciQL, data vaults) and linked data technologies (ontologies, linked geospatial data, stSPARQL) an...
Database benchmarking is most valuable if real-life data and workloads are available. However, real-life data (and workloads) are often not publicly available due to IPR constraints or privacy concerns. And even if available, they are often limited regarding scalability and variability of data characteristics. On the other hand, while easily scalab...
In the dawn of the data intensive research era, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This...
We address the need for scalable access to petabytes of Earth Observation data and the discovery of knowledge that is hidden in them.
Advances in remote sensing technologies have enabled public and commercial organizations to send an ever-increasing number of satellites in orbit around Earth. As a result, Earth Observation (EO) data has been constantly increasing in volume in the last few years, and is currently reaching petabytes in many satellite archives. For example, the mult...
TELEIOS is a recent European project that addresses the need for scalable access to petabytes of Earth Observation data and the discovery and exploitation of knowledge that is hidden in them. TELEIOS builds on scientific database technologies (array databases, SciQL, data vaults) and Semantic Web technologies (stRDF and stSPARQL) implemented on top...
In DataCell, we design streaming functionalities in a mod- ern relational database kernel which targets big data analyt- ics. This includes exploitation of both its storage/execution engine and its optimizer infrastructure. We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advan...
In this short paper we outline the Data Vault, a database-attached external file repository.
It provides a true symbiosis between a DBMS and existing file-based repositories.
Data is kept in its original format while scalable processing functionality is provided through the DBMS facilities.
In particular, it provides transparent access to all data...
The National Observatory of Athens (NOA) has been established in Greece as a research institute offering, among others, operational services for disaster management of forest wildfires. In this paper the main activities of NOA related to fire monitoring and the Burn Scar Mapping damage assessment services are presented. The current capacities in de...
The diversity of hardware components within a single system calls for strategies for efficient cross-device data processing. For example, existing approaches to CPU/GPU co-processing distribute individual relational operators to the "most appropriate" device. While pleasantly simple, this strategy has a number of problems: it may leave the "inappro...
Adaptive indexing initializes and optimizes indexes incrementally, as a side
effect of query processing. The goal is to achieve the benefits of indexes
while hiding or minimizing the costs of index creation. However,
index-optimizing side effects seem to turn read-only queries into update
transactions that might, for example, create lock contention...
Physical design represents one of the hardest problems for database management systems. Without proper tuning, systems cannot achieve good performance. Offline indexing creates indexes a priori assuming good workload knowledge and idle time. More recently, online indexing monitors the workload trends and creates or drops indexes online. Adaptive in...
MonetDB is a state-of-the-art open-source column-store database management system targeting applications in need for analytics over large collections of data. MonetDB is actively used nowadays in
health care, in telecommunications as well as in scientific databases and in data management research,
accumulating on average more than 10,000 downloads o...
The Database research group at CWI was established in 1985 and is supported by a scientific programmer and a system engineer to keep the machines running. The research group is working on MonetDB, an open-source columnar database system. A thorough daily regression testing infrastructure ensures that changes applied to the code base survive an atta...
SIGMOD has offered, since 2008, to verify the experiments published in the papers accepted at the conference. This year, we have been in charge of reproducing the experiments provided by the authors (repeatability), and exploring changes to experiment parameters (workability). In this paper, we assess the SIGMOD repeatability process in terms of pa...
There is a clear need nowadays for extremely large data processing.
This is especially true in the area of scientific data management where soon we expect
data inputs in the order of multiple Petabytes.
However, current data management technology is not suitable for such data sizes.
In the light of such new database applications, we can rethink so...
Adaptive indexing is characterized by the partial creation
and refinement of the index as side effects of query execution.
Dynamic or shifting workloads may benefit from preliminary
index structures
focused on the columns and specific key ranges actually queried ---
without incurring the cost of full index construction.
The costs and benefits of ad...
Adaptive indexing is characterized by the partial creation and refinement of the index as side effects of query execution. Dynamic or shifting workloads may benefit from preliminary index structures focused on the columns and specific key ranges actually queried --- without incurring the cost of full index construction. The costs and benefits of ad...
Ideally, realizing the best physical design for the current and all subsequent workloads would impact neither performance
nor storage usage. In reality, workloads and datasets can change dramatically over time and index creation impacts the performance
of concurrent user and system activity. We propose a framework that evaluates the key premise of...
We demonstrate ROX, a run-time optimizer of XQueries, that focuses on finding the best execution order of XPath steps and relational joins in an XQuery. The problem of join ordering has been extensively researched, but the proposed techniques are still unsatisfying. These either rely on a cost model which might result in inaccurate estimations, or...
Traditional optimizers fail to pick good execution plans, when faced with increasingly complex queries and large data sets. This failure is even more acute in the context of XQuery, due to the structured nature of the XML language. To overcome the vulnerabilities of traditional optimizers, we have previously proposed ROX, a Run-time Optimizer for X...
We demonstrate ROX, a run-time optimizer of XQueries, that focuses on finding the best execution order of XPath steps and relational joins in an XQuery. The problem of join ordering has been extensively researched, but the proposed techniques are still unsatisfying. These either rely on a cost model which might result in inaccurate estimations, or...
SIGMOD 2008 was the first database conference that offered to test submitters' programs against their data to verify the repeatability of the experiments published [1]. Given the positive feedback concerning the SIGMOD 2008 repeatability initiative, SIGMOD 2009 modified and expanded the initiative with a workability assessment.
The holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small team of programmers, Self-managing, to let it run out-of-the-...
The holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small team of programmers, Self-managing, to let it run out-of-the-...
Column-stores gained popularity as a promising physical de- sign alternative. Each attribute of a relation is physically stored as a separate column allowing queries to load only the required attributes. The overhead incurred is on-the-y tuple reconstruction for multi-attribute queries. Each tu- ple reconstruction is a join of two columns based on...
Optimization of complex XQueries combining many XPath steps and joins is currently hindered by the absence of good cardinality estimation and cost models for XQuery. Additionally, the state-of- the-art of even relational query optimization still struggles to cope with cost model estimation errors that increase with plan size, as well as with the ef...
The Armada model describes how a distributed database system evolves, using multiple nodes that together form the database. In such system, posing a query involves continuously locating the right node until sufficient data to answer the query has been found. Locating a node involves making a connection to such node. Since making a connection is exp...
In the past decades, advances in speed of commodity CPUs have far outpaced advances in RAM latency. Main-memory access has therefore become a performance bottleneck for many computer applications; a phenomenon that is widely known as the "memory wall." In this paper, we report how research around the MonetDB database system has led to a redesign of...
This paper reports on the results of an independent evalu- ation of the techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data Management Using Vertical Partitioning", authored by D. Abadi, A. Marcus, S. R. Mad- den, and K. Hollenbach (1). We revisit the proposed bench- mark and examine both the data and query space cover- age. The...
This paper reports on the results of an independent evaluation of the techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data Management Using Vertical Partitioning", authored by D. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach [1]. We revisit the proposed benchmark and examine both the data and query space coverage. The benchma...
This paper presents an extensive and detailed experimental evaluation of XQuery processors. The study consists of running five publicly available XQuery benchmarks—the Michigan benchmark (MBench), XBench, XMach-1, XMark and X007—on six XQuery processors, three stand-alone (file-based) XQuery processors (Galax, Qizx/Open, Saxon-B) and three XML/XQue...
SIGMOD 2008 was the first database conference that offered to test submitters' programs against their data to verify the experiments published. This paper discusses the rationale for this effort, the community's reaction, our experiences, and advice for future similar efforts.
Database indices provide a non-discriminative navigational infrastructure to localize tuples of interest. Their mainte- nance cost is taken during database updates. In this pa- per, we study the complementary approach, addressing in- dex maintenance as part of query processing using continu- ous physical reorganization, i.e., cracking the database...
A cracked database is a datastore continuously reorganized based on operations being executed. For each query, the data of interest is physically reclustered to speed-up future access to the same, overlapping or even disjoint data. This way, a cracking DBMS self-organizes and adapts itself to the workload. So far, cracking has been considered for s...
Performance, performance and performance used to be the three things that really mattered in database research. Most of our pub-lished works indeed include an experimental evaluation of the pro-posed techniques. However, such evaluations are sometimes seen as a "must-have" eating up the valuable space where one could de-scribe new ideas. The experi...
The data on the web, in digital libraries, in scientific repositories, etc. con- tinues to grow at an increasing rate. Distribution is a key solution to overcome this data explosion. However, existing solutions are mostly based on architectures with a single point of failure. In this paper, we present Armada, a model for a database architecture to...
This report summarizes the presentations and discussions that occurred during the Second International Workshop on Data Management on Modern Hardware (DaMoN). DaMoN was held in Chicago on June 25th, 2006, and was collocated with ACM SIGMOD 2006. The aim of this one-day workshop is to bring together researchers interested in optimizing database perf...
In this paper, we present data threaded execution, a new strategy to exploit both, pipelining and intra-operator parallelism in shared-everything environments. Data threaded execution is intuitive, straightforward to implement, but resistant against workload estimation errors and resistant against the discretization error of processor scheduling, t...
Relational XQuery processors aim at leveraging mature relational DBMS query processing technology to provide scalability and
efficiency. To achieve this goal, various storage schemes have been proposed to encode the tree structure of XML documents
in flat relational tables. Basically, two classes can be identified: (1) encodings using fixed-length...
Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational table...
This paper presents an extensive and detailed experimental evaluation of XQuery processors. The study consists of running five publicly available XQuery benchmarks --- the Michigan benchmark (MBench), XBench, XMach-1, XMark and X007 --- on six XQuery processors, three stand-alone (file-based) XQuery processors (Galax, Qizx/Open, Saxon-B) and three...
lowed: based on the extensible relational database ker-nel MonetDB [2], Pathˉnder provides highly e±cient and scalable XQuery technology that scales beyond 10 GB XML input instances on commodity hardware. Pathˉnder requires only local extensions to the un-derlying DBMS's kernel, such as the staircase join op-erator [7, 9]. A join recognition logic...
Various techniques have been proposed for efficient evaluation of XPath expressions,
where the XPath location steps are rooted in a single sequence of
context nodes. Among these techniques, the staircase join allows
to evaluate XPath location steps along arbitrary axes in at most
one scan over the XML document, exploiting the XPath accelerator
enco...
Pathfinder/MonetDB is a collaborative effort
of the University of Konstanz, the University of Twente, and the
Centrum voor Wiskunde en Informatica (CWI) in Amsterdam to develop
an XQuery compiler that targets an RDBMS back-end. The author of
this abstract is student at the University of Konstanz and spent
six months as an intern at the CWI, designi...
We outline an efficient ACID-compliant mechanism for structural inserts and deletes in relational XML document storage that uses a region based pre/size/level encoding (equivalent to the pre/post encoding). Updates to such node-numbering schemes are considered prohibitive (i.e. physical cost linear to document size), because structural updates caus...
Query performance strongly depends on finding an execution plan that touches as few superfluous tuples as possible. The access structures deployed for this purpose, however, are non-discriminative. They assume every subset of the domain being indexed is equally important, and their structures cause a high maintenance overhead during updates. This a...
The authors introduce concepts for loading large amounts of XML documents into databases where the documents are stored and maintained. The goal is to make XML databases as unobtrusive in multi-tier systems as possible and at the same time provide as many services defined by the XML standards as possible. The ubiquity of XML has sparked great inter...
As CPUs become more powerful with Moore's law and memory latencies stay constant, the impact of the memory access performance bottleneck continues to grow on relational operators like join, which can exhibit random access on a memory region larger than the hardware caches. While cache-conscious variants for various relational algorithms have been d...
This paper proposes a way to integrate cleanly relational databases and XML documents. The main idea is to draw a clear line of demarcation between the two concepts by modelling XML documents as a new atomic SQL type. The standardised XML tools like XPath, XQuery, XSLT are then user-defined functions that operate on this type. Well-defined interope...
Database vendors and researchers have been responding to the establishing of XML [45] as the premier data interchange language
for Internet applications with the integration of XML processing capabilities into Database Management Systems. The new features
fall into two categories: XML-enabled interfaces allow the DBMS to speak and understand XML fo...
Accurate prediction of operator execution time is a prerequisite for database query optimiza- tion. Although extensively studied for conven- tional disk-based DBMSs, cost modeling in main- memory DBMSs is still an open issue. Recent database research has demonstrated that memory access is more and more becoming a significant— if not the major—cost...