
Eduardo Cunha de Almeida- PhD
- Professor (Associate) at Federal University of Paraná
Eduardo Cunha de Almeida
- PhD
- Professor (Associate) at Federal University of Paraná
About
94
Publications
30,587
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
636
Citations
Introduction
My research interest is in database systems, especially distributed query and storage engines. Current projects, include: Chameleon, an adaptive distributed query processing engine for Hive. ControVol, a schema evolution adviser for NoSQL data stores and DoricStore, a Column-Store for emerging hardware.
Current institution
Additional affiliations
May 2004 - October 2005
April 2001 - May 2004
October 1998 - April 2001
Education
October 2005 - February 2009
March 2002 - August 2004
February 1995 - December 1999
Publications
Publications (94)
Data sampling over data streams is common practice to allow the analysis of data in real-time. However, sampling over data streams becomes complex when the stream does not fit in memory, and worse yet, when the length of the stream is unknown. A well-known technique for sampling data streams is the Reservoir Sampling. It requires a fixed-size reser...
A cracked database is a physically self-organized database based on the predicates of queries being executed. The goal is to create self-adaptive indices as a side-product of query processing. Our goal is to critically review the current cracking algorithms and index structures with respect to OLAP and mixed workloads. In this dissertation, we pres...
Building scalable web applications on top of NoSQL data stores is becoming common practice. Many of these data stores can easily be accessed programmatically, and do not enforce a schema. Software engineers can design the data model on the go, a flexibility that is crucial in agile software development. The typical tasks of database schema manageme...
MapReduce query processing systems translate a query statement into a query plan, consisting of a set of MapReduce jobs to be executed in distributed machines. During query translation, these query systems uniformly allocate computing resources to each job by delegating the same tuning to the entire query plan. However, jobs may implement their own...
Publicly available datasets are subject to new versions, with each version potentially reflecting changes to the data. These changes may involve adding or removing attributes, changing data types, and modifying values or their semantics. Integrating these datasets into a relational database poses a significant challenge: How to keep track of the ev...
Publicly available datasets are subject to new versions, with each new version potentially reflecting changes to the data. These changes may involve adding or removing attributes, changing data types, and modifying values or their semantics. Integrating these datasets into a database poses a significant challenge: how to keep track of the evolving...
This paper introduces an approach for discovering denial constraints (DCs) to identify faults in transmission lines. However, the considerable volume of data in the studied scenario makes traditional DC discovery impractical due to lengthy execution times. We propose an alternative DC discovery approach that uses streaming windows to address this i...
A visualização de violações de regras de qualidade de dados possui grande utilidade na limpeza de dados. Uma operação amplamente utilizada para essa visualização é a projeção das combinações de tuplas que violam as regras. No entanto, essa operação é custosa quando consideramos os formalismos estado-da-arte em limpeza de dados, como as restrições d...
An array database is a software that uses non-linear data structures to store and process multidimensional data, including images and time series. As multi-dimensional data applications are generally data-intensive, array databases can benefit from multi-processing systems to improve performance. However, when dealing with Non-Uniform Memory Access...
Transmission lines are fundamental components of the electric power system, demanding special attention from the protection system due to the vulnerability of these lines. This paper presents a method for fault location in transmission lines using data for a single terminal without requiring explicit feature engineering by a domain expert. The faul...
As linhas de transmissão demandam de atenção especial dos mecanismos de proteção do sistema elétrico de potência, visto que a ocorrência de faltas pode acarretar na indisponibilidade do fornecimento de energia elétrica. Frente a isso, o presente trabalho apresenta um método baseado em redes neurais recorrentes (LSTM e GRU) para a localização de fal...
Distributed database systems store and manipulate data on multiple machines. In these systems, the processing cost of query operations is mainly impacted by the data access latency between machines over the network. With recent technology advances in programmable network devices, the network switches provide new opportunities for dynamically managi...
O processamento paralelo é uma solução para melhorar o desempenho de consultas de banco de dados, reduzindo o tempo de resposta, e aumentando a vazão no processamento de consultas. Com a evolução de hardware surgiram novas tecnologias para o paralelismo. Uma delas é o uso de GPUs (Graphics Processing Units) para processamento de propósito geral. A...
Array database management systems (Array databases) are specialized software to streamline multi-dimensional data processing. Due to the data-hungry nature of multi-dimensional data applications (e.g., images and time series), array databases must ideally provide linear speedup when using a multi-processing system. However, when dealing with non-un...
The detection of constraint-based errors is a critical task in many data cleaning solutions. Previous works perform the task either using traditional data management systems or using specialized systems that speed up error detection. Unfortunately, both approaches may fail to execute in a reasonable time or even exhaust the available memory in the...
Dependências de dados são fundamentais em importantes áreas do gerenciamento de dados, tais como qualidade, integração e análise de dados. Esta tese apresenta contribuições relevantes para importantes problemas relacionados à tais dependências. O primeiro é relacionado à detecção de dependências. Estudamos a detecção de restrições de negação, pois...
O custo do processamento de uma consulta em sistemas de banco de dados distribuídos liga-se diretamente ao custo da transferência de dados na rede. A Software-Defined Wide Area Network (SD-WAN) e uma tecnologia que permite (re)programar dispositivos de rede via software. Sua programabilidade proporciona novas possibilidades para gerenciar topologia...
This work makes contributions that reach central problems in connection with data dependencies. The first problem regards the discovery of dependencies of high expressive power. We introduce an efficient algorithm for the discovery of denial constraints: a type of dependency that has enough expressive power to generalize other important types of de...
Com o aumento na população urbana, as cidades tem grande necessidade de utilizar métodos apoiados por técnicas de Inteligência Artificial para melhorar a mobilidade urbana em trânsitos cada vez mais congestionados pelo crescente número de veículos em circulação. Estudos demonstram que o problema de congestionamento de trânsito é agravado em até 30%...
O reconhecimento facial já faz parte na vida de muitos de nós. Grande parte dos smartphones atuais efetua o desbloqueio do aparelho utilizando a face como forma de identificar o dono do aparelho e proporcionar acesso aos dados. No entanto, ele também vem ganhando espaço para outros objetivos, principalmente em soluções corporativas como controle de...
Existing approaches for image-based Automatic Meter Reading (AMR) have been evaluated on images captured in well-controlled scenarios. However, real-world meter reading presents unconstrained scenarios that are way more challenging due to dirt, various lighting conditions, scale variations, in-plane and out-of-plane rotations, among other factors....
Existing approaches for image-based Automatic Meter Reading (AMR) have been evaluated on images captured in well-controlled scenarios. However, real-world meter reading presents unconstrained scenarios that are way more challenging due to dirt, various lighting conditions, scale variations, in-plane and out-of-plane rotations, among other factors....
A Software-defined wide-area networking (SD-WAN) é uma tecnologia que permite (re)programar dispositivos de rede via software. Nesse caso, os dispositivos de rede possuem uma flexibilidade maior na administração, permitindo a programabilidade da rede. Este artigo apresenta uma avaliação do processamento distribuído de operadores hash join de banco de...
SQL-on-Hadoop engines such as Hive provide a declarative interface for processing large-scale data over computing frameworks such as Hadoop. The underlying frameworks contain a large number of configuration parameters that can significantly impact performance, but which are hard to tune. The problem of automatic parameter tuning has become a lively...
As universidades públicas brasileiras têm papel chave como promotoras do desenvolvimento econômico, tecnológico e social do país. Diversas instituições de ensino e pesquisa, com seus laboratórios e grupos, têm contribuído fortemente não apenas com a formação de recursos humanos de alta qualidade, mas também com o desenvolvimento científico e tecnol...
As universidades públicas brasileiras têm papel chave como promotoras do desenvolvimento econômico, tecnológico e social do país. Diversas instituições de ensino e pesquisa, com seus laboratórios e grupos, têm contribuído fortemente não apenas com a formação de recursos humanos de alta qualidade, mas também com o desenvolvimento científico e tecnol...
Hash Tables play a lead role in modern database systems, finding applications in the execution of joins, grouping, indexing, removal of duplicates, and accelerating ad hoc queries. We focus on Cuckoo Hash to deal with collisions, guaranteeing at most two memory accesses for data retrieval, in the worst case. However, building the Cuckoo Table with...
Maintaining data consistency is known to be hard. Recent approaches have relied on integrity constraints to deal with the problem - correct and complete constraints naturally work towards data consistency. State-of-the-art data cleaning frameworks have used the formalism known as denial constraint (DC) to handle a wide range of real-world constrain...
The rapid growth of "big-data" intensified the problem of data movement when processing data analytics: Large amounts of data need to move through the memory up to the CPU before any computation takes place. To tackle this costly problem, Processing-in-Memory (PIM) inverts the traditional data processing by pushing computation to memory with an imp...
Restrições de negação (RNs) expressam regras que identificam inconsistências em um banco de dados. Compô-las, no entanto, é uma tarefa onerosa. Nós propomos um método que descobre RNs com base em evidências extraídas das tuplas de um conjunto de dados. Nosso método descobre RNs confiáveis, mesmo que o conjunto de dados contenha erros. Nossos experi...
Neste artigo, apresentamos nossa visão de como os Sistemas Gerenciadores de Bancos de Dados (SGBD) podem integrar Processamento-em- Memória (PIM) em processamento de consultas. PIM promete mitigar os problemas clássicos de “memory-wall” e “energy-wall” presentes nas arquiteturas de computadores que são amplificados pela movimentação de dados na hie...
Hash Tables play a lead role in modern databases systems during the execution of joins, grouping, indexing, removal of duplicates, and accelerating point queries. In this paper, we focus on Cuckoo Hash, a technique to deal with collisions guaranteeing that data is retrieved with at most two memory access in the worst case. However, building the Cuc...
Hash Tables play a lead role in modern databases systems during the execution of joins, grouping, indexing, removal of duplicates, and accelerating point queries. In this paper, we focus on Cuckoo Hash, a technique to deal with collisions guaranteeing that data is retrieved with at most two memory access in the worst case. However, building the Cuc...
SQL-on-Hadoop processing engines have become state-of-the-art in data lake analysis. However, the skills required to tune such systems are rare. This has inspired automated tuning advisors which profile the query workload and produce tuning setups for the low-level MapReduce jobs. Yet with highly dynamic query workloads, repeated re-tuning costs ti...
The recent trend of Processing-in-Memory (PIM) promises to tackle the memory and energy wall problems lurking in the data movement around the memory hierarchy, like in data analysis applications. In this paper, we present our vision on how database systems can embrace PIM in query processing. We share with the community an empirical analysis of the...
A SQL-on-Hadoop query consists of a workflow of MapReduce jobs with a single point of configuration. This means that the developer tunes hundreds of tuning parameters directly in the query source code (or via terminal interface), but the system assumes the same configuration to every running job. The lurking problem is that the system allocates com...
During the parallel execution of queries in Non-Uniform Memory Access (NUMA) systems, the Operating System (OS) maps the threads (or processes) from modern database systems to the available cores among the NUMA nodes using the standard node-local policy. However, such non-smart mapping may result in inefficient memory activity, because shared data...
Integrity constraints (ICs) are meant for many data management tasks. However, some types of ICs can express semantic rules that others ICs cannot, or vice versa. Denial constraints (DCs) are known to be a response to this expressiveness issue because they generalize important types of ICs, such as functional dependencies (FDs), conditional FDs, an...
In this paper, we use the potential of the near-data parallel computing presented in the Hybrid Memory Cube (HMC) to process near-data query filters and mitigate the data movement through the memory hierarchy up to the x86 processor. In particular, we present a set of extensions to the HMC Instruction Set Architecture (ISA) to filter data in-memory...
Semantic query optimization uses dependencies between attributes to formulate query transformations and revise the number of processed rows, with direct impact on performance. Commercial databases present facilities to define dependencies as not enforced constraints. The goal is to help the query optimizer in cases where the database is denormalize...
The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an ada...
The recent Hybrid Memory Cube (HMC) is a smart memory which includes functional units inside one logic layer of the 3D stacked memory design. In order to execute instructions inside the Hybrid Memory Cube (HMC), the processor needs to send instructions to be executed near data, keeping most of the pipeline complexity inside the processor. Thus, con...
Processor-in-Memory (PIM) architectures, such as the Hybrid Memory Cube (HMC), are emerging nowadays as a solution for processing large amount of data directly inside the memory. In this area, several researchers are proposing and evaluating new instructions and new PIM architectures. For such evaluations, trace-driven simulators, as the Simulator...
The Brazilian government is maintaining several digital inclusion projects, providing computers and Internet connection to developing regions around the country. However, these projects can only succeed if they are constantly assessed; namely, the projects infrastructure deployment must be closely monitored and evaluated. In this paper, we introduc...
A considerable portion of the time spent during databases operation processing consists of moving data around the memory hierarchy rather than actually processing it. The emergence of smart memories, as the new Hybrid Memory Cube (HMC), allows mitigating this memory wall problem, by executing instructions directly inside the memory, reducing data m...
Dependências funcionais (DFs) representam restrições de integridade amplamente estudadas no contexto de caracterização de dados. Neste trabalho, exploramos a descoberta automática de DFs e descrevemos um método para seleção daquelas que são relevantes com relação a semântica da carga de trabalho. A avaliação experimental mostra que as dependências...
In the parallel execution of queries in Non-Uniform Memory Access (NUMA), the operating system maps database processes/threads (i.e., workers) to the available cores across the NUMA nodes. How- ever, this mapping results in poor cache activity with many minor page faults and slower query response time when workers and data are allocated in di erent...
Nowadays, applications that predominantly perform lookups over large databases are becoming more popular
with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidth of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing database...
In database cracking, a database is physically self-organized into cracked partitions with cracker indices boosting the access to these partitions. The AVL Tree is the data structure of choice to implement cracker indices. However, it is particularly cache-inefficient for range queries, because the nodes accessed only for a few times (i.e, " Cold D...
As a new era of " Big Data " comes, contemporary database management systems (DBMS) introduced new functions to satisfy new requirements for big volume and velocity applications. Although the development agenda goes at full pace, the current testing agenda does not keep up, specially to validate non-functional requirements, such as: performance and...
As a new era of “Big Data” comes, contemporary database management systems (DBMS) introduced new functions to satisfy new requirements for big volume and velocity applications. Although the development agenda goes at full pace, the current testing agenda does not keep up, especially to validate non-functional requirements, such as: performance and...
In this paper we focus on the aggregate query model implemented over NoSQL document-stores for read-mostly data bases. We discuss that the aggregate query model can be a good fit for read-mostly databases if the following design requirements are met: on-line time range queries, aggregates with predefined filters, frequent schema evolution and no ad...
The Federal University of Paran´a (UFPR) hosts the cross-disciplinary research group C3SL, which has been investigating on open-source solutions for digital inclusion, covering different topics over the last fifteen years. It has been acting as a partner of different Brazilian public institutions and governments, backing up them for strategic choic...
NoSQL data stores are popular backends for managing big data that is evolving over time: Due to their schema-flexibility , a new release of the application does not require a full migration of data already persisted in production. Instead, using object-NoSQL mappers, developers can specify lazy data migrations that are executed on-the-fly, when a l...
The invention provides a method and device for centrally allocating a set of computing resources to each of a plurality of distributed processing means, each of which processes a portion of a database query using said set of computing resources, in order to distributively process the entire database query. The method is remarkable in that the resou...
In building software-as-a-service applications, a flexible development environment is key to shipping early and often. Therefore, schema-flexible data stores are becoming more and more popular. They can store data with heterogeneous structure, allowing for new releases to be pushed frequently, without having to migrate legacy data first. However, t...
We consider the task of building Big Data software systems, offered as software-as-a-service. These applications are commonly backed by NoSQL data stores that address the proverbial Vs of Big Data processing: NoSQL data stores can handle large volumes of data and many systems do not enforce a global schema, to account for structural variety in data...
Chameleon is a tuning advisor to support performance tuning decision-making of MapReduce administrators and users. In MapReduce query processing, a query is translated into a set of jobs, i.e., query plan. For administrators , Chameleon can be a powerful tool for observing query plan workloads and their impact in large-cluster machine setups in ter...
Over the last decade, large amounts of concurrent transactions have been generated from different sources, such as, Internet-based systems, mobile applications, smart-homes and cars. High-throughput transaction processing is becoming commonplace, however there is no testing technique for validating non functional aspects of DBMS under transaction f...
Context
Large-scale distributed systems are becoming commonplace with the large popularity of peer-to-peer and cloud computing. The increasing importance of these systems contrasts with the lack of integrated solutions to build trustworthy software. A key concern of any large-scale distributed system is the validation of global properties, which ca...
This paper briefly presents a model for monitoring a large, heterogeneous and geographically scattered computer park. The data collection is performed by a software agent. The collected data are sent to the central server over the Internet, and stored by the storage system. An on-line portal makes up the visualization system, featuring charts, repo...
The availability of Distributed Database Management Systems (DDBMS) is related to the probability of being up and running at a given point in time and to the management of failures. One well-known and widely used mechanism to ensure availability is replication, which includes performance impact on maintaining data replicas across the DDBMS's machin...
Transactional database management systems (DBMS) have been successful at supporting traditional transaction processing workloads. However, web-based applications that tend to generate huge numbers of concurrent business operations are pushing DBMS performance over their limits, thus threatening overall system availability. Then, a crucial question...
The design of the NoSQL schema has a direct impact on the scalability of web applications. Especially for developers with little experience in NoSQL stores, the risks inherent in poor schema design can be incalculable. Worse yet, the issues will only manifest once the application has been deployed, and the growing user base causes highly concurrent...
Nowadays the large-scale systems are common-place in any kind of applications. The popularity of the web created a new environment in which the applications need to be highly scalable due to the data tsunami generated by a huge load of requests (i.e., connections and business operations). In this context, the main question is to validate how far th...
Teaching web development in Computer Science undergraduate courses is a difficult task. Often, there is a gap between the students' experiences and the reality in the industry. As a consequence, the students are not always well-prepared once they get the degree. This gap is due to several reasons, such as the complexity of the assignments, the work...
Often corporations need tools to improve their decision making in a
competitive market. In general, these tools are based on data warehouse
platforms to mange and analyze large amounts of data. However, several of these
corporations do not have enough resources to buy such platforms because of the
high cost. This work is dedicated to a feasibility...
In relational database systems the optimization of select-project-join queries is a combinatorial problem. The use of exhaustive search methods is prohibitive because of the exponential increase of the search space. Randomized searches are used to find near optimal plans in polynomial time. In this paper, we investigate the large join query optimiz...
MapReduce (MR) is the most popular solution to build applications for
large-scale data processing. These applications are often deployed on large
clusters of commodity machines, where failures happen constantly due to bugs,
hardware problems, and outages. Testing MR-based systems is hard, since it is
needed a great effort of test harness to execute...
Typical testing architectures for distributed software rely on a centralized test controller, which decomposes test cases in steps and deploy them across distributed testers. The controller also guarantees the correct execution of test steps through synchronization messages. These architectures are not scalable while testing large-scale distributed...
Testing distributed systems is challenging. Peer-to-peer (P2P) systems are composed of a high number of concurrent nodes distributed across the network. The nodes are also highly volatile (i.e., free to join and leave the system at any time). In this kind of system, a great deal of control should be carried out by the test harness, including: volat...
Peer-to-peer (P2P) offers good solutions for many applications such as large data sharing and collaboration in social networks.
Thus, it appears as a powerful paradigm to develop scalable distributed applications, as reflected by the increasing number
of emerging projects based on this technology. However, building trustworthy P2P applications is d...
Peer-to-peer (P2P) offers good solutions for many applications such as large data sharing and collaboration in social networks. Thus, it appears as a powerful paradigm to develop scalable distributed applications, as reflected by the increasing number of emerging projects based on this technology. However, building trustworthy P2P applications is d...
Le pair-à-pair (P2P) offre de bonnes solutions pour de nombreuses applications distribuées, comme le partage de grandes quantités de données et/ou le support de collaboration dans les réseaux sociaux. Il apparaît donc comme un puissant paradigme pour développer des applications distribuées évolutives, comme le montre le nombre croissant de nouveaux...
Developing peer-to-peer (P2P) systems is hard because they must be deployed on a high number of nodes, which can be autonomous, refusing to answer to some requests or even unexpectedly leaving the system. Such volatility of nodes is a common behavior in P2P system and can be interpreted as fault during tests. In this paper, we propose a framework f...
Peer-to-peer (P2P) is becoming a key technology for software development, but still lacks integrated solutions to build trust in the final software, in terms of correctness and security. Testing such systems is difficult because of the high numbers of nodes which can be volatile. In this paper, we present a framework for testing volatility of P2P s...
Typical distributed testing architectures decompose test cases in actions and dispatch them to different nodes. They use a central test controller to synchronize the action execution sequence. This architecture is not fully adapted to large scale distributed systems, since the central controller does not scale up. This paper presents two approaches...
Testing peer-to-peer (P2P) systems is difficult because of the high numbers of nodes which can be heterogeneous and volatile. A test case may be composed of several ordered actions that may be executed on different nodes. To ensure action ordering and the correct behavior of the test case, a synchronization mechanism is required. In this paper, we...
Today’s DBMS are constantly upon large-scale workloads (e.g., internet) and require a reliable tool to benchmark them upon
a similar workload. Usually, benchmarking tools simulate a multi-user workload within a single machine. However, this avoids
large-scale benchmarking and also introduces deviations in the result. In this paper, we present a sol...