ArticlePDF Available

Abstract and Figures

High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. Furthermore, there is a contradiction between ease of extending the data model and ease of analysis. The modern 'Data Lake' approach, promises extreme ease of adding new data to a data model, however it is prone to eventually becoming a Data Swamp - unstructured, ungoverned, and out of control Data Lake where due to a lack of process, standards and governance, data is hard to find, hard to use and is consumed out of context. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. This technique is almost as convenient for expanding data model as a Data Lake, while it is internally protected from transforming to Data Swamp. A case study of how this approach is used for a Data Warehouse at Avito over a three-year period, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, is also presented. This paper is an extension of theses from The 34th International Conference on Conceptual Modeling (ER 2015) (Golov and Rönnbäck 2015) [1], it is complemented with numerical results about key operating areas of highly normalized big data warehouse, collected over several (1-3) years of commercial operation. Also, the limitations, imposed by using a single MPP database cluster, are described, and cluster fragmentation approach is proposed.
Content may be subject to copyright.
A preview of the PDF is not available
... Usually Data lakes are chaotic; they are a store where all the data of a company is accumulated without any kind of structure or rules to use it. For that reason, some people have started to call it Data swamps [12]. This problem can lead to a decrease of the value of the data. ...
Conference Paper
Big Data is changing the perspective on how to obtain valuable information from data stored by organizations of all kinds. By using these insights, companies can make better decisions and thus achieve their business goals. However, each new technology can create new security problems, and Big Data is no exception. One of the major security issues in a Big Data ecosystem is what level of trust in data and data sources stakeholders can have: without reliable data, the results of data analysis lose value. In this paper, we propose a security pattern to improve traceability and veracity of data through the use of Blockchain technologies. In this pattern, Blockchain will be used as a distributed ledger where all operations performed on the data will be registered. Therefore, the veracity of the data will increase, as will the confidence in the insights obtained from the analysis. The purpose of this paper is to help Chief Security Officer and Big Data architects incorporate this mechanism to improve the security of their environment.
... According to (Kirk, 2010) and (Golov and Rönnbäck, 2017), it is a simple task to achieve a ten-fold speed up when an application makes use of data parallelism. A computation is said to be parallel when a program runs on a multiprocessor machine in which all processors share access to available memory, i.e. the same address on different processors corresponds to the same memory location. ...
... Initial Accepted Scopus 108 53 papers: [1]- [3], [5], [9], [10], [13]- [19], [23]- [29], [31]- [33], [37], [40], [45], [49], [50], [57], [60]- [66], [68], [70], [71], [73], [76]- [78], [81]- [84], [88], [90], [91], [93]- [95] Springer 222 20 papers: [4], [6], [12], [21], [30], [36], [38], [39], [41]- [43], [47], [51], [53], [69], [74], [79], [85], [86], [92] Google Scholar 197 6 papers: [8], [34], [56], [67], [80], [87] Web of Science 71 4 papers: [7], [44], [48], [ tively, as Google Sheets is available online. ...
... In other words, conceptual modeling is done [26]. Based on the entities and their relationships, database normalization is done in the third normal form (3NF) level [27]. Finally, the normalized entities are implemented in SQL DBMS. ...
Article
Drug repurposing is an interesting field in the drug discovery scope because of reducing time and cost. It is also considered as an appropriate method for finding medications for orphan and rare diseases. Hence, many researchers have proposed novel methods based on databases which contain different information. Thus, a suitable organization of data which facilitates the repurposing applications and provides a tool or a web service can be beneficial. In this review, we categorize drug databases and discuss their advantages and disadvantages. Surprisingly, to the best of our knowledge, the importance and potential of databases in drug repurposing are yet to be emphasized. Indeed, the available databases can be divided into several groups based on data content, and different classes can be applied to find a new application of the existing drugs. Furthermore, we propose some suggestions for making databases more effective and popular in this field.
Article
In the development of product design, one of the elements of market competition for products is to meet the Kansei needs of users. Compared to features, users pay more attention to whether products can match their emotions, which is Kansei needs. The product developers are eager to get the Kansei needs of users more accurately and conveniently. This paper takes the computer cloud platform as the carrier and based on the collaborative filtering algorithm. We used personalized double matrix recommendation algorithm as the core, and the adjectives dimensionality reduction method to filter the image tags to simplify the users’ rating process and improve the recommendation efficiency. Finally, we construct a Kansei needs acquisition model to quickly and easily obtain the Kansei needs of users. We verify the model using the air purifier as a subject. The results of the case show that the model can find out the user’s Kansei needs more quickly. When the data is more, the prediction will be more accurate and timely.
Article
In this paper, a new active dynamic lightning protection method is proposed based on the large data characteristics of electric power. This method mainly includes two parts: Part one, Neo4j framework model which is used to analyze large data of power system and dynamic regulation of power system, and Python software which is used to compare and analyze different framework models; Part two, the comparison between dynamic lightning and conventional protection methods. The results show that Neo4j traversal speed is 87.5% and 89.1% faster than Hadoop and Spark respectively, clustering effect is 12.5% and 17.8% higher than Hadoop and Spark respectively. As a result, Neo4j framework model is more suitable for the characteristics of large data in power system. After the lightning accident, the power-off time of dynamic lightning protection system is reduced by about 53.1%, and the recovery time of the system also decreased by about 42.8%. In the dynamic regulation of power system, the output of power supply is reduced by 35.1 MW and the load is cut out by 15.8 MW, which greatly reduce the impact of lightning strike on power supply and important load.
Article
Full-text available
It is difficult to predict dissolved oxygen values because they are disordered and nonlinear. Accurate prediction of dissolved oxygen in shellfish aquaculture plays an important role in improving shellfish production, and a reliable model is needed to accurately predict dissolved oxygen values. Therefore, in this paper, an enhanced naive Bayes (NB) model is proposed. Due to the excessive number of different dissolved oxygen values, their direct use as input samples will result in overly few training set categories for each value, which reduces the prediction accuracy. Therefore, the dissolved oxygen differential series dataset is used as the input data to reduce the number of training set categories and improve the training accuracy. To increase the number of samples in the training set, the sliding window concept from network communication protocols is used to partition the differential sequence dataset and generate the features and labels of the training set. The values were predicted as categories, and the dissolved oxygen data were accurately predicted by selecting the labels that correspond to the posterior probability maxima of all training samples. Finally, the algorithm is used to predict the dissolved oxygen data from February 18, 2016, to January 31, 2020, in Yantai, Shandong Province, China. The dissolved oxygen data of a shellfish farm were trained and predicted, and the best values of the feature lengths were optimized by analyzing their effects on the predicted dissolved oxygen values. The proposed algorithm has significantly improved the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) compared to the advanced algorithms. The results of the Diebold-Mariano test and 10-fold cross-validation also show that the proposed algorithm has a higher prediction accuracy.
Article
Full-text available
Abstract The analysis and processing of big data are one of the most important challenges that researchers are working on to find the best approaches to handle it with high performance, low cost and high accuracy. In this paper, a novel approach for big data processing and management was proposed that differed from the existing ones; the proposed method employs not only the memory space to reads and handle big data, it also uses space of memory-mapped extended from memory storage. From a methodological viewpoint, the novelty of this paper is the segmentation stage of big data using memory mapping and broadcasting all segments to a number of processors using a parallel message passing interface. From an application viewpoint, the paper presents a high-performance approach based on a homogenous network which works parallelly to encrypt-decrypt big data using AES algorithm. This approach can be done on Windows Operating System using .NET libraries.
Chapter
Data is the driving force for the economy of any country. Data works as a fuel to the economy. The prime’s task for the organization is to store the data and use that data for the decision-making process. In the past, the organization used data warehouse and data marts to store the data and use for decision-making purpose, but for technology advancement, data warehouse faces many challenges and it fails to fulfill market demands. The biggest challenge for the data warehouse is to manage big data, data with velocity, data with huge volume, data with variety, data with veracity, and data with value. As of twenty-first century starts, worlds witness many new technologies like AI, deep learning, machine learning, and many more which all completely depend upon big data. Data warehouse fails to fulfill data engineer requirements to use these technologies to make decision-making system more effective. Data engineers want a new repository to store big data as data warehouse works on the concept of schema-on-write state that transforms the data before storage but engineers want data in raw format and later on according to business need they can transform the data to get the different values from data. To overcome the challenges which were faced by the data warehouse, research comes with a new concept known as data lake, a technologically advanced version of a data warehouse. Data lake works on the concept of schema-on-read. The objective of this chapter is to examine the idea of data lake from a user perspective as well as a technology perspective.
Chapter
Full-text available
This paper describes and analyses optimization approaches, which make possible the exact calculation of millions of hierarchical count distinct measures over hundreds of billions data rows. Described approach evolved for several years, in parallel with the growth of tasks from a fast growing internet company, and was finally implemented as a PEAPM (Pipelined Exact Accumulation for Paralleled Measures) algorithm. Current version of an algorithm outputs exact values (not estimates), works in a single thread, in minutes using a general commodity hardware, and requires volume of RAM equal to the doubled size of required measures.
Conference Paper
Full-text available
High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. In the one extreme, a database can be set up to provide the results of a single known query so that the use of available of resources are maximized and response time minimized, but at the cost of all other queries being suboptimally executed. In the other extreme, when no query is known in advance, the database must provide the information without such optimization, normally resulting in inefficient execution of all queries. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. A case study of how this approach is used for a Data Warehouse at Avito over two years time, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, are also presented.
Conference Paper
Full-text available
Fake identities and Sybil accounts are pervasive in today's online communities. They are responsible for a growing number of threats, including fake product reviews, malware and spam on social networks, and astroturf political campaigns. Unfortunately, studies show that existing tools such as CAPTCHAs and graph-based Sybil detectors have not proven to be effective defenses. In this paper, we describe our work on building a practical system for detecting fake identities using server-side clickstream models. We develop a detection approach that groups "similar" user clickstreams into behavioral clusters, by partitioning a similarity graph that captures distances between clickstream sequences. We validate our clickstream models using ground-truth traces of 16,000 real and Sybil users from Renren, a large Chinese social network with 220M users. We propose a practical detection system based on these models, and show that it provides very high detection accuracy on our clickstream traces. Finally, we worked with collaborators at Renren and LinkedIn to test our prototype on their server-side data. Following positive results, both companies have expressed strong interest in further experimentation and possible internal deployment.
Conference Paper
Full-text available
MapReduce has recently gained great popularity as a programming model for processing and analyzing massive data sets and is extensively used by academia and industry. Several implementations of the MapReduce model have emerged, the Apache Hadoop framework being the most widely adopted. Hadoop offers various utilities, such as a distributed file system, job scheduling and resource management capabilities and a Java API for writing applications. Hadoop's success has intrigued research interest and has led to various modifications and extensions to the framework. Implemented optimizations include performance improvements, programming model extensions, tuning automation and usability enhancements. In this paper, we discuss the current state of the Hadoop framework and its identified limitations. We present, compare and classify Hadoop/MapReduce variations, identify trends, open issues and possible future directions.
Article
Full-text available
This paper describes the system architecture of the Vertica Analytic Database (Vertica), a commercialization of the design of the C-Store research prototype. Vertica demonstrates a modern commercial RDBMS system that presents a classical relational interface while at the same time achieving the high performance expected from modern "web scale" analytic systems by making appropriate architectural choices. Vertica is also an instructive lesson in how academic systems research can be directly commercialized into a successful product.
Article
Full-text available
Maintaining and evolving data warehouses is a complex, error prone, and time consuming activity. The main reason for this state of affairs is that the environment of a data warehouse is in constant change, while the warehouse itself needs to provide a stable and consistent interface to information spanning extended periods of time. In this article, we propose an agile information modeling technique, called Anchor Modeling, that offers non-destructive extensibility mechanisms, thereby enabling robust and flexible management of changes. A key benefit of Anchor Modeling is that changes in a data warehouse environment only require extensions, not modifications, to the data warehouse. Such changes, therefore, do not require immediate modifications of existing applications, since all previous versions of the database schema are available as subsets of the current schema. Anchor Modeling decouples the evolution and application of a database, which when building a data warehouse enables shrinking of the initial project scope. While data models were previously made to capture every facet of a domain in a single phase of development, in Anchor Modeling fragments can be iteratively modeled and applied. We provide a formal and technology independent definition of anchor models and show how anchor models can be realized as relational databases together with examples of schema evolution. We also investigate performance through a number of lab experiments, which indicate that under certain conditions anchor databases perform substantially better than databases constructed using traditional modeling techniques.
Article
Full-text available
Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorithm for clustering webusers based on a function of the longest common subsequence of their clickstreams that takes into account both the trajectory taken through a website and the time spent at each page. Results are presented on weblogs of www.sulekha.com to illustrate the techniques. keywords : web usage mining, clickstream, subsequence, clustering. 1
Article
In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.
Conference Paper
Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight their strengths and weaknesses, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries).
Modeling the Agile Data Warehouse with Data Vault
  • Hans Hultgren
Hans Hultgren, Modeling the Agile Data Warehouse with Data Vault (Volume 1), Brighton Hamilton (November 16, 2012), 2012.