ArticlePDF Available

Abstract and Figures

High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. Furthermore, there is a contradiction between ease of extending the data model and ease of analysis. The modern 'Data Lake' approach, promises extreme ease of adding new data to a data model, however it is prone to eventually becoming a Data Swamp - unstructured, ungoverned, and out of control Data Lake where due to a lack of process, standards and governance, data is hard to find, hard to use and is consumed out of context. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. This technique is almost as convenient for expanding data model as a Data Lake, while it is internally protected from transforming to Data Swamp. A case study of how this approach is used for a Data Warehouse at Avito over a three-year period, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, is also presented. This paper is an extension of theses from The 34th International Conference on Conceptual Modeling (ER 2015) (Golov and Rönnbäck 2015) [1], it is complemented with numerical results about key operating areas of highly normalized big data warehouse, collected over several (1-3) years of commercial operation. Also, the limitations, imposed by using a single MPP database cluster, are described, and cluster fragmentation approach is proposed.
Content may be subject to copyright.
A preview of the PDF is not available
... Mckay et al. put forward new ideas for big data management skills, by constructing accurate measurement models to optimize data processing speed and data standardization [5]. Golov and Manco have conducted research on big data management issues, constructed a big data management model, and proposed a highly standardized method [6] [7]. ...
... According to (Kirk, 2010) and (Golov and Rönnbäck, 2017), it is a simple task to achieve a ten-fold speed up when an application makes use of data parallelism. A computation is said to be parallel when a program runs on a multiprocessor machine in which all processors share access to available memory, i.e. the same address on different processors corresponds to the same memory location. ...
Article
Full-text available
This paper aims to find the actuarial tables that best represent the occurrences of mortality and disability in the Brazilian Armed Forces, thus providing a better dimensioning of the costs of military pensions to be paid by the pension system. To achieve this goal, an optimization software was developed that tests 53 actuarial tables for the death of valid military personnel, 21 boards for entry into the disability of assets, and 21 boards for mortality of invalids. The software performs 199 distinct adherence tests for each table analyzed through linear aggravations and de-escalations in the probabilities of death and disability. The statistical–mathematical method used was the chi-square adherence test in which the selected table is the one with the null hypothesis “observed data” equal to the “expected data” with the highest degree of accuracy. It is expected to bring a significant contribution to society, as a model of greater accuracy reduces the risk of a large difference between the projected cost and the cost observed on the date of the year, thus contributing to the maintenance of public governance. Additionally, the unprecedented and dual nature of the methodology presented here stands out. As a practical contribution, we emphasize that the results presented streamline the calculation of actuarial projections, reducing by more than 90% the processing times of calculations referring to actuarial projections of retirees from the armed forces. As a limitation of the study, we emphasize that, although possibly replicable, the database was restricted only to the Brazilian Armed Forces.
Article
.The article introduces a system concept for analyzing the Almaty's air basin. Air pollution monitoring plays a key role in environmental issues. However, in order to develop an environmental information system, it is necessary to conduct a careful analysis of natural events in order to determine their direct impact on the state of the environment, in this case the atmosphere. To assess the degree of pollution, emissions must be measured at specific intervals in specific locations across the settlement, depending on the study approach used. To numerically implement the data obtained, a model for determining the amount of pollutant concentration should be used, which will lead to a numerical expression representing the atmospheric conditions in a certain industrial region. The purpose of the project is to programmatically implement the above operations, namely to develop software based on a specific appropriate mathematical model that allows for the calculation of pollutant concentrations and, on that basis, the complex index of atmospheric pollution. To analyze and make specific decisions related toreduction ofpollution levels, it is necessary to track the dynamics of changes in the atmospheric pollution index over long periods of time, which necessitates the creation of a database to record measurements as well as the ability to view them visually at any time, for example, on a map. Мақалада шығарындылар статистикасы жҽне Алматы қаласының қҧрғақ бассейнінің жалпы жағдайына талдау берілген. Ауа бассейнінің жағдайы бойынша, Қазақстан Республикасы Статистика агенттігінің жҽне Алматы қаласы Статистика департаментінің есеп беру мҽліметіне сҥйене отырып, барлық стационарлық дереккҿздерден атмосфералық ауаға ластаушы заттар шығарындыларының жалпы жылдық кҿлемі 2020 жылы 47,016 тонна. ЖЭО зауыттарының негізгі ластаушылары кҥкірт диоксиді, азот оксиді жҽне қатты заттар болып табылады. Оның ҥстіне мазут, битуминозды жҽне қоңыр кҿмірді жағу кезінде атмосфераға кҥкірт диоксиді кҿп мҿлшерде бҿлінеді, ал битуминозды кҿмірді қолданғанда азот оксидінің шығарылуы да кҥрт артады. Сондықтан табиғи газ ең тиімдібалама болып саналады. Негізгі стационарлық ластаушы заттардың талдауы жҥргізіледі: зауыттар, ҽртҥрлі типтегі фабрикалар, жылу электр станциялары жҽне тҧрақты жҽне ҥздіксіз атмосфераға зиянды ластаушы заттарды шығаратын ҧқсас ҿндірістік нысандар. Ластану деңгейін анықтау ҥшін шығарындылар таңдалған зерттеу ҽдіснамасына байланысты елді мекеннің белгілі бір жерлерінде белгілі бір уақыт аралығында ҿлшенеді.
Article
In the development of product design, one of the elements of market competition for products is to meet the Kansei needs of users. Compared to features, users pay more attention to whether products can match their emotions, which is Kansei needs. The product developers are eager to get the Kansei needs of users more accurately and conveniently. This paper takes the computer cloud platform as the carrier and based on the collaborative filtering algorithm. We used personalized double matrix recommendation algorithm as the core, and the adjectives dimensionality reduction method to filter the image tags to simplify the users’ rating process and improve the recommendation efficiency. Finally, we construct a Kansei needs acquisition model to quickly and easily obtain the Kansei needs of users. We verify the model using the air purifier as a subject. The results of the case show that the model can find out the user’s Kansei needs more quickly. When the data is more, the prediction will be more accurate and timely.
Article
In this paper, a new active dynamic lightning protection method is proposed based on the large data characteristics of electric power. This method mainly includes two parts: Part one, Neo4j framework model which is used to analyze large data of power system and dynamic regulation of power system, and Python software which is used to compare and analyze different framework models; Part two, the comparison between dynamic lightning and conventional protection methods. The results show that Neo4j traversal speed is 87.5% and 89.1% faster than Hadoop and Spark respectively, clustering effect is 12.5% and 17.8% higher than Hadoop and Spark respectively. As a result, Neo4j framework model is more suitable for the characteristics of large data in power system. After the lightning accident, the power-off time of dynamic lightning protection system is reduced by about 53.1%, and the recovery time of the system also decreased by about 42.8%. In the dynamic regulation of power system, the output of power supply is reduced by 35.1 MW and the load is cut out by 15.8 MW, which greatly reduce the impact of lightning strike on power supply and important load.
Article
Full-text available
It is difficult to predict dissolved oxygen values because they are disordered and nonlinear. Accurate prediction of dissolved oxygen in shellfish aquaculture plays an important role in improving shellfish production, and a reliable model is needed to accurately predict dissolved oxygen values. Therefore, in this paper, an enhanced naive Bayes (NB) model is proposed. Due to the excessive number of different dissolved oxygen values, their direct use as input samples will result in overly few training set categories for each value, which reduces the prediction accuracy. Therefore, the dissolved oxygen differential series dataset is used as the input data to reduce the number of training set categories and improve the training accuracy. To increase the number of samples in the training set, the sliding window concept from network communication protocols is used to partition the differential sequence dataset and generate the features and labels of the training set. The values were predicted as categories, and the dissolved oxygen data were accurately predicted by selecting the labels that correspond to the posterior probability maxima of all training samples. Finally, the algorithm is used to predict the dissolved oxygen data from February 18, 2016, to January 31, 2020, in Yantai, Shandong Province, China. The dissolved oxygen data of a shellfish farm were trained and predicted, and the best values of the feature lengths were optimized by analyzing their effects on the predicted dissolved oxygen values. The proposed algorithm has significantly improved the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) compared to the advanced algorithms. The results of the Diebold-Mariano test and 10-fold cross-validation also show that the proposed algorithm has a higher prediction accuracy.
Article
Full-text available
Abstract The analysis and processing of big data are one of the most important challenges that researchers are working on to find the best approaches to handle it with high performance, low cost and high accuracy. In this paper, a novel approach for big data processing and management was proposed that differed from the existing ones; the proposed method employs not only the memory space to reads and handle big data, it also uses space of memory-mapped extended from memory storage. From a methodological viewpoint, the novelty of this paper is the segmentation stage of big data using memory mapping and broadcasting all segments to a number of processors using a parallel message passing interface. From an application viewpoint, the paper presents a high-performance approach based on a homogenous network which works parallelly to encrypt-decrypt big data using AES algorithm. This approach can be done on Windows Operating System using .NET libraries.
Chapter
Data is the driving force for the economy of any country. Data works as a fuel to the economy. The prime’s task for the organization is to store the data and use that data for the decision-making process. In the past, the organization used data warehouse and data marts to store the data and use for decision-making purpose, but for technology advancement, data warehouse faces many challenges and it fails to fulfill market demands. The biggest challenge for the data warehouse is to manage big data, data with velocity, data with huge volume, data with variety, data with veracity, and data with value. As of twenty-first century starts, worlds witness many new technologies like AI, deep learning, machine learning, and many more which all completely depend upon big data. Data warehouse fails to fulfill data engineer requirements to use these technologies to make decision-making system more effective. Data engineers want a new repository to store big data as data warehouse works on the concept of schema-on-write state that transforms the data before storage but engineers want data in raw format and later on according to business need they can transform the data to get the different values from data. To overcome the challenges which were faced by the data warehouse, research comes with a new concept known as data lake, a technologically advanced version of a data warehouse. Data lake works on the concept of schema-on-read. The objective of this chapter is to examine the idea of data lake from a user perspective as well as a technology perspective.
Conference Paper
Full-text available
High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. In the one extreme, a database can be set up to provide the results of a single known query so that the use of available of resources are maximized and response time minimized, but at the cost of all other queries being suboptimally executed. In the other extreme, when no query is known in advance, the database must provide the information without such optimization, normally resulting in inefficient execution of all queries. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. A case study of how this approach is used for a Data Warehouse at Avito over two years time, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, are also presented.
Conference Paper
Full-text available
Fake identities and Sybil accounts are pervasive in today's online communities. They are responsible for a growing number of threats, including fake product reviews, malware and spam on social networks, and astroturf political campaigns. Unfortunately, studies show that existing tools such as CAPTCHAs and graph-based Sybil detectors have not proven to be effective defenses. In this paper, we describe our work on building a practical system for detecting fake identities using server-side clickstream models. We develop a detection approach that groups "similar" user clickstreams into behavioral clusters, by partitioning a similarity graph that captures distances between clickstream sequences. We validate our clickstream models using ground-truth traces of 16,000 real and Sybil users from Renren, a large Chinese social network with 220M users. We propose a practical detection system based on these models, and show that it provides very high detection accuracy on our clickstream traces. Finally, we worked with collaborators at Renren and LinkedIn to test our prototype on their server-side data. Following positive results, both companies have expressed strong interest in further experimentation and possible internal deployment.
Conference Paper
Full-text available
MapReduce has recently gained great popularity as a programming model for processing and analyzing massive data sets and is extensively used by academia and industry. Several implementations of the MapReduce model have emerged, the Apache Hadoop framework being the most widely adopted. Hadoop offers various utilities, such as a distributed file system, job scheduling and resource management capabilities and a Java API for writing applications. Hadoop's success has intrigued research interest and has led to various modifications and extensions to the framework. Implemented optimizations include performance improvements, programming model extensions, tuning automation and usability enhancements. In this paper, we discuss the current state of the Hadoop framework and its identified limitations. We present, compare and classify Hadoop/MapReduce variations, identify trends, open issues and possible future directions.
Article
Full-text available
This paper describes the system architecture of the Vertica Analytic Database (Vertica), a commercialization of the design of the C-Store research prototype. Vertica demonstrates a modern commercial RDBMS system that presents a classical relational interface while at the same time achieving the high performance expected from modern "web scale" analytic systems by making appropriate architectural choices. Vertica is also an instructive lesson in how academic systems research can be directly commercialized into a successful product.
Article
Full-text available
Maintaining and evolving data warehouses is a complex, error prone, and time consuming activity. The main reason for this state of affairs is that the environment of a data warehouse is in constant change, while the warehouse itself needs to provide a stable and consistent interface to information spanning extended periods of time. In this article, we propose an agile information modeling technique, called Anchor Modeling, that offers non-destructive extensibility mechanisms, thereby enabling robust and flexible management of changes. A key benefit of Anchor Modeling is that changes in a data warehouse environment only require extensions, not modifications, to the data warehouse. Such changes, therefore, do not require immediate modifications of existing applications, since all previous versions of the database schema are available as subsets of the current schema. Anchor Modeling decouples the evolution and application of a database, which when building a data warehouse enables shrinking of the initial project scope. While data models were previously made to capture every facet of a domain in a single phase of development, in Anchor Modeling fragments can be iteratively modeled and applied. We provide a formal and technology independent definition of anchor models and show how anchor models can be realized as relational databases together with examples of schema evolution. We also investigate performance through a number of lab experiments, which indicate that under certain conditions anchor databases perform substantially better than databases constructed using traditional modeling techniques.
Article
Full-text available
Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorithm for clustering webusers based on a function of the longest common subsequence of their clickstreams that takes into account both the trajectory taken through a website and the time spent at each page. Results are presented on weblogs of www.sulekha.com to illustrate the techniques. keywords : web usage mining, clickstream, subsequence, clustering. 1
Article
In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.
Conference Paper
Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight their strengths and weaknesses, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries).
Modeling the Agile Data Warehouse with Data Vault
  • Hans Hultgren
Hans Hultgren, Modeling the Agile Data Warehouse with Data Vault (Volume 1), Brighton Hamilton (November 16, 2012), 2012.