ArticlePDF Available

Abstract and Figures

High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. Furthermore, there is a contradiction between ease of extending the data model and ease of analysis. The modern 'Data Lake' approach, promises extreme ease of adding new data to a data model, however it is prone to eventually becoming a Data Swamp - unstructured, ungoverned, and out of control Data Lake where due to a lack of process, standards and governance, data is hard to find, hard to use and is consumed out of context. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. This technique is almost as convenient for expanding data model as a Data Lake, while it is internally protected from transforming to Data Swamp. A case study of how this approach is used for a Data Warehouse at Avito over a three-year period, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, is also presented. This paper is an extension of theses from The 34th International Conference on Conceptual Modeling (ER 2015) (Golov and Rönnbäck 2015) [1], it is complemented with numerical results about key operating areas of highly normalized big data warehouse, collected over several (1-3) years of commercial operation. Also, the limitations, imposed by using a single MPP database cluster, are described, and cluster fragmentation approach is proposed.
Content may be subject to copyright.
A preview of the PDF is not available
... The main feature of systems built on Big Data technology is the implementation of parallel computing [36]. This allows the query to be distributed between several servers (segments) using the query optimizer (master) [37]. ...
Article
Full-text available
Modern satellite positioning and navigation technologies are not applicable in specific areas such as the exploration of oil and gas deposits by means of directional drilling techniques. Here, we can rely solely on natural geophysical fields, such as the Earth’s magnetic field. The precise underground navigation of borehole drilling instruments requires a seamless, near-real-time access to operational geomagnetic data. This paper describes the MAGNUS BD hardware-software system, deployed at the Geophysical Center of the Russian Academy of Sciences, that provides the efficient accumulation, storage, and processing of geomagnetic data. This system, based on the Big Data (BD) technology, is a modern successor of the MAGNUS processing software complex developed in 2016. MAGNUS BD represents one of the first cases of the BD technology’s application to geomagnetic data. Its implementation provided a significant increase in the speed of information processing and allowed for the use of high-frequency geomagnetic satellite data and expanding the overall functionality of the system. During the MAGNUS BD system’s deployment on a physically separate dedicated cluster, the existing classical database (DB) was migrated to the Arenadata database with full preservation of its functionality. This paper gives a brief analysis of the current problems of directional drilling geomagnetic support and outlines the possible solutions using the MAGNUS BD system.
... This characteristic renders it well-suited for various applications, including but not limited to social network analysis, fraud detection, and recommendation systems. Massively Parallel Processing (MPP) databases, such as Teradata [89] and Snowflake, have been purposefully engineered to cater to the demands of high-performance analytics [90,91]. Data and processing duties are distributed among numerous nodes in order to enhance the speed of query execution. ...
Article
Full-text available
A Digital Twin (DT) is a digital copy or virtual representation of an object, process, service, or system in the real world. It was first introduced to the world by the National Aeronautics and Space Administration (NASA) through its Apollo Mission in the '60s. It can successfully design a virtual object from its physical counterpart. However, the main function of a digital twin system is to provide a bidirectional data flow between the physical and the virtual entity so that it can continuously upgrade the physical counterpart. It is a state-of-the-art iterative method for creating an autonomous system. Data is the brain or building block of any digital twin system. The articles that are found online cover an individual field or two at a time regarding data analysis technology. There are no overall studies found regarding this manner online. The purpose of this study is to provide an overview of the data level in the digital twin system, and it involves the data at various phases. This paper will provide a comparative study among all the fields in which digital twins have been applied in recent years. Digital twin works with a vast amount of data, which needs to be organized, stored, linked, and put together, which is also a motive of our study. Data is essential for building virtual models, making cyber-physical connections, and running intelligent operations. The current development status and the challenges present in the different phases of digital twin data analysis have been discussed. This paper also outlines how DT is used in different fields, like manufacturing, urban planning, agriculture, medicine, robotics, and the military/aviation industry, and shows a data structure based on every sector using recent review papers. Finally, we attempted to give a horizontal comparison based on the features of the data across various fields, to extract the commonalities and uniqueness of the data in different sectors, and to shed light on the challenges at the current level as well as the limitations and future of DT from a data standpoint.
... Mckay et al. put forward new ideas for big data management skills, by constructing accurate measurement models to optimize data processing speed and data standardization [5]. Golov and Manco have conducted research on big data management issues, constructed a big data management model, and proposed a highly standardized method [6] [7]. ...
... According to (Kirk, 2010) and (Golov and Rönnbäck, 2017), it is a simple task to achieve a ten-fold speed up when an application makes use of data parallelism. A computation is said to be parallel when a program runs on a multiprocessor machine in which all processors share access to available memory, i.e. the same address on different processors corresponds to the same memory location. ...
Article
Full-text available
Many industries and academia have devoted a lot of effort and money to creating and/or using good extract-transform-load (ETL) software suitable for their data analysis purposes since it is considered a key to their success. As a result, we find the valuable interventions of research efforts based on ETL software are divided according to well-known approaches such as Business Intelligence, Big Data, and/or Semantic. As a result, problems arise in keeping up with changes and handling the significant diversity in features across these approaches. Which results in disorientation in the finding, evaluation, and choice of an ETL for industries and academia facing their approaches needs. These problems inspire us to provide a contribution that uses the systematic-literature-review (SLR) method to collect 207 papers from three databases, namely, ScienceDirect, Springer, and IEEE, dated from 2010 to 2022, grouped based on both ETL approaches and their commonly used criteria, afterwards using an existing method that automatically identifies the adequate multicriteria method for this study, which gives us the analytical-hierarchy-process method to provide the best research paper according to the requirements of scientific literature. The result implies the great significance of this study in multiple ways, providing a global idea of research papers about ETL approaches, allowing customers to eliminate uncertainty from selecting an ETL according to their specific approach needs, preferences, and interests, and also enabling future researchers and developers of ETL to decide when to focus and how to make innovative contributions to fill gaps in the literature.
Article
Full-text available
This paper aims to find the actuarial tables that best represent the occurrences of mortality and disability in the Brazilian Armed Forces, thus providing a better dimensioning of the costs of military pensions to be paid by the pension system. To achieve this goal, an optimization software was developed that tests 53 actuarial tables for the death of valid military personnel, 21 boards for entry into the disability of assets, and 21 boards for mortality of invalids. The software performs 199 distinct adherence tests for each table analyzed through linear aggravations and de-escalations in the probabilities of death and disability. The statistical–mathematical method used was the chi-square adherence test in which the selected table is the one with the null hypothesis “observed data” equal to the “expected data” with the highest degree of accuracy. It is expected to bring a significant contribution to society, as a model of greater accuracy reduces the risk of a large difference between the projected cost and the cost observed on the date of the year, thus contributing to the maintenance of public governance. Additionally, the unprecedented and dual nature of the methodology presented here stands out. As a practical contribution, we emphasize that the results presented streamline the calculation of actuarial projections, reducing by more than 90% the processing times of calculations referring to actuarial projections of retirees from the armed forces. As a limitation of the study, we emphasize that, although possibly replicable, the database was restricted only to the Brazilian Armed Forces.
Article
.The article introduces a system concept for analyzing the Almaty's air basin. Air pollution monitoring plays a key role in environmental issues. However, in order to develop an environmental information system, it is necessary to conduct a careful analysis of natural events in order to determine their direct impact on the state of the environment, in this case the atmosphere. To assess the degree of pollution, emissions must be measured at specific intervals in specific locations across the settlement, depending on the study approach used. To numerically implement the data obtained, a model for determining the amount of pollutant concentration should be used, which will lead to a numerical expression representing the atmospheric conditions in a certain industrial region. The purpose of the project is to programmatically implement the above operations, namely to develop software based on a specific appropriate mathematical model that allows for the calculation of pollutant concentrations and, on that basis, the complex index of atmospheric pollution. To analyze and make specific decisions related toreduction ofpollution levels, it is necessary to track the dynamics of changes in the atmospheric pollution index over long periods of time, which necessitates the creation of a database to record measurements as well as the ability to view them visually at any time, for example, on a map. Мақалада шығарындылар статистикасы жҽне Алматы қаласының қҧрғақ бассейнінің жалпы жағдайына талдау берілген. Ауа бассейнінің жағдайы бойынша, Қазақстан Республикасы Статистика агенттігінің жҽне Алматы қаласы Статистика департаментінің есеп беру мҽліметіне сҥйене отырып, барлық стационарлық дереккҿздерден атмосфералық ауаға ластаушы заттар шығарындыларының жалпы жылдық кҿлемі 2020 жылы 47,016 тонна. ЖЭО зауыттарының негізгі ластаушылары кҥкірт диоксиді, азот оксиді жҽне қатты заттар болып табылады. Оның ҥстіне мазут, битуминозды жҽне қоңыр кҿмірді жағу кезінде атмосфераға кҥкірт диоксиді кҿп мҿлшерде бҿлінеді, ал битуминозды кҿмірді қолданғанда азот оксидінің шығарылуы да кҥрт артады. Сондықтан табиғи газ ең тиімдібалама болып саналады. Негізгі стационарлық ластаушы заттардың талдауы жҥргізіледі: зауыттар, ҽртҥрлі типтегі фабрикалар, жылу электр станциялары жҽне тҧрақты жҽне ҥздіксіз атмосфераға зиянды ластаушы заттарды шығаратын ҧқсас ҿндірістік нысандар. Ластану деңгейін анықтау ҥшін шығарындылар таңдалған зерттеу ҽдіснамасына байланысты елді мекеннің белгілі бір жерлерінде белгілі бір уақыт аралығында ҿлшенеді.
Article
In the development of product design, one of the elements of market competition for products is to meet the Kansei needs of users. Compared to features, users pay more attention to whether products can match their emotions, which is Kansei needs. The product developers are eager to get the Kansei needs of users more accurately and conveniently. This paper takes the computer cloud platform as the carrier and based on the collaborative filtering algorithm. We used personalized double matrix recommendation algorithm as the core, and the adjectives dimensionality reduction method to filter the image tags to simplify the users’ rating process and improve the recommendation efficiency. Finally, we construct a Kansei needs acquisition model to quickly and easily obtain the Kansei needs of users. We verify the model using the air purifier as a subject. The results of the case show that the model can find out the user’s Kansei needs more quickly. When the data is more, the prediction will be more accurate and timely.
Article
In this paper, a new active dynamic lightning protection method is proposed based on the large data characteristics of electric power. This method mainly includes two parts: Part one, Neo4j framework model which is used to analyze large data of power system and dynamic regulation of power system, and Python software which is used to compare and analyze different framework models; Part two, the comparison between dynamic lightning and conventional protection methods. The results show that Neo4j traversal speed is 87.5% and 89.1% faster than Hadoop and Spark respectively, clustering effect is 12.5% and 17.8% higher than Hadoop and Spark respectively. As a result, Neo4j framework model is more suitable for the characteristics of large data in power system. After the lightning accident, the power-off time of dynamic lightning protection system is reduced by about 53.1%, and the recovery time of the system also decreased by about 42.8%. In the dynamic regulation of power system, the output of power supply is reduced by 35.1 MW and the load is cut out by 15.8 MW, which greatly reduce the impact of lightning strike on power supply and important load.
Conference Paper
Full-text available
High performance querying and ad-hoc querying are commonly viewed as mutually exclusive goals in massively parallel processing databases. In the one extreme, a database can be set up to provide the results of a single known query so that the use of available of resources are maximized and response time minimized, but at the cost of all other queries being suboptimally executed. In the other extreme, when no query is known in advance, the database must provide the information without such optimization, normally resulting in inefficient execution of all queries. This paper introduces a novel technique, highly normalized Big Data using Anchor modeling, that provides a very efficient way to store information and utilize resources, thereby providing ad-hoc querying with high performance for the first time in massively parallel processing databases. A case study of how this approach is used for a Data Warehouse at Avito over two years time, with estimates for and results of real data experiments carried out in HP Vertica, an MPP RDBMS, are also presented.
Conference Paper
Full-text available
Fake identities and Sybil accounts are pervasive in today's online communities. They are responsible for a growing number of threats, including fake product reviews, malware and spam on social networks, and astroturf political campaigns. Unfortunately, studies show that existing tools such as CAPTCHAs and graph-based Sybil detectors have not proven to be effective defenses. In this paper, we describe our work on building a practical system for detecting fake identities using server-side clickstream models. We develop a detection approach that groups "similar" user clickstreams into behavioral clusters, by partitioning a similarity graph that captures distances between clickstream sequences. We validate our clickstream models using ground-truth traces of 16,000 real and Sybil users from Renren, a large Chinese social network with 220M users. We propose a practical detection system based on these models, and show that it provides very high detection accuracy on our clickstream traces. Finally, we worked with collaborators at Renren and LinkedIn to test our prototype on their server-side data. Following positive results, both companies have expressed strong interest in further experimentation and possible internal deployment.
Conference Paper
Full-text available
MapReduce has recently gained great popularity as a programming model for processing and analyzing massive data sets and is extensively used by academia and industry. Several implementations of the MapReduce model have emerged, the Apache Hadoop framework being the most widely adopted. Hadoop offers various utilities, such as a distributed file system, job scheduling and resource management capabilities and a Java API for writing applications. Hadoop's success has intrigued research interest and has led to various modifications and extensions to the framework. Implemented optimizations include performance improvements, programming model extensions, tuning automation and usability enhancements. In this paper, we discuss the current state of the Hadoop framework and its identified limitations. We present, compare and classify Hadoop/MapReduce variations, identify trends, open issues and possible future directions.
Article
Full-text available
This paper describes the system architecture of the Vertica Analytic Database (Vertica), a commercialization of the design of the C-Store research prototype. Vertica demonstrates a modern commercial RDBMS system that presents a classical relational interface while at the same time achieving the high performance expected from modern "web scale" analytic systems by making appropriate architectural choices. Vertica is also an instructive lesson in how academic systems research can be directly commercialized into a successful product.
Article
Full-text available
Maintaining and evolving data warehouses is a complex, error prone, and time consuming activity. The main reason for this state of affairs is that the environment of a data warehouse is in constant change, while the warehouse itself needs to provide a stable and consistent interface to information spanning extended periods of time. In this article, we propose an agile information modeling technique, called Anchor Modeling, that offers non-destructive extensibility mechanisms, thereby enabling robust and flexible management of changes. A key benefit of Anchor Modeling is that changes in a data warehouse environment only require extensions, not modifications, to the data warehouse. Such changes, therefore, do not require immediate modifications of existing applications, since all previous versions of the database schema are available as subsets of the current schema. Anchor Modeling decouples the evolution and application of a database, which when building a data warehouse enables shrinking of the initial project scope. While data models were previously made to capture every facet of a domain in a single phase of development, in Anchor Modeling fragments can be iteratively modeled and applied. We provide a formal and technology independent definition of anchor models and show how anchor models can be realized as relational databases together with examples of schema evolution. We also investigate performance through a number of lab experiments, which indicate that under certain conditions anchor databases perform substantially better than databases constructed using traditional modeling techniques.
Article
Full-text available
Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorithm for clustering webusers based on a function of the longest common subsequence of their clickstreams that takes into account both the trajectory taken through a website and the time spent at each page. Results are presented on weblogs of www.sulekha.com to illustrate the techniques. keywords : web usage mining, clickstream, subsequence, clustering. 1
Article
In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.
Conference Paper
Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight their strengths and weaknesses, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries).
Modeling the Agile Data Warehouse with Data Vault
  • Hans Hultgren
Hans Hultgren, Modeling the Agile Data Warehouse with Data Vault (Volume 1), Brighton Hamilton (November 16, 2012), 2012.