ArticlePDF Available

Abstract

With cloud computing, cheap storage, and technology advancements, an enterprise uses multiple applications to operate business functions. Applications are not limited to just transactions, customer service, sales, finance but they also include security, application logs, marketing, engineering, operations, HR and many more. Each business vertical uses multiple applications which generate a huge amount of data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential growth in data volume. In almost all enterprises, data silos exist through these applications. These applications can produce structured, semi-structured, or unstructured data at different velocity and in different volume. Having all data sources integrated and generating timely insights helps in overall decision making. With recent development in Big Data Integration, data silos can be managed better and it can generate tremendous value for enterprises. Big data integration offers flexibility, speed, and scalability for integrating large data sources. It also offers tools to generate analytical insights which can help stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with data silos and how big data integration can help to stun them.
International Journal of Database Management Systems (IJDMS ) Vol.11, No.2/3, June 2019
DOI : 10.5121/ijdms.2019.11301 1
BRIDGING DATA SILOS USING BIG DATA
INTEGRATION
Jayesh Patel
Senior Member
ABSTRACT
With cloud computing, cheap storage and technology advancements, an enterprise uses multiple
applications to operate business functions. Applications are not limited to just transactions, customer
service, sales, finance but they also include security, application logs, marketing, engineering, operations,
HR and many more. Each business vertical uses multiple applications which generate a huge amount of
data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential
growth in data volume. In almost all enterprises, data silos exist through these applications. These
applications can produce structured, semi-structured, or unstructured data at different velocity and in
different volume. Having all data sources integrated and generating timely insights helps in overall
decision making. With recent development in Big Data Integration, data silos can be managed better and it
can generate tremendous value for enterprises. Big data integration offers flexibility, speed and scalability
for integrating large data sources. It also offers tools to generate analytical insights which can help
stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with
data silos and how big data integration can help to stun them.
KEYWORDS
Data Silo, Big Data, Data Pipelines, Integration, Data Lake, Hadoop
1. INTRODUCTION
A data silo is a segregated group of data stored in multiple enterprise applications. Most
applications store raw and processed data in various ways. Each application has its own features
and tools to allow business users an easy access to processed data using dashboards and reports.
Most of these dashboard and reports are application specific. As a result, teams in various
business units will have limited visibility of the data in the enterprise and they will only see a
partial picture. They will not be able to extract the full value of data from various enterprise
applications which sources data silos. Data silos restrict sharing information and collaboration
among teams. It leads to poor decision making and negatively impacts profitability [16][17].
With evolving APIs in enterprise applications and the emergence of new technologies,
frameworks, and platforms, there are various opportunities to integrate these silos. Integrating
thousands of disparate data sources is now easier than ever before due to Big Data Integrations
and tools. This paper will focus on how big data integration can help integrating disparate data
sources.
2. RELATED WORK
Data integration has been around since last few decades and it has been evolving. Data silos are
often managed differently by enterprises based on their priorities and challenges. Data Integration
is the key to bridge data silos and generate true value of data [15].
International Journal of Database Management Systems (IJDMS ) Vol.11, No.2/3, June 2019
2
As applications can be heterogeneous, it is challenging to integrate them. To integrate
heterogeneous sources, data integration mitigates risks and offers flexibility. In paper [21],
multiple approaches were discussed to integrate heterogeneous data warehouses by an example of
a practical system.
The paper [14] compares traditional data warehouse toolkits with big data integration. It provides
details on methodology, architecture, ETL challenges, processing and storage for data warehouse
and big data lake. It summarizes characteristics of data warehouse and Big data followed by a
proposed model to process big data.
A seminar on big data integration [2] discussed techniques such as record linkage,
schema mapping and data fusion. It also summarized how big data integration relieves
challenges with Big data.
Big data integration is reviewed in paper [11] by presenting issues with traditional data
integrations and relevant work on big data integration. Challenges of Big data is also discussed
with techniques to resolve them in a big data environment.
A book on Big Data Analytics [1] discusses strategies and roadmap to integrate big data analytics
in enterprises. It presents how big data tools and techniques can help to develop big data
applications to manage data silos.
3. DATA SILOS AND CHALLENGES
3.1. Data Silos
In the age of artificial intelligence, data science, machine learning, and predictive analytics,
generating insights from the right data at the right time is critical. These days, there is a strong
wave to adopt artificial intelligence and machine learning to be data-driven and to gain a
competitive advantage.
The biggest hurdle to attain success in this endeavor is data integration and preparation [18]. Due
to the increasing number of business applications in the corporate world, data sit in hundreds or
thousands of applications or servers that result in data silos. It becomes worse at the times of
mergers and acquisitions. On the other side, it is not practical to give everyone in the company to
give access to all applications. Even if that is true, it will take so much time to integrate required
datasets without a proper strategy. It takes days, months or even years to manually acquire and
integrate data from data silos. Data silos are formed due to various reasons not limited to
structural, vendor contracts, and political [16].
3.2. Challenges with Data Silos
According to a survey by the American Management Association [8], 83% of executives told that
their organizations have silos and 97% think silos have a negative effect. The simplest reason is
they impact overall decision making and impacts each business vertical.
It is obvious that data silos restrict visibility across different verticals. Data enrichment is not
possible with data silos and that’s why it impacts negatively on informed decision making.
Additionally, data silos can represent the same problem or situation differently. For example, an
active subscriber in a SaaS (software as a service) company may mean different to finance team
than the marketing team. For finance team, if a subscriber is paying a subscription or using a
promotion, he/she will be considered active. For marketing, active status depends on various
International Journal of Database Management Systems (IJDMS ) Vol.11, No.2/3, June 2019
3
activities on the platform like login, activity, etc. That leads to inefficiency and additional work to
determine which source is accurate.
3.3. Bridging Data Silos
As data silos are formed from multiple applications and processes, data reside at various places-
cloud, on-premise, applications servers, flat-files, databases, etc. A key thing here is to find value
in data and define which data silos should create maximum value if they are integrated. One way
you can overtake data silos is to strategize sources of data silos to enhance collaborations and
communications among multiple departments and teams. From separating different processes or
applications to unifying them will break down data silos. However, it requires major efforts and a
change in the overall culture of the organization [16][17].
Another way to solve this problem is to integrate these data silos using integrations techniques
and tools. Integrating these data silos is a costly and time-consuming process. There have been
multiple frameworks and tools for data integrations but we will focus on big data integration due
to its long term benefits.
Let’s understand the need for integrating data silos in the corporate world with a simple analogy.
In a household kitchen, you will find sugar in a container, coffee in another container and creamer
in a different container. We consolidate all three in different proportions to make a delicious
coffee. Similarly, multiple applications in enterprises form data silos for specific operations.
When executives and investors look at a company as a whole, a clear and better view of the
overall picture will go a long way. Integrating data silos in an effective way can resolve this
critical challenge.
4. BIG DATA INTEGRATION
Enterprise applications generate a variable volume of data real-time, near-real-time or not-real-
time. Data can be structured, semi-structured or unstructured [5][14]. Almost 80% of data from
enterprise applications is either unstructured or semi-structured [9]. Volume can be low or high
but there is no clear bar to classify volume as low or high [11]. Applications can generate static or
dynamic data structures and they are heterogeneous in most cases [14]. Most of these demonstrate
the characteristics of Big Data, often known as the seven V´s: Volume, Variety, Velocity,
Veracity, Value, Variability, and Viability [4][14].
Traditional data integration techniques work well for structured data [21]. Traditional data
integration can handle some of the characteristics of Big Data but it fell short on handling semi-
structured and unstructured data at scale [2][3]. With advancements in big data technologies and
infrastructure, it becomes much easier to integrate a variety of data sources at scale [11].
Distributed processing and distributed messaging systems made integrations possible at high
scale. Map Reduce and Hadoop used to be a good option to integrate data silos [14]. Spark and
Hadoop can be better to integrate a variety of data sources after recent development in Spark [6].
Kafka and other distribution messaging technology made real-time data integration possible
[12][13]. Depending on the needs, data can be processed in batch, or real-time using big data
tools.
To integrate data silos, you can either extract raw data from each application or pull processed
data from sources based on the requirements and nature of applications. Following general steps
are identified to integrate data silos based on several use cases:
International Journal of Database Management Systems (IJDMS ) Vol.11, No.2/3, June 2019
4
1) Data Discovery: Identify data silos to be integrated. It is a data discovery and justification
phase. Evaluate and determine what integrating data silos will benefit the entire corporation.
Set the clear expectation, benchmark, and constraints for each data silo. This phase will give
you some idea of priorities on which data silos should be handled first.
2) Data Extraction: Determine what to extract from sources. Data can be from relational
databases, cloud, IoT sensors, flat files, application logs, images, large documents, and so on
[1][19]. Data extraction should not put more burden on data sources and should not impact
other production processes. “Schema on read” can be beneficial for semi-structured and
unstructured data and big data integration can handle it easily [10].
3) Data Export: Export data to temporary storage or messaging system. Kafka is a really
powerful message broker for high-throughput real-time event data and can be really effective
in integrating data silos [12][20]. Other alternatives should be evaluated based on needs.
4) Data Loading: Load data to the cloud, Hadoop Distributed File System (HDFS) or any big
data platform real-time or in batches [6]. Data can be loaded as-is in raw format. Based on
requirements, data can be transformed and stored as well.
5) Data Processing and Analytics: Consolidate data sources based on business needs. This is a
step to process unstructured or semi-structured data to structured data so that it can be used in
analytics. It includes aggregating data, generating key performance metrics and analytical
data models [7].
5. BIG DATA LAKE
Data Lake is an excellent choice for data consolidation and for integrating data silos. It is an
integral part of Big Data Integration. It offers several benefits over the traditional data warehouse.
Data lake supports the strategy to develop the universal culture of scalable data-driven insights.
They are highly flexible, agile and reusable. As they are designed for low-cost storage, they are
commonly used to store historical data from various sources [15].
When you have all data you need at one place, you can use it for all purposes. Instead of storing
only raw data, data lake should also store processed data, metrics and aggregations to take full
advantage. In order to avoid extra processing and time, common aggregations and metrics should
be kept ready in the data lake. That way it can be an enterprise data platform serving as a single
source of truth.
As the journey to the big data lake is not short, be stick to goals identified during data discovery.
This journey will definitely open doors for the new opportunities and will help enterprises to
subjugate data silos.
6. CONCLUSIONS
Today data are being generated in hundreds or thousands from enterprise applications at an
unprecedented scale. As these applications are used by different business verticals and teams, data
sharing is not easy even within the specific business vertical. As a result, multiple data silos are
formed in enterprises. Data can generate enormous value if integrated, analyzed and presented
correctly. This paper presented some challenges with data silos and how big data integration can
help to tackle them. It discussed the use of Big Data Lake to integrate data silos and how it can
serve multiple needs through a single platform. As big data integration is evolving day by day,
new frameworks, platforms, and techniques are expected to make integrations easier in the future.
International Journal of Database Management Systems (IJDMS ) Vol.11, No.2/3, June 2019
5
Additionally, data governance for big data lake and data access control on enterprise data have a
huge scope for development in the future.
REFERENCES
[1] David Loshin, “Big Data Analytics”, Elsevier, 2013
[2] Xin Luna Dong, Divesh Srivastava, 2013. “Big Data Integration”, ICDE conference 2013
[3] Sachchidanand Singh, Nirmala Singh, 2012. “Big Data Analytics”, International Conference on
Communication, Information & Computing Technology (ICCICT), Oct 19-20, 2012.
[4] P. Bedi, V. Jindal, and A. Gautam, “Beginning with Big Data Simplified,” 2014.
[5] D. L. W.H. Inmon, Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and
Data Vault. Amsterdam, Boston: Elsevier, 2014.
[6] J. G. Shanahan and L. Dai, “Large Scale Distributed Data Science Using Apache Spark,” in
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2015, pp. 23232324.
[7] A White Paper, 2013. “Aggregation and analytics on Big Data using the Hadoop eco- system”
[8] Comfort LK. Risk, Security, and Disaster Management. Annual Review of Political Science.
2005;8:335356.
[9] eWEEK. (2019). Managing Massive Unstructured Data Troves: 10 Best Practices. [online] Available
at: http://www.eweek.com/storage/slideshows/managing-massiveunstructured-data-troves-10-best-
practices#sthash.KAbEigHX.dpuf [Accessed 11 May 2019].
[10] Soumysen, Ranak Ghosh, Debanjali, NabenduChaki, 2012. “Integrating XML Data into Multiple
ROLAP Data Warehouse Schemas”, International Journal of Software Engineering and Application
(USEA), Vol 3, No.1, Jan 2012.
[11] B.arputhamary and L.arockiam. “A Review on Big Data Integration” IJCA Proceedings on
International Conference on Advanced Computing and Communication Techniques for High
Performance Applications ICACCTHPA 2014(5):21-26, February 2015.
[12] J. Kreps, N. Narkhede, and J. J. Rao, ‘‘Kafka: A distributed messaging system for log processing,’’ in
Proc. NetDB, 2011, pp. 17.
[13] J. Liao, X. Zhuang, R. Fan and X. Peng, "Toward a General Distributed Messaging Framework for
Online Transaction Processing Applications," in IEEE Access, vol. 5, pp. 18166-18178, 2017.
[14] Salinas, Sonia Ordonez and Alba C.N. Lemus. (2017) “Data Warehouse and Big Data integration”
Int. Journal of Comp. Sci. and Inf. Tech. 9(2): 1-17.
[15] Analytics Magazine, 03-Nov-2016. “Data Lakes: The biggest big data challenges,” [Online].
Available at: http://analytics-magazine.org/data-lakes-biggest-big-data-challenges/. [Accessed: 11-
May-2019].
[16] Alienor. "What Is a Data Silo and Why Is It Bad for Your Organization?" Plixer. July 31, 2018.
Accessed May 11, 2019. https://www.plixer.com/blog/network-security/data-silo-what-is-it-why-is-it-
bad/.
[17] "4 Best Ways To Breakdown Data Silos [Problems and Solutions]." Status Guides. February 26,
2019. Accessed May 11, 2019. https://status.net/articles/data-silos-information-silos/.
[18] G. Press, “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey
Says,” Forbes, 23-Mar-2016. [Online]. Available:
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-
enjoyable-data-science-task-survey-says/#6d5fd596f637. [Accessed: 11-May-2019].
[19] AmitkumarManekar, and Dr. G. Pradeepinib (2015,May), “A Review On Cloud Based Data
Analysis”. International Journal on Computer Network And Communications (IJCNC) May
2015,Vol.1 No.1
International Journal of Database Management Systems (IJDMS ) Vol.11, No.2/3, June 2019
6
[20] L. Duggan, J. Dowzard, J. Katupitiya, and K. C. Chan, “A Rapid Deployment Big Data Computing
Platform for Cloud Robotics,” International journal of Computer Networks & Communications, vol.
9, no. 6, pp. 7788, 2017.
[21] Torlone, Riccardo. (2008). Two approaches to the integration of heterogeneous data warehouses.
Distributed and Parallel Databases. 23. 69-97. 10.1007/s10619-007-7022-z.
Authors
Jayesh Patel completed his Bachelors of Engineering in Information
Technology in 2001 and MBA in Information Systems in 2007. He currently
work for Rockstar Games as Senior Data Engineer, focusing on developing
data-driven decision-making processes on Big Data Platform. He has
successfully built machine learning pipelines and architected big data analytics
solutions over the past several years. Additionally, he is a senior member for
IEEE.
... In recent years, growing and isolated data and information silos have hindered decisionmaking processes across organizations (Patel, 2019). Many of those organizations aspire to have a unified and integrated master data asset. ...
... Knowledge graphs have become the preferred model for data integration (Wilcke et al., 2017) for several reasons. First, as the volume of data increased across the web, data and information silos (Miller et al., 2010) became harder to solve (Patel, 2019). Second, the graph model treats data and relationships as equals in representation and importance compared to the relational data model, emphasizing entity representation (Hogan et al., 2020). ...
Technical Report
Full-text available
Knowledge graphs have emerged as an efficient data model providing data integration capabilities based on semantics. The flexibility that the knowledge graph allows in representing data and relations renders the necessity of schema alignments, relational diagrams, primary keys, foreign keys, and computationally expensive joins obsolete. The graph data structure underlying the knowledge graph data model opens the door for utilizing established methods developed in graph theory and geometric deep learning in overcoming data integration problems such as entity resolution and deduplication. Here we present a framework and method capable of integrating and extracting features from multiple and heterogeneous biomedical data sources. Here I use the proposed data integration method to study the Cancer Epigenetics research community. Specifically, I am curious to understand the links between authors and bio-entities based on biological links between them, not just co-occurrences. Each data source is transformed into a knowledge graph where data and relations are represented equally in a property graph format. The framework then automatically learns feature vector representations in an unsupervised fashion for each entity and relation while performing entity resolution, record linkage, and alignment across the multiple graphs in the same vector space. The result is a single integrated and consolidated data source. In addition to a set of optimized entity-based feature vectors allowing performing downstream data mining, information retrieval, and machine learning tasks on the merged data. I demonstrate the usefulness of the proposed method by mining and analyzing the Cancer Epigenetics research community trends.
... Therefore, to implement the remedy of a business silo is to reverse the EA maturity process, for example, by creating data silos i.e., segregated groups of data. However, since data integration techniques such as big data lake, search for, target, gather, combine, and analyse data silos at scale to derive valuable insights for business decision making (Patel and Member, 2019;Ross, Weill and Robertson, 2006;Fang, 2015), this remedy alone would not be sufficient. Data governance and particularly data access control on coopetitor confidential information is required. ...
Thesis
Full-text available
This research considers Amazon's business model and practices that have brought intense scrutiny and debate not only in Europe but in the US. It is not Amazon alone which is the focus of this scrutiny but the entire internet economy and the big tech firms who have increasingly become integral to our daily lives and for which pre-existing competition policy proves ineffective. This research focuses on the EU antitrust case against Amazon for its use of big data for competitive advantage within its retail marketplace and extends the analysis to its Amazon Web Services (AWS) segment. A wide range of literature on the topics of big data, competitive advantage, antitrust policy, and antitrust case law was utilised to undertake a mixed-method single case study-realist review to apply what works for whom, how and under what circumstances to the case of Amazon.
... An immense amount of data fuels this digital revolution, and it is desperately needed for specialized procedures to be developed and optimized. This information is primarily created to document patient care for legal or financial purposes [1] and is often stored in silos [2], consequently making it hard to reach and impossible to reuse. Owing to the missing exchange, data formats will differ, creating data heterogeneity, which is a well-discussed issue in computer science [3]. ...
Article
Full-text available
Background: Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example, data identification and classification. However, a clear definition of metadata is crucial for further use. Unfortunately, extensive experience with the processing and management of metadata has shown that the term "metadata" and its use is not always unambiguous. Objective: This study aimed to understand the definition of metadata and the challenges resulting from metadata reuse. Methods: A systematic literature search was performed in this study following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting on systematic reviews. Five research questions were identified to streamline the review process, addressing metadata characteristics, metadata standards, use cases, and problems encountered. This review was preceded by a harmonization process to achieve a general understanding of the terms used. Results: The harmonization process resulted in a clear set of definitions for metadata processing focusing on data integration. The following literature review was conducted by 10 reviewers with different backgrounds and using the harmonized definitions. This study included 81 peer-reviewed papers from the last decade after applying various filtering steps to identify the most relevant papers. The 5 research questions could be answered, resulting in a broad overview of the standards, use cases, problems, and corresponding solutions for the application of metadata in different research areas. Conclusions: Metadata can be a powerful tool for identifying, describing, and processing information, but its meaningful creation is costly and challenging. This review process uncovered many standards, use cases, problems, and solutions for dealing with metadata. The presented harmonized definitions and the new schema have the potential to improve the classification and generation of metadata by creating a shared understanding of metadata and its context.
... It is estimated that data scientists spend 80% of their time obtaining, cleaning, and preparing data, and only 20% of their time building models, analyzing, visualizing, and drawing conclusions from that data. 64,65 Accessibility is important; it should be joinable (in a form that can be joined to other clinical data when necessary) and shareable (a data-sharing culture within the hospital ecosystem so that data can be joined). 66 If clinicians/researchers do not have a coherent, accurate picture of patient flow, diagnostic processes, and complete longitudinal data acquisition processes of patients, it is hard to analyze and improve processes and care. ...
Article
Big Data is no longer a novel concept in health care. Its promise of positive impact is not only undiminished, but daily enhanced by seemingly endless possibilities. Epilepsy is a disorder with wide heterogeneity in both clinical and research domains, and thus lends itself to Big Data concepts and techniques. It is therefore inevitable that Big Data will enable multimodal research, integrating various aspects of "-omics" domains, such as phenome, genome, microbiome, metabolome, and proteome. This scope and granularity have the potential to change our understanding of prognosis and mortality in epilepsy. The scale of new discovery is unprecedented due to the possibilities promised by advances in machine learning, in particular deep learning. The subsequent possibilities of personalized patient care through clinical decision support systems that are evidence-based, adaptive, and iterative seem to be within reach. A major objective is not only to inform decision-making, but also to reduce uncertainty in outcomes. Although the adoption of electronic health record (EHR) systems is near universal in the United States, for example, advanced clinical decision support in or ancillary to EHRs remains sporadic. In this review, we discuss the role of Big Data in the development of clinical decision support systems for epilepsy care, prognostication, and discovery.
... The need for cooperation is widely recognized in emergency management [140], since responses require a great diversity of skills and resources. Big data integration and Extract, Transform, and Load (ETL) technologies can be crucial for breaking down and bridging data silos [141]. Moreover, the proposed reference architecture can help in organizing and classifying existing experiences and sharing best practices. ...
Article
Full-text available
Nowadays, we are witnessing a shift in the way emergencies are being managed. On the one hand, the availability of big data and the evolution of geographical information systems make it possible to manage and process large quantities of information that can hugely improve the decision-making process. On the other hand, digital humanitarianism has shown to be very beneficial for providing support during emergencies. Despite this, the full potential of combining automatic big data processing and digital humanitarianism approaches has not been fully realized, though there is an initial body of research. This paper aims to provide a reference architecture for emergency management that instantiates the NIST Big Data Reference Architecture to provide a common language and enable the comparison of solutions for solving similar problems.
... Considerable investments go into duplicating work done. Furthermore, projects can be hampered or cancelled by limited access to storage and processing capabilities in the case of centralized data repositories or data silos [38]. Besides, IT knowledgeable human resources can be prohibitively expensive in certain organizations. ...
Chapter
The present study aims to determine a recent approach on MP-GEO software Process Framework for guiding the specification of geospatial databases in the cloud. Geospatial databases are in increasing demand for serving a variety of applications and user expectations. The general requirements for building such databases are well understood, but technological tools and approaches continue their development because of the specific requirements of operating environments. In many situations, the implementation of domain-specific databases still presents major challenges. In interdisciplinary research environments where groups may work on a common geographical region, several important issues arise from the diversity of geospatial data sources. Main issues, namely authorship and copyright in the case of the MP-GEO project, demand the involvement of software specialists with a fresh look towards the specification structure and management of its geospatial data. In MP-GEO, the groups investigate natural and social processes related to agriculture, ecology and natural resources. The main concern is to solve data incompatibilities in order to share and reuse information in different projects. The MP-Geo project emerges from long-standing and real needs of academics and students working in the aforementioned fields. Accordingly, this paper presents improvements in the MP-Geo software process specification to migrate a geospatial database to the cloud. The software framework continues to emphasize software engineering good practices and standards.
... The further step is Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) to make multiple data sources available at one place [8]. Feature engineering is the subsequent step and it is one of the most laborious phases in ML Development. ...
Conference Paper
In the Machine Age, Machine learning (ML) becomes a secret sauce to success for any business. Machine learning applications are not limited to autonomous cars or robotics but are widely used in almost all sectors including finance, healthcare, entertainment, government systems, telecommunications, and many others. Due to a lack of enterprise ML strategy, many enterprises still repeat the tedious steps and spend most of the time massaging the required data. It is easier to access a variety of data because of big data lakes and data democratization. Despite it and decent advances in ML, engineers still spend significant time in data cleansing and feature engineering. Most of the steps are often repeated in this exercise. As a result, it generates identical features with variations that lead to inconsistent results in testing and training ML applications. It often stretches the time to go-live and increases the number of iterations to ship a final ML application. Sharing the best practices and best features are not only time-savers but they also help to jumpstart ML application development. The democratization of ML features is a powerful way to share useful features, to reduce time go-live, and to enable rapid ML application development. It is one of the emerging trends in enterprise ML application development and this paper presents details about a way to achieve ML feature democratization.
... The further step is Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) to make multiple data sources available at one place [10]. Feature engineering is the subsequent step and it is one of the most laborious phases in ML Development. ...
Conference Paper
In the Information Age, Machine learning (ML) provides a competitive advantage to any business. Machine learning applications are not limited to driverless cars or online recommendations but are widely used in healthcare, social services, government systems, telecommunications, and so on. As many enterprises are trying to step up machine learning applications, it is critical to have a long-term strategy. Most of the enterprises are not able to truly realize the fruits of ML capabilities due to its complexity. It is easier to access a variety of data today due to data democratization, distributed storage, technological advancements, and big data applications. Despite easier data access and recent advancements in ML, developers still spend most of the time in data cleansing, data preparation, and data modeling for ML applications. These steps are often repeated and result in identical features. As identical features can have inconsistent processing while testing and training, more issues pop up at later stages in ML application development. The unification of ML features is an effective way to address these issues. This paper presents details about numerous methods to achieve ML features unification.
Article
The purpose of the study is to identify quantitative and qualitative factors affecting the import flows of the Republic of Kyrgyzstan from the EAEU member states using gravity modeling. The article analyzes the identification of quantitative and qualitative factors that affect the value of import trade flows of the Republic of Kyrgyzstan from the EAEU member countries. The research toolkit is H. Linnemann's gravitational model. Gravity models of import trade flows of the Republic of Kyrgyzstan with each EAEU member country have been built. The comparative characteristics of the identified quantitative factors are given. The forecast valuesfor 2021-2023 of the value of import trade flows of the Republic of Kyrgyzstan from the EAEU member countries have been made.
Chapter
In this paper we describe new features and data visualization components of Linked Open Biodiversity Data (LOBD), an application that uses Linked Data to link information extracted from Global Biodiversity Information Facility (GBIF) with different linked open datasets such as Wikidata, NCBI Taxonomy and OpenCitation, to visually present the information as a scientific dashboard. The application allows to complement information about marine Biodiversity with information not initially available. To demonstrate this, a use case is presented.
Article
Full-text available
The primary contribution of this research is the production of a general cloud robotics architecture that leverages the established and evolving big data technologies. Prior research in this area has not released all details of their deployed architectures, which prevents experimental results from being replicated and verified. By providing a general-purpose architecture, it is hoped that this framework will allow future research to build upon and begin to create a standardised platform, where research can be easily repeated, validated and compared.The secondary contribution is the critical evaluation of the design of cloud robotic architectures. Whilst prior research has demonstrated that cloud-based robotic processing is achievable via big data technologies, such research has not discussed the choice in design. With the ecosystem of big data technologies expanding in recent years, a review of the most relevant technologies for cloud robotics is appropriate to demonstrate and validate the proposed architectural design.
Conference Paper
Full-text available
Big Data is a collection of datasets that describes massive amount of data in the range of zettabytes and yottabytes. Dealing with these large datasets, organizations face difficulties in creating, manipulating and managing Big Data as existing traditional database and software techniques are unable to process and analyze these massive datasets. These tasks require new tools and techniques that can extract valuable information from Big Data by some analytic process. In this paper, we add new attributes to Big Data as Viability, Cost and Consistency in addition to existing 6 V’s and 1 C to make 7 V’s and 3 C’s of Big Data. This paper also introduces tools and techniques associated with Big Data that will be of great importance. Further, the paper discusses the setting up of Hadoop cluster and implements a sorting problem on a file of 1 GB with varying cluster size. The result shows that with the increase in the cluster size the processing time of the job significantly reduces.
Conference Paper
Full-text available
The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Article
This paper presents a general distributed messaging framework for online transaction processing applications in e-business industry, and we name it as Metamorphosis (MetaQ for short). Specifically, this messaging framework has features of high availability, scalability, and high data throughput. The current implementation of MetaQ supports distributed XA transactions, asynchronous messaging, and multiple ways for storing message offsets. As a consequence, it is suited to the application contexts having a large quantity of messages, transaction-support and real-time requirements. More important, another branch of MetaQ implementation, i.e. RocketMQ has been deployed and used in Taobao.com and Alipay.com. The real usages in both typical online transaction applications have proven that the nature of MetaQ can perform well for such big Internet applications.
Conference Paper
Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.
Conference Paper
In this paper, we explain the concept, characteristics & need of Big Data & different offerings available in the market to explore unstructured large data. This paper covers Big Data adoption trends, entry & exit criteria for the vendor and product selection, best practices, customer success story, benefits of Big Data analytics, summary and conclusion. Our analysis illustrates that the Big Data analytics is a fast-growing, influential practice and a key enabler for the social business. The insights gained from the user generated online contents and collaboration with customers is critical for success in the age of social media.
Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault
  • D L W H Inmon
D. L. W.H. Inmon, Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault. Amsterdam, Boston: Elsevier, 2014.
Aggregation and analytics on Big Data using the Hadoop eco-system
  • White Paper
A White Paper, 2013. "Aggregation and analytics on Big Data using the Hadoop eco-system"
  • L K Comfort
  • Risk
  • Disaster Security
  • Management
Comfort LK. Risk, Security, and Disaster Management. Annual Review of Political Science. 2005;8:335-356.
Integrating XML Data into Multiple ROLAP Data Warehouse Schemas
  • Ranak Soumysen
  • Ghosh
  • Nabenduchaki Debanjali
Soumysen, Ranak Ghosh, Debanjali, NabenduChaki, 2012. "Integrating XML Data into Multiple ROLAP Data Warehouse Schemas", International Journal of Software Engineering and Application (USEA), Vol 3, No.1, Jan 2012.