ArticlePDF Available

OVERCOMING DATA SILOS THROUGH BIG DATA INTEGRATION

Authors:

Abstract

With cloud computing, cheap storage and technology advancements, an enterprise uses multiple applications to operate business functions. Applications are not limited to just transactions, customer service, sales, finance but they also include security, application logs, marketing, engineering, operations, HR and many more. Each business vertical uses multiple applications which generate a huge amount of data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential growth in data volume. In almost all enterprises, data silos exist through these applications. These applications can produce structured, semi-structured, or unstructured data at different velocity and in different volume. Having all data sources integrated and generating timely insights helps in overall decision making. With recent development in Big Data Integration, data silos can be managed better and it can generate tremendous value for enterprises. Big data integration offers flexibility, speed and scalability for integrating large data sources. It also offers tools to generate analytical insights which can help stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with data silos and how big data integration can help to overcome them.
International Journal of Computer Science and Technology (IJCST)Vol.3, No.1, May 2019.
DOI:10.5121/IJDMS.2019.1301 1
OVERCOMING DATA SILOS THROUGH BIG DATA
INTEGRATION
Jayesh Patel
Senior Member, IEEE
ABSTRACT
With cloud computing, cheap storage and technology advancements, an enterprise uses multiple
applications to operate business functions. Applications are not limited to just transactions, customer
service, sales, finance but they also include security, application logs, marketing, engineering, operations,
HR and many more. Each business vertical uses multiple applications which generate a huge amount of
data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential
growth in data volume. In almost all enterprises, data silos exist through these applications. These
applications can produce structured, semi-structured, or unstructured data at different velocity and in
different volume. Having all data sources integrated and generating timely insights helps in overall
decision making. With recent development in Big Data Integration, data silos can be managed better and it
can generate tremendous value for enterprises. Big data integration offers flexibility, speed and scalability
for integrating large data sources. It also offers tools to generate analytical insights which can help
stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with
data silos and how big data integration can help to overcome them.
KEYWORDS
Data Silo, Big Data, Data Pipelines, Integration, Data Lake, Hadoop
1. INTRODUCTION
A data silo is a segregated group of data stored in multiple enterprise applications. Most
applications store raw and processed data in various ways. Each application has its own features
and tools to allow business users an easy access to processed data using dashboards and reports.
Most of these dashboard and reports are application specific. As a result, teams in various
business units will have limited visibility of the data in the enterprise and they will only see a
partial picture. They will not be able to extract the full value of data from various data silos. Data
Silos restrict sharing information and collaboration among teams. It leads to poor decision
making and negatively impacts profitability [16][17].
With evolving APIs in enterprise applications and the emergence of new technologies,
frameworks, and platforms, there are various opportunities to integrate these silos. Integrating
thousands of disparate data sources is now easier than ever before due to Big Data Integrations
and tools. This paper will focus on how big data integration can help integrating disparate data
sources.
International Journal of Data mining Management Systems (IJDMS) Vol.3, No.1, May 2019.
2
2. DATA SILOS AND CHALLENGES
2.1. Data Silos
In the age of artificial intelligence, data science, machine learning, and predictive analytics,
generating insights from the right data at the right time is critical. These days, there is a strong
wave to adopt artificial intelligence and machine learning to be data-driven and to gain a
competitive advantage.
The biggest hurdle to attain success in this endeavor is data integration and preparation [18]. Due
to the increasing number of business applications in the corporate world, data sit in hundreds or
thousands of applications or servers that result in data silos. It becomes worse at the times of
mergers and acquisitions. On the other side, it is not practical to give everyone in the company to
give access to all applications. Even if that is true, it will take so much time to integrate required
datasets without a proper strategy. It takes days, months or even years to acquire and integrate
data from data silos. Data silos are formed due to various reasons not limited to structural, vendor
contracts, and political [16].
2.2. Challenges with Data Silos
According to a survey by the American Management Association [8], 83% of executives told
that their organizations have silos and 97% think silos have a negative effect. The simplest
reason is they impact overall decision making and impacts each business vertical.
It is obvious that data silos restrict visibility across different verticals. Data enrichment is not
possible with data silos and that’s why it impacts negatively on informed decision making.
Additionally, data silos can represent the same problem or situation differently. For example, an
active subscriber in a SaaS (software as a service) company may mean different to finance team
than the marketing team. For finance team, if a subscriber is paying a subscription or using a
promotion, he/she will be considered active. For marketing, active status depends on various
activities on the platform like login, activity, etc. That leads to inefficiency and additional work
to determine which source is accurate.
2.3. Overcoming Data Silos
As data silos are formed from multiple applications and processes, data reside at various places-
cloud, on-premise, applications servers, flat-files, databases, etc. A key thing here is to find value
in data and define which data silos should create maximum value if they are integrated. One way
you can overcome data silos is to strategize sources of data silos to enhance collaborations and
communications among multiple departments and teams. From separating different processes or
applications to unifying them will break down data silos. However, it requires major efforts and a
change in the overall culture of the organization [16][17].
Another way to solve this problem is to integrate these data silos using integrations techniques
and tools. Integrating these data silos is a costly and time-consuming process. There have been
multiple frameworks and tools for data integrations but we will focus on big data integration due
to its long term benefits.
International Journal of Data mining Management Systems (IJDMS) Vol.3, No.1, May 2019.
3
Let’s understand the need for integrating data silos in the corporate world with a simple analogy.
In a household kitchen, you will find sugar in a container, coffee in another container and
creamer in a different container. We consolidate all three in different proportion to make a
delicious coffee. Similarly, multiple applications in enterprises form data silos for specific
operations. When executives and investors look at a company as a whole, a clear and better view
of the overall picture will go a long way. Integrating data silos in an effective way can resolve
this critical challenge.
3. BIG DATA INTEGRATION
Enterprise applications generate a variable volume of data real-time, near-real-time or not-real-
time. Data can be structured, semi-structured or unstructured [5][14]. Almost 80% of data from
enterprise applications is either unstructured or semi-structured [9]. Volume can be low or high
but there is no clear bar to classify volume as low or high [11]. Applications can generate static
or dynamic data structures and they are heterogeneous in most cases [14]. Most of these
demonstrate the characteristics of Big Data, often known as the seven V´s: Volume, Variety,
Velocity, Veracity, Value, Variability, and Viability [4][14].
Traditional data integration techniques work well for structured data. Traditional data integration
can handle some of the characteristics of Big Data but it fell short on handling semi-structured
and unstructured data at scale [2][3]. With advancements in big data technologies and
infrastructure, it becomes much easier to integrate a variety of data sources at scale [11].
Distributed processing and distributed messaging systems made integrations possible at high
scale. Map Reduce and Hadoop used to be a good option to integrate data silos [14]. Spark and
Hadoop can be better to integrate a variety of data sources after recent development in Spark [6].
Kafka and other distribution messaging technology made real-time data integration possible
[12][13]. Depending on the needs, data can be processed in batch, or real-time using big data
tools.
To integrate data silos, you can either extract raw data from each application or pull processed
data from sources based on the requirements and nature of applications. Following general steps
are identified to integrate data silos based on several use cases:
1) Data Discovery: Identify data silos to be integrated. It is a data discovery and
justification phase. Evaluate and determine what integrating data silos will benefit the
entire corporation. Set the clear expectation, benchmark, and constraints for each data
silo. This phase will give you some idea of priorities on which data silos should be
handled first.
2) Data Extraction: Determine what to extract from sources. Data can be from relational
databases, cloud, IoT sensors, flat files, application logs, images, large documents, and
so on [1][19]. Data extraction should not put more burden on data sources and should not
impact other production processes. “Schema on read” can be beneficial for semi-
structured and unstructured data and big data integration can handle it easily [10].
International Journal of Data mining Management Systems (IJDMS) Vol.3, No.1, May 2019.
4
3) Data Export: Export data to temporary storage or messaging system. Kafka is a really
powerful message broker for high-throughput real-time event data and can be really
effective in integrating data silos [12][20]. Other alternatives should be evaluated based
on needs.
4) Data Loading: Load data to the cloud, Hadoop Distributed File System (HDFS) or any
big data platform real-time or in batches [6]. Data can be loaded as-is in raw format.
Based on requirements, data can be transformed and stored as well.
5) Data Processing and Analytics: Consolidate data sources based on business needs. This
is a step to process unstructured or semi-structured data to structured data so that it can
be used in analytics. It includes aggregating data, generating key performance metrics
and analytical data models [7].
4. BIG DATA LAKE
Data Lake is an excellent choice for data consolidation and for integrating data silos. It is an
integral part of Big Data Integration. It offers several benefits over the traditional data
warehouse. Data lake supports the strategy to develop the universal culture of scalable data-
driven insights. They are highly flexible, agile and reusable. As they are designed for low-cost
storage, they are commonly used to store historical data from various sources [15].
When you have all data you need at one place, you can use it for all purposes. Instead of storing
only raw data, data lake should also store processed data, metrics and aggregations to take full
advantage. In order to avoid extra processing and time, common aggregations and metrics should
be kept ready in the data lake. That way it can be an enterprise data platform serving as a single
source of truth.
As the journey to the big data lake is not short, be stick to goals identified during data discovery.
This journey will definitely open doors for the new opportunities and will help enterprises to
overcome data silos.
5. CONCLUSIONS
Today data are being generated in hundreds or thousands from enterprise applications at an
unprecedented scale. As these applications are used by different business verticals and teams,
data sharing is not easy even within the specific business vertical. As a result, multiple data silos
are formed in enterprises. Data can generate enormous value if integrated, analyzed and
presented correctly. This paper presented some challenges with data silos and how big data
integration can help to overcome them. It discussed the use of Big Data Lake to integrate data
silos and how it can serve multiple needs through a single platform. As big data integration is
evolving day by day, new frameworks, platforms, and techniques are expected to make
integrations easier in the future. Additionally, data governance for big data lake and data access
control on enterprise data have a huge scope for development in the future.
International Journal of Data mining Management Systems (IJDMS) Vol.3, No.1, May 2019.
5
REFERENCES
[1] David Loshin, “Big Data Analytics”, Elsevier, 2013.
[2] Xin Luna Dong, Divesh Srivastava, 2013. “Big Data Integration”, ICDE conference 2013.
[3] Sachchidanand Singh, Nirmala Singh, 2012. “Big Data Analytics”, International Conference on
Communication, Information & Computing Technology (ICCICT), Oct 19-20, 2012.
[4] P. Bedi, V. Jindal, and A. Gautam, “Beginning with Big Data Simplified,” 2014.
[5] D. L. W.H. Inmon, Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse
and Data Vault. Amsterdam, Boston: Elsevier, 2014.
[6] J. G. Shanahan and L. Dai, “Large Scale Distributed Data Science Using Apache Spark,” in
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2015, pp. 2323–2324.
[7] A White Paper, 2013. “Aggregation and analytics on Big Data using the Hadoop eco- system”
[8] Comfort LK. Risk, Security, and Disaster Management. Annual Review of Political Science.
2005;8:335–356.
[9] eWEEK. (2019). Managing Massive Unstructured Data Troves: 10 Best Practices. [online]
Available at: http://www.eweek.com/storage/slideshows/managing-massiveunstructured-data-
troves-10-best-practices#sthash.KAbEigHX.dpuf [Accessed 11 May 2019].
[10] Soumy sen, Ranak Ghosh, Debanjali, Nabendu Chaki, 2012. “Integrating XML Data into
Multiple ROLAP Data Warehouse Schemas”, International Journal of Software Engineering and
Application (USEA), Vol 3, No.1, Jan 2012.
[11] B.arputhamary and L.arockiam. “A Review on Big Data Integration” IJCA Proceedings on
International Conference on Advanced Computing and Communication Techniques for High
Performance Applications ICACCTHPA 2014(5):21-26, February 2015.
[12] J. Kreps, N. Narkhede, and J. J. Rao, ‘‘Kafka: A distributed messaging system for log
processing,’’ in Proc. NetDB, 2011, pp. 1–7.
[13] J. Liao, X. Zhuang, R. Fan and X. Peng, "Toward a General Distributed Messaging Framework
for Online Transaction Processing Applications," in IEEE Access, vol. 5, pp. 18166-18178, 2017.
[14] Salinas, Sonia Ordonez and Alba C.N. Lemus. (2017) “Data Warehouse and Big Data
integration” Int. Journal of Comp. Sci. and Inf. Tech. 9(2): 1-17.
[15] Analytics Magazine, 03-Nov-2016. “Data Lakes: The biggest big data challenges,” [Online].
Available at: http://analytics-magazine.org/data-lakes-biggest-big-data-challenges/. [Accessed:
11-May-2019].
[16] Alienor. "What Is a Data Silo and Why Is It Bad for Your Organization?" Plixer. July 31, 2018.
Accessed May 11, 2019. https://www.plixer.com/blog/network-security/data-silo-what-is-it-why-
is-it-bad/.
[17] "4 Best Ways To Breakdown Data Silos [Problems and Solutions]." Status Guides. February 26,
2019. Accessed May 11, 2019. https://status.net/articles/data-silos-information-silos/.
International Journal of Data mining Management Systems (IJDMS) Vol.3, No.1, May 2019.
6
[18] G. Press, “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task,
Survey Says,” Forbes, 23-Mar-2016. [Online]. Available:
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-
enjoyable-data-science-task-survey-says/#6d5fd596f637. [Accessed: 11-May-2019].
[19] Amitkumar Manekar, and Dr. G. Pradeepinib (2015,May), “A Review On Cloud Based
Data Analysis”. International Journal on Computer Network And Communications (IJCNC) May
2015,Vol.1 No.1
[20] L. Duggan, J. Dowzard, J. Katupitiya, and K. C. Chan, “A Rapid Deployment Big Data
Computing Platform for Cloud Robotics,” International journal of Computer Networks &
Communications, vol. 9, no. 6, pp. 77–88, 2017.
Authors
Jayesh Patel completed his Bachelors of Engineering in Information Technology
in 2001 and MBA in Information Systems in 2007. He currently work for
Rockstar Games as Senior Data Engineer, focusing on developing data-driven
decision-making processes on Big Data Platform. He has successfully built
machine learning pipelines and architected big data analytics solutions over the
past several years. Additionally, he is a senior member for IEEE.
... Isolated data silos were additionally a challenge indicated by experts within the field of I4.0, and those that work directly within the IT and application landscape field. Data silos refer to the "segregated group of data stored in multiple enterprise applications" (Patel, 2019). Data silos directly present challenges to business operations and decisionmaking while being directly opposed to the aims of interoperability indicated within the 2030 Vision. ...
... Out of the expert interviews conducted, six indicated that data silos present a significant challenge within the IT and application landscape in regard to progressing with I4.0 transformation (See Appendix 11B). Data silos occur due to incompatible data systems with the architectural layer, which transpires due to the utilisation of multiple application systems which require specific features and tools to support business users; thus, when transferring data between applications, processed data is no longer accessible (Patel, 2019). As I4.0 is developed upon the principle of data-driven manufacturing, the inability to access, process, and transfer data presents a significant threat to the future transformation efforts and of current operations. ...
... As I4.0 is developed upon the principle of data-driven manufacturing, the inability to access, process, and transfer data presents a significant threat to the future transformation efforts and of current operations. Data silos can thus result in poor decision making and negatively impact profitability (Patel, 2019). ...
Thesis
I4.0 has become a leading topic within manufacturing and digital policy in Germany. Through published policy, the German federal government seeks to control the direction of the digital transformation of the manufacturing sector of the economy. Challenges and successes within the adoption and transformation process have been realized and either support or limit the speed at which the transformation occurs. Through investigating the heart of the German economy, which are machine tool SMEs, this dissertation attempt to provide an isolated evaluation of the challenges and successes within the IT and application landscape, which are landscapes faced within supporting I4.0 transformation. As these landscapes are responsible for supporting the entirety of I4.0, it is crucial that a sound application system is implemented. Conduction of a policy analysis was performed to determine what are the current objectives of the federal government regarding I4.0 and what is required of the IT and application landscape to support these objectives. The results were obtained from semi-structured interviews with experts in the field of I4.0, the machine and tool industry, and the IT and application landscape. The results of the gap analysis would serve to support the findings that there are significant limitations within the IT and application landscape of German SMEs, which hamper the ability for successful digital transformation and I4.0 standards. The findings were then compiled into policy recommendations for the federal government, designed to support recommend changes within policy to support the innovation of the IT and application landscapes of SMEs with the aims to support the attainment of federal policy objectives.
... AIOps tool needs to have access to the data from all the applications and be able to link them together to create a comprehensive picture of the system. It is obvious that data silos restrict visibility across different verticals [4].  Lessen the accuracy and integrity of your data: Data silos can lead to data duplication, inconsistency, and incompleteness. ...
Preprint
Full-text available
Using artificial intelligence to manage IT operations, also known as AIOps, is a trend that has attracted a lot of interest and anticipation in recent years. The challenge in IT operations is to run steady-state operations without disruption as well as support agility" can be rephrased as "IT operations face the challenge of maintaining steady-state operations while also supporting agility [11]. AIOps assists in bridging the gap between the demand for IT operations and the ability of humans to meet that demand. However, it is not easy to apply AIOps in current organizational settings. Data Centralization is a major obstacle for adopting AIOps, according to a recent survey by Cisco [1]. The survey, which involved 8,161 senior business leaders from organizations with more than 500 employees, found that 81% of them acknowledged that their data was scattered across different silos within their organizations. This paper illustrates the topic of data silos, their causes, consequences, and solutions. Additional
Conference Paper
As the internet of things networks tend to grow bigger and bigger, the need for interoperability across systems is increasing. However, due to privacy, security, and segmentation across companies, the current practice is to create always new data silos, leading to a discontinuity in the unity of data. To address this, ontologies are seeing increasing usage in the industrial world. Nevertheless, the problem of linking the data back to the ontology is still present, which is often done manually.This paper introduces an approach to simplify this task by providing a semi-supervised tool using semantic data type detection and deep learning. One of the features of the proposed method is the ability to train the model without the need to keep every dataset stored together.
Chapter
Nowadays, educational institutions are one of the biggest producers of data. The rise of e-Learning contents, digital libraries, webinars, learning management systems, online classes and examinations, video surveillance, sensors, and wearables devices contribute to this data explosion. Learning management systems can index millions of students’ data, their interactions, course registrations, social networks, and their Internet research results. Besides, the potential to learn from this population-scale data is massive. By building analytic dashboards using machine learning and deep learning approaches on these datasets, educational organizations can improve the learning experience, teaching skills, and learning environment and drive better teaching and learning outcomes. Some real-world examples are students’ dropouts, students’ behavior, employee and student's health, prevention fraud data and abuse, etc. In present legacy systems, the data silos from the data warehouse could not handle unstructured data. It increases the complexity and cost of transferring data between multiple disparate data systems. Also, there is a performance bottleneck with data throughput while managing multiple data copies in different locations. This paper aims to store all educational data in a central location and handle all structured and unstructured data without any performance bottlenecks. It is proposed to design an enterprise data lake solution for academic organizations using deep learning to predict the outcomes.KeywordsArtificial intelligencePredictive analyticsMachine learningEducational analyticsDeep learningData lakeEducational organizations
Chapter
Greenhouse farmers around the world face multiple challenges imposed by manual tasks and must deal with complex relationships among growth environment variables. Usually, tasks are accomplished with low efficiency and high uncertainty, which becomes evident when evaluating the impact introduced by adjustments to these variables. These challenges have led to the appearance of the precision agriculture industry, as farmers attempt to automate the agricultural and commercialization processes using solutions based on the Internet of Things (IoT), Artificial Intelligence (AI) and Cloud Computing. Although these novel technological solutions seem to tackle some of the challenges, several concerns about centralization and data silos throughout the supply chain have arisen. Thus, we propose the Interplanetary Precision Agriculture (IPA) project as an alternative to an increasing demand for better technological solutions in the sustainable food supply, required by the long-term presence of humans in any given environment. The current project aims to improve the cultivation process on and off Earth, by implementing solutions based on the IoT, AI, and Distributed Ledger Technologies (DLT). Hence, a “system of systems” is laid out. First, Magrito, a holonomic autonomous rover, is introduced to capture crop performance parameters (output variables). Second, Precision Habitat PRO, the environment controlling device, is deployed to capture growing parameters (input variables). Third, a commercial Bluetooth scale is added. Last, a Farm Management System is utilized to correlate the data captured by IoT devices with business logic. The resulting information is sent to the IOTA Tangle network to render it immutable and interoperable, at zero network processing fees with minimal energy consumption.
Article
Full-text available
The primary contribution of this research is the production of a general cloud robotics architecture that leverages the established and evolving big data technologies. Prior research in this area has not released all details of their deployed architectures, which prevents experimental results from being replicated and verified. By providing a general-purpose architecture, it is hoped that this framework will allow future research to build upon and begin to create a standardised platform, where research can be easily repeated, validated and compared.The secondary contribution is the critical evaluation of the design of cloud robotic architectures. Whilst prior research has demonstrated that cloud-based robotic processing is achievable via big data technologies, such research has not discussed the choice in design. With the ecosystem of big data technologies expanding in recent years, a review of the most relevant technologies for cloud robotics is appropriate to demonstrate and validate the proposed architectural design.
Article
Full-text available
This paper presents a general distributed messaging framework for online transaction processing applications in e-business industry, and we name it as Metamorphosis (MetaQ for short). Specifically, this messaging framework has features of high availability, scalability, and high data throughput. The current implementation of MetaQ supports distributed XA transactions, asynchronous messaging, and multiple ways for storing message offsets. As a consequence, it is suited to the application contexts having a large quantity of messages, transaction-support and real-time requirements. More important, another branch of MetaQ implementation, i.e. RocketMQ has been deployed and used in Taobao.com and Alipay.com. The real usages in both typical online transaction applications have proven that the nature of MetaQ can perform well for such big Internet applications.
Article
Full-text available
Data Warehouse is one of the most common ways for analyzing large data for decision based system. These data are often sourced from online transactional system. The transactional data are represented in different formats. XML is one of the worldwide standards to represent data in web based system. Numbers of organizations use XML for e-commerce and internet based applications. Integration of XML and data warehouse for the innovation of business logic and to enhance decision making has therefore emerged as a demanding area of research interest. This paper focuses on integrating XML data based on multiple related XML schemas, to an equivalent data warehouse schemas based on relational online analytical processing (ROLAP). This work bears a high relevance towards standardizing of the ETL phase (Extraction, Transformation, and Loading) of the OLAP projects. The novelty of the work is that more than one data warehouse schemas could be identified from a single related XML schema and each of them could be categorized as star schema or snowflake schema. Moreover if the individual schemas are found to be related according to the analysis, fact constellation could be identified. A new data structure, Schema Graph has been proposed in the process.
Conference Paper
Full-text available
Big Data is a collection of datasets that describes massive amount of data in the range of zettabytes and yottabytes. Dealing with these large datasets, organizations face difficulties in creating, manipulating and managing Big Data as existing traditional database and software techniques are unable to process and analyze these massive datasets. These tasks require new tools and techniques that can extract valuable information from Big Data by some analytic process. In this paper, we add new attributes to Big Data as Viability, Cost and Consistency in addition to existing 6 V’s and 1 C to make 7 V’s and 3 C’s of Big Data. This paper also introduces tools and techniques associated with Big Data that will be of great importance. Further, the paper discusses the setting up of Hadoop cluster and implements a sorting problem on a file of 1 GB with varying cluster size. The result shows that with the increase in the cluster size the processing time of the job significantly reduces.
Conference Paper
Full-text available
The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Conference Paper
Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.
Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault
  • D L W H Inmon
D. L. W.H. Inmon, Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault. Amsterdam, Boston: Elsevier, 2014.
Aggregation and analytics on Big Data using the Hadoop eco-system
  • White Paper
A White Paper, 2013. "Aggregation and analytics on Big Data using the Hadoop eco-system"
  • L K Comfort
  • Risk
  • Disaster Security
  • Management
Comfort LK. Risk, Security, and Disaster Management. Annual Review of Political Science. 2005;8:335-356.