ArticlePDF Available

Abstract and Figures

Data from consumer-based devices for collecting personal health-related data could be useful in diagnostics and treatment. This requires a flexible and scalable software and system architecture to handle the data. This study examines the existing mSpider platform, addresses shortcomings in security and development, and suggests a full risk analysis, a more loosely coupled component-based system for long term stability, better scalability, and maintainability. The goal is to create a human digital twin platform for an operational production environment.
Content may be subject to copyright.
Health Research Requires Efficient
Platforms for Data Collection from
Personal Devices
Erlend JOHANNESSEN
a,1
, André HENRIKSEN
a
, Eirik ÅRSAND
a
,
Alexander HORSCH
a
, Jonas JOHANSSON
b
and Gunnar HARTVIGSEN
a
a
Dept. of Computer Science, UiT The Arctic University of Norway, Tromsø, Norway
b
Dept. of Community Medicine, UiT The Arctic University of Norway, Tromsø, Norway
ORCiD ID: Johannessen 0000-0003-4860-9192, Henriksen 0000-0002-0918-7444,
Årsand 0000-0002-9520-1408, Horsch 0000-0001-7745-0139, Johansson 0000-0001-
7912-5786, Hartvigsen 0000-0001-8771-9867
Abstract. Data from consumer-based devices for collecting personal health-related
data could be useful in diagnostics and treatment. This requires a flexible and
scalable software and system architecture to handle the data. This study examines
the existing mSpider platform, addresses shortcomings in security and
development, and suggests a full risk analysis, a more loosely coupled component-
based system for long term stability, better scalability, and maintainability. The
goal is to create a human digital twin platform for an operational production
environment.
Keywords. Infrastructure, Scalability, Human Digital Twin
1. Introduction
Physical activity (PA) trackers and smartwatches can be used for health data collection
in research as an addition to existing methods [1], and the data collected could be used
to support patient diagnostics and treatment [2], see overview by Henriksen et al. [3].
For collecting data from many different device suppliers, a flexible and robust
solution is needed, that can also receive data from heterogenous sources. The mSpider
(Motivating continuous Sharing of Physical activity using non-Intrusive Data
Extraction methods Retro- and prospectively) system is an experimental tool designed
for automatic and continuous collecting of health-related data recorded by consumer-
based activity trackers [4]. It has been designed to collect data of various PA-variables
from activity trackers from a range of different providers. Today’s activity trackers are
smart devices capable of collecting many PA-variable estimates and transferring them
to a smartphone for persistent storage. In their study, Henriksen et al. [4] collected
smartwatch data using the mSpider system.
The current mSpider architecture consists of two servers, an administrative user-
facing system (front-end), and a back-end server for gathering data by using the
1
Corresponding Author: Erlend Johannessen, Department of Computer Science, UiT The Arctic
University of Norway, PO Box 6050 Langnes, N-9037 Tromsø, Norway. E-mail: erlend.johannessen@uit.no.
Caring is Sharing – Exploiting the Value in Data for Health and Innovation
M. Hägglund et al. (Eds.)
© 2023 European Federation for Medical Informatics (EFMI) and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/SHTI230286
841
manufacturers’ public APIs. In addition, a mobile application has been made for those
manufacturers (notably Apple and Samsung) where data only are available through
SDKs provided by the manufacturers. Figure 1 shows the original mSpider
architecture.
In mSpider, the participants are enrolled in a study, after which they only need to
wear their smart watches to collect and share data. Their activity data are uploaded to
the device manufacturers’ respective clouds and then pulled to the mSpider back-end
server through the manufacturers’ APIs.
Figure 1. Original mSpider data collection architecture.
The programming language used for the mSpider back-end server was Go (go.dev,
open source). The admin front-end used Node.js (nodejs.org, open source) for serving
HTML, CSS, and JavaScript, and Angular (angular.io, open source) for creating the
user interface. All data were stored in a MongoDB (mongodb.com, US) database. Both
servers were run in a Docker (docker.com, US) environment using Docker-Compose
on a single Ubuntu Linux (canonical.com, UK) server.
The goal of this study is to examine the mSpider system, identify problems and
possible improvements, and to discuss an architecture capable of collecting data from a
large population, over an extended period of time.
A digital twin is defined as a digital representation of physical entity. A human
digital twin is maybe not fully realisable at the moment, but there is potential for
creating small-scale human digital twins today, by joining different types of data from
digital services and sources. The ambition aim is to add more data types and more data
sources to the mSpider system, in order to approach a human digital twin system [5,6].
2. Methods
To analyse the state of the mSpider system and uncover problems, several experts were
involved in interviews, and using the “think aloud” method, in a qualitative study
approach [7], including the following steps:
1. Questioning the researchers using mSpider on how they experienced the
system, and what they disliked or missed.
2. Discussing with the original developers and operational staff members behind
mSpider, outlining the decisions governing the development of the system.
3. Reviewing the source code of the system, to understand the functionalities and
state of the system.
E. Johannessen et al. / Health Research Requires Efficient Platforms842
3. Results
Observation done via researchers, developers (interview and code review), and
operational staff are shown in Table 1. Priorities indicate the importance of the
respective improvement. Issues are assigned to one type of actor, although some issues
concerned or had consequences for several or all actors. Issues given low priority are
not described in detail.
Table 1. Findings from researchers, developers, and operational staff, including code review. The rows are
coloured differently to separate actors’ observations from each other.
Actors Observations Priority
Operational
staf
f
Security patches would be difficult to apply because of the way mSpider
was developed and maintained
High
Operational
staf
f
No assessment of threats or risks done on the system High
Operational
staf
f
Running system inside university network could theoretically put university
network at risk
High
Operational
staf
f
Every production update means deploying the full system. If something
malfunctions the whole system may malfunction
Medium
Developers Third-party components used got outdated and were no longer maintained
and could not easily be replaced
High
Developers Back-end server was a tightly coupled monolithic architecture High
Developers Back-end server was responsible for everything: participant consent
dialogue, data collection, management, data extraction, batch runner for
historical data
High
Developers Device data were saved in the same storage, giving a format that did not
suit all devices
Medium
Developers Changing provider behaviour creates a new deployment of the full system Low
Developers Non-relational database may not be ideal for storage technology when there
was need for combining several collections
Low
Developers Different metadata were saved in the same storage collection Low
Developers Adding a new provider, initiated changes across the whole code base Low
Researchers A limited number of variables collected, e.g., daily step count, energy
expenditure (kcal), and moderate or vigorous physical activity (PA)
Low /
Medium
Researchers Inefficient and cumbersome user interface Low
Researchers Only rudimentary data extraction from the system Low
Researchers Limited management functionality for study data Low
Researchers Limited management functionality for participant data Low
Researchers Data collection complexities with regards to when devices add their data to
the manufacturer’s cloud
Low
Several issues were uncovered from code reviews and from talking to developers.
Device data from different manufacturers were saved into the same document storage,
giving a general format that did not suit different data from different manufacturers.
This required comprehensive mapping methods when reading from the different
collections, since the various manufacturers have different data models. When used
with a proof-of-concept system with limited data collection this worked but would be
too complex when expanding on more data variables. Another issue was the use of
third-party components in the source code. Using community code in your project is
normally not an issue, but problems arise when packages get outdated and are no
E. Johannessen et al. / Health Research Requires Efficient Platforms 843
longer maintained. This could be due to lack of security fixes, but also because the
package is outdated with regards to functionality as to what the package was meant to
solve. As an example, the Golang package used for MongoDB database access did not
work with newer (and more secure) versions of MongoDB. This package was deeply
integrated with the code and could not be easily replaced.
The back-end server (see Figure 1) was implemented as a monolith, which
normally is not a problem, but a tightly coupled monolithic system tends to end up as a
“big ball of mud” [8]. The server was responsible for everything, including data
collection from APIs, being a receiving API for the mobile mSpider clients (for Apple
and Samsung), being a management and data extraction system for researchers and
admin personnel, and a batch runner for gathering historical data from the device
providers. This is a lot of responsibility for a system and is difficult to maintain.
Operational staff were mainly concerned with security and deployment of the
mSpider system, and there were some problems with the solution running inside the
university network. Every production update meant deploying the full system.
Theoretically, if something malfunctioned the whole system could malfunction.
Security patches would be difficult to apply because of the way mSpider was
developed and maintained. From reviewing the mSpider project we found that several
security measures were implemented in mSpider, but there was no assessment of
threats or risks. Because of this, data was collected and stored anonymously. Running
the current system on a single Linux server on the university premises was a problem
with regards to security for mSpider but could also pose a problem for the university’s
security.
Based on the expert-interviews and the source code review, we identified several
requirements for a new system version. The new system should:
1) Be ready for productive use in a professional health context.
2) Be scalable with regard to new devices.
3) Be scalable with regard to data volume.
4) Collect more diverse data or groups of data.
5) Make data interpretable and easily available to researchers from various
disciplines.
4. Discussion
Risk assessments are used to expose undesirable incidents in systems and evaluate
probability of occurrence. To be production-ready the new system needs a thorough
risk assessment, and the security needs to improve for the system not to risk leaking
collected data. The system also needs to be running continuously, so a stable set of
services is necessary. This is dependent on how the software architecture is
implemented, and some software principles are essential for this, among them
separation of concerns and extensive use of interfaces when using the relevant
programming languages.
Using a secure and capable runtime environment is key, and a suggestion for this is
running the solution in Microsoft Azure or another cloud computing service.
An important feature in the new system would be scalability with regards to new
devices. One way of solving this would be to create a service for each data provider so
that change to one device provider’s API, access method, or data collected only would
affect a single service. This would also make it easy to add new providers to the system,
E. Johannessen et al. / Health Research Requires Efficient Platforms844
in that the new provider could be developed and tested in isolation from the other
services. Several services could also be added for each provider, so that the solution is
scalable with regards to gathering increasing amounts of data from the population.
The storage system will be created such that data from each provider could be
stored in their own databases. Creating the possibility for a separate database or
localization of storage for each provider would increase scalability for storage, which
again opens up for federated database technology [9].
Data collected are intended to be heterogeneous and the system should store as
diverse and plentiful data as practically possible. The next version of mSpider should
be expanded so that it covers more data types (e.g., pulse, sleep, temperature, body
composition) and makes a low-resolution human digital twin [5] possible.
5. Conclusion
We have identified requirements for a production ready, scalable, and flexible data
collection system for personal devices. One of the overarching goals for the new
mSpider system is to enable it for population-based research investigating potential
changes in a population’s lifestyle and health. This would imply continuous data
collection from the population to create data-driven analyses, working towards a
human digital twin [6]. Researchers should be able to create a research project by
initially setting some parameters for what data they want to be included in the study,
such as steps, heart rate, weight, and sleep. These data could be extracted daily into a
warehousing system [10]. This data collection system could give new opportunities for
public health research.
References
[1] Brickwood K-J, Watson G, O’Brien J, Williams AD. Consumer-Based Wearable Activity Trackers
Increase Physical Activity Participation: Systematic Review and Meta-Analysis. JMIR MHealth
UHealth. 2019;7:e11819. doi: 10.2196/11819.
[2] Gwaltney C, Coons SJ, O’Donohoe P, O’Gorman H, Denomey M, Howry C, Ross J, ePRO
Consortium. “Bring Your Own Device” (BYOD): The Future of Field-Based Patient-Reported
Outcome Data Collection in Clinical Trials? Ther Innov Regul Sci. 2015;49:783–791. doi:
10.1177/2168479015609104.
[3] Henriksen A, Mikalsen M, Woldaregay A, Muzny M, Hartvigsen G, Hopstock L, Grimsgaard S.
Using Fitness Trackers and Smartwatches to Measure Physical Activity in Research: Analysis of
Consumer Wrist-Worn Wearables. J Med Internet Res. 2018;20:e110. doi: 10.2196/jmir.9157.
[4] Henriksen A, Johannessen E, Hartvigsen G, Grimsgaard S, Hopstock LA. Consumer-Based Activity
Trackers as a Tool for Physical Activity Monitoring in Epidemiological Studies During the COVID-
19 Pandemic: Development and Usability Study. JMIR Public Health Surveill. 2021;7:e23806. doi:
10.2196/23806.
[5] Johannessen E, Henriksen A, Hartvigsen G, Horsch A, Årsand E, Johansson J. Ubiquitous digital
health-related data: clarification of concepts. Scand Conf Health Inform. 2022;52–58. doi:
10.3384/ecp187009.
[6] Miller ME, Spatz E. A unified view of a human digital twin. Hum-Intell Syst Integr. 2022;4:23–33.
doi: 10.1007/s42454-022-00041-x.
[7] Eccles DW, Arsal G. The think aloud method: what is it and how do I use it? Qual Res Sport Exerc
Health. 2017;9:514–531. doi: 10.1080/2159676X.2017.1331501.
[8] Big Ball of Mud [Internet]. [cited 2022 Nov 14]. Available from: http://www.laputan.org/mud/.
[9] Sheth AP, Larson JA. Federated database systems for managing distributed, heterogeneous, and
autonomous databases. ACM Comput Surv. 1990;22:183–236. doi: 10.1145/96602.96604.
[10] Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. ACM SIGMOD
Rec. 1997;26:65–74. doi: 10.1145/248603.248616.
E. Johannessen et al. / Health Research Requires Efficient Platforms 845
... Moreover, while developing such models already represents a challenging task, these considerations raise additional concerns for the overarching HDT system. For example, they require the system to handle large volumes of heterogeneous data from diverse sources and provide high flexibility, scalability, and modularity to enable adaptation to changes and integration of new components and technologies over time [63] [12] [10]. ...
... Additionally, [52] advocates using hybrid cloud-edge approaches that merge the strengths of both cloud and edge architectures to balance real-time processing and scalability are occasionally mentioned. Moreover, while some researchers develop and integrate most components from scratch, some suggest leveraging third-party tools for tasks such as data management and advanced analytics [63] [52]. Ultimately, ensuring the security of this infrastructure is paramount, typically achieved through established measures, such as data encryption and access control, and emerging technologies, such as blockchain, providing security through decentralized data management and immutable record-keeping. ...
... In Phase 5 (Table 3), the literature primarily discusses employing clinical trials [66], expert and stakeholder feedback sessions [63] [5] [45], or proof-of-concept studies demonstrating functionality through simplified case studies and laboratory experiments [31] [57]. Despite this, aside from using standard metrics for evaluating the AI models within HDTs, there is a notable lack of detailed discussion on specific methodologies for the comprehensive validation and approval of HDTs, as, for instance, requested by [65]. ...
Article
Full-text available
Human Digital Twins (HDTs) have emerged as a key technology in digitalizing human-centered applications, able to enhance decision-making, personalize healthcare, and optimize workplace environments. However, despite a growing research body, there exists a critical lack of guidance for designing sophisticated, user-centric HDTs. In addressing this gap, this paper reports a systematic literature review of 44 papers in the field of HDTs (published between 2022 and 2023) to distill key design process insights and identify gaps in the current understanding and approaches to HDT design. Based on insights from this review, we introduce ETHICA, a systematic methodology that advocates for an iterative development process, emphasizing stakeholder collaboration and focusing on ethical and user-centric principles throughout the design life cycle. The resulting ETHICA methodology offers a robust blueprint for constructing sophisticated HDTs that harmonizes technological advancements with human-centered design and ethical considerations, encouraging a shift towards more holistic and human-centric approaches in the emerging field of Digital Twin technologies.
Conference Paper
Full-text available
The increased development and use of ubiquitous digital services reinforce the trend where health-related data is generated everywhere. Data usage in different areas introduces different terms for the same or similar concepts. This adds to the confusion of what these terms represent. We aim to provide an overview of concepts and terms used in connection with digital twins and in a healthcare context.
Article
Full-text available
The term human digital twin has recently been applied in many domains, including medical and manufacturing. This term extends the digital twin concept, which has been illustrated to provide enhanced system performance as it combines system models and analyses with real-time measurements for an individual system to improve system maintenance. Human digital twins have the potential to change the practice of human system integration as these systems employ real-time sensing and feedback to tightly couple measurements of human performance, behavior, and environmental influences throughout a product’s life cycle to human models to improve system design and performance. However, as this concept is relatively new, the literature lacks inclusive and precise definitions of this concept. The current research reviews the literature on human digital twins to provide a generalized structure of these systems, provide definitions of a human digital twin and human digital twin system, and review the potential applications of these systems within product design, development, and sustainment. This review of the existing literature suggests that components of human models sufficient to provide robust human digital twins are likely to be derived across multiple fields of study. Thus, development of these systems would benefit an open multi-disciplinary research effort.
Article
Full-text available
BACKGROUND Consumer-based physical activity trackers increase in popularity. The widespread use of these devices and the long-term nature of the recorded data provides a valuable source of physical activity data for epidemiological research. Major challenges include the large number of activity tracker providers and models, and the difference in how and what data are recorded and shared. OBJECTIVE The aim of this study was to develop a system to record data on physical activity from different providers of consumer-based activity trackers, and to examine its usability as a tool for physical activity monitoring in epidemiological research. The longitudinal nature of the data and the concurrent pandemic outbreak allowed us to show how the system can be used for surveillance of physical activity levels before, during, and after a COVID-19 lockdown. METHODS We developed a system (mSpider) for automatic recording of data on physical activity from participants wearing activity trackers from Apple, Fitbit, Garmin, Oura, Polar, Samsung, and Withings, as well as trackers storing data in Google Fit and Apple Health. To test the system throughout development, we recruited 35 volunteers to wear a provided activity tracker from primo 2019 and onwards. In addition, we recruited 113 participants with privately owned activity trackers worn before, during, and after the COVID-19 lockdown in Norway. We examined monthly change in number of steps, minutes of moderate-to-vigorous physical activity, and activity energy expenditure during 2019-2020 using bar plots and two-sided paired sample t-tests and Wilcoxon signed-rank test. RESULTS Compared to March 2019, there was a significant reduction in mean step count and mean activity energy expenditure during the March 2020 lockdown period. The reduction was temporary, and the year to year comparison show a small increase in moderate-to-vigorous physical activity and no change in steps and activity energy expenditure. CONCLUSIONS mSpider is a working prototype currently able to record physical activity data from providers of consumer-based activity trackers. The system was successfully used to examine change in physical activity levels during the COVID-19 period.
Article
Full-text available
Background The range of benefits associated with regular physical activity participation is irrefutable. Despite the well-known benefits, physical inactivity remains one of the major contributing factors to ill-health throughout industrialized countries. Traditional lifestyle interventions such as group education or telephone counseling are effective at increasing physical activity participation; however, physical activity levels tend to decline over time. Consumer-based wearable activity trackers that allow users to objectively monitor activity levels are now widely available and may offer an alternative method for assisting individuals to remain physically active. Objective This review aimed to determine the effects of interventions utilizing consumer-based wearable activity trackers on physical activity participation and sedentary behavior when compared with interventions that do not utilize activity tracker feedback. Methods A systematic review was performed searching the following databases for studies that included the use of a consumer-based wearable activity tracker to improve physical activity participation: Cochrane Controlled Register of Trials, MEDLINE, PubMed, Scopus, Web of Science, Cumulative Index of Nursing and Allied Health Literature, SPORTDiscus, and Health Technology Assessments. Controlled trials of adults comparing the use of a consumer-based wearable activity tracker with other nonactivity tracker–based interventions were included. The main outcome measures were physical activity participation and sedentary behavior. All studies were assessed for risk of bias, and the Grades of Recommendation, Assessment, Development, and Evaluation system was used to rank the quality of evidence. The guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement were followed. A random-effects meta-analysis was completed on the included outcome measures to estimate the treatment effect of interventions that included an activity tracker compared with a control group. Results There was a significant increase in daily step count (standardized mean difference [SMD] 0.24; 95% CI 0.16 to 0.33; P<.001), moderate and vigorous physical activity (SMD 0.27; 95% CI 0.15 to 0.39; P<.001), and energy expenditure (SMD 0.28; 95% CI 0.03 to 0.54; P=.03) and a nonsignificant decrease in sedentary behavior (SMD −0.20; 95% CI −0.43 to 0.03; P=.08) following the intervention versus control comparator across all studies in the meta-analyses. In general, included studies were at low risk of bias, except for performance bias. Heterogeneity varied across the included meta-analyses ranging from low (I²=3%) for daily step count through to high (I²=67%) for sedentary behavior. Conclusions Utilizing a consumer-based wearable activity tracker as either the primary component of an intervention or as part of a broader physical activity intervention has the potential to increase physical activity participation. As the effects of physical activity interventions are often short term, the inclusion of a consumer-based wearable activity tracker may provide an effective tool to assist health professionals to provide ongoing monitoring and support.
Article
Full-text available
Background: New fitness trackers and smartwatches are released to the consumer market every year. These devices are equipped with different sensors, algorithms, and accompanying mobile apps. With recent advances in mobile sensor technology, privately collected physical activity data can be used as an addition to existing methods for health data collection in research. Furthermore, data collected from these devices have possible applications in patient diagnostics and treatment. With an increasing number of diverse brands, there is a need for an overview of device sensor support, as well as device applicability in research projects. Objective: The objective of this study was to examine the availability of wrist-worn fitness wearables and analyze availability of relevant fitness sensors from 2011 to 2017. Furthermore, the study was designed to assess brand usage in research projects, compare common brands in terms of developer access to collected health data, and features to consider when deciding which brand to use in future research. Methods: We searched for devices and brand names in six wearable device databases. For each brand, we identified additional devices on official brand websites. The search was limited to wrist-worn fitness wearables with accelerometers, for which we mapped brand, release year, and supported sensors relevant for fitness tracking. In addition, we conducted a Medical Literature Analysis and Retrieval System Online (MEDLINE) and ClinicalTrials search to determine brand usage in research projects. Finally, we investigated developer accessibility to the health data collected by identified brands. Results: We identified 423 unique devices from 132 different brands. Forty-seven percent of brands released only one device. Introduction of new brands peaked in 2014, and the highest number of new devices was introduced in 2015. Sensor support increased every year, and in addition to the accelerometer, a photoplethysmograph, for estimating heart rate, was the most common sensor. Out of the brands currently available, the five most often used in research projects are Fitbit, Garmin, Misfit, Apple, and Polar. Fitbit is used in twice as many validation studies as any other brands and is registered in ClinicalTrials studies 10 times as often as other brands. Conclusions: The wearable landscape is in constant change. New devices and brands are released every year, promising improved measurements and user experience. At the same time, other brands disappear from the consumer market for various reasons. Advances in device quality offer new opportunities for research. However, only a few well-established brands are frequently used in research projects, and even less are thoroughly validated.
Article
Full-text available
A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed. We then define a methodology for developing one of the popular architectures of an FDBS. Finally, we discuss critical issues related to developing and operating an FDBS.
Article
This paper describes the nature and utility of the think aloud method for studying thinking that qualitative researchers from any disciplinary background can consider as an option for understanding thought. The paper begins with an overview of the theoretical framework underpinning the think aloud method, and how this framework is proposed to address shortcomings of traditional interview-based methods of understanding participants’ thinking. It continues with a description of a study using the think aloud method of golfers’ thoughts during putting with the aim of demonstrating how the method can be applied, and as an opportunity to present examples of thoughts and explore how these thoughts function in the control of putting tasks. The paper finishes with proposals for how qualitative researchers might integrate the think aloud method into their own research on sport, exercise and health, and a discussion of common pitfalls and concerns with applying the method.
Article
Field-based patient-reported outcome (PRO) assessments, including measures of signs, symptoms, and events that are administered outside of the research clinic, can be critical in evaluating the efficacy and safety of new medical treatments. Collection of this type of data commonly involves providing subjects with stand-alone electronic devices, such as smartphones, that they can use to respond to assessments in their home or work environment. Although this approach has proven useful, it is also limited in several ways: For example, provisioning stand-alone devices can be costly for sponsors, and requiring subjects to carry a device that is exclusively dedicated to the study can be burdensome. The “Bring Your Own Device” (BYOD) approach, in which subjects use their own smartphone or Internet-enabled device to complete field-based PRO assessments, addresses many of these concerns. However, the BYOD model has its own limitations that should be considered. In this article, representatives of the ePRO Consortium review operational, privacy/security, and scientific/regulatory considerations regarding BYOD. We hope that this review will allow researchers to make informed decisions when choosing methods to collect field-based PRO data in future clinical trials. Additionally, we hope that the discussion in this article will establish a research agenda for further examination of BYOD approaches.
Article
Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996.