ArticlePDF Available

Figures

Content may be subject to copyright.
5-1
Managing Big Data for Scientific Visualization
Michael CoxDavid Ellsworth
MRJ/NASA Ames Research Center MRJ/NASA Ames Research Center
Microcomputer Research Lab, Intel Corporation
1-May-97
Introduction
Many areas of endeavor have problems with big data. Some classical business applications have faced big
data for some time (e.g. airline reservation systems), and newer business applications to exploit big data are
under construction (e.g. data warehouses, federations of databases). While engineering and scientific
visualization have also faced the problem for some time, solutions are less well developed, and common
techniques are less well understood. In this section we offer some structure to understand what has been
done to manage big data for engineering and scientific visualization, and to understand and go forward in
areas that may prove fruitful. With this structure as backdrop, we discuss the work that has been done in
management of big data, as well as our own work on demand-paged segments for fluid flow visualization.
Our primary goal is to enable the scientist or engineer to extract information from his or her data. Many
authors begin with the assumption that interactivity is the most important goal of visualization systems (cf.
[1]). While we agree that interactivity is important, it is not always possible on big data sets. We
encounter practicing scientists and engineers whose single most important goal is to understand their data;
they are willing to live with off-line algorithms that give them information that interactive visualization
algorithms may not. Useful algorithms do not always run at interactive rates, especially on big data sets.
Thus, our focus is to maximize first the data set size that can be analyzed (and second maximize
performance), rather than first to maximize interactivity (and second maximize the data set size that can be
analyzed).
The problem of big data can be discerned as two distinct problems:
Big data collections.
Big data objects.
We discuss these in greater detail below. Briefly, however, big data collections are aggregates of many
data sets. Typically the data sets are multi-source, are often multi-disciplinary, are generally distributed
among multiple physical sites, and are often multi-database (that is, they are stored in disparate types of
data repositories). At any one site, the size of the data may exceed the capacity of fast storage (disk), and
so data may be partitioned between tape and disk. Any single data object or data set within the collection
may be manageable by itself, but in aggregate, the problem is difficult. To accomplish anything useful, the
scientist must request information from multiple sources, and each such request may require tape access at
the repository. Over many scientists, the access patterns may not be predictable, and so it is not always
obvious what can be archived on faster storage (disk) and what must be off-loaded to tape. In addition,
there are the standard data management problems (but aggravated) of consistency, heterogeneous database
access, and of locating relevant data.
Big data objects are just that -- single data objects (or sets) that are too large to be processed by standard
algorithms and software on the hardware one has available. Clearly, a data object too large for one
installation may be manageable at another, and there are really two goals:
Some data objects are so large that they do not fit on the largest workstation commercially available
(and in some cases do not fit on the largest supercomputer currently installed)
Even smaller data objects may be substantially larger than can be handled by the hardware to which
the “typical” engineer or scientist has access.
mbc@nas.nasa.gov
ellswort@nas.nasa.gov
5-2
Big data objects typically result from computer simulation of physical phenomena -- Fluid Dynamics
(CFD), Structural Analysis, Weather Modeling, etc. The size of a big data object is often the same as the
size of the memory on the largest supercomputer to which a scientist or engineer has access. In an era of
parallel distributed-memory supercomputers, this may be huge. Big data objects today are typically single-
source and single-disciplinary.
These problems of big data collections and big data objects are those that confront scientists and engineers
now. While neither problem has entirely satisfactory solutions even today, we would like to develop
systems to manage big data collections of big objects. For example, there is a project underway at NASA
Ames Research Center to promote the development of an Integrated Design System (IDS). The goals of
the IDS are to integrate the activities of multiple engineering disciplines, allowing greater communication,
iteration, and refinement among the disciplines. Since each discipline generates and must analyze big
objects, the IDS will require management of big collections of big data.
Below, we first discuss the problems of big data collections, and suggest areas of research that may bear on
them. We then introduce the problems of big data objects. Following this, the remainder of this section
focuses on solutions for big data objects.
Big data collections
Today, big data collections typically arise in fields with acquired data, as from remote sensors and satellite
imaging. These fields include the atmospheric sciences, geophysical sciences, and health and medical
sciences. However, as discussed, requirements are moving to big collections of big (and smaller) objects,
as with the IDS.
Big data collections present a number of problems for data management:
Data are generally distributed among multiple sites.
Data generally reside in collections of heterogeneous databases (each the repository for data acquired
or processed at that site).
There are generally incompatible data interfaces and representations; data are generally not self-
describing.
There is often no platform-independent definition of the data types in the underlying data, and the
relationship between them. Such a data model facilitates the construction of higher-level tools to use
the data.
The meta-data that facilitate discovery and use are also often not stored with the data (e.g. where were
the data collected, when, what calibration applied, what are the units, etc). The meta-data may (and
probably should) also include compressed and/or condensed representations of the underlying data;
such compressed representations enable browsing of a large collection.
These problems conspire to make locating the data difficult. Visualization can serve an important role
in data location, in particular by compressing summary information in a format that can be visually
understood quickly (e.g. as TileBars do for document retrieval [2]).
The storage requirements for the data may be quite large, requiring partitioning between disk and tape.
There may be poor locality in the queries for the data (since, for example, requests may be to diverse
variables measured at arbitrary times). Any particular request is more likely to require data from tape
than from disk.
The raw bandwidths required to satisfy requests may be quite large for any actively used collection.
These include bandwidths from tape to disk, from disk to memory, from memory to network, and
across the network.
A usage model that is common for big collections is that of browsing, and ordering products. In this
model, a scientist may browse the data collection, often with a tool that visualizes the underlying data. The
compressed visual representation may simply be a thumbnail or underlying 2D imaging, or it may be a
representation of the results of a query on the data (e.g. as for example TileBars [2]). After the scientist has
determined that some data sets are of interest, he or she orders the data. This initiates the flow of data from
(potentially) tape to disk, across the network, and to the scientist’s workstation.
The Earth Observing System (EOS) is an instructive example of the problem of big collections. The goal
of EOS is to provide a long-term repository of environmental measurements (e.g. satellite images at various
5-3
wavelengths; about 1,000 parameters are currently included in the requirements) for long-term study of
climate and of earth’s ecosystems. The data are intended to be widely available not only to scientists, but
to the general public. It is envisioned that EOS can create a market for value-added products that reduce
the data to something understandable to the general public.
Thus, EOS must acquire, archive, and disseminate large collections of distributed data, as well as provide
an interface to the data. The EOS data interface clearly must be sufficiently well defined (and rich) that a
wide range of products can be written that combine and synthesize the data in arbitrary ways. Estimates for
the data volume that must be acquired, processed, and made available are from 1 to 3 Tbytes/day. These
data arrive in the form of individual data objects that vary from about 10 to 100 Mbytes (with an average of
about 50 Mbytes), and are acquired and processed by about 10 Distributed Active Archive Centers
(DAACs). The average request rate internally at one site today is estimated to be 20 files/minute; however,
it is painfully obvious that when EOS data become available to a broader base of scientists (not to mention
to the general public), the request rate will be arbitrarily high. The EOS is chartered to maintain a 15-year
data archive.
It is clear that to support projects like EOS, we will need not only new software and tool development, but
also fundamental research:
It is clear that tools to develop, maintain, and extend data models are essential. These tools must be
applicable both to data and to the meta-data. Prior work and pointers to additional literature on data
modeling for scientific visualization can be found in [3, 4, 5, 6, 7, 8, 9, 10, 19]. Some interesting work
to build visualization systems directly on database management systems (DBMSs) can be found in
[11,12,13].
Proper definition of the meta-data is important. A path into this literature can be found in [14].
Federated databases and data warehousing algorithms (cf. [15, 16]) should be explored for their
application to big collections. Although these have been primarily investigated in the context of
business applications, many of the underlying problems are the same, in particular the problems of
unifying disparate data models, maintaining consistency, and supporting autonomous databases while
enabling common access. A very simplistic view is that data warehouses are federated databases with
caching. From this viewpoint, results from these areas may be useful in evaluating the efficacy of
caching data objects from large collections.
Big data objects
Big data objects typically are the result of large-scale simulations in such areas as Computational Fluid
Dynamics (CFD), Structural Analysis, Weather Modeling, and Astrophysics. These simulations typically
produce multi-dimensional data sets. There is generally some 3-dimensional grid or mesh on which data
values are calculated. Often, data values are calculated at some number of time steps in the simulation.
Sometimes the grid or mesh structure as well changes with the time steps.
For example, in CFD a common data structure is a set of curvilinear grids, with computed values at the
vertices of the grids. The grids themselves are curved regular lattices, bent to conform to the structures
around which flow is calculated (e.g. a wing). Multiple grids may be required to conform to all of the parts
of the surface under study (e.g. the wing or fuselage). CFD simulations often calculate unsteady flow, in
which case there are solutions at multiple time steps; similarly, in CFD simulations, the grids themselves
may move and/or change shape (e.g. in order to conform to flap movement).
Data objects (in particular the results of simulation) in general present the following problems:
Data modeling -- the multi-dimensional data structures required are not adequately supported by
current database technologies. In addition there are no standardized models of the data structures
required, even within a single discipline (or rather, there are so many standards to choose from). As a
result, visualization codes typically must handle multiple file formats and data representations.
Data model evolution – Data models evolve. For example, in CFD alternative grid structures are an
active area of research.
Big data objects in particular present these additional problems:
5-4
Data management -- there is generally not a clean division between the data models (where they exist)
and data management. As a result, most of the data models are simply file formats, with no provision
for special handling of data sets that do not fit in main memory.
Data too big to be memory-resident -- often a single data object does not fit in main memory. For
example, in CFD, even older time-varying data objects are on the order of 10 Gbytes. Newer studies
produce data objects on the order of 100 to 200 Gbytes. Even a single time step today can be 1 Gbyte,
with studies underway that will produce time steps approaching 2 Gbytes. With data this large, it is
not possible to rely on operating system virtual memory to manage the discrepancy between data set
size and physical memory size. Thus, special action must be taken.
Data too big for local disk (perhaps even for remote disk) -- clearly, not only do some of these data
objects not fit in main memory, but they do not even fit on local disk on most workstations. In fact, the
largest CFD study of which we are aware is 650 Gbytes [17], which would not fit on centralized
storage at most installations!
Bandwidth and latency -- with data this large, and the requirement to find alternatives to operating
system virtual memory, the bandwidths and latencies between the data store and main memory must be
managed carefully.
Data models in general have been investigated, and are fairly well studied. There are in fact many
proposals and working systems that hide the underlying representation (and potentially even data
management) from the application [3, 4, 5, 6, 7, 8, 9, 10, 11,12,13, 19].
Solutions to the problems specific to big data objects have also been explored. They are covered in greater
detail in subsequent sections, but briefly can be categorized by three general approaches:
Moving the computation to the data rather than moving the data to the computation.
Stenciling the algorithm to the data (for locality of reference). To take advantage of segments or pages
(or of subsets of the data that are streamed through memory), visualization algorithms must generally
be stenciled to the data access patterns so that data are loaded as few times as possible. In general this
is a process by which the algorithm and/or code are recast for locality of reference.
Segmentation and paging. When not all data can fit in main memory (or perhaps even on local disk),
the data may be loaded and utilized in segments, or pages.
The combined problem – big collections of big objects
In many areas of science, there is a trend toward combining simulation with experimentation in order to
validate simulated results and to guide experiments by simulation. An example of this trend can be seen in
the Advanced Development and Test Environment (ADTE) project at NASA Ames. One goal of this
project is to use CFD simulation results to guide the experiments that are performed in NASA’s wind
tunnels. CFD simulations will be performed to explore the parameter space of interest for wind tunnel
tests, these results will be used to guide (and constrain) the parameters of the experiments. After the
experiments are conducted, multi-source analysis will compare expected and observed results, and will be
used to calibrate both subsequent CFD simulations and the experimental test equipment.
In addition, there is a trend even within disciplines to want exploration of large parameter spaces by
simulation. Each simulation may generate a large data object; over a large parameter space many such
objects are generated. As individual disciplines develop the ability to do such parameter space searches
(e.g. as supercomputers and their codes become capable of generating many such objects in shorter periods
of time), we will find more big collections of big objects.
A layered model of data visualization
Treinish has suggested that the visualization/data problem can be modeled as layers of functionality [3], as
is typically done with networking protocols (cf. [19]). Treinish’s model comprises three layers
(visualization algorithms, data model, and data storage). We extend Treinish’s three layers to the four
shown in Figure 1.
5-5
Visualization Algorithms
Data Model(s)
Data Management
Distributed Data Storage
Figure 1. Layered architecture of visualization and data management.
In this architecture, each layer only interacts with an adjacent layer that is responsible for providing
guaranteed services defined by an API. The data model layer is responsible for presenting an API that
supports data types and data values in a platform- and machine-independent fashion. A client (such as a
visualization library) of the data model layer need not be aware of internal representations of those data
values, nor should the client be aware of details of data storage or retrieval. The data model layer is
responsible for translations and reformatting necessary to support the common API representation. The
data management layer is responsible for managing the flow of data into and out of main memory, either
from local disk or possibly from remote disk (or even tape). The distributed data storage layer is
responsible for providing an API (or APIs) that move data among distributed machines, disks, and
potentially tapes. An advantage of thinking of the big data problem in this way is that it allows data
management algorithms to be decoupled from the issues of data model, data representation, and decoupled
from the precise implementation of distributed data movement and storage.
Many data models have been proposed. Some have literally been file formats [20]. Others have been
somewhat higher-level, but have still been defined as interfaces to file formats [21, 22]. Most of the
remaining models [4, 5, 7, 8, 10] have been based on the dataflow model of programming. In the dataflow
model, the programmer specifies nodes, each of which performs computation. A node has an input data
type and an output data type, and the user may connect nodes with directed arcs to form a directed
computation graph. The system is responsible for scheduling the computation, and for performing the data
movement implied by the arcs of the graph.
Data management has not specifically been identified as a layer of computation in visualization systems
before, and so new work has not had a common framework for discussion. However, a number of
approaches have been explored. In the remainder of these notes, we will focus on the data management
layer, in particular focusing on the problem of managing individual big data objects.
Data management: big data objects
To reiterate, the problems specific to big data objects are:
Data too large for local memory.
Data potentially too large for local disk.
Data potentially too large for the largest remote disk.
Insufficient bandwidth and long latency.
These are problems that are not new to scientific data or scientific visualization. There is a wealth of work
from other fields that can be applied, in particular from the areas of operating systems and computer
architecture (cf. [23, 24]), databases (cf. [25]), and distributed systems (cf. [19]). In particular, the
problem of too much data for too small a memory (or disk) is the problem of memory hierarchy. Not all
5-6
memory and storage are created alike. The classic pyramid of memory capacity and speed is shown in
Figure 2. Closer to the top of the pyramid are the least capacity but the fastest access. At the bottom of the
pyramid are the greatest capacity but slowest access. In a distributed system, the memory hierarchy
becomes more complicated, with not only local memory, disk, and tape, but network connections (of
various speeds) to remote memories, disks, and tapes of various capacities and speeds.
If a data are too large to fit into memory, then there are really only two possible solutions:
Find another memory into which the data can fit, or
Load and use the data in pieces. Exactly the same choices exist when the data cannot fit entirely on
local disk, and when they cannot fit entirely on the largest remote disk available.
The first of these is a degenerate application of the principle: move the computation to the data. The
second may be thought of from at least two points of view: stencil the algorithm to the data for locality of
reference, or segment and/or page the data. These three approaches are discussed in the following
subsections.
Registers
L1 cache
L2 cache
Local memory
Local disk
Local tape
Figure 2. Classic memory hierarchy on a uniprocessor.
Moving the computation to the data
If the data do not fit in local memory (or if the bandwidth for the transfer is insufficient), move the
computation to the data. There are primarily two approaches in this vein that have been employed
Find or buy a large enough machine that can hold the data on disk and in core, and move the
computation to that machine; this is the approach taken, for example by [26].
Partition the visualization so that traversal is done on a machine with sufficient disk and memory to
hold the data, calculate synthetic geometry on that machine, and download this geometry over a fast
local network to the local workstation for visualization. This scheme is shown in Figure 3.
Examples of this approach can be found in [1, 27]. This scheme has several disadvantages. It is primarily a
time-sharing approach and suffers the same disadvantages that time-sharing did 20 years ago:
Response is subject to time-sliced availability of the supercomputer. If the supercomputer runs in
dedicated mode, the user may have to wait arbitrarily long for access.
One may not have the requisite access to a supercomputer, or may not want to spend one’s
supercomputer budget for interactive jobs.
It may not work for data objects generated on larger supercomputers, or that are the concatenated
results of several runs.
On the other hand, this approach does provide today the best interactive performance for extremely large
data objects.
5-7
Big data object
in remote memory
Big data object
on fast disk
CPU(s)
traverse data
Supercomputer
Geometry generated
by supercomputer sent
over net to workstation
f
or display
Workstation
Display
Figure 3. Moving the computation to the data (using a supercomputer for data traversal).
Stenciling for locality of reference
As an alternative to buying a bigger machine or using a supercomputer, one may look for opportunities to
stencil the algorithm onto the available resources. This was done by the Unsteady Flow Analysis Toolkit
(UFAT), which was designed to process only adjacent time steps in an unsteady flow (and thus require only
several rather than potentially thousands of large data objects at a time) [28].
This also is the idea behind replacement of coarse-grained nodes in a dataflow system with fine-grained
nodes [29]. With dataflow approaches to visualization, it has been recognized that the usual coarse-grained
style of programming leads to large memory requirements, since the nodes of the graph may require all the
data to be memory resident before they can execute. To solve this problem, Song and Golin developed
fine-grained nodes for isosurface extraction and one volume rendering algorithm. They showed that these
fine-grained nodes could execute on subsets of the data, and discard them from memory before moving on
to other subsets [29]. This allowed execution with significantly smaller memory requirements. Pang has
confirmed these results with a fine-grained approach to isosurface extraction in his dataflow environment
[10].
Segmentation and paging
There are classical “systems” approaches to memory hierarchy that can productively be brought to bear on
big data objects in scientific visualization. In order to discuss their relevance, we require some
definitions. When more data reside on disk than can be loaded into physical memory, we must employ
virtual memory of some sort. Virtual memory is simply the mapping of a larger memory space onto a
smaller physical space. This mapping is typically done on chunks of contiguous data, and when any
address in a virtual chunk is required, the chunk is loaded from disk into physical memory. If all chunks of
5-8
physical memory are in use, then some chunk already in physical memory must be identified, possibly
written to disk, and overwritten by the new chunk that is required.
When the chunks are of variable size, they are referred to as segments. When chunks are of fixed size, they
are referred to as pages. Older systems tended to support segments,1 newer systems ubiquitously support
pages. Paged segment systems have also been explored. If a segment or page is loaded when it is required
(in particular when some portion of it is accessed), the system is said to be demand-driven (e.g. demand
paged). If something is known about the behavior of the application, then it is sometimes possible to
prefetch pages (or segments) before they are needed, so that the application need not wait for the seek time
to the disk and the subsequent data transfer. Since the disk controller itself can be viewed as an additional
processor, prefetching can achieve some degree of parallelism between the application and the transfer of
the page (or segment) from disk to memory. When this parallelism is achieved, it is said that the algorithm
is pipelined with respect to disk reads. Finally, we can define an interval of time relevant for a particular
application (say, the interval between time steps in an unsteady flow visualization), and count the number
of pages (or segments) accessed by the application. The number of such pages (or segments) required is
referred to as the application’s working set during that interval. The term working set is also sometimes
used to refer to the behavior of the (smaller) working set over the lifetime of the application.
Operating system-controlled demand paging
We have heard the claim that since today’s operating systems (OS) support virtual memory, it is already
possible to visualize big data objects that do not fit in physical memory, simply by leaving the paging to the
OS. We find this belief quite surprising. When an algorithm has not been designed for locality of
reference, and its virtual memory exceeds physical memory by even 1% [30], it generally thrashes. When
a system thrashes, it spends more of its time replacing pages in physical memory with new pages from disk
than it does accomplishing real work. By now, thrashing has been well-known and documented for
perhaps 20 years (cf. [23, 24]). We have observed and documented thrashing in UFAT with particle
tracing in data objects too large for main memory ([31] and discussed below), and Ueng has documented
thrashing in CFD visualization algorithms on unstructured grids too large for physical memory [32]. If
there are visualization systems that serendipitously escape thrashing on data objects significantly larger
than main memory, we are not aware of them.
There are a number of reasons that leaving the paging to the OS cannot result in the best performance, and
at worst can result in applications that are orders of magnitude slower:
Page replacement policies – ideally the page that should be replaced when a new page must be loaded
is the page “that will not be accessed for the longest time.” Since this is clearly not possible without
detailed knowledge of the application, the typical OS chooses the least recently used (LRU) page
among all pages. This may or may not be the correct choice for a given application at a given time.
Valuations of pages – a corollary of naïve replacement policies is that the OS does not know which
pages are “precious” (regardless of the time of their most recent access), and which pages are
“disposable”. For example, recently used scratch space (that will not be used again) will be spared by
LRU replacement in preference for pages of critical values that will be used over and again.
Self-limiting applications – there is extremely poor information available to an application that does
want to stay within the bounds of physical memory. Many programmers know exactly how to limit
their use of physical memory if they only know to limit to set their maximum consumption. This lack
of information is exacerbated by the dynamic nature of free memory – information should be available
not only at start-up time but dynamically during the lifetime of the application.
Implementation choices – some sub-optimal behavior of today’s OS implementations are not inherent
in OS-controlled virtual memory, but are sometimes the result of choices made by the OS developers.
For example, many implementations mark memory pages read from disk as requiring write-back to
disk if they are replaced (i.e. as dirty). This means that a block read from disk to memory, and
unmodified by the application, is written to swap space if it is replaced by another page. Since the
page could be re-read from the original file instead, this represents an extra disk write.
There have been efforts to export more control over paging to the application. Perhaps the most visible of
these has been the set of memory mapping routines available in some Unix implementations: mmap(),
1 A segment was often defined as a basic block in the code, or a function, or an individual array.
5-9
mpin(), msync(), and munmap(). While these are a start, they do not provide sufficient semantics. In
particular, there is still no way for the application to determine the amount of physical memory that is
available. This makes it inadvisable to use mpin() to lock significant numbers of pages in order to force the
issue. And in any event there is still no application control over the page replacement policy. In addition,
the file semantics of mmap() do not support application control over non-file segments (such as computed
fields).
The Mach operating system provided some level of application control over the OS’s paging policies [33],
but Mach is not widely available today as a platform for visualization. A very interesting set of semantics
for application control over paging was added to the V system [30], but V is also not widely deployed.
Application-controlled segments
In the absence of control over virtual memory page behavior, many applications manage physical memory
utilization by defining and managing their own segments.2 There is often a natural “segment” in many
visualization algorithms. For example, a single time step in an unsteady CFD visualization might be a
segment (as it is implicitly in UFAT [28]). Ueng has specifically targeted visualization of unstructured grid
CFD data sets by partitioning space into octrees, and loading sub-trees from disk to memory specifically on
demand [32]. In both cases, the authors were able to visualize much larger data objects than could fit in
main memory.
In a somewhat different visualization domain, Funkhouser explicitly used segmentation to explore
architectural databases at interactive rates [34]. He partitioned space using a variant of k-D trees, and
defined segments as collections of objects contained within the sub-trees. Off-line, he calculated from each
node the other nodes and objects that were potentially visible. At run-time, he used this information, plus
dead reckoning based on the observer’s direction of travel, to prefetch the potentially visible segments. He
was able to visualize at interactive rates a database roughly 10x the size of main memory. While these
results, and the techniques they suggest, are of interest, the differences with respect to scientific
visualization should be explicitly noted:
With synthetic imagery, data traversal is driven by direction of travel of a viewer; in scientific
visualization data traversal is driven by the visualization algorithm (and is generally unrelated to the
viewer) and geometry is not generated until after traversal.
With synthetic imagery, data that will not be needed can be explicitly culled by fairly well-known
algorithms; in scientific visualization, it is not yet clear which data can be culled and which data cannot
be culled (and in any event is visualization algorithm-specific).
The sizes of the biggest synthetic data are significantly smaller than those encountered in scientific
visualization. In particular, Funkhouser worked on a “large” database of 1 Gbyte. From the point of
view of scientists that today must cope with 100+ Gbyte data, that’s not big data!
In summary, application-controlled segmentation to manage the use of physical memory has been a
successful technique for visualizing big data objects. The precise mapping of segment to the underlying
data, the replacement policy, and the implementation with respect to a given visualization algorithm, of
course require thought on the part of the developer. However, we believe that segmentation will become a
fundamental approach for visualizing and analyzing extremely large data objects.
Application-controlled demand-paged segments
Segments may themselves be paged. This may be necessary if the application programmer has chosen to
map a segment to an object that may itself be larger than physical memory. UFAT for example implicitly
maps each solution file to a segment, and there are single steady-flow solutions at Ames today that are
about 1 Gbyte in size; clearly segmentation is not a complete solution. Paging the segments may also be
desirable if the visualization algorithm makes sparse use of the underlying data. We say more about this in
the section titled “Parsimonious traversal”.
The Common Data Format (CDF) library [21] implements a simple form of demand-paged segments for
earth sciences data (e.g. time, latitude, longitude, plus variables such as temperature, pressure, etc). In our
2 Although most papers on such applications do not explicitly identify their disk-to-memory management
as segmentation.
5-10
terminology, CDF maps a segment to each file, and independently demand pages each of these segments.
In CDF, a cache is associated with every open file. The size of the cache can be configured (dynamically)
by the application. When the application references a variable from the underlying file, CDF first looks in
its cache; if the underlying page is not present, CDF retrieves it from disk. If the cache for that file is full,
CDF chooses a victim for replacement. Since CDF allocates a cache with each file, the working set (i.e.
total memory in use) grows with the number of segments (files opened). CDF does allow the application to
change the size of a file’s cache, but since the application most likely does not keep track of its previous
access patterns, it does not have sufficient information (say) to reduce the cache on the file that has been
least recently used. We are unaware of studies on CDF that explore alternative page sizes, replacement
policies, and data storage and organization, and so cannot address the trade-offs in demand-paged segments
for earth sciences data.
In the Data Analysis group at NASA Ames Research Center, we have implemented a library for demand-
paged segments and have explored its behavior when used by fluid flow visualization software on very
large data objects [31]. We built upon UFAT’s use of segments, to support visualization of larger data
objects with even less memory. As already mentioned, UFAT defines a segment whenever it encounters a
grid (a mapping of a regular lattice to a curvilinear 3D mesh), or whenever it encounters a solution (the
parameters at a single time step at the nodes of the grid). The original UFAT sequentially opens and
processes the segments required for sequential time steps (generally two at a time) before moving on to
subsequent time steps. It reads each segment it requires into main memory before processing, and if the
total data required for the time step exceed physical memory, UFAT implicitly relies on OS virtual
memory.
We have modified UFAT by changing its access of grids and solutions to be through a demand-paging
library we have developed. When UFAT attempts to open a grid or solution and read the data entirely into
memory, we alternatively create a segment, and initially read no data into memory. When the modified
UFAT attempts to access the data (with a multi-dimensional array reference, typically of the form
data[i][j][k][parameter]), we trap the reference, load the appropriate page if it is not already in physical
memory, and return the desired data. Thus, we have added a data management layer between the data
model (multi-dimensional array access), and data storage.
We have retained the original UFAT definition of segment (i.e. a segment is either a grid or solution for a
single time step).3 At start-up, we declare two pools of free pages, one for grids, one for solutions. When
UFAT accesses a grid and the underlying page is not in core, we allocate a page from the grid pool, and
load the underlying data from disk. If the grid pool is exhausted, we choose a victim from among all grid
pages (by LRU), and proceed as before. Similarly for the solution pool. When a segment is freed by
UFAT, we return all pages from that segment to the appropriate pool.
This scheme has several properties that have enabled out-of-core visualization. First, since we employ
fixed-sized pools (over all segments) we ensure that memory utilization does not grow beyond physical
memory. Second, since we define segment types (grid or solution), we use more than the last access time
to guide our page replacement policy. This is important because simple LRU over all data (as employed by
the operating system) could, for example, steal for a solution exactly the grid page required for the next
solution access. Third, the implementation of a data management layer allows the data model access
(multi-dimensional array reference) to be explicitly decoupled from the underlying storage. In particular,
we have used this feature to explore alternative storage strategies for multi-dimensional data; the results of
these investigations are discussed in greater detail below. Fourth, we enhance the effective use of segments
that UFAT already employed. Finally, many visualization algorithms do not traverse all of the data in a
given segment. Demand paging can reduce the aggregate size of the data that must be transferred from disk
to physical memory. This is discussed in greater detail below.
3 In practice, our current implementation binds a segment to each zone within the grid or solution; however,
our current implementation does not take advantage of this finer-grain mapping and for all practical
purposes these segments behave as a group that is mapped to the grid or solution.
5-11
Out-of-core visualization
The highest-level results are shown in Table 1. In these runs, pages were formatted in “cubes” of the
underlying 3D data (as opposed to “flat” slices of the underlying data). Because our implementation
explicitly places a data management layer between the data model and data storage, we were able to
explore the tradeoff between “cubed” pages and “flat” pages. For each data set, we ran both the original
UFAT and our modified (paged) UFAT on machines constrained to less than 1 Gbyte, and 128, 64, 48, and
32 Mbytes of physical memory.4 The total physical memory required per time step for each of the data sets
was: 94.3 Mbytes for the F18, 32.3 Mbytes for the Shuttle, and 4 Mbytes for the Tapered cylinder. Total
data size (all time steps) was: 10,170.6 Mbytes for the F18, 32.3 Mbytes for the Shuttle, and 251.5 Mbytes
for the Tapered cylinder. Thumbnails of the visualization studies of these data sets are shown in Figures 8
through 10 (which appear following the references).
UFAT Memory Capacity (Mbytes)
Model Version 1,024 128 64 48 32
F18 original 552 9,754
paged 1,043 1,166
Shuttle original 9.3 11.6 18.3 47.1
paged 12.6 12.4 14.4 17.5
Tapered original 140
cylinder paged 166
Table 1. Total run time for the three data sets, original UFAT and paged UFAT, on machines
constrained to the shown memory capacities; times are in seconds.
Several conclusions are clear from this table
It is sub-optimal at best to rely on the operating system for paging data objects too large for physical
memory.
Demand-paged segmentation can be a viable strategy for out-of-core visualization of large data objects
that exceed the capacity of physical memory. Note in particular that paged UFAT adapts gracefully as
the available physical memory is progressively reduced.
The reader may note that our demand-paged segments did incur a performance cost when the data set fits
entirely in physical memory. We have profiled our system and found that by far the largest inefficiency is
in our translation from array references to physical storage on disk. That is, our implementation of data
management is fairly expensive. We believe that with in-lining and common sub-expression elimination,
we can reduce this cost so that demand-paged UFAT is at parity in performance with the original UFAT
even when the data set does fit entirely in physical memory.5
Storage strategy and page size
Multi-dimensional scientific data are often stored in row- or column-order (e.g. [20]). That is, they are
stored first linearly along one dimension, and then in planes. However, multi-dimensional scientific data
tend to be accessed more coherently, in particular as the result of traversal through 3-space. In volume
rendering, it is well known that storage in “cubes” results in more efficient access than “flat” storage in
planes (cf. [35]).
4 The Tapered Cylinder and Shuttle were all run on an R10000 SGI with 128 Mbytes of memory; smaller
machines were simulated by mpin()ing the requisite physical memory and thereby making it unavailable to
any application. The F18 runs were all on an R10000 SGI with 1 Gbyte of main memory; smaller
configurations were similarly simulated with mpin().
5 UFAT relies upon many routines written in FORTRAN. In order to implement the data management
layer, we were forced to replace array arguments to these routines, with pointers to functions. When these
routines access a multi-dimensional array, our translation functions are called to map the array reference to
a physical page either in memory or on disk. Because this function cannot be aware of its call environment,
it performs a great deal of recomputation between array references that differ by only one in some
dimension.
5-12
We have verified that when CFD data size exceeds the capacity of physical memory, cubed storage
generally results in better performance.6 Our results also show empirically that the best performance is
generally achieved with cubes of size 8x8x8 cells. Selections from these results are shown in Figures 4
through 6.
6 There are of course anomalous cases in which streaklines or streamlines follow along curvilinear planes
because of the nature of the underlying CFD data.
Title:
Creator:
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
flat
page size: 4
cubed flat 8
cubed flat
16
cubed flat
32
cubed
0
1000
2000
3000
Total Time (sec)
Figure 4. F18, with memory constrained to
128 Mbytes. Cubed vs. flat page storage, for
page sizes from 4x4x4 to 32x32x32.
Title:
Creator:
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
flat
page size: 4
cubed flat 8
cubed flat
16
cubed flat
32
cubed
0
5
10
15
Total Time (sec)
Figure 5. Shuttle, with memory constrained
to 64 Mbytes. Cubed vs. flat page storage, for
page sizes from 4x4x4 to 32x32x32.
Title:
Creator:
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
flat
page size: 4
cubed flat 8
cubed flat
16
cubed flat
32
cubed
0
50
100
Total Time (sec)
Figure 6. Shuttle, with memory constrained
to 32 Mbytes. Cubed vs. flat page storage, for
page sizes from 4x4x4 to 32x32x32.
Sparse traversal
We should expect from first principles that many of the algorithms of fluid flow analysis (particle tracing,
streaklines, streamlines, cutting planes, etc) need only traverse a subset of the entire data for any particular
visualization. If we assume that traversal of each cell (say) of a solution results in the generation of
geometry, the visualization would be incomprehensible if every cell were traversed – because 3-space
would be too filled with geometry for the viewer to see anything! We have in fact found this in UFAT’s
traversal of data. The percentages of total blocks touched by UFAT are shown in Table 2. An example
5-13
working set is shown in Figure 7. In this graph, the working set required by UFAT to particle trace the
flow is shown. As can be seen, UFAT’s working set always stays below 35% of total blocks. The erratic
behavior of the working set is worth some discussion. When a particle in UFAT transitions from one zone
to the next, UFAT performs a relatively inefficient search for the particle’s new location in an adjacent
zone. When it does this, it accesses more blocks. This shows that while UFAT’s working set during any
time step is less than about a third of the total blocks in the data set, there is an opportunity (by more
efficient searching) to reduce the working set even further!
Model Percentage of total
blocks touched
F18 6.73%
Shuttle 17.4%
Tapered cylinder 23.1%
Table 2. Percentage of blocks touched by UFAT on visualization of cubed 8x8x8 data.7
Title:
Creator:
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
50 100 150 200 250 300
Timestep
0.0
0.1
0.2
0.3
Fraction of bytes touched
Grid
Solution
Figure 7. Working set for F18 for 8x8x8 cubed data.
Out-of-core visualization from remote disk
Not only may a big data object not fit in local memory; it may not even fit on local disk. We have begun to
explore the viability of a distributed architecture in which a file server provides pages of big data objects to
smaller local workstations. To do this, we have employed demand-paged UFAT on a mid-range
workstation, with the F18 stored on a remote server accessible via the Network File System (NFS). Our
initial results are promising, and are shown in Table 3.
UFAT version Computation performed F18 data stored Total execution, mins.
Original Fast remote server Fast disk of same
remote server 9.6
Original Local workstation, 128
Mbytes of memory Fast remote disk of
remote server 360
Paged Local workstation, 128
Mbytes of memory Fast remote disk of
remote server 46
Table 3. Results of demand paging from remote file server.
While it is clearly faster to visualize the F18 on the fastest workstation available8, not every practicing
engineer or scientists has unrestricted access to such a machine. In addition, not all installations even have
7 We calculate this as the average fraction of blocks touched over all time steps.
5-14
such machines. These results suggest that it may be possible to deploy visualization tools to a much
broader customer base of practicing engineers and scientists by storing large data objects centrally, and by
remote demand-driven paging to local workstations. The experiments reported above were done with NFS
over 10 Mbit Ethernet. Faster networks should make this distributed approach more viable.
Other areas of exploration
There are a number of areas that we believe are worth exploring in application-controlled segmentation and
in application-controlled demand paging of these segments.
Page prefetching. Neither of the two out-of-core visualization efforts of which we are aware ([32] and
ours) takes advantage of prefetching. While Funkhouser’s results with prefetching polygon data are
encouraging [34], scientific data are (as previously discussed) different. We are exploring several
possible algorithms for prefetching scientific data.
Alternative allocation of pool types, and of partitioning available physical memory among them.
Application of demand-paged segments to other visualization domains.
Further exploration of distributed architectures, and of the partitions between computation and data; in
particular, broader exploration of the opportunities of paging, prefetching, and caching from remote
servers.
Parsimonious traversal
As previously discussed, the CFD visualization algorithms we have studied have only sparsely traversed
the data. Where algorithms traverse only a portion of the data, there is opportunity for visualizing larger
data objects. Finding the algorithm with most parsimonious traversal is an important area of investigation
for visualization of extremely large data objects.
We believe that this sparse traversal is inherently a property of most visualization algorithms, and will be
discovered as their behavior is studied further. We believe that there are few algorithms for which sparse
traversal “is not possible”. Consider the basis for this bold claim. Most visualization algorithms produce
geometry based on local features of the underlying data set. For example, the triangles generated for any
given cell during isosurface extraction depend on only a few cells at most9; particle tracing in CFD
visualization depends on only local cells to calculate flow velocity and direction. This means that the
algorithm need not traverse the entire data set to generate geometry for the features of interest (since their
calculation does not require global information). An algorithm that only uses local information but does
traverse the entire data set should generate geometry for every cell – and this would be visually
incomprehensible. Of course, the algorithm must find the features of interest, and there are algorithms that
perform a global search to do so. Marching cubes is an example of this [36]. However, even this canonical
example of an algorithm that must traverse the entire data set to find the features of interest has been further
developed for more parsimonious traversal [37].
Acknowledgements
The authors would like to thank Michael Gerald-Yamasaki, David Kenwright, Sam Uselton, and David
Korsmeyer for useful discussions regarding visualization systems and directions.
References and bibliography
1. B. Hibbard and B. Paul, “Case Study #4: Examining Data Sets in Real Time, VIS-5D and VIS-AD for
Visualizing Earth and Space Science Computations,” ACM SIGGRAPH ’94 Course #27, Visualizing
and Examining Large Scientific Data Sets: A Focus on the Physical and Natural Sciences,
SIGGRAPH ’94, July 1994.
2. M. A. Hearst, “TileBars: Visualization of Term Distribution Information in Full Text Information
Access,” Proceedings of CHI ’95, Denver, CO, May 1995.
3. L. A. Treinish, “Solution Techniques for Data Management, the Visual Display and the Examination
of Large Scientific Data Sets,” Supercomputing ’95 Course, Visualizing and Examining Large
8 The server on which these experiments were conducted is one of the fastest servers in the Data Analysis
group at NASA Ames.
9 If higher-order methods are employed.
5-15
Scientific Data Sets: A Focus on the Physical and Natural Sciences, Supercomputing ’95, December
1995.
4. B. Lucas, G. D. Abram, N. S. Collins, D. A. Epstein, D. L. Gresh, and K. P. McAuliffe, “An
Architecture for a Scientific Visualization System,” Proceedings IEEE Visualization ’92, October
1992, pp. 107-113.
5. W. Hibbard, C. R. Dyer, and B. Paul, “Display of Scientific Data Structures for Algorithm
Visualization,” Proceedings of IEEE Visualization ’92, October 1992, pp. 139-146.
6. R. B. Haber, B. Lucas, and N. Collins, “A Data Model for Scientific Visualization with Provisions for
Regular and Irregular Grids,” Proceedings of Visualization ’91, October 1991, pp. 298-305.
7. D. S. Dyer, “A Dataflow Toolkit for Visualization,” IEEE Computer Graphics and Applications, Vol.
10, No. 4, July 1990, pp.60-69.
8. C. Upson, et al., “The Application Visualization System: A Computational Environment for Scientific
Visualization,” IEEE Computer Graphics and Applications, Vol. 9, No. 4, July 1989, pp. 30-42.
9. J. C. French, A.K. Jones, and J. L. Pfaltz, eds. Scientific Database Management (Panel Reports and
Supporting Material), Report of the Invitational NSF Workshop on Scientific Database Management,
Charlottesville, VA, March 1990, Technical Report 90-22, August 1990, Department of Computer
Science, University of Virginia.
10. A. Pang and N. Alper, “Mix&Match: A Construction Kit for Visualization,” Proceedings of IEEE
Visualization ’94, October 1994, pp. 302-309.
11. J. P. Lee and G. G. Grinstein, “An Architecture for Retaining and Analyzing Visual Explorations of
Databases,” Proceedings of IEEE Visualization ’95, October 1995, pp. 101-108.
12. M. Stonebraker, J. Chen, N. Nathan, C. Paxson, A. Su, and J. Wu, “Tioga: A Database-Oriented
Visualization Toolkit,” Proceedings of IEEE Visualization ’93, October 1993, pp. 86-93.
13. P. Kochevar, Z. Ahmed, J. Shade, C. Sharp, “Bridging the Gap Between Visualization and Data
Management: A Simple Visualization Management System,” Proceedings of IEEE Visualization ’93,
October 1993, pp. 94-101.
14. Proceedings of the First IEEE Metadata Conference, April 1996. The proceedings are available online
at http://www.computer.org:80/conferen/meta96/meta_home.html.
15. A. P. Sheth and J. A. Larson, “Federated Database Systems for Managing Distributed, Heterogeneous,
and Autonomous Databases,” ACM Computing Surveys, Vol. 22, No. 3, 1990, pp. 183-236.
16. J. Widom, “Research Problems in Data Warehousing,” Proceedings of the 4th International Conference
on Information and Knowledge Management (CIKM), November 1995.
17. D. C. Banks and B. A. Singer, “Vortex Tubes in Turbulent Flows: Identification, Representation,
Reconstruction,Proceedings of IEEE Visualization ’94, October 1994, pp. 132-139.
18. S. T. Bryson, D. Kenwright, and M. J. Gerald-Yamasaki, “FEL: The Field Encapsulation Library,”
Proceedings of IEEE Visualization ’96, October 1996, pp. 241-247.
19. A. S. Tanenbaum, Computer Networks, 3rd edition, Prentice Hall, Englewood Cliffs, NJ, 1996.
20. P. Walatka and P. Buning, PLOT3D User’s Manual, Version 3.6, NASA Technical Memorandum
101067, NASA Ames Research Center, 1989.
21. National Space Science Data Center, CDF User’s Guide, Version 2.4, NASA/Goddard Space Flight
Center, February 1994.
22. National Center for Supercomputing Applications, HDR User’s Manual, V4.1r1, University of Illinois
at Urbana-Champaign, November 1990.
23. A. Silberschatz, J. Peterson, P. Galvin, Operating System Concepts, 3rd edition, Addison-Wesley,
Reading, MA, 1991.
24. K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw Hill, New
York, NY, 1984.
25. H. Korth and A. Silberschatz, Database System Concepts, McGraw Hill, New York, NY, 1986.
26. S. Bryson and C. Levit, “The Virtual Wind Tunnel,” IEEE Computer Graphics and Applications, Vol.
12, No. 4, July 1992, pp. 25-34.
27. Computational Engineering International, Inc., EnSight: Advanced Visual and Quantitative
Postprocessing for Computational Analysis, product literature CEI, PO Box 14306, Research Triangle
Park, NC 27709, 1996.
28. D. Lane, “UFAT: A Particle Tracer for Time-Dependent Flow Fields,” Proceedings of Visualization
’94, October 1994, pp. 257-264.
29. D. Song and E. Golin, “Fine-Grain Visualization Algorithms in Dataflow Environments,” Proceedings
of Visualization ’93, October 1993, pp. 126-133.
5-16
30. K. Harty and D. R. Cheriton, “Application-Controlled Physical Memory using External Page-Cache
Management,” Proceedings of the Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems, September 1992, pp. 187-197.
31. M. Cox and D. Ellsworth, “Application-Controlled Demand Paging for Out-of-Core Visualization,”
NAS Technical Report NAS-97-010, NASA Ames Research Center, April 1997.
32. S. K. Ueng, K. Siborski, and K. L. Ma, “Out-of-Core Streamline Visualization on Large Unstructured
Meshes,” ICASE Report No. 97-22, Institute for Computer Applications in Science and Engineering,
NASA Langley Research Center, April 1997.
33. D. McNamee and K. Armstrong, “Extending the Mach External Pager Interface to Allow User-Level
Page Replacement Policies,” Technical Report 90-09-05, University of Washington, September 1990.
34. T. A. Funkhouser, Database and Display Algorithms for Interactive Visualization of Architectural
Models, Ph.D. dissertation, University of California at Berkeley, 1993.
35. U. Neumann, “Parallel Volume-Rendering Algorithm Performance on Mesh-Connected
Multicomputers,” Proceedings of the 1993 Parallel Rendering Symposium, San Jose, CA, October
1993, pp. 97-104.
36. W. Lorensen and H. Cline, “Marching Cubes: A High-Resolution 3D Surface Construction
Algorithm,” ACM Computer Graphics (Proceedings of SIGGRAPH ’87), Vol. 21, No. 4, 1987, pp.
163-170.
37. T. Itoh, K. Koyamada, “Isosurface Generation by Using Extrema Graphs,” Proceedings of IEEE
Visualization ’94, October 1994, pp. 77-83.
Figure 8. F18. Concatenated frames from unsteady flow particle trace animation.
Figure 9. Shuttle. Single frame from steady flow streaklines.
Figure 10. Tapered cylinder. Concatenated frames from unsteady flow particle trace animation.
5−17
... W telekomunikacji główne obszary to sieci komputerowe i Internet, radio i telewizja, nadawanie, telefonia, wideokonferencje i cyfrowe centrale telefoniczne (VoIP). Zastosowania w transporcie obejmują lotnictwo i awionikę, pojazdy autonomiczne, rozpoznawanie kierowcy/pojazdu, transport i inżynierię ruchu 7 . W naukach przyrodniczych i medycznych są one przydatne w bioinformatyce, inżynierii biologicznej, biomechanice, badaniach nad lekami, genetyce i genomice, obrazowaniu medycznym, neuronauce i neurorobotyce, informatyce medycznej, żywieniu i nauce o żywności, monitorowaniu parametrów fizjologicznych i zdrowiu publicznym. ...
Article
Full-text available
Ostatnie lata przyniosły szybki rozwój technologii przetwarzania danych, poczynając od uczenia maszynowego, przez big data (zbiory danych) do sztucznej inteligencji (ang. artificial intelligence, AI). Wiąże się to z rosnącymi możliwościami ich wykorzystania we wszystkich dziedzinach wiedzy (m.in. bezpieczeństwo, finanse, handel, marketing). Komisja Europejska oszacowała udział gospodarki opartej na danych w unijnym PKB w 2022 r. na 3,9%. Obszarem wykorzystania big data stopniowo staje się działalność najwyższych organów kontroli (NOK). Wiąże się jednak z wieloma wyzwaniami oraz koniecznością dokonania zmian metodycznych i organizacyjnych, ale może doprowadzić do skuteczniejszej kontroli administracji publicznej i innych podmiotów, a jednocześnie do lepszego wykorzystania zasobów. Celem artykułu jest przedstawienie możliwości wykorzystania narzędzi big data w działalności najwyższych organów kontroli. W pierwszej części przybliżono znaczenie tej technologii w sektorze publicznym, w drugiej nakreślono uwarunkowania związane z wykorzystaniem wielkich zbiorów danych, a w ostatniej zaprezentowano rozważania dotyczące transformacji cyfrowej NOK i czynniki mające wpływ na wykorzystanie narzędzi big data w tych organach.
... Specifically, with Software as a Service (SaaS) as described by Francois et al. [103], proposes the ability to process, store and analyze massive datasets, commonly referred to as "big data". • The idea of big data can be ambiguous, as noted by Cox et al. [67]. It goes beyond the simple storage of numerous pieces data. ...
Article
Full-text available
This paper explores the adoption of cloud computing in small and medium enterprises (SMEs), focusing on synthesizing key factors influencing adoption. By conducting a comprehensive literature review and analyzing theoretical frameworks such as the Technology Acceptance Model (TAM) and the Technology-Organization-Environment (TOE) framework, this study proposes a novel conceptual model tailored to the unique socio-economic context of SMEs in the MENA region. The methodology includes an analysis of 30 research articles, highlighting enablers and barriers to adoption. Key contributions include a taxonomy of factors and actionable insights for policymakers and practitioners. This research addresses gaps in existing studies by providing a region-specific perspective on cloud computing adoption for SMEs. This strong tendency towards services provided by cloud computing is very clear, and to shape their future IT, it’s worth highlighting the existence of such technology. A synthesis of the literature on cloud computing adoption was presented in this paper classifying reviews based on factors that play an important role in taking the adoption decision incorporating the theoretical frameworks. Moreover, the relationship between big data and cloud computing has been explored. In addition to proposing a synthesized findings model, this paper includes various factors derived from the literature and related theoretical frameworks. In this paper we select research articles from broadly recognized research databases, as well as undertaking a comprehensive review of these selected articles focusing on the adoption of cloud computing and big data. It also includes a bibliography containing the most relevant publications in these domains, from 2011 to 2024. By examining 30 articles, the objectives of the paper are to analyze data derived from articles as well as to study the academic frameworks used in both developing and developed nations. Inferring this knowledge is important to the growing cloud computing market and accelerating its utilization among SMEs in Lebanon, the Middle East, and the North Africa (MENA) region was the intent behind working on this paper.
... Büyük veri (Big Data) terimini ilk kullanan kişiler olan Michael Cox ve David Ellsworth, görselleştirme için büyük miktarda bilimsel verinin kullanılmasını vurgulamıştır (Ellsworth ve Cox, 1997). Bu tarihten sonra büyük veri ile ilgili çok çeşitli tanımlamalar yapılmış ve günümüze büyük bir sermaye potansiyeli olarak gelmiştir. ...
... Büyük veri (Big Data) terimini ilk kullanan kişiler olan Michael Cox ve David Ellsworth, görselleştirme için büyük miktarda bilimsel verinin kullanılmasını vurgulamıştır (Ellsworth ve Cox, 1997). Bu tarihten sonra büyük veri ile ilgili çok çeşitli tanımlamalar yapılmış ve günümüze büyük bir sermaye potansiyeli olarak gelmiştir. ...
Chapter
Full-text available
Günümüz küresel ticaret ortamında lojistik ve tedarik zinciri yönetimi (TZY), ürün ve hizmetlerin farklı coğrafyalar ve pazarlar arasında kesintisiz akışını düzenleyen önemli bileşenler olarak öne çıkmaktadır. Giderek daha karmaşık hale gelen ticaret ağları ve değişen tüketici talepleri ile karakterize edilen birbirine bağlı bu ortamda, etkin lojistik ve TZY uygulamaları işletmelerin rekabetçi kalabilmeleri ve müşteri beklentilerini karşılayabilmeleri için kritik öneme sahiptir (Carey vd., 2024). TZY, ürünlerin, bilgilerin ve finansal kaynakların ilk çıkış noktasından son varış noktasına kadar, akışının denetlenmesini ve geliştirilmesini içeren kapsamlı bir süreçtir. Bu karmaşık süreç, hammadde sağlayan tedarikçiler, üreticiler, lojistik akışı sağlayan distribütörler ve ürünleri satan perakendeciler ile bunları satın alan müşteriler dahil olmak üzere çok sayıda üyenin dikkatli bir şekilde koordinasyonunu gerektirmektedir. TZY'nin nihai hedefleri yüksek düzeyde verimlilik, maliyet etkinliği ve müşteri memnuniyeti elde etmektir (Negi, 2021; Nazarian ve Khan, 2024). TZY'nin tarihsel sürecinde, 1960’lı yıllara kadar üretimin önündeki engellerin kaldırılması ve yeterli üretim kapasitesinin sağlanması hedeflenmiştir. 1980’li yıllara gelindiğinde ise öncelik düşük maliyetli üretim olmuştur. 1990’lı yıllar, TZY kavramının ortaya çıkmaya başladığı dönemdir. 2000’li yıllara kadar olan dönemde, düşük maliyet ve kaliteli üretim anlayışının yanı sıra tam zamanında üretim felsefesi benimsenmiştir. 2000’li yıllar ise, teknolojik gelişmelerin hızlanması ve internetin yaygınlaşmasıyla TZY'nin yeniden şekillendiği bir dönem olmuştur. Bu yıllar, özellikle işletmelerin internet tabanlı TZY'ne yöneldiği bir dönemdir (Öztürk, 2016). Sonuç olarak TZY'nin tarihsel süreci incelendiğinde, teknolojik yeniliklerin lojistik süreçleri ve TZY'ni önemli ölçüde dönüştürdüğü görülmektedir. Bu teknolojiler izlenebilirliğini artırarak operasyonel verimliliği iyileştirmenin yanı sıra, tedarik zincirinde karşılaşılan zorlukları ve riskleri azaltarak daha etkili ve güvenilir bir yönetim ortamı sağlamaktadır.
... However, these datasets present unique challenges: difficulty accessing the data, limitations in computational power, and the need for real-time processing capabilities. Downloading petascale data locally is problematic due to limitations of local memory or disk size and insufficient bandwidth for remote disks [13]. Our work focuses on collaboration with these institutions to improve data accessibility, empowering researchers to unlock the hidden knowledge within these vast datasets. ...
... In 1997, David Ellsworth and Michael Cox of NASA used "big data" for the first time [2], and in 1998, Science magazine published an article entitled "A handler for Big Data" [3]. Since 2011, big data has entered a full-blown boom period this year, and more and more scholars have shifted their research on big data from basic concepts and characteristics to multiple perspectives, such as data assets and changes in thinking [4]. ...
... The term "big data" first appeared in computer science research in the late 1990s (Cox & Ellsworth, 1997). It is characterized by its vast volume, variety, rapid generation, and diverse sources (Xu et al., 2024), including social media platforms, online travel reviews, mobile applications, GPS tracking, as well as transaction data from booking ...
Article
Full-text available
The emergence of big data and its related technologies has brought about novel economic models, industry phenomena, and relational networks, instigating revolutionary changes with significant value for tourism sustainability. This study conducts a bibliometric analysis of 212 articles (2014-2024) on big data in tourism from the Web of Science (WoS) Core Collection database, aiming to create a knowledge map based on big data in tourism. This study utilizes VOSviewer software to carry out citation analysis, co-citation analysis, co-authorship analysis, and keyword co-occurrence analysis, revealing trends in publications, national contributions, influential journals and authors, author collaborations, as well as the conceptual structure and research trends in the field of big data in tourism. The findings indicate a concentration of research in seven areas: machine learning, social network analysis, sustainability, tourism demand forecasting, artificial intelligence, smart tourism, and text mining techniques. The research has focused on emerging hot topics since 2022, including destination image, COVID-19, topic modeling, and urban tourism. This study maps the knowledge of big data in tourism, elucidates the academic evolution in this field, and offers future research directions for scholars in the domain.
... The term 'big data' emerged in 1997 from NASA researchers Cox and Ellsworth [18], who were the first to refer to 'Big Data' as: "Visualization poses an interesting challenge for computer systems of computer systems: the data sets are often quite large, straining the capacity of main memory, local disk, and even remote disk, local disk, and even remote disk. We call this the big data problem." ...
Article
Software development using agile System Development Life Cycles (SDLC), such as Scrum and XP, has gained important acceptance for small businesses. Agile approaches eliminate barriers to required organizational, technical, and economic resources usually necessary when rigorous software development approaches, through heavyweight methodologies (e.g., Rational Unified Process (RUP)) or heavyweight international standards (e.g., ISO/IEC 12207) are used. However, despite their high popularity in small businesses, their utilization is scarce in the emergent domain of Big Data Analytics Systems (BDAS). Consequently, small businesses interested in deploying BDAS lack systematic academic guidance regarding agile SDLC for BDAS. This research, thus, addresses this research gap, and reports an updated comparative study of three of the main proposed SDLCs for BDAS (Cross-Industry Standard Process for Data Mining CRISP-DM), Two mains were Microsoft Team Data Science Process (TDSP), and Domino Data Science Lifecycle (DDSL)) in the current BDAS development literature, against a Scrum and Extreme Programming (Scrum-XP) SDLC. For this aim, a Pro Forma of a generic Scrum-XP SDLC is used to examine the conceptual structure, i.e., roles, phases-activities, roles, and work products-of these two SDLCs. Hence, this comparative study provides theoretical and practical insights on agile SDLC for BDAS adequate for small businesses and calls for further conceptual and empirical research to advance toward an agile SDLC for BDAS supported by academia and used in practice.
Article
The purpose of the article, devoted to the problem of data quality in the use of Big Data for management decision-making, is a comprehensive study of key aspects of this problem, an assessment of its impact on the efficiency of business processes, and the development of practical recommendations for its solution. The article describes in detail the main challenges associated with low quality data, among which the most significant are inaccuracy, incompleteness, obsolescence and heterogeneity of information. It analyzes how these problems affect the decision-making process in business, highlighting the risks that include the possibility of making wrong strategic decisions, loss of competitiveness, reduced efficiency of business processes and, as a result, possible financial losses. The article also considers modern methods of data processing and cleaning, analyzing their advantages and disadvantages. Identified shortcomings of existing methods, such as limitations in the automation of cleaning processes and problems of data integration from different sources, require improvement to improve data quality. Based on this analysis, several strategies are proposed to improve the quality of data, including the introduction of the latest technologies for cleaning and integration of information, development and standardization of data exchange between different systems and companies, as well as improving the skills of specialists in the field of Big Data. An important aspect that the authors focus on is the constant updating of knowledge and adaptation to new technological trends, which allows maintaining high data quality and effectively using them in management processes. The authors emphasize the need for the integration of new technologies and continuous professional development as key factors for ensuring competitiveness and business success in the rapidly changing technological environment. The article also stresses the importance of establishing a robust data governance framework to ensure accountability and transparency in data handling practices. It highlights the need for collaboration between different departments within an organization to ensure data consistency and reliability across all business processes. Lastly, the authors call for the adoption of machine learning algorithms and artificial intelligence tools to enhance the automation of data quality management, making decision-making more efficient and accurate.
Book
The book is intended as a text to support two semesters of courses in computer architecture at the college senior and graduate levels. There are excellent problems for students at the end of each chapter. The authors have divided the use of computers into the following four levels of sophistication: data processing, information processing, knowledge processing, and intelligence processing.
Conference Paper
A prototype visualization management system is described which merges the capabilities of a database management system with any number of existing visualization packages such as AVS or IDL. The prototype uses the Postgres database management system to store and access Earth science data through a simple graphical browser. Data located in the database is visualized by automatically invoking a desired visualization package and downloading an appropriate script or program. The central idea underlying the system is that information on how to visualize a data set is stored in the database with the data set itself.
Conference Paper
This paper presents a mathematical data model for scientific visualization based on the mathematics of fiber bundles. The findings of previous works are extended to the case of piecewise field representations (associated with grid-based data representations), and a general mathematical model for piecewise representations of fields on irregular grids is presented. A discussion of the various types of regularity that can be found in computational grids and techniques for compact field representation based on each form of regularity are presented. These techniques can be combined to obtain efficient methods for representing fields on grids with various regular or partially regular structures.
Conference Paper
Most of the current dataflow visualization systems are based on coarse-grain dataflow computing models. In this paper we propose a fine-grain dataflow model that takes advantage of data locality properties of many visualization algorithms. A fine-grain module works on small chunks of data one at a time by keeping a dynamically adjusted moving window on the input data stream. It is more memory efficient and has the potential of handling very large data sets without taking up all the memory resources. Two popular visualization algorithms, an iso-surface extraction algorithm and a volume rendering algorithm, are implemented using the fine-grain model. The performance measurements showed faster speed, reduced memory usage, and improved CPU utilization over a typical coarse-grain system
Conference Paper
A software architecture is presented to integrate a database management system with data visualization. One of it's primary objectives, the retention of user-data interactions, is detailed. By storing all queries over the data along with high-level descriptions of the query result and associated visualization, the process by wich a database is explored can be analyzed. This approach can lead to contributions in the development of user models as "data explorers", metadata models for scientific databases, intelligent assistants, and data exploration services. We describe the underlying elements of this approach, specifically the visual database exploration model and the metadata objects that support the model.