ArticlePDF Available

Optimization of request processing times for a heterogeneous data aggregation platform

Authors:

Abstract and Figures

A heterogeneous data aggregation system, e.g. developed within the frame of the GRADLC project, allows for a flexible expansion by connecting new data storages, as well as providing researchers a fast and aggregated access to heterogeneous data from independent (astroparticle physics) projects, while reducing the load on the original data storages. However, this flexibility requires balancing user requests in the queue with respect to various request processing times for the distributed storages, taking into account the different data processing policies on each particular storage. In order to attack this problem, a mathematical model of the data aggregation system was developed, and approaches to optimization of the request ordering in the processing queue are proposed and investigated by performing a numerical experiment. Based on this results, a job shop scheduling algorithm was revealed which gives benefit in mean request processing times compared to the well-known first in, first out (FIFO) model.
Content may be subject to copyright.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Optimization of request processing times for a heterogeneous data
aggregation platform
To cite this article: Victoria Tokareva 2021 J. Phys.: Conf. Ser. 1740 012058
View the article online for updates and enhancements.
This content was downloaded from IP address 184.174.61.9 on 23/01/2021 at 12:56
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
CSP2020
Journal of Physics: Conference Series 1740 (2021) 012058
IOP Publishing
doi:10.1088/1742-6596/1740/1/012058
1
Optimization of request processing times for a
heterogeneous data aggregation platform
Victoria Tokareva
Karlsruhe Institute of Technology, Institute for Astroparticle Physics, 76021 Karlsruhe,
Germany
E-mail: victoria.tokareva@kit.edu
Abstract. A heterogeneous data aggregation system, e.g. developed within the frame of the
GRADLC project, allows for a flexible expansion by connecting new data storages, as well
as providing researchers a fast and aggregated access to heterogeneous data from independent
(astroparticle physics) projects, while reducing the load on the original data storages. However,
this flexibility requires balancing user requests in the queue with respect to various request
processing times for the distributed storages, taking into account the different data processing
policies on each particular storage. In order to attack this problem, a mathematical model of the
data aggregation system was developed, and approaches to optimization of the request ordering
in the processing queue are proposed and investigated by performing a numerical experiment.
Based on this results, a job shop scheduling algorithm was revealed which gives benefit in mean
request processing times compared to the well-known first in, first out (FIFO) model.
1. Introduction
In the last decade, trends in open data access to be used for interdisciplinary research have
become relevant in science all over the world [1, 2, 3, 4]. The globalization of science leads to the
exchange of experience and ideas between different fields of knowledge, and allows us to expand
the horizons of our understanding of processes of different origins by extending the range of
methods and techniques for obtaining scientific knowledge. At the same time, researchers of
interdisciplinary subjects continue to face problems such as an unsystematic approach to store
and provide data, or the lack of a single interface for data access [2, 5, 6, 7].
Data aggregation
server
Storage 1 Storage 2 Storage S
Figure 1. Aggregator interaction with storages
Currently existing approaches for working with distributed data storages [8, 9, 10] imply that
the remote storage use a single file system (for example, HDFS or CVMFS) as well as a unified
CSP2020
Journal of Physics: Conference Series 1740 (2021) 012058
IOP Publishing
doi:10.1088/1742-6596/1740/1/012058
2
software stack for data processing, which require significant costs and changes to the streamlined
workflow of remote data centers.
In the German-Russian Astroparticle Data Life Cycle (GRADLC) [6] project we are
developing an alternative approach: a system that treats remote storage servers as black boxes,
an aggregator carries the load of handling integrated user queries (see the general scheme in
Fig. 1). Similarly to the aforementioned solutions [8, 9] our system uses the metadata to optimize
the processing time of requests.
To further improve the quality of the request service, a mathematical modeling of the system
behavior turned out to be relevant in order to identify the most productive heuristics to reduce
the waiting time in the queue. In this paper, we will consider a static model for processing
user requests in a system with an aggregator and several storages in the case of a simple user
behavior scenario, described in chapter 2.
2. Description of the system operation and the mathematical model
Let users of the system send requests Rj(s, p) to the aggregator, where j[0, ..., J ] is the number
of the request in the system, si, i [1, ..., S] are the remote storage identifier, and pRnare
the data selection parameters of the request. Further on, the user application goes through the
fundamental stages of data processing shown in Fig. 2.
Request
registration
uuid retrieval
request remote
storage S
data fetching
data archiving
R_j(s, p)
result
Figure 2. Processing cycle for a user request.
We will consider the case when one user is interested in the data of only one storage, i.e.
when sis a scalar.
For such a case, one can create Sparallel queues for execution and process Rj(s, p) requests
for different storages Sin parallel, as shown in detail in Fig. 3.
It should be noted that this case is not representative for the main class of problems solved
by the aggregator, since the purpose of creating such systems is to provide users with joint
information from different sources. However, developing a model for a given simple case allows
us to look at the relationships between objects in the system and highlight the essential patterns
of the processes that occur, which can later be supplemented and expanded to account for more
realistic use cases.
We will rely on the fact that the criteria pof the user request Rj(s, p) are encapsulated in it
and are not explicitly known to the scheduler. The execution times of the processing steps of the
request shown in Fig. 3 are not immediately known to the scheduler. They depend linearly on the
CSP2020
Journal of Physics: Conference Series 1740 (2021) 012058
IOP Publishing
doi:10.1088/1742-6596/1740/1/012058
3
MDDB
Go to other
queue
uuid (si, p) ?
Storage si
s = si -? uuid, p
Queue i
job_i.1
job_i.x
.......
No
Yes
fetching post-
processing result of
request
R_j (s, p)
retirved
in Tijsec.
R_j (s, p)
Figure 3. Principal stages of processing the request Rj(s, p) by the aggregator.
number of entries nij in the remote storage sthat correspond to the request parameters p, and
the scheduler may need to perform time-consuming computations to estimate these numbers.
Let us consider as an example the individual stages of request processing by different storages,
using some of the remote storages connected to the GRADLC system (Fig. 4).
103104105106
log (nij), # of events
10 2
10 1
100
log (tij
s), sec
UUID fetching times
a) UUID fetching times for
KASCADE remote storage
103104105106
log (nij), # of events
10 2
10 1
100
101
log(tij
f), sec
Event fetching times
KASCADE data storage
Tunka-133 data storage
b) Event fetching times
103104105106
log (nij), # of events
10 1
100
101
log(tij
a)
Data post-processing
KASCADE data storage
Tunka-133 data storage
c) Data post-processing
Figure 4. Dependencies of the execution times on the number of requested records nij for
various request processing stages: a) a dependence of the time tij
sof query processing by the
metadata database on the number of requested records nij, approximated by a linear equation
on a logarithmic scale; b) a dependence of the time tij
fof retrieving the requested records on
their number nij, approximated by a linear equation in a logarithmic scale; c) a dependence of
the time tij
aof post-processing of a query on the aggregator side on the number of requested
records nij , approximated by a linear equation in a logarithmic scale.
Thus, we can conclude that functions (1) can be defined for the considered system of
aggregated information collection from distributed storages.
ts(nij ) = ν·nij , ν R
tf(nij ) = µi·nij , µ = (µ1, ..., µs)Rs
ta(nij ) = τ·nij , τ R
(1)
Let us introduce the procedure C, which maps vectors from the space of parameters pto
nij R. This procedure will allow performing a preliminary calculation of the request execution
time Rj(s, p), which is necessary for further optimization of the queue ordering. The execution
time tcof the procedure Cis non-uniformly distributed in [0, Tc]. When planning the request
execution order, the call to the C(p) procedure creates overhead costs that must be compensated
for by optimizing the queue waiting time of requests, which may not be fulfilled in the case of
a small number of requests to the system.
CSP2020
Journal of Physics: Conference Series 1740 (2021) 012058
IOP Publishing
doi:10.1088/1742-6596/1740/1/012058
4
Based on the dependencies described above and the request processing model shown in Fig. 3,
we obtain the following factors for calculating the expected waiting time of a request in the
system:
(0) Performing an estimation C(p) = nij, tcunif (0, Tc)
(1) Waiting time of a request in the queue tij
q
(2) Initializing a query to the metadata database tin unif(0,Θin)
(3) Processing of a request by the metadata database tij
s=ν·nij ,νR
(4) Time of retrieving requested records from a remote storage tij
f(nij ) = µi·nij ,
µ= (µ1, ..., µs)Rs
(5) Request post-processing time on the aggregator side tij
a(nij ) = τ·nij ,τR
Here steps 3–5 can be performed in parallel in blocks as a part of a single request, while
steps 0–3 must be performed sequentially. From Fig. 4 we see that the UUID fetching time for a
single event is about an order of magnitude less than the one event fetching and post-processing.
The competition, that is most expensive in terms of time and system resources, occurs when
processing multiple requests to the same remote storage.
Thus, the execution time of the request Rij:
Tij =tc+tij
q+tin+nij ·max(ν, µi, τ ) (2)
From (2) we have:
tij
q=
j1
X
d=1
Tij , j = 2, J, Ti1R(3)
We are looking for a request application processing schedule such that:
J
X
j=1
tij
qmin, i [0, S] (4)
3. Simulations
Known strategies [11] for distributing jobs in a queue are:
(i) FIFO (first in, first out).
(ii) LIFO (last in, first out).
(iii) Capacity: the approach is based on the establishment of sub-queues for “large” and “small”
tasks. It assumes the concurrent access of several requests to the storage, which leads to
overhead costs and slows down the search and retrieval of data in our case.
(iv) Fairness: the model assuming parallel processing of several requests and processing requests
with interruptions, which is unacceptable within the framework of the system under
consideration.
(v) Priority: a priority of the job is evaluated based on known parameters and the jobs are
ranked based on the assigned priority. There are various ways to assess the priority of a
task. In this paper, ranking by task execution time was considered.
Within this study, we performed a simulation of the modeling, the results are shown in Fig. 5.
The figure shows that with a relatively small number of tasks, the overhead of calculating the
expected duration of the task and the redistribution of tasks in the queue affect the performance
of the ranking algorithm in such a way that it does not show a significant advantage. However,
CSP2020
Journal of Physics: Conference Series 1740 (2021) 012058
IOP Publishing
doi:10.1088/1742-6596/1740/1/012058
5
Figure 5. Comparison of task execution times for the FIFO algorithm (blue line) and the
priority one (orange line).
as the number of tasks grows, the ranking algorithm begins to outperform the FIFO approach
confidently. Qualitatively this happens because the first jobs contribute more to the total waiting
time of all requests than the last ones, so it is better to process shorter jobs at the beginning. We
expect that this advantage may be even more significant for more complex cases of simultaneous
queries to multiple storages and dynamic priority rebalancing.
4. Results and discussion
This work considers a system of aggregated data collection and access with a metadata catalog
and Sdistributed independent storages. For the case of user requests that address only one of the
storages, a system of Sparallel queues was proposed. A procedure was proposed for estimating
the time it takes to complete individual stages of user request processing, an algorithm for
processing user requests by the system was described, a mathematical model of the described
process was developed, and the problem was formulated to minimize the user waiting time in
the queue s. To investigate the possibilities of the optimal solution to this problem, simulation
modeling was performed, which has shown the advantage of the chosen Priority method of task
ordering compared to FIFO and LIFO approaches. More complex cases of system behavior and
other approaches to determine the task priorities will be considered in further studies.
Acknowledgements: The author acknowledges the help of the colleagues from projects
Astroparticle Physics Data Storage (APPDS) and German-Russian Astroparticle Physics Data
Life Cycle Inintiative (GRADLCI), especially to A. Haungs.
References
[1] Mons B, Neylon C, Velterop J, Dumontier M, da Silva Santos L O B and Wilkinson M D 2017 Information
services & use 37(1) 49-56
[2] Reiser L, Harper L, Freeling M, Han B and Luan S 2018 Molecular plant 11(9) 1105-08
[3] Lagoze C, Van de Sompel H. 2005 Implementation guidelines for the open archives initiative protocol
for metadata harvesting Protocol version 2.0 of 2002-06-14, document version 2005/05/03T22:51:00Z
http://www.openarchives.org/OAI/2.0/guidelines.htm
[4] Yiotis K 2005 Information technology and libraries 24(4) 157-62
[5] van Wezel J, Streit A, Jung C, Stotzka R, Halstenberg S, Rigoll F et al. 2012 arXiv preprint arXiv:1212.5596
[6] Bychkov I, Demichev A, Dubenskaya J, Fedorov O, Haungs A, Heiss A et al. 2018 Data 3(4) 56.
CSP2020
Journal of Physics: Conference Series 1740 (2021) 012058
IOP Publishing
doi:10.1088/1742-6596/1740/1/012058
6
[7] Sch¨orner T et al. The PAHN-PaN NFDI Consortium. The binding Letter of Intent of the PAHN-PaN NFDI
Consortium https://www.dfg.de/download/pdf/foerderung/programme/nfdi/absichtserklaerungen_
2019/2019_pahn_pan.pdf Last visited 05.11.20
[8] Shvachko K, Kuang H, Radia S and Chansler R 2010 The hadoop distributed file system IEEE 26th
symposium on mass storage systems and technologies (MSST) 1-10
[9] Zaharia M, Xin R S, Wendell P, Das T, Armbrust M, Dave A et al. 2016 Apache spark: a unified engine for
big data processing Communications of the ACM 59(11) 56-65.
[10] Kacsuk P, Kiss T 2007 Towards a scientific workflow-oriented computational World Wide
Grid. CoreGRID Technical report https://westminsterresearch.westminster.ac.uk/item/91934/
towards-a-scientific-workflow- oriented-computational- world-wide- grid Last visited 05.11.20
[11] Kruse R L 1987 Data Structures and Program Design (2nd ed) (Prentice-Hall, Inc. div. of Simon & Schuster)
pp 150
Article
Interdisciplinary research and open data are characteristic trends in modern science. Globalization facilitates the exchange of experiences and ideas between different domains of knowledge, and allows us to expand the horizons of our under- standing of the processes taking place in nature and society. To support this activity of scientific communities, a ggregated data access systems are being developed to link together storages of heterogeneous data. Providing high throughput of these systems leads to NP-complex optimization problems, which imply a wide space search with a large number of variables, each one of them with a wide interpretation domain. So far, a large number of heuristics for solving such problems have been described in literature, but the choice of the most appropriate heuristic always depends on the specifics of the system we want to optimize. In this paper, a model of an aggregated data collection system with K heterogeneous storages is considered. A problem of building an optimal schedule for user requests processing is instantiated, and a method for estimating the task execution time is proposed. The use of priority dispatching rules is considered and the results of a comparative simulation of their applicability are presented.
Article
Full-text available
Modern large-scale astroparticle setups measure high-energy particles, gamma rays, neutrinos, radio waves, and the recently discovered gravitational waves. Ongoing and future experiments are located worldwide. The data acquired have different formats, storage concepts, and publication policies. Such differences are a crucial point in the era of Big Data and of multi-messenger analysis in astroparticle physics. We propose an open science web platform called ASTROPARTICLE.ONLINE which enables us to publish, store, search, select, and analyze astroparticle data. In the first stage of the project, the following components of a full data life cycle concept are under development: describing, storing, and reusing astroparticle data; software to perform multi-messenger analysis using deep learning; and outreach for students, post-graduate students, and others who are interested in astroparticle physics. Here we describe the concepts of the web platform and the first obtained results, including the meta data structure for astroparticle data, data analysis by using convolution neural networks, description of the binary data, and the outreach platform for those interested in astroparticle physics. The KASCADE-Grande and TAIGA cosmic-ray experiments were chosen as pilot examples.
Article
Full-text available
The FAIR Data Principles propose that all scholarly output should be Findable, Accessible, Interoperable, and Reusable. As a set of guiding principles, expressing only the kinds of behaviours that researchers should expect from contemporary data resources, how the FAIR principles should manifest in reality was largely open to interpretation. As support for the Principles has spread, so has the breadth of these interpretations. In observing this creeping spread of interpretation, several of the original authors felt it was now appropriate to revisit the Principles, to clarify both what FAIRness is, and is not.
Article
Full-text available
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Article
Full-text available
This paper gives an account of the origin and development of the Open Access Initiative (OAI) and the digital technology that enables its existence. The researcher explains the crisis in scholarly communications and how open access (CIA) can reform the present system. OA has evolved two systems for delivering research articles: CIA archives or repositories and OA journals. They differ in that CIA journals conduct peer review and CIA archives do not. Discussion focuses on how these two delivery systems work, including such topics as OAI, local institutional repositories, Eprints self-archiving software, cross-archives searching, metadata harvesting, and the individuals who invented CIA and organizations that support it.
Article
Full-text available
This paper describes a potential roadmap establishing a scientific, workflow-oriented, computational World Wide Grid (WWG). In order to achieve such a WWG the paper suggests three major steps. First, create uniform meta-brokers and connect existing production Grids by them in order to form the WWG infrastructure where existing production Grids become interoperable at the job submission level. Second, create workflow-oriented advance Grid portals and connect them to the meta-brokers in order to exploit workflow level Grid interoperability. Finally, create workflow Grid services and their registry and repository in order to make the workflows developed by different workflow communities interoperable and shareable.
Article
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
  • B Mons
  • C Neylon
  • J Velterop
  • M Dumontier
  • Silva Da
  • L O B Santos
  • M D Wilkinson
Mons B, Neylon C, Velterop J, Dumontier M, da Silva Santos L O B and Wilkinson M D 2017 Information services & use 37(1) 49-56
  • L Reiser
  • L Harper
  • M Freeling
  • Han B Luan
Reiser L, Harper L, Freeling M, Han B and Luan S 2018 Molecular plant 11(9) 1105-08