Content uploaded by Artem Kramov
Author content
All content in this area was uploaded by Artem Kramov on Mar 10, 2021
Content may be subject to copyright.
Available via license: CC BY-ND 4.0
Content may be subject to copyright.
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov, 2020
Системні дослідження та інформаційні технології, 2020, № 3 7
T
I
Д
C
ПРОГРЕСИВНІ ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ,
ВИСОКОПРОДУКТИВНІ КОМП’ЮТЕРНІ
СИСТЕМИ
UDC 004.4
DOI: 10.20535/SRIT.2308-8893.2020.3.01
METHOD OF SEMANTIC APPLICATION VERIFICATION
IN GPGPU TECHNOLOGY
S.L. KRYVYI, S.D. POGORILYY, M.S. SLYNKO, A.A. KRAMOV
Abstract. An application development and verification method for massively paral-
lel systems using NVIDIA GPUs is proposed. The method allows creating models at
different levels of abstraction using the apparatus of marked transition systems. The
compositions (product) of such systems are transformed into a Petri net, which are
then analyzed by appropriate means. The proposed method allows specifying model
properties by temporal logic formulas. This allows studying the properties of mas-
sively parallel systems which is almost impossible to analyze manually, since the
number of execution threads in the latest NVIDIA video adapter architectures (Pas-
cal, Volta, Turing, Ampere) is measured in hundreds of thousands or millions.
Keywords: CUDA, graphical processing units (GPU), General Purpose Graphics
Computing (GPGPU), transition system, Petri net, model design.
INTRODUCTION
The main feature of the latest graphical processing units (GPU) is the availability
of a set of streaming multiprocessors (SM) that were used previously in image
processing algorithms and tasks only. General Purpose Graphics Computing
(GPGPU) technology is based on the use of a combination of GPUs working in
parallel to process data using general-purpose algorithms (scientific or other, but
not necessarily related to image processing). The latest graphics architectures
from NVIDIA include Pascal, Volta, Amper, Thuring [1]. For this day, Volta is
one of the most powerful GPU architectures, which is an indicator of achieve-
ments in the field of high-performance artificial intelligence calculations (the
GTX TITAN Z graphics adapter, built on the top of two powerful GK110 cores,
can provide peak performance up to 8 teraflops, and each core can implement
2880 stream processors, which in total gives 5760 stream processors).
NVIDIA, in addition, develops a series of video adapters focused on scien-
tific applications and use in high-performance (cluster) computing. These GPUs
lack some graphic-specific features and are widely used in the scientific field.
This led to a significant increase in the number of supercomputers included in the
TOP500 world most efficient computers [2] that utilize NVIDIA video adapters.
Nowadays high-performance computing (HPC) trends get shifted from using
clusters consisting of general-purpose modules to more specialized accelerated
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
8
components (in other words, from universal CPUs to other units — GPU, FPGA,
etc.), that is, to less functional and less power consuming modules. Accelerators,
unlike universal CPUs, can not run an operating system, and rely on external sys-
tems for I/O operations or task scheduling. Their advantage in productivity lies
solely in the fact that these elements are used in large groups simultaneously. The
article proposes methods of modelling the properties of GPU accelerator architec-
tures at different abstraction levels.
In terms of programming, success of the most common GPGPU technologies
(CUDA in particular) lies in the fact that they encapsulate the SIMD nature of
GPU hardware. In most cases, the developer deals with individual streams that
work with scalar data instead of warps [1] working with vectors.
Development of the complicated, massively parallel systems that utilize
video adapters requires new scientific methods to justify both the architecture of
the system and applications for them. Since modern multi-threaded applications
have hundreds of thousands and millions of threads, solving parallelization tasks
for such systems makes impossible to use traditional engineering design ap-
proaches and requires the use of a mathematical apparatus and formalization
methods to substantiate the decisions made. Model justification is an effective
way of algorithmic study of the parallel algorithms properties. One of the options
for implementing such a justification is the use of the apparatus of algorithmic
algebras [3], which allow formulating schemes of algorithms in the form of
algebraic expressions that depend on various parameters, including software and
hardware platforms, paradigms of parallel programming, etc. This paper focuses
on using the transition systems (TS) [4] and their compositions as the main
mathematical model, which allow creating models at different abstraction levels.
Their properties can be investigated by translating in Petri networks (PN) [5] and
can be specified by temporal logic formulas [6].
MODERN NVIDIA GPU ARCHITECTURES
CUDA (Computational Unified Device Architecture) is a parallel computing ar-
chitecture developed by NVIDIA to facilitate the GPGPU programming by using
high-level APIs. Since the pilot CUDA platform had been introduced more than
10 years ago, each new generation of NVIDIA GPUs provided better application
performance (for example, in floating point operations), increased energy effi-
ciency, added important new computing capabilities, and streamlined graphics
processor programming. Today, NVIDIA GPUs are leading computing devices
that have, in some way, defined the artificial intelligence revolution (AI). Nowa-
days GPGPU technology accelerates deep learning; high-precision text, voice,
and other media data recognition systems; are used in the areas of molecular mo-
deming, modelling of medical products, medical diagnosis, financial modelling,
and others.
Application instructions that are executed in a GPU-based heterogeneous
environment can be logically decomposed into the following parts:
“host” instructions — blocks which are executed on the CPU;
“kernel” functions — instruction blocks which are executed on the GPUs.
Host blocks define the context of the kernel functions execution, transferring
data between the computer’s RAM and GPU memory.
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 9
All NVIDIA GPU architectures execute instructions in groups of 32 thread
(known as warps) using the SIMT model (Single Instruction, Multiple Thread),
that is, one instruction is executed by many threads simultaneously, although be-
haviour of each individual thread is not limited by anything. However, architec-
tures prior to Pascal include, among others, a software counter and a mask com-
mon to all threads of the warp that determines which threads are active at any
given moment. This means that in the case of a execution ow branching, each
execution path uses only a subset of all threads, while the rest are deactivated.
Once execution paths are converged, threads of a warp start being executed simul-
taneously again.
Such an implementation model gets rid of the necessity to track each indi-
vidual thread state separately. However, tracing only a warp in general means re-
ducing the level of parallelism if branching is present, as described above. In turn,
this prevents the data exchange between the threads within a single warp, if those
threads are at different execution stages, or if they execute the instructions in dif-
ferent branches. That means that threads of different warps may execute instruc-
tions in parallel, but threads of a single warp sometimes have to execute instruc-
tions sequentially. Thus, algorithms that require data sharing at a high level of
detail or that utilize synchronization tools (such as mutexes) can suffer from dead-
locks. Therefore, developers have to rely on algorithms with minimal blocking
support while using the NVIDIA GPUs of Pascal architecture or earlier (that is,
more than 60% of devices, because Volta architecture was released only in 2017
and Turing — in 2018).
More modern architectures, starting from Volta, include independent flow
planning, which stores instruction counters and call stacks for each thread sepa-
rately, which can be utilized for optimal resource usage or to allow one thread to
wait for data from another. To increase the level of parallelism, Volta includes an
optimizer that defines how to group active threads within the warp within the
SIMT module. This maintains the high performance of the SIMT approach, as in
previous NVIDIA GPUs, but with much greater flexibility: the threads can now
perform various branching paths within a single warp. The verification methods
proposed are illustrated on a simple system model analysis example.
MODEL JUSTIFICATION AND VERIFICATION
Testing process has always been the main method of increasing the reliability of
programs, developed using traditional methods. Edsger Dijkstra once said: “Pro-
gram testing can be used to show the presence of bugs, but never to show their
absence!”. In addition, testing can not detect typical synchronization errors of
parallel programs. Parallel programs may for years retain errors that manifest
themselves after a long usage period as a reaction to a specific combination of
numerous factors that have arisen (for example, due to the unpredictable rates of
individual threads/processes execution in parallel programs). However, if any of
the system properties can be expressed formally, for example, in the form of a
mathematical logic formula, then analysis of this property can be performed by
verification methods. Normally process of the system verification consists of the
following parts:
1. Construction of a mathematical model of the system under analysis.
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
10
2. Definition of the properties to be checked in the form of a formal text (al-
so known as specification).
3. Building a formal proof of the presence or absence of the property being
verified.
Usually, mathematical model of a system is a graph whose vertices are
called states and represent situations (or situation classes), in which the system
may be present at different times; whose edges can have labels depicting the ac-
tions system can perform. The functioning of the system in this model is repre-
sented by transitions along the edges of the graph from one state to another. If the
passable edge has a label, then this label represents the action of the system,
executed when passing from the state at the beginning of the edge to the state
at its end.
The choice of the abstraction level for the system modelling depends on
many factors (algorithmic solvability, astronomical dimensions of the model, the
absence of effective methods of formal analysis of properties etc.). In this regard,
the informal rules of this choice are reduced to the following: system model
should not be over-specified, because the excessive model complexity may cause
significant computational problems during its formal analysis. On the other hand,
system model should not be oversimplified: it should reflect those aspects of the
system that are relevant to the properties being verified, and preserve all the
properties of the simulated system which are of interest for analysis.
Model checking (MC, [7]) approach is used to find a formal proof that the
model does not meet its specification. In this paper we propose a new method of
justifying that the model satisfies the specification. This paper focuses on the
method of checking the conformity of a model an its specification, which uses the
apparatus of TS and PN. Definition of a simple and labeled TS is given below,
while definition of PN is well known and can be found in [7] if necessary.
Definition 1. A simple transition system is ),,,(
RSA , where
S — finite or infinite set of states;
R
— finite or infinite set of transitions;
, — two mappings from S to
R
, which make a correspondence between
a transition Rt
and two states )(t
and )(t
, which are called respectively the
beginning and end of the transition .t
The transition t with beginning
s
and end s
is written as follows: .ss
Sometimes transitions may have a common beginning or end or both. This means
that the pair SR
:, is not necessarily an injective function. TS A is called
finite if sets RS ,are finite. If the set of states defines an initial state, such TS is
labeled as ),,,,( 0
sRSA
and is called initial TS. Such model as a simple TS
may be enough to study properties of a model at a certain level of abstraction, but
when it is necessary to carry out a more detailed analysis, labeled TS is more ap-
propriate.
Definition 2. Let
X
be an alphabet. The labeled transition system (LTS) is
an ordered six ),,,,,( 0hsTSA where ),,,,( 0
sTS
represent a TS, and h is
a mapping from T to ,X which makes a correspondence between each transition
t and its label .)( Xth LTS is finite if sets ,S ,T
X
are finite. Transition label
)(th may also be called an action, and the transition itself is written as follows:
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 11
)),(,( sths or .
)( ss th
Transition label set is often accompanied by a special
label , which represents an internal action of the system that is not visible at
a given level of modelling.
Using LTS as a model of a real system allows analysing properties of actions
associated with transitions, which is impossible by using a simple TS. Analysis
and verification of the applications for graphic video adapters (NVIDIA in par-
ticular) was chosen as a subject of the study, because this area perfectly illustrates
the impossibility of manual verification, as the number of threads allocated for
solving the problem is measured by hundreds of thousands (in Pascal/Volta archi-
tectures). The use of LTS to construct a high-level model for substantiating the
properties of a CUDA application was described in [8].
CUDA APPLICATION EXECUTION MODEL
A generalized execution model in the NVIDIA CUDA architecture, based on LTS
and Petri nets was presented in [9]. The main emphasis was put on obtaining a
high-level model without a detailed examination of the labels semantics of each
transition system. The following briefly recalls the main details: three LTS were
emerged as a result of CUDA application decomposition:
LTS ),,,,},,,,{( 1011132101 haRaaaaSA
— represents the warp that
contains a set of instructions and sequentially executes them, where
};,,,{ 43211 rrrrX
LTS ),,,,},,,,,,{( 202225432102 hbRbbbbbbSB
— represents a
generalized information instruction, where 2 123456
{, , , , , };
X
pppp pp
LTS ),,,,},,,,{( 3033332103 hcRccccSC
— represents a thread block
execution process (warp scheduling) on the SM, where }.,,,{ 43213 qqqqX
Transition label functions 321 ,, hhh are shown below in the charts (Fig. 1, 2, 3).
The main activities modelled at this level of abstractions include planning, choos-
ing warp and instruction for it and providing exclusive access to computational
resources for the execution time of each warp-instruction tuple. This paper shows
creation of a model of the lower level of abstraction, which will allow us to ana-
lyze the properties of actions associated with transitions, and the properties of the
model in particular states.
F
ig. 1. LTS
B
of the information instruction
b1
b2b4
b2
b0b5
p
2
p
5
p
6
p
3
p
1
p
4
τ
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
12
Definition 1. General TS (GTS) is a TS ),,,,,,( 0LAPsRXSA
where
S — set of states;
X
— set of actions associated with transitions;
SXSR — transition relation;
0
s — initial state from S;
AP
— set of propositional formulas associated with states;
:()LS BAP — state label function, where ()
B
AP is the power set
of .AP
We introduce a set of Boolean variables:
},,,,,{
1ninstrMemFiinstrArFinecinstrMemExcinstrArExeinstrMeminstrArV ,
where
,0,0,0,0{)(
1
instrMeminstrArFincinstrArExeinstrArh
}0,0
ninstrMemFiecinstrMemEx ;
}1{)( 11 instrArph — arithmetical instruction selection;
}1{)( 41 instrMemph — global memory access selection;
}1{)( 21 cinstrArExeph — arithmetical instruction execution;
}1{)( 51
ecinstrMemExph — memory access execution;
}1{)( 31 instrArFinph — retrieving results of an arithmetical operation;
}1{)( 61 ninstrMemFiph — retrieving results of memory access operation.
We define the state label function :
1
L
};0{)( 01
instrMeminstrArbL
11
() { 1};L b instrAr
12
(){ 1};L b instrArExec
};1{)( 31 instrMembL
};1{)( 41
ecinstrMemExbL
}.1{)( 51
instrArFinninstrMemFibL
We introduce a set of Boolean variables ,,{
2warpBusywarpActiveV
}
warpFin that are set by the following transition labels (Fig. 2):
21
() { 1}h r warpActive— warp activation — given warp gains access to
SM resources to execute a single instruction;
}1{)( 22 warpBusyrh — instruction is being executed by a warp;
}1{)( 32 warpFinrh — instruction execution is finished;
а1 а2а3 а0 r1 r2r3
r4
F
ig. 2. LTS A of the instruction execution within a warp
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 13
}0,0,0{)( 42
warpFinwarpBusywarpActiverh — warp deactivation.
State label function 2
L is defined as following:
};0{)( 02
warpFinwarpBusywarpActiveaL
};1{)( 12 warpActiveaL
};1{)( 22 warpBusyaL
}.1{)( 32 warpFinaL
We introduce a set of Boolean variables ,,{
3xecwarpInstrEelwarpInstrSV
}
inwarpInstrF that are set by the following transition labels (Fig. 3):
}1{)( 13 elwarpInstrSqh — warp and instruction selection;
}1{)( 23
xecwarpInstrEqh — execution of the selected instruction by a
selected warp;
}1{)( 33 inwarpInstrFqh — confirmation of instruction execution finish;
}1,1,0{)( 43
inwarpInstrFxecwarpInstrEelwarpInstrSqh — transi-
tion to the next iteration.
State label function 3
L is defined as following:
};0{)( 03
inwarpInstrFxecwarpInstrEelwarpInstrScL
};1{)( 13 elwarpInstrScL
};1{)( 23
xecwarpInstrEcL
}.1{)( 33 inwarpInstrFcL
Integration of multiple TS into a holistic system that orchestrates the joint
work of all subsystems is performed depending on the requirements of the com-
ponent interaction model (synchronous, asynchronous, parallel, sequential). These
interaction methods are introduced using different TS composition types and the
general notion of TS. To analyze the model properties, we consider the concept of
parallel composition of the TS.
There are several options of TS composition that model parallel functioning
of multiple TS. The simplest one is the composition in which TSs work in paral-
lel, but do not interact with each other. Such a composition is based on the con-
cept of alternating actions (interleaving), which are performed by different sub-
systems of the composition. In this case, the order of actions performed by each
TS is preserved. This is a common way of modelling parallel interactions, which
is based on the assumption that the result of a parallel execution of operations
coincides with the result of their sequential execution. The formal definition of
this composition is as follows.
Let 0
(, , , , , )
i
iiii ii
A
SXRs APL — transition systems, where .,...,2,1 ni
Parallel composition of TS n
AAA ,...,, 21 with interleaving is TS
а1c2c3 c0 q1 q2q3
q4
F
ig. 3. LTS C of the warp scheduler
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
14
12 0
... ( , , , , , ),
n
A
AA A SXRsAPL where n
SSSS
...
21 ,
0
s
),,...,,( 00
2
0
1n
sss ,...
21 n
XXXX and transitions from
R
are defined
as follows: transition Rssxss nn
)),...,(,),,...,(( 11 if and only if state 1
( ,..., )
n
ss
differs from 1
( ,..., )
n
ss by the value of not more than one component, that is there
exists {1, 2 , . . . , }in such that ,),,( iii Rsxs
where }.{
i
Xx
Note that in this composition TS can be repeated, that is, a given i
A may be
included in TS
A
more than once. In addition, some of the TS i
A may be compo-
sitions of other TS: ....
21 iniii AAAA Therefore one can describe the whole sys-
tem in a structured from by using a parallel composition. However, there may be
achievable states in the resulting TS which are not desirable. Therefore a parallel
composition with alternating actions does not always reflect the real situation
when the TSs should interact with each other. Handshaking composition concept
is a more adequate approach to describe parallel interaction. Such option describes a
situation when different TS are synchronized by the actions in which multiple TSs
participate simultaneously (data exchange etc.). Those actions are indicated by the
same symbol in alphabets of all TS which take part in the interaction.
Definition 2. Let 0
(, , , , , )
i
iiii ii
A
SXRs APL — transition systems, where
.2,1i Parallel composition of TS 21,AA with handshaking is a TS
12 0
(, , , , , ),
A
AA SXRsAPL where ,
21 SSS
12
000
(, ),sss
.
21 XXX Transitions that belong to
T
are defined as following:
if 21 XXx and 11 1
(,,) ,sxs R
22 2
(,, ) ,sxs R
then
12 12
(( , ), ,( , ))ss x ss R
;
if 21 \XXx and 11 1
(,,) ,sxs R
then 12 12
(( , ), ,( , ))ss x ss R
;
if 12 \XXx and 22 2
(,, ) ,sxs R
then 12 12
(( , ), ,( , ))ss x ss R
.
The following composition summarizes both previous concepts and is called
a synchronized parallel composition or a synchronous product of TS (Fig. 4).
b3,a2,c2
b0,a0,c0
b1,a2,c2
Inst
r
b4,a2,c2
b0,a1,c1
b0,a3,c3
b2,a2,c2
b5,a2,c2
t0
t0
t2
t5
t3
t6
t4
t7
s0
s2
s3s4
s7
s1
s5s6
t9 t8
s8
F
ig. 4. GTS A of synchronous product of CBA LTSLTSLTS
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 15
Definition 3. nsssss AALAPsRXSA
...),,,,,( 10 , where 1
SSs
n
S, 1
00 0
( ,..., )
n
ss s is called a synchronous product of TS n
AAA ,...,, 21
where 0
(, , , , , ).
i
iiii ii
A
SARs APL Set of transitions is divided into two classes:
asynchronous and synchronous. When synchronous transition from state n
ss ,...,
1
occurs, some of its components change simultaneously while the rest remains un-
changed. To describe this the symbol
is added to labels of each transition set
i
R (i.e. {}
i
X
) along with a corresponding transition (,, ).
ii
ss
Such a sym-
bol shows that state does not change during a particular transition. Subset
R
of
n
RR ...
1 is called the synchronization constraints set. If TS is labeled, then the
set of global transitions
R
corresponds to the set of transition labels .X In other
words, ,...
1n
XXX where i
X is an alphabet of LTS ,
i
A .,...,1 ni
An arbi-
trary element 1
( ,..., , )
n
AA AR of the set
X
is called the LTS synchronization
vector.
We build a model of application execution in the CUDA architecture in the
form of synchronous product with the following global transitions (initial and fi-
nal states of the transitions are omitted, since they are present in the model):
),,(),,,(),,,(),,,({ 34232212111
ptptqrptqrtR ;
)}.,,(),,,(),,,(),,,(),,,( 44933867562245 qrtqrtptptqrpt
DETERMINATION OF SEMANTIC CORRECTNESS OF THE MODEL ACTIONS
Definition function of i
L allow to find the context of correct system functioning
at each transition in each state. For example, if B
LTS is in state ,
0
b values of
both instrAr and instrMem should be .0 If system was moved to a state 0
bafter
a number of allowed transitions and instrMem flag is set, this will mean that sys-
tem is not semantically correct, even if the corresponding PN meets all reliability
criteria.
To analyze the semantic correctness of a model, it is necessary to determine
the context of each global state and consider its conformity to the system under
analysis. As shown above, transition system state context is specified by the state
label function ,
i
L and the global state context can be retrieved by combining lo-
cal state contexts of the components of the synchronous products. Each of the
global states can be described as follows:
)()()()( 0302011 cLaLbLsL
}0{}0{}0{
elwarpInstrSwarpActiveinstrMeminstrAr —
instruction is absent, warp is not active, scheduler is ready for planning;
)()()()( 1312012 cLaLbLsL
}1{}1{}0{
elwarpInstrSwarpActiveinstrMeminstrAr —
warp and instruction are selected, but instruction type is not yet determined;
)()()()( 2322113 cLaLbLsL
}1{}1{}1{
xecwarpInstrEwarpBusyinstrAr —
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
16
arithmetic instruction is selected for a given warp;
)()()()( 2322214 cLaLbLsL
}1{}1{}1{
xecwarpInstrEwarpBusycinstrArExe —
given warp executes arithmetic operation;
)()()()( 2322315 cLaLbLsL
}1{}1{}1{
xecwarpInstrEwarpBusyinstrMem —
memory access instruction is selected for a given warp;
)()()()( 2322416 cLaLbLsL
}1{}1{}1{
xecwarpInstrEwarpBusyecinstrMemEx —
given warp executes memory access instruction;
)()()()( 2322517 cLaLbLsL
}1{}1{}1{
inwarpInstrFwarpBusyninstrMemFiinstrArFin —
current instruction is executed regardless of its type;
)()()()( 3332018 cLaLbLsL
}1{}1{}0{
inwarpInstrFwarpFininstrMeminstrAr —
instruction is absent, warp and scheduler finished their work.
All received contexts are semantically correct in accordance with the materials
[1], thus, the set of global transitions of synchronous products is defined correctly.
In addition to the above, the semantic correctness of the model is provided
by the following properties:
mutual exclusion: a single instruction can be of one type only — either
arithmetic or memory access;
fairness: if the warp is active, scheduler must provide an instruction to be
executed;
liveliness: if the warp is active, one of the available types of instruction
should be executed;
deactivation: warp scheduler does not activate the warp if there are no in-
structions for execution.
Sequence nnsss
...
2110 is called a finite execution in GTS, where
1
(, , )
iii
ss T
for .1,...,1,0
ni
The following finite execution paths exist in the received LTS:
11122334478891
()*Path stststst ststs
;
21125566778891
()*,Path s t s t s t s t s t s t s
where denotes the iterative operation of regular language. The language that cor-
responds to the paths above is:
1 2 1 234 567 89
( ) ( ( ) ( ))* ( ( ) )*.LPath LPath LPath t ttt ttt tt
And this language must correspond to the formula:
1234 5678
()()()() ()()()().Ls Ls Ls Ls Ls Ls Ls Ls
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 17
Let's check the semantic correctness properties of the model:
mutual exclusion: the following expression is always true:
3522321113
() () ( ) ( )( () ());Ls Ls L a L c L b L b
fairness: always 21 33
() ();La Lc
liveliness: always 2 1 11 22 32 13 22 32
() ()()() ()()();La LbLaLc LbLaLc
deactivation: 30 10 2 0
() () ( ).Lc LbLa
FORMALIZATION OF THE GTS ANALYSIS PROCESS
Analysis process presented in the previous section can be generalized and reused
for any GTS.
Definition 1. The subset of a set ()*,
B
AP where ()
B
AP is the power set
of the set ,AP is called the linear-temporal property
P
over the set of atomic
propositional formulas .AP Consequently, (( ))*.PBAP
In our case set (( ))*BAP is a set of words of finite length constructed from
concatenated formulas of ().
B
AP Assume there is (, , ,, , )ASXRIAPL
and
n
sss ...
10
is a sequence of states. Such sequence is called a path fragment
if 1()
ii
sPosts
where () (,),
ii
Post s Post s x
and :{),( SsxsPost i
},),,( Rsxsi
.Xx Sequence n
sss ...
10
for which
)( n
sPost is called
the maximal path fragment.
Definition 2. Word 01
( ) ( )... ( )
n
Ls Ls Ls is called a trace ))((
trace of a fi-
nite sequence .
Consequently, the set of traces is a set of finite words over the
alphabet of the propositional formulas ()
B
AP which are executed in states of this
sequence. Denote },)({)( tracetrace () ( ())trace s trace Path s
and
() ()
sI
Traces A trace s
where )(sPath is maximal fragment of path that
begins in state .
s
In our case, the set of
AP
propositional formulas includes the following items:
,,,,,,{ ninstrMemFiinstrArFinecinstrMemExcinstrArExeinstrMeminstrArAP
}.,,,,, inwarpInstrFwarpFinxecwarpInstrEwarpBusyelwarpInstrSwarpActive
The following words are traces of GTS
A
:
1 234781
()()()()()()
p
Ls Ls Ls Ls Ls Ls;
2256781
()()()()()()
p
Ls Ls Ls Ls Ls Ls.
Let's define bad prefixes in these words as prefixes that violate the truth of
),( i
sL which mean the following words:
)(...)()( 1321 sLsLsLp
;
)(...)()( 1322 sLsLsLp
;
)(...)()( 1523 sLsLsLp
;
... ... ... ... ... ... ... ... ... ...
)()( 12 sLsLpk
.
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
18
Therefore language )...(*)()( 121 k
ppppABadPref
is regular and
is accepted by a finite automaton ),,,,( 0FQfAPQB
where
FQ0 that
is shown at Fig. 5.
We will connect the GTS from Fig. 4 and automaton
B
with such a product
B
A
that produces GTS A as result so that:
,QSS
R
— the smallest relation defined by the rule
,
),(),(
)(
qsqs
qqss
x
sLx
},:),{( )(
000000 0qqQqIsqsI sL
,QPA
)(QBQLS where (, ) {}.Lsq q
Then the correctness of GTS
A
functioning is expressed as a condition
,)()()()(
BLATracesABadPrefATraces
where ()LB is the language accepted by the automaton .B Consequently, if
P
is
a property whose execution guarantees the correct functioning of GTS ,A then
)()( PBadPrefATraces . Thus, GTS A will look as shown on Fig. 6. There
is a transition from each vertex, except for 00
(, ),sq to the state 2
(,*).sq Such
transitions represent one of the bad prefix traces. For the sake of clarity only 4
examples of bad prefixes are left in Fig. 6.
F
ig. 5. Automaton
B
that accepts bad prefixes, ,
00 Qq
,
0Fq
Fq
*
p1 , p2
**
1,, k
pp
q0 q*
s0, q0 s2, q0 s3, q0s4, q0
s2, q* s1, q0s8, q0
s5, q0s6, q0
s7, q0
not(L(s2))
not(L(s1))
no
t(
L
(
s
5
)
no
t(
L
(
s
3
)
t9
t8
t6
t7
t0
t4
t2t3
t1
t5
F
ig. 6. GTS of B
A
product, where Fq
*
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 19
DETERMINATION OF MODEL CORRECTNESS ON THE HIGH LEVEL OF
ABSTRACTION
In addition to semantic verification, it is important to check the redundancy of the
system, deadlock/trap balance etc. To do this, we use verification at the highest
level of abstraction, without need of transition label semantics analysis.
The translation of the received synchronous product into the PN gives the
network shown in Fig. 7. It is described in [10], [11] that the TS product seman-
tics and the semantics of the corresponding PN are consistent in the sense that a
sequence of global transitions represents the global history of the TS product
A
if
and only if it is an admissible sequence of transitions in the PN. Accordingly, el-
ements of the set
R
become transitions of the PN, and the global states of the TS
product (the set of states of each TS involved in the synchronous product before
or after the global transition) become the places of the received network. We
build a PN by using the synchronization constraints set, and such a network simu-
lates the interoperability of all subsystems. Recall that NVIDIA video adapters
operate with multiple warps at the same time, so there is a situation of synchro-
nous and asynchronous execution, since different instructions may have different
execution times and will not reach synchronization location at the same time.
Therefore using the Petri net apparatus is expedient.
Let's analyze the presence and number of dead transitions in PN for
one warp case. To do this, we form a state equation
k
MMxA 0
b1,a2,c2
b0,a3,c3
b0,a1,c1
instr b5,a2,c2
b2,a2,c2
b4,a2,c2
b3,a2,c2
b0,a0,c0 *
1
t
**
1
t
t2
t9
t3
t5t6
t4
t8
t7
t1
s
2
s
8
s
3
s
5
s
4
s
7
s
6
s
0
s
1
…
F
ig.
7
. PN that represents synchronous product CBA LTSLTSLTS
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
20
0 dxA where 0(1,1,0,0,0,0,0,0,0),M (0,1,0,0,0,0,0,0,0),
k
M
)( 0
MMd k (see Table 1).
Table 1. Petri net state equation matrix
t
s
1
t 2
t 3
t 4
t 5
t 6
t 7
t 8
t 9
t 0
()
k
M
M
0
s -1 0 0 0 0 0 0 0 0 1
1
s -1 0 0 0 0 0 0 0 1 0
2
s 1 -1 0 0 -1 0 0 0 0 0
3
s 0 1 -1 0 0 0 0 0 0 0
4
s 0 0 1 -1 0 0 0 0 0 0
5
s 0 0 0 0 1 -1 0 0 0 0
6
s 0 0 0 0 0 1 -1 0 0 0
7
s 0 0 0 1 0 0 1 -1 0 0
8
s 0 0 0 0 0 0 0 1 -1 0
Applying the TSS-algorithm [9] to solve the state equation with the above
matrix, we obtain the following solutions (Table 2).
Table 2. Petri net state equation solutions
1
t 2
t 3
t 4
t 5
t 6
t 7
t 8
t 9
t
1 1 1 1 0 0 0 1 1
1 0 0 0 1 1 1 1 1
As can be seen from the set of solutions, all transitions in the PN with the
initial and final markings given above are alive (the value corresponding to each
transition is positive at least in one of the solutions). In addition, the property of
mutual exclusion has been verified: transitions that correspond only to one of the
possible types of instructions are performed at each point of time.
The analysis of the properties of the received PN also showed the absence of
deadlocks, verified limitation and controllability [10]. As described above, this
analysis did not take into account the properties of actions in transitions and
properties associated with the states of the model.
CONCLUSION
This paper proposes the use of the TS mathematical apparatus to obtain a for-
malized system specification in the GPGPU technology. The advantages of the
existing model include the ability to reduce synchronous product to a PN, which
allows for further verification by automated means. The ability to study the char-
acteristics of the model created by the combination of LTS and PN apparatuses is
shown. The model was analyzed to verify there are no dead transitions and places
(without taking into account the semantics of the transition labels), and a separate
analysis was performed to verify the semantic correctness of the model actions.
Method of semantic application verification in GPGPU technology
Системні дослідження та інформаційні технології, 2020, № 3 21
As a result of these actions, the proof of the correct construction of the model was
obtained. The developed approach allows to simplify and reduce the processes of
verification and testing of multi-threaded applications in computer systems that
utilize video adapters.
REFERENCES
1. Nvidia Data Center – Nvidia, 2018. [Online]. Available: https://www.nvidia.com/
en-us/data-center/. Accessed on: 2019, March 14.
2. TOP500 Lists - TOP500 Supercomputer Sites, 2018. [Online]. Available:
https://www.top500.org/lists. Accessed on: 2019, March 14.
3. A.V. Anisimov, S.D. Pogorilyy, and D.Yu. Vitel, “About the Issue of Algorithms
formalized Design for Parallel Computer Architectures”, Applied and Computational
Mathematics, vol. 12, no. 2, pp.140–151, 2013.
4. A. Arnold, Finite Transition Systems: Semantics of Communicating Systems. Paris,
France: Prentice Hall, 1994, 177 p.
5. T. Murata, “Petri nets: properties, analysis and applications”, in Proc. of the IEEE,
77:541.80, 1989.
6. M. Ben-Ari, Mathematical Logic for Computer Science. UK: Prentice Hall Interna-
tional Ltd, 1993, 305 p.
7. E.M. Clarke, Jr., O. Grumberg, and D.A. Peled, Model Checking. USA: MIT Press,
1999.
8. S.D. Pogorilyy, S.L. Kryvyi, and M.S. Slynko, “Model justification of GPU-based
applications”, Control Systems and Computers, vol. 4, pp. 46–56, 2018.
9. S.L. Kryvyi, Linear Diophantine constraints and their applications. Chernivtsi: Buk-
rek Publishing House, 2015.
10. S.L. Kryvyi, S.D. Pogorilyy, and M.S. Slynko, “Transition systems as method of de-
signing applications in GPGPU technology”, in Proc. 11-th international scientific
and practical conference on programming UkrPROG’2018.
11. S.L. Kryvyi et al., “Design of Grid Structures on the Basis of Transition Systems
with the Substantiation of the Correctness of Their Operation”, Cybernetics and Sys-
tems Analysis, vol. 53, no. 1, pp.105–114, New York, USA:Springer Science +
Business Media, January 2017.
Received 15.10.2020
INFORMATION ON THE ARTICLE
Serhii L. Kryvyi, ORCID: 0000-0003-4231-0691, Taras Shevchenko National University
of Kyiv, Ukraine, e-mail: sl.krivoi@gmail.com
Sergiy D. Pogorilyy, ORCID: 0000-0002-6497-5056, Taras Shevchenko National Uni-
versity of Kyiv, Ukraine, e-mail: sdp77@i.ua, sdp@univ.net.ua
Maksym S. Slynko, ORCID: 0000-0001-9667-8729, Taras Shevchenko National Univer-
sity of Kyiv, Ukraine, e-mail: maxim.slinko@gmail.com
Artem A. Kramov, ORCID: 0000-0003-3631-1268, Taras Shevchenko National Univer-
sity of Kyiv, Ukraine, e-mail: artemkramov@gmail.com
МЕТОД СЕМАНТИЧНОЇ ВЕРИФІКАЦІЇ ЗАСТОСУВАНЬ У ТЕХНОЛОГІЇ GPGPU /
С.Л. Кривий, С.Д. Погорілий, М.С. Слинько, А.А. Крамов
Анотація. Запропоновано метод розроблення та верифікації застосувань для
систем з масовим паралелізмом на основі відеоадаптерів від компанії NVIDIA,
S.L. Kryvyi, S.D. Pogorilyy, M.S. Slynko, A.A. Kramov
ISSN 1681–6048 System Research & Information Technologies, 2020, № 3
22
який дозволяє створювати абстракції різних рівнів за допомогою апарата роз-
мічених транзиційних систем. Композиції таких систем трансформуються в
мережі Петрі, які далі аналізуються відповідними засобами. Метод також дає
змогу створювати моделі на різних рівнях абстракції, а їх властивості можуть
специфікуватися формулами темпоральної логіки. Це дозволяє досліджувати
властивості систем з масовим паралелізмом, які майже неможливо аналізувати
вручну, оскільки кількість потоків у новітніх архітектурах відеоадаптерів
(Pascal, Volta, Amper, Тюрінг), виділених для виконання коду, вимірюється
сотнями тисяч або мільйонами.
Ключові слова: CUDA, графічні процесори (GPU), графічні обчислення
загального призначення (GPGPU), транзиційна система, мережа Петрі, побу-
дова моделі.
МЕТОД СЕМАНТИЧЕСКОЙ ВЕРИФИКАЦИИ ПРИЛОЖЕНИЙ В ТЕХНОЛОГИИ
GPGPU / С.Л. Крывый, С.Д. Погорелый, М.С. Слинько, А.А. Крамов
Аннотация. Предложен метод разработки и верификации приложений для
систем с массовым параллелизмом на основе видеоадаптеров от компании
NVIDIA, который позволяет создавать абстракции различных уровней с по-
мощью аппарата размеченных транзиционных систем. Композиции таких сис-
тем трансформируются в сети Петри, которые далее анализируются соответст-
вующими средствами. Метод позволяет создавать модели на различных
уровнях абстракции, а их свойства могут специфицироваться формулами тем-
поральной логики. Это позволяет исследовать свойства систем с массовым па-
раллелизмом, которые практически невозможно анализировать вручную, так
как количество потоков в новейших архитектурах видеоадаптеров (Pascal,
Volta, Amper, Тьюринг), выделенных для выполнения кода, измеряется сотня-
ми тысяч или миллионами.
Ключевые слова: CUDA, графические процессоры (GPU), графические вы-
числения общего назначения (GPGPU), транзиционная система, сеть Петри,
построение модели.