PreprintPDF Available

Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems

Authors:
  • Sorbonne Université, CNRS, INRIA, LIP6

Abstract

Causality is an important concept both for proving impossibility results and for synthesizing efficient protocols in distributed computing. For asynchronous agents communicating over unreliable channels, causality is well studied and understood. This understanding, however, relies heavily on the assumption that agents themselves are correct and reliable. We provide the first epistemic analysis of causality in the presence of byzantine agents, i.e., agents that can deviate from their protocol and, thus, cannot be relied upon. Using our new framework for epistemic reasoning in fault-tolerant multi-agent systems, we determine the byzantine analog of the causal cone and describe a communication structure, which we call a multipede, necessary for verifying preconditions for actions in this setting.
L.S. Moss (Ed.): TARK 2019
EPTCS 297, 2019, pp. 293–312, doi:10.4204/EPTCS.297.19
© R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
This work is licensed under the
Creative Commons Attribution License.
Causality and Epistemic Reasoning
in Byzantine Multi-Agent Systems
Roman Kuznets*
Embedded Computing Systems
TU Wien
Vienna, Austria
roman@logic.at
Laurent Prosperi
ENS Paris-Saclay
Cachan, France
laurent.prosperi@ens-cachan.fr
Ulrich Schmid
Embedded Computing Systems
TU Wien
Vienna, Austria
s@ecs.tuwien.ac.at
Krisztina Fruzsa
Embedded Computing Systems
TU Wien
Vienna, Austria
krisztina.fruzsa@tuwien.ac.at
Causality is an important concept both for proving impossibility results and for synthesizing efficient
protocols in distributed computing. For asynchronous agents communicating over unreliable chan-
nels, causality is well studied and understood. This understanding, however, relies heavily on the
assumption that agents themselves are correct and reliable. We provide the first epistemic analysis
of causality in the presence of byzantine agents, i.e., agents that can deviate from their protocol and,
thus, cannot be relied upon. Using our new framework for epistemic reasoning in fault-tolerant multi-
agent systems, we determine the byzantine analog of the causal cone and describe a communication
structure, which we call a multipede, necessary for verifying preconditions for actions in this setting.
1 Introduction
Reasoning about knowledge has been a valuable tool for analyzing distributed systems for decades [5, 9],
and has provided a number of fundamental insights. As crisply formulated by Moses [17] in the form of
the Knowledge of Preconditions Principle, a precondition for action must be known in order to be action-
able. In a distributed environment, where agents only communicate by exchanging messages, an agent
can only learn about events happening to other agents via messages (or sometimes the lack thereof [8]).
In asynchronous systems, where the absence of communication is indistinguishable from delayed
communication, agents can only rely on messages they receive. Lamport’s seminal definition of the
happened-before relation [14] establishes the causal structure for asynchronous agents in the agent–time
graph describing a run of a system. This structure is often referred to as a causal cone, whereby causal
links are either time transitions from past to future for one agent or messages from one agent to another.
As demonstrated by Chandy and Misra [2], the behavior of an asynchronous agent can only be affected
by events from within its causal cone.
The standard way of showing that an agent does not know of an event is to modify a given run by
removing the event in question in such a way that the agent cannot detect the change. By Hintikka’s
definition of knowledge [11], the agent thinks it possible that the event has not occurred and, hence, does
not know of the event to have occurred. Chandy and Misra’s result shows that in order for agent ito learn
of an event happening to another agent j, there must exist a chain of successfully delivered messages
*Supported by the Austrian Science Fund (FWF) projects RiSE/SHiNE (S11405) and ADynNet (P28182).
PhD student in the FWF doctoral program LogiCS (W1255).
294
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
leading from the moment of agent jobserving the event to some past or present state of agent i. This ob-
servation remains valid in asynchronous distributed systems where messages could be lost and/or where
agents may stop operating (i.e., crash) [4, 10, 19].
In synchronous systems, if message delays are upper-bounded, agents can also learn from the absence
of communication (communication-by-time). As shown in [1], Lamport’s happened-before relation must
then be augmented by causal links indicating no communication within the message delay upper bound
to also capture causality induced via communication-by-time, leading to the so-called syncausality rela-
tion. Its utility has been demonstrated using the ordered response problem, where agents must perform
a sequence of actions in a given order: both the necessary and sufficient knowledge and a necessary and
sufficient communication structure (called a centipede) have been determined in [1]. It is important to
note, however, that syncausality works only in fault-free distributed systems with reliable communica-
tion. Although it has recently been shown in [8] that silent choirs are a way to extend it to distributed
systems where agents may crash, the idea does not generalize to less benign faults.
Unfortunately, all the above ways of capturing causality and the resulting simplicity of determining
the causal cone completely break down if agents may be byzantine faulty [15]. Byzantine faulty agents
may behave arbitrarily, in particular, need not adhere to their protocol and may, hence, send arbitrary
messages. It is common to limit the maximum number of agents that ever act byzantine in a distributed
system by some number f, which is typically much smaller than the total number nof agents. Prompted
by the ever growing number of faulty hardware and software having real-world negative, sometimes
life-critical, consequences, capturing causality and providing ways for determining the causal cone in
byzantine fault-tolerant distributed systems is both an important and scientifically challenging task. To
the best of our knowledge, this challenge has not been addressed in the literature before.1
In a nutshell, for f>0, the problem of capturing causality becomes complicated by the fact that
a simple causal chain of messages is no longer sufficient: a single byzantine agent in the chain could
manufacture “evidence” for anything, both false negatives and false positives. And indeed, obvious
generalizations of message chains do not work. For example, it is a folklore result that, in the case
of direct communication, at least f+1 confirmations are necessary because fof them could be false.
When information is transmitted along arbitrary, possibly branching and intersecting chains of messages,
the situation is even more complex and defies simplistic direct analysis. In particular, as shown by the
counterexample in [16, Fig. 1], one cannot rely on Menger’s Theorem [3] for separating nodes in the
two-dimensional agent–time graph.
Major contributions: In this paper, we generalize the causality structure of asynchronous distributed
systems described above to multi-agent systems involving byzantine faulty agents. Relying on our novel
byzantine runs-and-systems framework [12] (described in full detail in [13]), we utilize some generic
epistemic analysis results for determining the shape of the byzantine analog of Lamport’s causal cone.
Since knowledge of an event is too strong a precondition in the presence of byzantine agents, it has to be
relaxed to something more akin to belief relative to correctness [18], for which we coined the term hope.
We show that hope can only be achieved via a causal message chain that passes solely through correct
agents (more precisely, through agents still correct while sending the respective messages). While the
result looks natural enough, its formal proof is quite involved technically and paints an instructive pic-
ture of how byzantine agents can affect the information flow. We also establish a necessary condition for
detecting an event, and a corresponding communication structure (called a multipede), which is severely
complicated by the fact that the reliable causal cones of indistinguishable runs may be different.
Paper organization: In Sect. 2, we succinctly introduce the features of our byzantine runs-and-
1Despite having “Byzantine” in the title, [4, 10] only address benign faults (crashes, send/receive omissions of messages).
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
295
systems framework [13] and state some generic theorems and lemmas needed for proving the results
of the paper. In Sect. 3, we describe the mechanism of run modifications, which are used to remove
events an agent should not know about from a run, without the agent noticing. Our characterization of
the byzantine causal cone is provided in Sect. 4, the necessary conditions for establishing hope for an
occurrence of an event and the underlying multipede structure can be found in Sect. 5. Some conclusions
in Sect. 6 round-off the paper.
2 Runs-and-Systems Framework for Byzantine Agents
First, we describe the modifications of the runs-and-systems framework [5] necessary to account for
byzantine behavior. To prevent wasting space on multiple definition environments, we give the following
series of formal definitions as ordinary text marking defined objects by italics; consult [13] for the same
definitions in fully spelled-out format. As a further space-saving measure, instead of repeating every
time “actions and/or events,” we use haps2as a general term referring to either actions or events.
The goal of all these definitions is to formally describe a system where asynchronous agents 1,...,n
perform actions according to their protocols, observe events, and exchange messages within an envi-
ronment represented as a special agent
ε
. Unlike the environment, agents only have limited local in-
formation, in particular, being asynchronous, do not have access to the global clock. No assumptions
apart from liveness are made about the communication. Messages can be lost, arbitrarily delayed, and/or
delivered in the wrong order. This part of the system is a fairly standard asynchronous system with un-
reliable communication. The novelty is that the environment may additionally cause at most fagents
to become faulty in arbitrary ways. A faulty agent can perform any of its actions irrespective of its
protocol and observe events that did not happen, e.g., receive unsent or corrupted messages. It can also
have false memories about actions it has performed. At the same time, much like the global clock, such
malfunctions are not directly visible to an agent, especially when it mistakenly thinks it acted correctly.
We fix a finite set A={1,...,n}of agents. Agent iAcan perform actions a Actionsi,
e.g., send messages, and witness events e Eventsisuch as message delivery. We denote Hapsi:=
ActionsiEventsi. The action of sending a copy numbered kof a message
µ
Msgs to an agent jA
is denoted send(j,
µ
k), whereas a receipt of such a message from iAis recorded locally as recv(i,
µ
).3
Agent irecords actions from Actionsiand observes events from Eventsiwithout dividing them into
correct and faulty. The environment
ε
, on the contrary, always knows if the agent acted correctly or
was forced into byzantine behavior. Hence, the syntactic representations of each hap for agents (local
view) and for the environment (global view) must differ, with the latter containing more information.
In particular, the global view syntactically distinguishes correct haps from their byzantine counterparts.
While there is no way for an agent to distinguish a real event from its byzantine duplicate, it can analyze
its recorded actions and compare them with its protocol. Sometimes, this information might be sufficient
for the agent to detect its own malfunctions.
All of Actions :=SiAActionsi,Events :=SiAEventsi, and Haps :=Actions Events represent the
local view of haps. All haps taking place after a timestamp t T:=Nand no later than t+1 are grouped
into a round denoted t+½ and are treated as happening simultaneously. To model asynchronous agents,
we exclude these system timestamps from the local format of Haps. At the same time, the environment
ε
incorporates the current timestamp tinto the global format of every correct action aActionsi, as initi-
2Cf. “Till I know ’tis done, Howe’er my haps, my joys were ne’er begun.” W. Shakespeare, Hamlet, Act IV, Scene 3.
3Thus, it is possible to send several copies of the same message in the same round. If one or more of such copies are received
in the same round, however, the recipient does not know which copy it has received, nor that there have been multiple copies.
296
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
ated by agent iin the local format, via a one-to-one function global (i,t,a). Timestamps are especially
crucial for proper message processing with global (i,t,send(j,
µ
k)) :=gsend(i,j,
µ
,id(i,j,
µ
,k,t)) for
some one-to-one function id :A×A×Msgs ×N×TNthat assigns each sent message a unique
global message identifier (GMI). We chose not to model agent-to-agent channels explicitly. With all
messages effectively sent through one system-wide channel, these GMIs are needed to ensure the causal-
ity of message delivery, i.e., that only sent messages can be delivered correctly. The sets GActionsi:=
{global (i,t,a)|tT,aActionsi}of all possible correct actions for each agent in global format are
pairwise disjoint due to the injectivity of global. We set GActions :=FiAGActionsi.
Unlike correct actions, correct events witnessed by agents are generated by the environment
ε
and,
hence, can be assumed to be produced already in the global format GEventsi. We define GEvents :=
FiAGEventsiassuming them to be pairwise disjoint, and GHaps =GEvents GActions. We do not
consider the possibility of the environment violating its protocol, which is meant to model the fundamen-
tal physical laws of the system. Thus, all events that can happen are considered correct. A byzantine event
is, thus, a subjective notion. It is an event that was perceived by an agent despite not taking place. In other
words, each correct event EGEventsihas a faulty counterpart fake (i,E), and agent icannot distinguish
the two. An important type of correct global events of agent jis the delivery grecv(j,i,
µ
,id)GEvents j
of message
µ
with GMI id Nsent by agent i. Note that the GMI, which is used in by the global
format to ensure causality, must be removed before the delivery is recorded by the agent in the local
format because GMIs contain the time of sending, which should not be accessible to agents. To strip
this information before updating local histories, we employ a function local :GHaps Haps converting
correct haps from the global into the local format in such a way that for actions local reverses global,
i.e., localglobal (i,t,a):=a. For message deliveries, localgrecv(j,i,
µ
,id):=recv(i,
µ
), i.e., agent j
only knows that it received message
µ
from agent i. It is, thus, possible for two distinct correct global
events, e.g., grecv(j,i,
µ
,id)and grecv(j,i,
µ
,id), representing the delivery of different copies of the
same message
µ
, possibly sent by iat different times, to be recorded by jthe same way, as recv(i,
µ
).
Therefore, correct actions are initiated by agents in the local format and translated into the global
format by the environment. Correct and byzantine events are initiated by the environment in the global
format and translated into the local format before being recorded by agents.4We will now turn our
attention to byzantine actions.
While a faulty event is purely an error of perception, actions can be faulty in another way: they can
violate the protocol. The crucial question is: who should be responsible for such violations? With agents’
actions governed by their protocols while everything else is up to the environment, it seems that errors,
especially unintended errors, should be the environment’s responsibility. A malfunctioning agent tries
to follow its protocol but fails for reasons outside of its control, i.e., due to environment interference. A
malicious agent tries to hide its true intentions from other agents by pretending to follow its expected
protocol and, thus, can also be modeled via environment interference. Thus, we model faulty actions as
byzantine events of the form fake(i,A7→ A)where A,AGActionsi⊔ {noop}for a special non-action
noop in global format. Here Ais the action (or, in case of noop, inaction) performed, while Arep-
resents the action (inaction) perceived instead by the agent. More precisely, the agent either records
a=local(A)Eventsiif AGEventsior has no record of this byzantine action if A=noop. The
byzantine inaction fail (i):=fake (i,noop 7→ noop)is used to make agent ifaulty without performing
any actions and without leaving a record in is local history. The set of all is byzantine events, corre-
sponding to both faulty events and faulty actions, is denoted BEventsi, with BEvents :=FiABEventsi.
To prevent our asynchronous agents from inferring the global clock by counting rounds, we make
4This has already been described for correct events. A byzantine event is recorded the same way as its correct counterpart.
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
297
waking up for a round contingent on the environment issuing a special system event go(i)for the agent in
question. Agent i’s local view of the system immediately after round t+½, referred to as (process-time or
agent-time)node (i,t+1), is recorded in i’s local state ri(t+1), also called is local history. Nodes (i,0)
correspond to initial local states ri(0)Σi, with G(0):=iAΣi. If a round contains neither go(i)nor
any event to be recorded in is local history, then the said history ri(t+1) = ri(t)remains unchanged,
denying the agent the knowledge of the round just passed. Otherwise, ri(t+1) = X:ri(t), for XHapsi,
the set of all actions and events perceived by iin round t+½, where : stands for concatenation. The ex-
act definition will be given via the updateifunction, to be described shortly. Thus, the local history ri(t)
is a list of all haps as perceived by iin rounds it was active in. The set of all local states of iis Li.
While not necessary for asynchronous agents, for future backwards compatibility, we add more sys-
tem events for each agent, to serve as faulty counterparts to go(i). Commands sleep (i)and hibernate (i)
signify a failure to activate the agent’s protocol and differ in that the former enforces waking up the agent
(and thus recording time) notwithstanding. These commands will be used, e.g., for synchronous systems.
None of the system events SysEventsi:={go(i),sleep (i),hibernate (i)}is directly detectable by agents.
To summarize, GEventsi:=GEventsiBEventsiSysEventsiwith GEvents :=FiAGEventsiand
GHaps :=GEvents GActions. Throughout the paper, horizontal bars signify phenomena that are cor-
rect. Note that the absence of this bar means the absence of a claim of correctness. It does not necessarily
imply a fault. Later, this would also apply to formulas, e.g., occurredi(e)demands a correct occurrence
of an event ewhereas occurredi(e)is satisfied by either correct or faulty occurrence.
We now turn to the description of runs and protocols for our byzantine-prone asynchronous agents.
Arun r is a sequence of global states r(t) = (r
ε
(t),r1(t),...,rn(t)) of the whole system consisting of the
state r
ε
(t)of the environment and local states ri(t)of every agent. We already discussed the composition
of local histories. Similarly, the environment’s history r
ε
(t)is a list of all haps that happened, this time
faithfully recorded in the global format. Accordingly, r
ε
(t+1) = X:r
ε
(t)for the set XGHaps of all
haps from round t+½. The set of all global states is denoted G.
What happens in each round is determined by protocols P
iof agents, protocol P
ε
of the environ-
ment, and chance, the latter implemented as the adversary part of the environment. Agent i’s protocol
P
i:Li
(
(Actionsi)) \ {}provides a range P
i(ri(t)) of sets of actions based on is current local
state ri(t), with the view of achieving some collective goal. Recall that the global timestamp tis not
part of ri(t). The control of all events—correct, byzantine, and system—lies with the environment
ε
via its protocol P
ε
:T
(
(GEvents)) \ {}, which can depend on a timestamp tTbut not on the
current state. The environment’s protocol is thus kept impartial by denying it an agenda based on the
global history so far. Other parts of the environment must, however, have access to the global history, in
particular, to ensure causality. Thus, the environment’s protocol provides a range P
ε
(t)of sets of events.
Protocols P
iand P
ε
are non-deterministic and always provide at least one option. The choice among the
options (if more than one) is arbitrarily made by the already mentioned adversary part of the environ-
ment. It is also required that all events from P
ε
(t)be mutually compatible at time t. These t -coherency
conditions are: (a) no more than one system event go(i),sleep (i), and hibernate (i)per agent iat a time;
(b) a correct event perceived as eby agent iis never accompanied by a byzantine event that iwould also
perceive as e, i.e., an agent cannot be mistaken about witnessing an event that did happen; (c) the GMI
of a byzantine sent message is the same as if a copy of the same message were sent correctly in the same
round. Note that the prohibition (b) does not extend to correct actions.
Both the global run r:TGand its local parts ri:TLiprovide a sequence of snapshots of
the system and local states respectively. Given the joint protocol P := (P
1,...,P
n)and the environment’s
protocol P
ε
, we focus on
τ
f,P
ε
,P-transitional runs r that result from following these protocols and are built
according to a transition relation
τ
f,P
ε
,PG×Gfor asynchronous agents at most f0 of which may
298
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
r(t)
P
n(rn(t))
...
P
1(r1(t))
P
ε
(t)
P
n
P
1
P
ε
Xn
...
X1
X
ε
adversary
adversary
adversary
α
t
n(r)
...
α
t
1(r)
α
t
n(r)
...
α
t
1(r)
X
ε
=
α
t
ε
(r)
global
global
β
t
n(r)
...
β
t
1(r)
β
t
ε
(r)
β
t
ε
(r)
f iltern
f ilter1
f ilter
ε
rn(t+1)
...
r1(t+1)
r
ε
(t+1)
upd aten
upd ate1
upd ate
ε
β
t
ε
n(r)
β
t
ε
1(r)
r(t+1)
|
t
| | | | |
t+1
Protocol phase Adversary phase Labeling phase Filtering phase Updating phase
Figure 1: Details of round t+½ of a
τ
f,P
ε
,P-transitional run r.
become faulty in a given run. In this paper, we only deal with generic f,P
ε
, and P. Hence, whenever safe,
we write
τ
in place of
τ
f,P
ε
,P. Each transitional run begins in some initial global state r(0)G(0)and
progresses by ensuring that r(t)
τ
r(t+1), i.e., (r(t),r(t+1))
τ
, for each timestamp tT. Given f,
P
ε
, and P, the transition relation
τ
f,P
ε
,Pconsisting of five consecutive phases is graphically represented
in Figure 1 and described in detail below:
1. Protocol phase. A range P
ε
(t)
(GEvents)of t-coherent sets of events is determined by the en-
vironment’s protocol P
ε
; for each iA, a range P
i(ri(t))
(Actionsi)of sets of is actions is
determined by the agents’ joint protocol P.
2. Adversary phase. The adversary non-deterministically picks a t-coherent set X
ε
P
ε
(t)and a set
XiP
i(ri(t)) for each iA.
3. Labeling phase. Locally represented actions in Xi’s are translated into the global format:
α
t
i(r):=
{global (i,t,a)|aXi} ⊆ GActionsi. In particular, correct sends are supplied with GMIs.
4. Filtering phase. Functions f ilt er
ε
and f ilt erifor each iAremove all causally impossible at-
tempted events from
α
t
ε
(r):=X
ε
and actions from
α
t
i(r).
4.1. First, f ilt er
ε
filters out causally impossible events based (a) on the current global state r(t),
which could not have been accounted for by the protocol P
ε
, (b) on
α
t
ε
(r), and (c) on all
α
t
i(r), not
accessible for P
ε
either. Specifically, two kinds of events are causally impossible for asynchronous
agents with at most fbyzantine failures and are removed by f ilt er
ε
in two stages as follows (for-
mal definitions can be found in the appendix, Definitions A.16–A.17; cf. also [13] for details):
(1) in the 1st stage, all byzantine events are removed by f il terf
ε
if they would have resulted in
more than ffaulty agents in total;
(2) in the 2nd stage, correct receives without matching sends (either in the history r(t)or in the
current round) are removed by f ilterB
ε
.
The resulting set of events to actually occur in round t+½ is denoted
β
t
ε
(r):=f ilt er
ε
r(t),
α
t
ε
(r),
α
t
1(r), .. .,
α
t
n(r).
4.2. After events are filtered, f ilt erifor each agent iremoves all is actions iff go(i)/
β
t
ε
(r). The
resulting sets of actions to be actually performed by agents in round t+½ are
β
t
i(r):=f ilt eri
α
t
1(r), .. .,
α
t
n(r),
β
t
ε
(r).
We have
β
t
i(r)
α
t
i(r)GActionsiand
β
t
ε
(r)
α
t
ε
(r)GEvents.
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
299
5. Updating phase. The resulting mutually causally consistent sets of events
β
t
ε
(r)and of actions
β
t
i(r)
are appended to the global history r(t); for each iA, all non-system events from
β
t
ε
i(r):=
β
t
ε
(r)GEventsi
as perceived by the agent and all correct actions
β
t
i(r)are appended in the local form to the local
history ri(t), which may remain unchanged if no action or event triggers an update or be appended
with the empty set if an update is triggered only by a system event go(i)or sleep(i):
r
ε
(t+1):=update
ε
r
ε
(t),
β
t
ε
(r),
β
t
1(r), ...,
β
t
n(r); (1)
ri(t+1):=updateiri(t),
β
t
i(r),
β
t
ε
(r).(2)
Formal definitions of update
ε
and updateiare given in Def. A.19 in the appendix.
The protocols Pand P
ε
only affect phase 1, so we group the operations in the remaining phases 2–5
into a transition template
τ
fthat computes a transition relation
τ
f,P
ε
,Pfor any given Pand P
ε
. This transi-
tion template, primarily via the filtering functions, represents asynchronous agents with at most ffaults.
The template can be modified independently from the protocols to capture other distributed scenarios.
Liveness and similar properties that cannot be ensured on a round-by-round basis are enforced by
restricting the allowable runs by admissibility conditions Ψ, which formally are subsets of the set Rof all
transitional runs. For example, since no goal can be achieved without allowing agents to act from time to
time, it is standard to impose the Fair Schedule (FS) admissibility condition, which for byzantine agents
states that an agent can only be delayed indefinitely through persistent faults:
FS :={rR|(iA)(tT)(tt)
β
t
ε
(r)SysEventsi6=}.
In scheduling terms, FS ensures that each agent be considered for using CPU time infinitely often. Deny-
ing any of these requests constitutes a failure, represented by a sleep(i)or hibernate (i)system event.
We now combine all these parts in the notions of context and agent-context:
Definition 1. Acontext
γ
= (P
ε
,G(0),
τ
f,Ψ)consists of an environment’s protocol P
ε
, a set of global
initial states G(0), a transition template
τ
ffor f0, and an admissibility condition Ψ. For a joint proto-
col P, we call
χ
= (
γ
,P)an agent-context. A run r:TGis called weakly
χ
-consistent if r(0)G(0)
and the run is
τ
f,P
ε
,P-transitional. A weakly
χ
-consistent run ris called (strongly)
χ
-consistent if rΨ.
The set of all
χ
-consistent runs is denoted R
χ
R. Agent-context
χ
is called non-excluding if any finite
prefix of a weakly
χ
-consistent run can be extended to a
χ
-consistent run.
We are also interested in narrower types of faults. Let FEventsi:=BEventsi{sleep (i),hibernate (i)}.
Definition 2. Environment’s protocol P
ε
makes an agent iA:
1. correctable if XP
ε
(t)implies that X\FEventsiP
ε
(t);
2. delayable if XP
ε
(t)implies X\GEventsiP
ε
(t);
3. error-prone if XP
ε
(t)implies that, for any YFEventsi, the set Y(X\FEventsi)P
ε
(t)
whenever it is t-coherent;
4. gullible if XP
ε
(t)implies that, for any YFEventsi, the set Y(X\GEventsi)P
ε
(t)whenever
it is t-coherent;
5. fully byzantine if agent iis both error-prone and gullible.
In other words, correctable agents can always be made correct for the round by removing all their
byzantine events; delayable agents can always be forced to skip a round completely (which does not
300
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
make them faulty); error-prone (gullible) agents can exhibit any faults in addition to (in place of) correct
events, thus, implying correctability (delayability); fully byzantine agents’ faults are unrestricted. Com-
mon types of faults, e.g., crash or omission failures, can be obtained by restricting allowable sets Yin
the definition of gullible agents.
Now that our byzantine version of the runs-and-systems framework is laid out, we define interpreted
systems in this framework in the usual way, i.e., as special kinds of Kripke models for multi-agent dis-
tributed environments [5]. For an agent-context
χ
, we consider pairs (r,t)R
χ
×Tof a
χ
-consistent
run rand timestamp t. A valuation function
π
:Prop
(R
χ
×T)determines whether an atomic
proposition from Prop is true in run rat time t. The determination is arbitrary except for a small set
of designated atomic propositions whose truth value at (r,t)is fully determined. More specifically, for
iA,oHapsi, and tTsuch that tt,
correct(i,t)is true at (r,t), or node (i,t)is correct in run r, iff no faulty event happened to iby
timestamp t, i.e., no event from FEventsiappears in the r
ε
(t)prefix of the r
ε
(t)part of r(t);
correctiis true at (r,t)iff correct(i,t)is;
fake(i,t)(o)is true at (r,t)iff ihas a faulty reason to believe that oHapsioccurred in round t½,
i.e., ori(t)because (at least in part) of some OBEventsi
β
t1
ε
(r);
occurred(i,t)(o)is true at (r,t)iff ihas a correct reason to believe oHapsioccurred in round t½,
i.e., ori(t)because (at least in part) of O(GEventsi
β
t1
ε
(r))
β
t1
i(r);
occurredi(o)is true at (r,t)iff at least one of occurred(i,m)(o)for 1 mtis;
occurred(o)is true at (r,t)iff at least one of occurredi(o)for iAis;
occurredi(o)is true at (r,t)iff either occurredi(o)is or at least one of fake(i,m)(o)for 1 mtis.
An interpreted system is a pair I= (R
χ
,
π
). The epistemic language
ϕ
::=p| ¬
ϕ
|(
ϕ
ϕ
)|Ki
ϕ
where pProp and iAand derived Boolean connectives are defined in the usual way. Truth for these
(epistemic)formulas is defined in the standard way, in particular, for a run rR
χ
, timestamp tT,
atomic proposition pProp, agent iA, and formula
ϕ
we have (I,r,t)|=piff (r,t)
π
(p)and
(I,r,t)|=Ki
ϕ
iff (I,r,t)|=
ϕ
for any rR
χ
and tTsuch that ri(t) = r
i(t). A formula
ϕ
is valid
in I, written I|=
ϕ
, iff (I,r,t)|=
ϕ
for all rR
χ
and tT.
Due to the t-coherency of all allowed protocols P
ε
, an agent cannot be both right and wrong about any
local event eEventsi, i.e., I|=¬(occurred(i,t)(e)fake(i,t)(e)). Note that for actions this can happen.
Following the concept from [6] of global events that are local for an agent, we define conditions
under which formulas can be treated as such local events. A formula
ϕ
is called localized for i within
an agent-context
χ
iff ri(t) = r
i(t)implies (I,r,t)|=
ϕ
(I,r,t)|=
ϕ
for any I= (R
χ
,
π
), runs
r,rR
χ
, and timestamps t,tT. By these definitions, we immediately obtain:
Lemma 3. The following statements are valid for any formula
ϕ
localized for an agent i Awithin an
agent-context
χ
and any interpreted system I= (R
χ
,
π
):I|=
ϕ
Ki
ϕ
and I|=¬
ϕ
Ki¬
ϕ
.
The knowledge of preconditions principle [17] postulates that in order to act on a precondition an
agent must be able to infer it from its local state. Thus, Lemma 3 shows that formulas localized for ican
always be used as preconditions. Our first observation is that the agent’s perceptions of a run are one
example of such epistemically acceptable (though not necessarily reliable) preconditions:
Lemma 4. For any agent-context
χ
, agent i A, and local hap o Hapsi, the formula occurredi(o)is
localized for i within
χ
.
It can be shown that correctness of these perceptions is not localized for iand, hence, cannot be the
basis for actions. In fact, Theorem 12 will reveal that no agent can establish its own correctness.
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
301
3 Run modifications
We will now introduce the pivotal technique of run modifications, which are used to show an agent does
not know
ϕ
by creating an indistinguishable run where
ϕ
is false.
Definition 5. A function
ρ
:R
χ
(GActionsi)×
(GEventsi)is called an i-intervention for an agent-
context
χ
and agent i A. A joint intervention B = (
ρ
1,...,
ρ
n)consists of i-interventions
ρ
ifor each
agent iA. An adjustment [Bt;...;B0]is a sequence of joint interventions B0. .. ,Btto be performed at
rounds from ½ to t+½.
We consider an i-intervention
ρ
(r) = (X,X
ε
)applied to a round t+½ of a given run rto be a
meta-action by the system designer, intended to modify the results of this round for iin such a way
that
β
t
i(r) = Xand
β
t
ε
i(r) =
β
t
ε
(r)GEventsi=X
ε
in the artificially constructed new run r. For
ρ
(r) = (X,X
ε
), we denote a
ρ
(r):=Xand e
ρ
(r):=X
ε
. Accordingly, a joint intervention (
ρ
1,...,
ρ
n)pre-
scribes actions
β
t
i(r) = a
ρ
i(r)for each agent iand events
β
t
ε
(r) = FiAe
ρ
i(r)for the round in question.
Thus, an adjustment [Bt;...;B0]fully determines actions and events in the initial t+1 rounds of run r:
Definition 6. Let adj = [Bt;...;B0]be an adjustment where Bm= (
ρ
m
1,...,
ρ
m
n)for each 0 mtand
each
ρ
m
iis an i-intervention for an agent-context
χ
= ((P
ε
,G(0),
τ
f,Ψ),P). A run ris obtained from
rR
χ
by adjustment adj iff for all tt, all T>t, and all iA,
(a) r(0):=r(0),
(b) r
i(t+1):=updateir
i(t),a
ρ
t
i(r),FiAe
ρ
t
i(r),
(c) r
ε
(t+1):=update
ε
r
ε
(t),FiAe
ρ
t
i(r),a
ρ
t
1(r), .. ., a
ρ
t
n(r),
(d) r(T)
τ
f,P
ε
,Pr(T+1).
R(
τ
f,P
ε
,P,r,adj)is the set of all runs obtained from rby adj.
Note that adjusted runs need not be a priori transitional, i.e., obey (d), for tt. Of course, we intend
to use adjustments in such a way that ris a transitional run. But it requires a separate proof. In order to
improve the readability of these proofs, we allow ourselves (and already used) a small abuse of notation.
The
β
-sets
β
t
ε
(r)and
β
t
i(r)were initially defined only for transitional runs as the result of filtering.
But they also represent the sets of events and is actions respectively happening in round t+½. This
alternative definition is equivalent for transitional runs and, in addition, can be used for adjusted runs r.
This is what we mean whenever we write
β
-sets for runs obtained by adjustments.
In order to minimize an agent’s knowledge in accordance with the structure of its (soon to be defined)
reliable (or byzantine) causal cone, we will use several types of i-interventions that copy round t+½ of
the original run to various degrees: (a) CFreeze denies iall actions and events, (b) FakeEchot
ireproduces
all messages sent by ibut in byzantine form, (c) X-Focust
i(for an appropriately chosen set XA×T)
faithfully reproduces all actions and as many events as causally possible.
Definition 7. For an agent-context
χ
,iA, and rR
χ
, we define the following i-interventions:
CFreeze (r):= (,).(3)
FakeEchot
i(r):=,{fail (i)} ⊔ fake(i,gsend(i,j,
µ
,id)7→ noop)
gsend(i,j,
µ
,id)
β
t
i(r)(A)fake (i,gsend(i,j,
µ
,id)7→ A)
β
t
ε
(r).(4)
X-Focust
i(r):=
β
t
i(r),
β
t
ε
i(r)\grecv(i,j,
µ
,id(j,i,
µ
,k,m)) (j,m)/X,kN.(5)
302
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
4 The Reliable Causal Cone
Before giving formal definitions and proofs, we first explain the intuition behind our byzantine analog of
Lamport’s causal cone, and the particular adjustments used for constructing runs with identical reliable
causal cones in the main Lemma 10 of this section.
In the absence of faults [2], the only way information, say, about an event ethat happens to an agent j
can reach another agent i, or more precisely, its node (i,t), is via a causal chain (time progression and
delivered messages) originating from jafter ehappened and reaching ino later than timestamp t. The
set of beginnings of causal chains, together with all causal links, is called the causal cone of (i,t). The
standard way of demonstrating the necessity of such a chain for ito learn about e, when expressed in
our terminology, is by using an adjustment that removes all events and actions outside the causal cone.
Once an adjusted run with no haps outside the causal cone is shown to be transitional and the local state
of iat timestamp tis shown to be the same as in the given run, it follows that iconsiders it possible that
edid not happen and, hence, does not know that ehappened. This well known proof is carried out in our
framework in [12] (see also [13] for an extended version).
However, one subtle aspect of our formalization is also relevant for the byzantine case. We illus-
trate it using a minimal example. Suppose, in the given run, jssent exactly one message to jrduring
round m+½ and it was correctly received by jrin round l+½. At the same round, jritself sent its last
ever message, and sent it to i. If this message to iarrived before t, then (jr,l)is a node within the causal
cone of (i,t). On the other hand, neither (jr,l+1)nor (js,m)are within the causal cone. Thus, the run
adjustment discussed in the previous paragraph removes the action of sending the message from jsto jr,
which happened outside the causal cone, and, hence, makes it causally impossible for jrto receive it de-
spite the fact that the receipt happened within, or more precisely, on the boundary of the causal cone. On
the other hand, the message sent by jrin the same round cannot be suppressed without inoticing. Thus,
suppressing all haps on the boundary of the causal cone is not an option. These considerations necessitate
the use of X-Focusl
jto remove such “ghosts” of messages instead of the exact copy of round l+½ of the
given run. To obtain Chandy–Misra’s result, one needs to set Xto be the entire causal cone.5
We now explain the complications created by the presence of byzantine faults. Because byzantine
agents can lie, the existence of a causal chain is no more sufficient for reliable delivery of information.
Causal chains can now be reliable, i.e., involve only correct agents, or unreliable, whereby a byzantine
agent can corrupt the transmitted information or even initiate the whole communication while pretending
to be part of a longer chain. If several causal chains link a node (j,m)witnessing an event with (i,t),
where the decision based on this event is to be made, then, intuitively, the information about the event can
only be relied upon if at least one of these causal chains is reliable. In effect, all correct nodes, i.e., nodes
(j,m)such that (I,r,t)|=correct(j,m), are divided into three categories: those without any causal chains
to (i,t), i.e., nodes outside Lamport’s causal cone, those with causal chains but only unreliable ones, and
those with at least one reliable causal chain. There is, of course, the fourth category consisting of byzan-
tine nodes, i.e., nodes (j,m)such that (I,r,t)6|=correct(j,m). Since there is no way for nodes without
reliable causal chains to make themselves heard, we call these nodes silent masses and apply to them the
CFreeze intervention: since they cannot have an impact, they need not act. The nodes with at least one
reliable causal chain to (i,t), which must be correct themselves, form the reliable causal cone and treated
the same way as Lamport’s causal cone in the fault-free case, except that the removal of “ghost” messages
is more involved in this case. Finally, the remaining nodes are byzantine and form a fault buffer on the
5This treatment of the cone’s boundary could be perceived as overly pedantic. But in our view this is preferable to being
insufficiently precise.
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
303
way of reliable information. Their role is to pretend the run is the same independently of what the silent
masses do. We will show that FakeEchom
jsuffices since only messages sent from the fault buffer matter.
Before stating our main Lemma 10, which constructs an adjusted run that leaves agent iat tin the
same position while removing as many haps as possible, it should be noted that our analysis relies on
knowing which agents are byzantine in the given run, which may easily change without affecting local
histories. This assumption will be dropped in the following section.
First we define simple causal links among nodes as binary relations on A×Tin infix notation:
Definition 8. For all iAand tT, we have (i,t)l(i,t+1). Additionally, for a run r, we
have (i,m)r
c(j,l)iff there are
µ
Msgs and id Nsuch that grecv(j,i,
µ
,id)
β
l1
ε
(r)and either
gsend(i,j,
µ
,id)
β
m
i(r)or fake (i,gsend(i,j,
µ
,id)7→ A)
β
m
ε
(r)for some A∈ {noop} ⊔ GActionsi.
Causal r-links r:=l∪→r
care either local or communication related. A causal r-path for a run ris
a sequence
ξ
=h
θ
0,
θ
1,...,
θ
ki,k0, of nodes connected by causal r-links, i.e., such that
θ
lr
θ
l+1for
each 0 l<k. This causal r-path is called reliable iff node (jl,tl+1)is correct in rfor each
θ
l= ( jl,tl)
with 0 l<kand, additionally, node
θ
k= ( jk,tk)is correct in r. We also write
θ
0 r
ξθ
kto denote the
fact that path
ξ
connects node
θ
0to
θ
kin run r, or simply
θ
0 r
θ
kto state that such a causal r-path exists.
Note that neither receives nor sends of messages forming a reliable causal r-path can be byzantine.
The latter is guaranteed by the immediate future of nodes on the path being correct.
Definition 9. The reliable causal cone r
θ
of node
θ
in run r consists of all nodes
ζ
A×Tsuch
that
ζ
r
ξθ
for some reliable causal r-path
ξ
. The fault buffer iir
θ
of node
θ
in run r consists of all
nodes (j,m)with m<tsuch that (j,m) r
θ
and (j,m+1)is not correct. Abbreviating iir
θ
:=r
θ
⊔iir
θ
,
the silent masses of node
θ
in run r are all the remaining nodes ·ir
θ
:= (A×T)\ iir
θ
.
Here the filling of the cone signifies reliable communication, ii represents a barrier for correct
information, whereas ·i depicts correct information isolated from its destination. We can now state the
main result of this section:
Lemma 10 (Cone-equivalent run construction).For f N, for a non-excluding agent-context
χ
=
(P
ε
,G(0),
τ
f,Ψ),Psuch that all agents are gullible, correctable, and delayable, for any
τ
f,P
ε
,P-tran-
sitional run r, for a node
θ
= (i,t)A×Tcorrect in r, let adjustment adj = [Bt1;...;B0]where
Bm= (
ρ
m
1,...,
ρ
m
n)for each 0mt1such that
ρ
m
j:=
iir
θ
-Focusm
jif (j,m)r
θ
,
FakeEchom
jif (j,m)∈ iir
θ
,
CFreeze if (j,m)∈ ·ir
θ
.
(6)
Then each rR(
τ
f,P
ε
,P,r,adj)satisfies the following properties:
(A) ((j,m)r
θ
)r
j(m) = rj(m);
(B) (mt)r
i(m) = ri(m);
(C) for any m t, we have that
β
m1
ε
(r)FEvents j6=iff both (j,m1) r
θ
and (j,m)is not
correct in r;
(D) for any m t, any node (j,m)correct in r is also correct in r;
(E) the number of agents byzantine by any m t in run ris not greater than that in run r and is f ;
(F) ris
τ
f,P
ε
,P-transitional.
Proof sketch. The following properties follow from the definitions:
r
θ
∩iir
θ
=,
θ
r
θ
,(7)
(j,m)r
θ
&(k,m)r(j,m) =(k,m)∈ iir
θ
.(8)
304
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
Note that for
β
k= ( jk,mk)with k=1,2, we have
β
1r
β
2implies m1<m2and
β
1 r
β
2implies
m1m2. Thus, all parts of the lemma except for Statement (F) only concern mt, and even this last
statement for m>tis a trivial corollary of Def. 6(d). Thus, we focus on mt.
Statement (A) can be proved by induction on musing the following auxiliary lemma for the given
transitional run rand the adjusted run r, which is also constructed using the standard update functions.
Lemma 11. If r j(m) = r
j(m), and
β
m
j(r) = a
ρ
m
j(r), and
β
m
ε
j(r) = e
ρ
m
j(r), then rj(m+1) = r
j(m+1).
Proof. This statement follows from (2) for the transitional run r, Def. 6(b) for the adjusted run r, and the
fact that update jonly depends on events of agent j, in particular, on the presence of go(j)or sleep (j)
(see Def. A.19 in the appendix for details).
The third condition of Lemma 11 is satisfied for
ρ
m
j(r) = iir
θ
-Focusm
jwithin r
θ
by (8). Further,
if (j,m)r
θ
, then so are all (j,m)with mm. In particular, (i,m)r
θ
for any mt. Thus,
Statement (B) follows from Statement (A) we have already proved.
Statement (C) is due to the fact that (a) iir
θ
-Focusm
jdoes not produce any new byzantine events
relative to
β
m
ε
j(r), which contains none for (j,m)r
θ
, (b) CFreeze never produces byzantine events,
whereas (c) FakeEchom
jalways contains at least fail (j)BEvents j. Statements (D) and (E) are direct
corollaries of Statement (C).
The bulk of the proof concerns Statement (F), or, more precisely the transitionality up to times-
tamp t. For each m<t, we need sets
α
m
ε
(r)P
ε
(m)and
α
m
j(r) = {global (j,m,a)|aXj}for some
XjPj(r
j(m)) for each jAsuch that for
β
m
ε
(r) = FjAe
ρ
m
j(r)and
β
m
j(r) = a
ρ
m
j(r)for all jA,
β
m
ε
r=f ilt er
ε
r(m),
α
m
ε
r,
α
m
1r,...,
α
m
nr,(9)
β
m
jr=f ilt erj
α
m
1r,...,
α
m
nr,
β
m
ε
r.(10)
The construction of such
α
-sets and the proof of (9)–(10) for them is by induction on m. Note that
r
ε
(0) = r
ε
(0)and r
j(0) = rj(0)for all jAby Def. 6(a). We will show that it suffices to choose
α
m
jr:=(
α
m
j(r)if (j,m)r
θ
,
{global (j,m,a)|aXj}for some XjPj(r
j(m)) otherwise,(11)
with the choice in the latter case possible by Pj(r
j(m)) 6=, and
α
m
ε
r:=f ilt erf
ε
r(m),
α
m
ε
(r),
α
m
1(r),...,
α
m
n(r)\G
(l,m)∈·ir
θ
⊔iir
θ
GEventsl⊔ {fail (l)|(l,m)∈ iir
θ
}⊔
nfake (l,gsend(l,j,
µ
,id)7→ noop)(l,m)∈ iir
θ
&
gsend(l,j,
µ
,id)
β
m
l(r)(A)fake (l,gsend(l,j,
µ
,id)7→ A)
β
m
ε
(r)o.(12)
Informally, according to (11), in r, we just repeat the choices made in rwithin the reliable causal
cone and make arbitrary choices elsewhere. According to (12), events are chosen in a more complex
way. First, mimicking the 1st-stage filtering in the given run r, the originally chosen
α
m
ε
(r)P
ε
(m)
is preventively purged of all byzantine events whenever they would have caused more than fagents to
become faulty in r. Note that, in our transitional simulation of the adjusted run r, this is done prior to
filtering (9) by exploiting the correctability of all agents. Secondly, for all agents loutside the reliable
causal cone at the current timestamp m, i.e., with (l,m)∈ ·ir
θ
iir
θ
, all events are removed, to comply with
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
305
the total freeze among the silent masses ·ir
θ
and to make room for byzantine communication in the fault
buffer iir
θ
. The resulting set complies with P
ε
because all agents are delayable. For the silent masses, this
is the desired result. For the fault buffer, on the other hand, byzantine sends are added for every correct
or byzantine send in r, thus, ensuring that the incoming information in the reliable causal cone in ris the
same as in r. For the case when a faulty buffer node (l,m)sent no messages in the original run, fail (l)is
added to make the immediate future (l,m+1)byzantine despite its silence, which is crucial for fulfilling
Statement (C) and simplifying bookkeeping for byzantine agents.
The proof of (9)–(10) is by induction on m=0,...,t1. To avoid overlong formulas, we ab-
breviate the right-hand side of (9) by ϒm
ε
and the right-hand sides of (10) for each jAby Ξm
jfor
the specific
α
m
j(r)and
α
m
ε
(r)defined in (11) and (12) respectively. Thus, it only remains to show that
β
m
ε
(r) = ϒm
ε
and (jA)
β
m
j(r) = Ξm
j, or equivalently, further abbreviating ϒm
j:=ϒm
ε
GEventsj, that
β
m
ε
jr=ϒm
jand
β
m
jr=Ξm
j
for all jA, by simultaneous induction on m.
Induction step for the silent masses (j,m)∈ ·ir
θ
. By (12),
α
m
ε
j(r):=
α
m
ε
(r)GEventsj=, and fil-
tering it yields ϒm
j==
β
m
ε
j(r)as prescribed by CFreeze. In particular, go(j)/
β
m
ε
j(r), thus, ensuring
that filtering
α
m
j(r), whatever it is, yields Ξm
j==
β
m
j(r), once again in compliance with CFreeze
applied within ·ir
θ
.
Before proceeding with the induction step for the remaining nodes, observe that events in
α
m
ε
(r),
if added to r(m), do not cross the byzantine-agent threshold f, meaning that the 1st-stage filtering does
not affect
α
m
ε
(r):
f ilt erf
ε
r(m),
α
m
ε
r,
α
m
1r,...,
α
m
nr=
α
m
ε
r.(13)
Indeed there are two sources of byzantine events in
α
m
ε
(r): byzantine events from
α
m
ε
(r)that survived
f ilt erf
ε
in (12) and those pertaining to nodes in the fault buffer iir
θ
. The former were also present
in
β
m
ε
(r)in the original run because the 2nd-stage filter f ilt erB
ε
only removes correct (receive) events.
At the same time, for any (l,m)∈ iir
θ
, the immediate future (l,m+1)was a faulty node in rby the def-
inition of iir
θ
. In either case, any agent faulty in rbased on
α
m
ε
(r)was also faulty by timestamp m+1
in r. Additionally, any agent already faulty in r(m)was also faulty in r(m)by Statement (D). Since the
number of agents faulty by m+1 in the original transitional run rcould not exceed f, adding
α
m
ε
(r)
to r(m)does not exceed this threshold either. It follows from (13) that
ϒm
j=filterB
ε
(r(m),
α
m
ε
r,
α
m
1r,...,
α
m
nr)GEventsj.(14)
Induction step for the fault buffer (j,m)∈ iir
θ
. For these nodes, the
α
m
ε
j(r)part of
α
m
ε
(r)contains
no correct events, hence, f ilt erB
ε
, which only removes correct receives, has no effect. In other words,
ϒm
j=
α
m
ε
rGEvents j={fail (j)} ⊔ nfake (j,gsend(j,h,
µ
,id)7→ noop)
gsend(j,h,
µ
,id)
β
m
j(r)(A)fake (j,gsend(j,h,
µ
,id)7→ A)
β
m
ε
(r)o=
β
m
ε
jr
as prescribed by FakeEchom
j. As in the case of the silent masses, go(j)/
β
m
ε
(r)guarantees that the
Ξm
j==
β
m
j(r)requirement is fulfilled within iir
θ
.
Induction step for the reliable causal cone (j,m)r
θ
. The case of the nodes with a reliable causal
path to
θ
, whose immediate future remains correct in r, is the final and also most complex induction step.
Recall that
α
m
j(r) =
α
m
j(r)Pj(r(m)) = Pj(r(m)) because within r
θ
by Statement (A) r(m) = r(m).
306
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
Thus, our choice of
α
m
j(r)in (11) is in compliance with transitionality. Since (j,m+1)is correct,
the
α
m
ε
j(r):=
α
m
ε
(r)GEventsjpart of
α
m
ε
(r)contained no byzantine events and, hence, is unchanged
by (12). For the same reason it is not affected by 1st-stage filtering in either run. Thus, the same set of
j’s events undergoes the 2nd-stage filtering in both the original run rand in our transitional simulation
of the adjusted run r. Let us call this set of js events j.
Since both f ilt erB
ε
and iir
θ
-Focusm
jcan only remove receive events, it immediately follows that
β
m
ε
j(r)
β
m
ε
j(r)and ϒm
jagree on all non-receive events. Importantly, this includes go(j)events, thus
ensuring that Ξm
j=
α
m
j(r).
A receive event U=grecv(j,k,
µ
,id)jis retained in either run iff it is causally grounded by a
matching send, correct or byzantine. Due to the uniqueness of GMI id, as ensured by the injectivity of
both id and global functions, as well as Condition (c) of the t-coherency of sets produced by P
ε
, there is at
most one agent ks node where such a matching send can originate from. If id is not well-formed and no
such send can exist, Uis filtered out from both
β
m
ε
(r)and ϒm
ε
, the former ensuring U/
β
m
ε
(r). The rea-
soning in the case such a node
η
= (k,z)exists depends on where timestamp zis relative to mand where
η
falls in our partition of nodes. Generally, to retain Uin
β
m
ε
j(r)and ϒm
j, one must find either a correct
send V:=gsend(k,j,
µ
,id)or a faulty send WA:=fake (k,V7→ A)for some AGActionsk⊔ {noop}.
If z>mis in the future of m, then Uis filtered out from both
β
m
ε
(r)and ϒm
ε
, hence, U/
β
m
ε
(r).
If zmand
η
∈ ·ir
θ
, then, independently of filtering in r, hap U/eiir
θ
-Focusm
j(r)=
β
m
ε
j(r)
because the message’s origin is outside the focus area. At the same time, no actions or events are
scheduled at
η
in r(for z=mit follows from the already proven induction step for silent masses).
Without either Vor WA, event Uis filtered out from ϒm
ε
.
If z<mand
η
∈ iir
θ
, then, by (4) in the definiton of FakeEchoz
k, only Wnoop can save Uin ϒm
jand
Wnoop
β
z
ε
k(r)iff either V
β
z
k(r)or WA
β
z
ε
k(r)for some A. Thus, filtering Uyields the same
result in both runs, and iir
θ
-Focusm
jdoes not affect Ubecause
η
∈ iir
θ
.
If z=mand
η
∈ iir
θ
, again only Wnoop can save Uin ϒm
j, this time by construction (12) of
α
m
ε
(r)6∋ go(k). Here Wnoop
α
m
ε
k(r)iff either V
β
z
k(r)or WA
β
z
ε
k(r)for some A. Thus, fil-
tering Uyields the same result in both runs, and iir
θ
-Focusm
jdoes not affect Ubecause
η
∈ iir
θ
.
If z<mand
η
r
θ
, then (k,z+1)is still correct in both rand r, hence, no byzantine events
such as WAare present in either r(m)or r(m). Accordingly, only Vcan save Uin this case. Since
β
z
k(r) =
β
z
k(r)by construction (5) of iir
θ
-Focusz
k, filtering Uyields the same result in both runs,
and iir
θ
-Focusm
jdoes not affect Ubecause
η
∈ iir
θ
.
If z=mand
η
r
θ
, again (k,m+1)is correct in rmeaning this time that no WAare present in k.
Again, only Vcan save Ufrom filtering. Since
α
m
k(r) =
α
m
k(r)by construction (11) and the sets
of events being filtered agree on go(k)k, here too filtering Uyields the same result in both
runs, and iir
θ
-Focusm
jdoes not affect Ubecause
η
∈ iir
θ
.
This case analysis completes the induction step for the reliable causal cone, the induction proof, proof of
Statement (F), and the proof of the whole Lemma 10.
5 Preconditions for Actions: Multipedes
Arguably the most important application of Lemma 10, and, hence, of causal cones, is to derive precon-
ditions for agents’ actions, cp. [1]. While relatively simple in traditional settings, where events can be
preconditions according to the knowledge of preconditions principle [17] and where Lamport’s causal
cone suffices, this is no longer true in byzantine settings. As Theorem 12 reveals, if f>0, an asyn-
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
307
chronous agent can learn neither that it is (still) correct nor that a particular event6really occurred.
Theorem 12 ([12]).If f 1, then for any o Eventsi, for any interpreted system I= (R
χ
,
π
)with any
non-excluding agent-context
χ
= ((P
ε
,G(0),
τ
f,Ψ),P)where i is gullible and every j 6=i is delayable,
I|=¬Kioccurred(o)and I|=¬Kicorrecti.(15)
These validities can be shown by modeling the infamous brain in a vat scenario (see [12] for details).
Theorem 12 obviously implies that knowledge of simple preconditions, e.g., events, is never achiev-
able if byzantine agents are present. Settling for the next best thing, one could investigate whether
iknows ohas happened relative to its own correctness, i.e., whether Ki(correctioccurred(o)) holds
(cf. [18]), a kind of non-factive belief in o. This means that ican be mistaken about odue to its own
faults (in which case it cannot rely on any information anyway), not due to being misinformed by other
agents. It is, however, sometimes overly restrictive to assume that Ki(correctioccurred(o)) holds in
situations when iis, in fact, faulty: typical specifications, e.g., for distributed agreement [15], do not
restrict the behavior of faulty agents, and agents might sometimes learn that they are faulty. We therefore
introduced the hope modality
Hi
ϕ
:=correctiKi(correcti
ϕ
),
which was shown in [7] to be axiomatized by adding to K45 the axioms correcti(Hi
ϕ
ϕ
), and
¬correctiHi
ϕ
, and Hicorrecti.
The following Theorem 13 shows that hope is also closely connected to reliable causal cones, in the
sense that events an agent can hope for must lie within the reliable causal cone.
Theorem 13. For a non-excluding agent-context
χ
= ((P
ε
,G(0),
τ
f,Ψ),P)such that all agents are
gullible, correctable, and delayable, for a correct node
θ
= (i,t), and for an event o Events, if all
occurrences of O GEvents such that local(O) = o happen outside the reliable causal cone r
θ
of a run
rR
χ
, i.e., if O
β
m
ε
(r)GEvents j&local(O) = o implies (j,m)/r
θ
, then for any I= (R
χ
,
π
),
(I,r,t)6|=Hioccurred(o).
Proof. Constructing the first trounds according to the adjustment from Lemma 10 and extending this
prefix to an infinite run rR
χ
using the non-exclusiveness of
χ
, we obtain a run with no correct events
recorded as o. Indeed, in r, there are no events originating from ·ir
θ
, no correct events from iir
θ
, and all
events originating from r
θ
, though correct, were also present in rand, hence, do not produce oin local
histories. At the same time, ri(t) = r
i(t)by Lemma 10(B), making (I,r,t)indistinguishable for i, and
(I,r,t)|=correctiby Lemma 10(D).
It is interesting to compare the results and proofs of Theorems 12 and 13. Essentially, in the run r
modeling the brain in a vat in the former, iis a faulty agent that perceives events while none really happen.
Therefore, Kioccurred (o)can never be attained. In the run rconstructed by Lemma 10 in Theorem 13,
on the other hand, iremains correct. The reason that Hioccurred(o)fails here is that odoes not occur
within the reliable causal cone.
Theorem 13 shows that, in order to act based on the hope that an event occurred, it is necessary that
the event originates from the reliable causal cone. Unfortunately, this is not sufficient. Consider the
case of a run rwhere no agent exhibits a fault: every causal message chain is reliable and the ordinary
6Actually, the reasoning in this section also extends to actions, i.e., arbitrary haps.
308
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
and reliable causal cones coincide. However, since up to fagents could be byzantine, it is trivial to
modify rby seeding fail (j)events in round ½ for several agents jin a way that is indistinguishable for
agent itrying to hope for the occurrence of o. This would enlarge the fault buffer and shrink the reliable
causal cone in the so-constructed adjusted run ˆr. Obviously, by making different sets of agents byzantine
(without violating f, of course), one can fabricate multiple adjusted runs where ˆri(t) = ri(t)is exactly
the same but fault buffers and reliable causal cones vary in size and shape. Any single one of those ˆr
satisfying the conditions of Theorem 13, in the sense that all occurrences of ohappen outside its reliable
causal cone, dash the hope of ifor oin r.
Thus, in order for ito have hope at (i,t)in run rthat oreally occurred, it is necessary that some
correct global version Oof o(not necessarily the same one) is present somewhere (not necessarily at
the same node) in the reliable causal cone of every run ˆrthat ensures ri(t) = ˆri(t). This gives rise to the
definition of a multipede, which ensures (I,r,t)|=Hioccurred(o)according to Theorem 13:
Definition 14 (Multipede).We say that a run rin a non-excluding agent context
χ
= ((P
ε
,G(0),
τ
f,Ψ),P)
contains a multipedeo
θ
for event oEvents at some node
θ
= (i,t)iff, for all runs ˆrR
χ
with ri(t) = ˆri(t),
it holds that ohappens inside its reliable causal cone, i.e., that
((j,m)ˆr
θ
)(OGEvents j)O
β
m
ε
(ˆr)&local(O) = o.
We obtain the following necessary condition for the existence of a multipede:
Theorem 15 (Necessary condition for a multipede).Given an arbitrary non-excluding agent-context
χ
=(P
ε
,G(0),
τ
f,Ψ),Psuch that all agents are gullible, correctable, and delayable and for any run
rR
χ
in any interpreted system I= (R
χ
,
π
), if (I,r,t)|=Hioccurred(o)for a correct node
θ
= (i,t),
i.e., if there is a multipedeo
θ
in r, then the following must hold: Let Byzr
θ
:={jA|(m)( j,m)∈ iir
θ
}.
For any S A\({i}Byzr
θ
)such that |S|=f−|Byzr
θ
|, there must exist a witness wSAof some correct
event OS
β
mS
ε
(r)GEventswSsuch that local(OS) = o and such that there is causal path (wS,mS) r
ξ
S
θ
that does not involve agents from S Byzr
θ
.
Proof. Since, by Lemma 10, the adjusted run rR
χ
and since the only faults up to toccur in rin
the fault buffer iir
θ
, i.e., pertain to agents from Byzr
θ
, for any Sdescribed above, one can construct first
trounds by setting
β
0
ε
(rS):=
β
0
ε
(r){fail (j)|jSByzr
θ
}and keeping the rest of rintact. These first
trounds can be extended to complete infinite runs rSR
χ
indistinguishable for iat
θ
from either ror r
because the addition of fail(j)is imperceptible for agents and does not affect protocols. The only poten-
tially affected element could have been f ilt er
ε
in the part ensuring byzantine agents do not exceed fin
number, but it also behaves the same way as in rbecause |S|+|Byzr
θ
|=f. Since ri(t) = r
i(t) = rS
i(t),
we have (I,rS,t)|=Hioccurred(o). Node
θ
remains correct in these runs because i/S. Thus, by
Theorem 13, each run rSmust have a requisite correct event OS
β
mS
ε
wS(rS)rS
θ
. It remains to note
that any such correct event from rSmust be present in rand in rand any causal path in rSexists already
in rand r, according to the construction from Lemma 10. Thus, there must exist a causal path
ξ
in rfrom
(wS,mS)to
θ
such that
ξ
is reliable in rS. Finally, since all fbyzantine agents in rS, namely SByzr
θ
,
are made faulty from round ½, path
ξ
Sbeing reliable in rSmeans not involving these agents.
From the perspective of protocol design, arguably, of more interest are sufficient conditions for
the existence of a multipede in a given run. Whereas a sufficient condition could be obtained directly
from Def. 14, of course, identifying all the transitional runs ˆrwith ri(t) = ˆri(t)is far from being com-
putable in general. Actually, we conjecture that sufficient conditions cannot be formulated in a protocol-
independent way at all. Unfortunately, however, protocol-dependence cannot be expected to be simple
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
309
either. For instance, even just varying the number and location of faults in rfor suppressing occurred(o)
in a modified run ˆrcould be non-trivial. If kagents are already faulty in run r, at least fkones can
freely be used for this purpose. However, some of the kbyzantine faults in rmay also be re-located in ˆr,
as agents that only become faulty after timestamp tcannot be part of any fault buffer. Rather than making
them faulty, it would suffice to just freeze them.
For instance, for the following communication structure with f=2 and agents 1 and 2 being byzan-
tine (we omit the time dimension for simplicity’s sake),
1234
they would both participate in the fault buffer, whereas, already 2 alone would suffice because even were
1 correct, the observed communication does not give it a chance to pass by 2. Depending on 1’s protocol,
it might be possible to reassign 1 to the silent masses, thereby allowing to consider 4 as the second faulty
agent and, thus, showing the impossibility for 3 to act in this situation. An opposite outcome is possible
in the following scenario:
A1.1A2.1
I1A1.2C A2.2I2
A1.3A2.3
Let f=2 and the faulty agents be A2.1 and A1.1. While the sufficient condition forces Cto consider
the case of both I1 and I2 being compromised and information originating from them unreliable, our
necessary condition does not rule out C’s ability to make a decision. Indeed, suppose I1 and I2 are
investigators sending in their reports via three aides each. Having received 4 identical reports that are
correct from A1.2, A1.3, A2.2, and A2.3 and only 2 fake reports from A1.1 and A2.1, agent Cwould have
been able to choose the correct version if the possibility of both investigators being compromised were
off the table. Our method of adjusting the run does not allow us to move the faulty agent from A1.1 to I1
because it is not clear how A1.1 would have behaved were it correct and had it received a fake report
from I1. By designing a protocol in such a way that A1.1’s correct behavior in such a hypothetical situ-
ation is different, we can eliminate the possibility of investigators being compromised and, thus, resolve
the situation for C.
6 Conclusions
The main contribution of this paper is the characterization of the analog of Lamport’s causal cone in
asynchronous multi-agent systems with byzantine faulty agents. Relying on our novel byzantine runs-
and-systems framework, we provided an accurate epistemic characterization of causality and the induced
reliable causal cone in the presence of asynchronous byzantine agents. Despite the quite natural final
shape of a reliable causal cone, it does not lead to simple conditions for ascertaining preconditions:
the detection of what we called a multipede is considerably more complex than the verification of the
existence of one causal path in the fault-free case. Since the agents’ actions depend on the shape of
multiple alternative reliable causal cones in byzantine fault-tolerant protocols like [20], however, there is
no alternative but to detect multipedes.
310
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
Developing practical sufficient conditions for the existence of a multipede poses exciting challenges,
which are currently being addressed in the context of the epistemic analysis of some real byzantine
fault-tolerant protocols. This context-dependency is unavoidable, since the agent that tries to detect a
multipede in a run lacks global information such as the actual members of the fault buffer. On the other
hand, the gap between the necessary and sufficient conditions can potentially be minimized by designing
protocols based on the insights into the causality structure we have uncovered. For instance, while we
treated all error-creating nodes as part of the fault buffer in our necessary conditions for a multipede,
it is sometimes possible to relegate redundant parts of it into the silent masses. As this would allow to
re-locate byzantine faults for intercepting more causal paths, one may design protocols in a way that does
not allow this.
A larger and more long-term goal is to extend our study to syncausality and the reliable syncausal
cone in the context of synchronous byzantine fault-tolerant multi-agent systems, and to possibly incor-
porate protocols explicitly into the logic.
Acknowledgments. We are grateful to Yoram Moses, Hans van Ditmarsch, and Moshe Vardi for
fruitful discussions and valuable suggestions that have helped shape the final version of this paper. We
also thank the anonymous reviewers for their comments and suggestions.
References
[1] Ido Ben-Zvi & Yoram Moses (2014): Beyond Lamport’s Happened-before: On Time Bounds and the Order-
ing of Events in Distributed Systems.
Journal of the ACM
61(2:13), doi:10.1145/2542181.
[2] K. M. Chandy & Jayadev Misra (1986): How processes learn.
Distributed Computing
1(1), pp. 40–52,
doi:10.1007/BF01843569.
[3] Reinhard Diestel (2017): Graph Theory, Fifth edition. Springer, doi:10.1007/978-3-662-53622-3.
[4] Cynthia Dwork & Yoram Moses (1990): Knowledge and Common Knowledge in a Byzantine Environment:
Crash Failures.
Information and Computation
88(2), pp. 156–186, doi:10.1016/0890-5401(90)90014-9.
[5] Ronald Fagin, Joseph Y. Halpern, Yoram Moses & Moshe Y. Vardi (1995): Reasoning About Knowledge.
MIT Press.
[6] Ronald Fagin, Joseph Y. Halpern, Yoram Moses & Moshe Y. Vardi (1999): Common knowledge revisited.
Annals of Pure and Applied Logic
96(1–3), pp. 89–105, doi:10.1016/S0168-0072(98)00033-5.
[7] Krisztina Fruzsa (2019): Hope for Epistemic Reasoning with Faulty Agents! In:
Proceedings of ESSLLI
2019 Student Session
. (To appear).
[8] Guy Goren & Yoram Moses (2018): Silence. In:
PODC ’18, Proceedings of the 2018 ACM Symposium on
Principles of Distributed Computing
, ACM, pp. 285–294, doi:10.1145/3212734.3212768.
[9] Joseph Y. Halpern & Yoram Moses (1990): Knowledge and Common Knowledge in a Distributed Environ-
ment.
Journal of the ACM
37(3), pp. 549–587, doi:10.1145/79147.79161.
[10] Joseph Y. Halpern, Yoram Moses & Orli Waarts (2001): A characterization of eventual Byzantine agreement.
SIAM Journal on Computing
31(3), pp. 838–865, doi:10.1137/S0097539798340217.
[11] Jaakko Hintikka (1962): Knowledge and Belief: An Introduction to the Logic of the Two Notions. Cornell
University Press.
[12] Roman Kuznets, Laurent Prosperi, Ulrich Schmid & Krisztina Fruzsa (2019): Epistemic Reasoning with
Byzantine-Faulty Agents. In:
Proceedings of FroCoS 2019
. (To appear).
[13] Roman Kuznets, Laurent Prosperi, Ulrich Schmid, Krisztina Fruzsa & Lucas Gr ´eaux (2019): Knowledge in
Byzantine Message-Passing Systems I: Framework and the Causal Cone. Technical Report TUW-260549,
TU Wien. Available at https://publik.tuwien.ac.at/files/publik_260549.pdf.
R. Kuznets, L. Prosperi, U. Schmid & K. Fruzsa
311
[14] Leslie Lamport (1978): Time, Clocks, and the Ordering of Events in a Distributed System.
Communications
of the ACM
21(7), pp. 558–565, doi:10.1145/359545.359563.
[15] Leslie Lamport, Robert Shostak & Marshall Pease (1982): The Byzantine Generals Problem.
ACM Transac-
tions on Programming Languages and Systems
4(3), pp. 382–401, doi:10.1145/357172.357176.
[16] Alexandre Maurer, S´ebastien Tixeuil & Xavier Defago (2015): Reliable Communication in a Dynamic Net-
work in the Presence of Byzantine Faults. eprint 1402.0121, arXiv. Available at https://arxiv.org/abs/
1402.0121.
[17] Yoram Moses (2015): Relating Knowledge and Coordinated Action: The Knowledge of Preconditions Prin-
ciple. In R. Ramanujam, editor:
Proceedings of TARK 2015
, pp. 231–245, doi:10.4204/EPTCS.215.17.
[18] Yoram Moses & Yoav Shoham (1993): Belief as defeasible knowledge.
Artificial Intelligence
64(2), pp.
299–321, doi:10.1016/0004-3702(93)90107-M.
[19] Yoram Moses & Mark R. Tuttle (1988): Programming Simultaneous Actions Using Common Knowledge.
Algorithmica
3, pp. 121–169, doi:10.1007/BF01762112.
[20] T. K. Srikanth & Sam Toueg (1987): Optimal Clock Synchronization.
Journal of the ACM
34(3), pp. 626–645,
doi:10.1145/28869.28876.
Appendix
Filter functions
Definition A.16. The filtering function f ilt er
ε
for asynchronous agents with at most f0 byzantine
faults is defined as follows.
First, we define a subfilter f ilt erB
ε
:G×
(GEvents)×n
i=1
(GActionsi)
(GEvents)that re-
moves impossible receives: for a global state hG, set X
ε
GEvents, and sets XiGActionsi,
f ilt erB
ε
(h,X
ε
,X1,...,Xn):=X
ε
\ngrecv(j,i,
µ
,id)gsend(i,j,
µ
,id)/h
ε
(A∈ {noop} ⊔ GActionsi)fake (i,gsend(i,j,
µ
,id)7→ A)/h
ε
(gsend(i,j,
µ
,id)/Xigo(i)/X
ε
)
(A∈ {noop} ⊔ GActionsi)fake (i,gsend(i,j,
µ
,id)7→ A)/X
ε
o,
where h
ε
is the environment’s record of all haps in the global state hand Oh
ε
(O/h
ε
) states that the
hap OGHaps is (isn’t) present in this record of all past rounds, X
ε
represents all events attempted by
the environment and Xi’s represent all actions attempted by agents iin the current round.
Second, using XB
ε
i:=X
ε
BEventsi⊔ {sleep (i),hibernate (i)}and defining A(Failed (h)) to be
the set of agents who have already exhibited faulty behavior in the global state h, we define a subfilter
f ilt erf
ε
:G×
(GEvents)×n
i=1
(GActionsi)
(GEvents)that removes all byzantine events in the
situation when having them would have exceeded the fthreshold:
f ilt erf
ε
(h,X
ε
,X1, .. . , Xn):=
X
ε
if A(Failed (h)) i|XB
ε
i6=f,
X
ε
\F
iA
XB
ε
iotherwise.
The filter f ilt er
ε
:G×
(GEvents)×n
i=1
(GActionsi)
(GEvents)is obtained by composing
these two subfilters, with the fsubfilter applied first:
f ilt er
ε
(h,X
ε
,X1, .. . , Xn):=f ilterB
ε
h,f ilt erf
ε
(h,X
ε
,X1, .. . , Xn),X1, .. ., Xn.
312
Causality and Epistemic Reasoning in Byzantine Multi-Agent Systems
The composition in the opposite order could violate causality if a message receipt is preserved by
f ilt erB
ε
based on a byzantine send in the same round, which is later removed by f ilt erf
ε
.
Definition A.17. The filters f ilt eri:n
j=1
(GActions j)×
(GEvents)
(GActionsi)for agents’ ac-
tions are defined as follows: for X
ε
representing all environment’s events and Xirepresenting all actions
attempted by agent iin the current round,
f ilt eri(X1,...,Xn,X
ε
):=(Xiif go(i)X
ε
,
otherwise.
Update functions
Before defining the update functions, we need several auxiliary functions:
Definition A.18. We use a function local :GHaps Haps converting correct haps from the global
format into the local formats for the respective agents in such a way that, for any i,jA, any tT, any
aActionsi, any
µ
Msgs, and any MN:
1. local(GActionsi) = Actionsi; 3. localglobal (i,t,a)=a;
2. local(GEventsi) = Eventsi; 4. localgrecv(i,j,
µ
,M)=recv(j,
µ
).
For all other haps, the localization cannot be done on a hap-by-hap basis because system events and
byzantine events fake (i,A7→ noop)do not create a local record. Accordingly, we define a localization
function
σ
:
(GHaps)
(Haps)as follows: for each XGHaps,
σ
X:=local(XGHaps)
{EGEvents |(i)fake (i,E)X} ∪ {AGActions |(i)(A)fake i,A7→ AX}.
Definition A.19. We abbreviate X
ε
i:=X
ε
GEventsifor performed events X
ε
GEvents and actions
XiGActionsifor each iA. Given a global state r(t) = r
ε
(t),r1(t),...,rn(t)G, we define
agent is updatei:Li×
(GActionsi)×
(GEvents)Lithat outputs a new local state from Libased
on i’s actions Xiand events X
ε
:
updatei(ri(t),Xi,X
ε
):=(ri(t)if
σ
(X
ε
i) = and X
ε
i∩ {go(i),sleep(i)}=,
σ
X
ε
iXi:ri(t)otherwise
(note that in transitional runs, updateiis always used after the action f il teri, thus, in the absence of go(i),
it is always the case that Xi=).
Similarly, the environment’s state update
ε
:L
ε
×
(GEvents)×n
i=1
(GActionsi)L
ε
outputs
a new state of the environment based on events X
ε
and all actions Xi:
update
ε
(r
ε
(t),X
ε
,X1,...,Xn):= (X
ε
X1⊔ · ·· ⊔ Xn):r
ε
(t).
Accordingly, the global update function upd ate :G×
(GEvents)×n
i=1
(GActions j)Gmodifies
the global state as follows:
update (r(t),X
ε
,X1,...,Xn):=update
ε
(r
ε
(t),X
ε
,X1,...,Xn),
update1(r1(t),X1,X
ε
),...,updaten(rn(t),Xn,X
ε
).
ResearchGate has not been able to resolve any citations for this publication.
Technical Report
Full-text available
We present an extension incorporating Byzantine agents into the epistemic runs-and-systems framework for modeling distributed systems introduced by Fagin et al. [FHMV95]. Our framework relies on a careful separation of concerns for various actors involved in the evolution of a message-passing distributed system: the agents' protocols, the underlying computational model, and the adversary controlling Byzantine faulty behavior, with each component adjustable individually. This modularity will allow our framework to be eventually extended to most existing distributed computing models. The main novelty of our framework is its ability to reason about knowledge of Byzantine agents who need not follow their protocols and may or may not be aware of their own faultiness. We demonstrate the utility of this framework by first reproducing the standard results regarding Lamport's happened-before causal relation [Lam78] and then identifying its Byzantine analog representing the necessary communication structure for attaining knowledge in asynchronous systems with Byzantine faulty agents.
Article
Full-text available
We consider the following problem: two nodes want to reliably communicate in a dynamic multihop network where some nodes have been compromised, and may have a totally arbitrary and unpredictable behavior. These nodes are called Byzantine. We consider the two cases where cryptography is available and not available. We prove the necessary and sufficient condition (that is, the weakest possible condition) to ensure reliable communication in this context. Our proof is constructive, as we provide Byzantine-resilient algorithms for reliable communication that are optimal with respect to our impossibility results. In a second part, we investigate the impact of our conditions in three case studies: participants interacting in a conference, robots moving on a grid and agents in the subway. Our simulations indicate a clear benefit of using our algorithms for reliable communication in those contexts.
Conference Paper
We introduce a novel comprehensive framework for epistemic reasoning in multi-agent systems where agents may behave asynchronously and may be byzantine faulty. Extending Fagin et al.’s classic runs-and-systems framework to agents who may arbitrarily deviate from their protocols, it combines epistemic and temporal logic and incorporates fine-grained mechanisms for specifying distributed protocols and their behaviors. Besides our framework’s ability to express any type of faulty behavior, from fully byzantine to fully benign, it allows to specify arbitrary timing and synchronization properties. As a consequence, it can be adapted to any message-passing distributed computing model we are aware of, including synchronous processes and communication, (un-)reliable uni- / multi- / broadcast communication, and even coordinated action. The utility of our framework is demonstrated by formalizing the brain-in-a-vat scenario, which exposes the substantial limitations of what can be known by asynchronous agents in fault-tolerant distributed systems. Given the knowledge of preconditions principle, this restricts preconditions that error-prone agents can use in their protocols. In particular, it is usually necessary to relativize preconditions with respect to the correctness of the acting agent.
Conference Paper
The cost of communication is a substantial factor affecting the scalability of many distributed applications. Every message sent can incur a cost in storage, computation, energy and bandwidth. Consequently, reducing the communication costs of distributed applications is highly desirable. The best way to reduce message costs is by communicating without sending any messages whatsoever. This paper initiates a rigorous investigation into the use of silence in synchronous settings, in which processes can fail. We formalize sufficient conditions for information transfer using silence, as well as necessary conditions for particular cases of interest. This allows us to identify message patterns that enable communication through silence. In particular, a pattern called a \em silent choir is identified, and shown to be central to information transfer via silence in failure-prone systems. The power of the new framework is demonstrated on the \em atomic commitment problem (AC). A complete characterization of the tradeoff between message complexity and round complexity in the synchronous model with crash failures is provided, in terms of lower bounds and matching protocols. In particular, a new message-optimal AC protocol is designed using silence, in which processes decide in~3 rounds in the common case. This significantly improves on the best previously known message-optimal AC protocol, in which decisions were performed in Θ(n) rounds.
Book
This standard textbook of modern graph theory, now in its fifth edition, combines the authority of a classic with the engaging freshness of style that is the hallmark of active mathematics. It covers the core material of the subject with concise yet reliably complete proofs, while offering glimpses of more advanced methods in each field by one or two deeper results, again with proofs given in full detail. The book can be used as a reliable text for an introductory course, as a graduate text, and for self-study. From the reviews: “This outstanding book cannot be substituted with any other book on the present textbook market. It has every chance of becoming the standard textbook for graph theory.”Acta Scientiarum Mathematiciarum “Deep, clear, wonderful. This is a serious book about the heart of graph theory. It has depth and integrity. ”Persi Diaconis & Ron Graham, SIAM Review “The book has received a very enthusiastic reception, which it amply deserves. A masterly elucidation of modern graph theory.” Bulletin of the Institute of Combinatorics and its Applications “Succeeds dramatically ... a hell of a good book.” MAA Reviews “A highlight of the book is what is by far the best account in print of the Seymour-Robertson theory of graph minors.” Mathematika “ ... like listening to someone explain mathematics.” Bulletin of the AMS
Article
The Knowledge of Preconditions principle (KoP) is proposed as a widely applicable connection between knowledge and action in multi-agent systems. Roughly speaking, it asserts that if some condition is a necessary condition for performing a given action A, then knowing that this condition holds is also a necessary condition for performing A. Since the specifications of tasks often involve necessary conditions for actions, the KoP principle shows that such specifications induce knowledge preconditions for the actions. Distributed protocols or multi-agent plans that satisfy the specifications must ensure that this knowledge be attained, and that it is detected by the agents as a condition for action. The knowledge of preconditions principle is formalised in the runs and systems framework, and is proven to hold in a wide class of settings. Well-known connections between knowledge and coordinated action are extended and shown to derive directly from the KoP principle: a "common knowledge of preconditions" principle is established showing that common knowledge is a necessary condition for performing simultaneous actions, and a "nested knowledge of preconditions" principle is proven, showing that coordinating actions to be performed in linear temporal order requires a corresponding form of nested knowledge.
Article
The coordination of a sequence of actions, to be performed in a linear temporal order in a distributed system, is studied. While in asynchronous message-passing systems such ordering of events requires the construction of message chains based on Lamport's happened-before relation, this is no longer true in the presence of time bounds on message delivery. Given such bounds, the mere passage of time can provide information about the occurrence of events at remote sites, without the need for explicit confirmation. A new causal structure called the centipede is introduced, and it is shown that centipedes must exist in every execution where linear ordering of actions is ensured. Centipedes capture the subtle interplay between the explicit information obtained via message chains, and the indirectly derived information gained by the passage of time, given the time bounds. Centipedes are defined using two relations. One is called syncausality, a slight generalisation of the happened-before relation. The other is a novel bound guarantee relation among events, that is based on the bounds on message transmission. In a precise sense, centipedes play a role in the synchronous setting analogous to that played by message chains in asynchronous systems. Our study is based on a knowledge-based analysis of distributed coordination. Temporally linear coordination is reduced to nested knowledge (knowledge about knowledge). Obtaining nested knowledge of a spontaneous event is, in turn, shown to require the existence of an appropriate centipede.
Article
This work applies the theory of knowledge in distributed systems to the design of efficient fault-tolerant protocols. We define a large class of problems requiring coordinated, simultaneous action in synchronous systems, and give a method of transforming specifications of such problems into protocols that areoptimal in all runs: these protocols are guaranteed to perform the simultaneous actions as soon as any other protocol could possibly perform them, given the input to the system and faulty processor behavior. This transformation is performed in two steps. In the first step we extract, directly from the problem specification, a high-level protocol programmed using explicit tests for common knowledge. In the second step we carefully analyze when facts become common knowledge, thereby providing a method of efficiently implementing these protocols in many variants of the omissions failure model. In the generalized omissions model, however, our analysis shows that testing for common knowledge is NP-hard. Given the close correspondence between common knowledge and simultaneous actions, we are able to show that no optimal protocol for any such problem can be computationally efficient in this model. The analysis in this paper exposes many subtle differences between the failure models, including the precise point at which this gap in complexity occurs.
Article
We consider the common-knowledge paradox raised by Halpern and Moses: common knowledge is necessary for agreement and coordination, but common knowledge is unattainable in the real world because of temporal imprecision. We discuss two solutions to this paradox: 1.(1) modeling the world with a coarser granularity, and2.(2) relaxing the requirements for coordination.