Content uploaded by Amedeo Roberto Esposito
Author content
All content in this area was uploaded by Amedeo Roberto Esposito on Sep 04, 2021
Content may be subject to copyright.
On conditional Sibson’s α-Mutual Information
Amedeo Roberto Esposito, Diyuan Wu, Michael Gastpar
School of Computer and Communication Sciences
EPFL, Lausanne, Switzerland
{amedeo.esposito, diyuan.wu, michael.gastpar}@epfl.ch
Abstract—In this work, we analyse how to define a conditional
version of Sibson’s α-Mutual Information. Several such defini-
tions can be advanced and they all lead to different information
measures with different (but similar) operational meanings. We
will analyse in detail one such definition, compute a closed-form
expression for it and endorse it with an operational meaning while
also considering some applications. The alternative definitions
will also be mentioned and compared.
Index Terms—R´
enyi-Divergence, Sibson’s Mutual Information,
Conditional Mutual Information, Information Measures
I. INTRODUCTION
Sibson’s α-Mutual Information is a generalization of Shan-
non’s Mutual Information with several applications in prob-
ability, information and learning theory [1]. In particular, it
has been used to provide concentration inequalities in settings
where the random variables are not independent, with applica-
tions to learning theory [1]. The measure is also connected to
Gallager’s exponent function, a central object in the channel
coding problem both for rates below and above capacity [2],
[3]. Moreover, a new operational meaning has been given
to the measure with α= +∞when a novel measure of
information leakage has been proposed in [4], under the name
of Maximal leakage. Similarly to Iα, Maximal Leakage has
recently found applications in learning and probability theory
[1]. Howerever, while Maximal Leakage has a corresponding
conditional form [4], Sibson’s α-Mutual Information lacks an
agreed upon conditional version. In this work we analyse a
path that could be taken in defining such a measure and will
focus on one specific choice, given in Definition 4 below.
We discuss key properties of this choice and endow it with
an operational meaning as the error-exponent in a properly
defined hypothesis testing problem. Moreover, we hint at some
application of this measure to other settings as well. The
choice we make is not unique and we will explain how making
different choices leads to different information measures, all
of them equally meaningful. A conditional version of Sibson’s
Iαhas been presented in [5]. We briefly present their measure
in Sec. III-B along with a new result that we believe to
be of interest. We then present in Sec. III-C a different
choice for conditional Iα. We show some properties of this
measure, compare the two objects in Sec. III-E and then
discuss a general approach to associate an operational meaning
to these measures in Sec. IV. Alternative routes have been
considered in [6] where Arimoto’s generalisation of the Mutual
Information has been considered and a conditional version has
been given.
II. BACKGROUND AND DEFINITIONS
Given a function f:R→[−∞,+∞]we can define its
convex conjugate f?:R→[−∞,+∞]as follows:
f?(λ) = sup
x
(λx −f(x)).(1)
Given a function f,f?is guaranteed to be lower semi-
continuous and convex. We can re-apply the conjugation
operator to f?and obtain f??. If fis convex and lower
semincontinuous then f=f??, otherwise all we can say is
that ∀x∈Rf??(x)≤f(x).log denotes the natural logarithm.
A. Sibson’s α-Mutual Information
Introduced by R´
enyi as a generalization of entropy and KL-
divergence, α-divergence has found many applications ranging
from hypothesis testing to guessing and several other statistical
inference and coding problems [7]. Indeed, it has several useful
operational interpretations (e.g., hypothesis testing, and the
cut-off rate in block coding [8], [9]). It can be defined as
follows [8].
Definition 1. Let (Ω,F,P),(Ω,F,Q)be two probability
spaces. Let α > 0be a positive real number different from 1.
Consider a measure µsuch that P µand Q µ(such a
measure always exists, e.g. µ= (P+Q)/2)) and denote with
p, q the densities of P,Qwith respect to µ. The α-Divergence
of Pfrom Qis defined as follows:
Dα(PkQ) = 1
α−1log Zpαq1−αdµ. (2)
Remark 1.The definition is independent of the chosen mea-
sure µ. It is indeed possible to show that Rpαq1−αdµ =
Rq
p1−αdP, and that whenever P Q or 0< α < 1,
we have Rpαq1−αdµ =Rp
qαdQ, see [8].
It can be shown that if α > 1and P 6 Q then
Dα(PkQ) = ∞. The behaviour of the measure for α∈
{0,1,∞} can be defined by continuity. In general, one has that
D1(PkQ) = D(PkQ)but if D(P kQ) = ∞or there exists β
such that Dβ(PkQ)<∞then limα↓1Dα(PkQ) = D(P kQ)
[8, Theorem 5]. For an extensive treatment of α-divergences
and their properties we refer the reader to [8]. Starting from
R´
enyi’s Divergence and the geometric averaging that it in-
volves, Sibson built the notion of Information Radius [10]:
Definition 2. Let (µ1, . . . , µn)be a family of probability
measures and (w1, . . . , wn)be a set of weights s.t. wi≥0
for i= 1, . . . , n and such that Pn
i=1 wi>0. Let α≥1, the
information radius of order αis defined as:
1
α−1min
νPiwiµi
log X
i
wiexp((α−1)Dα(µikν))!.
Suppose now we have two random variables X, Y jointly
distributed according to PXY . It is possible to generalise Def.
2 and see that the information radius is a special case of the
following quantity [7]:
Iα(X, Y ) = min
QY
Dα(PXY kPXQY).(3)
Iα(X, Y )represents a generalisation of Shannon’s Mutual
Information and possesses many interesting properties
[7]. Indeed, limα→1Iα(X, Y ) = I(X;Y). On the
other hand when α→ ∞, we get: I∞(X, Y ) =
log EPYhsupx:PX(x)>0
PXY ({x,Y })
PX({x})PY({Y})i=L(X→Y),
where L(X→Y)denotes the Maximal Leakage from Xto
Y, a recently defined information measure with an operational
meaning in the context of privacy and security [4]. For more
details on Sibson’s α-MI, as well as a closed-form expression,
we refer the reader to [7], as for Maximal Leakage the reader
is referred to [4].
III. DEFI NI TI ON
A. Introduction
The characterisation expressed in (3) represents the foun-
dation of this work. Indeed, using (3) as the definition
of Sibson’s α-MI allows us to draw parallels with Shan-
non’s Mutual Information. This, in turn, allows us to define,
drawing inspiration from Shannon’s measures, an analogous
conditional version of Sibson’s Iα. It is very well known
that I(X;Y) = D(PXY kPXPY)as well as I(X;Y|Z) =
D(PXY Z kPZPX|ZPY|Z). We can thus follow a similar ap-
proach in defining a conditional α-Mutual Information: we
will estimate the (R´
enyi’s) divergence of the joint PXY Z from
a distribution characterised by the Markov chain X−Z−Y
via α-Divergences. Mimicking (3) we will also minimise such
divergence with respect to a family of measures. Having
now three random variables, we can think of three natural
factorisations for PXY Z (assuming that X−Z−Yholds):
PXPZ|XPY|Z,PYPZ|YPX|Z,PZPY|ZPX|Z. The question
then is: which measure should we minimise with respect to, in
order to define Iα(X, Y |Z)? Natural candidates seem to be the
minimisations with respect to QZ,QY|Zand QY. The matter
is strongly connected to the operational meaning that the
information measure acquires, alongside with the applications
it can provide. Each of the definitions can be useful in specific
settings. Keeping this in mind, the purpose of this work is not
to compare different definitions in order to find the best one
but rather to highlight properties of the different definitions
with an operationally driven approach. Each of these measures
can be associated to a hypothesis testing problem and a bound
relating different measures of the same event (typically a joint
and a Markov chain-like distribution). Different applications
require different conditional Iα’s. With this drive, let us make
a specific choice for the minimisation and draw a parallel with
the others along the way. The random variable whose measure1
we choose to minimise will be denoted as a superscript.
B. IY|Z
α(X, Y |Z)
In [5], conditional α-mutual information was defined as
follows:
Definition 3. Let X, Y , Z be three random variables jointly
distributed according to PXY Z . For α > 0, a conditional
Sibson’s mutual information of order αbetween Xand Y
given Zis defined as:
IY|Z
α(X, Y |Z) = min
QY|Z
Dα(PXY Z kPX|ZQY|ZPZ).(4)
It is possible to find a closed-form expression for Def. 3
[5, Section IV.C.2]. This definition is interesting as setting Z
equal to a constant allows us to retrieve Iα(X, Y ). Moreover,
starting from Definition 3 and its closed-form expression one
can retrieve the following result.
Theorem 1. Let (X ×Y ×Z ,F,PXY Z )be a probability space.
Let PZand PX|Zbe the induced conditional and marginal
distributions. Assume that PXY Z PZPY|ZPX|Z.Given
E∈ F and z∈ Z, y ∈ Y, let Ez,y ={x: (x, y, z )∈E}.
Then, fixed α≥1:
PXY Z (E)≤EZ"ess sup
PY|Z
PX|Z(EZ,Y )#α−1
α
·exp α−1
αIY|Z
α(X, Y |Z).(5)
Proof.
PXY Z (E) = EPZPY|ZPX|ZdPXY Z
dPZPY|ZPX|Z
E(6)
≤E
1
α00
PZE
α00
α0
PY|ZE
α0
α
PX|Z dPXY Z
dPZPY|ZPX|Zα
·E
1
γ00
PZE
γ00
γ0
PY|ZE
γ0
γ
PX|Z[γ
E] (7)
≤EZ"ess sup
PY|Z
PX|Z(EZ,Y )#α−1
α
·exp α−1
αIY|Z
α(X, Y |Z).(8)
The first inequality follows from applying H¨
older’s inequality
three times and the six parameters are such that 1
α00 +1
γ00 =
1
α0+1
γ0=1
α+1
γ= 1.(8) follows from setting α00 =αand
α0= 1 which imply γ00 =γand γ0→ ∞.
Another property of IZ
αis that, similarly to uncondi-
tional Iα[4], taking the limit of α→ ∞, we have that
IY|Z
α(X, Y |Z)→
α→∞L(X→Y|Z),leading us to the following:
1It is clearly possible to minimise over more than one random variable at
once, like it has been done in [5], [11] in the context of both regular Iα(X, Y )
and conditional Iα(X, Y |Z).
Corollary 1. Under the same assumptions of Theorem 1:
PXY Z (E)≤EZ[ess sup
PY|Z
PX|Z(EZ,Y )] exp (L(X→Y|Z)) .
(9)
C. IZ
α(X, Y |Z)
As discussed in Section III-A, another natural candidate def-
inition of conditional α-mutual information is the following:
Definition 4. Under the same assumptions of Definition 3:
IZ
α(X, Y |Z) = min
QZ
Dα(PXY Z kPX|ZPY|ZQZ).(10)
To the best of our knowledge Definition 4 has not been
considered elsewhere. As for IY|Z
α(X, Y |Z), it is possible to
compute a closed-form expression for IZ
α(X, Y |Z). We will
limit ourselves to discrete random variables for simplicity.
Theorem 2. Let α > 0and X, Y , Z be three discrete random
variables.
IZ
α(X, Y |Z) = α
α−1log X
z
PZ(z)
· X
x,y
PXY |Z=z(x, y)α(PX|Z=z(x)PY|Z=z(y))1−α!
1
α
.
The proof follows from the definition of IZ
α(X, Y |Z)and
Sibson’s identity [9, Eq. (12)]. Mirroring Section III-B we can
state an analogous of Theorem 1 for IZ
α:
Theorem 3. Let (X ×Y ×Z ,F,PXY Z )be a probability space.
Let PY|Zand PX|Zbe the induced conditional distributions.
Assume that PXY Z PZPY|ZPX|Z.Given E∈ F and
z∈ Z, let Ez={(x, y):(x, y, z)∈E}. Then, fixed α≥1:
PXY Z (E)≤ess sup
PZPX|ZPY|Z(EZ)α−1
α
·exp α−1
αIZ
α(X, Y |Z).(11)
This type of result is useful as it allows us to approximate
the probability of Eunder a joint, with the probability of
Eunder a different measure encoding some independence
(typically easier to analyse) — in this specific case, the
measure induced by a Markov chain. Such bounds represent,
for us, the main application-oriented employment of these
measures [1]. Notice that, other than using IZ
αinstead of
IY|Z
α, Theorem 3 involves a different essential supremum as
compared to Theorem 1. Moving on with the comparison,
we have that differently from Definition 3, the information
measure we are defining here is symmetric. Moreover, setting
Zto a constant in Definition 4 does not allow us to retrieve
Iα(X, Y ), but rather Dα(PX Y kPXPY).
D. An additive SDPI-like inequality
Definition 4 shares some interesting properties with
Iα(X, Y ). One such property is a rewriting of Iα(X, Y )in
terms of Dα. This allows us to leverage the strong data
processing inequality (SDPI) for Hellinger integrals of order
α, which in turn allows us to provide an SDPI-like results for
IZ
α. A definition for SDPIs can be found at [12, Def 3.1]
More precisely, we can write
IZ
α(X, Y |Z)
=α
α−1log EZexpα−1
αDα(PXY |ZkPX|ZPY|Z)
=α
α−1log EZhDfα(PXY |ZkPX|ZPY|Z)1/αi,(12)
where Dfαdenotes the Hellinger integral of order α,i.e., given
two measures P,Q,Dfα(PkQ) = EQhdP
dQαi. Leveraging
Eq. (12) we can state the following.
Theorem 4. Let α > 1and X, Y , W, Z be four random
variables such that (Z, W )−X−Yis a Markov chain:
IZ
α(W, Y |Z)≤1
α−1log ηfα(PY|X)+IZ
α(W, X|Z),(13)
where we denote with ηfα(PY|X)the contraction parameter
of the Hellinger integral of order α, i.e., for a given Markov
Kernel K,ηfα(K) = supµ,ν6=µ
Dfα(KµkK ν)
Dfα(µkν)[12, Def. III.1].
The proof follows from Eq. (12) and a reasoning similar to
[13, Lemma 3] but applied to the Dfα-divergence instead of
the KL-divergence.
Remark 2.Notice that data processing inequalities are simply
a consequence of the convexity of f[14, Thm 4.2] and
fα(x) = xαis indeed convex. Hence, although the Hellinger
integral is not normalised to be 0whenever the measures are
the same, it does satisfy a DPI. Moreover, the contraction
parameter of a strong data-processing inequality is always less
than or equal to 1. Hence, log(ηfα(K)) ≤0.
An analogous result of Theorem 4 for Definition 3 does not
seem possible.
Remark 3.One can state a result similar to Theorem 4 for
unconditional Iα. Specifically, we can write
Iα(X, Y ) = α
α−1log EYhD1/α
fα(PX|YkPX)i.
Since Iαis an asymmetric quantity, we only get the SDPI-
like result in one direction. Namely, given the Markov chain
W−X−Ywe can relate via SDPI Iα(W, Y )and Iα(X, Y )
(but, for instance, not Iα(W, X)and Iα(X, Y )), as follows:
Iα(W, Y )≤1
α−1log ηfα(PW|X)+Iα(X, Y ).(14)
Theorem 4 and Eq. (14) represent a different from usual
SDPI-like inequality. The reason for this is that the (function
of the) ηparameter is added to the information measure, rather
than multiplied. However, one of the main applications of
(conditional and not) Iαin bounds requires the exponentiation
of the quantity, which brings us back to a multiplicative form.
To make this statement more precise, let us state the following:
Corollary 2. Under the same assumptions of Theorem 4 we
have that:
PW Y Z (E)≤ess sup
PZ
(PW|ZPY|Z(EZ))α−1
α
·ηfα(PY|X)1/α ·exp α−1
αIZ
α(W, X|Z).
Corollary 2 follows directly from Theorem 3 and Theo-
rem 4.
Remark 4.A similar result can be derived for unconditional
Iαstarting from (14) and [1, Corollary 1].
E. Discussion on IZ
αand IY|Z
α
Let us now use Theorems 1 and 3 as a means of comparison
for the two conditional Iα. These results are useful whenever
we want to control the joint measure of some event Ebut we
only know how to control it (e.g., via an upper-bound) under
some hypothesis of independence [1]. Consider the factorisa-
tion of PXY Z under X−Z−Yto be fixed. In the context of
Theorem 1 and 3, according to the measure we know how to
control, different conditional Iα’s will appear on the right-hand
side of the bound (c.f., Eq. (5), (9) and (11)). For instance, if
we assume to be able to control ess supQZPX|ZPY|Z(EZ)
then, Theorem 3 tells us that IZ
α(X, Y |Z)is the measure
to study. If we assume instead that we are able to con-
trol terms of the form EPZ[ess supPY|ZPX|Z(EZ,Y )] then
IY|Z
α(X, Y |Z)would be the measure to analyse. (Quantities
like EPZ[ess supPY|ZPX|Z(EZ,Y )], for specific choices of E,
are known in the literature as “small-ball probabilities” and
have found applications in distributed estimation problems and
distributed function computation [13], [15]). More generally,
we can find a duality between the measure over which we
supremise (on the right-hand side of the bounds) and the
corresponding minimisation in the definition of conditional Iα.
The same measures also have a fundamental role in defining
the hypothesis testing problem that endows the information
measure with its operational meaning, as we will see in the
next section.
IV. OPE RATIONAL MEANING
Drawing inspiration from [5], [11], [16], let us consider the
following composite hypothesis testing problem. Fix a pmf
PXY Z , observing a sequence of triples {(Xi, Yi, Zi)}n
i=1 we
want to decide whether:
0) {(Xi, Yi, Zi)}n
i=1 is sampled in an iid fashion from
PXY Z (null hypothesis);
1) {(Xi, Yi, Zi)}n
i=1 is sampled in an iid fashion from
QZPX|ZPY|Z, where QZis an arbitrary pmf over the
space Z(alternative hypothesis).
We can relate IZ
α(X, Y |Z)to the error-exponent of the
just defined hypothesis testing problem. This can be seen
as a more lenient test for markovity where the measure
of Zis allowed to vary. Similarly to before, there is a
link between which measure is allowed to vary and the
minimisation in the definition of conditional Iα. Choosing,
for instance, to minimise over QXallows this measure to
vary in the alternative hypothesis. Using Theorem 3 we can
already connect IZ
αto the problem in question. Given a test
Tn:{X × Y × Z}n→ {0,1}, we will denote with p1
n
(Type-1 error) the probability of wrongfully choosing the
hypothesis 1given that the sequence is distributed according
to P⊗n
XY Z , i.e. p1
n=P⊗n
XY Z (Tn({(Xi, Yi, Zi)}n
i=1) = 1) and
with p2
n(Type-2 error) the maximum probability of wrongfully
choosing the hypothesis 0given that the sequence is distributed
according to (QZPX|ZPY|Z)⊗nfor some QZ, i.e. p2
n=
supQZ∈P(Z)(QZPX|ZPY|Z)⊗n(Tn({(Xi, Yi, Zi)}n
i=1) = 0).
Theorem 5. Let n > 0and Tn:{X × Y × Z }n→ {0,1}
be a deterministic test, that upon observing the sequence
{(Xi, Yi, Zi)}n
i=1 chooses either the null or the alterna-
tive hypothesis. Assume that ∃R > 0 : ∀QZ∈ Q(Z)
we have (QZPX|ZPY|Z)⊗n(Tn({(Xi, Yi, Zi)}n
i=1) = 0) ≤
exp(−nR). Let also α≥1,
1−p1
n≤exp −α−1
αn(R−IZ
α(X, Y |Z)).(15)
Proof. We have that 1−p1
n=P⊗n
XY Z (Tn({(Xi, Yi, Zi)}n
i=1) =
0). Starting from Theorem 3:
1−p1
n≤ess sup
Pn
ZPn
X|ZPn
Y|Z(En
Z)1/γ
·exp α−1
αIZ
α(Xn, Y n|Zn).(16)
Since we assumed the exponential decay of
(QZPX|ZPY|Z)⊗n(Tn({(Xi, Yi, Zi)}n
i=1) = 0) for every QZ
we also have that ess supPn
ZPn
X|ZPn
Y|Z(En
Z)≤exp(−nR)
(consider a measure ˜
QZthat puts all the mass on the
sequence achieving the essential supremum in (16)). Given the
assumption of independence on the triples {(Xi, Yi, Zi)}n
i=1
and following a reasoning similar to the one in Eqn. (49)
in [7], we have that IZ
α(Xn, Y n|Zn) = nI Z
α(X, Y |Z). The
conclusions then follow from algebraic manipulations of
(16).
This result implies that if we assume an exponential decay
for the type-2 error p2
nand R > IZ
α(X, Y |Z)we have an
exponential decay of the probability of correctly choosing the
null hypothesis as well. Moreover, for every n > 0:
1
nlog(1 −p1
n)≤ −α−1
αR−IZ
α(X, Y |Z).(17)
We can conclude that:
lim sup
n→∞
1
nlog(1 −p1
n)≤ − sup
α∈(1,+∞]
α−1
αR−IZ
α(X, Y |Z).
A. Error exponents
Following the approach undertaken in [11] we can also
define an achievable error-exponent pair for the hypothesis
testing problem in question.
Definition 5. A pair of error exponents (EP, EQ)∈R2is
called achievable w.r.t the above hypothesis testing problem if
there exists a series of tests {Tn}∞
n=1 such that 2:
lim inf
n→∞ −1
nlog P⊗n
XY Z (Tn({(Xi, Yi, Zi)}n
i=1) = 1) > EP,
lim inf
n→∞ inf
QZ
−1
n
log(QZPX|ZPY|Z)⊗n(Tn({(Xi, Yi, Zi)}n
i=1) = 0) > EQ.
We can then define the error exponent functions [11] EP:
R→R∪ {+∞} and EQ:R→R∪ {+∞} as follows:
EP(EQ) = sup{EP∈R: (EP, EQ)is achievable}(18)
EQ(EP) = sup{EQ∈R: (EP, EQ)is achievable}(19)
It is now possible to relate IZ
α(X, Y |Z), where α∈(0,1],
with both the Fenchel conjugate of EP(·),E?
P(·)and E??
P(·).
First, let us characterise E?
P(EQ).
Lemma 1.
E?
P(λ) = (+∞,if λ > 0
λI 1
1−λ(X, Y |Z),otherwise.(20)
Proof. Assume λ≤0,
E?
P(λ) = sup
EQ∈R
[λEQ−EP(EQ)]
= sup
EQ∈R
λEQ−inf
RXY Z :
D(RXY Z ||RZPX|ZPY|Z)≤EQ
D(RXY Z kPX Y Z )
(a)
= sup
EQ∈R
sup
RXY Z :
D(RXY Z kRZPX|ZPY|Z)≤EQ
[λEQ−D(RXY Z kPX Y Z )]
= sup
RXY Z
sup
EQ∈R:
EQ≥D(RXY Z kRZPX|ZPY|Z)
[λEQ−D(RXY Z kPX Y Z )]
(b)
= sup
RXY Z λD(RXY Z kRZPX|ZPY|Z)−D(RX Y Z kPX Y Z )
(c)
= (λ−1)inf
QZ
inf
RXY Z −λ
1−λD(RXY Z kQZPX|ZPY|Z)
+1
1−λD(RXY Z kPX Y Z )
(d)
=λinf
QZ
D1
1−λ(PXY Z kQZPX|ZPY|Z)
(e)
=λI 1
1−λ(X, Y |Z).
Where step (a) follows from an analogous result of [11,
Corollary 2] for our testing problem, step (b) follows because,
given that λ≤0then D(RXY Z kQZPX|ZPY|Z)achieves the
maximum. Step (c) follows from an analogous of [11, Lemma
4], (d) follows from [8, Theorem 3] and to conclude (e) follows
from Definition 4. For λ > 0the reasoning is identical to [11,
Lemma 12].
2As pointed out in [11], despite having bounds like in Theorem 5 decaying
with two rates EP, EQ, we cannot conclude anything on the achievability of
the pair.
Now, we can prove the connection to E??
P(·).3
Theorem 6. Given EQ, EP∈R
E??
P(EQ) = sup
α∈(0,1]
1−α
α(Iα(X, Y |Z)) −EQ),(21)
E??
Q(EP) = sup
α∈(0,1] Iα(X, Y |Z)−α
1−αEP.(22)
Proof.
E??
P(EQ) = sup
λ∈R
(λEQ−E?
P(λ)) (23)
(f)
= sup
λ≤0
(λEQ−E?
P(λ)) (24)
(g)
= sup
λ≤0
(λEQ−I1
1−λ(X, Y |Z)) (25)
(h)
= sup
α∈(0,1]
1−α
α(Iα(X, Y |Z)−EQ).(26)
Where (f) follows from E?
P(λ)=+∞for λ > 0, (g) follows
from Lemma 1 and (h) by setting α=1
1−λ. The proof of (22)
follows from similar arguments.
V. CONCLUSIONS
We have considered the problem of defining a condi-
tional version of Sibson’s α-Mutual Information. Drawing
inspiration from an equivalent formulation of Iα(X, Y )as
minQYDα(PXY kPXQY)we saw how several of these
propositions can be made for a Iα(X, Y |Z). Two have already
been analysed in [5]. We proposed here a general approach that
allows to connect to each such measure:
1) a bound, allowing to approximate the probability
PXY Z (E)with the probabilty of Eunder a product
distribution induced by the Markov chain X−Z−Y;
2) an operational meaning as the error exponent of a
hypothesis testing problem where the alternative hypoth-
esis is a markov-like distribution and some measures are
allowed to vary.
A simple relationship between the hypothesis testing problem
and the information measure can already be found using the
bound described in 1), without requiring any extra machinery.
To conclude, the usefulness of a measure clearly comes from
its applications and ease of computability. While the latter
remains the same for all the possible conditional Iαthe
former can vary according to the definition. With this in mind,
the various definitions are equally meaningful and it seems
reasonable to use the conditional Iαthat best suits the specific
application at hand.
ACKNOWLEDGMENT
The work in this paper was supported in part by the
Swiss National Science Foundation under Grants 169294 and
200364.
3Notice that E??
P(·)is not guaranteed to be equal to EP(·). Indeed, it is
possible to find examples where the function is not convex and thus, all we
retrieve is a lower bound on EP[11, Example 14].
REFERENCES
[1] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via
r´
enyi-, f-divergences and maximal leakage,” Accepted for Publication
in IEEE Transactions on Information Theory, 2021. [Online]. Available:
http://arxiv.org/abs/1912.01439
[2] R. Gallager, “A simple derivation of the coding theorem and some
applications,” IEEE Transactions on Information Theory, vol. 11, no. 1,
pp. 3–18, 1965.
[3] R. G. Gallager, Information Theory and Reliable Communication. USA:
John Wiley & Sons, Inc., 1968.
[4] I. Issa, A. B. Wagner, and S. Kamath, “An operational approach to
information leakage,” IEEE Transactions on Information Theory, vol. 66,
no. 3, pp. 1625–1657, 2020.
[5] M. Tomamichel and M. Hayashi, “Operational interpretation of R´
enyi
information measures via composite hypothesis testing against product
and markov distributions,” IEEE Transactions on Information Theory,
vol. 64, no. 2, pp. 1064–1082, 2018.
[6] J. Liao, L. Sankar, O. Kosut, and F. P. Calmon, “Robustness of maximal
α-leakage to side information,” in 2019 IEEE International Symposium
on Information Theory (ISIT), 2019, pp. 642–646.
[7] S. Verd´
u, “α-mutual information,” in 2015 Information Theory and
Applications Workshop, ITA 2015, San Diego, CA, USA, February 1-
6, 2015, 2015, pp. 1–6.
[8] T. van Erven and P. Harremo¨
es, “R´
enyi divergence and Kullback-Leibler
divergence,” IEEE Trans. Inf. Theory, vol. 60, no. 7, pp. 3797–3820, July
2014.
[9] I. Csiszar, “Generalized cutoff rates and R´
enyi’s information measures,”
IEEE Transactions on Information Theory, vol. 41, no. 1, pp. 26–34,
Jan 1995.
[10] R. Sibson, “Information radius,” Z. Wahrscheinlichkeitstheorie verw
Gebiete 14, pp. 149–160, 1969.
[11] A. Lapidoth and C. Pfister, “Testing against independence and a R´
enyi
information measure,” in 2018 IEEE Information Theory Workshop
(ITW), 2018, pp. 1–5.
[12] M. Raginsky, “Strong data processing inequalities and φ-sobolev
inequalities for discrete channels,” IEEE Transactions on Information
Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
[13] A. Xu and M. Raginsky, “Information-theoretic lower bounds for
distributed function computation,” IEEE Transactions on Information
Theory, vol. 63, no. 4, pp. 2314–2337, 2017.
[14] Y. Wu, “Lecture notes on: Information-theoretic methods for high-
dimensional statistics,” 2020.
[15] A. Xu and M. Raginsky, “Information-theoretic lower bounds on bayes
risk in decentralized estimation,” IEEE Transactions on Information
Theory, vol. 63, no. 3, pp. 1580–1600, 2017.
[16] A. Lapidoth and C. Pfister, “Two measures of dependence,” Entropy,
vol. 21, no. 8, 2019. [Online]. Available: https://www.mdpi.com/1099-
4300/21/8/778