Conference PaperPDF Available

On conditional Sibson's α\alpha -Mutual Information

Authors:
On conditional Sibson’s α-Mutual Information
Amedeo Roberto Esposito, Diyuan Wu, Michael Gastpar
School of Computer and Communication Sciences
EPFL, Lausanne, Switzerland
{amedeo.esposito, diyuan.wu, michael.gastpar}@epfl.ch
Abstract—In this work, we analyse how to define a conditional
version of Sibson’s α-Mutual Information. Several such defini-
tions can be advanced and they all lead to different information
measures with different (but similar) operational meanings. We
will analyse in detail one such definition, compute a closed-form
expression for it and endorse it with an operational meaning while
also considering some applications. The alternative definitions
will also be mentioned and compared.
Index Terms—R´
enyi-Divergence, Sibson’s Mutual Information,
Conditional Mutual Information, Information Measures
I. INTRODUCTION
Sibson’s α-Mutual Information is a generalization of Shan-
non’s Mutual Information with several applications in prob-
ability, information and learning theory [1]. In particular, it
has been used to provide concentration inequalities in settings
where the random variables are not independent, with applica-
tions to learning theory [1]. The measure is also connected to
Gallager’s exponent function, a central object in the channel
coding problem both for rates below and above capacity [2],
[3]. Moreover, a new operational meaning has been given
to the measure with α= +when a novel measure of
information leakage has been proposed in [4], under the name
of Maximal leakage. Similarly to Iα, Maximal Leakage has
recently found applications in learning and probability theory
[1]. Howerever, while Maximal Leakage has a corresponding
conditional form [4], Sibson’s α-Mutual Information lacks an
agreed upon conditional version. In this work we analyse a
path that could be taken in defining such a measure and will
focus on one specific choice, given in Definition 4 below.
We discuss key properties of this choice and endow it with
an operational meaning as the error-exponent in a properly
defined hypothesis testing problem. Moreover, we hint at some
application of this measure to other settings as well. The
choice we make is not unique and we will explain how making
different choices leads to different information measures, all
of them equally meaningful. A conditional version of Sibson’s
Iαhas been presented in [5]. We briefly present their measure
in Sec. III-B along with a new result that we believe to
be of interest. We then present in Sec. III-C a different
choice for conditional Iα. We show some properties of this
measure, compare the two objects in Sec. III-E and then
discuss a general approach to associate an operational meaning
to these measures in Sec. IV. Alternative routes have been
considered in [6] where Arimoto’s generalisation of the Mutual
Information has been considered and a conditional version has
been given.
II. BACKGROUND AND DEFINITIONS
Given a function f:R[−∞,+]we can define its
convex conjugate f?:R[−∞,+]as follows:
f?(λ) = sup
x
(λx f(x)).(1)
Given a function f,f?is guaranteed to be lower semi-
continuous and convex. We can re-apply the conjugation
operator to f?and obtain f??. If fis convex and lower
semincontinuous then f=f??, otherwise all we can say is
that xRf??(x)f(x).log denotes the natural logarithm.
A. Sibson’s α-Mutual Information
Introduced by R´
enyi as a generalization of entropy and KL-
divergence, α-divergence has found many applications ranging
from hypothesis testing to guessing and several other statistical
inference and coding problems [7]. Indeed, it has several useful
operational interpretations (e.g., hypothesis testing, and the
cut-off rate in block coding [8], [9]). It can be defined as
follows [8].
Definition 1. Let (Ω,F,P),(Ω,F,Q)be two probability
spaces. Let α > 0be a positive real number different from 1.
Consider a measure µsuch that P µand Q  µ(such a
measure always exists, e.g. µ= (P+Q)/2)) and denote with
p, q the densities of P,Qwith respect to µ. The α-Divergence
of Pfrom Qis defined as follows:
Dα(PkQ) = 1
α1log Zpαq1αdµ. (2)
Remark 1.The definition is independent of the chosen mea-
sure µ. It is indeed possible to show that Rpαq1α=
Rq
p1αdP, and that whenever P  Q or 0< α < 1,
we have Rpαq1α=Rp
qαdQ, see [8].
It can be shown that if α > 1and P 6 Q then
Dα(PkQ) = . The behaviour of the measure for α
{0,1,∞} can be defined by continuity. In general, one has that
D1(PkQ) = D(PkQ)but if D(P kQ) = or there exists β
such that Dβ(PkQ)<then limα1Dα(PkQ) = D(P kQ)
[8, Theorem 5]. For an extensive treatment of α-divergences
and their properties we refer the reader to [8]. Starting from
R´
enyi’s Divergence and the geometric averaging that it in-
volves, Sibson built the notion of Information Radius [10]:
Definition 2. Let (µ1, . . . , µn)be a family of probability
measures and (w1, . . . , wn)be a set of weights s.t. wi0
for i= 1, . . . , n and such that Pn
i=1 wi>0. Let α1, the
information radius of order αis defined as:
1
α1min
νPiwiµi
log X
i
wiexp((α1)Dα(µikν))!.
Suppose now we have two random variables X, Y jointly
distributed according to PXY . It is possible to generalise Def.
2 and see that the information radius is a special case of the
following quantity [7]:
Iα(X, Y ) = min
QY
Dα(PXY kPXQY).(3)
Iα(X, Y )represents a generalisation of Shannon’s Mutual
Information and possesses many interesting properties
[7]. Indeed, limα1Iα(X, Y ) = I(X;Y). On the
other hand when α→ ∞, we get: I(X, Y ) =
log EPYhsupx:PX(x)>0
PXY ({x,Y })
PX({x})PY({Y})i=L(XY),
where L(XY)denotes the Maximal Leakage from Xto
Y, a recently defined information measure with an operational
meaning in the context of privacy and security [4]. For more
details on Sibson’s α-MI, as well as a closed-form expression,
we refer the reader to [7], as for Maximal Leakage the reader
is referred to [4].
III. DEFI NI TI ON
A. Introduction
The characterisation expressed in (3) represents the foun-
dation of this work. Indeed, using (3) as the definition
of Sibson’s α-MI allows us to draw parallels with Shan-
non’s Mutual Information. This, in turn, allows us to define,
drawing inspiration from Shannon’s measures, an analogous
conditional version of Sibson’s Iα. It is very well known
that I(X;Y) = D(PXY kPXPY)as well as I(X;Y|Z) =
D(PXY Z kPZPX|ZPY|Z). We can thus follow a similar ap-
proach in defining a conditional α-Mutual Information: we
will estimate the (R´
enyi’s) divergence of the joint PXY Z from
a distribution characterised by the Markov chain XZY
via α-Divergences. Mimicking (3) we will also minimise such
divergence with respect to a family of measures. Having
now three random variables, we can think of three natural
factorisations for PXY Z (assuming that XZYholds):
PXPZ|XPY|Z,PYPZ|YPX|Z,PZPY|ZPX|Z. The question
then is: which measure should we minimise with respect to, in
order to define Iα(X, Y |Z)? Natural candidates seem to be the
minimisations with respect to QZ,QY|Zand QY. The matter
is strongly connected to the operational meaning that the
information measure acquires, alongside with the applications
it can provide. Each of the definitions can be useful in specific
settings. Keeping this in mind, the purpose of this work is not
to compare different definitions in order to find the best one
but rather to highlight properties of the different definitions
with an operationally driven approach. Each of these measures
can be associated to a hypothesis testing problem and a bound
relating different measures of the same event (typically a joint
and a Markov chain-like distribution). Different applications
require different conditional Iαs. With this drive, let us make
a specific choice for the minimisation and draw a parallel with
the others along the way. The random variable whose measure1
we choose to minimise will be denoted as a superscript.
B. IY|Z
α(X, Y |Z)
In [5], conditional α-mutual information was defined as
follows:
Definition 3. Let X, Y , Z be three random variables jointly
distributed according to PXY Z . For α > 0, a conditional
Sibson’s mutual information of order αbetween Xand Y
given Zis defined as:
IY|Z
α(X, Y |Z) = min
QY|Z
Dα(PXY Z kPX|ZQY|ZPZ).(4)
It is possible to find a closed-form expression for Def. 3
[5, Section IV.C.2]. This definition is interesting as setting Z
equal to a constant allows us to retrieve Iα(X, Y ). Moreover,
starting from Definition 3 and its closed-form expression one
can retrieve the following result.
Theorem 1. Let (X ×Y ×Z ,F,PXY Z )be a probability space.
Let PZand PX|Zbe the induced conditional and marginal
distributions. Assume that PXY Z  PZPY|ZPX|Z.Given
E∈ F and z∈ Z, y ∈ Y, let Ez,y ={x: (x, y, z )E}.
Then, fixed α1:
PXY Z (E)EZ"ess sup
PY|Z
PX|Z(EZ,Y )#α1
α
·exp α1
αIY|Z
α(X, Y |Z).(5)
Proof.
PXY Z (E) = EPZPY|ZPX|ZdPXY Z
dPZPY|ZPX|Z
E(6)
E
1
α00
PZE
α00
α0
PY|ZE
α0
α
PX|Z dPXY Z
dPZPY|ZPX|Zα
·E
1
γ00
PZE
γ00
γ0
PY|ZE
γ0
γ
PX|Z[γ
E] (7)
EZ"ess sup
PY|Z
PX|Z(EZ,Y )#α1
α
·exp α1
αIY|Z
α(X, Y |Z).(8)
The first inequality follows from applying H¨
older’s inequality
three times and the six parameters are such that 1
α00 +1
γ00 =
1
α0+1
γ0=1
α+1
γ= 1.(8) follows from setting α00 =αand
α0= 1 which imply γ00 =γand γ0→ ∞.
Another property of IZ
αis that, similarly to uncondi-
tional Iα[4], taking the limit of α→ ∞, we have that
IY|Z
α(X, Y |Z)
α→∞L(XY|Z),leading us to the following:
1It is clearly possible to minimise over more than one random variable at
once, like it has been done in [5], [11] in the context of both regular Iα(X, Y )
and conditional Iα(X, Y |Z).
Corollary 1. Under the same assumptions of Theorem 1:
PXY Z (E)EZ[ess sup
PY|Z
PX|Z(EZ,Y )] exp (L(XY|Z)) .
(9)
C. IZ
α(X, Y |Z)
As discussed in Section III-A, another natural candidate def-
inition of conditional α-mutual information is the following:
Definition 4. Under the same assumptions of Definition 3:
IZ
α(X, Y |Z) = min
QZ
Dα(PXY Z kPX|ZPY|ZQZ).(10)
To the best of our knowledge Definition 4 has not been
considered elsewhere. As for IY|Z
α(X, Y |Z), it is possible to
compute a closed-form expression for IZ
α(X, Y |Z). We will
limit ourselves to discrete random variables for simplicity.
Theorem 2. Let α > 0and X, Y , Z be three discrete random
variables.
IZ
α(X, Y |Z) = α
α1log X
z
PZ(z)
· X
x,y
PXY |Z=z(x, y)α(PX|Z=z(x)PY|Z=z(y))1α!
1
α
.
The proof follows from the definition of IZ
α(X, Y |Z)and
Sibson’s identity [9, Eq. (12)]. Mirroring Section III-B we can
state an analogous of Theorem 1 for IZ
α:
Theorem 3. Let (X ×Y ×Z ,F,PXY Z )be a probability space.
Let PY|Zand PX|Zbe the induced conditional distributions.
Assume that PXY Z  PZPY|ZPX|Z.Given E∈ F and
z∈ Z, let Ez={(x, y):(x, y, z)E}. Then, fixed α1:
PXY Z (E)ess sup
PZPX|ZPY|Z(EZ)α1
α
·exp α1
αIZ
α(X, Y |Z).(11)
This type of result is useful as it allows us to approximate
the probability of Eunder a joint, with the probability of
Eunder a different measure encoding some independence
(typically easier to analyse) — in this specific case, the
measure induced by a Markov chain. Such bounds represent,
for us, the main application-oriented employment of these
measures [1]. Notice that, other than using IZ
αinstead of
IY|Z
α, Theorem 3 involves a different essential supremum as
compared to Theorem 1. Moving on with the comparison,
we have that differently from Definition 3, the information
measure we are defining here is symmetric. Moreover, setting
Zto a constant in Definition 4 does not allow us to retrieve
Iα(X, Y ), but rather Dα(PX Y kPXPY).
D. An additive SDPI-like inequality
Definition 4 shares some interesting properties with
Iα(X, Y ). One such property is a rewriting of Iα(X, Y )in
terms of Dα. This allows us to leverage the strong data
processing inequality (SDPI) for Hellinger integrals of order
α, which in turn allows us to provide an SDPI-like results for
IZ
α. A definition for SDPIs can be found at [12, Def 3.1]
More precisely, we can write
IZ
α(X, Y |Z)
=α
α1log EZexpα1
αDα(PXY |ZkPX|ZPY|Z)
=α
α1log EZhDfα(PXY |ZkPX|ZPY|Z)1i,(12)
where Dfαdenotes the Hellinger integral of order α,i.e., given
two measures P,Q,Dfα(PkQ) = EQhdP
dQαi. Leveraging
Eq. (12) we can state the following.
Theorem 4. Let α > 1and X, Y , W, Z be four random
variables such that (Z, W )XYis a Markov chain:
IZ
α(W, Y |Z)1
α1log ηfα(PY|X)+IZ
α(W, X|Z),(13)
where we denote with ηfα(PY|X)the contraction parameter
of the Hellinger integral of order α, i.e., for a given Markov
Kernel K,ηfα(K) = supµ,ν6=µ
Dfα(kK ν)
Dfα(µkν)[12, Def. III.1].
The proof follows from Eq. (12) and a reasoning similar to
[13, Lemma 3] but applied to the Dfα-divergence instead of
the KL-divergence.
Remark 2.Notice that data processing inequalities are simply
a consequence of the convexity of f[14, Thm 4.2] and
fα(x) = xαis indeed convex. Hence, although the Hellinger
integral is not normalised to be 0whenever the measures are
the same, it does satisfy a DPI. Moreover, the contraction
parameter of a strong data-processing inequality is always less
than or equal to 1. Hence, log(ηfα(K)) 0.
An analogous result of Theorem 4 for Definition 3 does not
seem possible.
Remark 3.One can state a result similar to Theorem 4 for
unconditional Iα. Specifically, we can write
Iα(X, Y ) = α
α1log EYhD1
fα(PX|YkPX)i.
Since Iαis an asymmetric quantity, we only get the SDPI-
like result in one direction. Namely, given the Markov chain
WXYwe can relate via SDPI Iα(W, Y )and Iα(X, Y )
(but, for instance, not Iα(W, X)and Iα(X, Y )), as follows:
Iα(W, Y )1
α1log ηfα(PW|X)+Iα(X, Y ).(14)
Theorem 4 and Eq. (14) represent a different from usual
SDPI-like inequality. The reason for this is that the (function
of the) ηparameter is added to the information measure, rather
than multiplied. However, one of the main applications of
(conditional and not) Iαin bounds requires the exponentiation
of the quantity, which brings us back to a multiplicative form.
To make this statement more precise, let us state the following:
Corollary 2. Under the same assumptions of Theorem 4 we
have that:
PW Y Z (E)ess sup
PZ
(PW|ZPY|Z(EZ))α1
α
·ηfα(PY|X)1·exp α1
αIZ
α(W, X|Z).
Corollary 2 follows directly from Theorem 3 and Theo-
rem 4.
Remark 4.A similar result can be derived for unconditional
Iαstarting from (14) and [1, Corollary 1].
E. Discussion on IZ
αand IY|Z
α
Let us now use Theorems 1 and 3 as a means of comparison
for the two conditional Iα. These results are useful whenever
we want to control the joint measure of some event Ebut we
only know how to control it (e.g., via an upper-bound) under
some hypothesis of independence [1]. Consider the factorisa-
tion of PXY Z under XZYto be fixed. In the context of
Theorem 1 and 3, according to the measure we know how to
control, different conditional Iαs will appear on the right-hand
side of the bound (c.f., Eq. (5), (9) and (11)). For instance, if
we assume to be able to control ess supQZPX|ZPY|Z(EZ)
then, Theorem 3 tells us that IZ
α(X, Y |Z)is the measure
to study. If we assume instead that we are able to con-
trol terms of the form EPZ[ess supPY|ZPX|Z(EZ,Y )] then
IY|Z
α(X, Y |Z)would be the measure to analyse. (Quantities
like EPZ[ess supPY|ZPX|Z(EZ,Y )], for specific choices of E,
are known in the literature as “small-ball probabilities” and
have found applications in distributed estimation problems and
distributed function computation [13], [15]). More generally,
we can find a duality between the measure over which we
supremise (on the right-hand side of the bounds) and the
corresponding minimisation in the definition of conditional Iα.
The same measures also have a fundamental role in defining
the hypothesis testing problem that endows the information
measure with its operational meaning, as we will see in the
next section.
IV. OPE RATIONAL MEANING
Drawing inspiration from [5], [11], [16], let us consider the
following composite hypothesis testing problem. Fix a pmf
PXY Z , observing a sequence of triples {(Xi, Yi, Zi)}n
i=1 we
want to decide whether:
0) {(Xi, Yi, Zi)}n
i=1 is sampled in an iid fashion from
PXY Z (null hypothesis);
1) {(Xi, Yi, Zi)}n
i=1 is sampled in an iid fashion from
QZPX|ZPY|Z, where QZis an arbitrary pmf over the
space Z(alternative hypothesis).
We can relate IZ
α(X, Y |Z)to the error-exponent of the
just defined hypothesis testing problem. This can be seen
as a more lenient test for markovity where the measure
of Zis allowed to vary. Similarly to before, there is a
link between which measure is allowed to vary and the
minimisation in the definition of conditional Iα. Choosing,
for instance, to minimise over QXallows this measure to
vary in the alternative hypothesis. Using Theorem 3 we can
already connect IZ
αto the problem in question. Given a test
Tn:{X × Y × Z}n→ {0,1}, we will denote with p1
n
(Type-1 error) the probability of wrongfully choosing the
hypothesis 1given that the sequence is distributed according
to Pn
XY Z , i.e. p1
n=Pn
XY Z (Tn({(Xi, Yi, Zi)}n
i=1) = 1) and
with p2
n(Type-2 error) the maximum probability of wrongfully
choosing the hypothesis 0given that the sequence is distributed
according to (QZPX|ZPY|Z)nfor some QZ, i.e. p2
n=
supQZ∈P(Z)(QZPX|ZPY|Z)n(Tn({(Xi, Yi, Zi)}n
i=1) = 0).
Theorem 5. Let n > 0and Tn:{X × Y × Z }n→ {0,1}
be a deterministic test, that upon observing the sequence
{(Xi, Yi, Zi)}n
i=1 chooses either the null or the alterna-
tive hypothesis. Assume that R > 0 : ∀QZ∈ Q(Z)
we have (QZPX|ZPY|Z)n(Tn({(Xi, Yi, Zi)}n
i=1) = 0)
exp(nR). Let also α1,
1p1
nexp α1
αn(RIZ
α(X, Y |Z)).(15)
Proof. We have that 1p1
n=Pn
XY Z (Tn({(Xi, Yi, Zi)}n
i=1) =
0). Starting from Theorem 3:
1p1
ness sup
Pn
ZPn
X|ZPn
Y|Z(En
Z)1
·exp α1
αIZ
α(Xn, Y n|Zn).(16)
Since we assumed the exponential decay of
(QZPX|ZPY|Z)n(Tn({(Xi, Yi, Zi)}n
i=1) = 0) for every QZ
we also have that ess supPn
ZPn
X|ZPn
Y|Z(En
Z)exp(nR)
(consider a measure ˜
QZthat puts all the mass on the
sequence achieving the essential supremum in (16)). Given the
assumption of independence on the triples {(Xi, Yi, Zi)}n
i=1
and following a reasoning similar to the one in Eqn. (49)
in [7], we have that IZ
α(Xn, Y n|Zn) = nI Z
α(X, Y |Z). The
conclusions then follow from algebraic manipulations of
(16).
This result implies that if we assume an exponential decay
for the type-2 error p2
nand R > IZ
α(X, Y |Z)we have an
exponential decay of the probability of correctly choosing the
null hypothesis as well. Moreover, for every n > 0:
1
nlog(1 p1
n)≤ −α1
αRIZ
α(X, Y |Z).(17)
We can conclude that:
lim sup
n→∞
1
nlog(1 p1
n)≤ − sup
α(1,+]
α1
αRIZ
α(X, Y |Z).
A. Error exponents
Following the approach undertaken in [11] we can also
define an achievable error-exponent pair for the hypothesis
testing problem in question.
Definition 5. A pair of error exponents (EP, EQ)R2is
called achievable w.r.t the above hypothesis testing problem if
there exists a series of tests {Tn}
n=1 such that 2:
lim inf
n→∞ 1
nlog Pn
XY Z (Tn({(Xi, Yi, Zi)}n
i=1) = 1) > EP,
lim inf
n→∞ inf
QZ
1
n
log(QZPX|ZPY|Z)n(Tn({(Xi, Yi, Zi)}n
i=1) = 0) > EQ.
We can then define the error exponent functions [11] EP:
RR∪ {+∞} and EQ:RR∪ {+∞} as follows:
EP(EQ) = sup{EPR: (EP, EQ)is achievable}(18)
EQ(EP) = sup{EQR: (EP, EQ)is achievable}(19)
It is now possible to relate IZ
α(X, Y |Z), where α(0,1],
with both the Fenchel conjugate of EP(·),E?
P(·)and E??
P(·).
First, let us characterise E?
P(EQ).
Lemma 1.
E?
P(λ) = (+,if λ > 0
λI 1
1λ(X, Y |Z),otherwise.(20)
Proof. Assume λ0,
E?
P(λ) = sup
EQR
[λEQEP(EQ)]
= sup
EQR
λEQinf
RXY Z :
D(RXY Z ||RZPX|ZPY|Z)EQ
D(RXY Z kPX Y Z )
(a)
= sup
EQR
sup
RXY Z :
D(RXY Z kRZPX|ZPY|Z)EQ
[λEQD(RXY Z kPX Y Z )]
= sup
RXY Z
sup
EQR:
EQD(RXY Z kRZPX|ZPY|Z)
[λEQD(RXY Z kPX Y Z )]
(b)
= sup
RXY Z λD(RXY Z kRZPX|ZPY|Z)D(RX Y Z kPX Y Z )
(c)
= (λ1)inf
QZ
inf
RXY Z λ
1λD(RXY Z kQZPX|ZPY|Z)
+1
1λD(RXY Z kPX Y Z )
(d)
=λinf
QZ
D1
1λ(PXY Z kQZPX|ZPY|Z)
(e)
=λI 1
1λ(X, Y |Z).
Where step (a) follows from an analogous result of [11,
Corollary 2] for our testing problem, step (b) follows because,
given that λ0then D(RXY Z kQZPX|ZPY|Z)achieves the
maximum. Step (c) follows from an analogous of [11, Lemma
4], (d) follows from [8, Theorem 3] and to conclude (e) follows
from Definition 4. For λ > 0the reasoning is identical to [11,
Lemma 12].
2As pointed out in [11], despite having bounds like in Theorem 5 decaying
with two rates EP, EQ, we cannot conclude anything on the achievability of
the pair.
Now, we can prove the connection to E??
P(·).3
Theorem 6. Given EQ, EPR
E??
P(EQ) = sup
α(0,1]
1α
α(Iα(X, Y |Z)) EQ),(21)
E??
Q(EP) = sup
α(0,1] Iα(X, Y |Z)α
1αEP.(22)
Proof.
E??
P(EQ) = sup
λR
(λEQE?
P(λ)) (23)
(f)
= sup
λ0
(λEQE?
P(λ)) (24)
(g)
= sup
λ0
(λEQI1
1λ(X, Y |Z)) (25)
(h)
= sup
α(0,1]
1α
α(Iα(X, Y |Z)EQ).(26)
Where (f) follows from E?
P(λ)=+for λ > 0, (g) follows
from Lemma 1 and (h) by setting α=1
1λ. The proof of (22)
follows from similar arguments.
V. CONCLUSIONS
We have considered the problem of defining a condi-
tional version of Sibson’s α-Mutual Information. Drawing
inspiration from an equivalent formulation of Iα(X, Y )as
minQYDα(PXY kPXQY)we saw how several of these
propositions can be made for a Iα(X, Y |Z). Two have already
been analysed in [5]. We proposed here a general approach that
allows to connect to each such measure:
1) a bound, allowing to approximate the probability
PXY Z (E)with the probabilty of Eunder a product
distribution induced by the Markov chain XZY;
2) an operational meaning as the error exponent of a
hypothesis testing problem where the alternative hypoth-
esis is a markov-like distribution and some measures are
allowed to vary.
A simple relationship between the hypothesis testing problem
and the information measure can already be found using the
bound described in 1), without requiring any extra machinery.
To conclude, the usefulness of a measure clearly comes from
its applications and ease of computability. While the latter
remains the same for all the possible conditional Iαthe
former can vary according to the definition. With this in mind,
the various definitions are equally meaningful and it seems
reasonable to use the conditional Iαthat best suits the specific
application at hand.
ACKNOWLEDGMENT
The work in this paper was supported in part by the
Swiss National Science Foundation under Grants 169294 and
200364.
3Notice that E??
P(·)is not guaranteed to be equal to EP(·). Indeed, it is
possible to find examples where the function is not convex and thus, all we
retrieve is a lower bound on EP[11, Example 14].
REFERENCES
[1] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via
r´
enyi-, f-divergences and maximal leakage,Accepted for Publication
in IEEE Transactions on Information Theory, 2021. [Online]. Available:
http://arxiv.org/abs/1912.01439
[2] R. Gallager, “A simple derivation of the coding theorem and some
applications,” IEEE Transactions on Information Theory, vol. 11, no. 1,
pp. 3–18, 1965.
[3] R. G. Gallager, Information Theory and Reliable Communication. USA:
John Wiley & Sons, Inc., 1968.
[4] I. Issa, A. B. Wagner, and S. Kamath, “An operational approach to
information leakage,” IEEE Transactions on Information Theory, vol. 66,
no. 3, pp. 1625–1657, 2020.
[5] M. Tomamichel and M. Hayashi, “Operational interpretation of R´
enyi
information measures via composite hypothesis testing against product
and markov distributions,IEEE Transactions on Information Theory,
vol. 64, no. 2, pp. 1064–1082, 2018.
[6] J. Liao, L. Sankar, O. Kosut, and F. P. Calmon, “Robustness of maximal
α-leakage to side information,” in 2019 IEEE International Symposium
on Information Theory (ISIT), 2019, pp. 642–646.
[7] S. Verd´
u, “α-mutual information,” in 2015 Information Theory and
Applications Workshop, ITA 2015, San Diego, CA, USA, February 1-
6, 2015, 2015, pp. 1–6.
[8] T. van Erven and P. Harremo¨
es, “R´
enyi divergence and Kullback-Leibler
divergence,IEEE Trans. Inf. Theory, vol. 60, no. 7, pp. 3797–3820, July
2014.
[9] I. Csiszar, “Generalized cutoff rates and R´
enyi’s information measures,
IEEE Transactions on Information Theory, vol. 41, no. 1, pp. 26–34,
Jan 1995.
[10] R. Sibson, “Information radius,” Z. Wahrscheinlichkeitstheorie verw
Gebiete 14, pp. 149–160, 1969.
[11] A. Lapidoth and C. Pfister, “Testing against independence and a R´
enyi
information measure,” in 2018 IEEE Information Theory Workshop
(ITW), 2018, pp. 1–5.
[12] M. Raginsky, “Strong data processing inequalities and φ-sobolev
inequalities for discrete channels,” IEEE Transactions on Information
Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
[13] A. Xu and M. Raginsky, “Information-theoretic lower bounds for
distributed function computation,” IEEE Transactions on Information
Theory, vol. 63, no. 4, pp. 2314–2337, 2017.
[14] Y. Wu, “Lecture notes on: Information-theoretic methods for high-
dimensional statistics,” 2020.
[15] A. Xu and M. Raginsky, “Information-theoretic lower bounds on bayes
risk in decentralized estimation,” IEEE Transactions on Information
Theory, vol. 63, no. 3, pp. 1580–1600, 2017.
[16] A. Lapidoth and C. Pfister, “Two measures of dependence,” Entropy,
vol. 21, no. 8, 2019. [Online]. Available: https://www.mdpi.com/1099-
4300/21/8/778
... Particularly, Sibson's α-information [148] is more appropriate for some applications [88,89] than other proposals, which generalizes mutual information (without mutuality). More recently, several conditional versions of α-information have been proposed [64,99,157] for generalizing conditional mutual information. Especially, [99] shows great potentials when applied into side-channel analysis by providing much tighter bounds on the probability of success, although only few candidates of order α are provided. ...
... In particular, the one proposed by Arimoto [1], known as Arimoto-Rényi conditional entropy, is a good definition possessing several fundamental features. On the other hand, regarding the generalization of mutual information, the one proposed by Sibson [148], known as Sibson's mutual information, is perhaps the most preferred generalization of classical mutual information and has been applied in various scenarios [63,64,122,137,157,160]. ...
Thesis
Cryptographic algorithms are nowadays prevalent in establishing secure connectivity in our digital society. Such computations handle sensitive information like encryption keys, which are usually very exposed during manipulation, resulting in a huge threat to the security of the sensitive information concealed in cryptographic components. In the field of embedded systems security, side-channel analysis is one of the most powerful techniques against cryptographic implementations. The main subject of this thesis is the measurable side-channel security of cryptographic implementations, particularly in the presence of random masking. Overall, this thesis consists of two topics. One is the leakage quantification of the most general form of masking equipped with the linear codes, so-called code-based masking; the other one is exploration of applying more generic information measures in a context of side-channel analysis. Two topics are inherently connected to each other in assessing and enhancing the practical security of cryptographic implementations .Regarding the former, we propose a unified coding-theoretic framework for measuring the information leakage in code-based masking. Specifically, our framework builds formal connections between coding properties and leakage metrics in side-channel analysis. Those formal connections enable us to push forward the quantitative evaluation on how the linear codes can affect the concrete security of all code-based masking schemes. Moreover, relying on our framework, we consolidate code-based masking by providing the optimal linear codes in the sense of maximizing the side-channel resistance of the corresponding masking scheme. Our framework is finally verified by attack-based evaluation, where the attacks utilize maximum-likelihood based distinguishers and are therefore optimal. Regarding the latter, we present a full spectrum of application of alpha-information, a generalization of (Shannon) mutual information, for assessing side-channel security. In this thesis, we propose to utilize a more general information-theoretic measure, namely alpha-information (alpha-information) of order alpha. The new measure also gives the upper bound on success rate and the lower bound on the number of measurements. More importantly, with proper choices of alpha, alpha-information provides very tight bounds, in particular, when alpha approaches to positive infinity, the bounds will be exact. As a matter of fact, maximum-likelihood based distinguishers will converge to the bounds. Therefore, we demonstrate how the two world, information-theoretic measures (bounds) and maximum-likelihood based side-channel attacks, are seamlessly connected in side-channel analysis .In summary, our study in this thesis pushes forward the evaluation and consolidation of side-channel security of cryptographic implementations. From a protection perspective, we provide a best-practice guideline for the application of code-based masking. From an evaluation perspective, the application of alpha-information enables practical evaluators and designers to have a more accurate (or even exact) estimation of concrete side-channel security level of their cryptographic chips.
... It should be noted that the version of the conditional α-mutual information that we give here is not the only possible definition, and other definitions have been proposed (Esposito et al., 2021b;Tomamichel and Hayashi, 2018). Our main reason for focusing on this particular definition is its role in generalization bounds and its connection to the conditional maximal leakage. ...
Preprint
Full-text available
A fundamental question in theoretical machine learning is generalization. Over the past decades, the PAC-Bayesian approach has been established as a flexible framework to address the generalization capabilities of machine learning algorithms, and design new ones. Recently, it has garnered increased interest due to its potential applicability for a variety of learning algorithms, including deep neural networks. In parallel, an information-theoretic view of generalization has developed, wherein the relation between generalization and various information measures has been established. This framework is intimately connected to the PAC-Bayesian approach, and a number of results have been independently discovered in both strands. In this monograph, we highlight this strong connection and present a unified treatment of generalization. We present techniques and results that the two perspectives have in common, and discuss the approaches and interpretations that differ. In particular, we demonstrate how many proofs in the area share a modular structure, through which the underlying ideas can be intuited. We pay special attention to the conditional mutual information (CMI) framework; analytical studies of the information complexity of learning algorithms; and the application of the proposed methods to deep learning. This monograph is intended to provide a comprehensive introduction to information-theoretic generalization bounds and their connection to PAC-Bayes, serving as a foundation from which the most recent developments are accessible. It is aimed broadly towards researchers with an interest in generalization and theoretical machine learning.
... In the case of three random variables, it is unclear which factorisation of the joint and which minimisation to consider. Indeed, it has been shown in [32] that several definitions of conditional I α can be proposed, depending on the operational meaning and corresponding probability bound one needs. In this subsection, we will consider the following conditional version of I α : ...
Preprint
Full-text available
This paper focuses on parameter estimation and introduces a new method for lower bounding the Bayesian risk. The method allows for the use of virtually \emph{any} information measure, including R\'enyi's α\alpha, φ\varphi-Divergences, and Sibson's α\alpha-Mutual Information. The approach considers divergences as functionals of measures and exploits the duality between spaces of measures and spaces of functions. In particular, we show that one can lower bound the risk with any information measure by upper bounding its dual via Markov's inequality. We are thus able to provide estimator-independent impossibility results thanks to the Data-Processing Inequalities that divergences satisfy. The results are then applied to settings of interest involving both discrete and continuous parameters, including the ``Hide-and-Seek'' problem, and compared to the state-of-the-art techniques. An important observation is that the behaviour of the lower bound in the number of samples is influenced by the choice of the information measure. We leverage this by introducing a new divergence inspired by the ``Hockey-Stick'' Divergence, which is demonstrated empirically to provide the largest lower-bound across all considered settings. If the observations are subject to privatisation, stronger impossibility results can be obtained via Strong Data-Processing Inequalities. The paper also discusses some generalisations and alternative directions.
... Two most common mutual information-based privacy measures are based on Sibson's [26] and Arimoto's -mutual information [1], respectively. Plenty of works [10], [12], [15], [19], [29], [31] based on the -mutual information were published, concerning data processing and mechanism composition. Notably, the case = ∞ leads to a newly defined privacy definition, maximal leakage, which was first proposed by Issa et al. [16]. ...
Preprint
Full-text available
In this paper, we investigate the privacy-utility trade-off (PUT) problem, which considers the minimal privacy loss at a fixed expense of utility. Several different kinds of privacy in the PUT problem are studied, including differential privacy, approximate differential privacy, maximal information, maximal leakage, Renyi differential privacy, Sibson mutual information and mutual information. The average Hamming distance is used to measure the distortion caused by the privacy mechanism. We consider two scenarios: global privacy and local privacy. In the framework of global privacy framework, the privacy-distortion function is upper-bounded by the privacy loss of a special mechanism, and lower-bounded by the optimal privacy loss with any possible prior input distribution. In the framework of local privacy, we generalize a coloring method for the PUT problem.
... This family of inequalities exists, although not always with the same inequality sign, for every α ∈ R as shown in Table I. When α > 1 and through these inequalities, we were able to connect the information measure to bounds on the probability of having a large generalisation error [1], hypothesis testing problems [11] and to lower-bounds on the Bayesian Risk in Estimation Procedures [12]. Here, as an example of application, we considered once again the Bayesian Risk setting and showed how this family of information measures can be employed in such a framework even if α < 1. ...
Preprint
Full-text available
We explore a family of information measures that stems from R\'enyi's α\alpha-Divergences with α<0\alpha<0. In particular, we extend the definition of Sibson's α\alpha-Mutual Information to negative values of α\alpha and show several properties of these objects. Moreover, we highlight how this family of information measures is related to functional inequalities that can be employed in a variety of fields, including lower-bounds on the Risk in Bayesian Estimation Procedures.
Article
We introduce a family of information leakage measures called maximal (α, β)- leakage (MαbeL), parameterized by real numbers α and β greater than or equal to 1. The measure is formalized via an operational definition involving an adversary guessing an unknown (randomized) function of the data given the released data. We obtain a simplified computable expression for the measure and show that it satisfies several basic properties such as monotonicity in β for a fixed α, non-negativity, data processing inequalities, and additivity over independent releases. We highlight the relevance of this family by showing that it bridges several known leakage measures, including maximal α-leakage (β = 1), maximal leakage (α = ∞, β = 1), local differential privacy (LDP) (α = ∞, β = ∞), and local Rényi differential privacy (LRDP) (α = β), thereby giving an operational interpretation to local Rényi differential privacy. We also study a conditional version of MαbeL on leveraging which we recover differential privacy and Rényi differential privacy. A new variant of LRDP, which we call maximal Rényi leakage , appears as a special case of MαbeL for α = ∞ that smoothly tunes between maximal leakage (β = 1) and LDP (β = ∞). Finally, we show that a vector form of the maximal Rényi leakage relaxes differential privacy under Gaussian and Laplacian mechanisms.
Article
Full-text available
Two families of dependence measures between random variables are introduced. They are based on the Rényi divergence of order α and the relative α -entropy, respectively, and both dependence measures reduce to Shannon’s mutual information when their order α is one. The first measure shares many properties with the mutual information, including the data-processing inequality, and can be related to the optimal error exponents in composite hypothesis testing. The second measure does not satisfy the data-processing inequality, but appears naturally in the context of distributed task encoding.
Article
Full-text available
We revisit the problem of asymmetric binary hypothesis testing against a composite alternative hypothesis. We introduce a general framework to treat such problems when the alternative hypothesis adheres to certain axioms. In this case we find the threshold rate, the optimal error and strong converse exponents (at large deviations from the threshold) and the second order asymptotics (at small deviations from the threshold). We apply our results to find operational interpretations of various Renyi information measures. In case the alternative hypothesis is comprised of bipartite product distributions, we find that the optimal error and strong converse exponents are determined by variations of Renyi mutual information. In case the alternative hypothesis consists of tripartite distributions satisfying the Markov property, we find that the optimal exponents are determined by variations of Renyi conditional mutual information. In either case the relevant notion of Renyi mutual information depends on the precise choice of the alternative hypothesis. As such, our work also strengthens the view that different definitions of Renyi mutual information, conditional entropy and conditional mutual information are adequate depending on the context in which the measures are used.
Article
In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are introduced and in learning theory, where they can be employed to bound the generalization error of a learning algorithm. Bounds are given in terms of Sibson’s Mutual Information, α\alpha -Divergences, Hellinger Divergences, and f -Divergences. A case of particular interest is the Maximal Leakage (or Sibson’s Mutual Information of order infinity), since this measure is robust to post-processing and composes adaptively. The corresponding bound can be seen as a generalization of classical bounds, such as Hoeffding’s and McDiarmid’s inequalities, to the case of dependent random variables.
Article
Given two random variables X and Y , an operational approach is undertaken to quantify the “leakage” of information from X to Y. The resulting measure L(X→Y ) is called maximal leakage, and is defined as the multiplicative increase, upon observing Y , of the probability of correctly guessing a randomized function of X, maximized over all such randomized functions. A closedform expression for L(X→Y ) is given for discrete X and Y, and it is subsequently generalized to handle a large class of random variables. The resulting properties are shown to be consistent with an axiomatic view of a leakage measure, and the definition is shown to be robust to variations in the setup. Moreover, a variant of the Shannon cipher system is studied, in which performance of an encryption scheme is measured using maximal leakage. A singleletter characterization of the optimal limit of (normalized) maximal leakage is derived and asymptotically-optimal encryption schemes are demonstrated. Furthermore, the sample complexity of estimating maximal leakage from data is characterized up to subpolynomial factors. Finally, the guessing framework used to define maximal leakage is used to give operational interpretations of commonly used leakage measures, such as Shannon capacity, maximal correlation, and local differential privacy.
Article
We derive information-theoretic converses (i.e., lower bounds) for the minimum time required by any algorithm for distributed function computation over a network of pointto- point channels with finite capacity, where each node of the network initially has a random observation and aims to compute a common function of all observations to a given accuracy with a given confidence by exchanging messages with its neighbors. We obtain the lower bounds on computation time by examining the conditional mutual information between the actual function value and its estimate at an arbitrary node, given the observations in an arbitrary subset of nodes containing that node. The main contributions include: 1) A lower bound on the conditional mutual information via so-called small ball probabilities, which captures the dependence of the computation time on the joint distribution of the observations at the nodes, the structure of the function, and the accuracy requirement. For linear functions, the small ball probability can be expressed by L´evy concentration functions of sums of independent random variables, for which tight estimates are available that lead to strict improvements over existing lower bounds on computation time. 2) An upper bound on the conditional mutual information via strong data processing inequalities, which complements and strengthens existing cutsetcapacity upper bounds. 3) A multi-cutset analysis that quantifies the loss (dissipation) of the information needed for computation as it flows across a succession of cutsets in the network. This analysis is based on reducing a general network to a line network with bidirectional links and self-links, and the results highlight the dependence of the computation time on the diameter of the network, a fundamental parameter that is missing from most of the existing lower bounds on computation time.
Conference Paper
Motivated by a distributed task-encoding problem, two closely related families of dependence measures are introduced. They are based on the Renyi divergence of order α and the relative α-entropy, respectively, and both reduce to the mutual information when the parameter α is one. Their properties are studied and it is shown that the first measure shares many properties with mutual information, including the data-processing inequality. The second measure does not satisfy the data-processing inequality, but it appears naturally in the context of distributed task encoding.
Article
We derive lower bounds on the Bayes risk in decentralized estimation, where the estimator does not have direct access to the random samples generated conditionally on the random parameter of interest, but only to the data received from local processors that observe the samples. The received data are subject to communication constraints, due to quantization and the noise in the communication channels from the processors to the estimator. We first derive general lower bounds on the Bayes risk using information-theoretic quantities, such as mutual information, information density, small ball probability, and differential entropy. We then apply these lower bounds to the decentralized case, using strong data processing inequalities to quantify the contraction of information due to communication constraints. We treat the cases of a single processor and of multiple processors, where the samples observed by different processors may be conditionally dependent given the parameter, for noninteractive and interactive communication protocols. Our results recover and improve recent lower bounds on the Bayes risk and the minimax risk for certain decentralized estimation problems, where previously only conditionally independent sample sets and noiseless channels have been considered. Moreover, our results provide a general way to quantify the degradation of estimation performance caused by distributing resources to multiple processors, which is only discussed for specific examples in existing works.