PreprintPDF Available

Fair Structure Learning in Heterogeneous Graphical Models

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Inference of community structure in probabilistic graphical models may not be consistent with fairness constraints when nodes have demographic attributes. Certain demographics may be over-represented in some detected communities and under-represented in others. This paper defines a novel 1\ell_1-regularized pseudo-likelihood approach for fair graphical model selection. In particular, we assume there is some community or clustering structure in the true underlying graph, and we seek to learn a sparse undirected graph and its communities from the data such that demographic groups are fairly represented within the communities. Our optimization approach uses the demographic parity definition of fairness, but the framework is easily extended to other definitions of fairness. We establish statistical consistency of the proposed method for both a Gaussian graphical model and an Ising model for, respectively, continuous and binary data, proving that our method can recover the graphs and their fair communities with high probability.
Fair Structure Learning in Heterogeneous Graphical Models
Davoud Ataee Tarzanagh, Laura Balzano, and Alfred O. Hero
Abstract
Inference of community structure in probabilistic graphical models may not be consistent
with fairness constraints when nodes have demographic attributes. Certain demographics may
be over-represented in some detected communities and under-represented in others. This paper
defines a novel
`1
-regularized pseudo-likelihood approach for fair graphical model selection. In
particular, we assume there is some community or clustering structure in the true underlying
graph, and we seek to learn a sparse undirected graph and its communities from the data
such that demographic groups are fairly represented within the communities. Our optimization
approach uses the demographic parity definition of fairness, but the framework is easily extended
to other definitions of fairness. We establish statistical consistency of the proposed method for
both a Gaussian graphical model and an Ising model for, respectively, continuous and binary
data, proving that our method can recover the graphs and their fair communities with high
probability.
Key words: Graphical Models; Fairness; Community Detection; Generalized Pseudo-Likelihood
1 Introduction
Probabilistic graphical models have been applied in a wide range of machine learning problems to infer
dependency relationships among random variables. Examples include gene expression (Peng et al.,
2009;Wang et al.,2009), social interaction networks (Tan et al.,2014;Tarzanagh and Michailidis,
2018), computer vision (Hassner and Sklansky,1981;Laferté et al.,2000;Manning and Schutze,
1999), and recommender systems (Kouki et al.,2015;Wang et al.,2015). Since in most applications
the number of model parameters to be estimated far exceeds the available sample size, it is necessary
to impose structure, such as sparsity or community structure, on the estimated parameters to
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA
(E-mail: tarzanaq,girasole@umich.edu,hero@eecs.umich.edu).
1
arXiv:2112.05128v1 [stat.ML] 9 Dec 2021
make the problem well-posed. With the increasing application of structured graphical models and
community detection algorithms in human-centric contexts (Tan et al.,2013;Song et al.,2011;
Glassman et al.,2014;Burke et al.,2011;Pham et al.,2011;Das et al.,2014), there is a growing
concern that, if left unchecked, they can lead to discriminatory outcomes for protected groups.
For instance, the proportion of a minority group assigned to some community can be far from its
underlying proportion, even if detection algorithms do not take the minority sensitive attribute
into account in decision making (Chierichetti et al.,2017). Such an outcome may, in turn, lead to
unfair treatment of minority groups. For example, in precision medicine, patient-patient similarity
networks over a biomarker feature space can be used to cluster a cohort of patients and support
treatment decisions on particular clusters (Parimbelli et al.,2018;Lafit et al.,2019). If the clusters
learned by the algorithm are demographically imbalanced, this treatment assignment may unfairly
exclude under-represented groups from effective treatments.
To the best of our knowledge, the estimation of fair structured graphical models has not
previously been addressed. However, there is a vast body of literature on learning structured
probabilistic graphical models. Typical approaches to impose structure in graphical models, such
as `1-regularization, encourage sparsity structure that is uniform throughout the network and may
therefore not be the most suitable choice for many real world applications where data have clusters or
communities, i.e., groups of graph nodes with similar connectivity patterns or stronger connections
within the group than to the rest of the network. Graphical models with these properties are called
heterogeneous.
It is known that if the goal is structured heterogeneous graph learning, structure or community
inference and graph weight estimation should be done jointly. In fact, performing structure inference
before weight estimation results in a sub-optimal procedure (Marlin and Murphy,2009). To overcome
this issue, some of the initial work focused on either inferring connectivity information or performing
graph estimation in case the connectivity or community information is known a priori (Danaher
et al.,2014;Guo et al.,2011b;Gan et al.,2019;Ma and Michailidis,2016;Lee and Liu,2015).
Recent developments consider the two tasks jointly and estimate the structured graphical models
arising from heterogeneous observations (Kumar et al.,2020;Hosseini and Lee,2016;Hao et al.,
2018;Tarzanagh and Michailidis,2018;Kumar et al.,2019;Gheche and Frossard,2020;Pircalabelu
and Claeskens,2020;Cardoso et al.,2020;Eisenach et al.,2020).
In this paper, we develop a provably convergent penalized pseudo-likelihood method to induce
fairness into clustered probabilistic graphical models. More specifically, we
2
Formulate a novel version of probabilistic graphical modeling that takes fairness/bias into
consideration. In particular, we assume there is some community structure in our graph, and
we seek to learn an undirected graph from the data such that demographic groups are fairly
represented within the communities of the graph.
Provide a rigorous analysis of our algorithms showing that they can recover fair communities
with high probability. Furthermore, it is shown that the estimators are asymptotically consistent
in high dimensional settings for both a Gaussian graphical model and an Ising model under
standard regularity assumptions.
Conclude by giving experimental results on synthetic and real-world datasets where proportional
clustering can be a desirable goal, comparing the proportionality and objective value of standard
graphical models to our methods. Our experiments confirm that our algorithms tend to better
estimate graphs and their fairer communities compared to standard graphical models.
The remainder of the paper is organized as follows: Section 2gives a general framework for fair
structure learning in graphs. Section 3gives a detailed statement of the proposed fair graphical
models for continuous and binary datasets. In Sections 4and 5, we illustrate the proposed framework
on a number of synthetic and real data sets, respectively. Section 6provides some concluding
remarks.
Notation.
For a set
S
,
|S|
is the cardinality of
S
, and
Sc
is its complement. The reals and
nonnegative reals fields are denoted as
R
and
R+
, respectively. We use lower-case and upper-case
bold letters such as
x
and
X
to represent vectors and matrices, respectively, with
xi
and
xij
denoting
their elements. If all coordinates of a vector
x
are nonnegative, we write
x
0. The notation
x>
0,
as well as
X
0and
X>
0for matrices, are defined similarly. For a symmetric matrix
XRn×n
,
we write
X
0if
X
is positive definite, and
X
0if it is positive semidefinite.
Ip
,
Jp
, and
0p
denote the
p×p
identity matrix, matrix of all ones, and matrix of all zeros, respectively. We use
Λ
i
(
X
),Λ
max
(
X
)and Λ
min
(
X
)to denote the
i
-th, maximum, and minimum singular values of
X
,
respectively. For any matrix
X
, we define
kXk
:=
maxij |xij|
,
kXk1
:=
Pij |xij|
,
kXk
:= Λ
max
(
X
),
and kXkF:= qPij |xij|2.
3
2 Fair Structure Learning in Graphical Models
We introduce a fair graph learning method that simultaneously accounts for fair community detection
and estimation of heterogeneous graphical models.
Let
Y
be an
n×p
matrix, with columns
y1,...,yp
. We associate to each column in
Y
a
node in a graph
G
= (
V,E
), where
V
=
{
1
,
2
, . . . , p}
is the vertex set and
E V × V
is the edge
set. We consider a simple undirected graph, without self-loops, and whose edge set contains only
distinct pairs. Graphs are conveniently represented by a
p×p
matrix, denoted by
Θ
, whose nonzero
entries correspond to edges in the graph. The precise definition of this usually depends on modeling
assumptions, properties of the desired graph, and application domain.
In order to obtain a sparse and interpretable graph estimate, many authors have considered the
problem
minimize
ΘL(Θ;Y) + ρ1kΘk1,off
subj. to Θ M.
(1)
Here,
L
is a loss function;
ρ1kΘk1,off
is the
`1
-norm regularization applied to off-diagonal elements
of
Θ
with parameter
ρ1>
0; and
M
is a convex constraint subset of
Rp×p
. For instance, in the
case of a Gaussian graphical model, we could take
L
(
Θ
;
Y
) =
log det
(
Θ
) +
trace
(
), where
S
=
n1Pn
i=1 yiy>
i
and
M
is the set of
p×p
positive definite matrices. The solution to
(1)
can
then be interpreted as a sparse estimate of the inverse covariance matrix (Banerjee et al.,2008;
Friedman et al.,2008). Throughout, we assume that
L
(
Θ
;
Y
)and
M
are convex function and set,
respectively.
2.1 Model Framework
We build our fair graph learning framework using
(1)
as a starting point. Let
V
denote the set of
nodes. Suppose there exist
K
disjoint communities of nodes denoted by
V
=
C1 · ·· CK
where
Ck
is the subset of nodes from
G
that belong to the
k
–th community. For each candidate partition
of
n
nodes into
K
communities, we associate it with a partition matrix
Q
[0
,
1]
p×p
, such that
qij
= 1
/|Ck|
if and only if nodes
i
and
j
are assigned to the
k
th community. Let
QpK
be the set
of all such partition matrices, and
¯
Q
the true partition matrix associated with the ground-truth
clusters {¯
Ck}K
k=1.
Assume the set of nodes contains
H
demographic groups such that
V
=
D1···DH
, potentially
with overlap between the groups. Chierichetti et al. (2017) proposed a model for fair clustering
4
requiring the representation in each cluster to preserve the global fraction of each demographic
group Dh, i.e.,
|Dh Ck|
|Ck|=|Dh|
pfor all k[K].(2)
Let
R {
0
,
1
}p×p
be such that
rij
= 1 if and only if nodes
i
and
j
are assigned to the same
demographic group, with the convention that
rii
= 1
,i
. One will notice that
(2)
is equivalent to
R
(
I11>/p
)
Q
= 0. Let
A1
:=
R
(
I11>/p
)and
B1
:=
diag
(
)
Jp
for some
>
0that controls how
close we are to exact demographic parity. Under this setting, we introduce a general optimization
framework for fair structured graph learning via a trace regularization and a fairness constraint on
the partition matrix Qas follows:
minimize
Θ,QL(Θ;Y) + ρ1kΘk1,off +ρ2trace (S+Q)G(Θ)
subj. to Θ M,A1QB1,and Q KQpK .
(3)
Here, G(Θ) : M→Mis a function of Θ(introduced in Sections 3.1 and 3.2).
We clarify the purpose of each component of the minimization
(3)
. The term
ρ1kΘk1,off
shrinks
small entries of
Θ
to 0, thus enforcing sparsity in
Θ
and consequently in
G
. This term controls the
presence of edges between any two nodes irrespective of the community they belong to, with higher
values for
ρ1
forcing sparser estimators. The polyhedral constraint is the fairness constraint, enforcing
that every community contains the
-approximate proportion of elements from each demographic
group
Dh
,
h
[
H
], matching the overall proportion. The term
ρ2trace ((S+Q)G(Θ))
enforces
community structure in a similarity graph,
G
(
Θ
). A similar linear trace term, i.e.,
trace
(
)
is used as an objective function in (Cai et al.,2015;Amini et al.,2018;Hosseini and Lee,2016;
Pircalabelu and Claeskens,2020;Eisenach et al.,2020) when estimating communities of networks.
However, the perturbation of the membership matrix with either a sample covariance for which the
population inverse covariance satisfies Assumption
(A2
)or some positive definite matrix is necessary
for developing a consistent fair graphical model.
2.2 Relaxation
The problem
(3)
is in general NP-hard due to its constraint on
Q
. However, it can be relaxed to a
computationally feasible problem. To do so, we exploit algebraic properties of a community matrix
Q
. By definition, we see that
Q
must have the form
Q
=
ΨΓΨ|
, where
Γ
is block diagonal with size
pk×pk
blocks on the diagonal with all entries equal to 1associated with
k
-th community,
Ψ
is some
permutation matrix, and the number of communities
K
is unknown. The set of all matrices
Q
of
5
this form is non-convex. The key observation is that any such
Q
satisfies several convex constraints
such as (i) all entries of
Q
are nonnegative, (ii) all diagonal entries of
Q
are 1, and (iii)
Q
is positive
semi-definite (Cai et al.,2015;Amini et al.,2018;Li et al.,2021). Without loss of generality, we
assume that the permutation matrix corresponding to the ground truth communities is the identity,
i.e., Ψ=I. Now, let
A:= [A1;Ip]and B:= [B1;Jp].
Thus, we propose the following relaxation:
minimize
Θ,QL(Θ;Y) + ρ1kΘk1,off +ρ2trace (S+Q)G(Θ)
subj. to Θ M,and Q N,
(4a)
where
N=QRp×p:Q0,0AQ B, qii = 1 for 1ip.(4b)
The solution of
(4)
jointly learns the fair community matrix
Q
and the network estimate
Θ
. We
highlight the following attractive properties of the formulation
(4)
: (i) the communities are allowed
to have significantly different sizes; (ii) the number of communities
K
may grow as
p
increases; (iii)
the knowledge of
K
is not required for fair community detection, and (iv) the objective function
(4a) is convex in Θgiven Qand conversely.
2.3 Algorithm
In order to solve (4), we use an alternating direction method of multipliers (ADMM) algorithm (Boyd
et al.,2011). ADMM is an attractive algorithm for this problem, as it allows us to decouple some of
the terms in (4) that are difficult to optimize jointly. In order to develop an ADMM algorithm for
(4) with guaranteed convergence, we reformulate it as follows:
minimize
Q,Θ,L(Θ;Y) + ρ2trace SG(Θ)+ρ1kk1,off +ρ2trace QG()
subj. to Θ=,Θ M,and Q N.(5)
The scaled augmented Lagrangian function for (5) takes the form
Υγ(Θ,,Q,W) := L(Θ;Y) + ρ2trace SG(Θ)+ρ1kk1,off
+ρ2trace QG()+γ
2kΘ+Wk2
F,
(6)
6
Algorithm 1 Fair Graph Learning via Alternating Direction Method of Multipliers
Initialize primal variables Θ(0),Q(0), and (0) ; dual variable W(0); and positive constants γ, ν .
Iterate until the stopping criterion
max nkΘ(t+1)Θ(t)k2
F
kΘ(t)k2
F
,kQ(t+1)Q(t)k2
F
kQ(t)k2
Foν
is met, where
Θ(t+1)
and
Q(t+1)
are the value of Θand Q, respectively, obtained at the t-th iteration:
S1. Q(t+1) argmin
Q∈N
trace QG((t)).
S2. (t+1) argmin
ρ2trace Q(t+1)G()+ρ1kk1+γ
2
Θ(t)+W(t)
2
F.
S3. Θ(t+1) arg min
Θ∈M
L(Θ;Y) + ρ2trace SG(Θ)+γ
2kΘ(t+1) +W(t)k2
F.
S4. W(t+1) W(t)+Θ(t+1) (t+1).
where
Θ M
,
, and
Q N
are the primal variables;
W
is the dual variable; and
γ >
0is a
dual parameter. We note that the scaled augmented Lagrangian can be derived from the usual
Lagrangian by adding a quadratic term and completing the square (Boyd et al.,2011, Section 3.1.1).
The proposed ADMM algorithm requires the following updates:
Q(t+1) argmin
Q∈N
ΥγQ,(t),Θ(t),W(t),(7a)
(t+1) argmin
ΥγQ(t+1),,Θ(t),W(t),(7b)
Θ(t+1) argmin
Θ∈M
ΥγQ(t+1),(t+1),Θ,W(t),(7c)
W(t+1) W(t)+Θ(t+1) (t+1).(7d)
A general algorithm for solving (4) is provided in Algorithm 1. Note that update for
Θ
,
Q
, and
depends on the form of the functions
L
and
G
, and is addressed in Sections 3.1 and 3.2. We also
note that
Q
sub-problem in S1. can be solved via a variety of convex optimization methods such as
CVX (Grant and Boyd,2014) and ADMM (Cai et al.,2015;Amini et al.,2018). In the following
sections, we consider special cases of (4) that lead to estimation of Gaussian graphical model and an
Ising model for, respectively, continuous and binary data.
We have the following global convergence result for Algorithm 1.
Theorem 1.
Algorithm 1converges globally for any sufficiently large
γ1
, i.e., starting from any
(
Θ(0),(0),Q(0),W(0)
), it generates a sequence (
Θ(t),(t),Q(t),W(t)
)that converges to a stationary
point of (6).
1The lower bound is given in (Wang et al.,2019, Lemma 9).
7
In Algorithm 1,S1. dominates the computational complexity in each iteration of ADMM (Cai
et al.,2015;Amini et al.,2018). In fact, an exact implementation of this subproblem of optimization
requires a full SVD, whose computational complexity is
O
(
p3
). When
p
is as large as hundreds
of thousands, the full SVD is computationally impractical. An open question is how to facilitate
the implementation, or whether there exists a surrogate that is computationally inexpensive. A
possible remedy is to apply an iterative approximation method in each iteration of ADMM, the
full SVD is replaced by a partial SVD where only the leading eigenvalues and eigenvectors are
computed. Although this type of method may get stuck in local minimizers, given the fact that
SDP implementation can be viewed as a preprocessing before K-means clustering, such a low-rank
iterative method might be helpful. It is worth mentioning that when the number of communities
K
is known, the computational complexity of ADMM is much smaller than
O
(
p3
); see Remark 4for
further discussion.
2.4 Related Work
To the best of our knowledge, the fair graphical model proposed here is the first model that can
jointly learn fair communities simultaneously with the structure of a conditional dependence network.
Related work falls into two categories: graphical model estimation and fairness.
Estimation of graphical models.
There is a substantial body of literature on methods for
estimating network structures from high-dimensional data, motivated by important biomedical and
social science applications (Liljeros et al.,2001;Robins et al.,2007;Guo et al.,2011c;Danaher et al.,
2014;Friedman et al.,2008;Tan et al.,2014;Guo et al.,2015;Tarzanagh and Michailidis,2018). Since
in most applications the number of model parameters to be estimated far exceeds the available sample
size, the assumption of sparsity is made and imposed through regularization of the learned graph.
An
`1
penalty on the parameters encoding the network edges is the most common choice (Friedman
et al.,2008;Meinshausen et al.,2006;Karoui,2008;Cai and Liu,2011;Xue et al.,2012;Khare et al.,
2015;Peng et al.,2009). This approach encourages sparse uniform network structures that may not
be the most suitable choice for real world applications, that may not be uniformly sparse. As argued
in (Danaher et al.,2014;Guo et al.,2011b;Tarzanagh and Michailidis,2018) many networks exhibit
different structures at different scales. An example includes a densely connected community in the
social networks literature. Such structures in social interaction networks may correspond to groups
of people sharing common interests or being co-located (Tarzanagh and Michailidis,2018), while in
8
biological systems to groups of proteins responsible for regulating or synthesizing chemical products
and in precision medicine the communities may be patients with common disease suceptibilities. An
important part of the literature therefore deals with the estimation of hidden communities of nodes,
by which it is meant that certain nodes are linked more often to other nodes which are similar,
rather than to dissimilar nodes. This way the nodes from communities that are more homogeneous
within the community than between communities where there is a larger degree of heterogeneity.
Some of the initial work focused on either inferring connectivity information (Marlin and Murphy,
2009) or performing graph estimation in case the connectivity or community information is known a
priori (Danaher et al.,2014;Guo et al.,2011b;Gan et al.,2019;Ma and Michailidis,2016;Lee and
Liu,2015), but not both tasks simultaneously. Recent developments consider the two tasks jointly
and estimate the structured graphical models arising from heterogeneous observations (Kumar et al.,
2020;Hosseini and Lee,2016;Hao et al.,2018;Tarzanagh and Michailidis,2018;Kumar et al.,2019;
Gheche and Frossard,2020;Pircalabelu and Claeskens,2020;Cardoso et al.,2020;Eisenach et al.,
2020).
Fairness.
There is a growing body of work on fairness in machine learning. Much of the research is
on fair supervised methods; see, Chouldechova and Roth (2018); Barocas et al. (2019); Donini et al.
(2018) and references therein. Our paper adds to the literature on fair methods for unsupervised
learning tasks (Chierichetti et al.,2017;Celis et al.,2017;Samadi et al.,2018;Tantipongpipat et al.,
2019;Oneto and Chiappa,2020;Caton and Haas,2020;Kleindessner et al.,2019). We discuss the
work on fairness most closely related to our paper. Chierichetti et al. (2017) proposed the notion of
fairness for clustering underlying our paper: namely, that each cluster has proportional representation
from different demographic groups (Feldman et al.,2015;Zafar et al.,2017). Chierichetti et al.
(2017) provides approximation algorithms that incorporate this fairness notion into
K
-center as well
as
K
-median clustering. Kleindessner et al. (2019) extend this to K-means and provide a provable
fair spectral clustering method; they implement
K
-means on the subspace spanned by the smallest
fair eigenvectors of Laplacian matrix. Unlike these works, which assume that the graph structure
and/or the number of communities is given in advance, an appealing feature of our method is to
learn fair community structure while estimating heterogeneous graphical models.
9
3 The Fair Graphical Models
In the following subsections, we consider two special cases of
(4)
that lead to estimation of graphical
models for continuous and binary data.
3.1 Fair Pseudo-Likelihood Graphical Model
Suppose
yi
= (
y1
i, . . . , yp
i
)are i.i.d. observations from
N
(
0,Σ
), for
i
= 1
, . . . , n
. Denote the sample
of the
i
th variable as
yi
= (
yi
1, . . . , yi
n
). Let
ωij
=
θijii
, for all
j6
=
i
. We note that the set of
nonzero coefficients of
ωij
is the same as the set of nonzero entries in the row vector of
θij
(
i6
=
j
),
which defines the set of neighbors of node
θij
. Using an
`1
-penalized regression, Meinshausen et al.
(2006) estimates the zeros in
Θ
by fitting separate Lasso regressions for each variable
yi
given the
other variables as follows
Fi(Θ;Y) = kyiX
j6=i
ωijyjk2+ρ1X
1i<jp|ωij |,where ωij =θij ii .(8)
These individual Lasso fits give neighborhoods that link each variable to others. Peng et al.
(2009) improve this neighborhood selection method by taking the natural symmetry in the problem
into account (i.e., θij =θij ), and propose the following joint objective function (called SPACE):
F(Θ;Y) = 1
2
p
X
i=1 nlog θii +wikyiX
j6=i
˙ωijsθjj
θii
yjk2+ρX
1i<jp|˙ωij |
=1
2
p
X
i=1 nlog θii +wikyi+X
j6=i
θij
θii
yjk2+ρ1X
1i<jp|˙ωij |,(9)
where
{wi}p
i=1
are nonnegative weights and
˙ωij
=
θij
θiiθj j
denotes the partial correlation between
the ith and jth variables for 1i6=jp. Note that ˙ωij = ˙ωji for i6=j.
It is shown in (Khare et al.,2015) that the above expression is not convex. Setting
wi
=
θ2
ii
and putting the
`1
-penalty term on the partial covariances
θij
instead of on the partial correlations
˙ωij
, they obtain a convex pseudo-likelihood approach with good model selection properties called
CONCORD. Their objective takes the form
F(Θ;Y) =
p
X
i=1 nlog θii +1
2kθiiyi+X
j6=i
θijyjk2
2+ρ1X
1i<jp|θij |.(10)
Note that the penalized matrix version of the CONCORD objective can be obtained by setting
L(Θ;Y) = n/2[log |diag(Θ)2|+trace(2)] in (1).
10
Our proposed fair graphical model formulation (called FCONCORD) is a fair version of CON-
CORD from (10). In particular, letting G(Θ) = Θ2and
M=ΘRp×p:θij =θji,and θii >0,for every 1i, j p,
in (4), our problem takes the form
minimize
Θ,QF(Θ,Q;Y) := n
2log |diag(Θ)2|
+ trace ((1 + ρ2)S+ρ2Q)Θ2+ρ1kΘk1,off
subj. to Θ M and Q N.(11)
Here, Mand Nare the graph adjacency and fairness constraints, respectively.
Remark 2.
When
ρ2
= 0, i.e., without a fairness constant and the second trace term, the objective
in
(11)
reduces to the objective of the CONCORD estimator, and is similar to those of SPACE
(Peng et al.,2009), SPLICE (Rocha et al.,2008), and SYMLASSO (Friedman et al.,2010). Our
framework is a generalization of these methods to fair graph learning and community detection, when
the demographic group representation holds.
Problem
(11)
can be solved using Algorithm 1. The update for
and
Θ
in S2. and S3. can be
derived by minimizing
Υ1() := 2
2trace QΩ2+ρ1kk1,off +γ
2kΘ+Wk2
F,(12a)
Υ2(Θ) := n
2log |diag(Θ)2|+ trace (1 + ρ2)2+γ
2kΘ+Wk2
F,(12b)
with respect to and Θ, respectively.
For 1ijp, define the matrix function Tij :M→Mby
Tij()argmin
˜
Υ1(˜
) : ˜ωkl =ωkl (k, l)6= (i, j),(13a)
Tij(Θ)argmin
˜
ΘΥ2(˜
Θ) : ˜
θkl =θkl (k, l)6= (i, j).(13b)
For each (
i, j
),
Tij
(
)and
Tij
(
Θ
)updates the (
i, j
)-th entry with the minimizer of
(12a)
and
(12b)
with respect to
ωij
and
θij
, respectively, holding all other variables constant. Given
Tij
(
)and
Tij
(
Θ
), the update for
and
Θ
in S2. and S3. can be obtained by a similar coordinate-wise descent
algorithm proposed in Peng et al. (2009); Khare et al. (2015). Closed form updates for
Tij
(
)and
Tij(Θ)are provided in Lemma 3.
11
Lemma 3. Let γn:= γn. For 1ip, define
ai:= (1 + ρ2)sii +γnbi:= (1 + ρ2)X
j6=i
θijsij +γn(ωii wii),
ci:= ρ2qii +γn, di:= ρ2X
j6=i
qijωij +γn(wii +θii)
Further, for 1i<jp, let
aij := (1 + ρ2)(sii +sj j ) + γnbij := (1 + ρ2)X
j06=j
θij0sj j0+X
i06=i
θi0jsii0+γn(wij ωij )
cij := ρ2(qii +qjj ) + γn, dij := ρ2X
j06=j
ωij0qj j0+X
i06=i
ωi0jqii0+γn(wij +θij )
Then, we have
Tii()ii =di
ci
,Tii(Θ)ii =bi+qb2
i+ 4ai
2ai
,for all 1ip, (14a)
Tij()ij =Sdij
cij
,ρ1
γn,Tij(Θ)ij =bij
aij
,for all 1i<jp, (14b)
where S(α, β) := sign(α) max(|α| β, 0).
Remark 4.
In the case when
K
and
H
are known, the complexity of Algorithm 1is of the same
order as the CONCORD estimator, SPACE, SPLICE, and SYMLASSO. In fact, computing the fair
clustering matrix
Q
requires
O
((
pH
+ 1)
2K
)operations. On the other hand, it follows from (Khare
et al.,2015, Lemma 5) that
Θ
and
updates can be performed with complexity
min O(np2), O(p3)
.
This shows that when the number of communities is known, the computational cost of each iteration
of FCONCORD is max min O(np2), O(p3),(pH+ 1)2K.
3.1.1 Large Sample Properties of FCONCORD
We show that under suitable conditions, the FCONCORD estimator achieves both model selection
consistency and estimation consistency.
As in other studies (Khare et al.,2015;Peng et al.,2009), for the convergence analysis we assume
that the diagonal of the graph matrix
Θ
and partition matrix
Q
are known. Let
θo
= (
θij
)
1i<jp
and
qo
= (
qij
)
1i<jp
denote the vector of off-diagonal entries of
Θ
and
Q
, respectively. Let
θd
and
qd
denote the vector of diagonal entries of
Θ
and
Q
, respectively. Let
¯
θo,¯
θd,¯
qo
, and
¯
qd
denote the
true value of
θo,θd,qo
, and
qd
, respectively. Let
B
denote the set of non-zero entries in the vector
¯
θoand define
q:= |B|,Ψ(p, H, K ) := (pH+ 1)((pH+ 1)/K 1).(15)
12
In our consistency analysis, we let the regularization parameters
ρ1
=
ρ1n
and
ρ2
=
ρ2n
vary with
n
.
The following standard assumptions are required:
Assumption A
(A1)
The random vectors
y1,...,yn
are i.i.d. sub-Gaussian for every
n
1, i.e., there exists
M >
0
such that ku>yikψ2MpE(u>yi)2,uRp. Here, kykψ2= supt1(E|y|t)1
t/t.
(A2) There exist constants τ1, τ2(0,)such that
τ1<Λmin(¯
Θ)Λmax(¯
Θ)< τ2.
(A3) There exists a constant τ3(0,)such that
0Λmin(¯
Q)Λmax(¯
Q)< τ3.
(A4) For any K, H [p], we have KpH+ 1.
(A5) There exists a constant δ(0,1] such that, for any (i, j) Bc
¯
Hij,B¯
H1
B,Bsign(¯
θo
B)(1 δ),
where for 1i, j, t, s psatisfying i<j and t<s,
¯
Hij,ts := E¯
θo2L(¯
θd, , θo;Y)
∂θij θts θo=¯
θo.(16)
Assumptions
(A2
)
(A3
)guarantee that the eigenvalues of the true graph matrix
¯
Θ
and those of
the true membership matrix
¯
Q
are well-behaving. Assumption
(A4
)links how
H, K
and
p
can grow
with
n
. Note that
K
is limited in order for fairness constraints to be meaningful; if
K > p H
+ 1
then there can be no community with
H
nodes among which we enforce fairness. Assumption
(A5
)
corresponds to the incoherence condition in Meinshausen et al. (2006), which plays an important role
in proving model selection consistency of
`1
penalization problems. Zhao and Yu (2006) show that
such a condition is almost necessary and sufficient for model selection consistency in lasso regression,
and they provide some examples when this condition is satisfied. Note that Assumptions
(A1
),
(A2
), and
(A5
)are identical to Assumptions (C0)–(C2) in Peng et al. (2009). Further, it follows
from Peng et al. (2009) that under Assumption (A5)for any (i, j) Bc,
k¯
Hij,B¯
H1
B,Bk M(¯
θo)(17)
13
for some finite M(¯
θo).
Next, inspired by Peng et al. (2009); Khare et al. (2015), we prove estimation consistency for the
nodewise FCONCORD.
Theorem 5.
Suppose Assumptions
(A1
)
(A5
)are satisfied. Assume further that
p
=
O
(
nκ
)for
some
κ >
0,
ρ1n
=
O
(
plog p/n
),
n>O
(
qlog
(
p
)) as
n
,
ρ2n
=
O
(
plog(pH+ 1)/n
),
ρ2nδρ1n/
((1 +
M
(
¯
θo
))
τ2τ3
), and
=
0
. Then, there exist finite constants
C
(
¯
θo
)and
D
(
¯
qo
), such
that for any η > 0, the following events hold with probability at least 1O(exp(ηlog p)):
There exists a minimizer (b
θo,b
qo)of (11)such that
max kb
θo¯
θok,kb
qo¯
q0kmax C(¯
θo)ρ1nq/(1 + ρ2n), D(¯
qo)ρ2npΨ(p, H, K),
where qand Ψ(p, H, K)are defined in (15).
If min(i,j)∈B ¯
θij 2C(¯
θo)ρ1nq/(1 + ρ2n), then b
θo
Bc= 0.
Theorem 6provides sufficient conditions on the quadruple (
n, p, H, K
)and the model parameters
for the FCONCORD to succeed in consistently estimating the neighborhood of every node in the
graph and communities simultaneously. Notice that if
H
= 1 (no fairness) and
K
=
p
(no clustering)
we recover the results of Khare et al. (2015); Peng et al. (2009).
3.2 Fair Ising Graphical Model
In the previous section, we studied the fair estimation of graphical models for continuous data. Next,
we focus on estimating an Ising Markov random field (Ising,1925), suitable for binary or categorical
data. Let
y
= (
y1, . . . , yp
)
{
0
,
1
}p
denote a binary random vector. The Ising model specifies the
probability mass function
p(y) = 1
W(Θ)exp p
X
j=1
θjj yj+X
1j<j0p
θjj0yjyj0.(18)
Here,
W
(
Θ
)is the partition function, which ensures that the density function in
(18)
integrates to
one;
Θ
is a
p×p
symmetric matrix that specifies the graph structure:
θjj0
= 0 implies that the
j
th
and j0th variables are conditionally independent given the remaining ones.
Several sparse estimation procedures for this model have been proposed. Lee et al. (2007)
considered maximizing an
`1
-penalized log-likelihood for this model. Due to the difficulty in
computing the log-likelihood with the expensive partition function, alternative approaches have
14
been considered. For instance, Ravikumar et al. (2010) proposed a neighborhood selection approach
which involves solving
p
logistic regressions separately (one for each node in the network), which
leads to an estimated parameter matrix that is in general not symmetric. In contrast, others have
considered maximizing an
`1
-penalized pseudo-likelihood with a symmetric constraint on
Θ
(Guo
et al.,2011c,a;Tan et al.,2014;Tarzanagh and Michailidis,2018). Under the probability model
above, the negative log-pseudo-likelihood for nobservations takes the form
L(Θ;Y) =
p
X
j=1
p
X
j0=1
θjj0sjj0+1
n
n
X
i=1
p
X
j=1
log 1 + exp(θjj +X
j06=j
θjj0yij0).(19)
We propose to additionally impose the fairness constraints on
Θ
in (19) in order to obtain a
sparse binary network with fair communities. This leads to the criterion
minimize
Θ,QF(Θ,Q;Y) + ρ1kΘk1,off := n
p
X
j=1
p
X
j0=1
θjj0(sjj0+ρ2qjj0)
+
n
X
i=1
p
X
j=1
log 1 + exp(θjj +X
j06=j
θjj0yij0)+ρ1X
1i<jp|θij |,
subj. to Θ M and Q N.(20)
Here, Mand Nare the graph and fairness constraints, respectively.
We refer to the solution to
(20)
as the Fair Binary Network (FBN). An interesting connection
can be drawn between our technique and a fair variant of Ising block model discussed in Berthet
et al. (2016), which is a perturbation of the mean field approximation of the Ising model known as
the Curie-Weiss model: the sites are partitioned into two blocks of equal size and the interaction
between those within the same block is stronger than across blocks, to account for more order within
each block. One can easily seen that the Ising block model is a special case of (20).
An ADMM algorithm for solving
(20)
is given in Algorithm 1. The update for
in S2. can be
obtained from
(12a)
by replacing
2
with
. We solve the update for
Θ
in S3. using a relaxed
variant of Barzilai-Borwein method (Barzilai and Borwein,1988). The details are given in (Tarzanagh
and Michailidis,2018, Algorithm 2).
3.2.1 Large Sample Properties of FBN
In this section, we present the model selection consistency property for the separate regularized
logistic regression. The spirit of the proof is similar to Ravikumar et al. (2010), but since their model
does not include membership matrix Qand fairness constraints that are significant differences.
15
Similar to Section 3.1.1, let
θo
= (
θij
)
1i<jp
and
qo
= (
qij
)
1i<jp
denote the vector of off-
diagonal entries of
Θ
and
Q
, respectively. Let
θd
and
qd
denote the vector of diagonal entries of
Θ
and
Q
, respectively. Let
¯
θo,¯
θd,¯
qo,¯
qd
,
¯
Θ
and
¯
Q
denote the true value of
θo,θd,qo,qd
,
Θ
and
Q
,
respectively. Let Bdenote the set of non-zero entries in the vector ¯
θo, and let q=|B|.
Denote the log-likelihood for the i-th observation by
Li(θd,θo;Y) =
p
X
j=1
yijθjj +X
j6=j0
θjj0yij0+ log 1 + exp(θj j X
j6=j0
θjj0yij0).(21)
The population Fisher information matrix of
L
at (
¯
θd,¯
θo
)can be expressed as
¯
H
=
E
(
2Li
(
¯
θd,¯
θo
;
Y
))
and its sample counterpart ¯
Hn= 1/n Pn
i=1 2Li(¯
θd,¯
θo;Y). Let
vij = ˙vij (1 ˙vij ),where ˙vij =exp(θj j +Pj06=jθjj0yij0)
1 + exp(θjj +Pj06=jθjj0yij0).
Let
e
yj
= (
v1jy1j˙
yj,...,vnjynj ˙
yj
)
>
, where
˙
yj
= 1
/n Pn
i=1 vijyij
. We use
˜
X
=
(˜
X(1,2),··· ,˜
X(p1,p))to denote an np by p(p1)/2matrix, with
˜
X(j,j0)= (0n, ..., 0n,e
yj
|{z}
jth block
,0n, ..., 0n,e
yj0
|{z}
j0th block
,0n,...,0n)>,
where
0n
is an
n
-dimensional column vector of zeros. Let
˜
X(i,j)
be the [(
j
1)
n
+
i
]-th row of
˜
X
and
˜
X(i)
= (
˜
X(i,1),..., ˜
X(i,p)
). Let
T
=
E
(
˜
X(i)
(
˜
X(i)
)
>
)and
Tn
= 1
/n Pn
i=1 ˜
X(i)
(
˜
X(i)
)
>
as its sample
counterpart.
Our results rely on Assumptions (A3)(A4)and the following regularity conditions:
Assumption B
(B1) There exist constants τ4, τ5(0,)such that
Λmin(¯
HBB)τ4and Λmax(T)τ5.
(B2) There exists a constant δ(0,1], such that
k¯
HBcB¯
HBB1k(1 δ).(22)
Under these assumptions and (A3)(A4), we have the following result:
Theorem 6.
Suppose Assumptions
(A3
)
(A4
)and
(B1
)
(B2
)are satisfied. Assume further that
ρ1n
=
O
(
plog p/n
),
n>O
(
q3log p
)as
n
,
ρ2n
=
O
(
plog(pH+ 1)/n
),
ρ2nδρ1n/
(4(2
δ
)
τ3k¯
θok
), and
=
0
. Then, there exist finite constants
ˇ
C
(
¯
θo
),
ˇ
D
(
¯
qo
), and
η
such that the
following events hold with probability at least 1O(exp(ηρ2
1nn)):
16
There exists a local minimizer (b
θo,b
qo)of (20)such that
max kb
θo¯
θok,kb
qo¯
q0kmax ˇ
C(¯
θo)ρ1nq, ˇ
D(¯
qo)ρ2npΨ(p, H, K),(23)
where qand Ψ(p, H, K)are defined in (15).
If min(i,j)∈B ¯
θo
ij 2ˇ
C(¯
θo)1n, then b
θo
Bc= 0.
Theorem 6provides sufficient conditions on the quadruple (
n, p, H, K
)and the model parameters
for the FBN to succeed in consistently estimating the neighborhood of every node in the graph and
communities simultaneously. We note that if
H
= 1 (no fairness) and
K
=
p
(no clustering), we
recover the results of Ravikumar et al. (2010).
3.3 Consistency of Fair Community Labeling in Graphical Models
In this section, we aim to show that our algorithms recover the fair ground-truth community structure
in the graph. Let
b
V
and
¯
V
contain the orthonormal eigenvectors corresponding to the
K
leading
eigenvalues of
b
Q
and
¯
Q
as columns, respectively. It follows from (Lei et al.,2015, Lemma 2.1) that
if any rows of the matrix
¯
V
are same, then the corresponding nodes belong to the same cluster.
Consequently, we want to show that up to some orthogonal transformation, the rows of
b
V
are close
to the rows of
¯
V
so that we can simply apply K-means clustering to the rows of the matrix
b
V
. In
particular, we consider the K-means approach Lei et al. (2015) defined as
(b
U,b
O) = argmin
U,OkUO b
Vk2
F,subject to UMp,K ,ORK×K,(24)
where
Mp,K
is the set of
p×K
matrices that have on each row a single 1, indicating the fair
community to which the node belongs, and all other values on the row set at 0, since a node belongs
to only one community. Finding a global minimizer for the problem
(24)
is NP-hard (Aloise et al.,
2009). However, there are polynomial time approaches (Kumar et al.,2004) that find an approximate
solution (b
U,b
O)Mp,K ×RK×Ksuch that
kb
Ub
Ob
Vk2
F(1 + ξ) argmin
(U,O)Mp,K ×RK×KkUO b
Vk2
F.(25)
Next, similar to (Lei et al.,2015, Theorem 3.1), we quantify the errors when performing (1 +
ξ
)-
approximate K-means clustering on the rows of
b
V
to estimate the community membership matrix.
To do so, let
Ek
denote the set of misclassified nodes from the
k
-th community. By
¯
C
=
k[K]
(
Ck\Ek
)
we denote the set of all nodes correctly classified across all communities and by
¯
V¯
C
we denote the
17
submatrix of
¯
V
formed by retaining only the rows indexed by the set
¯
C
of correctly classified nodes
and all columns. Theorem 7relates the sizes of the sets of misclassified nodes for each fair community
and specify conditions on the interplay between n,p,H, and K.
Theorem 7. Let b
Ube the output of (1 + ξ)-approximate K-means given in (25). If
(2 + ξ)φ(p, H, K)rK
n< π
for some constant
π >
0, then there exist subsets
Ek Ck
for
k
= 1
, . . . , K
, and a permutation
matrix Φsuch that b
V¯
CΦ=¯
V¯
C, and
K
X
k=1 |Ek|/|Ck| π1(2 + ξ)φ(p, H, K)rK
n
with probability tending to 1.
4 Simulation Study
4.1 Tuning Parameter Selection
We consider a Bayesian information criterion (BIC)-type quantity for tuning parameter selection in
(3)
. Recall from Section 2.1 that objective function
(3)
decomposes the parameter of interest into
(
Θ,Q
)and places
`1
and trace penalties on
Θ
and
Q
, respectively. Specifically, for the graphical
Lasso, i.e., problem in
(3)
with
ρ2
= 0,Yuan and Lin (2006) proposed to select the tuning parameter
ρ1such that ˆ
Θminimizes the following quantity:
nlog det( ˆ
Θ) + ·trace(Sˆ
Θ)+ log(n)· | ˆ
Θ|.
Here,
|ˆ
Θ|
is the cardinality of
ˆ
Θ
, i.e., the number of unique non-zeros in
ˆ
Θ
. Using a similar idea,
we consider minimizing the following BIC-type criteria for selecting the set of tuning parameters
(ρ1, ρ2)for (3):
BIC( ˆ
Θ,ˆ
Q) :=
K
X
k=1
nklog |ˆ
Θ2
k|+trace (1 + c)Sk+ˆ
Qkˆ
Θ2
k+ log(nk)· | ˆ
Θk|),
where ˆ
Θkis the k-th estimated inverse covariance matrix.
Note that when the constant
c
is large,
BIC
(
ˆ
Θ,ˆ
Q
)will favor more graph estimation in
ˆ
Θ
.
Throughout, we take c= 0.25.
18
4.2 Notation and Measures of Performance
We define several measures of performance that will be used to numerically compare the various
methods. To assess the clustering performance, we compute the following clustering error (CE) which
calculates the distance between an estimated community assignment ˆziand the true assignment zi
of the ith node:
CE := 1
p|{(i, j) : 1(ˆzi= ˆzj)6=1(zi=zj), i < j}|.
To measure the estimation quality, we calculate the proportion of correctly estimated edges (PCEE):
PCEE := Pj0<j 1{|ˆ
θjj0|>105and |θj j0|6=0}
Pj0<j 1{|θjj0|6=0}.
Finally, we use balance as a fairness metric to reflect the distribution of the fair clustering (Chierichetti
et al.,2017). Let
Ni
=
{j
:
rij
= 1
}
be the set of neighbors of node
i
in
R
. For a set of communities
{Ck}K
k=1, the balance coefficient is defined as
Balance := 1
p
p
X
i=1
min
k,`[K]|Ck Ni|
|C` Ni|.
The balance coefficient, called simply the balance, is used to quantify how well the selected edges
can eliminate discrimination–the selected edges are considered fairer if they can lead to a balanced
community structure that preserve proportions of protected attributes.
4.3 Data Generation
In order to demonstrate the performance of the proposed algorithms, we create several synthetic
datasets based a special random graph with community and group structures. Then the baseline
and proposed algorithms are used to recover graphs (i.e., graph-based models) from the artificially
generated data. To create a dataset, we first construct a graph, then its associated matrix,
Θ
, is
used to generate independent data samples from the distribution
N
(0
,Θ1
). A graph (i.e.,
Θ
) is
constructed in two steps. In the first step, we determine the graph structure (i.e., connectivity)
based on the random modular graph also known as stochastic block model (SBM) (Holland et al.,
1983;Kleindessner et al.,2019).
The SBM takes, as input, a function
πc
: [
p
]
[
K
]that assigns each vertex
i V
to one of the
K
clusters. Then, independently, for all node pairs (
i, j
)such that
i>j
,
P
(
aij
= 1) =
bπc(i)πc(j)
, where
B
[0
,
1]
K×K
is a symmetric matrix. Each
bk`
specifies the probability of a connection between two
19
nodes that belong to clusters
Ck
and
C`
, respectively. A commonly used variant of SBM assumes
bkk
=
ξ1
and
bk`
=
ξ2
for all
k, `
[
K
]such that
k6
=
`
. Let
πd
: [
p
]
[
H
]be another function that
assigns each vertex
i V
to one of the
H
protected groups. We consider a variant of SBM with the
following probabilities:
P(aij = 1) =
ζ4if πc(i) = πc(j)and πd(i) = πd(j),
ζ3if πc(i)6=πc(j)and πd(i) = πd(j),
ζ2if πc(i) = πc(j)and πd(i)6=πd(j),
ζ1if πc(i)6=πc(j)and πd(i)6=πd(j).
(26)
Here, 1
ζi+1 ζi
0are probabilities used for sampling edges. In our implementation, we set
ζi
= 0
.
1
i
for all
i
= 1
,...,
4. We note that when vertices
i
and
j
belong to the same community,
they have a higher probability of connection between them for a fixed value of
πd
; see, (Kleindessner
et al.,2019) for further discussions.
In the second step, the graph weights (i.e., node and edge weights) are randomly selected based
on a uniform distribution from the interval [0
.
1
,
3] and the associated (Laplacian) matrix
Θ
is
constructed. Finally, given the graph matrix
Θ
, we generate the data matrix
Y
according to
y1,...,yni.i.d.
N(0,Θ1).
4.4 Comparison to Graphical Lasso and Neighbourhood Selection Methods
We consider four setups for comparing our methods with community-based graphical models (GM):
I.
Three-stage approach for which we (i) use a GM to estimate precision matrix
ˆ
Θ
, (ii) apply
a community detection approach (Cai et al.,2015) to compute partition matrix
ˆ
Q
, and (iii)
employ a K-means clustering to obtain clusters.
II.
Two-stage approach for which we (i) use
(4)
without fairness constraint to simultaneously
estimate precision and partition matrices and (ii) employ a K-means clustering to obtain
clusters.
FI.
Three-stage approach for which we (i) use a GM to estimate precision matrix
ˆ
Θ
, (ii) apply
a community detection approach (Cai et al.,2015) to compute partition matrix
ˆ
Q
, and (iii)
employ a fair K-means clustering (Chierichetti et al.,2017) to obtain clusters.
FII.
Two-stage approach for which we (i) use
(4)
to simultaneously estimate precision and partition
matrices and (ii) employ the K-means clustering to obtain clusters.
20
The main goal of Setups I. and II. is to compare the community detection errors without fairness
constraints under different settings of Land Gfunctions.
We consider three type of GMs in Setups I.FII.:
A.
A graphical Lasso-type method (Friedman et al.,2008) implemented using
L
(
Θ
;
Y
) =
n/2[log det(Θ) + trace()] and G(Θ) = Θ.
B.
A neighborhood selection-type method (Khare et al.,2015) implemented using
L
(
Θ
;
Y
) =
n/2[log |diag(Θ2)|+trace(2)] and G(Θ) = Θ2.
C.
A neighborhood selection-type method (Ravikumar et al.,2010) implemented using
L
(
Θ
;
Y
) =
Pp
j=1 Pp
j0=1 θjj0sjj0+ 1/n Pn
i=1 Pp
j=1 log 1 + exp(θjj +Pj06=jθjj0yij0)and G(Θ) = Θ.
In our results, we name each one “GM-Type-Setup” to refer to the GM type and the setup above.
For example, GM-A.II. refers to a graphical Lasso used in the first step of the two stage approach
in II. It is worth mentioning that GM-A.II. can be seen as a variant of the recently developed
cluster-based GLASSO (Pircalabelu and Claeskens,2020;Kumar et al.,2020;Hosseini and Lee,
2016).
Sample Size Method CE PCEE Balance
n= 300 GM-B.I. 0.464(0.015) 0.773(0.010) 0.143(0.005)
GM-B.II. 0.426(0.015) 0.791(0.010) 0.152(0.005)
GM-B.FI. 0.197(0.005) 0.773(0.010) 0.301(0.005)
GM-B.FII.(FCONCORD)0.060(0.005) 0.835(0.010) 0.516(0.010)
n= 450 GM-B.I. 0.423(0.015) 0.813(0.010) 0.187(0.009)
GM-B.II. 0.418(0.011) 0.863(0.010) 0.207(0.009)
GM-B.FI. 0.163(0.015) 0.813(0.010) 0.335(0.010)
GM-B.FII.(FCONCORD)0.011(0.005) 0.891(0.012) 0.520(0.010)
Table 1: Simulation results of neighborhood selection-type GMs on SBM network. The results are
for p= 600,H= 3, and K= 2.
We repeated the procedure 10 times and reported the averaged clustering errors, proportion
of correctly estimated edges, and balance. Tables 1and 2is for SBM with
p
= 300. As shown in
Tables 1and 2, GM-A.I., GM-A.II., GM-B.I., and GM-B.I. have the largest clustering error and
21
the proportion of correctly estimated edges. GM-A.II. and GM-B.II. improve the performance of
GM-A.I. and GM-B.I. in the precision matrix estimation. However, they still incur a relatively large
clustering error and balance since they ignore the similarity across different community matrices and
employ a standard K-means clustering. In contrast, our FCONCORD and FGLASSO algorithms
achieve the best clustering accuracy, estimation accuracy, and balance for both scenarios. This is due
to our simultaneous clustering and estimation strategy as well as the consideration of the fairness
of precision matrices across clusters. This experiment shows that a satisfactory fair community
detection algorithm is critical to achieve accurate estimations of heterogeneous and fair GMs, and
alternatively good estimation of GMs can also improve the fair community detection performance.
This explains the success of our simultaneous method in terms of both fair clustering and GM
estimation.
Sample Size Method CE PCEE Balance
n= 300 GM-A.I. 0.431(0.015) 0.731(0.011) 0.139(0.005)
GM-A.II. 0.416(0.015) 0.761(0.011) 0.167(0.005)
GM-A.FI. 0.182(0.006) 0.731(0.011) 0.322(0.010)
GM-A.FII. (FGLASSO) 0.097(0.006) 0.815(0.050) 0.504(0.010)
n= 450 GM-A.I. 0.470(0.011) 0.792(0.010) 0.154(0.005)
GM-A.II. 0.406(0.011) 0.803(0.010) 0.187(0.015)
GM-A.FI. 0.163(0.005) 0.792(0.010) 0.366(0.005)
GM-A.FII.(FGLASSO) 0.050(0.005) 0.863(0.012) 0.513(0.005)
Table 2: Simulation results of graphical Lasso-type algorithms on SBM network. The results are for
p= 600,H= 3, and K= 2.
Next, we consider a natural composition of SBM and the Ising model called Stochastic Ising
Block Model (SIBM); see, e.g., Berthet et al. (2016); Ye (2020) for more details. In SIBM, we take
SBM similar to
(26)
, where
p
vertices are divided into clusters and subgroups and the edges are
connected independently with probability
{ξi}4
i=1
. Then, we use the graph
G
generated by the SBM
as the underlying graph of the Ising model and draw
n
i.i.d. samples from it. The objective is to
exactly recover the fair clusters in SIBM from the samples generated by the Ising model, without
observing the graph G.
Tables 3reports the averaged clustering errors, the proportion of correctly estimated edges, and
22
balance for SIBM with
p
= 100. The standard GM-C.I. method has the largest clustering error and
balance due to its ignorance of the network structure in the precision matrices. GM-C.FI. improves
the clustering performance of the GM-C.I. by using the method of (Ravikumar et al.,2010) in
the precision matrix estimation and the robust community detection approach (Cai et al.,2015)
for computing partition matrix
ˆ
Q
. GM-C.FII. is able to achieve the best balance and clustering
performance due to the procedure of simultaneous fair clustering and heterogeneous GM estimation.
Sample Size Method CE PCEE Balance
n= 200 GM-C.I. 0.489(0.011) 0.601(0.010) 0.104(0.005)
GM-C.II. 0.461(0.005) 0.651(0.010) 0.107(0.005)
GM-C.FI. 0.219(0.005) 0.601(0.010) 0.212(0.010)
GM-C.FII.(FBN) 0.117(0.009) 0.738(0.005) 0.385(0.010)
n= 400 GM-C.I. 0.434(0.010) 0.637(0.010) 0.113(0.005)
GM-C.II. 0.461(0.010) 0.681(0.010) 0.107(0.005)
GM-C.FI. 0.219(0.005) 0.637(0.010) 0.382(0.010)
GM-C.FII.(FBN) 0.104(0.005) 0.796(0.005) 0.405(0.010)
Table 3: Simulation results of binary neighborhood selection-type GMs on SIBM network. The
results are for p= 100,H= 2, and K= 2.
5 Real Data Application
Recommender systems (RS) model user-item interactions to provide personalized item recommen-
dations that will suit the user’s taste. Broadly speaking, two types of methods are used in such
systems—content based and collaborative filtering. Content based approaches model interactions
through user and item covariates. Collaborative filtering (CF), on the other hand, refers to a set of
techniques that model user-item interactions based on user’s past response.
A popular class of methods in RS is based on clustering users and/or items (Ungar and Foster,
1998;O’Connor and Herlocker,1999;Sarwar et al.,2001;Schafer et al.,2007). Indeed, it is more
natural to model the users and the items using clusters (communities), where each cluster includes a
set of like-minded users or the subset of items that they are interested in. The overall procedure of
23
this method, called cluster CF (CCF), contains two main steps. First, it finds clusters of users and/or
items, where each cluster includes a group of like-minded users or a set of items that these users are
particularly interested in. Second, in each cluster, it applies traditional CF methods to learn users’
preferences over the items within this cluster. Despite efficiency and scalability of these methods, in
many human-centric applications, using CCF in its original form can result in unfavorable and even
harmful clustering and prediction outcomes towards some demographic groups in the data.
It is shown in Schafer et al. (2007); Mnih and Salakhutdinov (2008) that using item-item
similarities based on “who-rated-what” information is strongly correlated with how users explicitly
rate items. Hence, using this information as user covariates helps in improving predictions for explicit
ratings. Further, one can derive an item graph where edge weights represent movie similarities
that are based on global “who-rated-what” matrix (Kouki et al.,2015;Wang et al.,2015;Agarwal
et al.,2011;Mazumder and Agarwal,2011). Imposing sparsity on such a graph and finding its fair
communities is attractive since it is intuitive that an item is generally related to only a few other
items. This can be achieved through our fair GMs. Such a graph gives a fair neighborhood structure
that can also help better predict explicit ratings. In addition to providing useful information to
predict ratings, we note that using who-rated-what information also provides information to study
the fair relationships among items based on user ratings.
The goal of our analysis is to understand the balance and prediction accuracy of fair GMs on RS
datasets as well as the relationships among the items in these datasets. We compare the performance
of our fair GMs implemented in the framework of standard CCF and its fair K-means variant. In
particular, we consider the following algorithms:
FGLASSO (FCONCORD)+CF: A two-stage approach for which we first use FGLASSO
(FCONCORD) to obtain the fair clusters and then apply traditional CF to learn users’
preferences over the items within each cluster. We set
ρ1
= 1,
ρ2
= 0
.
05,
γ
= 0
.
01, and
= 1e3in our implementations.
CCF (Fair CCF): A two-stage approach for which we first use K-means (fair K-means
Chierichetti et al. (2017)) clustering to obtain the clusters and then apply CF to learn
users’ preferences within each cluster (Ungar and Foster,1998).
24
5.1 MovieLens Data
We use the MovieLens 10K dataset
2
. Following previous works (Koren,2009;Kamishima et al.,2012;
Chen et al.,2020), we use year of the movie as a sensitive attribute and consider movies before
1991 as old movies. Those more recent are considered new movies. Koren (2009) showed that the
older movies have a tendency to be rated higher, perhaps because only masterpieces have survived.
When adopting year as a sensitive attribute, we show that our fair graph-based RS enhances the
neutrality from this masterpiece bias. The clustering balance and root mean squared error (RMSE)
have been used to evaluate different modeling methods on this dataset. Since reducing RMSE is the
goal, statistical models assume the response (ratings) to be Gaussian for this data (Kouki et al.,
2015;Wang et al.,2015;Agarwal et al.,2011).
10 20 30 40
# Communities
1.05
1.1
1.15
1.2
1.25
1.3
1.35
RMSE
CCF
FAIR CCF
FGLASSO+CF
FCONCORD+CF
10 20 30 40
# Communities
0.2
0.4
0.6
0.8
1
Balance
CCF
FAIR CCF
FBLASSO+CF
FCONCORD+CF
Figure 1: RMSE (left) and Balance (right) of standard CCF, Fair CCF, FGLASSO+CF, and
FCONCORD+CF on MovieLens 10K data.
Experimental results are shown in Figure 1. As expected, the baseline with no notion of
fairness–CCF–results in the best overall RMSEs, with our two approaches (FGLASSO+CF and
FCONCORD+CF) providing performance fairly close to CCF. Figure 1(right) demonstrates that
compared to fair CCF, FGLASSO+CF and FCONCORD+CF significantly improve the clustering
balance. Hence, our fair graph-based RSs successfully enhanced the neutrality without seriously
sacrificing the prediction accuracy.
Fair GMs also provide information to study the relationships among items based on user ratings.
To illustrate this, the top-5 movie pairs with the highest absolute values of partial correlations are
shown in Table 4. If we look for the highly related movies to a specific movie in the precision matrix,
2http://www.grouplens.org
25
GM-B.II.
The pair of movies Partial correlation
The Godfather (1972) The Godfather: Part II (1974) 0.592
Grumpy Old Men (1993) Grumpier Old Men (1995) 0.514
Patriot Games (1992) Clear and Present Danger (1994) 0.484
The Wrong Trousers (1993) A Close Shave (1995) 0.448
Toy Story (1995) Toy Story 2 (1999) 0.431
Star Wars: Episode IV–A New Hope (1977) Star Wars: Episode V–The Empire Strikes Back (1980) 0.415
GM-B.FII. (FCONCORD)
The pair of movies Partial correlation
The Godfather (1972) The Godfather: Part II (1974) 0.534
Grumpy Old Men (1993) Grumpier Old Men (1995) 0.520
Austin Powers: International Man of Mystery (1997) Austin Powers: The Spy Who Shagged Me (1999) 0.491
Toy Story (1995) Toy Story 2 (1999) 0.475
Patriot Games (1992) Clear and Present Danger (1994) 0.472
The Wrong Trousers (1993) A Close Shave (1995) 0.453
Table 4: Pairs of movies with top 5 absolute values of partial correlations in the precision matrix
from GM-B.II. and GM-B.FII.(FCONCORD).
we find that FCONCORD enhances the balance by assigning higher correlations to more recent
movies such as “The Wrong Trousers” (1993) and “A Close Shave” (1995).
In addition, the estimated communities for two sub-graphs of movies are also shown in Figure 2.
From both networks, we can see that the estimated communities mainly consist of mass marketed
commercial movies, dominated by action films. Note that these movies are usually characterized by
high production budgets, state of the art visual effects, and famous directors and actors. Examples
in this communities include “The Godfather” (1972), “Terminator 2” (1991), and “Return of the
Jedi” (1983), “Raiders of Lost Ark” (1981), etc. As expected, movies within the same series are most
strongly associated. Figure 2(right) shows that FCONCORD enhances the neutrality from the old
movies bias by replacing them with new ones such as “Jurassic Park (1993),” “The Wrong Trousers”
(1993) and “A Close Shave” (1995).
5.2 Music Data
Music RSs are designed to give personalized recommendations of songs, playlists, or artists to a user,
thereby reflecting and further complementing individual users’ specific music preferences. Although
accuracy metrics have been widely applied to evaluate recommendations in music RS literature,
26
Figure 2: Subgraphs of the precision matrices estimated by GM-B.II.(left) and GM-B.FII.(right).
Nodes represent the movies labeled by their titles. Circle markers denote movies within a single
community in each subgraph, and square markers denote isolated movies. Blue nodes are new
movies and red nodes old movies within each community. The width of a link is proportional to
the magnitude of the corresponding partial correlations. GM-B.FII.(FCONCORD) enhances the
neutrality from year bias by replacing the old movies within each community by new ones.
evaluating a user’s music utility from other impact oriented perspectives, including their potential
for discrimination, is still a novel evaluation practice in the music RS literature (Epps-Darling et al.,
2020;Chen et al.,2020;Shakespeare et al.,2020). Next, we center our attention on artists’ gender
bias for which we want to estimate if standard music RSs may exacerbate its impact.
To illustrate the impact of artists gender bias in RSs, we use the freely available binary LFM-360K
music dataset
3
. The LFM-360K consists of approximately 360,000 users listening histories from
Last.fm
collected during Fall 2008. We generate recommendations for a sample of all users for
which gender can be identified. We limit the size of this sample to be 10% randomly chosen of all
male and female users in the whole dataset due to computational constraints. Let
U
be the set
of
n
users,
I
be the set of
p
items and
Y
be the
n×p
input matrix, where
yui
= 1 if user
u
has
selected item
i
, and zero otherwise. Given the matrix
Y
, the input preference ratio (PR) for user
3http://www.last.fm
27
Figure 3: Input Preference Ratio (PR) distributions of LFM-360K dataset.
10 20 30 40
# Communities
1
1.1
1.2
1.3
1.4
RMSE
CCF
FAIR CCF
FBN+CF
10 20 30 40
# Communities
0.2
0.4
0.6
0.8
1
Balance
CCF
FAIR CCF
FBN+CF
Figure 4: RMSE (left) and Balance (right) of standard CCF, Fair CCF, and FBN+CF on LFM-360K
music data.
group
D
on item category
C
is the fraction of liked items by group
D
in category
C
, defined as the
following (Shakespeare et al.,2020):
PR(D,C) := Pu∈D Pi∈C yui
Pu∈D Pi∈I yui
.(27)
Figure 3represents the distributions of users’ input PR towards male and female artist groups.
It shows that only around 20% of users have a PR towards male artists lower than 0.8. On the
contrary, 80% of users have a PR lower than 0.2 towards female artists. This shows that commonly
deployed state of the art CF algorithms may act to further increase or decrease artist gender bias in
user-artist RS.
Next, we study the balance and prediction accuracy of fair GMs on music RSs. Figure 4
indicates that the proposed FBN+CF has the best performance in terms of RMSE and balance. As
expected, the baseline with no notion of fairness–CCF–results in the best overall precision. Of the
28
two fairness-aware approaches, the fair K-means based approach–Fair CCF–performs considerably
below FBN+CF. This suggests that recommendation quality can be preserved, but leaves open the
question of whether we can improve fairness. Hence, we turn to the impact on fairness of the three
approaches. Figure 4(right) presents the balance. We can see that fairness-aware approaches–Fair
CCF and FBN+CF–have a strong impact on the balance in comparison with standard CCF. And
for RMSE, we see that FBN+CF achieves much better ratings difference in comparison with Fair
CCF, indicating that we can induce aggregate statistics that are fair between the two sides of the
sensitive attribute (male vs. female).
6 Conclusion
In this work we developed a novel approach to learning fair GMs with community structure. Our
goal is to motivate a new line of work for fair community learning in GMs that can begin to alleviate
fairness concerns in this important subtopic within unsupervised learning. Our optimization approach
used the demographic parity definition of fairness, but the framework is easily extended to other
definitions of fairness. We established statistical consistency of the proposed method for both a
Gaussian GM and an Ising model proving that our method can recover the graphs and their fair
communities with high probability.
We applied the proposed framework to the tasks of estimating a Gaussian graphical model and a
binary network. The proposed framework can also be applied to other types of graphical models,
such as the Poisson graphical model (Allen and Liu,2012) or the exponential family graphical model
(Yang et al.,2012).
Acknowledgment
D. Ataee Tarzanagh and L. Balzano were supported by NSF BIGDATA award #1838179, ARO YIP
award W911NF1910027, and NSF CAREER award CCF-1845076. A.O. Hero was supported by US
Army Resesarch Office grants #W911NF-15-1-0479 and #W911NF-19-1-0269.
References
Agarwal, D., L. Zhang, and R. Mazumder (2011). Modeling item-item similarities for personalized
recommendations on yahoo! front page. The Annals of applied statistics, 1839–1875.
29
Allen, G. I. and Z. Liu (2012). A log-linear graphical model for inferring genetic networks from
high-throughput sequencing data. In 2012 IEEE International Conference on Bioinformatics and
Biomedicine, pp. 1–6. IEEE.
Aloise, D., A. Deshpande, P. Hansen, and P. Popat (2009). Np-hardness of euclidean sum-of-squares
clustering. Machine learning 75 (2), 245–248.
Amini, A. A., E. Levina, et al. (2018). On semidefinite relaxations for the block model. The Annals
of Statistics 46 (1), 149–179.
Banerjee, O., L. El Ghaoui, and A. d’Aspremont (2008). Model selection through sparse maximum
likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning
Research 9, 485–516.
Barocas, S., M. Hardt, and A. Narayanan (2019). Fairness and Machine Learning. fairmlbook.org.
http://www.fairmlbook.org.
Barzilai, J. and J. M. Borwein (1988). Two-point step size gradient methods. IMA journal of
numerical analysis 8 (1), 141–148.
Berthet, Q., P. Rigollet, and P. Srivastava (2016). Exact recovery in the Ising block model.
arXiv:1612.03880 .
Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011). Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations and Trends
®
in Machine Learning 3 (1), 1–122.
Burke, R., A. Felfernig, and M. H. Göker (2011). Recommender systems: An overview. Ai
Magazine 32 (3), 13–18.
Cai, T. and W. Liu (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal
of the American Statistical Association 106 (494), 672–684.
Cai, T. T., X. Li, et al. (2015). Robust and computationally feasible community detection in the
presence of arbitrary outlier nodes. Annals of Statistics 43 (3), 1027–1059.
Cardoso, J. V. d. M., J. Ying, and D. P. Palomar (2020). Algorithms for learning graphs in financial
markets. arXiv preprint arXiv:2012.15410 .
30
Caton, S. and C. Haas (2020). Fairness in machine learning: A survey. arXiv preprint
arXiv:2010.04053 .
Celis, L. E., D. Straszak, and N. K. Vishnoi (2017). Ranking with fairness constraints. arXiv preprint
arXiv:1704.06840 .
Chen, J., H. Dong, X. Wang, F. Feng, M. Wang, and X. He (2020). Bias and debias in recommender
system: A survey and future directions. arXiv preprint arXiv:2010.03240.
Chierichetti, F., R. Kumar, S. Lattanzi, and S. Vassilvitskii (2017). Fair clustering through fairlets.
In Advances in Neural Information Processing Systems, pp. 5029–5037.
Chouldechova, A. and A. Roth (2018). The frontiers of fairness in machine learning. arXiv preprint
arXiv:1810.08810 .
Danaher, P., P. Wang, and D. M. Witten (2014). The joint graphical lasso for inverse covariance
estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 76 (2), 373–397.
Das, J., P. Mukherjee, S. Majumder, and P. Gupta (2014). Clustering-based recommender system
using principles of voting theory. In 2014 International conference on contemporary computing
and informatics (IC3I), pp. 230–235. IEEE.
Donini, M., L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil (2018). Empirical risk
minimization under fairness constraints. arXiv preprint arXiv:1802.08626 .
Eisenach, C., F. Bunea, Y. Ning, and C. Dinicu (2020). High-dimensional inference for cluster-based
graphical models. Journal of machine learning research 21 (53).
Epps-Darling, A., R. T. Bouyer, and H. Cramer (2020). Artist gender representation in music
streaming. In Proceedings of the 21st International Society for Music Information Retrieval
Conference (Montréal, Canada)(ISMIR 2020). ISMIR, pp. 248–254.
Feldman, M., S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015). Certifying
and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference
on knowledge discovery and data mining, pp. 259–268.
Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the
graphical lasso. Biostatistics 9 (3), 432–441.
31
Friedman, J., T. Hastie, and R. Tibshirani (2010). Applications of the lasso and grouped lasso to
the estimation of sparse graphical models. Technical report.
Gan, L., X. Yang, N. N. Nariestty, and F. Liang (2019). Bayesian joint estimation of multiple
graphical models. Advances in Neural Information Processing Systems 32.
Gheche, M. E. and P. Frossard (2020). Multilayer clustered graph learning. arXiv preprint
arXiv:2010.15456 .
Glassman, E. L., R. Singh, and R. C. Miller (2014). Feature engineering for clustering student
solutions. In Proceedings of the first ACM conference on Learning@ scale conference, pp. 171–172.
Grant, M. and S. Boyd (2014). Cvx: Matlab software for disciplined convex programming, version
2.1.
Guo, J., J. Cheng, E. Levina, G. Michailidis, and J. Zhu (2015). Estimating heterogeneous graphical
models for discrete data with an application to roll call voting. The Annals of Applied Statistics 9 (2),
821.
Guo, J., E. Levina, G. Michailidis, and J. Zhu (2010). Joint structure estimation for categorical
markov networks. Unpublished manuscript 3 (5.2), 6.
Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011a). Asymptotic properties of the joint
neighborhood selection method for estimating categorical markov networks. arXiv preprint
math.PR/0000000 .
Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011b). Joint estimation of multiple graphical models.
Biometrika 98 (1), 1–15.
Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011c). Joint estimation of multiple graphical models.
Biometrika 98 (1), 1–15.
Hao, B., W. W. Sun, Y. Liu, and G. Cheng (2018). Simultaneous clustering and estimation of
heterogeneous graphical models. Journal of Machine Learning Research.
Hassner, M. and J. Sklansky (1981). The use of markov random fields as models of texture. In
Image Modeling, pp. 185–198. Elsevier.
32
Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). Stochastic blockmodels: First steps. Social
networks 5 (2), 109–137.
Hosseini, M. J. and S.-I. Lee (2016). Learning sparse gaussian graphical models with overlapping
blocks. In Advances in Neural Information Processing Systems, Volume 30, pp. 3801–3809.
Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik A Hadrons and
Nuclei 31 (1), 253–258.
Kamishima, T., S. Akaho, H. Asoh, and J. Sakuma (2012). Enhancement of the neutrality in
recommendation. Citeseer.
Karoui, N. E. (2008). Operator norm consistent estimation of large-dimensional sparse covariance
matrices. The Annals of Statistics, 2717–2756.
Khare, K., S.-Y. Oh, and B. Rajaratnam (2015). A convex pseudolikelihood framework for high
dimensional partial correlation estimation with convergence guarantees. Journal of the Royal
Statistical Society: Series B: Statistical Methodology, 803–825.
Kleindessner, M., S. Samadi, P. Awasthi, and J. Morgenstern (2019). Guarantees for spectral
clustering with fairness constraints. In International Conference on Machine Learning, pp. 3458–
3467. PMLR.
Koren, Y. (2009). Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, pp. 447–456.
Kouki, P., S. Fakhraei, J. Foulds, M. Eirinaki, and L. Getoor (2015). Hyper: A flexible and
extensible probabilistic framework for hybrid recommender systems. In Proceedings of the 9th
ACM Conference on Recommender Systems, pp. 99–106.
Kumar, A., Y. Sabharwal, and S. Sen (2004). A simple linear time (1 +
/splepsiv/
)-approximation
algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on
Foundations of Computer Science, pp. 454–462. IEEE.
Kumar, S., J. Ying, J. V. Cardoso, de Miranda Cardoso, and D. Palomar (2019). Structured graph
learning via laplacian spectral constraints. In Advances in Neural Information Processing Systems,
pp. 11651–11663.
33
Kumar, S., J. Ying, J. V. de Miranda Cardoso, and D. P. Palomar (2020). A unified framework for
structured graph learning via spectral constraints. Journal of Machine Learning Research 21 (22),
1–60.
Laferté, J.-M., P. Pérez, and F. Heitz (2000). Discrete markov image modeling and inference on the
quadtree. IEEE Transactions on image processing 9 (3), 390–404.
Lafit, G., F. Tuerlinckx, I. Myin-Germeys, and E. Ceulemans (2019). A partial correlation screening
approach for controlling the false positive rate in sparse gaussian graphical models. Scientific
reports 9 (1), 1–24.
Lee, S.-I., V. Ganapathi, and D. Koller (2007). Efficient structure learning of Markov networks using
`1-regularization. In Advances in neural Information processing systems, pp. 817–824.
Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common structures.
The Journal of Machine Learning Research 16 (1), 1035–1062.
Lei, J., A. Rinaldo, et al. (2015). Consistency of spectral clustering in stochastic block models.
Annals of Statistics 43 (1), 215–237.
Li, X., Y. Chen, and J. Xu (2021). Convex relaxation methods for community detection. Statistical
Science 36 (1), 2–15.
Liljeros, F., C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Åberg (2001). The web of human
sexual contacts. Nature 411 (6840), 907–908.
Ma, J. and G. Michailidis (2016). Joint structural estimation of multiple graphical models. The
Journal of Machine Learning Research 17 (1), 5777–5824.
Manning, C. and H. Schutze (1999). Foundations of statistical natural language processing. MIT
press.
Marlin, B. M. and K. P. Murphy (2009). Sparse gaussian graphical models with unknown block
structure. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.
705–712.
Mazumder, R. and D. K. Agarwal (2011). A flexible, scalable and efficient algorithmic framework
for primal graphical lasso. arXiv preprint arXiv:1110.5508.
34
Meinshausen, N., P. Bühlmann, et al. (2006). High-dimensional graphs and variable selection with
the lasso. Annals of statistics 34 (3), 1436–1462.
Mnih, A. and R. R. Salakhutdinov (2008). Probabilistic matrix factorization. In Advances in neural
information processing systems, pp. 1257–1264.
Oneto, L. and S. Chiappa (2020). Fairness in machine learning. In Recent Trends in Learning From
Data, pp. 155–196. Springer.
O’Connor, M. and J. Herlocker (1999). Clustering items for collaborative filtering. In Proceedings of
the ACM SIGIR workshop on recommender systems, Volume 128. Citeseer.
Parimbelli, E., S. Marini, L. Sacchi, and R. Bellazzi (2018). Patient similarity for precision medicine:
A systematic review. Journal of biomedical informatics 83, 87–96.
Peng, J., P. Wang, N. Zhou, and J. Zhu (2009). Partial correlation estimation by joint sparse
regression models. Journal of the American Statistical Association 104 (486), 735–746.
Pham, M. C., Y. Cao, R. Klamma, and M. Jarke (2011). A clustering approach for collaborative
filtering recommendation using social network analysis. J. UCS 17 (4), 583–604.
Pircalabelu, E. and G. Claeskens (2020). Community-based group graphical lasso. Journal of
Machine Learning Research 21 (64), 1–32.
Ravikumar, P., M. J. Wainwright, and J. D. Lafferty (2010). High-dimensional Ising model selection
using `1-regularized logistic regression. The Annals of Statistics 38 (3), 1287–1319.
Ravikumar, P., M. J. Wainwright, J. D. Lafferty, et al. (2010). High-dimensional ising model selection
using `1-regularized logistic regression. The Annals of Statistics 38 (3), 1287–1319.
Robins, G., P. Pattison, Y. Kalish, and D. Lusher (2007). An introduction to exponential random
graph pmodels for social networks. Social networks 29 (2), 173–191.
Rocha, G. V., P. Zhao, and B. Yu (2008). A path following algorithm for sparse pseudo-likelihood
inverse covariance estimation (splice). arXiv preprint arXiv:0807.3734.
Samadi, S., U. Tantipongpipat, J. Morgenstern, M. Singh, and S. Vempala (2018). The price of fair
pca: One extra dimension. arXiv preprint arXiv:1811.00103 .
35
Sarwar, B., G. Karypis, J. Konstan, and J. Riedl (2001). Item-based collaborative filtering recom-
mendation algorithms. In Proceedings of the 10th international conference on World Wide Web,
pp. 285–295.
Schafer, J. B., D. Frankowski, J. Herlocker, and S. Sen (2007). Collaborative filtering recommender
systems. In The adaptive web, pp. 291–324. Springer.
Shakespeare, D., L. Porcaro, E. Gómez, and C. Castillo (2020). Exploring artist gender bias in music
recommendation. arXiv preprint arXiv:2009.01715.
Song, Q., J. Ni, and G. Wang (2011). A fast clustering-based feature subset selection algorithm for
high-dimensional data. IEEE transactions on knowledge and data engineering 25 (1), 1–14.
Tan, K. M., P. London, K. Mohan, S.-I. Lee, M. Fazel, and D. M. Witten (2014). Learning graphical
models with hubs. Journal of Machine Learning Research 15 (1), 3297–3331.
Tan, P.-N., M. Steinbach, and V. Kumar (2013). Data mining cluster analysis: basic concepts and
algorithms. Introduction to data mining, 487–533.
Tantipongpipat, U., S. Samadi, M. Singh, J. H. Morgenstern, and S. S. Vempala (2019). Multi-criteria
dimensionality reduction with applications to fairness. Advances in neural information processing
systems (32).
Tarzanagh, D. A. and G. Michailidis (2018). Estimation of graphical models through structured
norm minimization. Journal of Machine Learning Research 18 (209), 1–48.
Ungar, L. H. and D. P. Foster (1998). Clustering methods for collaborative filtering. In AAAI
workshop on recommendation systems, Volume 1, pp. 114–129. Menlo Park, CA.
Wang, H., N. Wang, and D.-Y. Yeung (2015). Collaborative deep learning for recommender systems.
In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and
data mining, pp. 1235–1244.
Wang, P., D. L. Chao, and L. Hsu (2009). Learning networks from high dimensional binary data:
An application to genomic instability data. arXiv preprint arXiv:0908.3882 .
Wang, Y., W. Yin, and J. Zeng (2019). Global convergence of admm in nonconvex nonsmooth
optimization. Journal of Scientific Computing 78 (1), 29–63.
36
Xue, L., S. Ma, and H. Zou (2012). Positive-definite
`1
-penalized estimation of large covariance
matrices. Journal of the American Statistical Association 107 (500), 1480–1491.
Yang, E., P. Ravikumar, G. I. Allen, and Z. Liu (2012). Graphical models via generalized linear
models. In NIPS, Volume 25, pp. 1367–1375.
Ye, M. (2020). Exact recovery and sharp thresholds of stochastic ising block model. arXiv preprint
arXiv:2004.05944 .
Yu, Y., T. Wang, and R. J. Samworth (2015). A useful variant of the davis–kahan theorem for
statisticians. Biometrika 102 (2), 315–323.
Yuan, M. and Y. Lin (2006). Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67.
Zafar, M. B., I. Valera, M. G. Rogriguez, and K. P. Gummadi (2017). Fairness constraints:
Mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962–970. PMLR.
Zhao, P. and B. Yu (2006). On model selection consistency of lasso. The Journal of Machine
Learning Research 7, 2541–2563.
37
Supplementary Material to “Fair Structure Learning in
Heterogeneous Graphical Models”
Davoud Ataee Tarzanagh, Laura Balzano, and Alfred O. Hero
Appendix A.1 provides some preliminaries used in the proof of main theorems.
Appendix A.2 provides large sample properties of FCONCORD, i.e., the proof of Theorem 5.
Appendix A.3 provides large sample properties of FBN, i.e., the proof of Theorem 6.
Appendix A.4 gives the consistency of fair community labeling in graphical models.
Appendix A.5 provides the detailed derivation of the updates for Algorithm 1.
A.1 Preliminaries
Let the notation
Fn
(
θd,θo,qd,qo
;
Y
)stands for
F
n
in
(11)
. We introduce a restricted version of
criterion (11) as below:
minimize
θo,qo
Fn(¯
θd,θo,¯
qd,qo;Y) + ρ1nkθok1,subj. to θo
Bc= 0.(A.1)
We define a linear operator A:Rp(p1)/2Rp×p,w Aw, satisfying
[Aw]ij =
wi+dji > j,
[Aw]ji i < j,
Pi6=j[Aw]ij i=j,
(A.2)
where
dj
=
j
+
j1
2
(2
pj
)
.
An example for
Aw
on a vector
w
= [
w1, w2,··· , w6
]
>
is given below
Aw=
0w1w2w3
w10w4w5
w2w40w6
w3w5w60
.
We derive the adjoint operator
A
of
A
by making
A
satisfy
hAw,zi
=
hw,Azi
; see, Kumar et al.
(2020) for more details.
1
Let
Qo
=
diag
(
¯
qd
) +
Aqo
and
Θo
=
diag
(
¯
θd
) +
Aθo
. Since by our assumption
=
0
, we obtain
minimize
θo,qoLn(¯
θd,θo;Y) + ρ1nkθok1+ρ2ntrace(ΘoQoΘo),
subj. to θo
Bc= 0,Qo0,A1Qo=0,0QoJp.
(A.3)
It is easy to see that
rank
(
A1
) =
H
1. Let
NRp×(pH+1)
be a matrix whose rows form an or-
thonormal basis of the nullspace of
A1
. We can substitute
Qo
=
NRoN>
for
RoR(pH+1)×(pH+1)
,
and then, using that N>N=I(pH+1), Problem (A.3) becomes
minimize
θo,roFn(¯
θd,θo,¯
rd,ro;Y) + ρ1nkθok1
subj. to θo
Bc= 0,Qo0,0QoJp,
(A.4)
where Fn(¯
θd,θo,¯
rd,ro;Y) = Ln(¯
θd,θo;Y) + ρ2ntrace(ΘoNRoN>Θo).
Throughout, we use
¯
gn
and
¯
Hn
to denote the gradient and the Hessian of
Ln
(
¯
θd,θo
;
Y
). We
also define the population gradient and Hessian as follows: For 1i<jp
¯
gij := E¯
θo ∂L(¯
θd,θo,Y)
∂θo
ij θo=¯
θo!,
and for 1i<jpand 1t<sp,
¯
Hij,ts := E¯
θo2L(¯
θd,θo;Y)
∂θo
ij∂θo
ts θo=¯
θo.
A.2 Large Sample Properties of FCONCORD
We list some properties of the loss function.
Lemma 8. (Peng et al.,2009) The following is true for the loss function:
(L1) There exist constants 0< M1M2<such that
M1(¯
θo)Λmin(¯
H)Λmax(¯
H)M2(¯
θo).
(L2) There exists a constant M3(¯
θo)<such that for all 1i<jp,¯
Hij,ij M3(¯
θo).
(L3) There exist constants M4(¯
θo)and M5(¯
θo)<, such that for any 1i<jp
Var ¯
θo(¯
gn
ij)M4(¯
θo),Var ¯
θo(¯
Hn
ij,ij)M5(¯
θo).
(L4) There exists a constant 0< M6(¯
θo)<, such that for all (i, j ) B
¯
Hij,ij ¯
Hij,Bij ¯
H1
Bij ,Bij
¯
HBij ,ij M4(¯
θo),where Bij := B/{(i, j)}.
2
(L5) There exists a constant M7(¯
θo)<, such that for any (i, j) Bc
k¯
Hij,B¯
H1
B,Bk M7(¯
θo).(A.5)
Lemma 9.
(Peng et al.,2009) Suppose Assumptions
(A1
)
(A2
)hold, then for any
η >
0, there
exist constants
c0
c3
, such that for any
vRq
the following events hold with probability at least
1O(exp(ηlog p)) for sufficiently large n:
(L1) k¯
gn
Bk c0qqlog p
n.
(L2) |v>¯
gn
B| c1kvkqqlog p
n.
(L3) |v>(¯
Hn
B,B¯
HB,B)v| c2kvk2qqlog p
n.
(L4) k(¯
Hn
B,B¯
HB,B)vk c3kvkqqlog p
n.
Lemma 10.
Suppose Assumptions
(A1
)
(A4
)are satisfied. Assume further that
ρ1n
=
O
(
plog p/n
),
n
=
O
(
qlog
(
p
)),
ρ2n
=
O
(
plog(pH+ 1)/n
), and
=
0
. Then, there exist finite constants
C1
(
¯
θo
)
and
D1
(
¯
qo
), such that for any
η >
0, there exists a (local) minimizer of the restricted problem
(A.1)
within the disc:
n(b
θo,b
qo) : max kb
θo
B¯
θo
Bk,kb
qo¯
qokmax C1(¯
θo)ρ1nq, D1(¯
qo)ρ2npΨ(p, H, K)o (A.6)
with probability at least 1O(exp(ηlog p)) for sufficiently large n.
Proof.
Let
µ1n
=
ρ1nq
with
q
=
|B|
and
µ2n
=
ρ2npΨ(p, H, K)
. Let
C1>
0and
wRp(p1)/2
such that
wBc
= 0,
kwk2
=
C1
. Further, assume
zR(pH+1)(pH)/2
be an arbitrary vector with
finite entries and kzk=D1. Let
Fn(¯
θd,¯
θo+µ1nw,¯
rd,¯
ro+µ1nz;Y)Fn(¯
θd,¯
θo,¯
rd,¯
ro;Y) = I1+I2+I3,
where
I1:= Ln(¯
θd,¯
θo+µ1nw;Y)Ln(¯
θd,¯
θo;Y),
I2:= ρ1n(k¯
θo+µ1nwk1 k¯
θok1),and
I3:= ρ2ntrace (¯
Θ+µ1nAw)2(S+N(¯
R+µ2nAz)N>)¯
Θ2¯
Q.
Following Peng et al. (2009), we first provide lower bounds for
I1
and
I2
. For the term
I1
, it
follows from Lemma 9that
w>
B¯
HB,BwBΛmin(¯
HB,B)kwBk2
2M1C2
1,(A.7)
3
which together with Lemma 9gives
I1=µ1nw>
B¯
gn
B+1
2µ2
1nw>
B¯
Hn
B,BwB
=µ1n(wB)>¯
gn
B+1
2µ2
1nw>
B¯
Hn
B,BwB
=µ1nw>
B¯
gn
B+1
2µ2
1nw>
B(¯
Hn
B,B¯
HB,B)wB+1
2µ2
1nw>
B¯
HB,BwB
1
2µ2
1nM1C2
1µ1nc1kwBk2
2rqlog p
n1
2µ2
1nc2kwBk2
2qrlog p
n.
For sufficiently large
n
, by assumption that
ρ1npn/ log p
if
p
and
plog p/n
=
o
(1),
the second term in the last line above is
O
(
µ1nn
) =
o
(
µ2
1n
); the last term is
o
(
µ2
1n
). Thus, for
sufficiently large n, we have
I11
2µ2
1nM1C2
1.(A.8)
For the term I2, by Cauchy-Schwartz and triangle inequality, we have
|I2|=ρ1n|k¯
θo+µ1nwk1µ1nkwk1| C1ρ1nµ1nq=C1µ2
1n.(A.9)
Next, we provide an upper bound for I3. Note that
I3:= ρ2ntrace (¯
Θ+µ1nAw)2S
+ρ2ntrace (¯
Θ+µ1nAw)2(N(¯
R+µ2nAz)N>)¯
Θ2¯
Q
=ρ2ntrace ¯
Θ2¯
S+ρ2ntrace (¯
Θ+µ1nAw)2(S¯
S)
+ρ2ntrace (¯
Θ+µ1nAw)2(N(¯
R+µ2nAz)N>)¯
Θ2¯
Q
Now, we have
trace (¯
Θ+µ1nAw)2(S¯
S)
+trace (¯
Θ+µ1nAw)2(N(¯
R+µ2nAz)N>)¯
Θ2¯
Q
=trace (¯
Θ+µ1nAw)2(N(¯
R+µ2nAz)N>)¯
Θ2¯
Q+I3,0
=trace (¯
Θ2+ 2µ1n¯
ΘAw)(N(¯
R+µ2nAz)N>)¯
Θ2¯
Q+I3,1+I3,0
=trace ¯
Θ2(N(¯
R+µ2nAz)N>¯
Q)+I3,2+I3,1+I3,0
=I3,3+I3,2+I3,1+I3,0.
4
From Assumption (A2), we have
I3,0O(kS¯
Sk),
I3,1=µ2
1ntrace (Aw)2N¯
RN>+µ2
1nµ2ntrace (Aw)2NAzN>,
τ2
3C2
1µ2
1n+C2
1D1µ2
1nµ2n,
I3,2= 2µ1ntrace ¯
ΘAwN ¯
RN>+ 2µ1nµ2ntrace ¯
ΘAwNAzN>
2τ2τ3C1µ1n+ 2µ1nµ2nτ2C1D1
I3,3:= trace ¯
Θ2(N(¯
R+µ2nAz)N>¯
Q)τ2
2C2
1µ2n.
Hence, for sufficiently large n, we have
I3τ1ρ2no(ρ2n).(A.10)
Now, by combining (A.8)-(A.10), for sufficiently large n, we obtain
Fn(¯
θd,¯
θo+µ1nw,¯
rd,¯
ro+µ1nz;Y)Fn(¯
θd,¯
θo,¯
rd,¯
ro;Y)
1
2M1µ2
1nC2
1µ2
1nC1+τ1ρ2n0.
Here, the first inequality follows
ρ1n
=
plog p/n
and the last inequality follows by setting
C1>
2
/M1
.
Now, let
Sw,z
=
{(w,z) : wBc= 0,kwk=C1,kzk=D1}
. Then, for
n
sufficiently large, the
following holds
inf
(w,z)∈Sw,z
Fn(¯
θd,¯
θo+µ1nw,¯
rd,¯
ro+µ2nz;Y)> Fn(¯
θd,¯
θo,¯
rd,¯
ro;Y),
with probability at least 1O(exp(ηlog p)).
Thus, any solution to the problem defined in
(A.1)
is within the disc
(A.6)
with probability
at least 1
O
(
exp
(
ηlog p
)). Finally, since
¯
Q
=
N¯
RN>
and
N>N
=
I(pH+1)
, we have that
kb
qo¯
q0k=kb
ro¯
r0k. This completes the proof.
Lemma 11.
Assume conditions of Lemma 12 hold and
ρ2n< δρ1n/
(
τ2τ3
). Then, there exists a
constant
C2>
0, such that for any
η >
0, for sufficiently large
n
, the following event holds with
probability at least 1O(exp(ηlog p)): for any θosatisfying
kθo¯
θok C21n,θo
Bc= 0,(A.11)
we have k∇θoFn(¯
θd,ˆ
θo
B,¯
rd,ˆ
ro;Y)k>1n.
5
Proof.
The proof follows the idea of (Peng et al.,2009, Lemma S-4). For
θo
=
ˆ
θo
satisfying
(A.11)
,
we have ˆ
θo=¯
θo+µ1nw, with wBc= 0 and kwk C2. We have
θoFn(¯
θd,ˆ
θo
B,¯
rd,ˆ
ro;Y) = (1 + ρ2n)ˆ
gn
B+ρ2n(Nˆ
RoN>)AAˆ
θo
B
= (1 + ρ2n)¯
gn
B+ (1 + ρ2n)µ1n¯
Hn
B,BwB+ρ2n(Nˆ
RoN>)AA¯
θo
B
(1 + ρ2n)¯
gn
B+ (1 + ρ2n)µ1n¯
Hn
B,BwB
+ρ2n(Nˆ
RoN>)AA¯
θo
B+ρ2nµ1n(Nˆ
RoN>)AAwB,
where the second inequality follows from Taylor expansion of θoLn(¯
θd,ˆ
θo
B;Y).
Let
ˆ
A:= Nˆ
RoN>AA>¯
θoand ¯
A:= N¯
RoN>AA>¯
θo.(A.12)
Then, we have
kˆ
ABk ρ2nkN¯
RoN>kkAA¯
θok+ρ2nk¯
Aˆ
Ak 2τ2τ32n,(A.13)
where the last inequality follows since
k¯
Ak Λmax(N¯
RoN>)kAA¯
θok τ2τ3q,
k¯
Aˆ
Ak=o(ρ2n).
Now, let
µ1n
=
1n
. By triangle inequality and similar proof strategies as in Lemma 10, for
sufficiently large n, we obtain
k∇θoFn(¯
θd,ˆ
θo
B,¯
rd,ˆ
ro;Y)k (1 + ρ2n)µ1nk¯
HB,BwBk c0(1 + ρ2n)(q1/2n1/2plog p)
c3(1 + ρ2n)kwBk2(µ1nqn1/2plog p)2τ2τ3qρ2no(ρ2n)
(1 + ρ2n)M1C21n2τ2τ3qρ2n,
with probability at least 1
O
(
exp
(
ηlog p
)). Here, the first inequality uses Lemma 9and the last
inequality follows from Lemma 8where we have k¯
HB,BwBk M1kwBk. Now, taking
C2=1+2δ
(1 + ρ2n)M1+(A.14)
for some > 0, completes the proof.
Next, inspired by Peng et al. (2009); Khare et al. (2015), we prove estimation consistency for the
nodewise FCONCORD, restricted to the true support, i.e., θo
Bc= 0.
6
Lemma 12.
Suppose Assumptions
(A1
)
(A4
)are satisfied. Assume further that
ρ1n
=
O
(
plog p/n
),
n>O
(
qlog
(
p
)) as
n
,
ρ2n
=
O
(
plog(pH+ 1)/n
),
ρ2nδρ1n/
((1 +
M7
(
¯
θo
))
τ2τ3
), and
=
0
. Then, there exist finite constants
C
(
¯
θo
)and
D
(
¯
qo
), such that for any
η >
0, the following
events hold with probability at least 1O(exp(ηlog p)):
There exists a local minimizer (b
θo
B,b
qo)of (A.1)such that
max kb
θo
B¯
θo
Bk,kb
qo¯
qokmax C(¯
θo)ρ1nq/(1 + ρ2n), D(¯
qo)ρ2npΨ(p, H, K),
where qand Ψ(p, H, K)are defined in (15).
If min(i,j)∈B ¯
θij 2C(¯
θo)ρ1nq/(1 + ρ2n), then b
θo
Bc= 0.
Proof. By the KKT condition, for any solution (ˆ
θo,ˆ
ro)of (A.1), it satisfies
k∇θoFn(¯
θd,ˆ
θo
B,¯
rd,ˆ
ro;Y)kρ1n.
Thus, for nsufficiently large, we have
k∇θoFn(¯
θd,ˆ
θo
B,¯
rd,ˆ
ro;Y)k 1n,
Let C(¯
θo) := C2. Using (A.14) and Lemma 11, we obtain
kˆ
θo¯
θok2C(¯
θo)1n/(1 + ρ2n),θo
Bc= 0
with probability at least 1O(exp(ηlog p)).
Now, if min(i,j)∈B ¯
θo
ij 2C(¯
θo)1n/(1 + ρ2n), then
1O(exp(ηlog p))
P¯
θokb
θo
B¯
θo
Bk C(¯
θo)1n/(1 + ρ2n),min
(i,j)∈B
¯
θo
ij 2C(¯
θo)1n/(1 + ρ2n)
P¯
θosign(ˆ
θo
ij) = sign(¯
θo
ij),(i, j) B.
The following Lemma 13 shows that no wrong edge is selected with probability tending to one.
Lemma 13.
Suppose that the conditions of Lemma 12 and Assumption
(A5
)are satisfied. Suppose
further that
p
=
O
(
nκ
)for some
κ >
0. Then for
η >
0, and for
n
sufficiently large, the solution of
(A.1)satisfies
Pk∇θoFn(¯
θd,ˆ
θo
Bc,¯
rd,ˆ
ro;Y)k<11O(exp(ηlog p)).(A.15)
7
Proof.
Let
En,k
=
{sign
(
b
θo
ij,B
) =
sign
(
¯
θo
ij,B
)
}
. Then by Lemma 12,
P¯
θo
(
En
)
1
O
(
exp
(
ηlog p
))
for large n. Define the sign vector ˆ
tfor ˆ
θoto satisfy the following properties,
ˆ
tij =sign(ˆ
θo
ij),if θo
ij 6= 0 ,
|ˆ
tij| 1,if ˆ
θo
ij = 0 .
(A.16)
for all 1i<jp.
On En,k, by the KKT condition and the expansion of Fnat (¯
θd,ˆ
θo,¯
ro,ˆ
ro), we have
(1 + ρ2n)ˆ
gn
B+ρ1nˆ
tB+ρ2n(Nˆ
RoN>)AAˆ
θo
B= 0.(A.17)
where ˆ
gn=θoLn(¯
θd,ˆ
θo;Y). Then, we can write
(1 + ρ2n)(ˆ
gn
B¯
gn
B) = ρ1nˆ
tBρ2nNˆ
RoN>AA>ˆ
θo
B(1 + ρ2n)¯
gn
B.
Let
˜
θo
denote a point in the line segment connecting
ˆ
θo
and
¯
θo
. Applying the Taylor expansion, we
obtain
(1 + ρ2n)¯
H¯
θo
Bˆ
θo
B=(1 + ρ2n)¯
gn
Bρ1nˆ
tB+Ln
Bρ2nˆ
AB.(A.18)
where Ln:= ¯
Hn¯
Hˆ
θo¯
θoand ˆ
A:= Nˆ
RoN>AA>ˆ
θo.
Now, by utilizing the fact that ˆ
θo
Bc=¯
θo
Bc= 0, we have
(1 + ρ2n)¯
HBcB(¯
θo
Bˆ
θo
B) = (1 + ρ2n)¯
gn
Bcρ1nˆ
tBc+Ln
Bcρ2nˆ
ABc,(A.19)
(1 + ρ2n)¯
HBB(¯
θo
Bˆ
θo
B) = (1 + ρ2n)¯
gn
Bρ1nˆ
tB+Ln
Bρ2nˆ
AB.(A.20)
Since
¯
Hn
BB
is invertible by assumption, we get Now, using results from Lemmas 14 and 16, we obtain
ρ1nkˆ
tBck=
¯
HBcB(¯
HBB)1((1 + ρ2n)¯
gn
Bρ1nˆ
tB
+Ln
Bρ2nˆ
AB)(1 + ρ2n)¯
gn
BcLn
Bc+ρ2nˆ
ABc
(A.21)
Now, (i) by the incoherence condition outlined in Assumption
(A5
), for any (
i, j
)
Bc
, we have
¯
Hij,B¯
H1
B,Bsign(¯
θo
B)(1 δ)<1.
(ii) by Lemma 8, for any (
i, j
)
Bc
:
||¯
Hij,B¯
H1
BB|| M7
(
¯
θ
); (iii) by the similar steps as in the proof
of Lemma 11,
ρ2nk¯
HBcB(¯
HB,B)1kkˆ
ABk2(1 + M7(¯
θ))τ2τ3ρ2n(A.22)
8
Thus, following straightforwardly (with the modification that we are considering each
B
instead of
B
) from the proofs of (Peng et al.,2009, Theorem 2), the remaining terms in
(A.21)
can be shown
to be all o(ρ1n), and the event
ρ1nkˆ
tBckρ1n(1 δ) + 4(1 + M7)τ2τ3ρ2n
ρ1n(1 3δ/4)
holds with probability at least 1O(exp(ηlog p)) for sufficiently large nand ρ2nδρ1n/(16(1 +
M7
(
¯
θ
))
τ2τ3
). Thus, it has been proved that for sufficiently large
n
, no wrong edge will be included
for each true edge set B.
A.2.1 Proof of Theorem 5
Proof.
By Lemmas 12 and 13, with probability tending to 1, there exists a local minimizer of the
restricted problem that is also a minimizer of the original problem. This completes the proof.
A.3 Large Sample Properties of FBN
The proof bears some similarities to the proof of Ravikumar et al. (2010); Guo et al. (2010) for the
neighborhood selection method, who in turn adapted the proof from Meinshausen et al. (2006) to
binary data; however, there are also important differences, since all conditions and results are for
fair clustering and joint estimation, and many of our bounds need to be more precise than those
given by Ravikumar et al. (2010); Guo et al. (2010).
Following the literature, we prove the main theorem in two steps: first, we prove the result holds
when assumptions
(B1’
)and
(B2’
)hold for
¯
Hn
and
Tn
, the sample versions of of
¯
H
and
T
. Then,
we show that if
(B1’
)and
(B2’
)hold for the population versions
¯
H
and
T
, they also hold for
¯
Hn
and Tnwith high probability (Lemma 17).
(B1’) There exist constants τ4, τ5(0,)such that
Λmin(¯
Hn
BB)τ4and Λmax(Tn)τ5.
(B2’) There exists a constant δ(0,1], such that
k¯
Hn
BcB¯
Hn
BB1k(1 δ).(A.23)
We first list some properties of the loss function.
9
Lemma 14. For δ(0,1], we have
P2δ
ρ1nk¯
gnkδ
42 exp ρ2
1n2
128(2 δ)2+ log p.
where ¯
gn:= θoL(¯
θd,¯
θo;Y). This probability goes to 0as long as ρ1n16(2δ)
δqlog p
n.
Lemma 15.
Suppose Assumption
(B1’
)holds and
n > Cq2log p
for some positive constant
C
, then
for any δ(0,1], we have
Λmin([2
θoL(¯
θd,¯
θo+δwB;Y)]BB)τ4
2.
Lemma 16. For δ(0,1], if ρ1nqτ2
4
100τ5
δ
2δ,k¯
gnkρ1n
4, then
¯
Hn
BB ˆ
Hn
BBˆ
θo¯
θo
δρ2
1n
4(2 δ).
Lemma 17.
If
¯
Hn
and
Tn
satisfy
(B1’
)and
(B2’
), the following hold for any
α >
0and some
positive constant C:
PΛmin(¯
Hn
BB)τ4α2 exp α2n
2q2+ 2 log q,
Pmax (Tn
BB)τ5+α)2 exp α2n
2q2+ 2 log q,
Pk¯
Hn
BcB¯
Hn
BB1k1δ
212 exp Cn
q3+ 4 log p.
We omit the proof of Lemmas 14-17, which are very similar to Ravikumar et al. (2010); Guo
et al. (2010).
Lemma 18.
Suppose Assumptions
(A3
)
(A4
)and
(B1’
)
(B2’
)are satisfied by
¯
Hn
and
Tn
.
Assume further that
ρ1n
16(2
δ
)
plog p/n
and
n > Cq2log p
for some positive constant
C
.
Then, with probability at least 1
2(
exp
(
Cρ2
1nn
)), there exists a (local) minimizer of the restricted
problem (11)within the disc:
n(b
θo,b
qo) : max kb
θo
B¯
θo
Bk2,kb
qo¯
qok
max ˇ
C(¯
θo)ρ1nq, ˇ
D(¯
qo)ρ2npΨ(p, H, K)o.
for some finite constants ˇ
C(¯
θo)and ˇ
D(¯
qo). Here, qand Ψ(p, H, K)are defined in (15).
Proof.
The proof is similar to Lemma 10. Let
µ1n
=
ρ1nq
with
q
=
|B|
and
µ2n
=
ρ2npΨ(p, H, K)
.
Let
C1>
0and
wRp(p1)/2
such that
wBc
= 0,
kwk2
=
C1
. Further, assume
zR(pH+1)(pH)/2
be an arbitrary vector with finite entries and kzk=D1. Let
Fn(¯
θd,¯
θo+µ1nw,¯
rd,¯
ro+µ1nz;Y)Fn(¯
θd,¯
θo,¯
rd,¯
ro;Y) = I1+I2+I3,
10
where
I1:= Ln(¯
θd,¯
θo+µ1nw;Y)Ln(¯
θd,¯
θo;Y),
I2:= ρ1n(k¯
θo+µ1nwk1 k¯
θok1),and
I3:= ρ2ntrace (¯
Θ+µ1nAw)2(S+N(¯
R+µ2nAz)N>)¯
Θ2¯
Q.
It follows from Taylor expansions that
I1
=
µ1nw>
B¯
gn
B
+
µ2
1nw>
[
2
θoL
(
¯
θd,¯
θo
+
δwB
;
Y
)]
BBwB,
for some δ(0,1]. Now, let ρ1n16(2 δ)plog p/n. It follows from Lemma 14 that
|w>
B¯
gn
B|≤k¯
gn
BkkwBk1µ1n
C1
4,(A.24)
where the last inequality follows since k¯
gn
Bk4.
Further, using our assumption on the sample size, it follows from Lemma 15 that
w>
B[2
θoL(¯
θd,¯
θo+δwB;Y)]BBwB τ4C2
1
2.(A.25)
For the second term, it can be easily seen that
|I2|=|ρ1n(k¯
θo
B+wBk1 k¯
θok1)| 1nC1.(A.26)
In addition, by the similar argument as in the proof of Lemma 10, we obtain
I3O(ρ2n)0.(A.27)
Now, by combining (A.24)–(A.27), we obtain
Fn(¯
θd,¯
θo+µ1nw,¯
rd,¯
ro+µ1nz;Y)Fn(¯
θd,¯
θo,¯
rd,¯
ro;Y)
C2
1
qlog p
nτ4
21
C11
4C1+O(ρ2n)0.
The last inequality uses the condition
C1>
5
4
. Th proof follows by setting
ˇ
C
(
¯
θo
) =
C1
and
ˇ
D(¯
θo) = D1.
Lemma 19.
Suppose Assumptions
(A3
)
(A4
)hold. If
(B1’
)and
(B2’
)are satisfied by
¯
Hn
and
Tn
,
ρ1n
=
O
(
plog p/n
),
n>O
(
q2log p
)as
n
,
ρ2n
=
O
(
plog(pH+ 1)/n
),
ρ2n
δρ1n/
(4(2
δ
)
τ3k¯
θok
),
=
0
, and
min(i,j)∈B ¯
θo
ij
2
ˇ
C
(
¯
θo
)
1n
. Then, the result of Theorem 6
holds.
Proof.
Define the sign vector
ˆ
t
for
ˆ
θ
to satisfy
(A.16)
. For
ˆ
θ
to be a solution of
(20)
, the sub-gradient
at ˆ
θmust be 0, i.e.,
(1 + ρ2n)ˆ
gn+ρ1nˆ
t+ρ2nNˆ
RoN>AAˆ
θo= 0,(A.28)
11
where ˆ
gn=θoL(¯
θd,ˆ
θo;Y). Then we can write
(1 + ρ2n)(ˆ
gn¯
gn) = ρ1nˆ
tρ2nNˆ
RoN>AA>ˆ
θo(1 + ρ2n)¯
gn.
Let
˜
θ
denote a point in the line segment connecting
ˆ
θ
and
¯
θ
. Applying the mean value theorem
gives
(1 + ρ2n)¯
Hnˆ
θo¯
θo=(1 + ρ2n)¯
gnρ1nˆ
t+Lnρ2nAn.(A.29)
where Ln=¯
Hn
BB ˜
Hn
BBˆ
θo¯
θoand An=Nˆ
RoN>AA>ˆ
θo.
Let
ˆ
θB
be the solution of restricted problem and let
ˆ
θBc
= 0, i.e.
(A.4)
. We will show that this
ˆ
θ
is the optimal solution and is sign consistent with high probability. To do so, let
ρ1n
=
16(2δ)
δqlog q
n
.
By Lemma 14, we have
k¯
gnkρ1nδ
4(2δ)ρ1n
4
with probability at least 1
4
exp
(
Cρ2
1nn
). Choosing
n1002τ2
5(2δ)2
τ4
4δ2q2log p, we have ρ1nqτ2
4
100τ5
δ
2δ, thus the conditions of Lemma 16 hold.
Now, by rewriting (A.29) and utilizing the fact that ˆ
θBc=¯
θBc= 0, we have
(1 + ρ2n)¯
Hn
BcB(ˆ
θo
B¯
θo
B) = (1 + ρ2n)¯
gn
Bcρ1nˆ
tBc+Ln
Bcρ2nAn
Bc,(A.30)
(1 + ρ2n)¯
Hn
BB(ˆ
θo
B¯
θo
B) = (1 + ρ2n)¯
gn
Bρ1nˆ
tB+Ln
Bρ2nAn
B.(A.31)
Since ¯
Hn
BB is invertible by assumption, combining (A.30) and (A.31) gives
¯
Hn
BcB(¯
Hn
BB)1((1 + ρ2n)¯
gn
Bρ1nˆ
tB+Ln
Bρ2nAn
B)
=(1 + ρ2n)¯
gn
Bcρ1nˆ
tBc+Ln
Bcρ2nAn
Bc.(A.32)
Now, using results from Lemmas 14 and 16, we obtain
ρ1nkˆ
tBck=
¯
Hn
BcB(¯
Hn
BB)1((1 + ρ2n)¯
gn
Bρ1nˆ
tB
+Ln
Bρ2nAn
B)(1 + ρ2n)¯
gn
BcLn
Bc+ρ2nAn
Bc
k¯
Hn
BcB(¯
Hn
BB)1k((1 + ρ2n)k¯
gn
Bk+ρ1n+kLnk+ρ2nkAn
Bck)
+ (1 + ρ2n)k¯
gn
Bk+kLnk+ρ2nkAn
Bck
ρ1n(1 δ
2) + (2 δ)k¯
θokτ3ρ2n
ρ1n(1 δ
2) + δ
4ρ1n.
The result follows by using Lemma 18, and our assumption that
min(i,j)∈B ¯
θo
ij
2
ˇ
C
(
¯
θo
)
ρ1nq
where
ˇ
C(¯
θo)is defined in Lemma 18.
12
A.3.1 Proof of Theorem 6
Proof.
With Lemmas 17 and 19, the proof of Theorem 6is straightforward. Given that
(B1
)and
(B2
)are satisfied by
¯
H
and
T
and that
ρ1n
=
O
(
plog p/n
)and
qp(log p)/n
=
o
(1) hold. Thus,
the conditions of Lemma 19 hold, and therefore the results in Theorem 6hold.
A.4 Proof of Theorem 7
Proof.
We want to bound
minORK×KkZb
VZ¯
VOkF
, where
O>O
=
OO>
=
IK
. For any
ORK×Kwith O>O=OO>=IK, since Z>Z=I(pH+1), we have
kZb
VZ¯
VOk2
F=kZ(b
V¯
VO)k2
F=kb
V¯
VOk2
F.
Hence,
min
ORK×K:O>O=OO>=IKkZb
VZ¯
VOk2
F= min
ORK×K:O>O=OO>=IKkb
V¯
VOkF.(A.33)
We proceed similarly to Lei et al. (2015). By Davis-Kahan’s Theorem (Yu et al.,2015, Theorem 1)
kb
V¯
VOkF
23/2min sr+ 1kb
Vb
V>¯
V>¯
V>k,kb
Vb
V>¯
V¯
V>kF
min r1Λr,ΛsΛs+1)
where
s
and
r
denote the positions of the ordered (from large to small) eigenvalues of the matrix
¯
V¯
V>. Using Theorem 12, we have that
kb
Vb
V>¯
V>¯
V>k2 kb
Vb
V>¯
V¯
V>kF=O(n1/2p(pH+ 1)2/(K((pH+ 1)))).
This implies that
kb
V¯
VOkFκφ(p, H, K )rK
n
for some κ > 0.
The rest of the proof follows as in the proof of (Lei et al.,2015, Theorem 1). More specifically, it
follows from (Lei et al.,2015, Lemma 5.3 ) that
K
X
k=1
|Sk|
|Ck|4(4 + 2ξ)kb
Vb
V>¯
V>¯
V>k
4(4 + 2ξ)κφ(p, H, K)rK
n
=2 + ξ
πφ(p, H, K)rK
n,(A.34)
for some π > 0.
13
A.5 Updating Parameters and Convergence of Algorithm 1
A.5.1 Proof of Theorem 1
Proof.
The convergence of Algorithm 1follows from Wang et al. (2019) which, using a generalized
version of the ADMM algorithm, propose optimizing a general constrained nonconvex optimization
problem of the form
f
(
x
) +
g
(
y
)subject to
x
=
y
. More precisely, for sufficiently large
γ
(the lower
bound is given in (Wang et al.,2019, Lemma 9)), and starting from any (
Θ(0),(0),Q(0),W(0)
),
Algorithm 1generates a sequence that is bounded, has at least one limit point, and that each limit
point (Θ(),(),Q(),W())is a stationary point of (6), i.e., Υγ(Θ(),(),Q(),W())=0.
The global convergence of Algorithm 1uses the Kurdyka–Łojasiewicz (KL) property of
Lγ
. Indeed,
the KL property has been shown to hold for a large class of functions including subanalytic and
semi-algebraic functions such as indicator functions of semi-algebraic sets, vector (semi)-norms
k·kp
with
p
0be any rational number, and matrix (semi)-norms (e.g., operator, trace, and Frobenious
norm). Since the loss function
L
is convex and other functions in
(6)
are either subanalytic or
semi-algebraic, the augmented Lagrangian
Lγ
satisfies the KL property. The reminder of proof is
similar to (Wang et al.,2019, Theorem 1).
A.5.2 Proof of Lemma 3
Proof. The proof follows the idea of (Khare et al.,2015, Lemma 4). Note that for 1ip,
Υ2(Θ) = nlog θii +n(1 + ρ2)
2θ2
iisii + 2θii X
j6=i
θijsij
+γ
2(θii ωii +wii)2+terms independent of θii,
where sij =y>
iyj/n. Hence,
∂θii
Υ2(Θ)=0 1
θii
+θii((1 + ρ2)sii +)
+ (1 + ρ2)X
j6=i
θijsij +(ωii wii )=0.
Hence,
0 = θ2
ii ((1 + ρ2)sii +)
| {z }
=:ai
+θii (1 + ρ2)X
j6=i
θijsij +(wii ωii )
| {z }
=:bi
1,
14
which gives (14b). Note that since θii >0the positive root has been retained as the solution.
Also, for 1i<jp,
Υ2(Θ) = n(1 + ρ2)
2(sii +sjj )θ2
ij +n(1 + ρ2)X
j06=j
θij0sjj0+X
i06=i
θi0jsii0θij
+γ
2(θij ωij +wij)2+terms independent of θij .(A.35)
0 = (1 + ρ2)(sii +sjj ) +
| {z }
=:aij
θij+
+ (1 + ρ2)X
j06=j
θij0sjj0+X
i06=i
θi0jsii0+(ωij wij )
| {z }
=:bij
.
which implies (14a).
The proof for updating follows similarly.
15
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Gaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.
Conference Paper
Full-text available
We address the problem of algorithmic fairness: ensuring that sensitive information does not unfairly influence the outcome of a classifier. We present an approach based on empirical risk minimization, which incorporates a fairness constraint into the learning problem. It encourages the conditional risk of the learned classifier to be approximately constant with respect to the sensitive variable. We derive both risk and fairness bounds that support the statistical consistency of our methodology. We specify our approach to kernel methods and observe that the fairness requirement implies an orthogonality constraint which can be easily added to these methods. We further observe that for linear models the constraint translates into a simple data preprocessing step. Experiments indicate that the method is empirically effective and performs favorably against state-of-the-art approaches.
Article
Full-text available
Gaussian graphical models capture dependence relationships between random variables through the pattern of nonzero elements in the corresponding inverse covariance matrices. To date, there has been a large body of literature on both computational methods and analytical results on the estimation of a single graphical model. However, in many application domains, one has to estimate several related graphical models, a problem that has also received attention in the literature. The available approaches usually assume that all graphical models are globally related. On the other hand, in many settings different relationships between subsets of the node sets exist between different graphical models. We develop methodology that jointly estimates multiple Gaussian graphical models, assuming that there exists prior information on how they are structurally related. For many applications, such information is available from external data sources. The proposed method consists of first applying neighborhood selection with a group lasso penalty to obtain edge sets of the graphs, and a maximum likelihood refit for estimating the nonzero entries in the inverse covariance matrices. We establish consistency of the proposed method for sparse high-dimensional Gaussian graphical models and examine its performance using simulation experiments. Applications to a climate data set and a breast cancer data set are also discussed.
Article
Full-text available
We consider joint estimation of multiple graphical models arising from heterogeneous and high-dimensional observations. Unlike most previous approaches which assume that the cluster structure is given in advance, an appealing feature of our method is to learn cluster structure while estimating heterogeneous graphical models. This is achieved via a high dimensional version of Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993). A joint graphical lasso penalty is imposed in the conditional maximization step to extract both homogeneity and heterogeneity components across all clusters. Our algorithm is computationally efficient due to fast sparse learning routines and can be implemented without unsupervised learning knowledge. The superior performance of our method is demonstrated by extensive experiments and its application to a Glioblastoma cancer dataset reveals some new insights in understanding the Glioblastoma cancer. In theory, a non-asymptotic error bound is established for the output directly from our high dimensional ECM algorithm, and it consists of two quantities: statistical error (statistical accuracy) and optimization error (computational complexity). Such a result gives a theoretical guideline in terminating our ECM iterations.
Article
A group of industry, academic, and government experts convene in Philadelphia to explore the roots of algorithmic bias.
Chapter
Machine learning based systems are reaching society at large and in many aspects of everyday life. This phenomenon has been accompanied by concerns about the ethical issues that may arise from the adoption of these technologies. ML fairness is a recently established area of machine learning that studies how to ensure that biases in the data and model inaccuracies do not lead to models that treat individuals unfavorably on the basis of characteristics such as e.g. race, gender, disabilities, and sexual or political orientation. In this manuscript, we discuss some of the limitations present in the current reasoning about fairness and in methods that deal with it, and describe some work done by the authors to address them. More specifically, we show how causal Bayesian networks can play an important role to reason about and deal with fairness, especially in complex unfairness scenarios. We describe how optimal transport theory can be leveraged to develop methods that impose constraints on the full shapes of distributions corresponding to different sensitive attributes, overcoming the limitation of most approaches that approximate fairness desiderata by imposing constraints on the lower order moments or other functions of those distributions. We present a unified framework that encompasses methods that can deal with different settings and fairness criteria, and that enjoys strong theoretical guarantees. We introduce an approach to learn fair representations that can generalize to unseen tasks. Finally, we describe a technique that accounts for legal restrictions about the use of sensitive attributes.
Article
Evidence-based medicine is the most prevalent paradigm adopted by physicians. Clinical practice guidelines typically define a set of recommendations together with eligibility criteria that restrict their applicability to a specific group of patients. The ever-growing size and availability of health-related data is currently challenging the broad definitions of guideline-defined patient groups. Precision medicine leverages on genetic, phenotypic, or psychosocial characteristics to provide precise identification of patient subsets for treatment targeting. Defining a patient similarity measure is thus an essential step to allow stratification of patients into clinically-meaningful subgroups. The present review investigates the use of patient similarity as a tool to enable precision medicine. 279 articles were analyzed along four dimensions: data types considered, clinical domains of application, data analysis methods, and translational stage of findings. Cancer-related research employing molecular profiling and standard data analysis techniques such as clustering constitute the majority of the retrieved studies. Chronic and psychiatric diseases follow as the second most represented clinical domains. Interestingly, almost one quarter of the studies analyzed presented a novel methodology, with the most advanced employing data integration strategies and being portable to different clinical domains. Integration of such techniques into decision support systems constitutes and interesting trend for future research.
Article
We study the question of fair clustering under the {\em disparate impact} doctrine, where each protected class must have approximately equal representation in every cluster. We formulate the fair clustering problem under both the k-center and the k-median objectives, and show that even with two protected classes the problem is challenging, as the optimum solution can violate common conventions---for instance a point may no longer be assigned to its nearest cluster center! En route we introduce the concept of fairlets, which are minimal sets that satisfy fair representation while approximately preserving the clustering objective. We show that any fair clustering problem can be decomposed into first finding good fairlets, and then using existing machinery for traditional clustering algorithms. While finding good fairlets can be NP-hard, we proceed to obtain efficient approximation algorithms based on minimum cost flow. We empirically quantify the value of fair clustering on real-world datasets with sensitive attributes.
Article