ArticlePDF Available

Evaluating re-identification risks with respect to the HIPAA privacy rule

Authors:

Abstract and Figures

Many healthcare organizations follow data protection policies that specify which patient identifiers must be suppressed to share "de-identified" records. Such policies, however, are often applied without knowledge of the risk of "re-identification". The goals of this work are: (1) to estimate re-identification risk for data sharing policies of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule; and (2) to evaluate the risk of a specific re-identification attack using voter registration lists. We define several risk metrics: (1) expected number of re-identifications; (2) estimated proportion of a population in a group of size g or less, and (3) monetary cost per re-identification. For each US state, we estimate the risk posed to hypothetical datasets, protected by the HIPAA Safe Harbor and Limited Dataset policies by an attacker with full knowledge of patient identifiers and with limited knowledge in the form of voter registries. The percentage of a state's population estimated to be vulnerable to unique re-identification (ie, g=1) when protected via Safe Harbor and Limited Datasets ranges from 0.01% to 0.25% and 10% to 60%, respectively. In the voter attack, this number drops for many states, and for some states is 0%, due to the variable availability of voter registries in the real world. We also find that re-identification cost ranges from $0 to $17,000, further confirming risk variability. This work illustrates that blanket protection policies, such as Safe Harbor, leave different organizations vulnerable to re-identification at different rates. It provides justification for locally performed re-identification risk estimates prior to sharing data.
Content may be subject to copyright.
Evaluating re-identification risks with respect to the
HIPAA privacy rule
Kathleen Benitez,
1
Bradley Malin
1,2
ABSTRACT
Objective Many healthcare organizations follow data
protection policies that specify which patient identifiers
must be suppressed to share “de-identified” records.
Such policies, however, are often applied without
knowledge of the risk of “re-identification”. The goals of
this work are: (1) to estimate re-identification risk for data
sharing policies of the Health Insurance Portability and
Accountability Act (HIPAA) Privacy Rule; and (2) to
evaluate the risk of a specific re-identification attack using
voter registration lists.
Measurements We define several risk metrics: (1)
expected number of re-identifications; (2) estimated
proportion of a population in a group of size gor less, and
(3) monetary cost per re-identification. For each US state,
we estimate the risk posed to hypothetical datasets,
protected by the HIPAA Safe Harbor and Limited Dataset
policies by an attacker with full knowledge of patient
identifiers and with limited knowledge in the form of voter
registries.
Results The percentage of a state’s population
estimated to be vulnerable to unique re-identification (ie,
g¼1) when protected via Safe Harbor and Limited
Datasets ranges from 0.01% to 0.25% and 10% to 60%,
respectively. In the voter attack, this number drops for
many states, and for some states is 0%, due to the
variable availability of voter registries in the real world.
We also find that re-identification cost ranges from $0 to
$17 000, further confirming risk variability.
Conclusions This work illustrates that blanket
protection policies, such as Safe Harbor, leave
different organizations vulnerable to re-identification at
different rates. It provides justification for locally
performed re-identification risk estimates prior to
sharing data.
INTRODUCTION
Advances in health information technology have
facilitated the collection of large quantities of
nely detailed personal data,
1
which, in addition to
supporting innovative healthcare operations, has
become a vital component of numerous secondary
endeavors, including novel comparative quality
research and the validation of published ndings.
23
Historically, data collection and processing efforts
were performed internally by the same organiza-
tion, but sharing data beyond the borders of the
organization has become a vital component of
emerging biomedical systems.
23
In fact, it is of
such importance that in the United States, some
federal agencies such as the National Institutes of
Health (NIH) have adopted policies that mandate
sharing data generated or studied with federal
funding.
45
To realize the benets of sharing data while
minimizing privacy concerns, many healthcare
organizations have turned to de-identication,
a technique that strips explicit identifying infor-
mation, such as personal names or Social Security
Numbers, from disclosed records. Healthcare orga-
nizations often employ multiple tiers of de-identi-
cation policies, the appropriateness of which is
usually dependent on the recipient and intended
use. Each policy species a set of features that must
be suppressed from the data. Presently, healthcare
organizations tend to employ at least two policy
tiers: (1) public use; and (2) restricted access research.
The public use policy removes a substantial number
of explicit identiers and quasi-identifying,or
potentially identifying, attributes. The resulting
dataset is thought to contain records that are
sufciently resistant to privacy threats. In contrast,
the restricted access research policy retains more
detailed features, such as dates and geocodes. In
return for additional information, oversight or
explicit approval from the originating organization
is required.
Though de-identication is a widely invoked
approach to privacy protection, there have been
limited investigations into the effectiveness of such
policies. Anecdotal evidence suggests that concerns
over the strength of such protections may be
warranted. In 1996, for instance, Sweeney was able
to merge publicly available de-identied hospital
discharge records with identied voter registration
records on the common elds of date of birth,gender
and residential zip code to re-identify the medical
record for the governor of Massachusetts, uncov-
ering the reason for a mysterious hospital stay.
6
In
subsequent investigations, it was estimated that
somewhere between 63% and 87% of the US
population is unique on the combination of such
demographics.
67
However, both investigations
assumed that an attacker has ready access to
a resource with names and demographics for the
entire population.
There are several primary goals and contributions
of this paper. First, we extend earlier work
6 7
by
dening and applying several computational
metrics to determine the extent to which de-iden-
tication policies in the Privacy Rule of the Health
Insurance Portability and Accountability Act
8
(HIPAA) leave populations susceptible to re-identi-
cation. In particular, we focus on the Safe Harbor
and Limited Dataset policies, which, akin to the
policy tiers mentioned earlier, dene public use and
restricted use datasets. In the process, we illustrate
how to compare the re-identication risk tradeoffs
between competing policies. We perform this anal-
ysis in a generative manner and assume that an
<Supplementary appendices
are published online only at
http://jamia.bmj.com/content/
vol17/issue2
1
Department of Biomedical
Informatics, School of Medicine,
Vanderbilt University, Nashville,
Tennessee, USA
2
Department of
Electrical Engineering and
Computer Science, School of
Engineering, Vanderbilt
University, Nashville,
Tennessee, USA
Correspondence to
Bradley Malin, 2525 West End
Avenue, Suite 600, Department
of Biomedical Informatics,
School of Medicine, Vanderbilt
University, Nashville, TN 37203,
USA; b.malin@vanderbilt.edu
Received 4 April 2009
Accepted 14 December 2009
J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026 169
Research paper
attacker has access to all the identifying information on the de-
identied population. Second, we demonstrate how to model
concerns in a more realistic setting and consider the context of
a limited knowledge attacker. Specically, while the analysis
mentioned in the rst part of the paper assumes access to iden-
tifying information for the entire population, the accessibility of
such data cannot be taken for granted. And, while voter regis-
tration lists have been exploited in one known instance and are
cited as a source of identied data, such an attack may not be
feasible in all situations. We investigate how the real world
availability of voter registration resources inuences the re-
identication risks. Voter information is often managed at the
state level, and thus we perform our analysis on a state-by-state
basis to determine how blanket federal-level data sharing policies
(ie, HIPAA) are affected by regional variability. Our results show
that differences in risk are magnied when the wide spread of
state voter registration policies is taken into account. Overall, our
study provides evidence that the risks vary greatly and an
attackers likelihood of re-identication success is dependent on
the population from which the released dataset is drawn.
BACKGROUND
In this section, we review the foundations of de-identication and
re-identication. We examine previous privacy risk analysis
approaches and illustrate the concepts with a motivating example.
From de-identification to re-identification
Consider the hypothetical situation outlined in gure 1. In this
setting, a healthcare provider maintains identied, patient-level
clinical information in its private medical records. For various
reasons, the provider needs to share aspects of this data with
a third party, but certain elds in the dataset are sensitive, and
therefore an administrator must take steps to protect the privacy
of the patients. The de-identication policy of the provider
forbids the disclosure of personal names and geographic attri-
butes, so these elds are suppressed to create the released dataset.
The residual information, however, may still be susceptible to re-
identication.
In this work, we are concerned with attacks that re-identify as
many records as possible, which in prior publications have been
called marketer attacks.
i
A large-scale attack requires an identi-
ed dataset having elds in common with the de-identied
dataset, such as the ctional voter list in gure 1. A re-identica-
tion, also known in the literature as an identity disclosure,
9
is
accomplished when an attacker can make a likely match between
a de-identied record and the corresponding record in the iden-
tied dataset. For simplicity, we assume that identied public
records contain data on everyone in the de-identied release,
making the identied population a superset of the de-identied
dataset. We acknowledge this is a simplication and point out
that it results in a worst-case risk analysis; that is, an upper
bound on the number of possible re-identications. The online
appendix elaborates on this component of the problem.
Unique individuals are most vulnerable to re-identication
precisely because matches are certain in the eyes of an attacker.
In gure 1, for instance, there is only one person in the popula-
tion who is a male born in 1953. As a result, since he is a patient
in the released dataset, his identity, which is reported in the voter
list, can easily be linked to his record in the released dataset.
However, it is important for the reader to recognize that
uniqueness is only a sufcient, and not a necessary, condition for
achieving re-identication. Anytime there is a level of individu-
ality, or distinctiveness as we shall call it, there is the potential for
re-identication. Notice, again in gure 1, that there are two
records in the released dataset for male patients born in 1955.
Similarly, there are also two males born in 1955 in the population
at large. While these records are non-unique, an attacker who
linked the identities to the sensitive records through a random
assignment procedure would be correct half of the time.
Identified datasets and the use of voter registration records
The key to successfully achieving a large-scale re-identication
attack is the availability of an identied dataset with broad
population coverage. In this sense, public records can provide for
an easily accessible resource that often includes richly-detailed
demographic features. While identied records with features
linkable to de-identied data could be obtained through illegiti-
mate means, such as the theft of a laptop that stores such lists
on an unencrypted hard drive (eg, see Tennessee
10
) or hacking
a state-owned website (eg, see Illinois
11
), lawful avenues make it
possible for potential attackers to obtain some public records,
such as voter registration lists, without committing any crime.
Moreover, access to such records can, in some cases, be obtained
without a formally executed data use agreement.
In this paper we focus on voter registration information as
a route of potential re-identication for several reasons. First, as
mentioned in the introduction of this paper, this resource was
applied in one of the most famous re-identication studies to
date: the case study by Sweeney.
6
Second, following in the
footsteps of this case study, there have been a signicant number
of publications by the academic and policy communities that
suggest such records are a particularly enticing resource for
would-be attackers.
12e21
However, allusions to the potential uses
of voter lists rarely acknowledge the complexity of data access
intricacies, or the economics, of the attack. Rather, they tend to
make an implicit assumption that a universal set of demographic
attributes tied to personal identity is available to all potential
adversaries for a nominal fee. But the reality of the situation is
that, if not the absolute contrary, the ability to apply such
a resource for re-identication is not universal. Consider, in 2002,
a survey of voter registration data gathering and privacy policies
which documented that, while all but one state required voters
to provide their date of birth, 11 states redacted certain features
associated with date of birth prior to making records available to
secondary users.
21
The accessibility of identifying resources, such
as voter registration lists, is made even more complex by the fact
that state-level access policies for identied records are dynamic
and change over time. To generate results that are relevant to the
current climate, this paper updates the aforementioned survey.
Re-identification risk measures
Most risk evaluation metrics for individual level data focus on
one of the following factors: (1) the number, or proportion, of
unique individuals; or (2) the worst case scenario, that is, the
identiability of the most vulnerable record in the dataset.
Of those that consider the rst factor, the most common
approach simply analyzes the proportion of records that are
unique within a particular population.
22 23
Alternative
approaches that have been proposed add nuance, for instance not
just considering unique links, but the probability that a unique
link between sensitive and identied datasets is correct. This
accounts for the complexities of the relationship between the
populations represented (further details on this matter are
provided in online Appendix B).
24
The second body of work
i
For further discussion of the types of attacks and types of re-identifications, see online
Appendix A.
170 J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026
Research paper
comes into play when none of the records is likely to be unique.
9
These approaches dene disclosure risk as the probability that
a re-identication can be achieved.
For the evaluation offered in this paper, we adopt a measure
proposed by Truta et al,
25
which offers an advantage over the
narrow focus on either unique individuals or the most susceptible
individuals. This measure incorporates risk estimates for all
records in the dataset, regardless of their level of distinctiveness.
METHODS
Materials
We utilized the following resources for our evaluation: (1)
HIPAA policies for secondary data sharing to determine the elds
available in released datasets; (2) real voter registration access
policies for each US state to determine the elds available to an
attacker; and (3) demographic summary statistics from the 2000
US Census as population descriptors. We describe each of these
resources in the following sections.
Sensitive data policies
Medical and health-related records are considered to contain sensi-
tive information by many people.
26
The unauthorized disclosure of
an individuals private health data, such as a positive HIV test result,
can have adverse effects on medical insurance, employment, and
reputation.
27 28
Yet, health data sharing is vital to further healthcare
research, and thus there are various mechanisms for doing so in a de-
identied format. As part of HIPAA, for instance, the Privacy Rule
regulates the use and disclosure of what is termed Protected Health
Information.
8
Of particular interest to our study are two de-iden-
tication policies specied by the Privacy Rule, namely Safe Harbor
and Limited Dataset, which permit the dissemination of patient-
level records without the need for explicit consent.
The Safe Harbor policy enumerates 18 identiers that must be
removed from health data, including personal names, web
addresses, and telephone numbers. This process creates a public-
use dataset, such that once data has been de-identied under this
policy, there are no restrictions on its use. As in many data sharing
regulations in the USA and around the world, Safe Harbor
contains a special threshold provision for geographic area.
29
When
a geographic area (eg, zip code) contains at least 20 000 people, it
may be included in Safe Harbor protected datasets, otherwise it
must be removed.
ii
Therefore, the threshold of 20 000 is signi-
cant for an analysis of population distinctiveness, which we
Figure 1 Example of de-identification and re-identification using public records.
ii
For simplicity, we assume no geographic detail beyond “US state” is made available
through Safe Harbor.
J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026 171
Research paper
explicitly investigate in the following evaluation. In contrast, the
Limited Dataset policy species a subset of 16 identiers that must
be removed, creating a research dataset. In order to obtain this
dataset, recipients must sign a data use agreement, a contract that
restricts the use of the data. Such agreements often explicitly
prohibit attempts to re-identify or contact the subjects.
In this paper, we focus explicitly on demographic information,
which is particularly relevant to risk analysis because of its wide
availability in health and public records, especially in the form of
voter registration lists. We assume that an unmodied dataset
managed by a healthcare entity includes (Name,Address,Date of
Birth,Gender,Race). When ltered through Safe Harbor, a released
dataset will contain only (Year of Birth,Gender,Race), while
a Limited Dataset release will also include (County,Date of Birth).
Voter registration information
Information regarding voter registration lists is available from
several sources. Most US state websites maintain online, unofcial
versions of their regulatory codes, which contain the policies that
govern the use and administration of voter registration lists (eg,
Alabama
30
). In some states this information is sufcient to learn
which elds are specically permitted in public releases of the voter
registration lists. In other states, the regulations are prohibitory,
simply stating which elds cannot be part of the public record. We
deemed that a survey of each states elections ofce was the most
reliable source for information regarding the current contents and
prices of voter registration lists. We conducted this survey (results
in online Appendix C) in the fall of 2008 by making inquiries with
election ofces and interpreting a variety of voter registration
forms and legal paperwork because there is no standard form or
procedure for obtaining state voter lists. Information available in
both private health data and voter registration information
consists mainly of demographics, such as age, gender, or race.
iii
Thus, we dened the potential elds of intersection as (Date of
Birth,Year of Birth,Race,Gender,County of Residence).
Population information
The census is a natural place to turn for population descriptions
subdivided by the aforementioned demographic features. The
2000 US Census is one of the most complete population records
to date with an undercount rate estimated to be between 0.96%
and 1.4%.
31
Many of the results of the census are freely available
online through the Census Bureaus American Fact Finder
website.
32
Tables PCT12 AeG detail the number of people of
each gender, by age, in a particular geographic division, each table
representing one of the Censuss seven race classications: White
alone,Black alone,American Indian or Alaska Native alone,Asian
alone,Native Hawaiian or Pacic Islander alone,Some other race
alone, and Two or more races. This information is available for
many geographic breakdowns, but as we dened the elds of
intersection to include only information as specic as county, the
most appropriate division was each table for the 3219 US
counties and county equivalents. We created tables for each state
and an additional table to translate between eld names and the
age ranges, genders, and races they represent, so that populations
with elds in common could be combined where warranted.
While the census provides the majority of the information
needed, it is not a perfect t. In particular, the census partitions the
population by gender and age, whereas voter registration data
include year of birth, for which we assume age is a proxy. However,
there are additional challenges. For instance, ages over 100 are
grouped by the US Census into 5-year age groups (100e104,
105e110). Additionally, information on date of birth is not
reported. To overcome such limitations, we leverage a statistical
estimation technique proposed by Golle, which is based on the
assumption that members of the group are distributed uniformly
at random in the larger group.
7
This implies that an individual is
as likely to be born on January 5 as January 6, and likewise, that
an individual in the age group 100e104 is as likely to be 100 as
101. More generally, given an aggregated group with nindividuals
who could correspond to bpossible subgroups, or bins, the
number of bins with iindividuals is estimated as:
fnðiÞ¼n
ib1nðb1Þni(1)
As an example, if there are 200 individuals in a group, say 24-year-
old Asian alonemalesinCountyX, then 2003365
199
3364
199
z116
are expected to have a unique birth date.
Risk estimation metrics
We developed two risk estimation metrics that we believe
provide a compromise between focusing on likely re-identica-
tions and accepting that there is some probability of re-identi-
cation for every record in a released dataset. They are termed g-
distinct and total risk and are dened as follows.
g-Distinct
An individual is said to be unique when he or she has a combina-
tion of characteristics that no one else has, and we say an indi-
vidual is g-distinct if their combination of characteristics is identical
to g-1 or fewer other people in the population. Therefore, unique-
ness is the base case of 1-distinct. In general, g-distinct is the sum of
the number of bins with iindividuals, which is computed as:
hnðgÞ¼ +
g
i¼1
ifnðiÞ(2)
Of the 200 individuals above, approximately 199.95 would be 5-
distinct. It is useful to think of these numbers in terms of
proportions rather than absolute numbers. In this case, 99.975% of
the group is 5-distinct. Therefore, if a released dataset contained
three Asian only24-year-old males, 2.999 of them would be
expected to be 5-distinct. Formally, given jmembers of a group of
n, the expected number that will be g-distinct is given as follows:
h
h
n
jðgÞ¼j
nhnðgÞ(3)
Total risk
We extend the notion of g-distinct to cover all possible gsto
create a measure of total risk. This is similar to the DR
max
metric proposed by Truta et al
25
and quanties the likelihood of
re-identication for each member of a group. When summed over
all groups, it reveals the expected number of re-identications for
the whole dataset. Specically, given jmembers of a group of n,
the expected number of re-identications (ie, the total risk) is
computed as:
^
rj
nðgÞ¼j
nb1nðbnðb1ÞnÞ(4)
Process
The risk analysis estimation consists of a three step process: (1)
determine the elds available to an attacker; (2) group the
Census data according to these elds; and (3) sum the result
obtained by applying a risk estimation metric to the results,
iii
While voter history is available from many states’ voter registration lists, and is not
explicitly prohibited by either of the privacy policies under consideration, it is certainly
not likely to turn up in a medical record.
172 J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026
Research paper
normalizing by the total population. The interplay of the data is
illustrated in gure 2, which depicts the relationship between our
simulation of re-identication (top) and the expected approach of
an attacker (bottom).
We consider two types of risk for the purposes of this work,
which we call GENERAL and VOTER.GENERAL is the risk asso-
ciated with a fully informed attacker and corresponds to the worst-
case scenario. It assumes that the attacker has access to identifying
information for each individual and all the relevant elds for
linkage for the entire population from which the disclosed records
were derived. To determine the elds available to a GENERAL
attacker, consider the data protection policy and assume the
attacker has access to all the demographic data permitted by that
policy. In gure 1, the released dataset has elds (Gender,Year of
Birth,Diagnosis), so we assume that the attacker has identifying
information containing (Gender,Year of Birth), and would use these
elds to re-identify the released dataset. The GENERAL attacker is
the typical risk model applied today. The second model, VOTER,is
tempered in that it considers the availability of a specicidentied
resource. Specically, the elds available to a VOTER attacker are
derived from the data de-identication policy and the voter regis-
tration access policy of the relevant state.
Post-analysis calculations
Trust differential
We use the re-identication risk estimates to compare the
protective capability of data sharing policies through a mecha-
nism we call the trust differential. This term stems from the
practice of using several policies to govern the disclosure of
the same dataset. In the case of the public and research datasets,
the latter contain more information because the researchers are
more trusted or are discouraged through various penalties of
violating a use agreement. Formally, we model the differential as
the ratio of policy-specic risks as R
j,g
(A)/R
j,g
(B), where R
j,g
(X)is
the risk measure for the group size gunder policy Xas computed
by re-identication metric j. Imagine that policy Acorresponds to
Limited Dataset and policy Bcorresponds to Safe Harbor. Then, the
resulting ratio quanties the extent to which researchers are
more trusted than the general public. Calculation of the trust
differential species the degree to which the latter policy better
protects the data.
Cost analysis
While an economic analysis does not t strictly into the diagram
in gure 1, it is a logical and practical aspect of the voter attack to
study. Cost acts as a deterrent in computer security-related
incidents,
33
such that an attack on privacy will only be
attempted if the net gain is greater than the net cost. Voter
registration lists, along with many other identied datasets, may
be available to an attacker, but at a certain price. An economic
analysis with respect to any of the above measures is then the
price in dollars for the resource normalized by the result of the re-
identication risk metric, that is C/R, where C is the cost for the
resource, and Ris the expected risk to the dataset from an
attacker using that resource as computed in equation (4). For
example, total risk conveys essentially the expected number of re-
identications. Thus, the economic analysis with respect to total
risk will be an estimate of the price the attacker pays for each
successful re-identication. All things being equal, we assume an
attacker will be more drawn to an attack with a lower cost to
success ratio.
RESULTS
For each US state we set gequal to 1, 3, 5, and 10 and for one
state, we performed a more detailed analysis, such that gwas
evaluated over the range 1 through 20 000. We perfor med a cost
analysis using the total risk measure over the same range. For
presentation purposes, we have divided the major results of the
evaluation to rst report results computed with g-distinct, and
then results calculated by total risk measures.
In general, we use a combination of factors to perform our risk
analysis and use the <Policy,Attack>pair to summarize the
specic evaluation. Policy refers to the health data sharing policy
and corresponds to either the Safe Harbor (SAFE) or Limited
Dataset (LIMITED) policy. Attack refers to the information we
assume is available to the adversary and refers to the GENERAL
or VOTER scenario.
g-Distinct analysis
The g-distinct analysis enables data managers to inspect a partic-
ular cross-section of the population, namely the individuals whose
records are most vulnerable to re-identication by virtue of being
the most distinctive. The plots in gure 3 illustrate the results for
the state of Ohio. The analysis of this state is particularly inter-
esting because its voter registration list includes (County,Year o f
Birth) and is thus different from either of the two HIPAA policies.
The risk analysis for <LIMITED,GENERAL>measures the re-
identication risks associated with the Ohio population using the
attributes of (County,Gender,Date of Birth,Race), and <LIMITED,
Figure 2 Interplay of data sources in re-identification.
Figure 3 g-Distinct risk analysis for the state of Ohio. (A) g¼1 to 5 (B)
g¼1 to 20 000.
J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026 173
Research paper
VOTER>using the attributes (County,Year of Birth), while the risk
analysis for <SAFE,GENERAL>uses (Gender,Year of Birth,Race),
and <SAFE,VOTER>uses (Year of Birth).
Both plots in gure 3 represent the same result, but at different
granularities. The plot on the left focuses on the population that
is particularly distinct, those identical to 5 or fewer people. We
focus on this cut-off because it is a common risk threshold
adopted by many healthcare and statistical agencies. We observe
that there is a large gap between the risk associated with Limited
Dataset and the other risks measured. Under Limited Dataset,
18.7% of the population is 1-distinct, or unique, and 59.7% are 5-
distinct. In contrast, under Safe Harbor, 0.0003% are 1-distinct
and 0.002% are 5-distinct. When these patterns are inspected
over a wider range of values of g, as shown in the plot on the
right, the pattern continues, such that the risk under Limited
Dataset rises quickly, surpassing 99.9% by g¼31. In other words,
fewer than 0.1% of the population in Ohio is expected to share
the combination of (County,Gender,Date of Birth,Race) with more
than 31 people.
The sheer number of distinct individuals can be startling. If
a researcher receives a dataset drawn at random from the
population of Ohio under Limited Dataset provisions, more than
1 out of 6 of those represented would be unique based on
demographic information. Remember, though, that uniqueness
is not sufcient to claim re-identication. There is still need for
an identied dataset and VOTER reects this reality. While
higher than the risk under Safe Harbor, <LIMITED,VOTER>is
signicantly lower than <LIMITED,GENERAL>, particularly
for smaller values of g. According to <LIMITED,VOTER>, only
0.002% of the population is 1-distinct and 0.01% is 5-distinct. As
we increase g,wend that more than 50% of the population is
3500-distinct under the same constraints. In other words, very
few individuals are readily identiable with any certainty. In
comparison, less than 1% of the population is 20 000-distinct for
<SAFE,VOTER>. Either way, the probability of re-identication
is small, but non-zero.
We can see more precisely how the two policies compare in
gure 4, which displays the trust differential for both GENERAL
and VOTER.InGENERAL, the trust differential for the two
policies ranges from approximately 5 to 90 000, while the VOTER
trust differential ranges from approximately 67 to more than 3.9
trillion. The extremely high values are found for the lowest
values of g, where small differences in values are sufcient to
make the differential oscillate, as can be seen in the plot.
Consistently, however, the trust differential is large even with g
equal to 20 000. It is perhaps an important feature that the trust
differential is greatest for low values of g, again, for the individ-
uals who are most susceptible to re-identication.
While the above results demonstrate the power of the g-
distinct analysis and the effects of different choices of g, they are
not necessarily representative of the results for other states.
Thus, gure 5 shows the range of vulnerabilities for selected small
values of gfor all 50 states (details for all states are in online
Appendix D). True to the results found in Ohio, vulnerabilities
under Safe Harbor are lower than those under Limited
Dataset. Safe Harbor vulnerabilities, however, are spread over
a wide range of small values, sufcient to create outliers, seen in
both of the Safe Harbor analyses in gure 6. Additionally, notice
the reduction of risk when attack-specic information is
introduced. While the 10-distinctiveness of the states ranges from
0.44 to nearly 1, with a median of 0.925, the attack-specic
10-distinctiveness ranges from 0 to 0.99, with a median of 0.36. In
other words, considering the actual attack tends to much lower
risk estimates, particularly when analyzing a less restrictive
policy.
Figures 6 and 7 provide another perspective on the results in
gure 5. In these plots, we show the two most vulnerable and
two least vulnerable states according to 1-distinct, for their
respective risk estimate and policy. These results summarize how
the states re-identication risk changes for various g(values for
each US state are provided in online Appendix E). Our goal was to
characterize how changes in re-identication risk related to each
other across states. In other words, we wanted to determine how
decisions made for risk thresholds affected the re-identication
estimates of the states. For the most part, the rankings remain
fairly consistent, but not universally. In particular, we observed
that the most substantial change within the range gless than 10
is the state of Kentucky for <LIMITED,VOTER>. This state had
the second greatest percentage of 1-distinct individuals, but is
ninth at the 10-distinct level. Thus, an attacker may shift focus
from one state to another depending on the policy and risk
threshold.
Total risk analysis
While g-distinct estimates enable analysts to determine which
states are the most vulnerable given a particular policy, the total
risk measure estimates the number of re-identications that
could theoretically be achieved by an attacker. It is important to
recognize that each record has some non-zero probability of
being re-identied, even if very small. The total risk measure
aggregates these probabilities.
Table 1 displays the results of the total risk analysis for the
states with the top three and bottom three trust differentials for
GENERAL and VOTER. A complete list of states and their total
risk measures under these policies and types of analysis can be
found in online Appendix E. In contrast to the state of Ohio, as
previously discussed, the state of Texass voter registration policy
includes all of the elds available in Limited Dataset releases.
Therefore, the health record policy is the limiting factor, meaning
that GENERAL and VOTER are identical. For the rest of the states
the voter registration policy is the limiting factor, and thus the
GENERAL and VOTER are different. For some states, this is
a slight difference, such as Virginia, whereas for others it is several
orders of magnitude different, such as Alaska. In states where the
voter registration policy is more restrictive than the health data
sharing policy, administrators might consider data release policies
that favor more information.
Figure 4 Trust differential (plotted on log scale) between Limited
Dataset and Safe Harbor for the state of Ohio.
174 J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026
Research paper
The difference between the Safe Harbor and Limited Dataset
risks can be seen in the trust differential, also shown in table 1.
While the trust differential calculated for GENERAL displays
a wide range, the extent of the differences is several orders of
magnitude less than the differences between the trust differential
for VOTER. For administrators using the trust differential to
make data sharing decisions, this difference highlights the critical
point of VOTER analysis for making policies that will apply
across states.
Cost analysis
The estimated price per re-identication for VOTER is shown in
table 2. The top of the table shows the states with the three
minimum and maximum costs per re-identication under Limited
Dataset, while the bottom shows the same for Safe Harbor. Details
for all states are provided inonline Appendix E. The estimated cost
per re-identication under Limited Dataset ranges from $0 to more
than $800. For the states with no charge for their voter registration
lists, Virginia has the highest total risk, with an estimated 3.1
million re-identications possible. Under Safe Harbor, the esti-
mated cost per re-identication ranges from again, $0, though this
time with a maximum total risk of 1431 expected re-identications
in North Carolina, to a high of $17 000 per re-identication in West
Virginia. This analysis not only highlights what is possible with
a particular attack, but what is likely based on these real-world
constraints. Particularly for the marketer attack model, the cost
and effort involved in achieving re-identications are an important
consideration.
DISCUSSION
In this paper, we introduced methods for estimating re-identi-
cation risk for various de-identication data sharing policies.
We also evaluated the risk of re-identication from a known
attack in the form of voter registration records. Our evaluation
revealed that the differences in population distributions of US
Figure 5 Distribution of g-distinct
computations for all US states,
clockwise from top left: (A) <SAFE,
GENERAL>; (B) <LIMITED,
GENERAL>; (C) <SAFE,VOTER>; and
(D) <LIMITED,VOTER>.
Figure 6 Ranks for top and bottom two states. (A) <LIMITED,
GENERAL>; (B) <LIMITED,VOTER>.Figure 7 Ranks for top and bottom two states. (A) <SAFE,GENERAL>;
(B) <SAFE,VOTER>.
J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026 175
Research paper
states and their policies for disseminating voter registries lead to
varying re-identication risks. Use of risk estimation approaches
has the potential to improve design and implementation of data
sharing policies. Here, we elaborate on some of the more
pressing issues and future directions.
From theory to application
Our analysis provides a basis for comparing different privacy
protection schemes both theoretically and with respect to real-
world attacks. As such, the approach may be useful to privacy
ofcials dening new policies. The difference between the
GENERAL risk and VOTER risk analysis shows a wide gap
between a perceived problem (the threat of re-identication using
voter registration lists) and the actual results of such an attack.
Furthermore, the performance of such an analysis on a state-by-
state level shows that the results vary widely across the country.
Data administrators in a state with a more permissive voter
registration policy may wish to be more conservative in the data
released, knowing the wealth of demographic information avail-
able in this single source. Comparatively, administrators in states
with more restrictive voter registration policies might be inter-
ested in performing similar analyses for other available sources of
identied data. They may ultimately conclude that the identied
data sources that are readily available in their area are such that
additional information may be included in a de-identied dataset
without greatly increasing the re-identication risk. In essence,
there are (at least) three different policy-making bodies that must
be aware of one another: the medical data-sharing policy makers,
the public records policy makers, and the data administrators
making decisions about particular datasets. When making
new policies or other policy-related decisions, the different poli-
cy-making bodies should be aware that their separate policies
interact and their combined actions inuence privacy.
Therefore, we take a moment to sketch an approach for policy
makers to set appropriate protections. First, to set a specic
policy, analysts should test several different policy options and
document their effects on the whole population. The results of
this analysis would enable the policy maker to compare policies
and also to create a target identiability range. This would dene
the acceptable level of risk permitted by the policy. Second,
when an actual dataset is ready for release, the policy should be
reexamined in light of that specic dataset. If a simple applica-
tion of the policy as written leads to a risk outside the acceptable
identiability range, that dataset would be subject to further
transformation before release, requiring additional suppression
or retraction of certain elds. Alternatively, policy makers could
authorize the release of additional elds if the estimated risk was
found to be below the acceptable threshold.
Limitations and future work
The general approach of this work is limited by certain
assumptions and simplications. First, the estimates computed
for the case study are only as complete as our population
information. Although the US Census Bureau reports that the
2000 Census is more accurate and complete than previous
censuses, the undercount rate is close to 1%.
30
Second, we used
the 2000 Census as an estimate of the current population as
opposed to the current population density. Third, we conated
the age reported in the Census with the year of birth reported in
voter registration lists and sensitive records. For date of birth, we
used a statistical model that assumes uniform distribution of
birth dates. Yet, reports have shown that this may not be accu-
rate,
33
so our estimates may misrepresent the number of distinct
individuals.
Nonetheless, the idea provides several future research
opportunities. First, we performed analysis for populations as
a whole, but not for specic datasets. We believe a similar
approach that denes the elds of intersection would be useful
for dataset-specic analysis. An evaluation using a specic
sensitive dataset, or multiple datasets, would allow for
comparison of the theoretical risk types we evaluated here
with more concrete measures. Second, this work focuses on
the attack-specic risk posed by publicly available voter
registration lists. While our survey provides accurate
information on statewide lists, in some states voter registries
are available from county governments. In Arizona, for
instance, county governments are the only source for
voter registration lists. Further research could show whether
small counties, with more distinctive populations, or larger
counties, with a lower cost per entry in the voter registries, are
more vulnerable to re-identication attacks. Additionally,
similar analysis could be performed with myriad other public
datasets which an attacker might use for re-identication
purposes.
Finally, a hurdle to the adoption of any new evaluation tool is
its implementation. The risk analysis process described here can
be replicated, but the implementation of such a system may be
a burden. A software tool can be developed to automate the
process of analyzing either a general population or a particular
dataset with regard to its distinctiveness and its susceptibility to
a predetermined set of attack models. We imagine that such
Table 1 Percentage of state population vulnerable to re-identification
and the trust differential according to the total risk measure
Differential
rank State
Limited
Dataset
Safe
Harbor
Trust
differential
General
50 DE 37.58 0.16 229
49 RI 35.25 0.13 275
48 AK 62.51 0.21 297
3 NY 25.69 0.01 3251
2 CA 19.28 <0.01 4291
1 TX 36.90 0.01 5172
Voter
50 HI 0.01 <0.01 22
49 ND 12.38 0.01 884
48 AZ 24.61 0.02 1177
3 PA 15.31 <0.01 13088
2 VA 8.20 <0.01 12507
1 MO 36.90 0.01 5171
Table 2 Estimated cost per re-identification
State Rank
Total
risk Price per re-id
Limited Dataset
VA 50 3159764 US$0
NY 49 2905697 US$0
SC 48 2231973 US$0
WI 3 72 US$174
WV 2 55 US$309
NH 1 10 US$827
Safe Harbor
NC 50 1431 US$0
SC 49 1386 US$0
NY 48 221 US$0
WI 3 2 US$6 250
NH 2 1 US$8267
WV 1 1 US$17 000
176 J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026
Research paper
a tool would have information on multiple attack models, and
could include different tools for estimating distinctiveness; we
are in the process of developing such a tool.
CONCLUSION
This research provided a set of approaches for estimating the
likelihood that de-identied information can be re-identied in
the context of data sharing policies associated with the HIPAA
Privacy Rule. The approaches are amenable to various levels of
estimation, such that policy makers and data administrators can
evaluate policies and determine the potential impact on re-
identication risk. Moreover, we demonstrated that such
approaches enable comparison of disparate data protection
policies such that risk tradeoffs can be formally calculated. We
demonstrated the effectiveness of the approach by evaluating
the re-identication risks associated with real population
demographics at the level of the US state. Furthermore, this
work demonstrates the importance of considering not just what
is possible, but also what is likely. In this regard, we considered
how de-identication policies fare in the context of the well
publicized voter registrationlinkage attack, and demonstrated
that risk uctuates across states as a result of differing public
record sharing policies. We believe that with the methods
proposed above and awareness of how different policies interact
to affect privacy, a policy maker can make more informed policy
decisions tailored to the needs and concerns of particular data-
sets. Finally, we have outlined several routes for improvement
and extension of the framework, including the incorporation of
up-to-date population distribution information and application
development.
Acknowledgments We thank the Steering Committee of the Electronic Medical
Record & Genomics Project, particularly Ellen Clayton, Teri Manolio, Dan Masys, Dan
Roden, and Jeff Streuwing for discussion and their insightful comments, from which
this work greatly benefited. We also thank Aris Gkoulalas-Divanis, Grigorios Loukides,
and John Paulett for reviewing an earlier version of the manuscript.
Funding This research was supported in part by grants from the Vanderbilt Stahlman
Faculty Scholar program and the National Human Genome Research Institute
(1U01HG00460301).
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
REFERENCES
1. Blumenthal D. Stimulating the adoption of health information technology. N Engl
J Med 2009;360:1477e9.
2. Safran C, Bloomrosen M, Hammond E, et al. Toward a national framework for the
secondary use of health data: an American Medical Informatics Association white
paper. J Am Med Inform Assoc 2007;14:1e9.
3. Weiner M, Embi P. Toward reuse of clinical data for research and quality
improvement: the end of the beginning? Ann Intern Med 2009;151:359e60.
4. National Institutes of Health. Final NIH statement on sharing research data NOT-
OD-03e032. February 26, 2003.
5. National Institutes of Health. Policy for sharing of data obtained in NIH supported
or conducted genome-wide association studies (GWAS) NOT-OD-07e088. August 28,
2007.
6. Sweeney L. Uniqueness of simple demographics in the U.S. population
Working paper LIDAP-WP4. Pittsburgh, PA: Data Privacy Lab, Carnegie Mellon
University, 2000.
7. Golle P. Revisiting the uniqueness of simple demographics in the US population. In:
Proc 5th ACM Workshop on Privacy in Electronic Society. 2006:77e80.
8. U.S. Dept. of Health and Human Services. Standards for privacy of individually
identifiable health information, Final Rule. Federal Registrar 2002; 45 CFR, Parts
160e4.
9. Lambert D. Measures of disclosure risk and harm. J Off Stat 1993:9:407e26.
10. Maynord A. New details reveal numerous mistakes prior to election commission
break-in. Nashville City Paper January 4, 2008.
11. Golab A. Social Security data puts 1.3 mil. voters at risk: suit. Chicago Sun-Times
January 23, 2007:13.
12. Agrawal R, Johnson C. Securing electronic health records without impeding the flow
of information. Int J Med Inf 2007;76:471e9.
13. Fung B, Wang K, Yu P. Anonymizing classification data for privacy preservation. IEEE
Trans Knowl Data Eng 2007;19:711e25.
14. Gionis A, Tassa T. k-anonymization with minimal loss of information. IEEE Trans
Knowl Data Eng 2009;21:206e19.
15. Jiang W, Atzori M. Secure distributed k-anonymous pattern mining data. Proc 6th
IEEE International Conference on Data Mining 2006:319e29.
16. Machanavajjhala A, Gehrke J, Kifer D, et al.l-diversity: privacy beyond k-anonymity.
ACM Trans Knowl Discov Data 2007;1:3.
17. McGuire A, Gibbs T. Genetics: no longer de-identified. Science 2006;312:370e1.
18. National Research Council. State voter registration databases: immediate actions
and future improvements, interim report. Washington, DC: National Academy of
Sciences, 2008.
19. Samarati P. Protecting respondents’ identities in microdata release. IEEE Trans
Knowl Data Eng 2001;13:1010e27.
20. Sweeney L. Weaving technology and policy together to maintain confidentiality.
J Law Med Ethics 1997;25:98e110.
21. Alexander K, Mills K. Voter privacy in the digital age Report from the California Voter
Foundation. Davis, CA: California Voter Foundation, 2004.
22. Greenberg B, Voshell L. Relating risk of disclosure for microdata and geographic area
size. Proc Section on Survey Research Methods, American Stat Assoc 1990:450e55.
23. Skinner C, Holmes D. Estimating the re-identification risk per record in microdata.
J Off Stat 1998;14:361e72.
24. Skinner C, Elliot M. A measure of disclosure risk for microdata. J R Stat Soc
2002;64:855e67.
25. Truta T, Fotouhi F, Barth-Jones D. Disclosure risk measures for microdata. Proc. 15th
International Conference on Scientific and Statistical Database Management.
2003:15e22.
26. Princeton Survey Research Associates. Medical privacy and confidentiality survey 1999.
27. Reidpath K, Chan K. HIV discrimination: integrating the results from a six-country
situational analysis in the Asia Pacific. AIDS Care 2005;17:195e204.
28. Parker R, Aggleton P. HIV and AIDS-related stigma and discrimination: a conceptual
framework and implications for action, Soc Sci Med 2003;57:13e24.
29. El Emam K, Brown A, AbdelMalik P. Evaluating predictors of geographic area
population size cut-offs to manage re-identification risk. J Am Med Inform Assoc
2009;16:256e66.
30. Alabama Administrative Code. http://www.alabamaadministrativecode.state.al.us/
alabama.html.
31. Mulry M. Summary of accuracy and coverage evaluation for census 2000. Research
Report Statistics #21006e3 for U.S. Census Bureau 2006.
32. U.S. Census Bureau. American FactFinder. http://factfinder.census.gov/.
33. Schechter SE. Toward econometric models of the security risk from remote attacks.
IEEE Security and Privacy Magazine 2005;3:40e4.
J Am Med Inform Assoc 2010;17:169e177. doi:10.1136/jamia.2009.000026 177
Research paper
... We further assumed that this adversary would rely on an attack strategy known as a marketer attack, wherein they have access to a large dataset of known speakers (hereafter the 'known' set), perhaps obtained from an online source such as YouTube, to train a recognition system that is then used to reidentify as many unknown speakers in the shared dataset (hereafter the 'unknown' set) as possible (see "Methods" for experimental design). 17,29 Other attack scenarios are possible, but a marketer attack established a generally accepted baseline for risk. Further implications of these assumptions are addressed in the Discussion. ...
... In this respect, it is worth noting that sufficient demographic data even in the absence of speech is well known to carry a significant risk of reidentification. 17,44 If any aspect of the metadata makes a patient population unique (i.e., there is only one person in a given age range), the risk of re-identification increases. 13,14 Furthermore, if any data about the speaker (e.g., sex) or recording (e.g., date) reduces the search space, re-identification risk would increase. ...
... To examine the risk of re-identification for clinical speech recordings in a shared dataset, we constructed experiments based on a marketer attack, where the adversary has a large set of identified speech recordings that they use to re-identify unknown speakers in a shared dataset. 17,29 We used the VoxCeleb datasets to explore the effect of scope on this re-identification risk, in terms of both the size of the known speaker set used by the adversary and the size of the shared unknown speaker dataset. We then looked at the effect of elicited speech tasks on re-identifying speakers using Mayo Clinic speech recordings. ...
Preprint
Full-text available
Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-art speaker recognition system. We demonstrate that the risk is inversely related to the number of comparisons an adversary must consider, i.e., the search space. Risk is high for a small search space but drops as the search space grows ($precision >0.85$ for $<1*10^{6}$ comparisons, $precision <0.5$ for $>3*10^{6}$ comparisons). Next, we show that the nature of a speech recording influences re-identification risk, with non-connected speech (e.g., vowel prolongation) being harder to identify. Our findings suggest that speaker recognition systems can be used to re-identify participants in specific circumstances, but in practice, the re-identification risk appears low.
... HIPPA provides a privacy rule, called the Safe Harbor method, for de-identification to limit the possible uses and disclosures of PHI. However, Benitez and Malin illustrated such protection rules leave different organizations vulnerable to re-identification [52]. To guarantee confidentiality, scalability, and flexibility of health information management at a third party (e.g., cloud services), privacy-preserving and patient-centric models based on cryptography for the storage and exchange of health information have been studied [53,54]. ...
... Re-identification Attacks: Re-identification or deanonymization is a class of privacy attacks that identify users from anonymized user data. Health information datasets have often been targeted by many studies in this research area, as well as those mentioned [52]. Other types of user data have been targeted. ...
Article
Full-text available
Digital contact-tracing (DCT) applications have been installed on more than 188 M smartphones worldwide as an effective mechanism for monitoring contact with COVID-19 infected individuals. DCT is promising not only for COVID-19, but also for preparing for a possible future large-scale pandemic. The DCT framework is unique in that it combines Bluetooth Low Energy (BLE) communications with cryptography techniques to track exposure on a large scale while protecting user privacy. The objective of this study is to assess the risk of the linking attack to the DCT frameworks; i.e., linking individuals to the identifiers contained in BLE broadcast frames that are supposed to be anonymized. Specifically, we target Google/Apple’s Exposure Notification (GAEN), which is the representative implementation of DCT. Our extensive experiments demonstrate that passively collected rolling proximity identifiers (RPIs) contained in the BLE frames can be linked to face photos which could lead to the exposure of privacy information with high accuracy, including infection status. We also demonstrate that an attacker with a few number of devices can correctly link RPIs and the images of the target person with a success rate of 86% at a rate of 5,000 users per hour. Based on these results, we propose countermeasures to reduce the inherent privacy risk of the GAEN framework.
... The second test concerned the disclosure risks associated with the public distribution of the synthetic datasets. Despite being anonymised, the data may contain sets of variables (e.g., age and gender) which, in combination, may be used by an adversary to uniquely identify a person (e.g., via linking the data with voter registration lists 69 ). Variables which in combination constitute personally identifying information are known as quasi-identifiers. ...
Article
Full-text available
In recent years, the machine learning research community has benefited tremendously from the availability of openly accessible benchmark datasets. Clinical data are usually not openly available due to their confidential nature. This has hampered the development of reproducible and generalisable machine learning applications in health care. Here we introduce the Health Gym - a growing collection of highly realistic synthetic medical datasets that can be freely accessed to prototype, evaluate, and compare machine learning algorithms, with a specific focus on reinforcement learning. The three synthetic datasets described in this paper present patient cohorts with acute hypotension and sepsis in the intensive care unit, and people with human immunodeficiency virus (HIV) receiving antiretroviral therapy. The datasets were created using a novel generative adversarial network (GAN). The distributions of variables, and correlations between variables and trends in variables over time in the synthetic datasets mirror those in the real datasets. Furthermore, the risk of sensitive information disclosure associated with the public distribution of the synthetic datasets is estimated to be very low.
... A library of desensitisation algorithms is available in the system, as shown in Table 8 [9]. e various sensitive data in Table 6 were graded for desensitisation intensity to get Table 9 [10]. e desensitisation strategies available in the system are shown in Table 10. ...
Article
Full-text available
Built on top of a big data platform, the Middle Platform develops data through abstraction, sharing, and reuse capabilities to provide data products and data services for upper-level business development. While fully analysing and mining the intrinsic value of data, privacy and sensitive information in the data must also be protected, so the Middle Platform needs a data desensitisation system to ensure the safe and open use of data. In order to solve the problems of high usage costs, low efficiency, and lack of standardised results of desensitisation that exist in conventional data desensitisation systems, an automated desensitisation system with data assets, access control, and desensitisation strategies as the main modules is established using an adaptive method of generating dynamic desensitisation rules, combined with a security monitoring mechanism of sensitivity classification and two-level permissions. The system optimises the configuration structure to obtain stable and reliable desensitisation results and efficiently respond to diverse business needs. Users are able to get rid of complex rule management and focus on the data usage itself.
... In the field of health, the Health Insurance Portability and Accountability Act (HIPAA) [5] provides safe harbor guidelines that define what information that can be considered as private: Private Health Information (PHI). The HIPAA categories form an acceptable consensus [28,29,30], even outside their field of application, which is the USA. For the sake of completeness, Table 1 recalls these categories. ...
Preprint
Full-text available
Unstructured textual data are at the heart of health systems: liaison letters between doctors, operating reports, coding of procedures according to the ICD-10 standard, etc. The details included in these documents make it possible to get to know the patient better, to better manage him or her, to better study the pathologies, to accurately remunerate the associated medical acts\ldots All this seems to be (at least partially) within reach of today by artificial intelligence techniques. However, for obvious reasons of privacy protection, the designers of these AIs do not have the legal right to access these documents as long as they contain identifying data. De-identifying these documents, i.e. detecting and deleting all identifying information present in them, is a legally necessary step for sharing this data between two complementary worlds. Over the last decade, several proposals have been made to de-identify documents, mainly in English. While the detection scores are often high, the substitution methods are often not very robust to attack. In French, very few methods are based on arbitrary detection and/or substitution rules. In this paper, we propose a new comprehensive de-identification method dedicated to French-language medical documents. Both the approach for the detection of identifying elements (based on deep learning) and their substitution (based on differential privacy) are based on the most proven existing approaches. The result is an approach that effectively protects the privacy of the patients at the heart of these medical documents. The whole approach has been evaluated on a French language medical dataset of a French public hospital and the results are very encouraging.
Thesis
Full-text available
The goal of this dissertation is to create a model for evaluating the information transparency of privacy policies based on the starting assumption that effective transparency mechanisms should aim to reduce information asymmetry between organizations that collect and process the data of respondents and the respondents themselves. For this purpose, an analytical matrix was set up, which analyzed the content of privacy policy texts to examine the fulfillment of defined requirements on theoretically defined dimensions of information transparency, theoretically conditionally called visibility and inferability, as operationalized units of measurement of certain degrees of information symmetry as indicators of information transparency in mutual correlation. Furthermore, each of the dimensions is operationalized through a certain number of indicators and sub-indicators as assumed requirements that the privacy policy should fulfill, that is, satisfy when achieving information symmetry according to respondents. Therefore, the requirements were assigned appropriate weights according to the theoretically assumed importance of each sub-indicator in defining the indicator, the sum of which is a maximum of 1 as a "measure" of the complete fulfillment of the requirements. By applying factor analysis on the collected data, on a sample of 152 health institutions in the public and private sector in the Republic of Croatia, a valid conceptual model was created that shows the influence of individual factors on information transparency, i.e. the reduction of information asymmetry between the aforementioned stakeholders, during which the results of other statistical analyzes of the sample were also extracted. According to the information transparency evaluation model, the effectiveness of transparency mechanisms is influenced to a greater extent by the factors of visibility, defined by the determinants of layering, updating and informativeness, in relation to the factors of inferability, defined by the determinants of accessibility, meaningfulness and comprehensibility of privacy policies. And although they do not correlate with each other, the defined factors can be used to determine the degree of information asymmetry of privacy policies with the aim of reducing it using the results of the match validity analysis performed during model validation. By adjusting individual determinants on the basis of obtained values of deviations from reference values based on the average of the examined institutions, it is possible to manage the efficiency of information transparency mechanisms.
Article
Artificial intelligence (AI) has experienced explosive growth in oncology and related specialties in recent years. The improved expertise in data capture, the increased capacity for data aggregation and analytic power, along with decreasing costs of genome sequencing and related biologic "omics", set the foundation and need for novel tools that can meaningfully process these data from multiple sources and of varying types. These advances provide value across biomedical discovery, diagnosis, prognosis, treatment, and prevention, in a multimodal fashion. However, while big data and AI tools have already revolutionized many fields, medicine has partially lagged due to its complexity and multi-dimensionality, leading to technical challenges in developing and validating solutions that generalize to diverse populations. Indeed, inner biases and miseducation of algorithms, in view of their implementation in daily clinical practice, are increasingly relevant concerns; critically, it is possible for AI to mirror the unconscious biases of the humans who generated these algorithms. Therefore, to avoid worsening existing health disparities, it is critical to employ a thoughtful, transparent, and inclusive approach that involves addressing bias in algorithm design and implementation along the cancer care continuum. In this review, a broad landscape of major applications of AI in cancer care is provided, with a focus on cancer research and precision medicine. Major challenges posed by the implementation of AI in the clinical setting will be discussed. Potentially feasible solutions for mitigating bias are provided, in the light of promoting cancer health equity.
Article
Full-text available
Die sekundäre Forschungsnutzung von Behandlungsdaten hat großes Potenzial, biomedizinisches Wissen zu erweitern und die Patientenversorgung zu verbessern. Gleichzeitig sind für eine bessere Ausschöpfung dieses Potenzials diverse Herausforderungen zu bewältigen. Dies gilt insbesondere in Deutschland, wo im Vergleich zu anderen Ländern, wie z.B. Dänemark oder Finnland, die sekundäre Forschungsnutzung von Behandlungsdaten unterentwickelt ist. Die Intensivierung der Nutzung der Daten aus Diagnose und Therapie von Patienten und die Entwicklung der dafür notwendigen Strukturen in Deutschland ist ethisch und politisch geboten: für die Verbesserung der Gesundheitsversorgung durch neue Erkenntnisse, Therapien und Qualitätssicherung. Aus diesem Grund fordern die Autor:innen alle Akteure des Gesundheitssystems, von der Bundes- und den Landesregierungen über die Krankenkassen, Kliniken, Ärzt:innen und Patient:innen dazu auf, einen Beitrag dazu leisten, eine Kultur der sekundären Forschungsnutzung von Behandlungsdaten zu etablieren. Basierend auf den Ergebnissen eines dreijährigen Verbundprojekts beantworten die Autor:innen aus den Bereichen Medizin, Rechtswissenschaften, Ethik und Sozialwissenschaften die Frage was notwendig ist, damit in Deutschland die Daten von Patient:innen systematisch für medizinische Forschung und die Analyse der klinischen Versorgungsqualität verwendet werden können und wirklich auch verwendet werden. Sie nennen zehn zentrale Forderungen und legen eine Liste von 23 Empfehlungen mit konkreten Maßnahmen vor, welche von den genannten Akteuren umgesetzt werden sollten.
Article
Market research companies collect extensive data on purchasing, travel, and app and media usage behaviors of consumers, prescriptions written by physicians, and so forth. Although the companies provide assurances of anonymity to the study participants, there is a significant concern about the vulnerability of these data. Could a motivated intruder match the pattern of purchases with the name and other personal and potentially sensitive details of an individual? We find that 17% to 94% of market research panelists in 15 frequently bought consumer goods categories are subject to high risk of reidentification through a potential record linkage attack based on their unique purchasing histories even when their identities are anonymized. We also demonstrate that the risk of reidentification in such data are vastly understated by the conventional measure, unicity, and propose a new measure, termed “sno-unicity.” To protect the privacy of panelists, we consider the well-known privacy notion of k-anonymity and develop a new approach called “graph-based minimum movement k-anonymization” that is designed especially for retaining the usefulness of panel data. We show that our approach works well in protecting participants’ privacy without substantially altering the information that marketers need for sound marketing decisions.
Article
Full-text available
The availability of large, deidentified health datasets has enabled significant innovation in using machine learning (ML) to better understand patients and their diseases. However, questions remain regarding the true privacy of this data, patient control over their data, and how we regulate data sharing in a way that that does not encumber progress or further potentiate biases for underrepresented populations. After reviewing the literature on potential reidentifications of patients in publicly available datasets, we argue that the cost—measured in terms of access to future medical innovations and clinical software—of slowing ML progress is too great to limit sharing data through large publicly available databases for concerns of imperfect data anonymization. This cost is especially great for developing countries where the barriers preventing inclusion in such databases will continue to rise, further excluding these populations and increasing existing biases that favor high-income countries. Preventing artificial intelligence’s progress towards precision medicine and sliding back to clinical practice dogma may pose a larger threat than concerns of potential patient reidentification within publicly available datasets. While the risk to patient privacy should be minimized, we believe this risk will never be zero, and society has to determine an acceptable risk threshold below which data sharing can occur—for the benefit of a global medical knowledge system.
Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k -anonymity has gained popularity. In a k -anonymized dataset, each record is indistinguishable from at least k − 1 other records with respect to certain identifying attributes. In this article, we show using two simple attacks that a k -anonymized dataset has some subtle but severe privacy problems. First, an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. This is a known problem. Second, attackers often have background knowledge, and we show that k -anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks, and we propose a novel and powerful privacy criterion called ℓ-diversity that can defend against such attacks. In addition to building a formal foundation for ℓ-diversity, we show in an experimental evaluation that ℓ-diversity is practical and can be implemented efficiently.
Article
A measure of re-identification risk at the record level has a variety of potential uses in statistical disclosure control for microdata. The conceptual basis of such a measure is considered. The risk is conceived of broadly as the evidence in support of a link between the record and the unit in the population from which it is derived. For discrete key variables subject to no measurement error, a measure is derived which reflects the probability that the record is unique in the population. Under certain assumptions, two approaches are described for estimating this measure from the microdata. These approaches are applied to a 10% sample of microdata from the 1991 Census in Great Britain. It is found that the resulting risk measures can indeed be used successfully to establish whether sample unique records are unique in the population. The implications of these findings are discussed.