Conference PaperPDF Available

Medical Data, Reconciling Research and Data Protection



Most of our medical records are being processed electronically, centralised and easily accessible. In this paper, we will discuss the advantages of the system for research, as well as its potential challenges.
Medical Data, Reconciling Research and Data Protection
Christine Hilcenko1,2,3*, Tara Taubman-Bassirian4
1Cambridge Institute for Medical Research, Cambridge, CB2 0XY, UK.
2Department of Haematology, University of Cambridge, Cambridge, CB2 0XY,
3Wellcome Trust-Medical Research Council Stem Cell Institute, University of
Cambridge, Cambridge, UK.
Summary: Most of our medical records are being processed electronically,
centralised and easily accessible. In this paper, we will discuss the advantages of
the system for research, as well as its potential challenges.
Keywords: health data, privacy, data protection, GDPR, security, information,
research, EMR, EHR, European Data Hub.
I - The purpose and benefits of digital health data
1. Electronic Health Record (EHR) or Electronic Medical Record
An electronic medical record (EMR) is a digital version of a chart with patient
information stored in a computer. An electronic health record (EHR) is the
systematised collection of patient and population electronically stored health
information in a digital format. These records are shared through different
countries and different medical departments. An EHR is a digital version of a
patient’s paper chart. EHRs are real-time, patient-centred records that make
information available instantly and securely to authorised users. Hand-written
data is replaced with electronic records to maintain the constant flow of patient-
related data. Introduced in 2009, Today, over 80% of points of medical care are
using this technology for billing and entry point of medical information. EHRs are
built to easily share information with other health care providers and organisations
such as laboratories, specialists, medical imaging facilities, pharmacies allowing
better diagnostics and patient outcomes based on the patient medical history. It
also improves patient participation, care coordination, and is cost saving. For
instance, EHR alerts can be used to notify providers when a patient has visited the
hospital, allowing them to proactively follow up with the patient. Every health
provider can have the same accurate and up-to-date information about a patient.
This is especially important with patients who are seeing multiple specialists,
receiving treatment in emergency setting or making transitions between care
It can reduce medical errors and unnecessary tests and reduce the chance that one
specialist will not know about an unrelated (but relevant) condition being
managed by another specialist. An EHR not only keeps a record of a patient's
medications or allergies, it also automatically checks for problems whenever a
new medication is prescribed and alerts the clinician to potential conflicts. It can
help providers quickly and systematically identify and correct operational
problems. In a paper-based setting, identifying such problems is much more
difficult, and correcting them can take years.
2. Individual patients’ electronic records processing for wider scientific
research projects
In data protection terms when data collected for one purpose is then deployed for
a new, unrelated, purpose, we are engaging in secondary processing. The
secondary processing of personal health data for scientific research in the medical
field is coveted for research and new treatments that improve public health. Its
legal status, however, is far from unproblematic. The deployment of new artificial
intelligence technologies is leading to a rethinking of biomedical research from
what was traditionally organised. This shift is important as classical research tends
to be questioned following various health scandals, Fewer people are willing to
participate to medical trials. In digital research, AI tools are preferably deployed
in the exploitation of databases to explore patient records, via medical imaging or
connected medical devices. The analysis of the various data sets with more
efficient computing capabilities makes it possible to increase the speed of
discoveries, acting as an accelerator of research. Clinical procedures are
simplified, time is spared, and, above all, the physical integrity of individuals is
preserved. This shift, for instance, has allowed discovery of an antibiotic molecule
capable of bypassing antibiotic resistance in February 2020
The primary advantage of digitalising research data is the increased ease of
participation in research. Many studies struggle to recruit and retain sufficient
participants. Less and less patients take part in clinical trials. A survey by
indicates that physical distance to research sites is one of the main
barriers to increased participation.
The second advantage is the amount of data from larger number of participants
that a decentralized study can generate. Finally, decentralized clinical trials attract
a more inclusive variety of participants. If the advantages are obvious, they are
not without inconvenient. Apart from the digital divide excluding the populations
with less access to technology, data quality, security and strong medical ethics are
required. Big data analytics, based on larger data sets, using tools or algorithms
are big assets for medical and scientific research.
Paolo, G. & Bincoletto G.4 have examined the secondary processing of personal
health data for scientific research in the medical field. Looking at the controllers’
obligations to comply with the data protection framework to safeguard
fundamental rights and freedoms. After comparing the implementation of EU
regulation into the French and Italian national legislations, they suggest a
proactive, legal-technical e-health solution that complies with the rules and
principles of the legal frameworks and empowers the individual’s control over
personal health data while promoting medical research. To this end, the data
protection by design concept plays a central role, and an interdisciplinary
approach is fundamental in combining legal and technical perspectives.”
The case of Yoti illustrates the potentials for research and dangers for the
For many years, immigration departments have struggled with physical
determination of the age of immigrants claiming to be minor. Such determination
by physical examination have never been accurate. The algorithm the company
Yoti has created claims to have achieved the goal with high accuracy. For doing
so, a large worldwide data sets of children aged between 12 and 19 years old has
be created. Admittedly some parents have received financial compensation in
exchange of pictures of their children with month and year of birth. By creating a
data set of children of various world ethnic faces, the algorithm can determine the
age of a minor. So far, the software has been commercialised in supermarkets for
the sale of alcohol or cigarettes. They are expanding with cinemas and other places
where age verification is necessary.
Scientific research could certainly benefit from such data base to study the ageing
process in various ethnic groups. The dilemma here is how to create and store
large scale biometrics data allowing a sensible secondary use of the information.
3. Big data and scientific research friend or foe
As indicated above, scientific research benefits from large scale data from various
sources. This has been made easier and more cost effective thanks to digitalisation
of information and cheap data storage. Big Data is defined by Zulkarnain, N. &
Anshari, M.6 as datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze”. In Kaislere et al.7, Big
data is data too big to be handled and analyzed by traditional database protocols
such as SQL”. Big Data are particularly large, complex datasets with high
analytical potentials that automated processing can analyse at higher speed than
traditional processing. Big data analytics in medicine and healthcare allows
analysis of the large datasets from thousands of patients, identifying clusters and
correlation between datasets. Moreover, improving predictive models using data
mining techniques’.
Big Data has large potentials helping medical research to create new growth
opportunities including in predictive healthcare. This is not without posing
significant challenges such as the loss of privacy and confidentiality. Privacy and
integrity are key concerns for individuals and all corporations involved.
As open access to health data is beneficial to scientific research. We will further
discuss the European initiatives. The following sections will draw data protection
lessons from several European initiatives.
4. The requirement of quality data an essential parameter to integrate
Data accuracy is paramount for quality research results. Human or algorithm
errors/biases or sometimes technology outcomes can alter results. This issue has
been pointed out by the European Union Agency for Fundamental Rights (FRA)8
that raised serious concerns about the quality of medical data and the resulting risk
of medical errors. The quality of data in EMR/EHR also raises some concern.
Studies where patients were shown their medical files and asked about their
accuracy found that up to 50 % of information was incomplete or erroneous. Too
many important data in EMR/EHR is unstructured in the form of free text, which
further reduces data quality.’ See also Miller, D.D.9.
To be validly processed, row medical data requires to undergo initial ‘cleansing’10.
In this paper, the authors take an interdisciplinary look at some of the technical
and legal challenges of data cleansing against the background of the European
medical device law, with the key message that technical and legal aspects must
always be considered together in such a sensitive context.
These authors initially enumerate the typical data quality issues they suggest the
cleansing operation should tackle : absence of data, dummy/default, noise (a.k.a.
the “butterfly effect, wrong data, inconsistent data, cryptic data, duplicate primary
keys, non-unique identifiers, multipurpose fields, violation of (business) rules.
With reference to the ECJ and the German Supreme Court (Bundesgerichtshof)
they are potential legal consequences for faulty data in medical AI, engaging the
liability of the notified body towards third persons like patients in case that an
assessment procedure has been carried out without sufficient diligence11.
The authors then suggest five necessary steps to use the raw data and transforms
it into information that can be worked with in the subsequent analytical steps:
Parsing, Correcting, Standardizing, Matching, Consolidating.
Without going into further detail into these technics, they suggest applying
standardisation and corrections to data sets sometimes from unstructured
information like manual diagnosis data to then compare and match to avoid
duplication to finally solve inconsistencies. The quality of this cleansing process
will have legal consequences as they point that any damage caused by
malfunctions or other service provider due to these faulty data, will create a
liability for the manufacturer of the medical device mostly under the Medical
Devices Regulation (MDR). It is required that the product in question shall be
safe and effective and shall not compromise the clinical condition or the safety of
patients”. GDPR data accuracy requirement will not apply to the training data as
data is mostly anonymised.
This has led the European Parliament to propose that developers should, where
feasible, implement quality checks of the external sources of data and should put
oversight mechanisms in place regarding their collection, storage, processing, and
use of data12.
Also, in its report on the safety and liability implications of Artificial Intelligence,
the Internet of Things and Robotics, the European Commission has noted that
Union product safety legislation now does not explicitly address the risks to
safety derived from faulty data. However, according to the ‘use’ of the product,
producers [in terms of the MDR, manufacturers] should anticipate during the
design and testing phases the data accuracy and its relevance for safety
functions.13. The Report and the Whitepaper also call for product safety
legislation to “provide for specific requirements addressing the risks to safety of
faulty data at the design stage as well as mechanisms to ensure that quality of
data is maintained throughout the use of the AI products and systems.”14
4. When Clinical Trials Are Digital, Ethics Is Needed
The pandemic accelerated the digitalisation of clinical research and widespread
decentralised clinical trials have brought many improvements while privacy and
ethics challenges took a back seat.
Artificial intelligence’s potentials in healthcare are undeniable: from AI medical
imaging, scanning patient health records to predict illness, monitoring devices to
systems that help track disease outbreaks. Accessing medical speciality
consultations to help evaluate symptoms in remote areas. Just to mention few15.
The outbreak of the COVID pandemic also saw the surge of controversial tracking
technologies. The Bluetooth technology widely used was inappropriate to detect
the contact tracing. In Singapore, the government admitted that the data health
collected that was repurposed beyond the original goal.
Additionally, some data bases were hacked, and medical data were accessed by
unlawful criminals16.
In June 2021, the World Health Organization published a guidance on Ethics &
Governance of Artificial Intelligence for Health17 outlining six key principles for
the ethical use of artificial intelligence in health. Leading experts recommended
the technology must put ethics and human rights at the heart of its design,
deployment, and use. The report warned about AI tools developed by private
technology companies (like Google and Chinese company Tencent) that have
large resources but not always the necessary ethical incentives. Their focus may
be toward profit, rather than the public good. While these companies may offer
innovative approaches, there is concern that they might eventually exercise too
much power in relation to governments, providers and patients18.
The report recommended six ethical principles: protect autonomy, promote human
safety, ensure transparency, foster accountability, ensure equity, promote an AI
that is sustainable.
In parallel, the new European regulation on AI, the AI Act19 which could come
into effect in late 2024 is a proposed European law on artificial intelligence (AI)
the first law on AI by a major regulator anywhere. The law assigns applications
of AI to three risk categories.
The Artificial Intelligence in Healthcare report20 supports the European
Commission in identifying and addressing any issues that might be hindering the
wider adoption of AI technologies in the healthcare sector. The study has
highlighted six categories where the European Commission is suggested to focus
to support the development and adoption of AI technologies in the healthcare
sector across the EU. These include:
1. a policy and legal framework supporting the further development and
adoption of AI aimed at the healthcare sector in particular;
2. initiatives supporting further investment in the area;
3. actions and initiatives that will enable the access, use and exchange of
healthcare data with a view to using AI;
4. initiatives to upskill healthcare professionals and to educate AI developers
on current clinical practices and needs;
5. actions addressing culture issues and building trust in the use of AI in the
healthcare sector;
6. policies supporting the translation of research into clinical practice.
Additionally, the GDPR allows for code of conducts to be approved by national
supervisory authorities. These can strengthen clinical research and
The Spanish Code of Conduct21
A first national Code of Conduct was promoted by Farmaindustria in Spain
regulating the processing of personal data in the field of clinical trials and other
clinical research and pharmacovigilance. This Code regulates how the promoters
of clinical studies with medicines and the CROs that decide to adhere thereto must
apply the data protection regulations. Data controllers and data processors that
adhere to the code of conduct are obliged to comply with its provisions.
The UK Code of Conduct for data-driven health and care tech
Following a consultation, the UK Government has published a code of conduct for
data-driven health and care technology to enable the development and adoption of
safe, ethical and effective data-driven health and care technologies 22.
The Government has set out the behaviours expected from those developing,
deploying and using data-driven technologies in the health and care systems:
1. Understand users, their needs and the context;
2. Define the outcome and how the technology will contribute to it;
3. Use data that is in line with appropriate guidelines for the purpose for
which it is being used;
4. Be fair, transparent and accountable about what data is being used;
5. Make use of open standards;
6. Be transparent about the limitations of the data used and algorithms
7. Show what type of algorithm is being developed or deployed, the ethical
examination of how the data is used, how its performance will be
validated and how it will be integrated into health and care provision;
8. Generate evidence of effectiveness for the intended use and value for
9. Make security integral to the design; and
10. Define the commercial strategy and consider only entering into
commercial terms in which the benefits of the partnerships between
technology companies and health and care providers are shared fairly.
5. The case of the French Health Data Hub and the COVID research
The French government has contracted the processing of health data with
Microsoft, a US corporation. Microsoft is said to have the processing capabilities
currently no other European company could have. Scientific research needs data
on COVID patients to study the spread of the virus, the scale of long term COVID,
the category of the population mostly affected, or the effects of the various
vaccines administered.
Several legal limitations had to be considered. First the question of the transfer of
European data outside the European Union. United States being deemed a country
of non-adequate data protection since the invalidation of the Privacy Shield
agreement allowing the flow of data between the two continents. The French data
Protection Supervisory authority, The Commission Nationale Informatique et
Libertés (CNIL) published an opinion expressing their concerns. They
recommended this measure to be only temporary requiring further guarantees
from Microsoft. The French Health Minister requested the data to be stored within
the EU. A measure that has limited protection as regardless of the localisation of
the data, US NSA Section 702, The US Cloud Act and the Executive Order 12333
would apply.
These technical necessities are a consequence of the loss of data sovereignty that
the European countries are facing. Dependence on foreign corporations for
processing health data is by itself an issue. The quasi-monopolistic position of
these corporations gives them excessive power of control over data. Data is
processed, possibly monetised. The security of the data can be compromised as
any data base created is a data at risk.
Big data analytics or algorithms are by essence thirsty of data. Scientific research
benefits from the largest data sets. How to conciliate the legitimate needs of
research with data protection principles?
6. Technical means of protecting medical data: encryption, anonymisation
or pseudonymisation
With data becoming increasingly digitalised and widely accessible, securing
health data is paramount. The first supplementary measure to secure data when
transferred abroad is to apply a strong encryption, keeping the encryption key out
of the reach of the US internet communication services. This is in fact difficult to
realise when data needs further processing.
Cybersecurity attacks and ransomware against hospitals are becoming a common
threat. Health data have high value and regularly targeted by cyber criminals.
Cyberattacks in healthcare environment have tripled since 2018 to reach 45
million individual victims in 2021.
Medical data is collected from various sources:
- imaging techniques for diagnosis;
- electronic health records;
- robotics in surgical procedures;
- telehealth for efficiency or reaching patients in more remote locations;
- wearables to monitor individuals’ health with various Internet connected
devices with often weak security.
- The use of open data sources is also instrumental in the field of genomics,
where data related to genetic makeup, biomarkers and bioinformatics is used
to derive better therapeutic solutions.
The European healthcare require stronger protection and security measures to
protect health data.
Pseudonymised personal data are data where identifiers such as names are
replaced by codes the research institutions keep. A ‘code key’ that is the link to
the individual person is kept separately from the research data in order to protect
the privacy of patients. Only fully anonymised data escape from the General Data
Protection Regulation (GDPR) requirements. In the case of data for research full
anonymisation might render the data useless. Therefore, it is often not a possible
Pseudonymised data and even anonymised data are not immune from re-
identification. Even when name, date of birth or national security numbers are
“anonymised”, a full health history will reveal patients’ age, gender, the places
where they have lived, their family relationships and aspects of their lifestyle.
Privacy-enhancing technologies such as homomorphic encryption, differential
privacy, federated analyses and use of synthetic data offer new ways for protecting
the privacy of individuals. New promises are seen in synthetic data23, still not
exempt of criticism24. Issues of anonymisation and de-identification need to be
addressed and appropriately managed25.
Health data being more sensitive, requires extra layers of protection and
appropriate security measure. The GDPR applicability to the data set has
implications in the free flow of data to countries outside the European Economic
Area (EEA). This is a problem, for example, with researchers at federal research
institutions in the United States. Transfers to international organisations such as
the World Health Organization are similarly affected26. The European scientific
academies have recently published a report explaining the consequences of stalled
data transfers and pushing for responsible solutions 27,28.
- The case of the Canadian PHIPA experience29
This is an example of the PHIPA Decision 17530 which details an investigation
into the sale of de-identified data by a health information entity to a third-party
corporation. The data protection supervisory became aware of the situation
through a news article and launched an investigation.
De-identification does not usually necessitate individual consents. However, data
controllers processing medical data have to be transparent to clearly and explicitly
inform patients about their practices in their privacy notice. De-identification,
considered a data processing operation, patients’ consent is therefore required in
most cases.
We will now take a deeper dive into the specific situation of the European health
data and its challenges. How to reconcile the need to share data, open free access
to the database, and keep data secure.
II - The paradigm of the European data sovereignty
The last few decades, we have witnessed major developments in the fields of
internet communication and data digitalisation. Europe has dragged behind US or
Chinese corporations. Our dependency on US corporations for cloud storage or
computer operating systems, or data analytics is undeniable. It’s with the COVID
pandemic that EU governments have had a wakeup call, realising the complexities
of creating a contact tracing app without the help of a Google App or its ID
verification Captcha. Since the European Court of Justice decision in July 2020
invalidating data transfers to the US, the use of US internet communication
services became problematic. European countries are re-thinking data processing
and data storage solutions including in the fields of health data and research.
1. European health data initiatives promise to open new perspectives
for medical research
A- EU Health Data
New initiatives are underway for EU Health data centre and a common health data
to be accessed more efficiently.
A study by Henrique Martins of ISCTE-Lisbon University Institute and Faculty
of Medical Sciences, UBI Portugal, was made public at the request of the Panel
for the Future of Science and Technology (STOA) and managed by the Scientific
Foresight Unit, within the Directorate-General for Parliamentary Research
Services (EPRS) of the Secretariat of the European Parliament is staggering31.
The study noted the absence of clear data architecture, lack of harmonisation and
absence of an EU-level centre for data analysis capable of better response to health
data crisis such as the Covid-19 pandemic : EU must have the capacity to use
data very effectively in order to make data-supported public health policy
proposals and inform political decisions.The study considered in detail the use
of advanced technologies such as AI. A new model, Preparedness and Response
Authority (HERA)32 was suggested.
On 3 May 2022, the Commission published the Proposal for a Regulation of the
European Parliament and of the Council on the European Health Data Space Act.
The European Data Protection Board (EDPB) and the European Data Protection
Supervisory (EDPS) were required to give their opinions on this project33.
Since the European Data Strategy for the creation of a single European Data Space
for primary use of medical data, all EU citizens will have access to their electronic
health records by 2030 thanks to the EU’s central eHealth platform linking
national contact points to the MyHealth@EU infrastructure and efficient national
digital health authorities.
B - Joint Action Towards the European Health Data Space
The European Health Data Space (EHDS) could enormously impact health
research if it can overcome barriers to cross-border34. The TEHDAS project, based
on the European Commission’s Health Programme 2020, develops European
principles for the secondary use of health data. Carried out by 25 European
countries and co-ordinated by the Finnish Innovation Fund, Sitra35. The project
will ease the share of data flow, with standardisation and the creation of a central
authority. Free access to this data will benefit innovation at the European and
international level.
III- Access to data for research - Re-using data and open science
1- International cooperation
1.1 - Within the National and international initiatives supporting medical data
health is the International Medical Informatics Association (IMIA) Open-Source
Working Group (OSWG)36, ‘a voluntary group supported by IMIA that brings
together researchers and practitioners from multiple countries with a diverse
range of informatics experience but common interest in the adoption of open
approaches to advancing the use of informatics to improve healthcare.’ This has
led to the development of an open access database of Free, Libre, and Open-
Source Software (FLOSS), called MedFLOSS37 to apply in the medical domain
to accelerate medical research.
1.2-. In the field of haematology Harmony Alliance promise to accelerate
scientific research. HARMONY claims to be the Next-generation science, sharing
data and knowledge A cooperation between all healthcare stakeholders to treat the
disease faster and better. The HARMONY Alliance38 has created a transparent
and secure repository for data from various clinical studies. Everybody can
contribute and help fight HMs.”
The HARMONY Research Project entitled ‘Use of Big Data to improve outcomes
for patients with Acute Lymphoblastic Leukemia (ALL)39.
A- Call to Remove obstacles to sharing health data with researchers
outside of the European Union40
Scientific academies in Europe (the European Academies Science Advisory
Council, the Federation of European Academies of Medicine, and the European
Federation of Academies of Sciences and Humanities) 41 have joined forces to call
attention to the challenges that affect not only European scientists but
collaborators worldwide.
In their paperRemove obstacles to sharing health data with researchers outside
of the European Union’42, the researchers from the University of Oslo develop on
the necessity to levy data sharing barriers: International sharing of
pseudonymized personal data among researchers is key to the advancement of
health research and is an essential prerequisite for studies of rare diseases or
subgroups of common diseases to obtain adequate statistical power.”
Certainly, the way forward for a more efficient international data sharing requires
a move from the US.“The United States should be encouraged to establish
enforceable data subject rights and effective legal remedies for European and
other non-US research participants whose data are processed by US researchers.
The voice of the health-research community must be heard by decision-makers at
the national level, at the EDPB, and within the EU Commission Directorates-
General involved, such as in the areas of justice, health and research. Without a
quick resolution, European research potential will not be realized, and European
citizens will fall behind.
We saw the importance for EU research to benefit from EU health data. Currently,
European health data is targeted by US corporations. A situation that has been
B- GAFAM US Corporations access to EU health data
1- The case of the NHS data agreement with DeepMind before
The UK national Health and Social Care Information Centre (HSCIC) databases
held all medical data in order to combine all healthcare records stored by general
practitioners with all information stored by social services and hospitals. The
Hospital Episode Statistics dataset on the other hand collects and curates data from
125 million individual in England every year. If this data set can have huge
potentials for research, it raises methodology of conservation questions.
UK has been criticised for sharing data with pharmaceutical companies,
insurance companies, health charities, hospital trusts, think tanks, and other
private companies. In 2014, it was disclosed that anonymous, pseudonymous, and
identifiable data was sold to 160 organisations. In response to a Freedom of
Information request43, HSCIC stated: We recognise that there will however
remain a latent risk that when combined with other sources of data, the identity
of the individual may be ascertained”. was closed in 2016 following
general criticism and optout.
In June 2017, Taunton and Somerset NHS Foundation Trust and DeepMind
Healthcare signed a 5-year contract to develop and evaluate a system able to detect
early signs of kidney-failure44. Over 1.6 million live NHS data records were given
to Google, via DeepMind.
In November 2017, the UK Information Commissioner ruled that the London’s
Royal Free hospital failed to comply with the Data Protection Act when it handed
over personal data of 1.6 million patients to DeepMind45. A DeepMind
spokesperson said the firm underestimated the complexity of the NHS and of
the rules around patient data”. According to the ICO, Elizabeth Denham, their
investigation found a number of shortcomings in the way patient records were
shared for this trial. Patients would not have reasonably expected their
information to have been used in this way, and the Trust could and should have
been far more transparent with patients as to what was happening.” The ICO
warned that such work should never be a choice between privacy or
Despite privacy advocates hopes, this ruling did not exclude the use of the app.
Streamsthat has since been rolled out to other British hospitals, and DeepMind
has also branched out into other clinical trials, including a project aimed at using
machine-learning techniques to improve diagnosis of diabetic retinopathy46 , and
another aimed at using similar techniques to better prepare radiotherapists for
treating head and neck cancers47.
Google's DeepMind Health systems have potential benefits for patients, nurses,
and doctors. The DeepMind Streamsapp allows clinicians to be informed when
patient vital signs deteriorate using data from patient-monitoring technology in
real-time. If the medical potentials have major benefits, they raise data ownership,
secondary usage and ethics concerns.
Julia Powel, from the Faculty of Law and Computer Laboratory, University of
Cambridge, was one of the early scholars warning about the ethical issues of
Streams App48 sharing NHS data with Google DeepMind. Google DeepMind and
healthcare in an age of algorithms49 offer promises in improving healthcare
systems and services. In 2016, DeepMind announced its first major health
project: a collaboration with the Royal Free London NHS Foundation Trust, to
assist in the management of acute kidney injury. Initially received with great
enthusiasm, the collaboration has suffered from a lack of clarity and openness,
with issues of privacy and power emerging as potent challenges as the project has
unfolded. The DeepMind-Royal Free case study underlined the policy
implications of sharing large datasets of patients medical data with private
In December 2019 The Observer revealed50 how UK medical data was allegedly
sold to American drug companies with little transparency or accountability around
the process. US drugs giants, including Merck (referred to outside the US and
Canada as MSD, Merck Sharp and Dohme), Bristol-Myers Squibb and Eli Lilly,
have paid the Department of Health and Social Care, which holds data derived
from GPs’ surgeries, for licences costing up to £330,000 each in return for
anonymised data to be used for research.
NHS Digital has announced GP medical records in England would be collected
via a new service called General Practice Data for Planning and
Research (GPDPR)51 that will replace the General Practice Extraction Service
(GPES), which has operated for over 10 years.
Following DeepMind, it was revealed that Palantir awarded £23m deal to continue
work on NHS Covid-19 Data Store52 The two-year contract was first reported
today by OpenDemocracy and Foxglove who have campaigned for transparency
surrounding deals between the NHS and big tech firms.OpenDemocracy53 and
Foxglove claim the contract was “secretly” signed “in apparent violation of their
[the government’s] prior promise to conduct future contracts between the NHS
and big tech via a full and open public tender”.
In a more recent episode, the NHS body was again criticised for being responsible
for delivering a transparent IT strategy leaving patients in the ignorance that the
medical data held by their GPs will be copied into a central database to be shared
with third parties unless they opt out by 23rd of June 2022. These revelations raise
big questions over the transparency and claims of anonymity in NHS data
transfers. Conflicting messaging overshadows NHS Digital's attempts to inform
public about patient data sharing54. The public has lost its trust. When the new
UK project came up, as reported by The Guardian55, “more than a million people
opted out of NHS data-sharing in one month in a huge backlash56 against
government plans to make patient data available to private companies,
the Observer can reveal. The General Practice Data for Planning and Research
scheme is now on hold with no new date for implementation, and NHS Digital has
made a series of concessions to campaigners to try to salvage it.”
Medical data harvesting in France
The US Palantir’s projects are not limited to the UK. It has been reported that US
data giant Palantir is on a mission to seduce France’s start-ups”. Fears might not
be unfounded as Palantir is said to be one of the most secretive companies in the
world. Palantir57 has expertise in big data analytics having initially worked for the
US armed forces and intelligence services.
More recently, French investigatory journalists58 revealed how medical data was
sold including drugs sold by pharmacists to Iqvia59. The group had been tracking
each patient via a unique identifier number to carry out "analyses of sales of health
products aggregated by typologies of pharmacies, by main types of prescribers
and by geographical areas".
Following the broadcast of the program, the French data protection authority,
CNIL, referred the case to the Paris Judicial Court60 asking Internet Service
Providers (ISPs) to block access to a site hosting health data of nearly 500,000
people. The CNIL, which has already carried out three controls on this data leak,
is continuing its investigations.
In April 2022, the French CNIL fined Dedalus 15 million Euros for their lack of
security measures having led to the breach of medical data of 500.000 individuals
- AMAZON, US online retailer coming big in the health data market
medical data62
Amazon is entering the health data market, combining medical information with
various other data including Amazon Echo63. The acquisition of One Medical
gives Amazon access to more data. One Medical built its own electronic medical
records system, and it has 15 years’ worth of medical and health-system data that
Amazon could tap.
A balance must be struck between protecting individual patients’ fundamental
rights of privacy and dignity, the need for research to access data, and for the
industry to study the impact of medical data on health improvement. Applying
privacy by design, privacy enhancing technologies, anonymisation and encryption
in full transparency and respect of individual needs and wishes is paramount.
Health data has huge value. `Regulators’ vigilance to monitor data handling is
essential. EU health data sovereignty will not only improve the privacy of patients
and protection of their data, it will also impact the European research capabilities.
3 Patient Willingness to Join Clinical Trials Drops Dramatically, New Data
Show -
4Paolo, G. & G. Bincoletto, A proactive GDPR-compliant solution for fostering
medical scientific research as a secondary use of per-sonal health data (2021),
Trento Law and Technology Research Group Research Paper n. 46.
6Zulkarnain, N. & Anshari, M. Big Data: Concept, Applications, & Challenges,
(2016). International Conference on Information Management and Technology,
pp. 3073.
7Kaisler, S.H., Armour, F. J., and Espinosa, A.J. Introduction to the big data and
analytics: concepts, techniques, methods, and applications minitrack (2016).
Proceedings of the Annual Hawaii International Conference on System Sciences,
pp. 10591060.
8FRA, Getting the Future Right, Artificial Intelligence and Fundamental
Rights (European Union Agency for Fundamental Rights 2020) 39.
9Miller, D. D. The medical AI insurgency: what physicians must know about
data to practice with intelligent machines (2019). Article number: 62 (2019).
10Stöger, K., Schneeberger, D., Kieseberg, P., Holzinger A. Legal aspects of
data cleansing in medical AI (2021). Computer Law& Security Review, 42.
11Case C-219/15 Elisabeth Schmitt v TÜV Rheinland LGA Products
GmbH ECLI:EU:C:2017:128; BGH 17 February 2020, VII ZR 151/18.
12Art. 17 para. 3 European Parliament, Framework of Ethical Aspects of
Artificial Intelligence (n 46).
13Commission, ‘Report on the Safety and Liability Implications of Artificial
Intelligence’ (n 18).
14Commission, COM (2020) 65 final (n 3) 15
15Wetsman, N. Artificial Intelligence aims to improve cancer screening in Kenya
(2019). Nature Medicine, 25, 1630-1631.
24Chen, R.J., Lu, M.Y., Chen, T.Y., Williamson, D.F.K., Mahmood, F. Synthetic
data in machine learning for medicine and healthcare (2021). Nature
Biomedical Engineering. 5, 493-497.
25The Royal Society, Protecting Privacy in Practice. The current use,
development, and limits of Privacy Enhancing Technologies in data analysis
26European Data Protection Board.
05/edpb_letter_out2021-0086_un_en.pdf (2021)
27The European Academies Science Advisory Council, the Federation of
European Academies of Medicine & the European Federation of Academies of
Sciences and Humanities. (2021).
28Bentzen, H.B., Castro, R., Fears, R., Griffin, G., Meulen, V.ter, Ursin, G.
Remove obstacles to sharing health data with researchers outside of the
European Union (2021). Nature Medicine, 27, 1329-1333.
31Study 21-09-
41 (2021
43Temperton J. DeepMind's new AI ethics unit is the company's next big move
[Internet]. [cited 2017 Nov 21]. Available
44Temperton J. DeepMind's new AI ethics unit is the company's next big move
[Internet]. [cited 2017 Nov 21]. Available
45Royal Free breached UK data law in 1.6m patient deal with Google's
46Google DeepMind pairs with NHS to use machine learning to fight blindness -
47Google DeepMind and UCLH collaborate on AI-based radiotherapy treatment
48Scaling Streams with Google -
50Patient data from GP surgeries sold to US companies
51General Practice Data for Planning and Research (GPDPR)
52Palantir awarded £23m deal to continue work on NHS Covid-19 Data Store -
53Open Democracy Controversial ‘spy tech’ firm Palantir lands £23m NHS data
54Conflicting messaging overshadows NHS Digital's attempts to inform public
about patient data slurp -
55NHS data grab on hold as millions opt out -
56 GPs warn over plans to share patient data with third parties in England -
57Pearltrees curation on Palantir -
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Data quality is of paramount importance for the smooth functioning of modern data-driven AI applications with machine learning as a core technology. This is also true for medical AI, where malfunctions due to "dirty data" can have particularly dramatic harmful implications. Consequently, data cleansing is an important part in improving the usability of (Big) Data for medical AI systems. However, it should not be overlooked that data cleansing can also have negative effects on data quality if not performed carefully. This paper takes an interdisciplinary look at some of the technical and legal challenges of data cleansing against the background of European medical device law, with the key message that technical and legal aspects must always be considered together in such a sensitive context.
Full-text available
Machine learning (ML) and its parent technology trend, artificial intelligence (AI), are deriving novel insights from ever larger and more complex datasets. Efficient and accurate AI analytics require fastidious data science—the careful curating of knowledge representations in databases, decomposition of data matrices to reduce dimensionality, and preprocessing of datasets to mitigate the confounding effects of messy (i.e., missing, redundant, and outlier) data. Messier, bigger and more dynamic medical datasets create the potential for ML computing systems querying databases to draw erroneous data inferences, portending real-world human health consequences. High-dimensional medical datasets can be static or dynamic. For example, principal component analysis (PCA) used within R computing packages can speed & scale disease association analytics for deriving polygenic risk scores from static gene-expression microarrays. Robust PCA of k-dimensional subspace data accelerates image acquisition and reconstruction of dynamic 4-D magnetic resonance imaging studies, enhancing tracking of organ physiology, tissue relaxation parameters, and contrast agent effects. Unlike other data-dense business and scientific sectors, medical AI users must be aware that input data quality limitations can have health implications, potentially reducing analytic model accuracy for predicting clinical disease risks and patient outcomes. As AI technologies find more health applications, physicians should contribute their health domain expertize to rules-/ML-based computer system development, inform input data provenance and recognize the importance of data preprocessing quality assurance before interpreting the clinical implications of intelligent machine outputs to patients.
COVID-19 has shown that international collaborations and global data sharing are essential for health research, but legal obstacles are preventing data sharing for non–pandemic-related research among public researchers across the world, with potentially damaging effects for citizens and patients.
The proliferation of synthetic data in artificial intelligence for medicine and healthcare raises concerns about the vulnerabilities of the software and the challenges of current policy.
A handheld microscope sends images from cervical samples to the cloud for analysis.
Getting the Future Right, Artificial Intelligence and Fundamental Rights (European Union Agency for Fundamental Rights
FRA, Getting the Future Right, Artificial Intelligence and Fundamental Rights (European Union Agency for Fundamental Rights 2020) 39.