The Text and Data Mining exception in the Proposal for a Directive on Copyright in the
Digital Single Market: Why it is not what EU copyright law needs
Thomas Margoni & Martin Kretschmer
25 April 2018
Permanent link: http://www.create.ac.uk/blog/2018/04/25/why-tdm-exception-copyright-directive-
The Proposal for a Directive on Copyright in the Digital Single Market (the Proposal) contains a
number of provisions intended to modernise EU copyright law and to make it “fit for the digital
age”.1 Some of these provisions have been object of a lively scholarly debate in the light of their
controversial nature (the proposed adjustment of intermediary liability for copyright purposes
contained in Art. 13, see here at p. 7) or because they propose to introduce a new right within the
already variegate EU neighbouring right landscape (i.e. the protection for press publishers contained
in Art. 11).
Far less attention has attracted the provision contained in Art. 3 of the Proposal dedicated to “Text
and data mining” (however, see here and here). The goal of Art. 3 is to introduce a mandatory
exception in EU copyright law which will exempt acts of reproduction made by research
organisations in order to carry out text and data mining for the purposes of scientific research. In
this blog Thomas Margoni and Martin Kretschmer discuss Art. 3 and explain why its formulation –
although underpinned by the right innovation policy goal – is wrong.
2) Text and Data Mining. Or the creation of new knowledge from existing information (but not
in the EU)
It has been calculated that the global research community generates over 1.5 million new scholarly
articles per annum (The STM report (2009) p. 5) or approximatively one new paper every 30
seconds (Spangler et al, Automated Hypothesis Generation based on Mining Scientific Literature,
(2014), p. 1877). It is quite clear that the scientific community as a whole is not able to maintain an
adequate level of understanding of all the scientific knowledge produced. This is not only bad for
science but is also bad for the economy because resources are spent to duplicate knowledge that
probably already exists but has not been found. Data confirm this by showing that some 90% of all
published scientific papers are never cited, whereas 50% of them are never read by anyone other
than their authors, referees and journal editors (Lokman I. Meho, The rise and rise of citation
analysis (2007)). It would be fantastic if it were possible to “hire” an additional 1 million well
trained and well paid researchers willing to cover all this wealth of knowledge. But this is unlikely
to happen any time soon. Nevertheless, there is a solution that could be put in place right now: to
use the power of computers and the “intelligence” of modern machine learning algorithms to
perform that job. The cost of computer hardware and software, their speed and tireless energy could
easily be used to allow the scientific community to fix the problem that half of the scientific
knowledge currently produced goes unread. There is only one little problem: this is a copyright
infringement. At least in the EU, since in other more “innovation oriented” economies, TDM is
generally considered a lawful activity.
There are many examples of how TDM may significantly improve the quality of research and boost
its development, including in ways that could not be covered even by the 1 million new researches
hired under the aforementioned anecdotal example. In the EU, in fields such as linguistics, the
ability to develop automated translation tools is currently limited mostly to the official documents
produced by the European Union, which are translated in all EU official languages, but most
importantly are generally openly available and reusable. Imagine what would it mean for these
types of applications if the original data sources were not limited to the official texts of the EU
bodies, but thanks to properly devised copyright laws include all information available on the
Internet (you might even get a EU start-up to finally compete on a level playing field with the
various Google, Facebook, Amazon, Twitter, etc). Similar examples can be found in the possibility
to TDM the web and the online archives of journals, libraries and collections in order to verify the
historical accuracy of certain facts and thus to combat fake news (something not covered by Art. 3
because journalists are not research organisations operating for research purposes). Or to favour
new developments within the field of TDM such as deep learning, knowledge discovery, machine
learning and so on.
It is worth noting that the large majority (if not the totality) of these cases are restricted by copyright
law in the EU, but are considered lawful in countries such as the US (and other countries
implementing similar investment-innovation balancing approaches), mainly thanks to flexible
norms, e.g. the fair use doctrine, which considers most of these uses transformative. In particular,
under US copyright law the more transformative is the new work the likelier that it constitutes fair
use, and courts have found that text and data mining is inherently transformative.
2.1) A TDM definition
Text and data mining is a term used to refer to a variety of analytical tools normally based on the
use of digital technologies, big data and the Internet. The Proposal defines TDM as “any automated
analytical technique aiming to analyse text and data in digital form in order to generate information
such as patterns, trends and correlations” (Art. 2(2) of the Proposal) as well as “the automated
computational analysis of information in digital form, such as text, sounds, images or data” enabled
by new technologies (Recital 8).
Importantly, TDM allows the creation of new knowledge from any sort of information, in particular
from already existing structured and unstructured data such as texts, images, sounds or databases
which often were created for other purposes (e.g. a public agency maintaining a log of temperature
measurements in a given location, or the dataset collected for a now concluded research project).
Furthermore and perhaps crucially, what TDM enables is the correlation of the most diverse sets of
information by combining data that would have otherwise never been combined just because no one
would have thought that any correlation or pattern could be identified. This type of analysis is
usually very time and labour consuming and involves a certain degree of risk (it does not guarantee
that any pattern or correlation will be identified), but if a properly trained algorithm can do this
efficiently (i.e. at a marginal cost tending to zero) then the risk is significantly reduced, if not
completely eliminated. In cases like these, where scientific achievements can offer new
opportunities of socio economic and cultural development the legal system must offer a clear set of
rules within which science can move confidently.
2.2) Where is the problem?
The main problem is that EU copyright law considers most TDM activities as a copyright
infringement. It is noteworthy that other more innovation-oriented jurisdictions (such as the U.S.,
Singapore, Japan) consider TDM lawful, therefore the scientific and economic sectors in those
jurisdictions have been employing TDM for a number of years leaving the EU behind.
The reason for this situation can be found in a broad definition of protected rights (especially, but
not exclusively, the right of reproduction, i.e. to make copies) which is not counterbalanced by a
similarly broad definition of limitations to copyright (especially, but not exclusively, to the right of
reproduction). The right of reproduction is defined as any “direct or indirect, temporary or
permanent reproduction by any means and in any form, in whole or in part” by Art. 2 of Directive
2001/29/EC (InfoSoc Directive). As it is the norm with digital technologies, in order to “text-and-
data-mine” information it is usually necessary to make (temporary) copies of the original data and
dataset in order to extract information (see here for a paper describing a machine learning example).
It is important to note that TDM is a type of “non consumptive use” of copyright material. The work
is not used as a work, but only the information, ideas, facts contained therein are used.
In the light of the broad definition of the right of reproduction the copies made during TDM
analysis possess the potential to infringe copyright. This infringement, however, could be exempted
on the basis of an exception or limitation to copyright. After all, as explained in the Preamble of the
same InfoSoc Directive, the broadly defined EU right of reproduction would make the very same
act of browsing the Internet a copyright infringement (for the temporary copy of web pages made in
the cache memory of computers) if it was not for the mandatory exception of Art. 5(1) InfoSoc that
allows certain temporary acts of reproduction.
2.3) Is the exception for temporary acts of reproduction of Art. 5(1) available to acts of TDM?
Partially. The CJEU had the occasion to clarify that temporary acts of reproduction made during
“data capture” processes can be covered by the exemption of Art. 5(1) under the cumulative
conditions that they:
1) constitute an integral and essential part of a technological process;
2) pursue a sole purpose, namely to enable the lawful use of a protected work; and
3) do not have an independent economic significance provided that:
3.1) the implementation of those acts does not enable the generation of an additional profit going
beyond that derived from the lawful use of the protected work;
3.2) the acts of temporary reproduction do not lead to a modification of that work.
These conditions, which as the Court of Justice of the European Union (CJEU) pointed out have to
be interpreted narrowly, are not always easy to meet in TDM processes and with reference to n. 2
and n. 3.1, are difficult to interpret. Therefore, whereas Art. 5(1) constitutes an important exception
for TDM activities, the cumulative, narrow and uncertain nature of those conditions do not offer a
clear and efficient legal framework within which science can move confidently. In other words, the
current EU copyright law framework is failing to meet the goal of efficiently balancing the
promotion of innovation and the protection of investments.
2.4) Is therefore a dedicated TDM exception necessary?
This is a good question. There are two levels at which this question should be answered: the
copyright theory level and the copyright law level.
On the theoretical level, copyright protects the original expression of ideas, not ideas themselves or
facts or data. Therefore, the extraction of factual information or ideas from textual or data sources is
simply outwith copyright’s scope (this together with other aspects of this blog are analysed in
greater detail in a forthcoming paper).
However, for a number of reasons that cannot be analysed in depth in here but that mostly relate to
the development that copyright law has undertaken as a consequence of the digital revolution – a
development mainly in the direction of resisting it rather than understanding and exploiting it – a
machine learning algorithm analysing a poem constitutes almost certainly a copyright infringement.
This is due to the temporary copy that the data capture process of the machine learning – or most
other TDM – procedure creates. The copyright infringement can be avoided in one of two cases: the
authorisation of the copyright holder (e.g. a copyright licence, see here) or the authorisation of the
law in the form of a copyright exception. This exception could be Art. 5(1) InfoSoc, although, as
seen above, the restrictiveness and uncertain boundaries of the exception do not really offer a
satisfying answer. Other exceptions to copyright are available but due to their narrow or fragmented
nature, they likewise do not offer an adequate answer.
Therefore, in practice, in the current state of EU copyright law a TDM exception is necessary.
3) Is the TDM exception as drafted in Art. 3 of the Proposal the right solution?
No, it is not. The main argument against the current formulation of Art. 3 of the Proposal is that it
introduces a double limitation for TDM: it can only be performed by research organisations and
only for the purpose of scientific research. Therefore, a commercial enterprise will not be able to
benefit from the exception. Nor a University acting for any other purpose than research (e.g.
commercial). Other purposes commonly accepted as fundamental in democratic societies are also
excluded, such as journalism, criticisms or review.
In the opinion of the drafter of Art. 3 Proposal, the current wording is thought to be less restrictive
than the “non commercial” limitation (which is instead found in the UK TDM exception), as
confirmed by the analysis developed in the Impact Assessment at pages 108-9. It seems however,
that Art. 3’s double limitation is very close to the non-commercial requirement and in certain
respects even more restrictive in the sense that a “non commercial” limitation would allow a
business acting for non commercial purposes (e.g. research, criticisms, news reporting, etc) to
benefit from the exception, something that is not possible under Art. 3 (although Public-Private
Partnerships are explicitly allowed). This is a major and unjustified limit that excludes important
economic sectors and SMEs from benefiting from a crucially important innovation tool. This clearly
contrast with fundamental rights such as the freedom of expression and the freedom to conduct a
business and (Arts. 11 and 16 of the Charter of Fundamental Rights of the European Union), even
though in the same proposal this contrast has been explicitly, although somehow superficially,
excluded (see page 9 of the Proposal).
Many have suggested that the Commission should have opted for the so called “option four” (see
page 8 of the current Proposal, or pages 108 – 109 of the of the Impact Assessment), that is to say a
TDM exception not limited to any beneficiary nor to any type of purpose. Option four would have
certainly been a much better option, but, and this is something that is not fully addressed in the
current debate around Art. 3, still insufficient.
Even if Art. 3 did not contemplate the reported double limitation (only research organisations and
for the purposes of scientific research), there are a number of additional problems with that
3.1) Reproductions and distributions
The main problem of the structure of Art. 3 is that it only exempts the right of reproduction but not
the right of distribution or communication to the public, nor the right of adaptation (although the
latter is not object of harmonisation at the EU level, with some limited exceptions).
This means that in all the situations when the results of an act of TDM include a protected part of
the original “mined” work (and the CJEU clarified that excerpts as short as 11 consecutive words
could be protected) these results cannot be communicated to the public or redistributed. In certain
areas this will not be a major concern, however in other areas, e.g. natural language processing, the
fact that certain models trained on a number of copyright protected corpora (i.e. texts) could
include 11 consecutive words, means that those models, the result of the research purpose
conducted by the research organisation, cannot be shared with anyone. Of course, the test is not “11
consecutive words” but is whether those 11 (or 15 or 8?) consecutive words are the “author’s own
intellectual creation”, an answer that will depend on each specific case, making the situation even
Therefore, a properly formulated TDM exception should cover not only the right of reproduction
but also the rights that cover the human (and computer) activities connected with the sharing of
those results, such as redistribution and communication to the public. A so devised exemption
would only apply to TDM activities, therefore only the communication to the public of parts of the
original work which are necessary for TDM purposes would be covered by the exemption, nothing
else. This would not be too different from what currently happens with the exception for parody or
quotation, where the original work is redistributed but only as part of the parody or quotation. Once
again, more flexible and innovation friendly solutions (e.g. fair use doctrines) already cover all the
3.2) Contractual overridability and technological overridability
Art. 3 in its current formulation clarifies that contractual provisions contrary to the TDM exception
shall be unenforceable. This is a good provision, as many times access to scientific databases is
provision contrary to the TDM exception is expressed through a Technological Protection Measure,
the exception ceases to take prevalence, as there is no direct reference to Technological Protection
Measures in Art. 3.
In other words, a result (contracting-out the TDM exception) that the law forbids, can in fact be
reintroduced by other means (the Technological Protection Measure) as the current formulation
omits to cover this case. It is worth recalling here that there is no basis in EU law to circumvent an
illegitimate technological protection measure, that is to say a technological measure that impedes
someone to do what an exception to copyright allows. This is contradictory, creates legal
uncertainty and frustrates the policy goals of Art. 3 paragraph 2. The EU legislature is fully aware
of this contradiction but failed to addressed it properly. In fact, Art. 6 of the Proposal (“common
provisions”) clarifies that the provisions of the first, third and fifth subparagraph of Art. 6(4)
InfoSoc directive apply. In plain English this means that if a user qualifies for an exception to
copyright (e.g. TDM) but a Technological Protection Measure prevents them from doing it, Member
States have an obligation to take appropriate measures to ensure that right holders make available to
the beneficiary an exception or limitation. In the almost 20 years since when the InfoSoc directive
was enacted, the UKIPO, which has correctly put in place a specific procedure for this type of
situations, has received less than a handful of requests.
The current formulation of Art. 3 is unsatisfactory and lacks ambition. In the Commission proposal
there are some good elements that properly reflect the copyright theory behind TDM, which – it
should be stressed – provides that ideas, facts and mere data are not object of copyright protection.
Copyright protects authors’ original expressions which ensures a creativity-innovation equilibrium
leading to the maximum level of socio-economic welfare.
The good elements of the Commission proposal are: the mandatory nature of the exception, the fact
that it cannot be limited by contract and that no remuneration scheme is set.
Nevertheless, a number of outstanding issues remain in the current Proposal, in particular the
limitation to research institutes for research purposes, the absence of a prohibition to circumvent the
exception through technological measures, and the fact that it only exempts acts of reproduction.
What is needed is a broad and flexible EU wide exception that does not only cover TDM but also
any other similar future technological development. Otherwise, each time a new technology is
developed EU copyright law will require to go through a lengthy and likely to be contested
legislative process in order to create an exception. During this period of, usually, years other
jurisdictions (that have the necessary flexibility to address the natural tension between the
protection of investments and favouring innovation designed into their copyright laws) will leave
the EU further behind. If the proposal, as it appears in the current stage, is not able to address these
as well as other concerns, perhaps it should be abandoned altogether.2
2 Acknowledgements: Thomas Margoni is coordinator of the legal interoperability working group of the EU H2020
OpenMinTeD project (H2020-EINFRA-2014-2) under grant agreement No. 654021 (OpenMinTeD). The opinions
expressed belongs exclusively to the authors.