Conference PaperPDF Available

A Legal Perspective on Training Models for Natural Language Processing



A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights.
A Legal Perspective on Training Models for Natural Language Processing
Richard Eckart de Castilho, Giulia Dore, Thomas Margoni,
Penny Labropoulou, Iryna Gurevych
Technische Universit¨
at Darmstadt, Department of Computer Science, UKP Lab
University of Glasgow, School of Law - CREATe Centre
Athena RC, Institute for Language and Speech Processing
A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this
paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning
model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results,
especially in terms of copyright and copyright-related rights.
Keywords: Copyright, Licensing, Machine Learning, Annotated Corpora
1. Introduction
The state-of-the-art in many areas of Natural Language Pro-
cessing (NLP) and Text Mining (TM) is based on Machine
Learning (ML). Algorithms learn abstract probabilistic mod-
els from texts annotated with labels (e.g. named entities,
part-of-speech tags, sentiment tags, etc.) in order to predict
such labels on unseen text. NLP tasks usually require the
deployment of multiple components each using specialised
models. As training models can be tedious and computation-
ally intensive, pre-trained models are a valuable resource.
However, the legal status of these models is often dubious,
as in many cases it is unclear (a) whether a model can be
trained from a corpus in absence of specific authorisation,
(b) which licence (if any) can or must be assigned to them,
and (c) if and in which cases the licence(s) of the original
corpus and annotations affect the licensing of a model. This
legal uncertainty often constitutes a hurdle, if not a real
barrier for the development of research infrastructures and
repositories, such as CLARIN
or OpenMinTeD,
models are shared and used.
In this paper, we explore the process of training a model
both from an NLP and a legal perspective. We discuss under
which circumstances annotated corpora may be used for
training a model and if their legal status may restrict the
choice of licences that can be applied to the trained model.
We use EU copyright law as the main reference, although
the analysis may find application beyond the EU (and may
need some degree of adjustment in different EU Member
2. Background
This section starts introducing the typical actions and re-
sources involved in ML models construction and deploy-
ment and successively discusses the relevant legal concepts.
2.1. NLP perspective
Models are constructed through a training process involving
alearning algorithm and training data to learn from. The
model captures abstract probabilistic characteristics from
the training data, which can then be used to predict the
learned labels on unseen data. For illustration purposes, we
focus on Named Entity Recognition (NER), an example of
a sequence classification task.
In general, constructing a model consists of the following
steps: (1) corpus compilation, (2) corpus pre-processing, (3)
corpus annotation, and (4) training of the model. Depending
on the availability of an annotated corpus, one or more of
these steps can be skipped. We briefly describe all of these
steps here and elaborate on the training step in the main
parts of our investigation.
Corpus compilation.
Each corpus is compiled to capture
specific aspects of real world language. For best results, the
ML algorithm must be trained on a corpus (i.e. set of texts)
that is similar to the corpus to which it is later applied; i.e.,
it must be of the same language and domain or text type
and annotated with the appropriate labels, e.g., “English”,
Social Sciences”, “scholarly publications” and “named enti-
ties” (NE), respectively. The corpus texts are selected and
obtained from one or more sources (e.g. publishers, journals,
web sites, etc.).
This involves all kinds of (usually auto-
matic) processes required to convert the textual content into
a format that can be further processed by the NLP tools,
such as conversion of PDF or HTML files into plain text,
removal of images, tables, etc.
This is the task of manual or automatic en-
richment of texts with labels relevant to the target task, pos-
sibly further corrected by experts. Annotations are often
arranged in layers, e.g. grammatical categories, morpho-
logical or syntactic features, etc. In all cases, the human
annotator or the tool “reads” the text which is segmented
into units (e.g. words, phrases) and assigns to some or all
of them the appropriate labels. The inventory of labels is
defined in annotation resources, such as tagsets, ontologies,
thesauri, etc. The assignment usually follows instructions
(e.g. guidelines, grammar rules, statistical data) that define
when to assign labels and how to disambiguate if multiple
candidate labels exist.
The training tool is a software programme that
implements an ML algorithm which is applied to the anno-
tated corpus, analyses its features and extracts from it the
appropriate probabilistic and statistical characteristics. The
model can thus be regarded as an abstraction of the anno-
tated corpus based on statistical observations which can then
be used with a second software tool (tagger) to predict the
learned labels (e.g. NEs) on unseen text.
For the present analysis, we assume that ML tools (trainers
and taggers) are governed by licenses that do not impose
restrictions on the models they create. A similar assumption
is made in relation to the use of predictions ML tools make.
Accordingly, the analysis focuses on the licensing terms of
the annotated corpora and the actions performed on them
while training a model.
2.2. Legal perspective
Before proceeding to the discussion of the three scenarios,
this section clarifies some basic copyright law concepts.
Texts and literary works.
Most corpora employed in
NLP consist of web pages, publications, articles, newspaper
texts, blog posts or even tweets, annotated or not. All these
resources possess the potential to be protected by copyright
law. To be eligible for copyright protection a work must be
original. The originality standard has been harmonised by
EU law at the level of the author’s own intellectual creation.
Current legislation
and relevant case law
. indicate that
this harmonised level of originality, despite the evocative
formula employed, is placed at a rather low level (Margoni,
2016) and the Court of Justice of the EU (CJEU) has held
that 11 consecutive words can in certain circumstances be
considered the author’s own intellectual creation, thus pro-
tected by copyright. This must be verified on a case-by-case
basis and it is achieved when an author is able to put their
personal stamp onto the work through free and creative
choices. Therefore it cannot be excluded that even single
sentences, if original, can be object of copyright protection.
In conclusion, it can be assumed that most corpora used
for TM/NLP, especially those of a literary and scientific
character, such as scholarly articles, are protected copyright.
Under EU law, as well as under the law of
many other countries,
databases are defined as collections
of independent works, data or other materials arranged in a
systematic or methodical way and individually accessible
by electronic or other means.
Copyright exists if originality
is found in the selection or arrangement of the content, i.e.,
the “intellectual creation” has to be found in the database
structure. Consequently, copyright in databases protects
only the structure and does not extend to the content. The
content, in turn, can be autonomously protected by copyright
(a database of scholarly articles), related rights (a database
of sound recordings), or be in the public domain (a database
of unprotected facts or of medieval texts).
E.g. Directive 2009/24/EC, OJ L 111, 5.5.2009, 16–22, Article
E.g. Judgment of 16 July 2009, Infopaq International v.
Danske Dagblades Forening, C-5/08, ECLI:EU:C:2009:465
General Agreement on Trade-Related Aspects of Intellectual
Property (TRIPS), 1869 U.N.T.S. 299, 33 I.L.M. 1197; WIPO
Copyright Treaty (WCT), 105-17 (1997), 36 ILM 65(1997)
6Directive 96/9/EC, OJ L 77, 27.3.1996, Article 1
In addition, EU law, unlike the law of most other countries in
the world, has introduced a new right protecting non-original
databases when a substantial investment has been put in
the obtaining, verification or presentation of the data – but
importantly not in the creation of the data. In this case, the
database maker (usually the person or entity who bears the
financial risk) enjoys a sui generis database right (SGDR),
which protects the content of the database from substantial
extractions. In other words, even databases of unprotected
facts could become object of a proprietary right that extends
to the database content in the light of the aforementioned
substantial investment (Hugenholtz, 2016; Guibault and
Wiebe, 2013). Therefore, certain collections of corpora (e.g.
the database of Institute X that over the years has collected
public domain corpora investing substantial time and work
resources in the process) could be protected by the SGDR.
Copyright and database rights are probably the two most
relevant rights potentially covering the annotated and un-
annotated corpora forming the basis for any training activity
(Stamatoudi and Torremans, 2000; Firdhous, 2012; Payne
and Landry, 2012; Borghi and Karapapa, 2013; Tsiavos
et al., 2014; Truyens and Van Eecke, 2014; Triaille et al.,
2014; Handke et al., 2015). In the following sections, we
examine this aspect more closely in the context of three
basic scenarios. Specific attention will be paid to: (a) the
right of reproduction (making copies) of the resources in
question and whether the distinction between temporary
and permanent copies matters; (b) the right of adaptation
and/or translation, i.e. the creation of works based on those
resources, what is often –but imprecisely from a EU law
point of view– called a derivative work; and (c) the specific
licence types under which annotated corpora are distributed.
Nevertheless, no specific attention will be dedicated to the
annotation process as many corpora are already available in
an annotated form.
3. Scenario I: Liberally Licensed Corpora
We start our investigation with a straightforward scenario:
training a model on an annotated corpus with a liberal
TM/NLP-friendly licence carrying only an attribution clause.
Here, we choose the popular Creative Commons Public Li-
cence with the Attribution clause in the latest version avail-
able (CC BY 4.0). This licence is particularly “friendly” for
TM/NLP activities because it authorises licensees to perform
all the aforementioned rights (reproduction, redistribution,
communication to the public, adaptation, etc.) under the
only main condition that attribution be maintained.
3.1. Scenario description
Despite version 4.0 of the CC licences being
available since 2013, many of the existing CC BY licensed
corpora are still distributed under older versions such as
CC BY 2.5 (e.g. the Wikinews texts of the GUM corpus
(Zeldes, 2017)) or CC BY 3.0 (e.g. the IULA Spanish LSP
Treebank (Marimon et al., 2014) or the CRAFT corpus
(Verspoor et al., 2012)).
We only consider the latest 4.0
version in the present analysis, but it should be noted that
It is notably difficult to locate corpora under specific CC
licence versions. E.g. at the time of writing, META-SHARE
( lists over 200 corpora under CC BY,
different licence versions could lead to different assessments
especially in relation to the SGDR right. Examples for
corpora under this licence version are the recent GermEVAL
2014 dataset for NER (Benikova et al., 2014) or the Coptic
Treebank (Schroeder and Zeldes, 2016).8
In our scenarios, we train a NER model with the
Stanford NER tool (Manning et al., 2014). To determine the
relation of the trained model to the original data, we examine
which information goes into the model. We describe the
process in the present and following scenarios at increasing
levels of detail, as required by the respective legal analyses.
The training process requires the creation of a usually tem-
porary copy (i.e. a reproduction) of the original data and
usually its transformation into the training data format. The
training data format used by the Stanford NER tool is very
simple: a two-column format in which the first column con-
tains a token (word or punctuation mark) and the second
column contains a label. Sentences are separated by a blank
line. If the word is a named entity, the label indicates the
entity type (e.g. person, organisation, etc.), otherwise it con-
tains a special “no category” label. To handle cases where
a NE consists of multiple tokens, the type is either prefixed
with B- to indicate the first token of the NE, I- for the other
tokens of the NE and Ofor single-token NEs. This so-called
BIO-encoding is a technical convention allowing the NER
tool to learn how to correctly detect multi-token NEs.
Before considering the training process in more detail in
Scenario II, we investigate whether the process up to this
point is permitted by the licence.
3.2. Scenario analysis
Is reproduction permitted?
According to the terms of
the CC BY 4.0 licence, the act of making reproductions is
expressly permitted, as per Section 2 of the licence text.9
In the present case, thanks to the liberal conditions es-
tablished by the licence, it is not necessary to investigate
whether the results of the training activity constitute a re-
production of the original corpora. The applicable licence
permits any type of reproduction, being it the transient re-
production necessary for the conversion of the corpora into
a machine processable format, or the final results of the
training process Therefore, annotated corpora under these
licences may be reproduced as part of the model training
process on the basis of the licence (in those cases when this
act is not covered by applicable exceptions and limitations, a
situation that would not trigger the terms of the CC licence,
see Scenario III below).
Is the result an adaptation (derivative work)?
As stated
above, the scenario is predicated on the assumption that all
input resources (raw texts and annotations) are covered by a
CC BY 4.0 license.
but does not carry information about the licence version. The LIN-
DAT/CLARIN repository ( includes the
licence version metadata, but the search interface does not allow
filtering resources by it.
We should note here that we have not investigated whether the
assignment of these licences to the respective corpora is indeed
9, s. 2
Section 2a of this licence expressly permits the creation of
adaptations and defines it contractually, although ultimately
referring to the applicable copyright law. Since the creation
of an adaptation is explicitly permitted under the terms of
the licences, if the act of training a model constitutes a
derivative work, which is indeed not trivial to determine (see
Scenario II below), in the present scenario this activity is
permitted, under a mere attribution condition. It is worth
pointing out that the attribution requirement in the present
case would require: a) retaining attribution, copyright and
licence notices, and providing a URI or hyperlink to the
licensed material to the extent reasonably practicable; b)
indicating modifications to the licensed material and retain
an indication of any previous modifications; c) indicating
the licensed material is licensed under this public license,
including the text of, or the URI or hyperlink to the licence.
No other limitations apply to the model. In principle, the
model could also be re-licensed under any arbitrary licence
if it qualifies for copyright protection on its own right, an
aspect that will be discussed below in Scenario II.
In conclusion, training models on the basis of liberally li-
censed corpora does not present major legal obstacles, al-
though proper attribution should be given.
4. Scenario II: Corpora with a Reciprocal
We continue our investigation with a slightly more com-
plicated case: training a model on an annotated corpus
with a TM/NLP-friendly licence which includes a share-
alike clause, such as the Creative Commons Attribution-
ShareAlike 4.0 (CC BY-SA 4.0). This means that works
adapted (derived) from the original work must carry the
same licence as the original work. Since this is a reciprocal
condition, it is important to determine whether a model is
an adaptation (derivative work) of the corpora under the
conditions of the licence.
4.1. Scenario description
We consider the corpus to be licensed under the
CC BY-SA 4.0 license, such as the latest version of the
SETimes.HR dataset (Agi´
c and Ljubeˇ
c, 2014).10
The basic scenario is the same NER training
process we have already started to describe in Scenario I.
Assuming the corpus at hand has already been transformed
into the two-column format described in Scenario I, the pro-
cess of training a model is rather straightforward (although
it may require significant computational resources).
As a first step, a configuration file for the Stanford NER tool
needs to be created. This file contains the names of the files
that comprise the training corpus and a name to be used for
the output file (i.e. the model to be created), as well as a
set of parameters controlling which features are extracted
from the training data and used for training the classifier.
An example parameter file (Figure 1) can be found in the
documentation of the Stanford NER tool.11
10 (Agi
c and Ljube
c, 2014) is
based on texts from the Croatian translation of the SETimes portal,
which were freely shared with attribution to the source.
11 – A more exten-
trainFile = training-data.col
serializeTo = ner-model.ser.gz
map = word=0,answer=1
Figure 1: Example parameter file params.prop.
As a second step, the Stanford NER tool is started in train-
ing mode using the command java -cp stanford-ner.jar -prop params.prop.
From this point on, the process runs fully automatically with-
out further interaction from the person training the model.
4.2. Scenario analysis
Is reproduction permitted?
Once again, the act of mak-
ing copies (reproductions) is expressly allowed by the
CC BY-SA 4.0 in the same terms analysed in Scenario I.
Is the result an adaptation (derivative work)?
The cre-
ation of adapted materials is also expressly permitted, as it
was with CC BY 4.0 in the previous scenario. However, the
SA clause that applies in the present scenario requires that
distribution of the adapted material be made under the terms
of the same licence or a later version with the same terms.
Therefore, it is important to determine whether the trained
model is an adaptation of the original annotated corpora. If
it is, the SA clause requires that the same licence be applied
to the trained model.
What constitutes adapted material is defined in Section 1(a)
of the licence
as “... material subject to Copyright and Sim-
ilar Rights that is derived from or based upon the Licensed
Material and in which the Licensed Material is translated,
altered, arranged, transformed, or otherwise modified in
a manner requiring permission under the Copyright and
Similar Rights held by the Licensor. To establish when this
happens is a determination that can be done only against a
specific domestic legal framework even within the EU. In
fact, differently from other rights, the right of adaptation is
not harmonised by EU law.13
However, not all modifications lead to the creation of an
adapted work. EU law seems to suggest that such a new
sive list of parameters can be found in the documentation to the
Stanford NER Java class NERFeatureFactory.
12, s. 1
Judgement of 22 Jan 2015, Art and Allposters International
BV, Case C-419/13, ECLI:EU:C:2015:27
(adapted) work is created only when the process of mod-
ification involves an original contribution (in the sense of
the author’s own intellectual creation). This appears to be
also the view of the licences developers.
Absent enough
originality in the modification, the unoriginally modified ma-
terial does not constitute a new copyright protected work, but
rather a mere reproduction (even if partial or “in any form”),
which, unless authorised by law or by contract, infringes the
copyright (the right of reproduction) in the original work.
Given this definition of adapted material, we need to con-
sider whether the act of training a model using the outlined
procedure meets the licence requirements.
The simplicity of the training process outlined above and the
limited choices in the parametrisation of a largely automated
process suggest that there is no space for the free and creative
choices that allow the author to express their personality
into the work. In particular, it seems that even when certain
choices in parametrisation are available these are dictated
mostly by technical considerations and by the “rules of the
game” of model training in a way that any equally skilled
technician would achieve a similar or identical result. Under
this interpretation, the model is not a creative adaptation
of the underlying annotated text corpora and thus does not
qualify as adapted material under the SA clause of the CC
This means that the trained model, not being an adaptation
of the underlying corpora, does not trigger the SA clause.
Training a model, as seen above, requires other types of
copyright relevant acts, namely reproduction, which must
be authorised or excused –statutorily or contractually– to
avoid infringement. In the present case, it means that if the
trained model, which does not qualify as adapted material, is
nonetheless a “reproduction in part” of the original corpora,
that part –and only that part– remains under the conditions
of the original licence, in the present case a CC BY-SA 4.0.
Similar conclusions would be reached if the resources em-
ployed in the training process were licensed under a CC BY-
ND 4.0. The licence in question, in fact, although allowing
the creation and reproduction of adapted works, does not
allow for their distribution (alias sharing), as specified in
its Section 2(a).
If the model is not an adapted work in
the meaning outlined above, then the NoDerivatives (ND)
clause will not be triggered.
Finally, does this mean that the trained model can be arbitrar-
ily licensed by its developer? In the present case, the trained
model lacks sufficient originality to qualify as a derivative
as well as an independent work. Therefore, it is not a work
of authorship for copyright law purposes. In theory a license
could still be applied but this would only have contractual
effects and not be based on an underlying property right
(a very relevant difference that cannot be explained here,
suffice to say that most copyright licences are based on a
valid underlying property right: if this is not present the
effects of the licence are limited. In the case of CC licenses,
as well as most “open” licences, if the licence is applied to
something that is not protected by copyright or related rights
the licence is not triggered).
Figure 2: Feature excerpt from a CoreNLP NER model.
5. Scenario III: Corpora with unclear
licence statements or restrictively licensed
In the previous scenarios we discussed texts and annota-
tions under licences which explicitly allow reproduction.
However, it is much more common that only annotations
are under such a licence. Obtaining corpora under simi-
lar licences is much more difficult. Most of the texts that
can be found online do not carry any licence at all or are
part of commercial offers which do not permit reproduction.
Thus, in the present scenario we investigate if such texts can
still be used for training models. We do not investigate the
relationship between corpora and annotation.
5.1. Scenario description
An example of a text corpus obtained from the
web and enriched with annotations under a CC licence is the
English part of the Universal Dependency Treebank (UDT-
While the annotations are provided under CC BY-
SA 4.0, the texts come from the English Web Treebank
which has been collected by Google from online weblogs,
newsgroups, emails, reviews and question-answering web-
sites. As the UDT-EN website states, the copyright of por-
tions of the texts may reside with “Google Inc., Yahoo! Inc.,
Trustees of the University of Pennsylvania and/or other orig-
inal authors. Thus, the licensing status of the individual
texts is not entirely transparent, and should be prudently
considered to be under an “all rights reserved” status.
Again, we consider the same process of training
a NER model described in Scenarios I and II. However, this
time we take a closer look at the training process in order
to assess whether the model reproduces significant parts of
the original document, an activity reserved to the copyright
During the training process, the NER training tool extracts
so-called features from the input data. This is the critical
step in the training process, as it determines how much of
the original text and annotations is retained. Figure 2 shows
a sample of feature values generated by the Stanford NER.
The Stanford NER uses a sequence classifier based on con-
ditional random fields (CRFs). The tool runs through the
text token-by-token and consumes features that have been
extracted for each token, such as the token string, a config-
urable number of characters forming the prefix/suffix of the
token, the left and right context of each token, e.g. the fact
16 English
that token Xappears left of token Y, and similar informa-
tion. The context captured in the features is very limited
and usually includes only the current word and a preceding
and following word. E.g. essentially-still-PSEQW2—CpC
indicates that the sequence “essentially still” was included
in the training corpus. Additionally, the CRF learns a set
of weights encoding the probability of an NE occurring in
the presence of the specific features (Finkel et al., 2005). It
is important that the features and weights capture a limited
information about the tokens and their annotations. They
discard the context details, because the ML algorithm needs
to learn a generalised model from the training data which
does not overfit on the training data.
The features and algorithm for NER-like tasks are designed
in such a way that the trained model represents an abstraction
of the training data. It is generally not possible to reconstruct
the original text from this abstraction.
5.2. Scenario analysis
Is reproduction permitted?
As briefly observed above,
the right of reproduction, defined as “any direct or indirect,
temporary or permanent reproduction by any means and in
any form, in whole or in part” is reserved to the right holder
of copyright works in all EU countries by Art. 2 InfoSoc
Directive and its national implementations.18
The CJEU had the opportunity to clarify that certain acts
of temporary reproduction carried out during a “data cap-
ture” process fulfil the requirements of the exception for
temporary copies (Art. 5(1) InfoSoc) under the cumulative
conditions that those acts:
Must constitute an integral and essential part of a tech-
nological process. This condition is satisfied notwith-
standing the fact that initiating and terminating that
process involves human intervention;
Must pursue a sole purpose, namely to enable the law-
ful use of a protected work; and
Must not have an independent economic significance
that the implementation of those acts does not
enable the generation of an additional profit going
beyond that derived from the lawful use of the
protected work;
that the acts of temporary reproduction do not
lead to a modification of that work (Case C-5/08
Infopaq I and C-302/10 Infopaq II).19
Under these conditions, temporary acts of reproduction are
permitted by EU law.
A brief description of the facts of the Infopaq case may
be helpful. The decision, referring to the compilation, ex-
traction, indexing and printing of newspaper articles and
keywords, identifies five phases relevant in the process of
18Directive 2001/29/EC, OJ L 167, 22.6.2001
Judgement of 16 Jul 2009, Infopaq International A/S v Danske
Dagblades Forening, Case C-5/08 I, ECLI:EU:C:2009:465 and
Judgement of 17 Jan 2012, Infopaq International A/S v Danske
Dagblades Forening, C-302/10, ECLI:EU:C:2012:16
data capture: (1) newspaper publications are registered man-
ually in an electronic registration database; (2) sections of
the publications are selectively scanned, allowing the cre-
ation of a Tagged Image File Format (TIFF) file for each
page of the publication and transferring it to an Optical
Character Recognition (OCR) server; (3) the OCR server
processes this TIFF file digitally and translates the image of
each letter into a character code recognisable by the com-
puter and all data are saved as a text file, while the TIFF file
is then deleted; (4) the text file is processed to find a search
word defined beforehand, identifying possible matches and
capturing five words before and after the search word (i.e. a
snippet of 11 words, before the text file is deleted; (5) at the
end of the data capture process, a cover sheet is printed out
containing all the matching pages as well as the text snippets
extracted from these pages.
The Court found that the exception of Art. 5(1), which
covers acts of temporary reproduction, only exempts the
activities listed in points 1) to 4) above, whereas the activity
of point 5), i.e. printing, constitutes a permanent act of
reproduction which is therefore not covered by the exception
for temporary copies.
It should further be noted that, in point 5), what is printed
is not the entire literary text, but only 11 consecutive words.
Only if these 11 consecutive words constitute a “reproduc-
tion in part” of the original work, copyright would be in-
fringed. In this regard, the EUCJ found that “it cannot be
excluded” that 11 consecutive words constitute the author’s
own intellectual creation and therefore represent a partial
(and thus infringing) permanent reproduction. The 11 words
threshold should not be taken as a strict parameter. The
real test is that of the author’s own intellectual creation. Ac-
cordingly, there will be shorter extracts that meet such a
condition, and longer extracts that do not meet it.
As a result, it can be argued that current EU law permits
the temporary copy of non-licensed copyright works for
purposes such as “data capturing” as long as the cumulative
conditions of Art. 5(1) as interpreted by the Court are met.
However, when the results of the data capturing process
leads to the permanent reproduction of the author’s own
intellectual creation this constitutes an infringement of the
right of reproduction.
Nevertheless, it must be stressed that the conditions of Art.
5(1) are not only cumulative (i.e. all must be met) but also
partially unclear (especially regarding the exact meaning
of “independent economic significance”) and must be inter-
preted strictly. These considerations have led many com-
mentators to the conclusions that Art. 5(1) is not suitable as
a general solution for TDM purposes (Triaille et al., 2014).
This conclusion is certainly correct, nevertheless, until when
a proper TDM exception is introduced at the EU level, the
suitability of Art. 5(1) for unlicensed corpora should be ex-
plored further for specific ML/NLP cases, as in the present
scenario. The remainder of this section will attempt such an
It seems that the the ML/NLP steps described in scenario
three are substantially similar to those described by the
EUCJ in the reported case law. In particular:
transforming the text corpus and the annotations into
the input format of the Stanford NER tool is arguably
equivalent to converting a TIFF image into text using
OCR but much less sophisticated;
inspecting each word in the text in turn in order to
create a ML feature representation capturing from the
word and its immediate left and right neighbours, and
from the annotation on the word is arguably equivalent
to extracting the search term and the words before and
after it, although extracting only one word before/after
instead of 5, for a total of 3 words instead of 11 words
creating a probabilistic report (i.e. model) about the
data obtained in this way is arguably equivalent to
printing a cover sheet containing the matching pages,
although the report consists of a numeric matrix encod-
ing probabilities of observed features and correlations,
as well as a feature and label dictionary containing
encoded features covering words, word pairs, parts of
words, and word shapes observed in the text, and the
It seems plausible that the temporary copies created in points
1. and 2. are transient or incidental if they are only kept
for the amount of time justified by the proper completion of
the technological process and are automatically destroyed
at the and of the process. It is also arguable that the act of
reproduction is an integral and essential part of a techno-
logical process (the conversion of the text into data) which
is necessary to enable a lawful use (statistical analysis is
arguably as lawful as the preparation of summaries and is
not a right reserved to the right holder by EU copyright law,
however if the right holder contractually limits this operation
and domestic law allows it, probably this condition would
not be met). The requirement of absence of independent
economic significance is probably harder to assess. Inde-
pendent economic significance is present if the author of the
reproduction is likely to make a profit out of the economic
exploitation of the temporary copy. This profit has to be dis-
tinct from the efficiency gains that the technological process
allows (see Infopaq II, 51).
Point 3. above refers to the results of the training process
(the creation of a model) which are permanent by defini-
tion. Therefore, point 3. cannot be exempted on the basis
of acts of temporary reproduction. It must be assessed how-
ever, whether the model constitutes a “reproduction in part
within the meaning of Art. 2 InfoSoc. If it does not, there
is simply no copyright relevant activity and thus no need to
rely on an exception.
In the present scenario, the trained model contains three
consecutive words of the original “all rights reserved” cor-
pora. While the test to be applied is not 11 vs. 3 consecutive
words, but that of the “author’s own intellectual creation”, it
seems plausible that three consecutive words are too insub-
stantial to constitute a “reproduction in part” of the original
corpora. Therefore, the trained model does not reproduce in
part the original corpora.
In conclusion, it can be argued that Art. 5(1) has the poten-
tial to be useful when the technological process is similar to
the one described in this scenario. However, given the cu-
mulative, strict and partially unclear conditions that qualify
it, a very careful case-by-case assessment should be per-
formed before deciding to rely on this exception given the
unavoidable degree of risk involved.
Are the results a derivative work?
The CJEU in All-
clarified that the right of adaptation has not been
object of EU harmonisation. Nevertheless, it must be ob-
served that cases where adaptation does not require the
reproduction at least in part of the original work may be of
difficult conceptualisation (illustratively, jurisdictions such
as France and the Netherlands classify the right of adapta-
tion as a type of reproduction). That said, there are situations
where there is adaptation without reproduction. An obvious
example is the translation of a literary work into a different
language. Technically speaking, there is no direct reproduc-
tion of the sentences in the original language. Consequently,
it should not come as a surprise that the right of translation
was the first right to be included in the minimum standard
of protection in the oldest copyright international treaty, the
Berne Convention.
However, training a model does not
seem to possess the characteristics be considered a trans-
lation. Therefore, excluding the cases when this activity
constitutes a reproduction (see above), it should be ascer-
tained whether and under which circumstances training a
model creates an adaptation.
Definition of derivative works in legislation
At the in-
ternational level, the Berne Convention grants copyright
protection to translations, adaptations, arrangements of mu-
sic and other alterations without prejudice to the original
work to which they refer.22
There is no explicit definition in the Convention of adapta-
tion, arrangement or other alterations of a work, a definitory
lacuna that has stimulated some debate over the possible
meanings of their specific wording (cf. (Ricketson and Gins-
burg, 2006), p. 480). As some have suggested ((Goldstein
and Hugenholtz, 2001), p. 252) adaptation means recasting
a work from one format to another, whereas arrangement
means modification within the same format, while others
have underlined how the effort of defining them may be even
unnecessary, if the law treats them essentially the same in
terms of protection ((Chow and Lee, 2006), p. 181).
Illustratively, the notion of derivative work is instead ex-
plicitly defined by US law, which under the Copyright Act
(17 U.S. Code) at
106(2) regulates the exclusive right to
prepare derivative works.
As already pointed out, EU law does not harmonise the right
of adaptation (except in the case of software and databases),
therefore a proper analysis should look at how this right is
regulated at the national level, thereby introducing an addi-
tional layer of complexity especially for scientific initiatives
which are often international.
The Court of Justice, in Allposters, avoided to define deriva-
Judgement of 22 January 2015, Art and Allposters Interna-
tional BV, Case C-419/13, ECLI:EU:C:2015:27
Berne Convention for the Protection of Literary and Artistic
Works, 1886
Berne Convention for the Protection of Literary and Artistic
Works, Art. 2, 8 and 12
tive works by substantially referring to the right of reproduc-
tion, purporting that a new work incorporating a pre-existing
protected work is “an alteration of the copy of the protected
work, which provides a result closer to the original” and so
constitutes “a new reproduction of that work” that remains in
the exclusive rights of its right holder.
Furthermore, what
seems determinant is that the new work (often identified as
secondary work) reproduces, adapts or alters what consti-
tutes the intellectual creation of the pre-existing (primary)
work adding however an “authorial contribution” (Margoni,
Given the lack of EU harmonisation for the right of adapta-
tion, the analysis should focus on the domestic law of EU
Member States (MS). This type of inquiry would need to
be done with a depth of analysis in 28 jurisdictions that
is not be possible in the present paper. Nevertheless, it
seem arguable that, as opposed to the broader US notion of
derivative works, the EU counterparts tend to define adap-
tation in a narrower way, “even narrower than the original
Berne formulation” (Bently and Sherman, 2014), p. 170, ft
220. Domestic courts have held that adaptations must show
some quality or character which the raw material did not
possess and which differentiates the product from the raw
material” ((Bently and Sherman, 2014), pp. 112-113 and
Interlego v. Tyco Industries 1989 AC 217), and that in order
for the right to be infringed, the elaboration should reveal
the pre-existing work in its own individuality (ex multis,
Corte Cassazione, 29 maggio 2003, n. 8597).
In the light of the above elements, which (it must be restated)
constitute a mere superficial exploration of EU MS domestic
law orientation on adaptations, it seems at least arguable
that when the elaboration (trained model) does not repro-
duce the original (corpora) nor reveals “its individuality”, no
infringement should be found. From a policy point of view,
training a model should be considered a free use. Future
work should concentrate on this aspect.
6. Conclusion
This paper underlines the complexities in the relationship
between concerning copyright and science in the context
of ML/NLP. The legal analysis has been based on three
specific scenarios which are all evolving around the task of
training models for NER from annotated texts. The same
legal principles can be applied to training models for other
ML/NLP tasks (e.g. POS tagging, etc.), but depending
on the specific variables the conclusions may differ. The
conclusions of the three case scenarios presented in the
present paper can be summarised as follows. The use of
corpora licensed under TM/NLP friendly licences such as
CC BY 4.0 guarantees that activities such as model training
are lawful. In the case when no TM/NLP friendly licences
are present, the operation of certain exceptions to copyright
(e.g. Art. 5(1) InfoSoc) can represent the only proper legal
basis for proceeding with ML activities. Nevertheless, a
considerable level of uncertainty surrounds the applicability
of the exception for temporary uses of Art. 5(1) InfoSoc
and a proper analysis of each case should be performed
Judgement of 22 Jan 2015, Art and Allposters International
BV, Case C-419/13, ECLI:EU:C:2015:27, paragraph 43.
before relying on it. Still, once the aspect of reproduction
is properly addressed, we suggest refraining from defining
training models in terms of derivative/adapted works, with
the consequence that licensing restrictions (e.g. all rights
reserved, ND or SA) imposed on the input training resources
may not find application in the resulting output. At the same
time, we acknowledge that the scope of TM/NLP is too
broad to be handled homogeneously and that different types
of algorithms and parametrisations require dedicated legal
analysis, for example based on the level of abstraction they
attain over the input data and the type of original material
that is reproduced in the trained model.
This work has received funding from the European Union’s
Horizon 2020 research and innovation programme (H2020-
EINFRA-2014-2) under grant agreement No. 654021 (Open-
MinTeD). It reflects only the authors’ views and the EU is
not liable for any use that may be made of the information
contained therein. It was further supported by the German
Federal Ministry of Education and Research (BMBF) un-
der the promotional reference 01UG1816B (CEDIFOR) and
by the German Research Foundation as part of the RTG
Adaptive Preparation of Information from Heterogeneous
Sources (AIPHES) under grant No. GRK 1994/1. Additional
thanks go to Mark Perry for review and comments.
Bibliographical References
c, v. and Ljube
c, N. (2014). The SETimes.HR Linguis-
tically Annotated Corpus of Croatian. In Nicoletta Cal-
zolari, et al., editors, Proceedings of LREC 2014, pages
1724–1727, Reykjavik, Iceland. ELRA.
Benikova, D., Biemann, C., and Reznicek, M. (2014).
Nosta-d named entity annotation for german: Guidelines
and dataset. In Nicoletta Calzolari, et al., editors, Pro-
ceedings of LREC 2014, Reykjavik, Iceland. ELRA.
Bently, L. and Sherman, B. (2014). Intellectual Property
Law. Oxford University Press, 4th edition.
Borghi, M. and Karapapa, S. (2013). Copyright and Mass
Digitization: A cross-jurisdictional perspective. Oxford
University Press.
Chow, D. and Lee, E. (2006). International Intellectual
Property. Problems, Cases and Materials. West Aca-
demic Publishing.
Finkel, J. R., Grenager, T., and Manning, C. (2005). Incor-
porating non-local information into information extrac-
tion systems by gibbs sampling. In Proceedings of ACL
2005, pages 363–370, Stroudsburg, PA, USA. ACL.
Firdhous, M. (2012). Automating legal research through
data mining. Int. Journal of Advanced Computer Science
and Applications, 1(6):9–16.
Goldstein, P. and Hugenholtz, P. B. (2001). International
Copyright: Principles, Law, and Practice. Oxford Uni-
versity Press.
Lucie Guibault et al., editors. (2013). Safe to be open. Study
on the protection of research data and recommendations
for access and usage. Universit¨
atsverlag G¨
Handke, C., Guibault, L., and Vallb
e, J.-J. (2015). Is Europe
Falling Behind? Copyright’s Impact on Data Mining in
Academic Research. In Birgit Schmidt et al., editors, New
Avenues for Electronic Publishing in the Age of Infinite
Collections and Citizen Science: Scale, Openness and
Trust – Proceedings of Elpub 2015, pages 120–130.
Hugenholtz, P., (2016). Something Completely Different:
Europe’s Sui Generis Database Right, volume 37 of In-
formation Law Series, pages 205–222. Kluwer Law Int.
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.,
and McClosky, D. (2014). The Stanford CoreNLP Natu-
ral Language Processing Toolkit. In Proceedings of ACL
2014: System Demonstrations, pages 55–60, Baltimore,
Maryland. ACL.
Margoni, T. (2014). The Digitisation of Cultural Heritage:
Originality, Derivative Works & (Non) Original Pho-
tographs. IVIR Institute for Inf. Law, pages 18–25.
Margoni, T. (2016). The harmonisation of EU copyright
law: The originality standard. In Mark Perry, editor,
Global Governance of Intellectual Property in the 21st
Century, pages 85–105. Springer, Switzerland.
Marimon, M., Bel, N., Fisas, B., Arias, B., V
azquez, S., Vi-
valdi, J., Morell, C., and Lorente, M. (2014). The IULA
Spanish LSP Treebank. In Nicoletta Calzolari, et al., ed-
itors, Proceedings of LREC 2014, Reykjavik, Iceland.
Payne, D. and Landry, B. J. L. (2012). A Composite
Strategy for the Legal and Ethical Use of Data Mining.
Int. Journal of Management, Knowledge and Learning,
Ricketson, S. and Ginsburg, J. C. (2006). International
Copyright and Neighbouring Rights. The Berne Conven-
tion and Beyond. Oxford University Press.
Schroeder, C. T. and Zeldes, A. (2016). Raiders of the lost
corpus. Digital Humanities Quarterly, 10(2).
Irina Stamatoudi et al., editors. (2000). Copyright in the
New Digital Environment: The Need to Redesign Copy-
right, volume 8 of Perspectives on Intellectual Property
Law. Sweet & Maxwell.
Triaille, J.-P., de Mee
us d’Argenteuil, J., and de Francquen,
A. (2014). Study on the legal framework of text and data
mining (tdm). Luxembourg: Publications Office.
Truyens, M. and Van Eecke, P. (2014). Legal aspects of text
mining. Computer Law & Security Rev., 30(2):153–170.
Tsiavos, P., Piperidis, S., Gavrilidou, M., Labropoulou, P.,
and Patrikakos, T. (2014). Legal framework of textual
data processing for Machine Translation and Language
Technology research and development activities. QTLP
report & wikibook.
Verspoor, K., Cohen, K. B., Lanfranchi, A., Warner, C.,
Johnson, H. L., Roeder, C., Choi, J. D., Funk, C.,
Malenkiy, Y., Eckert, M., Xue, N., Baumgartner, W. A.,
Bada, M., Palmer, M., and Hunter, L. E. (2012). A cor-
pus of full-text journal articles is a robust evaluation tool
for revealing differences in performance of biomedical
natural language processing tools. BMC Bioinformatics,
Zeldes, A. (2017). The GUM Corpus: Creating Multi-
layer Resources in the Classroom. Lang. Resour. Eval.,
... The line where the original rights cease to apply has to be somewhere between these points, and researchers and developers need to know where. The authors present arguments that copyright and personal data restrictions covering language data usually do not affect language models. 1 The article develops further the previous legal research conducted in the field of language technologies (see Eckart de Castilho et al. 2018;Kelli et al. 2016;Kelli et al. 2018a;Ilin and Kelli 2019;Klavan et al. 2018). ...
... Some models (like a simple frequency list) may also be too simple or too limited in options (cf. Eckart de Castilho et al. 2018). In nontrivial cases, the de facto situation is that models are made available together with the research papers describing them and the software tools used in their creation. ...
... There is no uniform definition of derivative work at EU and international levels. Different jurisdictions have their approaches (for further discussion, see Birštonas and Usonienė, 2013;Eckart de Castilho et al. 2018). ...
Full-text available
The authors address the legal issues relating to the creation and use of language models. The article begins with an explanation of the development of language technologies. The authors analyse the technological process within the framework copyright, related rights and personal data protection law. The authors also cover commercial use of language models. The authors' main argument is that legal restrictions applicable to language data containing copyrighted material and personal data usually do not apply to language models. Language models are generally not considered derivative works. Due to a wide range of language models, this position is not absolute. This work is licenced under a Creative Commons Attribution 4.0 International Licence. Licence details:
... Any website with lists, prices, news, email addresses, etc., can have those items segregated as individual data elements and used for research or marketing purposes 23 . One organization, Lindat/Clariah-CZ, is even freely offering the data they scraped for use in NLP learning 24 . Because of this, the rights and duties for working with scraped web data are different from traditional works. ...
... Explanation and Examples | ParseHub" 2021) 22 Some websites only use standard HTML tags, others use CSS styles, and the most advanced use JSON-LD tagging for searching and other research capabilities. 23 ("What Is Web Scraping and How to Use It?" 2020; "What Is Web Scraping and What Is It Used For? | ParseHub" 2021)("What Is Web Scraping and How to Use It?" 2020) 24 ("License NLPC-WeC" n.d.) ...
Technical Report
Full-text available
Machines don't read works or data. Machines need to first abstract and then format data for learning and then apply tagging and other metadata to model the data into something the machine can "understand." Legal protections aren't purpose-built to allow machines to abstract data from a work, process it, model it, and then represent it. Most licenses aren't purpose-built for that either. This document walks the reader through all the known protections and licenses as to whether they cover machine learning practices. It then postulates a proposed license structure for that purpose. 1. The Problem Matt is writing a paper on real estate market trends. He starts with reading what already exists, organizing his thoughts as he goes, writing notes into one of the popular note taking applications. He uses Zotero bibliography tracker to create a bibliography of everything he's reading, and then uses his note taking software to copy certain web pages to his computer for offline reading and annotation 1. Then he adds to the information with his own research by checking sales records in available databases, creating tables of information he found in the records, and placing the data into his note taking software (all three shown below).
... 1 For cases where the source material is, like this paper, available under a Creative Commons license, seeEckart de Castilho et al. (2018) for guidance. ...
Conference Paper
Full-text available
N-grams are of utmost importance for modern linguistics and language technology. The legal status of n-grams, however, raises many practical questions. Traditionally, text snippets are considered copyrightable if they meet the originality criterion, but no clear indicators as to the minimum length of original snippets exist; moreover, the solutions adopted in some EU Member States (the paper cites German and French law as examples) are considerably different. Furthermore, recent developments in EU law (the CJEU's Pelham decision and the new right of press publishers) also provide interesting arguments in this debate. The paper presents the existing approaches to the legal protection of n-grams and tries to formulate some clear guidelines as to the length of n-grams that can be freely used and shared.
... Any text could be protected by copyright law and it is not always easy to find suitable corpora that are free from copyright issues. Indeed, the relationship between copyright of texts and their use in natural language processing is complex (Eckart de Castilho et al., 2018). Nonetheless, it pays off to make some effort by searching for corpora that are free or in the public domain (Ide et al., 2010). ...
Full-text available
Meaning banking--creating a semantically annotated corpus for the purpose of semantic parsing or generation--is a challenging task. It is quite simple to come up with a complex meaning representation, but it is hard to design a simple meaning representation that captures many nuances of meaning. This paper lists some lessons learned in nearly ten years of meaning annotation during the development of the Groningen Meaning Bank (Bos et al., 2017) and the Parallel Meaning Bank (Abzianidze et al., 2017). The paper's format is rather unconventional: there is no explicit related work, no methodology section, no results, and no discussion (and the current snippet is not an abstract but actually an introductory preface). Instead, its structure is inspired by work of Traum (2000) and Bender (2013). The list starts with a brief overview of the existing meaning banks (Section 1) and the rest of the items are roughly divided into three groups: corpus collection (Section 2 and 3, annotation methods (Section 4-11), and design of meaning representations (Section 12-30). We hope this overview will give inspiration and guidance in creating improved meaning banks in the future.
... Any text could be protected by copyright law and it is not always easy to find suitable corpora that are free from copyright issues. Indeed, the relationship between copyright of texts and their use in natural language processing is complex (Eckart de Castilho et al., 2018). Nonetheless, it pays off to make some effort by searching for corpora that are free or in the public domain (Ide et al., 2010). ...
Full-text available
There are two types of Contribution environments that have been widely written about in the last decade-closed environments controlled by the promulgator and open access environments seemingly controlled by everyone and no-one at the same time. In closed environments, the promulgator has the sole discretion to control both the intellectual property at hand and the integrity of that content.
Réalisée en collaboration avec le GIE Hopsis et les employés des Hospices Civils de Lyon (HCL), cette thèse a pour objectif de proposer une réflexion sur les contraintes et les enjeux liés à l’introduction d’outils d’aide à la décision dans le processus de travail de médecins lors de consultations médicales. Nos travaux se sont organisés autour de trois axes principaux. Une étude des outils actuellement employés pour soutenir le personnel soignant dans leurs processus de décision, qui nous a permis de mettre en évidence les limites des approches actuelles pour l’aide aux décisions médicales. En second lieu, une analyse du processus de décision de médecins travaillant aux HCL, qui nous a permis de mettre en évidence le besoin en informations des médecins afin de prendre des décisions concernant leurs patients. Et enfin, la proposition d’un outil d’aide à la décision, qui vise à l’apprentissage et à l’anticipation des besoins en informations des médecins durant leurs consultations médicales coutumières.
Advances in generative algorithms have enhanced the quality and accessibility of artificial intelligence (AI) as a tool in building synthetic datasets. By generating photorealistic images and videos, these networks can pose a major technological disruption to a broad range of industries from medical imaging to virtual reality. However, as artwork developed by generative algorithms and cognitive robotics enters the arena, the notion of human-driven creativity has been thoroughly tested. When creativity is automated by the programmer, in a style determined by the trainer, using features from information available in public and private datasets, who is the proprietary owner of the rights in AI-generated artworks and designs? This Perspective seeks to provide an answer by systematically exploring the key issues in copyright law that arise at each phase of artificial creativity, from programming to deployment. Ultimately, four guiding actions are established for artists, programmers and end users that utilize AI as a tool such that they may be appropriately awarded the necessary proprietary rights. As artists are beginning to employ deep learning techniques to create new and interesting art, questions arise about how copyright and ownership apply to those works. This Perspective discusses how artists, programmers and users can ensure clarity about the ownership of their creations.
Full-text available
The purpose of this paper is to explore the legal consequences of the digitisation of cultural heritage institutions' archives and in particular to establish whether digitisation processes involve the originality required to trigger new copyright or copyright-related protection. As the European Commission and many MS reported, copyright and in particular "photographers rights" are cause of legal uncertainty during digitisation processes. A major role in this legally uncertain field is played by the standard of originality which is one of the main requirements for copyright protection. Only when a subject matter achieves the requested level of originality, it can be considered a work of authorship. Therefore, a first key issue analysed in this study is whether – and under which conditions – digitisation activities can be considered to be original enough as to constitute works (usually a photographic work) in their own right. A second element of uncertainty is connected with the type of work eventually created by acts of digitisation. If the process of digitisation of a (protected) work can be considered authorial, then the resulting work will be a derivative composed by two works: the original work digitally reproduced and the – probably – photographic work reproducing it. Finally, a third element of uncertainty is found in the protection afforded to "other photographs" by the last sentence of Art. 6 Term Directive and implemented in a handful of European countries. Accordingly, the paper is structured as follows: Part I is dedicated to the analysis of copyright law key concepts such as the originality standard, the definition of derivative works and the forms of protection available in cases of digital (or film-based) representations of objects (photographs). The second part of the study is devoted to a survey of a selection of EU Member States in an attempt to verify how the general concepts identified in Part I are applied by national legislatures and courts. The selected countries are Germany, France, Spain, Italy, Poland, the Netherlands and the UK. The country analysis fulfils a double function: on the one hand it provides a specific overview of the national implementation of the solutions found at international and EU level. On the other hand, it constitutes the only possible approach in order to analyse the protection afforded by some MS to those "other photographs" (also called non original photographs or mere/simple photographs) provided for by the last sentence of Art. 6 Copyright Term Directive. Part III presents some conclusions and recommendations for cultural heritage institutions and for legislatures.
Full-text available
Coptic represents the last phase of the Egyptian language and is pivotal for a wide range of disciplines, such as linguistics, biblical studies, the history of Christianity, Egyptology, and ancient history. It was also essential for "cracking the code" of the Egyptian hieroglyphs. Although digital humanities has been hailed as distinctly interdisciplinary, enabling new forms of knowledge by combining multiple forms of disciplinary investigation, technical obtacles exist for creating a resource useful to both linguists and historians, for example. The nature of the language (outside of the Indo­European family) also requires its own approach. This paper will present some of the challenges ­­ both digital and material ­­ in creating an online, open source platform with a database and tools for digital research in Coptic. It will also propose standards and methodologies to move forward through those challenges. This paper should be of interest not only to scholars in Coptic but also others working on what are traditionally considered more "marginal" language groups in the pre­modern world, and researchers working with corpora that have been removed from their original ancient or medieval repositories and fragmented or dispersed. The dry desert of Egypt has preserved for centuries the parchment and papyri that provide us with a glimpse into the economy, literature, religion, and daily life of ancient Egyptians. During the Roman period of Egyptian history, many texts were written in the Coptic language. Coptic is the last phase of the ancient Egyptian language family and is derived ultimately from the ancient Egyptian hieroglyphs of the pharaonic era. Digital and computational methods hold promise for research in the many disciplines that use Coptic literature as primary sources: biblical studies, church history, Egyptology, linguistics, to name a few. Yet few digital resources exist to enable such research. This essay outlines the challenges to developing a digital corpus of Coptic texts for interdisciplinary research — challenges that are both material (arising from the history and politics of the physical corpus itself) and theoretical (arising from recent efforts to digitize the corpus). We also sketch out some solutions and possibilities, which we are developing in our project Coptic SCRIPTORIUM. Digital Humanities has defined itself as a field that can enable research on a new scale, whether distant reading of large text corpora, aggregation of large visual media collections, or enabling discovery in future querying and algorithmic research [Moretti 2013] [Greenhalgh 2008] [Witmore 2012]. Critical Digital Humanities scholars remind us that digitization initiatives sometimes replicate the Western canon rather than expand it, and that digitization is not in and of itself a more equitable mode of scholarship existing outside of politics [Wernimont 2013] [Wilkins 2012]. Digital tools and corpora for Coptic language and literature, we argue, can expand humanistic research not merely in terms of scale but also scope, especially in ancient studies and literature. Large English, Greek, and Latin corpora — as well as the tools to create, curate, and query them — have been foundational for work in the Digital Humanities. Computational studies on the documents from late antique Egypt can facilitate academic inquiry across traditional disciplines as well as transform our canon of Digital Classics and Digital Humanities scholarship.
Full-text available
The first European Union Directive in the field of copyright was enacted nearly 25 years ago. Similarly to many other directives that followed, that Directive was “vertical” in scope, meaning that its “harmonising” effects were limited to the specific subject matter therein regulated (in this case, software). Other examples of “vertical harmonisation” are found in the field of photographs and databases as well as in many other European Union directives in the field of copyright, making this fragmented approach a typical trait of European Union Copyright law harmonisation. The reason for what could be labelled ‘piecemeal legislation’ can be linked to the limited power that the European Union had, until recently, in regulating copyright. As it can be easily verified from their preambles, all European Union Copyright Directives are mainly grounded in the smooth functioning of the internal market. It is the internal market—rather than copyright—that has driven the harmonisation of European Union copyright law to date. Nevertheless, if we look at the entire body of European Union copyright law today (the so called acquis communautaire) it certainly appears much more harmonised than what may be suggested by the above. The reason for this “unexpected” situation can most likely be found in the fundamental role that the Court of Justice of the European Union has played in interpreting and—some would argue—in creating European Union copyright law. Using the example of the originality standard, this paper offers an overview of the past and current state of European Union copyright, of the case law that has allowed the Court of Justice of the European Union to develop and affirm its own concepts and indicates what could and should be expected for the future of European Union copyright law.
Full-text available
This paper presents the methodology, design principles and detailed evaluation of a new freely available multilayer corpus, collected and edited via classroom annotation using collaborative software. After briefly discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML, multiple part of speech tagging, constituent and dependency parsing, information structural and coreference annotation, and Rhetorical Structure Theory analysis. Layers are inspected for annotation quality and together they coalesce to form a richly annotated corpus that can be used to study the interactions between different levels of linguistic description. The evaluation gives an indication of the expected quality of a corpus created by students with relatively little training. A multifactorial example study on lexical NP coreference likelihood is also presented, which illustrates some applications of the corpus. The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.
Conference Paper
Full-text available
We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straight-forward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Full-text available
Openness has become a common concept in a growing number of scientific and academic fields. Expressions such as Open Access (OA) or Open Content (OC) are often employed for publications of papers and research results, or are contained as conditions in tenders issued by a number of funding agencies. More recently the concept of Open Data (OD) is of growing interest in some fields, particularly those that produce large amounts of data – which are not usually protected by standard legal tools such as copyright. However, a thorough understanding of the meaning of Openness – especially its legal implications – is usually lacking. Open Access, Public Access, Open Content, Open Data, Public Domain. All these terms are often employed to indicate that a given paper, repository or database does not fall under the traditional “closed” scheme of default copyright rules. However, the differences between all these terms are often largely ignored or misrepresented, especially when the scientist in question is not familiar with the law generally and copyright in particular – a very common situation in all scientific fields. On 17 July 2012 the European Commission published its Communication to the European Parliament and the Council entitled “Towards better access to scientific information: Boosting the benefits of public investments in research”. As the Commission observes, “discussions of the scientific dissemination system have traditionally focused on access to scientific publications – journals and monographs. However, it is becoming increasingly important to improve access to research data (experimental results, observations and computer-generated information), which forms the basis for the quantitative analysis underpinning many scientific publications”. The Commission believes that through more complete and wider access to scientific publications and data, the pace of innovation will accelerate and researchers will collaborate so that duplication of efforts will be avoided. Moreover, open research data will allow other researchers to build on previous research results, as it will allow involvement of citizens and society in the scientific process. In the Communication the Commission makes explicit reference to open access models of publications and dissemination of research results, and the reference is not only to access and use but most significantly to reuse of publications as well as research data. The Communication marks an official new step on the road to open access to publicly funded research results in science and the humanities in Europe. Scientific publications are no longer the only elements of its open access policy: research data upon which publications are based should now also be made available to the public. As noble as the open access goal is, however, the expansion of the open access policy to publicly funded research data raises a number of legal and policy issues that are often distinct from those concerning the publication of scientific articles and monographs. Since open access to research data – rather than publications – is a relatively new policy objective, less attention has been paid to the specific features of research data. An analysis of the legal status of such data, and on how to make it available under the correct licence terms, is therefore the subject of the following sections.
This book details the history and development of the major international agreements affecting copyright and related rights. In particular, it examines the interpretation and application of the following conventions: the Berne Convention for the Protection of Literary and Artistic Works 1886–1970; the Rome Convention for the Protection of Performers, Producers of Phonogram and Broadcasting Organizations 1961; the WIPO Copyright and Performances and Phonograms Treaties 1996; and the TRIPS Agreement (so far as it affects copyright and related rights). Doctrinal analysis exposes gaps and ambiguities in the current text of the Berne Convention and considers the extent to which subsequent international instruments have resolved those questions. Issues concerning new technologies and digital networks thus receive in-depth treatment. The book also addresses questions of subject matter coverage, copyright ownership, duration, nature and scope of rights, and exceptions and limitations to copyright protection. Moreover, it considers private international law matters, looking at problems of international jurisdiction and choice of law.