Conference PaperPDF Available

Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation

Abstract and Figures

With the omnipresent availability and use of cloud services, software tools, Web portals or services, legal contracts in the form of license agreements or terms and conditions regulating their use are of paramount importance. Often the textual documents describing these regulations comprise many pages and can not be reasonably assumed to be read and understood by humans. In this work, we describe a method for extracting and clustering relevant parts of such documents, including permissions, obligations, and prohibitions. The clustering is based on semantic similarity employing a distributional semantics approach on large word embeddings database. An evaluation shows that it can significantly improve human comprehension and that improved feature-based clustering has a potential to further reduce the time required for EULA digestion. Our implementation is available as a web service, which can directly be used to process and prepare legal usage contracts.
Content may be subject to copyright.
Semantic Similarity based Clustering of License Excerpts for
Improved End-User Interpretation
Najmeh Mousavi Nejad
University of Bonn / Fraunhofer IAIS
Simon Scerri
University of Bonn / Fraunhofer IAIS
Sören Auer
University of Bonn / Fraunhofer IAIS
With the omnipresent availability and use of cloud services, soft-
ware tools, Web portals or services, legal contracts in the form of
license agreements or terms and conditions regulating their use are
of paramount importance. Often the textual documents describing
these regulations comprise many pages and can not be reasonably
assumed to be read and understood by humans. In this work, we
describe a method for extracting and clustering relevant parts of
such documents, including permissions, obligations, and prohibi-
tions. The clustering is based on semantic similarity employing a
distributional semantics approach on large word embeddings data-
base. An evaluation shows that it can signicantly improve human
comprehension and that improved feature-based clustering has a
potential to further reduce the time required for EULA digestion.
Our implementation is available as a web service, which can directly
be used to process and prepare legal usage contracts.
Information systems Information extraction
ing and classication
methodologies Natural language processing;
End-User License Agreements; EULA; Semantic Similarity; Cluster-
ing; Distributional Approach; Word Embeddings
ACM Reference format:
Najmeh Mousavi Nejad, Simon Scerri, and Sören Auer. 2018. Semantic
Similarity based Clustering of License Excerpts for Improved End-User
Interpretation. In Proceedings of ACM Conference, Washington, DC, USA,
July 2017 (Conference’17), 8 pages.
The ever increasing use of online services, Web portals and software
tools also resulted in a proliferation of extensive legal documents
governing their use. Users tend to ignore these documents due to
the time required for reading and the cumbersome legal lingua.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
Conference’17, July 2017, Washington, DC, USA
©2018 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
However, it is crucial to be aware of the policies associated with
online services and software tools. Each service has a set of terms
and conditions, often called End-User License Agreement (EULA),
which should be agreed by users, prior to their usage of that service.
Unsurprisingly, based on the recent surveys, it is very common for
people to ignore EULAs. According to an online research commis-
sioned by Skandia, only 7% read online EULAs when signing up for
products and services
. Of those who were surveyed, 21% said that
they felt uncomfortable as a result of ticking an EULA acceptance
box without reading the document: 10% found themselves locked
into a longer term contract than they expected and 5% lost money
by not being able to cancel or amend subscriptions and bookings.
In order to ease the process of license analysis for users, we can
apply text mining techniques to extract important segments and
sentences from EULAs and provide a compressed and summarized
version to end-users. We introduce a novel approach that exploits
the semantic similarity between short text excerpts to cluster similar
policies in EULAs. In previous phase of our project, the important
segments and sentences were extracted from the natural language
licenses and were categorized into three important classes (e.g.,
) [
]. During our experiments,
we have noticed that some sentences in the classes have similar
or close meaning to each other and therefore can be categorized
together. Table 1 shows some similar extracted excerpts from the
Apache License
and Apple Website EULA
. The colored words in
this table have close or similar meanings and therefore the segments
can be grouped together.
It should be claried here that our approach does not intend
to remove any extracted segment from similar ones, since this
may lead to losing vital information in EULAs. Instead our goal
is to provide a brief summary for each cluster. If the end-user is
concerned about a specic policy, they can browse the list of items
in each cluster and see the details.
In this study, we have taken
excerpts as the input and extracted key features from them
in order to perform clustering. The key features include the
of the policy (e.g., “copy”, “share”, “remove”, etc.); the
on which a specic
is granted or forbidden or obliged;
and the
which can be a “copyright” or “patent” or
“intellectual property right”). After feature extraction, a semantic
similarity framework based on word embeddings is used to compute
similarity between dierent features of each class and then by
summation of features similarities, the nal similarity score for
1 the-terminal- out-of- terms%
Conference’17, July 2017, Washington, DC, USA Najmeh Mousavi Nejad, Simon Scerri, and Sören Auer
Table 1: Example of Similar Extracted Excerpts
Apache Duty 1
You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of the
Derivative Works.
Apache Duty 2
If the Work includes a NOTICE text le as part of its
distribution, then any Derivative Works that You distribute
must include a readable copy of the attribution notices
contained within such NOTICE le, excluding those notices
that do not pertain to any part of the Derivative Works.
Apple Prohibition 1
You may not attempt to gain unauthorized access to any
portion or feature of the Site, or any other systems or
networks connected to the Site or to any Apple server, or to
any of the services oered on or through the Site, by hacking,
password “mining” or any other illegitimate means.
Apple Prohibition 2
You may not use any “deep-link”,[...] to access,acquire, copy
or monitor any portion of the Site or [...] or any Content, to
obtain or attempt to obtain any materials, documents or
information through any means not purposely made available
through the Site.
each pair is calculated and a symmetric similarity matrix is built
for each class. Finally, the most similar segments in each class
are grouped together with a hierarchical agglomerative clustering
(HAC) algorithm and the procedure goes on until a certain threshold
is reached. Our contributions are in particular:
Identication of crucial features with JAPE grammar rules [
specically tailored for EULAs.
Creation of a large domain-specic word embedding data-
base with 1,000 EULAs.
Implementation of a license analysis framework integrating
dierent APIs including GATE [
], Stanford CoreNLP [
and DISCO [13], which comprises a web API and UI.
A comprehensive evaluation of our approach benchmarking
it against human judgment.
The rest of the paper is organized as follows: we provided a
short literature review in section 2; in section 3 we explain our
approach and its implementation in the EULAide Service; section 4
evaluates the clustering alternatives as well as the usability of
EULAide service; and section 5 concludes with a list of possible
directions for future work.
An in-depth literature survey revealed that there is no automated
approach for the required EULA interpretation and summarization
to support end-users with their comprehension. Previous related
eorts have focussed on general document clustering and sum-
marization, and did not consider the legal references and charac-
teristics of EULAs. The only comparable service assisting users
to understand common licenses is
, which is based on
crowdsourcing strategy and is manually supported by experts in the
eld. Our implemented methods in EULAide attempt to eliminate
the dependency on human involvement.
Over the years, many clustering methods have been proposed
for the summarisation of texts. Survey papers such as [
] indicate
a large variety of methods can be pursued. Eorts investigating
semantic similarity for summarisation purposes have either relied
on a corpus-based approach or have implemented a new algorithm
to compute the semantic similarity between words and short texts.
Kenter & Rijke investigated whether it is possible to compute sim-
ilarity between two short texts relying on just semantic features
of the texts [
], using machine learning methods. Tsatsaronis
et al, proposed a new approach for computing semantic related-
ness between words based on a word thesaurus [
]. A framework
presented in [
] proposes semantic role labeling for the summa-
rization task. In a similar study, R.M. Aliguliyev [
] proposed a
sentence-clustering approach for extractive document summariza-
tion, using an evolutionary algorithm and an appropriate ranking
Although examples like the above abound, the uniqueness of our
target subject data precludes any method from being considered
superior. EULAs are legal contracts and therefore have a more or
less dened structure and terminology. Due to the nature of the
target subject data, we have considered semantics-based clustering
of extracts as a base for an intuitive Web user interface that supports
human analysis of licenses. Rather than starting from scratch, we
rely on an ontology-based Information Extraction (OBIE) method to
extract specic excerpts from license documents. The relevant OBIE
method [
] relies on the Open Digital Rights Language (ODRL)
, a mature and comprehensive vocabulary developed by a
dedicated W3C Community Group. ODRL was designed specically
for digital content and its adequacy has been validated through
a number of related eorts in the eld [
]. The existing
OBIE method considers a number of classes from ODRL, namely the
class and its three subclasses:
. Properties of the
superclass include
operation relating to the asset) and
(constraints which
aect the validity of actions). The OBIE method was implemented
as a GATE OBIE pipeline, and is available for further adaption
and extension. It can take as input legal text and generate three
annotation types based on the above subclasses. The pipeline, which
comprises common NLP tasks, an ontology-based gazetteer as well
as a JAPE transducer for processing the text obtained an F-measure
of over 70%, which is satisfactory since human inter-annotator
agreement for the same task was calculated at 90%.
The above OBIE is in this paper taken as a basis for our approach.
However, a number of shortcomings have been identied and have
been addressed. In particular, the JAPE rules were extended to ex-
tract major features of policies for the clustering task. Moreover,
since the ontology-based gazetteer in the pipeline annotates the
concepts based on the root attribute of tokens, the morphological
analyzer of the pipeline was improved to detect the stems more
Semantic Similarity based Clustering of License Excerpts for Improved End-User InterpretationConference’17, July 2017, Washington, DC, USA
precisely. Next section provides more details regarding to this en-
Given that the OBIE results are satisfactory, we then consider
an appropriate clustering method based on the input data: sets
together with a num-
ber of extracted features. In recent years, distributional semantic
approaches have been extensively studied for word and text simi-
larity [
]. Distributional semantic approaches benet from
the observation that words in the similar contexts tend to have sim-
ilar meanings. A very popular distributional approach is word2vec
and its extension paragraph2vec, which produces paragraph em-
beddings using deep learning technique [
]. However, since the
authors have not released the source code and it was not possible to
recreate their method due to incomplete descriptions, we could not
implement their approach. Alternatively, DISCO is an open source
Java application which retrieves the semantic similarity between
short texts and phrases [
]. In addition, it allows users to build their
own word embedding database from a text corpus. Over the years,
DISCO has received a high community endorsement [
Thus, our approach attempts to combine the extended OBIE-method
with the DISCO clustering method for EULA summarisation.
In this section we explain our approach and its implementation in
EULAide. Figure 1 shows the architecture of the framework. Each
of the following subsections will explain how each contribution ts
within this framework, i.e., i) modications to the existing GATE
OBIE pipeline, ii) word space creation for the proposed clustering
method and iii) the EULAide service itself. The latter ts within the
front-end, the former two within the back-end shown.
3.1 Modied OBIE pipeline for EULAs
Our eorts are based on modications to the cited GATE OBIE
pipeline tailored for processing EULAs. In an empirical study of 20
common licenses, we observed that many policy excerpts returned
for each class could be thematically grouped into clusters. As an
example, Table 2 shows three segments which have been extracted
for the Apache License. The colored words have
the same or very similar meaning and can therefore be grouped
together. After clustering, users can focus on a summary for each
cluster and if interested in that specic policy; browse all segments
in each cluster.
Table 2: Example of Annotated Permissions in Apache
You may reproduce, prepare Derivative Works of, publicly
display, publicly perform, sublicense, and distribute the Work
and such Derivative Works in Source or Object form.
You may reproduce and distribute copies of the Work or
Derivative Works thereof in any medium, with or without
modications, and in Source or Object form.
You may add Your own attribution notices within Derivative
Works that You distribute, alongside or as an addendum to the
NOTICE text from the Work.
Table 3: Example of Features Extraction
If you join a Dropbox for
Business account,
use it in compliance with your
employer’s terms and
condition action
each Contributor grants to
You a
patent license to make, use, sell, import,
and transfer (the Work)
type of policy action
As mentioned in the previous section, the GATE OBIE pipeline
component has been signicantly improved. The improvements
extracting specic features from annotation types (e.g., ac-
tions, conditions and policy types),
ii) improving GATE morphological analyzer and
adding the Stanford dependency parser [
] to the end of the
pipeline for extracting the main object of each excerpt.
The extracted features carry important information in EULAs and
play an important role in clustering similar segments. Features
the sequence of
for each segment, e.g., ‘copy, repro-
duce’, ‘share’, ‘remove’, etc.;
on which a specic
is granted or
forbidden or obliged; and
which can be a ‘copyright’ or ‘patent’ or
‘intellectual property right’.
Table 3 shows two examples of expected results of the feature
extraction phase.
In order to extract
more precisely, we added some rules
in GATE morphological processing resource. This resource species
the root of each token and in most cases, the stems of nouns are
identied almost as the original noun itself, e.g., the lemmas for
‘distributions’, ‘attribution’ or ‘attachment’ are ’distribution’, ‘attri-
bution’ and ‘attachment’. In this regard, the OBIE pipeline can not
relate these words to the ontology concepts, because the ontology-
based gazetteer annotates the text based on the root of each token.
Consequently, the JAPE rules will fail to extract these tokens as
features. However, after customizing morphological ana-
lyzer, the accuracy of stem identication has improved signicantly.
The number of annotated concepts by the ontology-based gazetteer
has increased from 9,630 to 9,927 for 20 licenses. As a result, the
pipeline is now for example able to extract the following
in: “Activities other than
of the
Work are not covered by this license...
In addition, the rational for adding dependency parser to the end
of the pipeline was resolving the main object of each excerpt, in
order to generate a short and simple summary for each cluster by
concatenating the ‘action’ and ‘object’ of all segments in one cluster.
It is worth mentioning that integrating GATE with the Stanford
NLP was quite a challenge, since both have dened quite dierent
structures for annotations and related concepts in their APIs.
Once the three annotation type classes are built with the respec-
tive features, they are passed to a semantic similarity measurement
component. This component builds a symmetric matrix for each
class and passes it to the clustering algorithm. Finally, the clustering
Conference’17, July 2017, Washington, DC, USA Najmeh Mousavi Nejad, Simon Scerri, and Sören Auer
Figure 1: Architecture of the EULAide system
component groups the segments based on their similarities and the
are shown to
the end-user. The next section provides more details regarding the
similarity computation and its usage for text clustering.
3.2 Word Space Creation & Semantic
Out of a number of available open source tools supporting corpus-
based semantic similarity between short texts, the DISCO API [
coupled with its word space builder was identied as the most ap-
propriate for our Java-based application. DISCO stands for extract-
ing DIstributionally related words using CO-occurrences. Distribu-
tional approaches need a large corpus to build the word embeddings
database. In order to create a domain specic word space for our
approach, we used a dataset comprising 1.000 EULAs [
]. The
dataset is passed to a method which generates a lemmatized text
le consisting of three columns: token, a part-of-speech tag, and the
base form (lemma). We execute the DISCO builder with the default
conguration on the lemmatized le. The builder’s output contains
word vectors for each token. Finally, DISCO takes word space and
two short texts as input and generates a real value between one
and zero indicating the semantic similarity.
Algorithm 1 shows the sketch of our clustering approach. The
semantic similarity is computed for all features of the extracted
segments and the nal similarity score is calculated by summing
all four values. Once, we obtained the similarity matrix for each
class, a clustering algorithm is required to group similar segments.
Agglomerative hierarchical clustering (HAC) is an established, well-
known technique which has been shown to be a successful method
for text and document clustering [
]. Furthermore, among dier-
ent HAC methods, the average linkage has been proved to be the
most suitable one for text categorization [
]. Once the proper
clustering technique is identied, we can pass similarity matrices
to the clustering component. The HAC process continues until it
reaches a pre-dened threshold.
3.3 EULAide Framework and Web Service
We developed a comprehensive license analysis framework EU-
LAide, which also provides a comprehensive Web service interface
As represented in the architecture, the client side Web interface
Algorithm 1 Sketch of Semantic Clustering Algorithm
Require: permissions, prohibitions, duties with features
1: for the three classes do
2: for all pairs in each class do
3: Asimilarity between actions
4: Bsimilarity between conditions
5: Csimilarity between policy types
6: Dsimilarity between the remainders of segments
7: f in al Sim A+B+C+D
8: add nalSim to the corresponding matrix cell
9: end for
10: do HAC clustering for the matrix with a threshold
11: end for
implemented in Javascript using the AngularJS framework sends a
request to the backend service and receives a JSON or XML object
if the request is valid. In the back-end, several open-source APIs
are orchestrated to provide an accurate result.
Figure 2 shows an example of EULAide output, applied to the
Google terms of service
. The number of extracted excerpts by
OBIE pipeline is 14, whereas the number of clusters has reduced to
9. The head of each accordion (block) contains the concatenation of
’action‘ and ’object‘ (extracted by Standford dependency parser) of
all members in the cluster which can be replaced with a more proper
summarization algorithm in future. As represented in the gure, the
numbered excerpts are grouped together. In addition, there is a tool-
tip ”more info“ at the bottom of each block which the user can hover
over and see the complete paragraph regarding that policy. The user
can also see all the details by clicking on “Open All” button or choose
to expand a specic accordion by clicking on the corresponding
header (e.g., summary). Since EULAide is implemented using a two-
layer architecture, it is platform independent, e.g., any client such
as mobile apps can easily communicate to our server.
In order to evaluate the eciency of EULAide as a summarization
tool, three types of experiments were carried out. First we designed
a test to measure the appropriateness of our clustering approach.
Second we created an experiment to verify two hypotheses:
EULAide needs less time and eort for EULA comprehension;
Semantic Similarity based Clustering of License Excerpts for Improved End-User InterpretationConference’17, July 2017, Washington, DC, USA
Figure 2: EULAide Platform Web interface showing the permission duty and prohibition clusters for a user provided EULA
semi-automatic information extraction and summarization
may lead to information loss.
As the last experiment, the usefulness of EULAide perceived by
users was estimated through a common usability test. The rst
experiment is presented in subsection 4.1 and the second and third
experiments are reported jointly in subsection 4.2 .
4.1 The Clustering Approach Evaluation
4.1.1 Setup
The evaluation carried out investigates two hypotheses:
to identify whether the proposed clustering method based
on the OBIE-derived classes is generally useful (i.e. helps
readers better digest and comprehend EULAs) and;
to identify whether the devised feature-extraction method
oers improved results;
When available, the quality of clustering methods is best mea-
sured by comparing machine-generated clusters with a reliable
gold standard. However, considering the complexity and broadness
of EULAs, it is very dicult to obtain or compile a suitable gold
standard that is agreed and accepted by a majority. Therefore, as an
alternative method we designed an experiment that can compare
clustering preferences between human evaluators (to approximate
inter-annotator agreement), and compare them with those gen-
erated computationally. To determine the usefulness of feature
extraction, a second machine-generated clustering was included
for a secondary comparison.
The input data for the evaluation considered of a number of
instances for the three EULA classes prohibitions, duties and per-
missions from four carefully-selected EULAs. The choice of the
latter considered both a good balance between the three identied
classes as well as sucient brevity. EULAs that were very long were
not suitable for this task since the human annotators’ clustering
Table 4: Total Extracted Instances & Machine-generated
Clusters with (Mf) and without (M) Considering Features
#Instances #Clusters-M #Clusters-Mf
Permission 30 18 20
Duty 27 24 20
Prohibition 40 32 35
task increases signicantly in terms of complexity. The selected
EULAs yielded the total number of class instances shown in Table 4.
Here, one can note that the feature-based clustering method yields
a marginally higher amount of clusters for two of the three classes
). A higher amount of clusters
can be interpreted as a more ne-grained result. However, this can
only be conrmed by comparison of these results with those of our
human evaluators.
To carry out the human evaluation, ve subjects were asked to
cluster the OBIE-derived excerpts of
within the context of each selected EULA. Since the tar-
get users of EUALAide are regular people who are not expected
to be particularly acquainted with legal text and jargon, the ve
volunteers selected have dierent levels of higher education (under-
and postgraduates) selected from a university campus. The only
conrmed common interest between the individuals is an under-
standing of the need for EULA summarization methods such as
the ones we propose. At the same time, to ensure that the task is
properly and equally understood by all evaluators, an introduc-
tion to the EULAide tool, its vision, goals and the relevant con-
cepts behind the input data was provided. However, the evaluators
were not given instructions on how to cluster the results but were
rather asked to devise their own clustering criteria as they best
deemed t. They were also explained that the input excerpts were
Conference’17, July 2017, Washington, DC, USA Najmeh Mousavi Nejad, Simon Scerri, and Sören Auer
Table 5: Clustering Results for Permissions
h1 h2 h3 h4 h5 M Mf
h1 (1) 0.9 0.65 0.83 0.9 0.86 0.8
h2 *** (1) 0.64 0.93 0.9 0.85 0.89
h3 *** *** (1) 0.65 0.65 0.7 0.72
h4 *** *** *** (1) 0.83 0.91 0.78
h5 *** *** *** *** (1) 0.85 0.88
semi-automatically extracted and could therefore contain some
errors. The EULAs they considered contained and average of 7.5
permissions, 10 prohibitions and 6.7 duties.
4.1.2 Human-Machine Comparisons & Discussion
The above experiment yielded two result sets: i) the two machine-
generated clusters and ii) the 5 human-generated clusters. In order
to compare and consider the two sets, we considered common
methods for measuring clustering quality, i.e., the Rand index and
F-measure [16].
Applying F-measure to clustering methods is similar to other
information retrieval approaches, and considers (dis-)similarity of
excerpts within clusters as a base for true/false positive/negatives.
However, the F-measure disregards true negatives, or “the occur-
rence of dissimilar excerpts in dierent clusters”. Thus, it does not
take the proportion of correct non-clustering of unrelated excerpts
into account; which is also very important in measuring the success
criteria for this task.
Alternatively, the Rand index formula shown in Equation 1 is
used to measure the percentage of correct decisions (accuracy),
which is more appropriate for our evaluation because it also factors
true negatives.
RI =
T P +T N
T P +F P +FN +T N (1)
The interpretation of positive or negative decisions in clustering
is drawn from decision series theory in arithmetic. Having
the space of all pairs of elements is computed by Equation 2.
In our results set we do not have one ‘correct’ clustering, but
rather ve subjective variations from each evaluator. Therefore, we
applied the Rand index to calculate the cross-accuracy of clusters,
using each evaluators’ clusters in turns as the correct standard.
We then simply considered the two machine-generated clusters as
alternate clusters, and extended the cross-accuracy computations to
cover both result sets. These computations generate a (symmetric)
matrix of results, separated by class (
prohibitions), as shown in Table 5, Table 6 and Table 7.
The ve human evaluators are enumerated h1-h5 and the two
machine-generated clusters as M (without feature inclusion, i.e.,
passing the entire excerpts to the algorithm) and Mf (passing the
entire excerpts with annotated features for feature-based cluster-
The results shown per class indicate that there is a high-level
of human agreement (ranging from a low of 0.61 to a high of 0.95).
Table 6: Clustering Results for Duties
h1 h2 h3 h4 h5 M Mf
h1 (1) 0.71 0.89 0.91 0.71 0.93 0.97
h2 *** (1) 0.61 0.63 0.95 0.7 0.68
h3 *** *** (1) 0.92 0.63 0.86 0.86
h4 *** *** *** (1) 0.65 0.85 0.88
h5 *** *** *** *** (1) 0.7 0.67
Table 7: Clustering Results for Prohibitions
h1 h2 h3 h4 h5 M Mf
h1 (1) 0.95 0.75 0.85 0.89 0.87 0.9
h2 *** (1) 0.75 0.85 0.92 0.91 0.92
h3 *** *** (1) 0.73 0.76 0.76 0.75
h4 *** *** *** (1) 0.89 0.8 0.81
h5 *** *** *** *** (1) 0.86 0.86
Table 8: Average Results
Human M Mf
Permission 0.79 0.83 0.81
Duty 0.76 0.81 0.81
Prohibition 0.83 0.84 0.85
There is also a high-level of human-machine agreement (ranging
from a low of 0.67 to a high of 0.97). This indicates that in gen-
eral, there was higher disagreement between humans than between
machine and humans. To further help with interpreting the above
results, Table 8 enlists the agreement between the evaluators (Hu-
man), between the benchmark method and the evaluators (M) and
between the feature-based clustering and the evaluators (Mf); once
again organised per class.
The above results indicate that the applied clustering method
based on the OBIE-derived instances is relatively successful, con-
sidering the level of human disagreement when performing the
same task manually; but that there is no signicant dierence be-
tween the two clustering methods. On the other hand according to
Table 8, the feature-based results are closer to human agreement.
The accumulated deviation of
is 9%, whereas
the aggregated deviation of
is 10%. Although the
dierence is minor, the results reconrms our previous assumption
that feature-based approach generates more ned-grained clusters
and is more attuned to human intuition and perception. While
we are encouraged by the former result, we see the potential to
further improve the feature-based clustering algorithm. The said
features are already being used to improve the results shown in
EULAide — the cluster titles shown in Figure 2 are based on the
derived actions.
To improve the results of feature-based method, we will strive
to more accurately extract the EULA-tailored feature. Since all of
them are identied by semi-automatic rules, they are prone to some
error. Increasing the accuracy of feature extraction phase will lead
to better results.
Semantic Similarity based Clustering of License Excerpts for Improved End-User InterpretationConference’17, July 2017, Washington, DC, USA
4.2 EULAide Usability Experiments
4.2.1 Setup
In order to conduct our evaluation, the four EULAs from our pre-
vious experiment were again chosen for the current task. A legal
expert was asked to design ve multiple choice questions for each
EULA (e.g., twenty in total). All questions are related to
. Afterwards, six people from the uni-
versity campus were selected to take part in our experiment. Each
person studied all four EULAs: they had to read two EULAs in
natural text and also exploits the EULAide service for the other two
EULAs. With this setting, each EULA in natural text was read by
three dierent students. Similarly, each EULA was browsed and
studied with EULAide by three individuals. The participants were
aware that their tasks are related to
and they were asked to understand and digest EULA
in order to answer the questions.
The question answering phase was split into two stages: rst the
participants had to answer the questions using their memory and
without looking the EULA; Second, if they were unsure about some
questions, they could check the EULA and use search tools to nd
the answer. The rational behind this setting was measuring how
good users can remember policies and also how fast they can search
for information in the EULA. The primary purpose of EULAide is
to get an overview of an EULA before accepting it. In practice,
when one is agreeing with terms and conditions, they should try
to remember the important parts of license agreement in order to
avoid infringing the regulations.
4.2.2 Evaluation & Discussion
In this section, we have reported the average results of our ex-
periment, e.g., for each EULA - either in natural text mode or in
EULAide mode - the average of three values (corresponding to three
participants) was computed. In this case, we have eight average
values: four EULAs in natural text mode plus the same four EULAs
in EULAide mode. Finally, For simplicity and avoiding confusion,
only the average value of each mode is presented here.
Table 9 shows the average time for each step of the experiment.
According to the table, using EULAide to study and understand
EULA takes signicantly less time than reading the EULA in natural
text, which was indeed an expected result. Furthermore, as already
mentioned before, we have divided the answering phase into two
is based on memory and
is allowing the
users to search for the answer in the EULA. Not surprisingly, the
average times of
are very similar, because regardless of
the mode (natural text or EULAide), reading the questions and
their multiple choice answers takes relatively equal time. However,
, using EULAide to search for answers is one
minute and 15 seconds faster than nding the desired information in
the natural text license. Once again, this was an expected outcome,
e.g., searching in a structured text is rather simpler.
The second part of current evaluation is concerned with the cor-
rectness of answers provided by the participants. Table 10 shows
the average percentage of results. According to this table, the cor-
rectness of answers in the natural text mode is 5% higher than
EULAide mode, which is a reasonable result. If the end-user bears
Table 9: Average Time (In Seconds)
Reading Answering
EULA-Full 1185 75 152
EULA-EULAide 315 72 77
Table 10: Average Percentage of Questions Results (%)
Correct Incorrect Unanswered in Phase1
67 8 18.5 5 1.5
62 15 6.5 4.5 12
with the cumbersome legal lingua in the EULA and spends time
studying it, he/she can understand the important parts of the li-
cense. However, as stated in the introduction, only a few people
read EULA and EULAide is an attempt to motivate them to be aware
of what they are agreeing to. Our result shows that if end-users
exploit our tool, they can get on average 62% of the questions right,
which is indeed very encouraging. Finally, there are some unan-
swered questions in
which leads us to the next phase. In
of answering process, the participants were allowed to
search for information in the EULA. According to the table, from
25% unanswered questions in the full text mode, people have found
18.5% correct answers after re-looking. On the other hand, from 23%
unanswered questions in EULAide mode, only 6.5% of correct an-
swers were found. Consequently, 12% of unsure questions remained
unanswered. This is due to semi-automatic process of information
extraction phase. The F-measure of OBIE pipeline is around 75%
and not all of
can be ex-
tracted with the pipeline. Therefore, not all of the questions were
covered in EULAide and the participants could not nd the answers.
In summary, with EULAide, people get 19.5% incorrect answers,
while with the full text and search, they get 13.3%, i.e., EULAide
has on average 6% higher error rate (incorrect). Similarly, there are
on average 10.5% more unanswered questions with our approach.
While we expect that every (semi-)automatic approach leads to a
potential information loss, we aim to verify that the error rate of
EULAide, as well as its unanswered questions are consequences
of automatic extraction and summarization. The information loss
of (10.5+6)% by EULAide is a reasonable cost for the time gained
(Exploiting EULAide is on average three times faster than the full
text), and the increased incentive for familiarizing oneself with an
EULA rather than simply accepting it.
4.2.3 Usability Test
The last task of participants was lling a usability questionnaire for
EULAide evaluation. We have reused a very common form available
Conference’17, July 2017, Washington, DC, USA Najmeh Mousavi Nejad, Simon Scerri, and Sören Auer
Table 11: Average Scores of Six Participants for the
Usability Questionnaire (Max=7)
Usefulness Ease of Use Ease of Learning Satisfaction
6.14 6.11 6.75 6.0
. There are thirty questions categorized into four groups: 8
questions for usefulness, 11 for ease of use, 4 for ease of learning
and 7 for satisfaction. There is seven options for each question rang-
ing from 1 (strongly dissatised) to 7 (strongly satised). Table 11
shows the average scores of six participants for each category. The
results are surely promising and implies that end-users are quite
satised using EULAide. Furthermore, some participants stated a
few points regarding the service. The positive feedbacks include:
a nice and friendly user interface, fast response time (less than
1 minute), signicant time reduction concerning EULA digestion,
summary of each cluster, grouping similar policies and the ability to
expand a specic cluster. The improvements given by participants
suggest some interesting ideas for future work. Two participants
recommended to include other aspect of EULA in the summary, e.g.,
what are the agreements between them and the service provider?
Three users have said not all of
are covered by EULAide. Last but not least, almost all partic-
ipants were pleased with the summarization idea and encouraged
us to improve the approach in future.
We presented a holistic approach for the analysis and preparation
of end-user license agreement. To the best of our knowledge, this
is the rst comprehensive approach for EULA interpretation and
comprehension. The approach included an comprehensive ontol-
ogy representing relevant terms, an ontology-based information
extraction, a clustering method for excerpts and a Web-based user
interface for self-service license analysis. Our evaluation showed
that the clustering is eective and signicantly reduces the num-
ber of relevant terms for users to focus on initially. In addition,
according to the usability study response, EULAide is more visual
and simpler to digest EULAs and saves around 75% of the time.
However, we are aware that it comes at a marginal price of 10.5%
loss of valuable information, which is an acceptable trade-o, con-
sidering the amount of time saved by users - especially since the
full EULAs are not being read by a lot of users. We deem this work
to be a signicant step forward to make the description of rules
and regulations governing online services, software tools, portals
and apps more user friendly.
In future work, we aim to expand the application of the approach
to other types of legal documents (e.g. contracts in general or regu-
latory documents). In addition, we aim to expand the work around
the manually crafted ontology and automatically extracted terms
and conditions into methodology for creation of comprehensive
legal knowledge graphs. Furthermore, we aim to provide authors of
EULAs means to accompany their licenses with a machine readable
version, which would further increases the precision and recall of
license analytics.
C. C. Aggarwal and C. Zhai. A survey of text clustering algorithms. In C. C.
Aggarwal and C. Zhai, editors, Mining Text Data, pages 77–128. Springer, 2012.
R. M. Aliguliyev. A new sentence similarity measure and sentence based extrac-
tive technique for automatic text summarization. Expert Syst. Appl., 36(4):7764–
7772, May 2009.
S. Bhagwani, S. Satapathy, and H. Karnick. Semantic textual similarity using
maximal weighted bipartite graph matching. In Proceedings of the First Joint
Conference on Lexical and Computational Semantics - Volume 1: Proceedings of
the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth
International Workshop on Semantic Evaluation, SemEval ’12, pages 579–585,
Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
E. Cabrio, A. P. Aprosio, and S. Villata. These are your rights - a natural lan-
guage processing approach to automated rdf licenses generation. In V. Presutti,
C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai, editors, ESWC, volume
8465 of Lecture Notes in Computer Science, pages 255–269. Springer, 2014.
D. Chen and C. D. Manning. A fast and accurate dependency parser using neural
networks. In EMNLP, 2014.
H. Cunningham, D. Maynard, and K. Bontcheva. Text Processing with GATE
(Version 8). Gateway Press CA, 2011.
H. Cunningham, D. Maynard, and V. Tablan. JAPE: a Java Annotation Patterns
Engine (Second Edition). Research Memorandum CS–00–10, Department of
Computer Science, University of Sheeld, November 2000.
E. Daga, M. d’Aquin, E. Motta, and A. Gangemi. A Bottom-Up Approachfor Licences
Classication and Selection, pages 257–267. Springer International Publishing,
Cham, 2015.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.
Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIET Y
FOR INFORMATION SCIENCE, 41(6):391–407, 1990.
P. A. Jamkhedkar and G. L. Heileman. A formal conceptual model for rights. In
Proceedings of the 8th ACM Workshop on Digital Rights Management, DRM ’08,
pages 29–38, New York, NY, USA, 2008. ACM.
T. Kenter and M. de Rijke. Short text similarity with word embeddings. In
Proceedings of the 24th ACM International on Conference on Information and
Knowledge Management, CIKM ’15, pages 1411–1420, New York, NY, USA, 2015.
A. Khan, N. Salim, and Y. Jaya Kumar. A framework for multi-document ab-
stractive summarization based on semantic role labelling. Appl. Soft Comput.,
30(C):737–747, May 2015.
P. Kolb. Disco: A multilingual database of distributionally similar words. Pro-
ceedings of KONVENS-2008, Berlin, 2008.
N. Lavesson, M. Boldt, P. Davidsson, and A. Jacobsson. Learning to detect spyware
using end user license agreements. Knowl. Inf. Syst., 26(2):285–307, Feb. 2011.
Q. V. Le and T. Mikolov. Distributed representations of sentences and documents.
In Proceedings of the 31th International Conference on Machine Learning, ICML
2014, Beijing, China, 21-26 June 2014, pages 1188–1196, 2014.
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA, 2008.
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky.
The Stanford CoreNLP natural language processing toolkit. In Association for
Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
N. K. Nagwani and S. Verma. A frequent term and semantic similarity based
single document text summarization algorithm. International Journal of Computer
Applications (0975–8887) Volume, pages 36–40, 2011.
N. M. Nejad, S. Scerri, S. Auer, and E. M. Sibarani. Eulaide: Interpretation of
end-user license agreements using ontology-based information extraction. In
Proceedings of the 12th International Conference on Semantic Systems, SEMANTiCS
2016, pages 73–80, New York, NY, USA, 2016. ACM.
C. Nguyen Ngoc, A. Roussanaly, and A. Boyer. Learning Resource Recommen-
dation: An Orchestration of Content-Based Filtering, Word Semantic Similarity
and Page Ranking. In EC-TEL 2014 : 9th European Conference on Technology
Enhanced Learning, Open Learning and Teaching in Educational Communities,
pages 302–316, Gratz, Austria, Sept. 2014. European Association of Technology
Enhanced Learning, Springer.
A. Qadir, P. N. Mendes, D. Gruhl, and N. Lewis. Semantic lexicon induction
from twitter with pattern relatedness and exible term length. In Proceedings
of the Twenty-Ninth AAAI Conference on Articial Intelligence, AAAI’15, pages
2432–2439. AAAI Press, 2015.
S. Steyskal and A. Polleres. Dening expressive access policies for linked data
using the odrl ontology 2.0. In Proceedings of the 10th International Conference on
Semantic Systems, SEM ’14, pages 20–23, New York, NY, USA, 2014. ACM.
G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis. Text relatedness based on a
word thesaurus. J. Artif. Int. Res., 37(1):1–40, Jan. 2010.
Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for
document datasets. Data Mining and Knowledge Discovery, 10(2):141–168, 2005.
... Our approach is indeed only possible as our glossaries are context-based, limited to the terminology found in the GDPR and our requirements. We are aware of existing efforts in interpreting and translating laws, regulations, and other legal documents (e.g., [2,30,16]). We do not mean to compete with them, but rather state that our parser, in the specific problem herein addressed, has given sufficiently accurate results. ...
... 111. 16 The system must provide the user with evidence of data collection practices. 111. ...
Full-text available
The introduction of the General Data Protection Regulation (GDPR) came to further strengthen the need for transparency—one of its main principles—and with it, the users’ empowerment to make service providers more responsible and accountable for processing of personal data. The technological infrastructures are not yet prepared to fully support the principle, but changes are bound to be implemented in the very near future. In this work (1) we comprehensively elicit the requirements one needs to implement transparency as stated in GDPR, and (2) we verify which current Transparency Enhancing Tools (TETs) can fulfil them. We found that work still needs to be done to comply with the European Regulation. However, parts of some TETs can already solve some issues. Work efforts need to be put on the development of new solutions, but also on the improvement and testing of existing ones.
... Nonetheless, they are aware that this comes at a marginal price of 10.5 percent loss of useful knowledge, which is an appropriate tradeoff, given the amount of time saved by users-particularly because a lot of users do not read the full EULAs. They find this work to be a major step forward in making it more user-friendly to define the rules and regulations regulating online services, software tools, portals and applications [24]. Kamath S and V S presented a semantic-based Web service retrieval system using natural language processing (NLP) techniques to retrieve usable knowledge from the services has been suggested. ...
... Moreover, the average linkage has been proved to be the most suitable one for text categorization among various HAC approaches. In addition, the average linkage has been shown to be the most suitable one for text categorization among various HAC methods [24]. ...
Full-text available
The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures.
... Our approach is possible because our glossaries are context-based, and focused to the terminology found in the GDPR and in the require-ments. There are works that propose interpreting and translating regulations and other legal documents in general (Bartolini et al., 2016;Sathyendra et al., 2017;Nejad et al., 2017). We do not mean to compete with them, but rather state that our parser, in the specific problem herein addressed, has given sufficiently accurate results. ...
Smart devices with the capability to record audio can create a trade-off for users between convenience and privacy. To understand how users experience this trade-off, we report on data from 35 interview, focus group, and design workshop participants. Participants' perspectives on smart-device audio privacy clustered into the pragmatist, guardian, and cynic perspectives that have previously been shown to characterize privacy concerns in other domains. These user groups differed along four axes in their audio-related behaviors (for example, guardians alone say they often move away from a microphone when discussing a sensitive topic). Participants surfaced three usage phases that require design consideration with respect to audio privacy: 1) adoption, 2) in-the-moment recording, and 3) downstream use of audio data. We report common design solutions that participants created for each phase (such as indicators showing when an app is recording audio and annotations making clear when an advertisement was selected based on past audio recording).
Conference Paper
Full-text available
Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length. We investigate whether determining short text similarity is possible using only semantic features---where by semantic we mean, pertaining to a representation of meaning---rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity. We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.
Conference Paper
Full-text available
Ignoring End-User License Agreements (EULAs) for online services due to their length and complexity is a risk un-dertaken by the majority of online and mobile service users. This paper presents an Ontology-Based Information Extraction (OBIE) method for EULA term and phrase extraction to facilitate a better understanding by humans. An ontol-ogy capturing important terms and relationships has been developed and used to guide the OBIE process. Through a feedback cycle we have improved its domain-specific coverage by identifying additional concepts. In the detection and extraction, we focus on three key rights and conditions: permission , prohibition and duty. We present the EULAide system, which comprises a custom information extraction pipeline and a number of custom extraction rules tailored for EULA processing. To evaluate our approach, we created and manually annotated a corpus of 20 well-known licenses. For the gold standard we achieved an Inter-Annotator Agreement (IAA) of 90%, resulting in 193 permissions, 185 prohibitions and 168 duties. An evaluation of the OBIE pipeline against this gold standard resulted in an F-measure of 70-74% which, in the context of the IAA, proves the feasibility of the approach.
With the rise of social media, learning from informal text has become increasingly important. We present a novel semantic lexicon induction approach that is able to learn new vocabulary from social media. Our method is robust to the idiosyncrasies of informal and open-domain text corpora. Unlike previous work, it does not impose restrictions on the lexical features of candidate terms — e.g. by restricting entries to nouns or noun phrases —while still being able to accurately learn multiword phrases of variable length. Starting with a few seed terms for a semantic category, our method first explores the context around seed terms in a corpus, and identifies context patterns that are relevant to the category. These patterns are used to extract candidate terms — i.e. multiword segments that are further analyzed to ensure meaningful term boundary segmentation. We show that our approach is able to learn high quality semantic lexicons from informally written social media text of Twitter, and can achieve accuracy as high as 92% in the top 100 learned category members.
Conference Paper
Licences are a crucial aspect of the information publishing process in the web of (linked) data. Recent work on modeling of policies with semantic web languages (RDF, ODRL) gives the opportunity to formally describe licences and reason upon them. However, choosing the right licence is still challenging. Particularly, understanding the number of features - permissions, prohibitions and obligations - constitute a steep learning process for the data provider, who has to check them individually and compare the licences in order to pick the one that better fits her needs. The objective of the work presented in this paper is to reduce the effort required for licence selection. We argue that an ontology of licences, organized by their relevant features, can help providing support to the user. Developing an ontology with a bottom-up approach based on Formal Concept Analysis, we show how the process of licence selection can be simplified significantly and reduced to answering an average of three/five key questions.
Conference Paper
In the latest years, the Web has seen an increasing interest in legal issues, concerning the use and re-use of online published material. In particular, several open issues affect the terms and conditions under which the data published on the Web is released to the users, and the users rights over such data. Though the number of licensed material on the Web is considerably increasing, the problem of generating machine readable licenses information is still unsolved. In this paper, we propose to adopt Natural Language Processing techniques to extract in an automated way the rights and conditions granted by a license, and we return the license in a machine readable format using RDF and adopting two well known vocabularies to model licenses. Experiments over a set of widely adopted licenses show the feasibility of the proposed approach.