PreprintPDF Available

Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Major scandals in corporate history have urged the need for regulatory compliance, where organizations need to ensure that their controls (processes) comply with relevant laws, regulations, and policies. However, keeping track of the constantly changing legislation is difficult, thus organizations are increasingly adopting Regulatory Technology (RegTech) to facilitate the process. To this end, we introduce regulatory information retrieval (REG-IR), an application of document-to-document information retrieval (DOC2DOC IR), where the query is an entire document making the task more challenging than traditional IR where the queries are short. Furthermore, we compile and release two datasets based on the relationships between EU directives and UK legislation. We experiment on these datasets using a typical two-step pipeline approach comprising a pre-fetcher and a neural re-ranker. Experimenting with various pre-fetchers from BM25 to k nearest neighbors over representations from several BERT models, we show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR. We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels. Thus, they are biased towards the pre-fetcher's score. Interestingly, applying a date filter further improves the performance, showcasing the importance of the time dimension.
Content may be subject to copyright.
Regulatory Compliance through Doc2Doc Information Retrieval:
A case study in EU/UK legislation where text similarity has limitations
Ilias Chalkidis † ‡
Manos Fergadiotis † ‡
Nikolaos Manginas
Eva Katakalou [
Prodromos Malakasiotis † ‡
EY AI Centre of Excellence in Document Intelligence, NCSR “Demokritos”
Department of Informatics, Athens University of Economics and Business
[Department of International, European and Area Studies, Panteion University
Major scandals in corporate history have urged
the need for regulatory compliance, where or-
ganizations need to ensure that their controls
(processes) comply with relevant laws, regula-
tions, and policies. However, keeping track of
the constantly changing legislation is difficult,
thus organizations are increasingly adopting
Regulatory Technology (RegTech) to facilitate
the process. To this end, we introduce regula-
tory information retrieval (REG -IR), an appli-
cation of document-to-document information
retrieval (DOC 2DO C IR), where the query is
an entire document making the task more chal-
lenging than traditional IR where the queries
are short. Furthermore, we compile and re-
lease two datasets based on the relationships
between EU directives and UK legislation. We
experiment on these datasets using a typical
two-step pipeline approach comprising a pre-
fetcher and a neural re-ranker. Experimenting
with various pre-fetchers from BM25 to knear-
est neighbors over representations from sev-
eral BE RT models, we show that fine-tuning
aBE RT model on an in-domain classification
task produces the best representations for IR.
We also show that neural re-rankers under-
perform due to contradicting supervision, i.e.,
similar query-document pairs with opposite la-
bels. Thus, they are biased towards the pre-
fetcher’s score. Interestingly, applying a date
filter further improves the performance, show-
casing the importance of the time dimension.
The contribution of Ms. Eva Katakalou was restricted
to the creation and the validation of the datasets as well as to
the authoring of the corresponding parts of the manuscript.
Figure 1: Number of legislative acts issued by the EU
per year. The gold color of the bars indicates how many
of the published acts are amendments to older ones.
1 Introduction
Major scandals in corporate history, from Enron to
Tyco International, Olympus, and Tesco,
have led
to the emergence of stricter regulatory mandates
and highlighted the need for regulatory compliance
where organizations need to ensure that they com-
ply with relevant laws, regulations, and policies
(Lin, 2016). However, keeping track of the con-
stantly changing legislation (Figure 1) is hard, thus
organizations are increasingly adopting Regulatory
Technology (RegTech) to facilitate the process.
Typically, a compliance regimen includes three
distinct but related types of measures, corrective,
detective, and preventive (Sadiq and Governatori,
l/21/the- worlds-biggest-accounting- scan
dals-toshiba- enron-olympus
arXiv:2101.10726v1 [cs.CL] 26 Jan 2021
2015). Corrective measures are usually undertaken
when new regulations are introduced to update ex-
isting controls. Detective measures, ensure “after-
the-fact” compliance, i.e., following a procedure, a
manual or automated check is carried out, to ensure
that every step of the procedure complied with the
corresponding regulations. Finally, preventive mea-
sures ensure compliance “by design”, i.e., during
the creation of new controls. All types of mea-
sures include an underlying information retrieval
(IR) task, where laws need to be retrieved given a
control or vice versa. We identify two use cases:
Given a new law retrieve all the controls of
the organization affected by this law. The or-
ganization can then apply corrective measures
to ensure compliance for these controls.
Given a control retrieve all relevant laws the
control should comply with. This is useful
for ensuring compliance after a procedure has
been carried out (detective measures) or when
creating new controls (preventive measures).
Regulatory information retrieval (REG-I R), sim-
ilarly to other applications of document-to-
document (DO C2D OC)IR, is much more challeng-
ing than traditional IR where the query typically
contains a few informative words and the docu-
ments are relatively small (Table 1). In D OC 2DO C
IR the query is a long document (e.g., a regulation)
containing thousands of words, most of which are
uninformative. Consequently, matching the query
with other long documents where the informative
words are also sparse, becomes extremely difficult.
Although legislation is available, organizations’
controls are strictly private and very hard to obtain.
Fortunately, the European Union (EU) has a legis-
lation scheme analogous to regulatory compliance
for organizations. According to the Treaty on the
Functioning of the European Union (TFEU),
published EU directives must take effect at the na-
tional level. Thus, all EU member states must adopt
a law to transpose a newly issued directive within
the period set by the directive (typically 2 years).
Notably, the United Kingdom (UK) having a high
compliance level with the EU (Figure 2),
is a good
test-bed for REG-IR. Thus we compile and release
two datasets for RE G-IR,E U2U K and UK2EU, con-
taining EU directives and UK regulations, which
2Articles 291 (1) and 288 paragraph 3.
Data for Figures 1 and 2 obtained from
/internal market/scoreboard/performance b
y governance tool/eu pilot.
Figure 2: The percentage of E U directives transposed
by UK legislation per year. Over 98% of the published
EU directives have been transposed.
can serve both as queries and documents under the
ground truth assumption that a UK law is relevant
to the EU directives it transposes and vice versa.
Dataset Domain ˜q˜
IR datasets in the literature
TRE C RO BUS T (Voorhees, 2005) News 3 / 14 254
BIO ASQ (Tsatsaronis et al., 2015) Biomedical 9 197
IR datasets with verbose queries
GOV 2 (Clarke et al., 2004) Web 11 / 57 682
WT1 0G(Chiang et al., 2005) Web 11 / 35 457
Regulatory Compliance datasets
EU2U K (ours) Law 2,642 1,849
UK2E U (ours) Law 1,849 2,642
Table 1: Statistics for query and document length for
IR datasets used in literature.
Since RE G-I R is a new task, our starting point is
the two-step pipeline approach followed by most
modern neural information retrieval systems (Guo
et al., 2016; Hui et al., 2017; McDonald et al.,
2018). First, a conventional IR system (pre-fetcher)
retrieves the
most prominent documents. Then a
neural model attempts to rank relevant documents
higher than irrelevant ones. In most approaches,
the pre-fetcher is based on Okapi BM
et al., 1995), a bag-of-words scoring function that
does not consider possible synonyms or contex-
tual information. To overcome the first limitation,
we follow Brokos et al. (2016) who employed
nearest neighbors over
weighted centroids of
word embeddings, without however improving the
results, probably because the centroids are noisy
considering many uninformative words. Further-
more, we employ B ERT (Devlin et al., 2019) to
extract contextualized representations for queries
and documents but again the results are worse than
. We also experiment with S-BE RT (Reimers
: DIRECTIVE 2006/66/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 6 September 2006 on batteries
and accumulators and waste batteries and accumulators and repealing Directive 91/157/EEC
BM25 rank Relevant Document title
1 No The Batteries and Accumulators (Placing on the Market) (Amendment) Regulations 2012
2 No The Batteries and Accumulators (Containing Dangerous Substances) (Amendment) Regulations 2000
3 No The Batteries and Accumulators (Placing on the Market) (Amendment) Regulations 2015
4 No The Batteries and Accumulators (Containing Dangerous Substances) Regulations 1994
5 No The Waste Batteries and Accumulators (Amendment) Regulations 2015
6 Yes The Waste Batteries and Accumulators Regulations 2009
12 Yes The Batteries and Accumulators (Placing on the Market) Regulations 2008
Table 2: Example from the EU2U K dataset where the retrieved UK laws are ranked by B M25 . The top-5 documents
seem similar to the query but are not relevant. Documents ranked 1st, 3rd, and 5th are amendments of the relevant
documents, i.e., UK laws that transpose the query.
and Gurevych, 2019) and LE GA L-B ERT (Chalkidis
et al., 2020), a model specialized in the legal do-
main. Both models perform better than BERT but
are still worse than or comparable to BM
. The
inability of BERT-based models motivated us to
find an auxiliary task that will result in better rep-
resentations for REG-IR. Following Chalkidis et al.
(2019), we fine-tune B ERT to predict EUROVOC
concepts that describe the core subjects of each
text. As expected this model (C-B ERT ) is the best
pre-fetcher by a large margin in EU2UK, while be-
ing comparable to BM
in UK 2EU. To summarize,
our contributions are:
We introduce REG-IR, an application of
DO C2DOC IR, which is a new family of I R
tasks, where both queries and documents are
long typically containing thousands of words.
(b) We compile and release the two first publicly
available datasets, EU2UK and U K2E U, suit-
able for REG-IR and D OC 2DO C I R in general.
We show that fine-tuning BE RT on an in-
domain classification task produces the best
document representations with respect to IR
and improves pre-fetching results.
2 Datasets curation
2.1 Data sources
EU/UK Legislation:
We have downloaded approx.
56K pieces of EU legislation (approx. 3.9K direc-
tives), from the EURL EX portal.
EU laws are 2,642
words long on average and are structured in three
major parts: the title (Table 2, query), the recitals
consisting of references in the legal background of
The datasets are available at
g/details/eacl2021 regir datasets.
the act, and the main body. We have also down-
loaded approx. 52K UK laws, publicly available
from the official UK legislation portal.
UK laws
are 1,849 words long on average and contain the
title (Table 2, document title) and the main body.
We have retrieved all transposi-
tion relations (approx. 3.7K) between EU directives
and UK laws from the C EL LAR database. CEL LA R
only provides the mapping between the CE LL AR
ids of EU directives and the title of each UK law.
Therefore we aligned the CELLA R ids with the of-
ficial UK ids based on the law title.
One or more
UK laws may transpose one or more E U directives.
2.2 Datasets compilation
be the sets of EU directives and UK laws,
respectively. We define REG-IR as the task where
the query
is a document, e.g, an EU directive,
and the objective is to retrieve a set of relevant
, from the pool of all available doc-
uments, e.g., all UK laws. We create two datasets:
EU 2UK:q∈ E,Rq={ri:ri∈ U, ri
UK 2EU:q∈ U,Rq={ri:ri∈ E, q transposes
Table 3 shows the statistics for the two datasets,
which are split in three parts, train,development,
and test, retaining a chronological order for the
queries. EU 2UK has a much larger pool of available
documents than UK2EU (52.5K vs. 3.9K) which
may impose an extra difficulty during retrieval.
More importantly, the average number of relevant
documents per query is small (at most 2) for both
datasets, as our ground truth assumption is strict,
i.e., relevant documents are those linked to the
query with a transposition relation. Also, EU legis-
lation is frequently amended (Figure 1) which also
7See Appendix A for details on the dataset curation.
Dataset Documents Train Development Test
in pool Queries Avg. relevant Queries Avg. relevant Queries Avg. relevant
EU 2UK 52,515 1,400 1.79 300 2.09 300 1.74
UK 2EU 3,930 1,500 1.90 300 1.46 300 1.29
Table 3: Detailed statistics for E U2U K and UK 2EU. Both datasets have relatively small number of relevant docu-
ments while EU 2UK has also large pool which may impose extra difficulties in the retrieval.
imposes difficulty in the retrieval task. Let
d1∈ E
be a directive transposed by
u1∈ U
d2∈ E
a directive amending
. The UK must adopt a law,
, to transpose
. Both
cover similar
concepts to those of
is an amendment and
must comply with
), but, strictly speaking
relevant only to
. Table 2 shows an example from
EU 2UK, where the top-5 documents seem very sim-
ilar to the query but are not considered relevant.
Note that the documents ranked 1st, 3rd and 5th,
are amendments of the relevant documents.
3 IR pipelines
Modern neural IR systems usually follow a two-
step pipeline approach. First, a conventional IR
system (pre-fetcher) retrieves the top-
most promi-
nent documents aiming to maximize its recall.
Then a neural model attempts to re-rank the doc-
uments by scoring relevant higher than irrelevant
ones. While this configuration is widely adopted
in literature, the re-ranking step could be omitted
provided an effective pre-fetching mechanism, i.e.,
the pre-fetcher will act as an end-to-end IR system.
3.1 Document pre-fetching
Okapi BM25
(Robertson et al., 1995) is a bag-of-
words scoring function estimating the relevance of
a document
to a query
, based on the query terms
appearing in
, regardless their proximity within
idf(qi)·tf(qi, d)·(k1+ 1)
tf(qi, d) + k1·1b+b·L
is the
-th query term, with
document frequency and
tf(qi, d)
term frequency.
is the length of
in words,
is the average
length of the documents in the collection,
a parameter that favors high
scores and
is a
parameter penalizing long documents.8
: Following Brokos et al. (2016), we
represent query/document terms with pre-trained
We use elastic, a widely used IR engine with the BM
scoring function. See
embeddings. For each query/document we calcu-
late the
weighted centroid of its embeddings:
cent(t) = Pl
i=1 xi·tf(xi, t)·idf(xi)
i=1 tf(xi, t)·idf(xi)(2)
is a text (query or document) and
is the
-th text term with embedding
. The documents
are ranked, with respect to the query, by a
neighbours (
) algorithm with cosine distance:
cosd(q, d) = 1 cent(q)·cent(d)
, similarly to W2V-CENT, relies in pre-trained
representations which now are extracted from
BE RT, thus being context-aware. A text can be rep-
resented by its
token or by the centroid of
its token embeddings. In the latter case the embed-
dings can be extracted from any of the 12 layers of
Note that the texts in our datasets do not en-
tirely fit in BERT. We thus split them into
(2 to 3 per text) and pass each chunk through BE RT
to obtain a list of token embeddings per layer (i.e,
the concatenation of
token embeddings lists) or
tokens. The final representation is ei-
ther the centroid of the token embeddings or the
centroid of the [cls] tokens.
(Reimers and Gurevych, 2019) is a BE RT
model fine-tuned for NLI. According to the authors,
training S-BE RT for NL I results in better represen-
tations than BERT for tasks involving text compari-
son, like IR. We use the same setting as in BE RT.
: Our datasets come from the legal
domain which has distinct characteristics compared
to generic corpora, such as specialized vocabulary,
particularly formal syntax, semantics based on ex-
tensive domain-specific knowledge, etc., to the ex-
tent that legal language is often classified as a ‘sub-
language’ (Tiersma, 1999; Williams, 2007; Haigh,
2018). BE RT and S-BE RT were trained on generic
corpora and may fail to capture the nuances of legal
language. Thus we used a BE RT model further pre-
trained on EU legislation (Chalkidis et al., 2020),
dubbed here LEGAL-BE RT, in a similar fashion.
9BE RT is not fine-tuned during this process.
:EU laws are annotated with E UROVOC
concepts covering the core subjects of EU legisla-
tion (e.g., environment, trade, etc.). Our intuition is
that a UK law transposing an EU directive will most
probably cover the same subjects. Thus we expect
that a BERT model, fine-tuned to predict E UROVOC
concepts, will learn rich representations describ-
ing these concepts which may be useful for pre-
fetching. We fine-tune BE RT following Chalkidis
et al. (2019)
and use the resulting model to ex-
tract query and document representations similarly
to the previous BE RT-based methods.
is simply a combination of our best
two pre-fetchers, C-BE RT and BM25:
EN S(q, d) = α·C B(q, d)+(1α)·B M25 (q, d)(4)
where CB is the score of C-BE RT and
is tuned on
development data and the scores of the pre-fetchers
are normalized in [0,1].
3.2 Document re-ranking
Modern neural re-rankers operate on pairs of the
(q, d)
to produce a relevance score,
rel(q, d)
for a document
with respect to a query
. Note,
however, that the main objective is to rank relevant
documents higher than irrelevant. Thus, during
training the loss is calculated as:
L= max(0,1rel(q, d+) + rel(q, d)) (5)
is a relevant document and
is an ir-
relevant document. We have experimented with
several neural re-ranking methods each having a
function that produces a relevance score
for each
of the top-
documents returned by the best pre-
fetcher. The final relevance score of a document is
calculated as:
rel(q, d) = wr·sr+wp·sp
, where
is the normalized score of the pre-fetcher and
ws,wpare learned during training.
Given the concerns on the strictness of the
ground truth assumption raised in Section 2.2, we
hypothesize that re-rankers will eventually over-
utilize the pre-fetcher score,
, when calculating
document relevance,
rel(q, d)
. As shown in Ta-
ble 2, in many cases both relevant and irrelevant
documents may have high similarity with the query.
This in turn may confuse and therefore degener-
ate the re-ranker’s term matching mechanism, i.e.,
MLPs or CN Ns over term similarity matrices.
We use all EU laws excluding EU directives that exist in
our development and test sets.
(Guo et al., 2016) uses pre-trained word
embeddings to represent query and document terms.
A histogram captures the cosine similarities of a
query term,
, with all the terms of a particular
document. Then an MLP consumes the histograms
to produce a document-aware score for each
which is weighted by a gating mechanism assessing
the importance of
. The sum of the weighted
scores is the relevance score of the document. A
caveat of DRMM is that it completely ignores the
context of the terms which could be of particular
importance in our datasets where texts are long.
(Hui et al., 2017) represents query and doc-
ument terms with pre-trained embeddings and cal-
culates a matrix
containing the cosine similari-
ties of all query-document term pairs. A row-wise
pooling operation on
keeps the highest
similarities per query term (matrix
). Then, wide
convolutions of different kernel (filter) sizes (
with multiple filters per size are applied on
. Each
filter of size
attempts to capture
similarities between queries and documents. A
-pooling operation keeps the strongest signals
across filters and a row-wise
pooling keeps
the strongest signals per query
-gram, resulting in
the matrix
. Subsequently, a row-wise concate-
nation of
with all
matrices (for different
values of
) is performed and a column contain-
ing the
scores of the query
terms is concatenated to the resulting matrix (
In effect, each row of the matrix contains different
-gram based similarity views of the correspond-
ing query term,
, along with an
-based impor-
tance score. The relevance score is produced as
the last hidden state of an LSTM with one hidden
unit, which consumes the rows of
tries to take into account the context of the query
and document terms using
-grams but this con-
text sensitivity is weak and we do not expect much
benefits in our datasets which contain long texts.
BE RT-based re-rankers
: Recent work tries to
exploit BERT to improve re-ranking. Following
MacAvaney et al. (2019), we use DR MM and PAC RR
on top of contextualized BE RT embeddings derived
from BE RT. Based on the results of Figure 4, we
use C-BE RT as the most promising BERT model.
We call these two models C-BE RT-DRMM and C-
BE RT-PACRR. We also experiment with two set-
tings depending on whether C-BERT weights are
updated (tuned) or not (frozen) during training.
Figure 3: Heatmaps showing R@100 for different values of k1and bon EU2U K (left) and UK 2EU (right). The
selected optimal values (green boxes) are outside the proposed ranges in the literature (blue boxes).
4 Experimental setup
4.1 Pre-trained resources
As several methods rely on word embeddings, we
trained a new WO RD 2VE C model (Mikolov et al.,
2013) in both corpora (EU and UK legislation) to
better accommodate legal language. Preliminary
experiments showed that domain-specific embed-
dings perform better than generic 200-dimensional
GloVe embeddings (Pennington et al., 2014) in de-
velopment data (EU 2UK : 66.5 vs. 59.3 at
and UK 2EU: 72.6 vs. 69.8 at R@100).11
All BE RT (pre-fetching) encoders and BERT-
based re-rankers use the -BAS E version, i.e., 12
layers, 768 hidden units and 12 attention heads,
similar to the one of Devlin et al. (2019).12
4.2 Pre-processing - document denoising
One of the major challenges in DOC2DOC IR, as op-
posed to traditional IR, is the length of the queries
and the documents which may induce noise (many
uninformative words) during retrieval. Thus we
applied several filters (stop-word, punctuation and
digits elimination) on both queries and documents
and reduced their length by approx. 55% (778
words for UK laws and 1,222 words for E U di-
rectives on average). Further on, we filtered both
queries and documents by eliminating words with
score less than the average
score of the stop-
words. Our intuition is that words (e.g., regulation,
EU, law, etc.) with such a small
score are un-
informative. Still, the texts are much longer (387
words for UK laws and 631 words for E U directives
on average) than the queries used in traditional IR
See also the discussion for legal language in Section 3.1.
12See Appendix B for more details.
(Table 1). As an alternative to drastically decrease
the query size, we experimented with using only
the title of a legislative act as a query but the results
were worse, i.e., approx. 5-20% lower
average across datasets, indicating that the full-text
is more informative, although the information is
sparse. Hence, we only consider the full-text, in-
cluding the title, for the rest of the experiments.
4.3 Evaluation measures
Pre-fetching aims to bring all the relevant docu-
ments in the top-
, thus we report
. We ob-
serve that for
the best pre-fetchers have
not significant gains in performance in develop-
ment data, thus we select
k = 100
, as a reason-
able threshold.
For re-ranking we report
) following the
literature (Manning et al., 2009). We report the
average and standard deviation across three runs
considering the best set of hyper-parameters on
development data for neural re-rankers.
4.4 Tuning BM25: The case of DOC2DOC IR
The effectiveness of BM
is highly dependant on
properly selecting the values of
. In tra-
ditional (ad-hoc) IR,
is typically evaluated in
the range
needs to
be in
) (Taylor et al.,
2006; Trotman et al., 2014; Lipani et al., 2015).
As a general rule of thumb BM
=1.2 and
=0.75 seems to give good results in most cases
(Trotman et al., 2014). We observe that in the case
of DOC2DOC IR where the queries are much longer,
the optimal values are outside the proposed ranges
See Appendix A.3 for an extended (
) perfor-
mance evaluation on pre-fetching.
Figure 4: Heatbars showing R@100 (on development data) for text representations extracted from different layers
of the various BE RT-based pre-fetchers we experimented with.
(Figure 3). In both datasets the optimal values for
are relatively high, favoring terms with
, while penalizing long documents. In effect
as a denoising regularizer to
over-utilize highly frequent query terms normal-
ized by document length.
4.5 Extracting representations from BERT
Recently there has been a lot of research on un-
derstanding the effectiveness of BERT’s different
layers (Liu et al., 2019; Hewitt and Manning, 2019;
Jawahar et al., 2019; Goldberg, 2019; Kovaleva
et al., 2019; Lin et al., 2019). Figure 4 shows
heatbars comparing representations extracted from
different layers of the various BERT-based pre-
fetchers we experimented with.
and C-BE RT which have been adapted in the le-
gal domain perform much better than BERT and
S-BE RT which were trained on generic corpora. An
interesting observation is that the
token is
a powerful representation only in C-BE RT where it
was trained to predict EU ROVOC concepts. Also,
in UK 2EU the embedding layer produces the best
representations in all BERT variants except C-BE RT,
where the embedding layer achieves comparable
results to the top-2 representations (
, Layer-
12). This is an indication that the context in this
dataset is not as important as in EU2UK.
4.6 Implementation details
All neural models were implemented using the
Tensorflow 2 framework. Hyper-parameters were
tuned on development data, using early stopping
and the Adam optimizer (Kingma and Ba, 2015).
Recall that a text can be represented by its
ken or by the centroid of its token embeddings which can be
extracted from any of the 12 layers of BE RT.
Method EU 2UK UK 2EU
R@100 R@100
BM25 (Robertson et al., 1995) 57.5 93.7
W2V-CEN T (Brokos et al., 2016) 50.6 88.2
BE RT (Devlin et al., 2019) 54.0 85.1
S-BE RT (Reimers and Gurevych, 2019) 57.7 84.8
LE GAL -BE RT (Chalkidis et al., 2020) 57.6 90.1
C-BE RT (ours) 83.8 92.9
ENSEMBLE (BM25 +C-B ERT)86.5 95.0
Table 4: Pre-fetching results across test datasets.
5 Experimental results
Table 4 shows
on the test
datasets for the various pre-fetchers considered.
On EU 2UK,C-BERT is the best method by a large
margin, followed by S-BE RT and LEGAL -BE RT,
verifying our assumption that the concept classi-
fication task is a good proxy for obtaining rich
representations with respect to IR. Both S-BERT
and LE GA L-B ERT are better than BE RT for differ-
ent reasons. LE GA L-BERT was adapted to the legal
domain and is, therefore, able to capture the nu-
ances of the legal language. S-BERT was trained
to produce representations suitable for comparing
texts with cosine similarity, a task highly related
to IR. Nonetheless, having been trained on generic
corpora with small texts, it performs much worse
than C-BE RT. Interestingly, BM
is comparable to
both S-BE RT and LE GA L-BERT despite its simplic-
ity. As expected, combining C-BE RT with BM
further improves the results. In UK2EU
is much higher compared to EU2UK probably be-
cause of the shortest queries. Also, as discussed
in Section 4.5, the contextual information is not
so critical in this dataset, thus we expect the con-
text unaware BM
and W2V-CEN T to perform well.
Indeed, BM
achieves the best results followed
closely by C-BERT and LEG AL -BE RT, while W2V-
Method EU2U K UK 2EU
wpwsR@20 nDCG@20 RP wpwsR@20 nDCG@20 RP
BM25 - - 45.8 34.4 25.5 - - 87.5 66.8 49.4
C-BERT (ours) - - 55.7 37.9 21.8 - - 79.7 53.0 33.1
ENSEMBLE (BM25 +C-B ERT) - - 54.1 43.1 29.6 - - 88.0 67.7 49.3
+DRMM +1.1 -0.8 59.9 (±3.2) 41.7 (±2.4) 24.3 (±2.9) +1.3 -0.8 86.3 (±1.1) 61.6 (±1.1) 40.1 (±1.5)
+PACRR +4.2 +0.6 54.3 (±0.2) 43.3 (±0.2) 30.1 (±0.4) +4.0 +0.1 88.0 (±0.0) 67.7 (±0.0) 49.3 (±0.0)
+C-BERT -DRM M (frozen) +3.3 -1.6 57.9 (±3.4) 43.1 (±0.3) 27.3 (±2.2) +3.5 -1.0 88.3 (±0.4) 67.3 (±0.6) 48.5 (±1.3)
+C-BERT -PACRR (frozen) +4.6 +0.9 54.1 (±0.0) 43.1 (±0.0) 29.6 (±0.0) +2.9 -0.9 89.6 (±0.4) 66.5 (±0.5) 46.0 (±0.9)
+C-BERT -DRM M (tuned) +1.9 -0.5 54.1 (±0.0) 43.1 (±0.0) 29.6 (±0.0) +1.2 +0.5 88.0 (±0.0) 67.7 (±0.0) 49.3 (±0.0)
+C-BERT -PACRR (tuned) +1.8 -0.6 54.1 (±0.0) 43.1 (±0.0) 29.6 (±0.0) +2.0 +2.1 88.0 (±0.0) 67.7 (±0.0) 49.3 (±0.0)
+ORA CLE - - 86.5 87.7 86.5 - - 95.0 95.3 95.0
Applying date filtering on top of predictions
Year range ±5years ±15 years
ENSEMBLE (BM25 +C-B ERT) - - 76.6 54.6 37.1 - - 86.2 68.2 50.0
+DRMM (pre-filtering) +1.1 -0.8 81.4 56.5 35.4 +1.3 -0.8 85.3 62.6 42.3
+DRMM (post-filtering) +1.1 -0.8 75.7 49.2 31.1 +1.3 -0.8 83.6 63.5 44.2
+PACRR (pre-filtering) +4.2 +0.6 76.6 54.8 37.6 +4.0 +0.1 86.2 68.2 50.0
+PACRR (post-filtering) +4.2 +0.6 74.2 52.9 36.5 +4.0 +0.1 85.5 67.6 49.6
Table 5: Re-ranking results across test datasets. The upper zone shows the results of neural re-rankers on top of
the best pre-fetchers with respect to (ws,wp). It also reports re-ranking results of the best pre-fetchers. The lower
zone reports the re-ranking results after applying temporal filtering.
CE NT outperforms S-B ERT and B ERT. Again the
ENSEMBLE improves the results.
Table 5 shows the ranking results on
test data for EU2UK and UK 2EU. We also report
results for BM
CL E, which re-ranks the top-
documents returned
by the pre-fetcher placing all relevant documents
at the top. On EU 2UK ENSEM BL E performs better
than the other two pre-fetchers. Interestingly, neu-
ral re-rankers fall short on improving performance
and are comparable (or even identical) with EN-
SE MB LE in most cases, possibly because very simi-
lar documents may be relevant or not (Section 2.2,
Table 2), leading to contradicting supervision.
As we hypothesized (Section 3.2), re-rankers over-
utilize the pre-fetcher score when calculating doc-
ument relevance, as a defense mechanism (bias)
against contradicting supervision, which eventu-
ally leads to the degeneration of the re-ranker’s
term matching mechanism. Inspecting the corre-
sponding weights of the models, we observe that
wp>> ws
across all methods. This effect
seems more intense in BERT-based re-rankers (C-
BE RT +DRMM or PACR R), especially those that
fine-tune C-BE RT, possibly because these models
perform term matching considering sub-word units,
instead of full words. In other words, relying on
the neural relevance score (
) is catastrophic. Sim-
ilar observations can be made for UK 2EU. In both
datasets all methods have a large performance gap
compared to the ORAC LE, indicating that there is
By contradicting supervision we mean similar training
query-document pairs with opposite labels.
still large room for improvement, possibly utilizing
information beyond text.
Figure 5: Relevant documents according to their
chronological difference with the query on EU 2UK de-
velopment data.
Filtering by year:
We have already highlighted
the difficulties imposed to our datasets by the fre-
quently amended EU directives (Section 2.2, Ta-
ble 2). Also, recall that each E U directive defines
a deadline (typically 2 years) for the transposition
to take place. On the other hand, as we observe
in Figure 5, E U directives may already be trans-
posed by earlier legislative acts of member states
(the member states act in a proactive manner), or
they may delay the transposition for political rea-
sons. In effect, the relevance of a document to a
query depends both on the textual content and the
time the laws were published. Thus, we filter out
documents that are outside a predefined distance
(in years) from the query in two ways, pre-filtering
and post-filtering. Pre-filtering is applied to the
pre-fetcher, i.e., prior to re-ranking, while post-
filtering is applied after the re-ranking. Note that
our main goal is to improve re-ranking. We thus ap-
ply the filtering scheme to the ENSEMBLE,DR MM
and PACR R. The lower zone of Table 5 shows the
results of the whole process. In EU 2UK, the hard-
est out of the two datasets, the time filtering has a
positive impact, improving the results by a large
margin. On the other hand, filtering seems to have
a minor effect in UK 2EU.
5.1 EU 2UK 6=U K2E U
Across experiments, we observe that best prac-
tices vary between the EU 2UK and U K2E U datasets.
EU 2UK benefits from C-B ERT representations,
while in UK2EU context-unaware and domain-
agnostic BM
has comparable or better perfor-
mance than C-BERT. Similarly, we observe that
time filtering further improves the performance in
EU 2UK, while we have a contradicting effect in
UK 2EU. Given the overall results, we conclude
the two datasets have quite different characteristics.
Thus, it is important to consider both EU2UK and
UK 2EU independently, although one may initially
consider them to be symmetric.
6 Related work
IR in the legal domain is widely connected with
the Competition on Legal Information Extrac-
tion/Entailment (COLIEE). From 2015 to 2017
(Kim et al., 2015, 2016; Kano et al., 2017), the
task was to retrieve Japanese Civil Code articles
given a question, while in COLI EE 2018 and 2019
(Kano et al., 2018; Rabelo et al., 2019), the task
was to retrieve supporting cases given a short de-
scription of an unseen case. However, the texts
of these competitions are small compared to our
datasets. Also, most submitted systems do not con-
sider recent advances in IR, i.e, neural ranking mod-
els (Guo et al., 2016; Hui et al., 2017; McDonald
et al., 2018; MacAvaney et al., 2019), which have
recently managed to improve rankings of conven-
tional IR, or end-to-end neural models which have
recently been proposed (Fan et al., 2018; Khattab
and Zaharia, 2020). Again, these end-to-end meth-
ods were applied on small texts. On the other hand,
there has been some work trying to cope with larger
queries, i.e., verbose or expanded queries, (Paik
and Oard, 2014; Gupta and Bendersky, 2015; Cum-
mins, 2016). Nonetheless, the considered queries
are at most 60 tokens long, contrary to our datasets
where, depending on the setting, the average query
length is 1.8K or 2.6K tokens (Table 1). Neural
methods greatly rely on text representations, thus
Reimers and Gurevych (2019) proposed S-BE RT
which is trained to compare texts for an NL I task
and could thus be used to extract representations
suitable for IR. Towards the same direction, Chang
et al. (2020) experimented with several auxiliary
tasks to extract better representations. However, the
latter two methods have been evaluated on datasets
with much smaller texts than the ones we consider.
7 Conclusions and future work
We proposed DOC2DOC IR, a new family of IR
tasks, where the query is an entire document, thus
being more challenging than traditional IR. This
family of tasks is particularly useful in regulatory
compliance, where organizations need to ensure
that their controls comply with the existing legisla-
tion. In the absence of publicly available DOC2DOC
datasets, we compile and release two datasets, con-
taining EU directives and UK laws transposing
these directives. Experimenting with conventional
) and neural pre-fetchers we showed that a
BE RT model fine-tuned on an in-domain classifi-
cation task, i.e., predict EUROVOC concepts, is by
far the best pre-fetcher in our datasets. We also
showed that neural re-rankers fail to improve the
performance, as their term matching mechanisms
degenerates, and over-utilize the pre-fetcher score.
In the future, we would like to investigate alterna-
tives in exploiting additional information that may
be critical in the newly introduced tasks (EU 2UK,
UK 2EU). In this direction naively utilizing chrono-
logical information leads to vast performance im-
provement in EU 2UK dataset. One possible direc-
tion is to model the cross-document relations (e.g.,
amendments) using Graph Convolutional Networks
(Kipf and Welling, 2016), while better modeling
the dimension of time (i.e., chronological differ-
ence between a query and a document) is also cru-
cial. Further on, to better deal with long documents,
we plan to investigate text summarization by em-
ploying a state-of-the-art neural summarizer, e.g.,
BART of Lewis et al. (2020), or sentence selec-
tion techniques, e.g., rationale extraction (Lei et al.,
2016; Chang et al., 2019), to find the most impor-
tant sections or sentences and create shorter and
more informative versions of queries/documents.
Georgios-Ioannis Brokos, Prodromos Malakasiotis,
and Ion Androutsopoulos. 2016. Using centroids
of word embeddings and word mover’s distance
for biomedical document retrieval in question an-
swering. In Proceedings of the 15th Workshop on
Biomedical Natural Language Processing (BioNLP
2016), at the 54th Annual Meeting of the Association
for Computational Linguistics (ACL 2016), pages
114–118, Berlin, Germany.
Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos
Malakasiotis, and Ion Androutsopoulos. 2019.
Large-Scale Multi-Label Text Classification on EU
Legislation. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 6314–6322, Florence, Italy.
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
siotis, Nikolaos Aletras, and Ion Androutsopoulos.
2020. LEGAL-BERT: The muppets straight out of
law school. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 2898–
2904, Online. Association for Computational Lin-
Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S.
Jaakkola. 2019. A Game Theoretic Approach
to Class-wise Selective Rationalization. In Ad-
vances in Neural Information Processing Systems
(NeurIPS), Vancouver, Canada.
Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim-
ing Yang, and Sanjiv Kumar. 2020. Pre-training
Tasks for Embedding-based Large-scale Retrieval.
In International Conference on Learning Represen-
Wei-Tsen Milly Chiang, Markus Hagenbuchner, and
Ah Chung Tsoi. 2005. The wt10g dataset and the
evolution of the web. In Special Interest Tracks
and Posters of the 14th International Conference on
World Wide Web, WWW ’05, page 938–939, New
York, NY, USA. Association for Computing Machin-
Charles Clarke, Nick Craswell, and Ian Soboroff. 2004.
Overview of the trec 2004 terabyte track. In TREC.
Ronan Cummins. 2016. A study of retrieval models for
long documents and queries in information retrieval.
In Proceedings of the 25th International Conference
on World Wide Web, WWW ’16, page 795–805, Re-
public and Canton of Geneva, CHE. International
World Wide Web Conferences Steering Committee.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. Proceedings of the Annual Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, abs/1810.04805.
Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengx-
iang Zhai, and Xueqi Cheng. 2018. Modeling Di-
verse Relevance Patterns in Ad-Hoc Retrieval. In
The 41st International ACM SIGIR Conference on
Research & Development in Information Retrieval,
SIGIR ’18, page 375–384, New York, NY, USA. As-
sociation for Computing Machinery.
Yoav Goldberg. 2019. Assessing BERT’s Syntactic
Abilities. CoRR, abs/1901.05287.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce
Croft. 2016. A deep relevance matching model for
ad-hoc retrieval. In Proceedings of the 25th ACM
International on Conference on Information and
Knowledge Management, CIKM ’16, page 55–64,
New York, NY, USA. Association for Computing
Manish Gupta and Michael Bendersky. 2015. Informa-
tion retrieval with verbose queries. In Proceedings
of the 38th International ACM SIGIR Conference on
Research and Development in Information Retrieval,
SIGIR ’15, page 1121–1124, New York, NY, USA.
Association for Computing Machinery.
Rupert Haigh. 2018. Legal English. Routledge.
John Hewitt and Christopher D. Manning. 2019. A
structural probe for finding syntax in word repre-
sentations. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4129–4138, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Kai Hui, Andrew Yates, Klaus Berberich, and Ger-
ard de Melo. 2017. PACRR: A position-aware neu-
ral IR model for relevance matching. In Proceed-
ings of the 2017 Conference on Empirical Methods
in Natural Language Processing, pages 1049–1058,
Copenhagen, Denmark. Association for Computa-
tional Linguistics.
Ganesh Jawahar, Benoˆ
ıt Sagot, and Djam ´
e Seddah.
2019. What Does BERT Learn about the Structure
of Language? In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 3651–3657, Florence, Italy. Associa-
tion for Computational Linguistics.
Yoshinobu Kano, Mi-Young Kim, Randy Goebel, and
Ken Satoh. 2017. Overview of coliee 2017. In COL-
IEE@ ICAIL, pages 1–8.
Yoshinobu Kano, Mi-Young Kim, Masaharu Yosh-
ioka, Yao Lu, Juliano Rabelo, Naoki Kiyota, Randy
Goebel, and Ken Satoh. 2018. Coliee-2018: Eval-
uation of the competition on legal information ex-
traction and entailment. In JSAI International Sym-
posium on Artificial Intelligence, pages 177–192.
Omar Khattab and Matei Zaharia. 2020. ColBERT: Ef-
ficient and Effective Passage Search via Contextual-
ized Late Interaction over BERT.
Mi-Young Kim, Randy Goebel, Yoshinobu Kano, and
Ken Satoh. 2016. Coliee-2016: evaluation of the
competition on legal information extraction and
entailment. In International Workshop on Juris-
informatics (JURISIN 2016).
Mi-Young Kim, Randy Goebel, and S Ken. 2015.
Coliee-2015: evaluation of legal question answer-
ing. In Ninth International Workshop on Juris-
informatics (JURISIN 2015).
Diederik P. Kingma and Jim Ba. 2015. Adam: A
method for stochastic optimization. In Proceed-
ings of the 5th International Conference on Learn-
ing Representations.
Thomas N. Kipf and Max Welling. 2016. Semi-
Supervised Classification with Graph Convolutional
Networks. CoRR, abs/1609.02907.
Olga Kovaleva, Alexey Romanov, Anna Rogers, and
Anna Rumshisky. 2019. Revealing the dark secrets
of BERT. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
4365–4374, Hong Kong, China. Association for
Computational Linguistics.
Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.
Rationalizing Neural Predictions. In Proceedings of
the 2016 Conference on Empirical Methods in Nat-
ural Language Processing, pages 107–117, Austin,
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pre-
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics.
Tom CW Lin. 2016. Compliance, technology, and
modern finance. Brook. J. Corp. Fin. & Com. L.,
Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.
Open sesame: Getting inside BERT’s linguistic
knowledge. In Proceedings of the 2019 ACL Work-
shop BlackboxNLP: Analyzing and Interpreting Neu-
ral Networks for NLP, pages 241–253, Florence,
Italy. Association for Computational Linguistics.
Aldo Lipani, Mihai Lupu, Allan Hanbury, and Akiko
Aizawa. 2015. Verboseness fission for bm25 doc-
ument length normalization. In Proceedings of the
2015 International Conference on The Theory of In-
formation Retrieval, ICTIR ’15, page 385–388, New
York, NY, USA. Association for Computing Machin-
Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. Smith. 2019. Lin-
guistic Knowledge and Transferability of Contextual
Representation. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Pa-
pers), pages 1073–1094, Minneapolis, Minnesota.
Association for Computational Linguistics.
Sean MacAvaney, Andrew Yates, Arman Cohan, and
Nazli Goharian. 2019. Cedr: Contextualized embed-
dings for document ranking. In Proceedings of the
42nd International ACM SIGIR Conference on Re-
search and Development in Information Retrieval,
SIGIR’19, page 1101–1104, New York, NY, USA.
Association for Computing Machinery.
Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Sch¨
utze. 2009. Introduction to Information
Retrieval. Cambridge University Press.
Ryan McDonald, Georgios-Ioannis Brokos, and Ion
Androutsopoulos. 2018. Deep relevance ranking us-
ing enhanced document-query interactions. CoRR,
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013.
Efficient Estimation of Word Representations in Vec-
tor Space. In International Conference on Learning
Representations, Scottsdale, AZ.
Jiaul H. Paik and Douglas W. Oard. 2014. A fixed-
point method for weighting terms in verbose infor-
mational queries. In Proceedings of the 23rd ACM
International Conference on Conference on Infor-
mation and Knowledge Management, CIKM ’14,
page 131–140, New York, NY, USA. Association for
Computing Machinery.
Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. GloVe: Global Vectors for Word
Representation. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543.
Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masa-
haru Yoshioka, Yoshinobu Kano, and Ken Satoh.
2019. A summary of the coliee 2019 competition.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese BERT-
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3982–3992, Hong Kong, China. Association for
Computational Linguistics.
S. Robertson, S. Walker, S. Jones, M. M. Hancock-
Beaulieu, and M. Gatford. 1995. Okapi at trec3.
In Overview of the Third Text Retrieval Conference,
pages 109—-126.
Shazia Sadiq and Guido Governatori. 2015. Manag-
ing Regulatory Compliance in Business Processes,
pages 265–288. Springer Berlin Heidelberg, Berlin,
Michael Taylor, Hugo Zaragoza, Nick Craswell,
Stephen Robertson, and Chris Burges. 2006. Opti-
misation methods for ranking functions with multi-
ple parameters. In Proceedings of the 15th ACM In-
ternational Conference on Information and Knowl-
edge Management, CIKM ’06, page 585–593, New
York, NY, USA. Association for Computing Machin-
Peter M Tiersma. 1999. Legal language. University of
Chicago Press.
Andrew Trotman, Antti Puurula, and Blake Burgess.
2014. Improvements to bm25 and language models
examined. In Proceedings of the 2014 Australasian
Document Computing Symposium, ADCS ’14, page
58–65, New York, NY, USA. Association for Com-
puting Machinery.
George Tsatsaronis, Georgios Balikas, Prodromos
Malakasiotis, Ioannis Partalas, Matthias Zschunke,
Michael R. Alvers, Dirk Weissenborn, Anastasia
Krithara, Sergios Petridis, Dimitris Polychronopou-
los, Yannis Almirantis, John Pavlopoulos, Nico-
las Baskiotis, Patrick Gallinari, Thierry Arti `
Axel-Cyrille Ngonga Ngomo, Norman Heino, ´
Gaussier, Liliana Barrio-Alvers, Michael Schroeder,
Ion Androutsopoulos, and Georgios Paliouras. 2015.
An overview of the BIOASQ large-scale biomedical
semantic indexing and question answering competi-
tion. BMC Bioinformatics, 16(138).
Ellen M. Voorhees. 2005. The TREC Robust Retrieval
Track. SIGIR Forum, 39(1):11–20.
Christopher Williams. 2007. Tradition and change in
legal English: Verbal constructions in prescriptive
texts, volume 20. Peter Lang.
A Dataset Compilation: Technical
In this section, we present the technical details
associated with the compilation of both datasets
described in the main paper. More specifically
we present the procedure of creating both corpora
as well as modelling the transposition relations
between EU and U K entries.
A.1 EU corpus
The compilation of the EU corpus is more straight-
forward than its UK counterpart but involves some
in-domain knowledge to filter unwanted legislation.
We initially download the core metadata as-
sociated with each document in the EU cor-
pus by utilizing the SPARQ L endpoint of the
EU Publications Office (
and the EURLEX platform (
https://eur-, as a REST-ful AP I.
Following the metadata collection, we pro-
ceed to filter out documents based on their
type in order to retain only EU directives and
regulations. This involves excluding corrigen-
dums. Corrigendums introduce corrections to
prior EU legislation. Usually these corrections
are minimal and change single phrases such
as (”In Regulation X, for: ‘. . . 4 July 2019
. . . ’, read: ‘. . . 4 July 2015 . . . ’.”). Thus these
documents lack the context to be both classi-
fied and correlated with other documents.
and decisions, both of which are irrelevant to
our use case. The final E U corpus contains
approximately 60k entries.
A.2 UK corpus
Compiling the UK corpus is not as trivial, since the
AP I is not as evolved and
we therefore have to manually crawl large parts of
the database to build our corpus.
The collected UK laws from the
portal form the initial corpus which
includes approximately 100k documents.
Similarly to our processing of the EU corpus,
we only retain documents in specific legisla-
tion types (UK Public General Acts, UK Local
Acts, UK Statutory Instruments and U K Min-
isterial Acts). We then eliminate laws that
aim to align English legislation with the rest
of the United Kingdom’s, more specifically
Scotland, Northern Ireland and Wales. The
final UK corpus includes 52K U K entries.
A.3 EU 2UK Transpositions
Transpositions are relations between entries in the
EU and U K corpora which we use to define rele-
vance for our retrieval tasks. Processing these rela-
tions is the most challenging aspect of compiling
our datasets and involves several steps.
We use the aforementioned SPARQ L endpoint,
to retrieve the transpositions between EU di-
rectives and the corresponding UK regulations
ELEX:32004L0038R(02) as an example.
Figure 6: Recall@k, where k[0,2000], across the three best pre-fetchers (i.e., B M25,C-B ERT and ENSEMBLE)
on the development dataset.
that implement them. We initially collect ap-
proximately 10k EU2UK pairs. In these pairs
the transposed EU law is referred to by its
unique portal ID but the transposing UK law is
referred to by its title. This is the primary chal-
lenge in modelling the transposition relations,
since mapping legislation titles to unique en-
tries in our UK corpus is not trivial. We hy-
pothesize that these relations are manually in-
serted in the database and therefore human
errors make performing exact matches often
impossible. Apart from the matching difficul-
ties, some of the pairs in the pool are inserted
mistakenly and hence need to be filtered.
We first filter the noisy pairs. Pairs are consid-
ered noisy either because they are duplicates
or because the do not meet some manually set
criteria. In turn, duplication can occur either
because identical pairs are inserted more than
once or because pairs in which the UK title
is mildly paraphrased are erroneously consid-
ered different. Our pool is reduced to 8k pairs
after resolving the former and to 7k pairs after
also resolving the latter. We further reduce the
pool size by filtering pairs in which the UK
title refers to non-English legislation (Scot-
land, Northern Ireland, Wales or Gibraltar)
Non-English legislation usually has an almost
identical counterpart within the pure english
. or in which the title does not con-
tain certain keywords (e.g., Act, Regulation,
Order, Rule). Documents that do not contain
any of these keywords are not officially pub-
lished in the
legisla ti on
Most of these are official releases from na-
tional governmental bodies, e.g. Ministries.
For instance the First Annual Report of the
Inter-Departmental Ministerial Group on Hu-
man Trafficking is not part of the UK’s national
To resolve the matching challenge, we employ
a complex matching scheme where for each
pair we gradually normalize the UK title un-
til we find either a singular match or multiple
ones. In the latter case, we resolve the matches
with heuristics. Our normalizations include
lower-casing, leading and trailing phrase re-
moval, punctuation elimination, date removal
and manually inserted substitutions.
After reducing our pair pool and then imple-
menting our matching scheme we can with
high confidence present 4k transposition pairs
which we use in our datasets.
BBE RT models
All BE RT variants (BE RT,S-B ERT,LEGA L-B ERT)
are publicly available from Hugging Face:
: The original BERT pre-trained for
Masked Language Modeling (MLM) and
Next Sentence Prediction (NSP) in English
Wikipedia and Books corpus. Available at
base-uncased- eurlex.
: This is the original BE RT fine-tuned
in STS-B NLI dataset. Available at
https:// bert.
: This is the origi-
nal BE RT further pre-trained in EU legislaiton.
Available at
aueb/bert-base- uncased-eurlex.
C Selecting kfor pre-fetching
In Section 4.1, we stated that we report
k = 100
in order to evaluate and compare pre-
fetching methods. In Figure 6, we present the per-
formance of the best pre-fetching methods (i.e.,
,C-BE RT and ENSEMBLE) for different val-
ues of
on the development set. We
observe that after
k= 100
, the ENSEMBLE pre-
fetcher has not significant gains in performance,
thus we select k= 100, as a reasonable threshold.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Although considerable attention has been given to neural ranking architectures recently, far less attention has been paid to the term representations that are used as input to these models. In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking. Through experiments on TREC benchmarks, we find that several ex-sting neural ranking architectures can benefit from the additional context provided by contextualized language models. Furthermore, we propose a joint approach that incorporates BERT's classification vector into existing neural models and show that it outperforms state-of-the-art ad-hoc ranking baselines. We call this joint approach CEDR (Contextualized Embeddings for Document Ranking). We also address practical challenges in using these models for ranking, including the maximum input length imposed by BERT and runtime performance impacts of contextualized language models.
Assessing relevance between a query and a document is challenging in ad-hoc retrieval due to its diverse patterns, i.e., a document could be relevant to a query as a whole or partially as long as it provides sufficient information for users' need. Such diverse relevance patterns require an ideal retrieval model to be able to assess relevance in the right granularity adaptively. Unfortunately, most existing retrieval models compute relevance at a single granularity, either document-wide or passage-level, or use fixed combination strategy, restricting their ability in capturing diverse relevance patterns. In this work, we propose a data-driven method to allow relevance signals at different granularities to compete with each other for final relevance assessment. Specifically, we propose a HIerarchical Neural maTching model (HiNT) which consists of two stacked components, namely local matching layer and global decision layer. The local matching layer focuses on producing a set of local relevance signals by modeling the semantic matching between a query and each passage of a document. The global decision layer accumulates local signals into different granularities and allows them to compete with each other to decide the final relevance score. Experimental results demonstrate that our HiNT model outperforms existing state-of-the-art retrieval models significantly on benchmark ad-hoc retrieval datasets.