Content uploaded by Gerrit Schumann
Author content
All content in this area was uploaded by Gerrit Schumann on Jan 16, 2023
Content may be subject to copyright.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Query-Based Retrieval of German Regulatory
Documents for Internal Auditing Purposes
1st Gerrit Schumann
University of Oldenburg
dept. of business informatics
Oldenburg, Germany
2nd Katharina Meyer
University of Oldenburg
dept. of business informatics
Oldenburg, Germany
3rd Jorge Marx Gómez
University of Oldenburg
dept. of business informatics
Oldenburg, Germany
Abstract — Among the countless documents internal auditors
face in practice, regulatory documents are by far the most impor-
tant materials. Auditors need them for every audit engagement to
familiarize themselves with the target state of the objects they are
auditing. Due to their textual contents, auditors typically work
through these documents by hand, which is very time consuming
– especially if the documents are complex, structured differently,
or are frequently modified. In this context, obtaining relevant
regulatory information for an upcoming audit can be challenging,
as auditors often have to find the materials they need among
hundreds of documents. In this study, we therefore adopted the
idea of document retrieval and evaluated it using a text corpus of
709 German regulatory documents, 13 different retrieval variants,
and 16 search queries. The results show that especially the
combination of a lexical and a semantic search approach provides
the best results.
Keywords — Document Retrieval, Regulatory Documents,
Internal Auditing
I. INTRODUCTION
One of the main tasks of internal auditing is to examine
processes and structures within the entire company [1]. In the
course of these audits, target and actual objects are compared
with each other in order to identify possible deviations between
them. As a prerequisite for such a comparison, the target objects
are defined in various regulatory documents such as manuals,
guidelines or work instructions. These documents thus represent
the basis for every audit [1] [2] [3].
In order to identify relevant information for an upcoming
audit, internal auditors typically have to work through the regu-
latory documents manually. Especially in large companies, this
manual work represents an enormous time commitment due to
the usually large number of regulatory documents. Since internal
auditors audit all departments of a company together with their
various processes and structures, this also means that they have
to familiarize themselves with regulatory documents that cover
a wide variety of topics [1] [3]. In addition, the rules and pro-
cesses of a company – and thus also regulatory documents such
as operational procedures or policy documents – are subject to
constant change due to the development and progress of the
company, the requirements of the market, and external regula-
tions [4]. Accordingly, even auditors who have been working in
the same company for a long time have to deal with new or
revised regulatory documents, which can make it difficult for
them to find the right information for an upcoming audit.
One way to simplify work with regulatory documents is by
using Natural Language Processing (NLP), which is a field of
research and application that investigates how computers can be
used to understand and manipulate natural language text or
speech [5]. When preparing an audit, NLP offers the potential to
provide relevant regulatory information in an automated way,
which then could help to reduce the time and cost of an audit. At
the same time, such automated document processing could take
into account a company's entire regulatory information base,
which can reduce the risk of overlooking relevant regulations.
To date, several studies have already applied NLP to regula-
tory documents [6] [7] [8] [9]. However, they mostly did so for
the purpose of information extraction, which typically involves
recognizing normative textual phrases in regulatory documents
and mapping them to specific annotation schemas, such as
“subject, object, target, action”. Despite the benefits of such
information extraction, the upstream identification of actually
relevant regulatory documents – from which information could
then be extracted – is also crucial for auditors. In this context, a
critical aspect is the nature of the documents. Previous studies
have mostly used standardized or normed regulatory documents,
such as ISO standards [6] [8]. However, the regulatory docu-
ments used in internal auditing are typically characterized by a
higher degree of variance. This variance can result, for example,
from the subject matter addressed, the vocabulary used, the level
of detail, and the document structures. Moreover, the regulatory
documents used in research are usually in English, which means
that the suitability of the presented techniques for regulatory
documents in other languages, such as German, can only be
assumed, but not guaranteed. Furthermore, there is a general lack
of studies that examine NLP with stronger reference to auditing
practice. In this context, Gepp et al. [10] and Brown-Liburd et
al. [11] recommend more exchange with auditors to collect
requirements from a practical perspective. In this way, applica-
tions could be developed that are specific to the auditing context
and take into account the auditors’ requirements [10] [11].
With regard to NLP-supported audit preparation, there is a
particular lack of exchange with internal auditors to determine
which aspects need to be taken into account when dealing with
company's regulatory documents. As a result of this lack of
exchange, it has not yet been determined what specific require-
ments exist in practice, what techniques are appropriate for
processing these documents, and how they can assist auditors
find relevant regulatory material for an upcoming audit.
The main objectives of this paper are therefore:
• Determining whether certain requirements exist from an
internal audit perspective that need to be considered when
obtaining regulatory documents for an upcoming audit.
• Identifying and testing of NLP techniques that may pro-
vide a potential solution to the identified requirements.
• Creating an appropriate data set which can be used to
evaluate the potential techniques.
In order to collect requirements from the perspective of
internal audit practice, we conducted a semistructured interview
with five auditors from an international automotive manufac-
turer (Section III). Based on the knowledge gained from this
interview, the concept of document retrieval was taken up and
examined using 13 model variants. As a prerequisite for the
evaluation of these model variants (Section VII), a German text
corpus was created (Section IV-V), which consists of 709
German regulatory documents that were annotated with respect
to their relevance to 16 search queries (Section VI). In Section
VIII, a conclusion is drawn on the results.
II. RELATED WORK
Information retrieval is about finding unstructured material
that satisfies an information need within large document collec-
tions [12]. In this context, “document retrieval” specifies the
nature of the material as a complete document rather than a
single text passage. However, in research, both terms are often
used as synonyms [13]. In the past, several studies have already
applied information retrieval on different regulatory materials.
For example, Sugathadasa et al. [14] combined semantic word
measures and document embeddings with NLP methods such as
lemmatizing and case-folding to retrieve legal case documents.
Cheng et al. [15] presented an approach that used vector-based
similarity measures to compute a semantic relatedness between
concepts from various domain-specific taxonomies and mapped
them to regulations to make them easier to find. Collarana et al.
[16] developed a QA-system for regulatory documents based on
natural language questions, in which they compared an informa-
tion-retrieval-based approach with a text-understanding-based
approach. To enable a comparison of related provisions from
different regulatory documents, Lau et al. [17] used informa-
tion retrieval techniques to combine generic features, such as
exceptions, definitions, or concepts with domain knowledge.
Most of these studies used regulatory documents that were
issued by a central (usually public) entity. In contrast, regulatory
documents originating from companies, which correspond to the
diversity of regulatory material used by internal auditors, are
rarely considered. Furthermore, to the best of our knowledge, a
text corpus of German regulatory documents written from a
corporate perspective has not yet been compiled, annotated and
used for document retrieval.
III. REQUIREMENTS
In order to determine what requirements exist from an internal
audit perspective for dealing with regulatory documents, we
conducted a semi-structured interview with auditors from the
field. This allowed us to gain a better understanding of how these
documents are characterized and how internal auditors proceed
to find documents that are relevant to an upcoming audit.
A. Interview Setting and Results
The semi-structured interview was conducted with five internal
auditors of an internationally operating automotive manufactu-
rer. Each participant has been working as an auditor for at least
three years and regularly performs audits using various regula-
tory documents. Participants were asked to answer the following
two questions: (1) What types of regulatory documents do you
typically deal with in the course of your work and how do these
documents differ? (2) How do you proceed to identify regulatory
documents that are relevant to an upcoming audit? The auditors'
responses to these questions are summarized below.
1) Results for Question 1:
The auditors described that in the case of their company, the
regulatory documents can be divided into different levels on the
basis of their scope (e.g., group | brand | department). These
document levels include, but are not limited to, “group policies”,
“brand policies”, “organizational policies”, “work instructions”,
“process standards”, and “training documents”. With regard to
the characteristic features of these different document levels, the
auditors described that, in addition to the thematic content, the
regulatory documents can also differ significantly in terms of
length, structure and the level of detail. For example, unlike
corporate instructions, work instructions are written in great
detail. Regulatory documents from the legal department, on the
other hand, tend to be more formal and often contain longer
sentences than regulatory documents from other departments. In
addition, the auditors pointed out that the vocabulary can also
differ significantly between the documents, as it is influenced by
the technical terms used in the respective business segment.
2) Results for Question 2
The auditors described that at the beginning of an audit
engagement they often already receive some relevant regulatory
documents directly from the department under audit. However,
these documents are usually only the starting point for an audit.
This means that the auditors have to search for additional regula-
tory documents whose relevance only becomes apparent in the
course of the audit preparation or -execution. The auditors also
illustrated this with an example: A regulatory document sent by
the department under audit could describe how sensitive data
should be handled in a particular system. However, the actual
definition of sensitive data within the corporate environment
may be defined in another regulatory document, which the
auditors must then obtain on their own. This downstream search
for additional regulatory documents is often performed explora-
tively and can take some time. This is especially the case when
the documents can only be searched on the basis of an exact
match between the search terms and the content of the document.
In this context, the auditors clarified the potential benefit of a
system that returns relevant documents not only when the search
terms are exactly matched, but also when there is semantic
proximity. They added that the context of a term often also plays
an essential role, which is why a combination of several search
terms might become necessary.
B. Derivation of the Requirements
The desired target state that emerged from the interview is
the ability of a system to find audit-relevant material in a pool of
regulatory documents that may differ greatly in terms of subject
matter as well as structure, vocabulary, and level of detail, based
on a search query consisting of one or more search terms. In the
remainder of this paper, we will address different techniques
from the field of document retrieval that could offer the potential
to approach this target state. For our study, this objective results
in requirements that concern two aspects:
• Diversity of the text corpus: Since the regulatory docu-
ments encountered in practice can have a high degree of
diversity in terms of their subject matter, document struc-
ture, vocabulary, and level of detail, the text corpus used
in this study should also exhibit these characteristics.
• Semantic, multi-keyword-based document retrieval:
Since relevant information in different regulatory docu-
ments does not necessarily bear the exact same designa-
tion, and further context may also be crucial for deter-
mining the relevant document in each case, the document
retrieval approach should not only be able to consider
multiple terms, but also return documents when terms are
lexically different but semantically similar.
IV. DATA GATHERING
In order to investigate to what extent a query-based retrieval
system for German regulatory documents can be addressed by
NLP and document retrieval, a German text corpus of hetero-
geneous regulatory documents is required. For this purpose, we
first considered German ISO documents as well as German legal
texts. While the latter has many normative expressions that are
in principle characteristic of regulatory documents, this type of
documents is rather rarely used by internal auditors. ISO docu-
ments, on the other hand, represent highly standardized regu-
latory documents that do not have the thematic and structural
diversity of regulatory documents used in internal auditing.
Independent of these two data sources, we did not find alter-
native data sources that could reflect the variety of internal regu-
latory documents described by the auditors during the interview
(1st requirement: “Diversity of the text corpus” | Section III.B).
For this reason, we decided to create a new text corpus by col-
lecting publicly available regulatory documents from various
German companies and institutions in different subject areas.
The procedure used for this data collection is divided into the
following four steps (Fig. 1). First, suitable search terms had to
be created that could be used to locate regulatory documents on
the Internet. Here, the more similar the search terms are to the
exact names of the regulatory documents, the more likely they
are to be found. In the second step, these search terms were
passed to a web scraper, which searched the resulting websites
for embedded documents in PDF format. Thus, the result of the
second step is a set of PDF files which, according to their title,
could be considered as regulatory documents. In order to
evaluate the suitability of the compiled documents for the final
text corpus, they were checked in the third step against four
criteria: (1) The document must be a German regulatory docu-
ment. (2) The document must be written in natural language. (3)
The content of the document must be extractable as text (no
scans). (4) The content of the regulatory document must fit
thematically to the auditing work of our practice partner or be
neutral to it (i.e., regulatory documents that regulate e.g., surgi-
cal procedures were excluded). Steps 1-3 were done iteratively
as new search terms were identified by reviewing the regula-
Fig. 1. Procedure for the collection of regulatory documents
tory documents, which in turn were used to scrape additional
documents. In the final step, duplicates were captured and
removed.
As initial search terms, we first used the document desig-
nations mentioned by the auditors during the interview. The
terms and descriptions of the documents that resulted from this
initial search were then used to define additional search terms.
These include, for example, 'Konzernrichtlinie' (Group Policy),
'Unternehmensrichtlinie' (Company Policy), 'Verfahrensricht-
linie' (Procedural Guideline), 'Antibestechungsrichtlinie' (Anti-
bribery Policy), 'Prüfanweisung' (Audit Instruction), 'Arbeitsan-
weisung' (Work Instruction), 'Konzernnorm' (Group Norm) as
well as 'Werknorm' (Factory Standard). As a result of this
iterative search for document designations, a total of 110 terms
were entered into the web scraper, which then searched 13,400
websites and downloaded 2,986 documents. Using the four
assessment criteria, we then manually assessed the documents,
leaving a total of 834 regulatory documents. Finally, 125
duplicates were identified, which reduced the set of relevant
regulatory documents to 709.
V. DATA EXTRACTION AND UNDERSTANDING
Since the collected regulatory documents are in PDF format,
the text they contain cannot be processed directly. PDF files are
binary coded and more complex than pure text files as they
contain different fonts as well as tables or graphics. To process
the text, it therefore first had to be extracted from the PDF and
converted to unicode characters (e.g., UTF-8). For this task we
used the library pdfplumber1.
The topics covered by the regulatory documents include,
among others, general and specific codes of conduct (e.g., on
dealing with bribery), occupational health and safety, infor-
mation security, factory standards, and supplier guidelines (Fig.
2-③). In total, they represent 40 different subject areas, which
are treated in varying degrees of detail depending on the docu-
ment type. On average, the documents comprise 15 pages, with
the smallest document containing one page and the largest
document containing 241 pages. However, about 87% of all
documents contain less than 25 pages (Fig. 2-①). Sentence
length is measured in tokens (words and punctuations) and
averaged per document, with the most common sentence length
being Ø 9-11 tokens (Fig. 2-②). After removing punctuations,
the text corpus contains a total of 2,577,928 tokens, of which
140,700 are originally unique and 129,635 after conversion to
lowercase.
1 https://pypi.org/project/pdfplumber/
Selection of
Search Terms
Scraping of
Documents
Review of the
Documents
Exclusion of
Duplicates
110
Defined
Ter ms
834
Regulatory
Documents
709
Regulatory
Documents
13,400
Websites
2,986
Downloads
Fig. 2. Pages, sentence length, and topics of the collected regulatory documents
VI. DATA ANNOTATION
An evaluation of document retrieval systems that allows
neutral and repeatable comparisons between the systems
requires an annotated test dataset [12] [13]. In order to create
such a test set, a text corpus is required as well as a set of
information needs that can be formulated as search queries. The
annotations are then used to determine whether a document
within the text corpus is relevant to an information need [12].
In this study, we used the regulatory documents collected in
Section IV as our text corpus and a total of 16 information needs
that we formulated as search queries (Section VI.A). For the
actual annotation of the relevance of each regulatory document
to these 16 search queries, we considered two approaches.
In the first approach, each document is manually annotated
for its relevance to a search query. This was used, for example,
to create the Cranfield dataset [18], which is one of the first
datasets that enabled a quantitative measure of the effectiveness
of document retrieval systems. For this purpose, several experts
evaluated 1,398 abstracts of journal articles on aerodynamics
with respect to their relevance to 225 questions, resulting in a
total of 314,550 annotations. In the case of our 16 queries,
11,344 manual annotations would need to be made for our 709
regulatory documents. Since our regulatory documents average
15 pages in length, this manual effort was not feasible with the
resources we had available for this study. For this reason, we
chose to use an annotation approach that is typically applied to
Fig. 3. Concept for the annotation of the regulatory documents
larger datasets, where only a subset of the documents is manu-
ally annotated. The most common method for this is “pooling”
[12] [13]. The idea of pooling is based on the fact that for most
search queries only a small subset is actually relevant. It is assu-
med that a document set contains on average k relevant docu-
ments for a search query and that an optimal retrieval system
would place the relevant documents on the first k ranks. For this
reason, only the top k search results are manually evaluated for
each search query. However, the disadvantage of this approach
is that a real retrieval system will not place all relevant docu-
ments on the top ranks, which means that some relevant docu-
ments would be ignored [19]. To minimize this limitation, the
first k most relevant results from different retrieval systems are
collected in a pool for each search query (Fig. 3-①). In this
pool, the documents are stored in a random order, reduced by
duplicates and then manually evaluated by experts with regard
to their actual relevance to the respective search query (Fig. 3-
④). All documents that are not in the pool after this evaluation
are then automatically classified as not relevant [12] [13] (Fig.
3-⑤). In this way, it is possible to annotate even large datasets.
For this study, we slightly modified the pooling approach
(Fig. 3, shaded gray). This is due to the classical procedure in
pooling, according to which the search results per pool are
provided by retrieval systems from different developers – which
is not the case in our study, since only we developed the retrieval
systems. However, to reduce the potential risk that relevant
documents might not be considered by our retrieval systems, we
Retrieval of German Regulatory Documents
Thirtieth European Conference on Information Systems (ECIS 2022), Timisoara, Romania 4
documents for the final text corpus, they were checked in the third step against four criteria: (1) The
document must be a German regulatory document. (2) The document must be written in natural
language. (3) The content of the document must be extractable as text (no scans). (4) The content of
the regulatory document must fit thematically to the auditing work of our practice partner or be neutral
to it (i.e. regulatory documents that regulate e.g., surgical procedures were excluded). Steps 1-3 were
done iteratively as new search terms were identified by reviewing the regulatory documents, which in
turn were used to crawl additional documents. In the final step, duplicates were captured and removed.
Figure 1. Procedure for the collection of regulatory documents.
As initial search terms, we first used the document designations mentioned by the auditors during the
workshop. The terms and descriptions of the documents that resulted from this initial search were then
used to define additional search terms. These include, for example, 'Konzernrichtlinie' (Group Policy),
'Unternehmensrichtlinie' (Company Policy), 'Verfahrensrichtlinie' (Procedural Guideline), 'Antibeste-
chungsrichtlinie' (Anti-bribery Policy), 'Prüfanweisung' (Audit Instruction), 'Arbeitsanweisung' (Work
Instruction), 'Konzernnorm' (Group Norm) as well as 'Werknorm' (Factory Standard). As a result of
this iterative search for document designations, a total of 110 terms were entered into the web crawler,
which then searched 13,400 websites and downloaded 2,986 documents. Using the four assessment
criteria, we then manually assessed the documents, leaving a total of 834 regulatory documents.
Finally, 125 duplicates were identified, which reduced the set of relevant regulatory documents to 709.
5 Data Extraction and Understanding
Since the collected regulatory documents are in PDF format, the text they contain cannot be processed
directly. PDF files are binary coded and more complex than pure text files as they contain different
fonts as well as tables or graphics. To process the text, it therefore first had to be extracted from the
PDF and converted to unicode characters (e.g., UTF-8). For this task we used the library pdfplumber1.
The topics covered by the regulatory documents include, among others, general and specific codes of
conduct (e.g., on dealing with bribery), occupational health and safety, information security, factory
standards, and supplier guidelines (Figure 2-③). On average, the documents comprise 15 pages, with
Figure 2. Pages, average sentence length, and topics of the collected regulatory documents.
1 https://pypi.org/project/pdfplumber/
Selection of Search Terms Crawling of Documents Review of the Documents Exclusion of Duplicates
110 Defined Term s 834 Regulatory Documents 709 Regulatory Documents13400 Websites | 2986 Downloads
0
25
50
75
100
125
150
175
200
520 35 50 65 110 160
0
25
50
75
100
125
150
175
200
5 7 9 11 13 15 17 19
Pages (x) per
document (y)
Ø Sentence length
(x) per document (y)
2"
1"
16%
>150
Supplier/Goods Delivery
13
Supplier/General
9
Supplier/Supplier Rating
7
VDA Guideline
7
Compliance
7
Organizational Instruction
7
Code of Conduct/Donations
6
Supplier/Suppliers Code of Conduct
6
Code of C./Corruption & Bribery/Gifts
6
Supplier/Quality/General
5
Corona
4
Transport/Dangerous Goods
4
Securities Investor
3
Transport/General
3
Transport/Load Securing
3
Code of Conduct/Cartel
3
Code of Conduct/Ethics
3
Certification
3
Site Regulations
2
Social Media
2
Supplier/Purchasing Conditions
2
Export & Econ. Sanctions Directive
1
8%
7%
6%
6%
5%
4%
4%
4%
4%
3%
Supplier/Material Data Sheets
14
Code of Conduct/Environment
16
Work Instructions
17
Code of Conduct/Conflicts of Interest
17
Supplier/Logistics Policy
19
Occupational Safety & Health/General
20
Supplier/Quality/Subcontractor
21
IT Security
23
Data Protection
25
Supplier/Supplier Policy
25
No assignment
29
Procedural Instructions
31
Supplier/Supplier Manual
37
Factory Standard
42
Information Security
45
Safety & Health/Contractors
53
Code of Conduct/Corruption & Bribery
59
Code of Conduct/General
110
Topics addressed within
the regulatory documents
3"
Search query
Text
corpus
Document retrieval system 1
Document retrieval system 2
Document retrieval system n
Pool
Manual
relevance
assessment Annotated
testset
Manual sorting of the text
corpus according to topics
Document selection on the
topic of the search query
Our
Extension
Classic
Pooling
Method ...
2
4 5
1Top kdocs.
of each
system
Docs. on the
subject of the
search query
3
manually presorted all documents by broad topics, such as
supplier, compliance, data protection, information security, or
safety & health (Fig. 3-②). Analogous to pooling, we followed
the assumption that only a subset of all documents is actually
relevant for a search query. Therefore, our presorting aims to
create hand-made subsets, which can then be used to expand the
pools with additional documents (Fig. 3-③). For example, if a
search query addresses the topic “supply”, then all documents
that have been assigned to the topic “supplier” during presorting
are added to the pool. In addition to these preselected docu-
ments, the top k search results of our 13 retrieval systems are
also added to the pool – according to the classical pooling
method (Fig. 3-①). Then duplicates are removed from the pool
and the relevance of all remaining documents to the respective
search query is evaluated manually. Analogous to the classical
pooling method, all documents that are not in the pool are
classified as “not relevant” (Fig. 3 ④,(⑤). By extending the
pooling approach in this way, additional documents that may not
be found by the document retrieval systems in the first k search
results can be manually identified beforehand and included in
the evaluation. In this way, the pooling approach can be comple-
mented by a “manual retrieval system” that also takes into
account – even if not in detail – the totality of all documents.
A. Definition of Search Queries
In order to implement the different retrieval systems for our
annotation approach shown in Fig. 3, we first needed concrete
information needs. For this purpose, we formulated 16 audit-
related search queries, which are based on exemplary informa-
tion needs mentioned by the auditors during the interview. The
resulting search queries consist of 2-3 German keywords from
different subject areas, such as “suppliers”, “occupational health
and safety”, “data protection and privacy”, “conflicts of
interest”, as well as “IT security” (Table I).
TABLE I. SEARCH QUERIES USED FOR DOCUMENT RETRIEVAL
B. Modeling of Document Retrieval Systems
Using the search queries listed in Table I as well as our text
corpus (Section IV), we implemented 13 different document
retrieval variants. Despite the second requirement (Section III.B)
that a system should also be able to provide relevant documents
based on semantic similarity to the search query, these 13 retrie-
val variants also include two entirely lexical search approaches.
This allowed us to make a direct comparison between semantic
and lexical approaches – especially because one of the lexical
methods (viz. BM25) became a semantic method by extension.
1) Lexical Search
To implement lexical search, we used TF-IDF (term frequen-
cy-inverse document frequency) and BM25 [20]. Both methods
are based on the bag-of-words (BOW) approach and assume an
exact match between the search terms and the terms occurring in
the documents. Since the BOW representation does not take into
account word order or sentence structure, we first removed
punctuations (except hyphens or apostrophes), stop words,
multiple spaces, line breaks, special characters, single letters and
numbers, and then normalized all characters to lower case.
a) TF-IDF and Cosine Similarity
TF-IDF weights the words in a document with consideration
of all word frequencies in the entire corpus. If words occur rarely
in the entire corpus but frequently in a document, then they are
weighted higher than words that occur frequently in the corpus
(Fig. 4-②). On this basis, the cosine similarity between the
search query and the documents is calculated in order to identify
the relevant documents to the search query [12]. To speed up the
retrieval task, this process can be divided into an offline and an
online phase. Within the offline phase, we used the TfidfVecto-
rizer2, which we trained on the preprocessed regulatory docu-
ments. During this training, the TfidfVectorizer learned the
vocabulary as well as the IDF values of the regulatory documents
and converted them into a document-term matrix. In the online
phase, the previously trained TfidfVectorizer is then used to con-
vert the search query into a document-term matrix as well. After
that, the cosine similarity between the document-term matrices
of the regulatory documents and the respective search query is
calculated. In the final step, the regulatory documents are ranked
based on the calculated similarities (Fig. 4-①).
Fig. 4. Lexical search using TF-IDF and cosine similarity
2 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.
text.TfidfVectorizer.html
Regulatory
Documents
Preprocessing Vectorization Document-
Term Matrix
TF-IDF
Pretrained
Vectorizer
Preprocessing Vectorization
Search Query
Document-
Term Matrix
TF-IDF
Calculation
of Cosine
Similarity
Ranking
Ranked
Documents
Offline Phase
Online Phase
!"#$%" !& ' ( !")! & '* + ,-. /
0123415
Number of occur-
rences of term tin
document D
Number of documents
in the corpus
Number of documents
containing term tt = Term D= Document
Search Query Term(s)
Description: Search for regulatory
documents that:
„Materialdatenblätter erstellen“
contain information on the creation of
material data sheets.
„Arbeitsschutz von
Fremdfirmen“
regulate the occupational health and
safety of contractors.
„Maßnahmen
Interessenkonflikte“
describe measures to avoid conflicts of
interest.
„Verarbeitung personen-
bezogener Daten“
describe how personal data must be
processed.
„Anforderungen für Passwörter“
contain requirements about passwords.
„Prüfung Erstbemusterung“
state what should be considered when
testing initial samples.
„Audit Datenschutz“
contain info. about data privacy audits.
„Definition sensible Daten“
describe how sensitive data is defined.
„Datenträger vernichten“
describe when and how data must be
destroyed.
„Backups erstellen“
contain instructions on how to create
data backups.
„Bezeichnung Profiling“
describe profiling.
„Schutz vor Schadsoftware“
describe the protection against malware
or computer viruses.
„Pseudonymisierung von Daten“
describe the pseudonymization of data.
„Ladungssicherung Transport“
contain information on cargo securing of
shipments.
„Persönliche Schutzausrüstung“
describe the use of personal protective
equipment or clothing.
„Palette Gesamtgewicht“
regulate the maximum weight a pallet
can carry.
1"
2"
Fig. 5. Lexical search using BM25
b) BM25
For the second lexical retrieval system, we applied the robust
and widely used ranking function BM25 (BM = best match; 25
= 25th version), which is also based on TF-IDF, but additionally
takes into account the term frequency saturation and the length
of a document (Fig. 5-②) [12]. Analogous to TF-IDF, this
approach can also be divided into an offline and an online phase.
In the offline phase, the preprocessed documents are first
tokenized using the German NLTK-tokenizer3. Using the
tokenized documents, a BM25 object is created that indexes the
documents based on their tokens. In the online phase, the
relevance value to a search query is then calculated for each
indexed and tokenized document using the BM25 object. After
that, the documents are ranked based on the determined rele-
vance values (Fig. 5-①). For the implementation of BM25 we
used the module gensim.summarization.bm254 , for which we
adopted the default parameter setting.
2) Semantic Search
Unlike lexical search, semantic search takes into account the
semantic meaning of documents and search queries. In this
study, we used two different methods for semantic search. The
first method is a query expansion that uses word embeddings to
expand the original search terms with semantically similar terms
such as synonyms or acronyms. The second method is a trans-
former-based “retrieval and re-ranking pipeline” that, unlike the
first method, also takes into account the context of the words.
a) Query Expansion using Word Embeddings
In practice, it cannot be assumed that an auditor would
always specify the exact matching term in a search query. Even
if relevant documents are returned in response to the search
query, it is also possible that other relevant documents are not
provided by the retrieval system because they contain seman-
tically similar but lexically different terms (e.g., “MDB” as an
abbreviation for “Materialdatenblatt”; eng: “Material data
sheet”). Since the lexical document retrieval systems using TF-
IDF and BM25 need an exact match between the terms in the
search query and the terms in the regulatory documents, they
3 https://www.nltk.org/api/nltk.tokenize.html
4 https://radimrehurek.com/gensim_3.8.3/summarization/bm25.html
are not able to consider such semantically similar terms. To solve
this problem, we added a query expansion to our BM25 docu-
ment retrieval system.
The basic architecture of our query expansion-based system
follows the structure of the retrieval system with BM25 (Fig. 5).
The newly added query expansion is performed after the search
query is tokenized (Fig. 6). As part of this step, word embeddings
are used to determine the most similar word vectors for each term
in the search query. For this similarity assessment, we again used
cosine similarity. To ensure an appropriate selection of the most
similar word vectors, only those words that have a similarity
value above a certain threshold are added to the query. For this
purpose, we tested different word embedding models for which
we set different thresholds (ranging from 0.75 to 0.9).
To implement our query expansion, we used two German
pretrained word embeddings from FastText [21] [22]. We chose
this library because, unlike Word2Vec [23] or GloVe [24], Fast-
Text considers N-grams at the character level rather than at the
word level, which allows it to create word vectors that are “out
of vocabulary” [21] [22]. The first model5 we used was devel-
oped by [25] and made open source by Facebook, while the
second one was developed by Deepset6. Both models were
trained on Wikipedia articles so that they represent a general
vocabulary. To teach these pretrained word embeddings the
vocabulary of regulatory documents, we further trained them on
our text corpus presented in Section IV. During this fine-tuning,
we used the same default parameters that were used for the
original training of each model (Table II). Deepset's pretrained
FastText model was trained using skipgram, while Facebook
Open Source's FastText model was trained using the CBOW
architecture (continuous bag-of-words). Within a window size,
the skipgram model predicts the context words for a target word,
while the CBOW model predicts the target word based on the
context words. In this regard, the window size indicates the
maximum distance between the context words and the predicted
word in a sentence. For both pretrained FastText models, we
used a window size of 5. The size of the embedding indicates
how many dimensions each of the word vectors consists of – in
Fig. 6. Semantic search using BM25 and query expansion
5 https://fasttext.cc/docs/en/crawl-vectors.html
6 https://deepset.ai/german-word-embeddings
Regulatory
Documents
Preprocessing Tokenization BM25 Object
Preprocessing Tokenization
Search Query
Calculation
of the BM25
Scores
Ranking
Ranked
Documents
Offline Phase
Online Phase
Ld= Length of
document Din
tokens
Lave = Ø number
of tokens of all
documents
k1= Free parameter to set
the saturation of !"(#,$%)
Frequency of search
term qin document D
D= Document
Q= Search query
ith term of query Q
b = Free parameter to set the influence of the document length
&'()*+,-.
/01234
= 5678
9:#" $%;<= >1?6 ;/@ABA3
@A; ACDBD; EF
EGHI B<=/>1?63
Regulatory
Documents
Preprocessing Tokenization BM25 Object
Preprocessing Tokenization
Search Query
Calculation
of the BM25
Scores
Ranking
Ranked
Documents
Offline Phase
Online Phase
Query
Expansion
See Figure 5-②
Deepset (original) | ms1
Deepset (original) | ms2
Deepset (fine-tuned) | ms1
Deepset (fine-tuned) | ms2
Facebook (original) | ms1
Facebook (original) | ms2
Facebook (fine-tuned) | ms1
Facebook (fine-tuned) | ms2
Enrichment of the
original search query
with semantically
similar terms using 8
model variants:
original search term
seman-
tically
similar
terms
1"
2"
TABLE II. FASTTEXT MODELS FOR QUERY EXPANSION
Deepset
Deepset
(fine-tuned)
Facebook
Facebook
(fine-tuned)
Training data
Wikipedia
Wikipedia +
Regulatory docs.
Wikipedia
Wikipedia +
Regulatory docs.
Architecture
skipgram
skipgram
CBOW
CBOW
Vocabulary size
1,319,232
1,325,117
2,000,000
2,013,511
Embeddings size
100
100
300
300
Min. frequency
10
10
5
5
Window size
5
5
5
5
this case 100 for Deepset's model and 300 for Facebook Open
Source's model. The minimum frequency defines how often a
word must occur in the text corpus in order for a word vector to
be created for that word. In this case, Deepset's model was
trained with a minimum frequency of 10, while Facebook Open
Source's model was trained with a minimum frequency of 5.
By further fine-tuning the two original FastText models with
our text corpus of regulatory documents, new domain-specific
words such as “Materialdatenblätter” (Material data sheets) or
“Informationssicherheitsleitlinie” (Information security guide-
line) were added to the vocabulary of both models. In this way,
the vocabulary of Deepset's FastText model was expanded by
5,885 words and that of Facebook Open Source's FastText
model by 13,511 words (Table II).
Using the four models shown in Table II, we created a total
of eight model variants that added a query expansion to the
BM25 document retrieval system. These model variants result
from the two different word embeddings (Facebook | Deepset),
the two training levels (with and without domain specializa-
tion), as well as different numbers of semantically similar words
added to the search query. Depending on the variant, the most
similar one (ms1) or two (ms2) terms from the respective word
embedding were added per search term – provided that the
similarity did not fall below a specific threshold. This threshold
in turn depends on the word embedding. For the original
Facebook model, this value was set to 0.75, for the original
Deepset model to 0.85, and for the two fine-tuned models to 0.9.
b) Retrieval- and Re-Ranking
In the documentation of the framework “Sentence-Transfor-
mer” 7 a pipeline for a semantic search is presented, that com-
bines a retrieval system and a ranking system based on Bert [26].
In this pipeline, a retrieval system first identifies relevant docu-
ments for a query and then a cross-encoder is applied to re-rank
the relevance of the previously identified documents (Fig. 7).
The retrieval task can be realized by a lexical search, such as
BM25, or by a semantic search using a bi-encoder. The latter
separately generates sentence embeddings for a given search
query and the documents. The embeddings of the search query
are then compared to the embeddings of the documents using
cosine similarity in order to retrieve the documents that are close
in vector space. The cross-encoder then passes the query and the
previously identified documents to the transformer network
simultaneously and returns a value between 0 and 1, which
indicates the relevance of the documents to the query [27] [28].
In contrast to bi-encoders, cross-encoders have the advan-
tage that they generalize better across domains – for example,
when the cross-encoder is trained on a particular domain, but
7 https://github.com/UKPLab/sentence-transformers
then applied to a different domain. In addition, cross-encoders
generally require less data and often provide more accurate
search results if the search queries are formulated in more detail.
On the other hand, a disadvantage is that they are computation-
ally expensive when dealing with large texts, since the search
query and the documents are loaded into the transformer network
simultaneously [29]. However, within the described pipeline,
this disadvantage is mitigated by the fact that the bi-encoder
independently pre-encodes the search query and documents into
sentence embeddings and then provides a set of potentially rele-
vant documents. This preselection can then be used by the cross-
encoder to perform the relevance evaluation for a smaller set of
query-document-pairs. In this way, the cross-encoder can help to
filter out possible noise from the retrieval-system's output [27].
To realize the described retrieval and re-ranking pipeline, we
implemented three different variants. In the first and second
variants, a bi-encoder is used as a retrieval system that searches
either at the document level (Variant 1) or at the sentence level
(Variant 2). The third variant contains a retrieval system that uses
the BM25 approach (Fig. 5) instead of the bi-encoder.
For the implementation of the bi- and the cross-encoder we
used pretrained models of the framework Sentence-Transformer.
When selecting such pretrained models for a semantic search, a
general distinction can be made between symmetric- and asym-
metric semantic search. In symmetric semantic search, the query
and the entries in the corpus are approximately the same length.
Since our collected regulatory documents are significantly larger
than the search queries, we performed an asymmetric semantic
search. For this type of search, transformer models that are
trained on the MS-Marco8 dataset are suitable, as it contains
many keyword queries and longer document passages [30]. For
our bi-encoder, we therefore chose the model “msmarco-distil-
bert-multilingual-en-de-v2-tmp-lng-aligned” 9, which was trained
using the English model “msmarco-distilbert” and made multi-
lingual using “knowledge distillation”. The latter is based on the
idea that a translated sentence is mapped to the same place in the
vector space as its original English sentence [31].
Fig. 7. Semantic search using retrieval- and re-ranking
8 https://microsoft.github.io/msmarco/
9 https://huggingface.co/sentence-transformers/msmarco-distilbert-
multilingual-en-de-v2-tmp-lng-aligned
Regulatory
Documents
Preprocessing
Preprocessing
Search Query
Top k
Documents
Re-ranking by
Cross-Encoder
Ranked
Documents
Offline Phase
Online Phase
Retrieval using Bi-Encoder*
Encoding to
Sentence
Embeddings
Assessment
by Cosine
Similarity
Bi-Encoder: Retrieval
of documents that are
close in vector space
relevant documents
search
query
2 Variants:
§Sentence-Level
§Document-Level
1
a
a
2
b
b
3
c
e
4
d
c
5
e
d
6
f
f
top k rank re-rank
Encoding to
Sentence
Embeddings
*Replaced by
BM25 for the
third variant.
For our cross-encoder, we used the pretrained model “amberoad/
bert–multilingual–passage–reranking–msmarco”10. This model
was also trained on the MSMarco dataset and returns a value
between -10 and 10 depending on how relevant a sentence is to
the search query [32].
To preserve as much information as possible about the
structure of the sentence and the context of the words, we only
removed line breaks, double spaces and symbols from the
regulatory documents. For the second system variant, we also
split the regulatory documents into sentences. Both the search
query and the preprocessed regulatory documents were then
coded at the document level (Variant 1) and at the sentence level
(Variant 2) using the bi-encoders and evaluated for relevance.
The most relevant search results are then evaluated a second
time using the cross-encoder. However, since the cross-encoder
only evaluates at the sentence level, we used the highest rele-
vance score occurring in each document as the decisive score for
the final ranking of the documents. In the last step, these scores
were then used to sort the documents according to their rele-
vance to each search query.
The implementation of the third system variant was analo-
gous to the procedure presented in Fig. 7. Only the preprocessing
and the bi-encoder were replaced by the BM25 retrieval system.
C. Final Annotation Step
With the implementation of all 13 retrieval systems, the first
prerequisite for our annotation concept was realized. The second
prerequisite was to presort the documents by broad topics, which
in the case of our 709 regulatory documents resulted in 40 topics
(Fig. 2). For each search query, we then created a pool in which
we first included all documents that fit the topic of the search
query based on the presorting. After that, we added at least 20 of
the highest-rated documents from each retrieval system to each
pool (k = 20). In this context, “at least” refers to the fact that we
successively increased k to collect more documents in the
respective pool if the presorting revealed that more documents
might be relevant to the search query. In the final step, we
removed duplicates and then manually assessed and annotated
all documents in each pool in terms of their actual relevance to
the respective search query. After this manual annotation, an
average of 30 relevant documents remained in each pool.
Following the classic pooling approach, we then labeled all
documents that were not in the pools as “not relevant”.
VII. EVALUATION
The evaluation was performed using the exact same search
terms and retrieval systems as those used in Section VI.
However, this time the outputs of the systems are not used to fill
the pools with potentially relevant documents, but to evaluate
the actual retrieval quality of the 13 different system variants. In
order to measure this quality, we used the final annotations made
in Section VI.C as ground truth and compared them to the initial
outputs of the 13 retrieval systems (which we interpret now as
“predictions”). These system outputs are represented by
numeric scores that provide the relevance of a document to a
search query. By comparing the scores of different documents
to the same search query, the documents can be ranked accor-
10 https://huggingface.co/amberoad/bert-multilingual-passage-reranking-
msmarco/tree/main
ding to their relevance. Note that the more documents are
considered in the ranking of a retrieval system, the higher the
likelihood that more relevant documents will be found – which
then also means, that the recall tends to be higher for the
respective system. In contrast, the fewer documents are consi-
dered in the ranking, the more likely it is that a system will reach
a higher precision. As a result, document retrieval systems
cannot be optimized for precision and recall at the same time.
We therefore calculated the mean average precision across all
queries (map@709). This metric is intended to measure the
ability of the retrieval system to rank the most relevant docu-
ments across all 709 documents. In addition, we have also
calculated the map@k in which the parameter k represents the
number of all actually relevant documents for a search query. For
example, k for the search query "Materialdatenblätter erstellen"
is 25, while k for the search query "Prüfung Erstbemusterung" is
45. The map@k metric can thus be used to more precisely assess
the retrieval systems' ability to rank relevant documents in the
top ranks.
As can be seen in Table III, the system that best ranks the
regulatory documents according to both map@709 and map@k
is “BM25 + Cross Encoder”. In contrast, “TF-IDF + Cosine Simi-
larity” scored the lowest on both metrics. The system “BM25”
and the two combinations of “BI-Encoder + Cross Encoder” also
achieve high ratings at map@k. However, the BI-Encoder on
document level, despite its high rating at map@k, has the
second-worst rating at map@709. This indicates that the system
ranks many relevant documents in the top ranks, but also ranks
several relevant documents in the middle or lower ranks.
Adding a query expansion (QE) to the BM25 system did not
equally improve the results at map@k and map@709. Although
it led to better ratings at map@709 in four cases (“fine-tuned
deepset model ms1”, “original facebook model ms1 / ms2”, “fine-
tuned facebook model ms1“), a better rating at map@k was only
successful in one case (“original facebook model ms1”). In terms
of domain specialization, it can also be seen that only the fine-
tuning of the deepset models produces better results at map@709
(both fine-tuned models outperform the original model but only
the “fine-tuned deepset model ms1” also outperforms the BM25
system without any query expansion). In contrast, an improve-
TABLE III. DOCUMENT RETRIEVAL RESULTS ACROSS ALL QUERIES
Retrieval System
QE = Query expansion | F = Figure number
L = lexical search | S = semantic search
ms1/2 = one most similar / two most similar terms added
Mean Average
Precision
(map)
F
@709
@k
TF-IDF + Cosine Similarity
L
4
0.566
0.734
BM25
L
5
0.728
0.896
BM25 + QE using original deepset model ms1
S
6
0.720
0.867
BM25 + QE using original deepset model ms2
S
6
0.680
0.805
BM25 + QE using fine-tuned deepset model ms1
S
6
0.734
0.866
BM25 + QE using fine-tuned deepset model ms2
S
6
0.726
0.849
BM25 + QE using original facebook model ms1
S
6
0.738
0.898
BM25 + QE using original facebook model ms2
S
6
0.729
0,876
BM25 + QE using fine-tuned facebook model ms1
S
6
0.731
0.876
BM25 + QE using fine-tuned facebook model ms2
S
6
0.674
0.802
BM25 + Cross Encoder
S
7
0.799
0.934
BI-Encoder (document level) + Cross Encoder
S
7
0.622
0.900
BI-Encoder (sentence level) + Cross Encoder
S
7
0.7519
0.8850
ment at map@k compared to the BM25 system without query
expansion could not be achieved by any of the fine-tuned QE-
models. However, we noticed that when searching with very
specific terms, the fine-tuned models placed more relevant
documents in the top-k ranks than the original model.
Overall, it can be seen that semantic searches produce better
results than lexical searches and that the systems with the cross-
encoder, especially with BM25 as the retrieval system, achieve
the best results. This can be explained by the fact that, unlike
other systems, the cross-encoder recognizes not only synonyms,
variations and acronyms but also the context of the words. How-
ever, regardless of the very good results of the cross-encoder in
combination with BM25, the relevance scores of the query
expansion are easier to follow, as they are calculated based on
an exact match of the enriched search terms and the documents.
VIII. CONCLUSION
The aim of this study was to identify and test different docu-
ment retrieval techniques that can assist internal auditors in
finding relevant regulatory documents for an upcoming audit.
To gain a better understanding of how internal auditors deal with
regulatory documents in practice, first an interview with five
auditors from an internationally operating automotive manufac-
turer was held. An important finding of this interview was that
internal auditors often face regulatory documents that can vary
widely in terms of subject matter, structure and level of detail.
In addition, finding the right regulatory documents can some-
times be difficult because, on the one hand, additional terms may
be needed to further define the context of the topic and, on the
other hand, possible search terms do not always correspond to
the terms contained in the regulatory documents. As a result of
these domain-specific requirements, the concept of document
retrieval was taken up and investigated using 13 different model
variants and 16 search queries, which we applied to a text corpus
consisting of 709 German regulatory documents11.
The evaluation of the different model variants showed that a
semantic search approach achieves better results than the lexical
search due to the consideration of variations, synonyms, acro-
nyms as well as the context of the words contained in the search
query and the documents. In particular, the combination of a
lexical-based BM25 system and a transformer-based cross-
encoder proved to be the best model approach, as it was very
good at generalizing across the regulatory document topics. In
this context, future research could focus on creating a cross-
encoder specifically for the domain of internal regulatory
documents to further improve search results. Besides the cross-
encoder, the combination of a BM25 system with a query
expansion also provided good results. It could be shown that the
domain specialization of a corresponding word embedding has
learned some useful domain-specific terms (Table II). To further
advance the domain specialization, future work could continue
training with more internal regulatory documents. At this point,
we also recommend the use of a model with query expansion
over a cross-encoder if the explainability of the search results is
of higher importance than the number of relevant documents.
The resulting findings can be of importance to auditors from
the field or developers in the auditing context, as they can com-
pare the requirements collected in this study with their own
needs and, if they are similar, derive actions or assessments that
could be crucial for transferring the approaches to their own
company. From a research perspective, we provide a study that
addresses concrete requirements from auditing practice, the
creation and annotation of a domain-specific text corpus, and its
use for the evaluation of 13 retrieval systems. A limitation of our
paper is the rather small amount of search queries and the fact
that the annotation, despite the advantages given by our modified
pooling approach, could not be additionally validated by other
independent developers – as it is usually intended for pooling.
Future research could address this aspect by including more
regulatory texts in collaborative pooling projects.
[1] Peemöller, V. H., and Kregel, J., Grundlagen der Internen Revision:
Standards, Aufbau und Führung [Fundamentals of Internal Auditing:
Standards, Structure and Management]. 2nd ed. Berlin: Erich Schmidt
Verlag, 2014.
[2] Berwanger, J., and Kullmann, S., Interne Revision: Funktion,
Rechtsgrundlagen und Compliance [Internal auditing: function, legal
basis and compliance]. 2nd ed. Wiesbaden: Springer Fachmedien
Wiesbaden, 2012.
[3] Knapp, E., Interne Revision und Corporate Governance: Aufgaben und
Entwicklungen für die Überwachung [Internal audit and corporate
governance: tasks and developments for monitoring]. 2nd ed. Berlin:
Erich Schmidt Verlag, 2009.
[4] Mitra, S., and Chittimalli, P. K. A, “Systematic Review of Methods for
Consistency Checking in SBVR-based Business Rules” in DIAS/
EDUDM@ ISEC, 2017.
[5] Chowdhury, G. G., “Natural language processing,” Annual review of
information science and technology, vol. 37, no. 1, pp. 51–89, 2003.
[6] Janpitak, N., Sathitwiriyawong, C., and Pipatthanaudomdee, P., “Infor-
mation Security Requirement Extraction from Regulatory Documents
using GATE/ANNIC,” in 2019 7th International Electrical Engineering
Congress (iEECON). Hua Hin, Thailand: IEEE, 2019, pp. 1–4.
[7] Xu, X., Cai, H. “Semantic Frame-Based Information Extraction from
Utility Regulatory Documents to Support Compliance Checking” in
Mutis, I., Hartmann, T. (eds) Advances in Informatics and Computing in
Civil and Construction Engineering. Springer, Cham, 2019.
[8] Winter, K., and Rinderle-Ma, S., “Detecting constraints and their relations
from regulatory documents using nlp techniques,” in H. Panetto, C.
Debruyne, H. A. Proper, C. A. Ardagna, D. Roman und R. Meersman
(eds.) On the Move to Meaningful Internet Systems. OTM 2018
Conferences, Bd. 11229. Cham: Springer Int. Publishing (Lecture Notes
in Computer Science), pp. 261–278, 2018.
[9] Zhang, J., and El-Gohary, N. M., “Semantic NLP-based information
extraction from construction regulatory documents for automated
compliance checking,” Journal of Computing in Civil Engineering, vol.
30 no. 2, pp. 1943-5487, 2015.
[10] Gepp, A. Linnenluecke M. K., O’Neill, J., and Smith, T., “Big data tech-
niques in auditing research and practice: Current trends and future oppor-
tunities,” Journal of Accounting Literature, vol. 40, pp. 102–115, 2018.
[11] Brown-Liburd, H., Issa, H., and Lombardi, D., “Behavioral Implications
of Big Data's Impact on Audit Judgment and Decision Making and Future
Research Directions,” Accounting Horizons, vol. 29, no. 2, pp. 451–468,
2015.
[12] Manning, C. D., Raghavan, P., and Schütze, H., “Introduction to
information retrieval,” Re-printed. Cambridge: Cambridge University
Press, 2009.
[13] Croft, W. B., Metzler, D., and Strohman, T., Search Engines: Information
Retrieval in Practice. London: Pearson Education, Inc., 2015.
[14] Sugathadasa, K., Ayesha, B., de Silva, N., Perera, A. S., Jayawardana, V.,
Lakmal, D., and Perera, M., “Legal document retrieval using document
vector embeddings and deep learning,” arXiv preprint:1805.10685, 2018.
[15] Cheng, C. P., Lau, G. T., Law, K. H., Pan, J., and Jones, A., “Regulation
retrieval using industry specific taxonomies,” Artificial Intelligence and
Law, vol. 16, no. 3, pp. 277–303, 2008.
11 https://github.com/schumanng/german_regulatory_documents
[16] Collarana, D., Heuss, T., Lehmann, J., Lytra, I., Maheshwari, G.,
Nedelchev, R., Schmidt, T., and Trivedi, P., “A question answering
system on regulatory documents,” in Legal Knowledge and Information
Systems, vol. 313, pp. 41–50, 2018.
[17] Lau, G. T., Law, K. H., and Wiederhold, G., “Legal information retrieval
and application to e-rulemaking,” in Proceedings of the 10th international
conference on Artificial intelligence and law (ICAIL 2005), Bologna,
Italy, 2005, pp. 146–154.
[18] Cleverdon, C., “The Cranfield Tests on Index Language Devices,” Aslib
Proceedings, vol. 19, no. 6, pp. 173–194, 1967.
[19] Glavaš, G. “10. Evaluation in Information Retrieval.” Universitiy of
Mannheim, 2020. Accessed on: October 11, 2021 [online]. Available:
https://www.uni-
mannheim.de/media/Einrichtungen/dws/Files_People/Profs/goran/10-
Evaluation-FSS20.pdf
[20] Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and
Gatford, M. “Okapi at TREC-3”. Nist Special Publication Sp, pp. 109-
126, 1995.
[21] Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T., “Bag of tricks for
efficient text classification,” arXiv preprint arXiv:1607.01759, 2016.
[22] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T., “Enriching word
vectors with subword information,” Transactions of the Association for
Computational Linguistics, vol. 5, pp. 135–146, 2017.
[23] Mikolov, T., Chen, K., Corrado, G., and Dean, J., “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
2013.
[24] Pennington, J., Socher, R., and Manning, C., “GloVe: Global Vectors for
Word Representation,” in Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Doha,
Qatar: Association for Computational Linguistics, 2014, pp. 1532–1543.
[25] Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T.,
“Learning Word Vectors for 157 Languages,” in Proceedings of the
International Conference on Language Resources and Evaluation (LREC
2018), Miyazaki, Japan, 2018.
[26] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., “BERT: Pre-
training of Deep Bidirectional Transformers for Language Under-
standing,” arXiv preprint arXiv:1810.04805, 2018.
[27] Reimers, N., Retrieve & Re-Rank. Sentence-Transformers Dokumenta-
tion, 2021. Accessed on: November 14, 2021 [online]. Available: https://
www.sbert.net/examples/applications/retrieve_rerank/README.html
[28] Reimers, N., and Gurevych, I., “Sentence-bert: Sentence embeddings
using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
[29] Thakur, N., Reimers, N., Daxenberger, J., and Gurevych, I., “Augmented
SBERT: Data Augmentation Method for Improving Bi-Encoders for
Pairwise Sentence Scoring Tasks,” arXiv preprint arXiv:2010.08240,
2020.
[30] Reimers, N., “Multilingual Information Retrieval (MS-Marco Bi-
Encoders)”, 2021. Accessed on: November 15, 2021 [online]. Available:
https://github.com/UKPLab/sentence-transformers/issues/695
[31] Reimers, N., and Gurevych, I., “Making monolingual sentence
embeddings multilingual using knowledge distillation,” arXiv preprint
arXiv:2004.09813, 2020.
[32] Reissel, P., and Manaj, I., Passage Reranking Multilingual BERT,
2020, Accessed on: November 14, 2021 [online]. Available:
https://huggingface.co/amberoad/bert-multilingual-passage-reranking-
msmarco