Conference PaperPDF Available

ExDocS: Evidence based Explainable Document Search

Authors:

Abstract and Figures

We present an explainable document search system (ExDocS), based on a re-ranking approach, that uses textual and visual explanations to explain document rankings to non-expert users. ExDocS attempts to answer questions such as "Why is document X ranked at Y for a given query?", "How do we compare multiple documents to understand their relative rankings?". The contribution of this work is on re-ranking methods based on various interpretable facets of evidence such as term statistics, contextual words, and citation-based popularity. Contribution from the user interface perspective consists of providing intuitive accessible explanations such as: "document X is at rank Y because of matches found like Z" along with visual elements designed to compare the evidence and thereby explain the rankings. The quality of our re-ranking approach is evaluated on benchmark data sets in an ad-hoc retrieval setting. Due to the absence of ground truth of explanations, we evaluate the aspects of interpretability and completeness of explanations in a user study. ExDocS is compared with a recent baseline-explainable search system (EXS), that uses a popular posthoc explanation method called LIME. In line with the "no free lunch" theorem, we find statistically significant results showing that ExDocS provides an explanation for rankings that are understandable and complete but the explanation comes at the cost of a drop in ranking quality.
Content may be subject to copyright.
ExDocS: Evidence based Explainable Document Search
Sayantan Polley*1,Atin Janki*1,Marcus Thiel1,Juliane Hoebel-Mueller1and
Andreas Nuernberger1
1Otto von Guericke University Magdeburg, Universitätsplatz 2, 39106 Magdeburg, Germany – rst authors with * have equal contribution
Abstract
We present an explainable document search system (ExDocS), based on a re-ranking approach, that uses textual and visual
explanations to explain document rankings to non-expert users. ExDocS attempts to answer questions such as “Why is
document X ranked at Y for a given query?”, “How do we compare multiple documents to understand their relative rankings?”.
The contribution of this work is on re-ranking methods based on various interpretable facets of evidence such as term
statistics, contextual words, and citation-based popularity. Contribution from the user interface perspective consists of
providing intuitive accessible explanations such as: “document X is at rank Y because of matches found like Z” along with
visual elements designed to compare the evidence and thereby explain the rankings. The quality of our re-ranking approach
is evaluated on benchmark data sets in an ad-hoc retrieval setting. Due to the absence of ground truth of explanations, we
evaluate the aspects of interpretability and completeness of explanations in a user study. ExDocS is compared with a recent
baseline - explainable search system (EXS), that uses a popular posthoc explanation method called LIME. In line with the
“no free lunch” theorem, we nd statistically signicant results showing that ExDocS provides an explanation for rankings
that are understandable and complete but the explanation comes at the cost of a drop in ranking quality.
Keywords
Explainable Rankings, XIR, XAI, Re-ranking
1. Introduction
Explainability in Articial intelligence (XAI) is currently
a vibrant research topic that attempts to make AI systems
transparent and trustworthy to the concerned stakehold-
ers. The research in XAI domain is interdisciplinary but
is primarily led by the development of methods from the
machine learning (ML) community. From the classi-
cation perspective, e.g., in a diagnostic setting a doctor
may be interested to know that how prediction for a dis-
ease is made by the AI-driven solution. XAI methods in
ML are typically based on exploiting features associated
with a class label, development of add-on model specic
methods like LRP [
2
], model agnostic ways such as LIME
[
3
] or causality driven methods [
4
]. The explainability
problem in IR is inherently dierent from a classication
setting. In IR, the user may be interested to know how a
certain document is ranked for the given query or why a
certain document is ranked higher than others [
5
]. Often
an explanation is an answer to a why question [6].
In this work, Explainable Document Search (ExDocS),
we focus on a non-web ad-hoc text retrieval setting and
aim to answer the following research questions:
1.
Why is a document X ranked at Y for a given
query?
The 1st International Workshop on Causality in Search and
Recommendation (CSR’21), July 15, 2021, Online
"sayantan.polley@ovgu.de (S. Polley*); atin.janki@ovgu.de
(A. Janki*); marcus.thiel@ovgu.de (M. Thiel);
juliane.hoebel@ovgu.de (J. Hoebel-Mueller);
andreas.nuernberger@ovgu.de (A. Nuernberger)
©2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
2.
How do we compare multiple documents to un-
derstand their relative rankings?
3.
Are the explanations provided interpretable and
complete?
There have been works [
5
], [
7
] in the recent past that
attempted to address related questions such as "Why is a
document relevant to the query?" by adapting XAI meth-
ods such as LIME [
3
] primarily for neural rankers. We
argue that the idea of relevance has deeper connotations
related to the semantic and syntactic notion of similarity
in text. Hence, we try to tackle the XAI problem from
a ranking perspective. Based on interpretable facets we
provide a simple re-ranking method that is agnostic of
the retrieval model. ExDocS provides local textual ex-
planations for each document (Part D in Fig. 1). The
re-ranking approach enables us to display the “math be-
hind the rank” for each of the retrieved documents (Part
E in Fig. 1). Besides, we also provide a global explana-
tion in form of a comparative view of multiple retrieved
documents (Fig. 4).
We discuss relevant work for explainable rankings
in section two. We describe our contribution to the re-
ranking approach and methods to generate explanation in
section three. Next in section four, we discuss the quanti-
tative evaluation of rankings on benchmark data sets and
a comparative qualitative evaluation with an explainable
search baseline in a user study. To our knowledge, this is
one of the rst works comparing two explainable search
systems in a user study. In section ve, we conclude
that ExDocS provides explanations that are interpretable
and complete. The results are statistically signicant in
Wilcoxon signed-rank test. However, the explanations
Figure 1: The ExDocS Search Interface. Local Textual explanation, marked (D), explains the rank of a document with a simpli-
fied mathematical score (E) used for re-ranking. A query-term bar, marked (C), for each document signifies the contribution
of each query term. Other facets of Local explanation can be seen in Fig 2&3. A running column in the le marked (B) shows
a gradual fading of color shade with decreasing rank. Global explanation via document comparison marked here as (A), is
shown in Fig 4. Showing search results for a sample query - ‘wine market’ on EUR-Lex [1] dataset.
come at a cost of reduced ranking performance paving
way for future work. The ExDocS system is online
1
and
the source code is available on-request for reproducible
research.
2. Related Work
The earliest attempts on making search results explain-
able can be seen through the visualization paradigms
[
8
,
9
,
10
] that aimed at explaining term distribution and
statistics. Mi and Jiang [
11
] noted that IR systems were
one of the earliest among other research elds to oer
interpretations of system decisions and outputs, through
search result summaries. The areas of product search [
12
]
and personalized professional search [
13
], have explored
explanations for search results by creating knowledge-
graphs based on user’s logs. In [
14
] Melucci made a
preliminary study and suggested that structural equa-
tion models from the causal perspective can be used to
generate explanations for search systems. Related to ex-
1https://tinyurl.com/ExDocSearch
Figure 2: Contribution of Query Terms for relevance
plainability, the perspective of ethics and fairness [
15
,
16
]
is also often encountered in IR whereby the retrieved data
may be related to disadvantaged people or groups. In
[
17
] a categorization of fairness in rankings is devised
based on the use of pre-processing, in-processing, or
Figure 3: Coverage of matched terms in a document
post-processing strategies.
Recently there has been a rise in the study of inter-
pretability of neural rankers [
5
,
7
,
18
]. While [
5
] uses
LIME, [
7
] uses DeepSHAP for generating explanations
and both of them dier considerably. Neural ranking can
be thought of as an ordinal classication problem, thereby
making it easier to leverage the XAI concepts from the
ML community to generate explanations. Moreover, [
18
]
generates explanations through visualization using term
statistics and highlighting important passages within the
documents retrieved. Apart from this, [
19
] oers a tool
built upon Lucene to explain the internal workings of the
Vector Space Model, BM25, and Language Model, but it is
aimed at assisting researchers and is still far from an end
user’s understanding. ExDocS also focuses on explaining
the internal operations of the search system similar to
[19], however, it uses a custom ranking approach.
Singh and Anand’s EXS [
5
] comes closest to ExDocS
in terms of the questions they aim to answer through
explanations, such as - "Why is a document relevant to
the query?" and "Why is a document ranked higher than
the other?". EXS uses DRMM (Deep Relevance Matching
Model), a pointwise neural-ranking model that uses a
deep architecture at the query term level for relevance
matching. For generating explanations it employs LIME
[
3
]. We consider the explanations from EXS as a fair
baseline and compare with ExDocS in a user-study.
3. Concept: Re-ranking via
Interpretable facets
The concept behind ExDocS is based on the re-ranking
of interpretable facets of evidence such as term statistics,
contextual words, and citation-based popularity. Each
of these facets is also a selectable search criterion in
the search interface. We have a motivation to provide a
simple intuitive mathematical explanation of each rank
with reproducible results. Hence, we start with a com-
mon TF-IDF based vector space model (VSM as OOTB
Apache Solr) with cosine similarity (ClassicSimilarity).
VSM helped us to separate the contributions of query
terms enabling us to analytically explain the ranks. BM25
was not deemed suitable for explaining the rankings to a
user, since it could not be interpreted completely analyt-
ically. On receiving a user query, we expand the query
and search the index. The top hundred results are passed
to the re-ranker (refer to Algo. 1) to get the nal results.
Term-count is taken as the rst facet of evidence since we
assumed that it is relatively easy to analytically explain to
a non-expert end-user as: “document X has P % relative
occurrences.. compared to the best matching document”
(refer to Part E in 1). The assumption on term-count
is also in line with a recent work [
18
] on explainable
rankings.
Skip-gram word-embeddings are used to determine
contextual words. About two to three nearest neighbor
words are used to expand the query. Additionally, the
WordNet thesaurus is used to detect synonyms. The opti-
mal combination of the ratio of word-embeddings versus
synonyms is empirically determined by ranking perfor-
mance. Re-ranking is performed based on the proportion
of co-occurring words. This enables us to provide local
explanations such as “document X is ranked at position
Y because of matches found for synonyms like A and
contextual words like B”. Citation analysis is performed
by making multiple combinations of weighted in-links,
Page Rank, and HITS score for each document. Citation
analysis was selected and deemed as an interpretable
facet that we named “document popularity”. We argue
that this could be used to generate understandable expla-
nations such as: “document X is ranked at Y because of
the presence of popularity”. Finally, we re-rank using the
following facets as shown below:
Keyword Search
: ‘term statistics’ (term-count)
Contextual Search
: ‘context-words’ (term-
count of query words + expanded contextual
words by word-embeddings).
Synonym Search
: ‘contextual words’ (term-
count of query words + expanded contextual
words). Contextual words are synonyms, in this
case, using Word-Net.
Contextual and Synonym Search
: ‘contex-
tual words’ (term-count of query words + ex-
panded contextual words). Contextual words are
word-embeddings+synonyms in this case.
Keyword Search with Popularity score
:
‘citation-based popularity’ (popularity score of a
document)
Based on benchmark ranking performance, we empiri-
cally determine a weighted combination of these facets
which is also available as a search criteria choice in the
interface. Additionally, we provide local and global vi-
sual explanations. Local ones in form of visualizing the
contribution of features (expanded query terms) for each
document as well as comparing them globally for mul-
tiple documents (refer the Evidence Graph in the lower
part of Fig. 4).
input : q = {w1,w2,...,wn}, D = {d1,d2,...,dm},
facet
output : A re-ranked doc list
1Select top-k docs from D using cosine similarity,
such as
{𝑑1, 𝑑2, ..., 𝑑𝑘} ∈ 𝐷𝑘
for 𝑖1to 𝑘do
2if facet == ‘term statistics’ or ‘contextual
words’ then
3evidence(di)Σ𝑤𝑞𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑𝑖)
4// count(w, di) is count of
term w in di
5end
6if facet == ‘citation-based popularity’ then
7evidence(di)𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦𝑆 𝑐𝑜𝑟𝑒(𝑑𝑖)
8// popularityScore(di) could
be inLinks count, PageRank
or HITS score of di
9end
10 end
11 Rerank all docs in Dk using evidence
12 return Dk
Algorithm 1: Re-ranking algorithm
4. Evaluation
We have two specic focus areas in evaluation. The rst
one is related to the quality of the rankings and the second
one is related to the explainability aspect. We leave out
evaluation of the popularity score model for future work.
4.1. Evaluation of re-ranking algorithm
We experimented the re-ranking algorithm on the TREC
Disk 4 & 5 (-CR) dataset. The evaluations were carried out
by using the trec_eval[
20
] package. We used TREC-6 ad-
hoc queries (topics 301-350) and used only ‘Title’ of the
topics as the query. We noticed that
Keyword Search
,
Contextual Search
,
Synonym Search
, and
Contextual Synonym Search
systems were unable
to beat the ‘Baseline ExDocS’ (OOTB Apache Solr) on
metrics such as MAP, R-Precision, and NDCG (refer
to Table 1). We benchmark our retrieval performance
by comparing with [
21
] and conrm that our ranking
approach needs improvement to at least match the
baseline performance metrics.
4.2. Evaluation of explanations
We performed a user study to qualitatively evaluate the
explanations. Also, to compare ExDocS’s explanations
with that of EXS; we integrated EXS’s explanation model
into our interface. Therefore, keeping the look and feel of
both systems alike, we tried to reduce user’s bias towards
any system.
4.2.1. User study setup
A total of 32 users participated in a lab controlled user
study. 30 users were from a computer science background
while 26 users had a fair knowledge of information re-
trieval systems. Each user was asked to test out both
the systems and the questionnaire was formatted in a
Latin-block design. The name of the systems was masked
as System-A (EXS) and System-B (ExDocS).
4.2.2. Metrics for evaluation
We use the existing denitions ([
6
] and [
22
]) of
Inter-
pretability
,
Completeness
and
Transparency
in the com-
munity with respect to evaluation in XAI. The following
factors are used for evaluating the quality and eective-
ness of explanations:
Interpretability
: describing the internals of a sys-
tem in human-understandable terms [6].
Completeness
: describing the operation of a sys-
tem accurately and allowing the system’s behav-
ior to be anticipated in future [6].
Transparency
: an IR system should be able to
demonstrate to its users and other interested par-
ties, why and how the proposed outcomes were
achieved [22].
4.3. Results and Discussion
We discuss the results of our experiments and draw con-
clusions to answer the research questions.
RQ1. Why is a document X ranked at Y for a
given query?
We answer this question by providing the individual tex-
tual explanation for every document (refer to Part D of
Fig. 1) on the ExDocS interface. The “math behind the
rank” (refer to Part E of Fig. 1) of a document is explained
as a percentage of the evidence with respect to the best
matching document.
Figure 4: Global Explanation by comparison of evidence for multiple documents (increasing ranks from le to right). A title-
body image is provided, marked (A), to indicate whether the query term was found in title and/or body. The column marked
(B), represents the attributes for comparison.
Table 1
MAP, R-Precision, and NDCG values for ExDocS search systems against TREC-6 benchmark values*[21]
IR Systems MAP R-Precision NDCG
csiro97a3* 0.126 0.1481 NA
DCU97vs* 0.194 0.2282 NA
mds603* 0.157 0.1877 NA
glair61* 0.177 0.2094 NA
Baseline ExDocS 0.186 0.2106 0.554
Keyword Search 0.107 0.1081 0.462
Contextual Search 0.080 0.0955 0.457
Synonym Search 0.078 0.0791 0.411
Contextual and Synonym Search 0.046 0.0526 0.405
RQ2. How do we compare multiple documents
to understand their relative rankings?
We provide an option to compare multiple documents
through visual and textual paradigms (refer to Fig. 4). The
evidence can be compared and contrasted and thereby un-
derstand the reasons for a document’s rank being higher
or lower than others.
RQ3. Are the generated explanations inter-
pretable and complete?
We evaluate the quality of the explanations in terms of
their interpretability and completeness. Empirical evi-
dence from the user study on Interpretability:
1.
96.88% of the users understood the textual expla-
nations of ExDocS
2.
71.88% of the users understood the relation be-
tween the query term and features (synonyms or
contextual words) shown in the explanation
3.
Users gave a mean rating of 4 out of 5 (standard
deviation = 1.11) to ExDocS on the understand-
ability of the percentage calculation for rankings,
shown as part of the explanations
When users were explicitly asked - whether they could
“gather an understanding of how the system functions
based on the given explanations”, users gave a positive
response with a mean rating of 3.84 out of 5 (standard
deviation = 0.72). The above-mentioned empirical evi-
dence indicates that the ranking explanations provided
by ExDocS can be deemed as interpretable.
Empirical evidence from the user study on Complete-
ness:
1.
All users found the features shown in the expla-
nation of ExDocS to be reasonable (i.e. sensible
or fairly good)
2.
90.63% of the users understood through compara-
tive explanations of ExDocS that- why a partic-
ular document was ranked higher or lower than
other documents
Moreover, 78.13% of total users claimed that they could
anticipate ExDocS behavior in the future based on the
understanding gathered through explanations (individual
and comparative). Based on the above empirical evidence
we argue that the ranking explanations generated by
ExDocS can be assumed to be complete.
Transparency:
We investigate if the explanations
make ExDocS more transparent [
22
] to the user. Users
gave ExDocS a mean rating of 3.97 out of 5 (standard
deviation = 0.86) on ‘Transparency’ based on the indi-
vidual (local) explanations. In addition to that, 90.63%
of the total users indicated that ExDocS became more
transparent after reading the comparative (global) expla-
nations. This indicates that explanations make ExDocS
more transparent to the user.
Figure 5: Comparison of explanations from EXS and ExDocS
on dierent XAI metrics. All the values shown here are scaled
between [0-1] for simplicity.
Comparison of explanations between ExDocS
and EXS:
Both the systems performed similarly in terms of
𝑇 𝑟𝑎𝑛𝑠𝑝𝑎𝑟𝑒𝑛𝑐𝑦
and
𝐶𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠
. However, users
found ExDocS explanations to be more interpretable com-
pared to that of EXS (refer to Fig. 5), and this compar-
ison was statistically signicant in WSR test (
|𝑊|<
𝑊𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙(𝛼= 0.05,𝑁𝑟= 10) = 10, where |𝑊|= 5.5).
5. Conclusion and Future Work
In this work, we present an Explainable Document Search
(ExDocS) system that attempts to explain document rank-
ings using a combination of textual and visual elements
to a non-expert user. We make use of word embeddings
and WordNet thesaurus to expand the user query. We use
various interpretable facets such as term statistics, con-
textual words, and citation-based popularity. Re-ranking
results from a simple vector space model with such in-
terpretable facets help us to explain the “math behind
the rank” to an end-user. We evaluate the explanations
by comparing ExDocS with another explainable search
baseline in a user study. We nd statistically signicant
results that ExDocs provides interpretable and complete
explanations. Although, it was dicult to nd a clear
winner between both systems in all aspects. In line with
the “no free lunch” theorem, the results show a drop in
ranking quality on benchmark data sets at the cost of
getting comprehensible explanations. This paves way
for ongoing research to include user feedback to adapt
the rankings and explanations. ExDocS is currently be-
ing evaluated in domain-specic search settings like law
search where explainability is a key factor to gain user
trust.
References
[1]
E. L. Mencia, J. Fürnkranz, Ecient multilabel clas-
sication algorithms for large-scale problems in
the legal domain, in: Semantic Processing of Legal
Texts, Springer, 2010, pp. 192–215.
[2]
S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.
Müller, W. Samek, On pixel-wise explanations for
non-linear classier decisions by layer-wise rele-
vance propagation, PloS one 10 (2015) e0130140.
[3]
M. T. Ribeiro, S. Singh, C. Guestrin, "Why Should I
Trust You?": Explaining the Predictions of Any Clas-
sier, in: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, KDD ’16, Association for Com-
puting Machinery, New York, NY, USA, 2016, p.
1135–1144.
[4]
J. Pearl, et al., Causal inference in statistics: An
overview, Statistics surveys 3 (2009) 96–146.
[5]
J. Singh, A. Anand, EXS: Explainable Search Using
Local Model Agnostic Interpretability, in: Proceed-
ings of the Twelfth ACM International Conference
on Web Search and Data Mining, WSDM ’19, Asso-
ciation for Computing Machinery, New York, NY,
USA, 2019, p. 770–773.
[6]
L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter,
L. Kagal, Explaining explanations: An overview of
interpretability of machine learning, in: 2018 IEEE
5th International Conference on Data Science and
Advanced Analytics (DSAA), IEEE, 2018, pp. 80–89.
[7]
Z. T. Fernando, J. Singh, A. Anand, A Study on
the Interpretability of Neural Retrieval Models Us-
ing DeepSHAP, in: Proceedings of the 42nd Inter-
national ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR’19,
Association for Computing Machinery, New York,
NY, USA, 2019, p. 1005–1008.
[8]
M. A. Hearst, TileBars: Visualization of Term Distri-
bution Information in Full Text Information Access,
in: Proceedings of the SIGCHI Conference on Hu-
man Factors in Computing Systems, CHI ’95, ACM
Press/Addison-Wesley Publishing Co., USA, 1995,
p. 59–66.
[9]
O. Hoeber, M. Brooks, D. Schroeder, X. D. Yang,
TheHotMap.Com: Enabling Flexible Interaction in
Next-Generation Web Search Interfaces, in: Pro-
ceedings of the 2008 IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent
Agent Technology - Volume 01, WI-IAT ’08, IEEE
Computer Society, USA, 2008, p. 730–734.
[10]
M. A. Soliman, I. F. Ilyas, K. C.-C. Chang, URank:
Formulation and Ecient Evaluation of Top-k
Queries in Uncertain Databases, in: Proceedings of
the 2007 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’07, Association
for Computing Machinery, New York, NY, USA,
2007, p. 1082–1084.
[11]
S. Mi, J. Jiang, Understanding the Interpretability
of Search Result Summaries, in: Proceedings of the
42nd International ACM SIGIR Conference on Re-
search and Development in Information Retrieval,
SIGIR’19, Association for Computing Machinery,
New York, NY, USA, 2019, p. 989–992.
[12]
Q. Ai, Y. Zhang, K. Bi, W. B. Croft, Explainable
Product Search with a Dynamic Relation Embed-
ding Model, ACM Trans. Inf. Syst. 38 (2019).
[13]
S. Verberne, Explainable IR for personalizing pro-
fessional search, in: ProfS/KG4IR/Data: Search@
SIGIR, 2018.
[14]
M. Melucci, Can Structural Equation Models In-
terpret Search Systems?, in: Proceedings of the
42nd International ACM SIGIR Conference on Re-
search and Development in Information Retrieval,
SIGIR’19, Association for Computing Machinery,
New York, NY, USA, 2019. URL: https://ears2019.
github.io/Melucci-EARS2019.pdf .
[15]
A. J. Biega, K. P. Gummadi, G. Weikum, Equity of
attention: Amortizing individual fairness in rank-
ings, in: The 41st International ACM SIGIR confer-
ence on Research & Development in Information
Retrieval, 2018, pp. 405–414.
[16]
S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness-
Aware Ranking in Search and Recommendation Sys-
tems with Application to LinkedIn Talent Search,
in: Proceedings of the 25th ACM SIGKDD In-
ternational Conference on Knowledge Discovery
amp; Data Mining, KDD ’19, Association for Com-
puting Machinery, New York, NY, USA, 2019, p.
2221–2231. URL: https://doi.org/10.1145/3292500.
3330691. doi:10.1145/3292500.3330691.
[17]
C. Castillo, Fairness and Transparency in Ranking,
SIGIR Forum 52 (2019) 64–71.
[18]
V. Chios, Helping results assessment by adding ex-
plainable elements to the deep relevance matching
model, in: Proceedings of the 43rd International
ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, Association for Com-
puting Machinery, New York, NY, USA, 2020. URL:
https://ears2020.github.io/accept_papers/2.pdf.
[19]
D. Roy, S. Saha, M. Mitra, B. Sen, D. Ganguly, I-REX:
A Lucene Plugin for EXplainable IR, in: Proceed-
ings of the 28th ACM International Conference on
Information and Knowledge Management, CIKM
’19, Association for Computing Machinery, New
York, NY, USA, 2019, p. 2949–2952.
[20]
C. Buckley, et al., The trec_eval evaluation package,
2004.
[21]
D. K. Harman, E. Voorhees, The Sixth Text RE-
trieval Conference (TREC-6), US Department of
Commerce, Technology Administration, National
Institute of Standards and Technology (NIST), 1998.
[22]
A. Olteanu, J. Garcia-Gathright, M. de Rijke, M. D.
Ekstrand, Workshop on Fairness, Accountability,
Condentiality, Transparency, and Safety in Infor-
mation Retrieval (FACTS-IR), in: Proceedings of the
42nd International ACM SIGIR Conference on Re-
search and Development in Information Retrieval,
2019, pp. 1423–1425.
... Common explanations may highlight the most relevant parts of a document [5] or provide feature importance score [25]. Recent approaches also use explainable features to re-rank candidates generated by vector-based IR systems [23]. However, few existing approaches explicitly incorporate the structured information from the text, which may serve as an important source for explanations. ...
... User studies has shown that re-ranking based on explainable features can improve the interpretability of the ranking result [23] since users can get insights into the passage by observing only the feature values. Examples of such explainable features include term statistics, contextual words, and citation-based popularity [23]. ...
... User studies has shown that re-ranking based on explainable features can improve the interpretability of the ranking result [23] since users can get insights into the passage by observing only the feature values. Examples of such explainable features include term statistics, contextual words, and citation-based popularity [23]. ...
Preprint
Full-text available
Thanks to recent advancements in machine learning, vector-based methods have been adopted in many modern information retrieval (IR) systems. While showing promising retrieval performance, these approaches typically fail to explain why a particular document is retrieved as a query result to address explainable information retrieval(XIR). Knowledge graphs record structured information about entities and inherently explainable relationships. Most of existing XIR approaches focus exclusively on the retrieval model with little consideration on using existing knowledge graphs for providing an explanation. In this paper, we propose a general architecture to incorporate knowledge graphs for XIR in various steps of the retrieval process. Furthermore, we create two instances of the architecture for different types of explanation. We evaluate our approaches on well-known IR benchmarks using standard metrics and compare them with vector-based methods as baselines.
... We attempt to extract the interpretable facets that govern similarity in legal text. The idea is to expose such facets to non-AI experts to understand the notion of similarity used by the IR model and comprehend the rankings (Part D and E in Fig. 1, [23]). ...
... Due care needs to be taken in data pre-processing or model processing [5], to ensure fair representation of attention. In a broader context of IR, [6] provides a categorization of strategies with respect to the idea of fairness and transparency in rankings -using data pre-processing [25], model in-processing [5] and ranking list post-processing [23,30] methods. Recent work [31] has used posthoc interpretability of rankings to answer the question -why was the document ranked higher or lower. ...
... In our next work [23], Explainable Document Search (named 'ExDocS'), we focused on explaining -what makes two law documents (EUR-Lex [19] corpus), similar? How can we support users to comprehend the notion of similarity in legal text? ...
Chapter
Full-text available
Assume a non-AI expert user like a lawyer using an AI driven text retrieval (IR) system. A user is not always sure why a certain document is at the bottom of the ranking list although it seems quite relevant and is expected at the top. Is it due to the proportion of matching terms, semantically related topics, or unknown reasons? This can be confusing and leading to lack of trust and transparency in AI systems. Explainable AI (XAI) is currently a vibrant research topic which is being investigated from various perspectives in the IR and ML community. While a major focus of the ML community is to explain a classification decision, a key focus in IR is to explain the notion of similarity that is used to estimate relevance rankings. Relevance in IR is a complex entity based on various notions of similarity (e.g. semantic, syntactic, contextual) in text. This is often subjective and ranking is an estimation of the relevance. In this work, we attempt to explore the notion of similarity in text with regard to aspects such as semantics, law cross references and arrive at interpretable facets of evidence which can be used to explain rankings. The idea is to explain non-AI experts that why a certain document is relevant to a query, for legal domain. We present our preliminary findings, outline future work and discuss challenges.KeywordsExplainable AIExplainable searchXIRXAI
... We attempt to extract the interpretable facets that govern similarity in legal text. The idea is to expose such facets to non-AI experts to understand the notion of similarity used by the IR model and comprehend the rankings (Part D and E in Fig. 1, [22]). ...
... Due care needs to be taken in data pre-processing or model processing [5], to ensure fair representation of attention. In a broader context of IR, [6] provides a categorization of strategies with respect to the idea of fairness and transparency in rankings -using data pre-processing [23], model in-processing [5] and ranking list post-processing [22,28] methods. There is a plethora of research [30,4] on aspects that govern the notion of relevance in legal text and information extraction from case laws and prior cases [13]. ...
... In our next work [22], Explainable Document Search (named 'ExDocS'), we focused on explaining -what makes two law documents (EUR-Lex [18] corpus), similar? How can we support users to comprehend the notion of similarity in legal text? ...
Conference Paper
Full-text available
Assume a non-AI expert user like a lawyer using an AI driven text retrieval (IR) system. A user is not always sure why a certain document is at the bottom of the ranking list although it seems quite relevant and is expected at the top. Is it due to the proportion of matching terms, semantically related topics, or unknown reasons? This can be confusing and leading to lack of trust and transparency in AI systems. Explainable AI (XAI) is currently a vibrant research topic which is being investigated from various perspectives in the IR and ML community. While a major focus of the ML community is to explain a classification decision, a key focus in IR is to explain the notion of similarity that is used to estimate relevance rankings. Relevance in IR is a complex entity based on various notions of similarity (e.g. semantic, syntactic, contextual) in text. This is often subjective and ranking is an estimation of the relevance. In this work, we attempt to explore the notion of similarity in text with regard to aspects such as semantics, law cross references and arrive at inter-pretable facets of evidence which can be used to explain rankings. The idea is to explain non-AI experts that why a certain document is relevant to a query, for a legal domain. We present our preliminary findings, outline future work and discuss challenges.
... [23] 2023 Detailed taxonomy of XAI metrics with methods. [24] 2021 Extensive collection of XAI and responsible AI. [25] 2021 XAI: Introduction, variety of examples and standard methodology. ...
Article
Full-text available
Explainable Artificial Intelligence (XAI) has an advanced feature to enhance the decision-making feature and improve the rule-based technique by using more advanced Machine Learning (ML) and Deep Learning (DL) based algorithms. In this paper, we chose e-healthcare systems for efficient decision-making and data classification, especially in data security, data handling, diagnostics, laboratories, and decision-making. Federated Machine Learning (FML) is a new and advanced technology that helps to maintain privacy for Personal Health Records (PHR) and handle a large amount of medical data effectively. In this context, XAI, along with FML, increases efficiency and improves the security of e-healthcare systems. The experiments show efficient system performance by implementing a federated averaging algorithm on an open-source Federated Learning (FL) platform. The experimental evaluation demonstrates the accuracy rate by taking epochs size 5, batch size 16, and the number of clients 5, which shows a higher accuracy rate (19, 104). We conclude the paper by discussing the existing gaps and future work in an e-healthcare system.
... Explanations are user-centric and can have multiple forms e.g., heatmaps or feature attributions, extractive text snippets or rationales, and termbased [58,66]. In some sense, most of EXIR approaches either directly [58,66] or indirectly [54,72] try to uncover the often latent machine intent. Consequently, the natural language machine intent from the LLM can naturally serve as an explanation of the downstream ranking task if it is faithful to the underlying ranking model. ...
Preprint
Full-text available
Querying, conversing, and controlling search and information-seeking interfaces using natural language are fast becoming ubiquitous with the rise and adoption of large-language models (LLM). In this position paper, we describe a generic framework for interactive query-rewriting using LLMs. Our proposal aims to unfold new opportunities for improved and transparent intent understanding while building high-performance retrieval systems using LLMs. A key aspect of our framework is the ability of the rewriter to fully specify the machine intent by the search engine in natural language that can be further refined, controlled, and edited before the final retrieval phase. The ability to present, interact, and reason over the underlying machine intent in natural language has profound implications on transparency, ranking performance, and a departure from the traditional way in which supervised signals were collected for understanding intents. We detail the concept, backed by initial experiments, along with open questions for this interactive query understanding framework.
... Several methods fall under this category, such as feature attribution, free-text explanation, and adversarial example methods. Feature attribution methods, also known as feature importance or saliency methods, generate explanations for an individual token by attributing the model output to input features (Qiao et al., 2019;Singh and Anand, 2019;Polley et al., 2021). On the other hand, free-text explanation methods provide explanations using natural language. ...
Article
Full-text available
Introduction People are today increasingly relying on health information they find online to make decisions that may impact both their physical and mental wellbeing. Therefore, there is a growing need for systems that can assess the truthfulness of such health information. Most of the current literature solutions use machine learning or knowledge-based approaches treating the problem as a binary classification task, discriminating between correct information and misinformation. Such solutions present several problems with regard to user decision making, among which: (i) the binary classification task provides users with just two predetermined possibilities with respect to the truthfulness of the information, which users should take for granted; indeed, (ii) the processes by which the results were obtained are often opaque and the results themselves have little or no interpretation. Methods To address these issues, we approach the problem as an ad hoc retrieval task rather than a classification task, with reference, in particular, to the Consumer Health Search task. To do this, a previously proposed Information Retrieval model, which considers information truthfulness as a dimension of relevance, is used to obtain a ranked list of both topically-relevant and truthful documents. The novelty of this work concerns the extension of such a model with a solution for the explainability of the results obtained, by relying on a knowledge base consisting of scientific evidence in the form of medical journal articles. Results and discussion We evaluate the proposed solution both quantitatively, as a standard classification task, and qualitatively, through a user study to examine the “explained” ranked list of documents. The results obtained illustrate the solution's effectiveness and usefulness in making the retrieved results more interpretable by Consumer Health Searchers, both with respect to topical relevance and truthfulness.
... A binary classification problem is created by considering the top-documents in an input ranking as relevant and the rest as irrelevant, essentially considering document ranking as a classification problem where the black box ranker is considered as a classifier. Polley et al. [91] compare EXS with their evidence-based explainable document search system, ExDocS, which performs reranking using interpretable features. In a user study, they found that EXS is on par with the ExDocS system in completeness and transparency metrics, although users rated ExDocS as more interpretable compared to EXS. ...
Preprint
Full-text available
Explainable information retrieval is an emerging research area aiming to make transparent and trustworthy information retrieval systems. Given the increasing use of complex machine learning models in search systems, explainability is essential in building and auditing responsible information retrieval models. This survey fills a vital gap in the otherwise topically diverse literature of explainable information retrieval. It categorizes and discusses recent explainability methods developed for different application domains in information retrieval, providing a common framework and unifying perspectives. In addition, it reflects on the common concern of evaluating explanations and highlights open challenges and opportunities.
Article
Full-text available
Product search is one of the most popular methods for customers to discover products online. Most existing studies on product search focus on developing effective retrieval models that rank items by their likelihood to be purchased. However, they ignore the problem that there is a gap between how systems and customers perceive the relevance of items. Without explanations, users may not understand why product search engines retrieve certain items for them, which consequentially leads to imperfect user experience and suboptimal system performance in practice. In this work, we tackle this problem by constructing explainable retrieval models for product search. Specifically, we propose to model the “search and purchase” behavior as a dynamic relation between users and items, and create a dynamic knowledge graph based on both the multi-relational product data and the context of the search session. Ranking is conducted based on the relationship between users and items in the latent space, and explanations are generated with logic inferences and entity soft matching on the knowledge graph. Empirical experiments show that our model, which we refer to as the Dynamic Relation Embedding Model (DREM), significantly outperforms the state-of-the-art baselines and has the ability to produce reasonable explanations for search results.
Conference Paper
Full-text available
A recent trend in IR has been the usage of neural networks to learn retrieval models for text based adhoc search. While various approaches and architectures have yielded significantly better performance than traditional retrieval models such as BM25, it is still difficult to understand exactly why a document is relevant to a query. In the ML community several approaches for explaining decisions made by deep neural networks have been proposed -- including DeepSHAP which modifies the DeepLift algorithm to estimate the relative importance (shapley values) of input features for a given decision by comparing the activations in the network for a given image against the activations caused by a reference input. In image classification, the reference input tends to be a plain black image. While DeepSHAP has been well studied for image classification tasks, it remains to be seen how we can adapt it to explain the output of Neural Retrieval Models (NRMs). In particular, what is a good "black" image in the context of IR? In this paper we explored various reference input document construction techniques. Additionally, we compared the explanations generated by DeepSHAP to LIME (a model agnostic approach) and found that the explanations differ considerably. Our study raises concerns regarding the robustness and accuracy of explanations produced for NRMs. With this paper we aim to shed light on interesting problems surrounding interpretability in NRMs and highlight areas of future work.
Article
Full-text available
Understanding and interpreting classification decisions of automated image classification systems is of high value in many applications as it allows to verify the reasoning of the system and provides additional information to the human expert. Although machine learning methods are solving very successfully a plethora of tasks, they have in most cases the disadvantage of acting as a black box, not providing any information about what made them arrive at a particular decision. This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of non- linear classifiers. We introduce a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks. These pixel contributions can be visualized as heatmaps and are provided to a human expert who can intuitively not only verify the validity of the classification decision, but also focus further analysis on regions of potential interest. We evaluate our method for classifiers trained on PASCAL VOC 2009 images, synthetic image data containing geometric shapes, the MNIST handwritten digits data set and for the pre-trained ImageNet model available as part of the Caffe open source package.
Conference Paper
We present a framework for quantifying and mitigating algorithmic bias in mechanisms designed for ranking individuals, typically used as part of web-scale search and recommendation systems. We first propose complementary measures to quantify bias with respect to protected attributes such as gender and age. We then present algorithms for computing fairness-aware re-ranking of results. For a given search or recommendation task, our algorithms seek to achieve a desired distribution of top ranked results with respect to one or more protected attributes. We show that such a framework can be tailored to achieve fairness criteria such as equality of opportunity and demographic parity depending on the choice of the desired distribution. We evaluate the proposed algorithms via extensive simulations over different parameter choices, and study the effect of fairness-aware ranking on both bias and utility measures. We finally present the online A/B testing results from applying our framework towards representative ranking in LinkedIn Talent Search, and discuss the lessons learned in practice. Our approach resulted in tremendous improvement in the fairness metrics (nearly three fold increase in the number of search queries with representative results) without affecting the business metrics, which paved the way for deployment to 100% of LinkedIn Recruiter users worldwide. Ours is the first large-scale deployed framework for ensuring fairness in the hiring domain, with the potential positive impact for more than 630M LinkedIn members.
Conference Paper
We examine the interpretability of search results in current web search engines through a lab user study. Particularly, we evaluate search result summary as an interpretable technique that informs users why the system retrieves a result and to which extent the result is useful. We collected judgments about 1,252 search results from 40 users in 160 sessions. Experimental results indicate that the interpretability of a search result summary is a salient factor influencing users' click decisions. Users are less likely to click on a result link if they do not understand why it was retrieved (low transparency) or cannot assess if the result would be useful based on the summary (low assessability). Our findings suggest it is crucial to improve the interpretability of search result summaries and develop better techniques to explain retrieved results to search engine users.
Article
Ranking in Information Retrieval (IR) has been traditionally evaluated from the perspective of the relevance of search engine results to people searching for information, i.e., the extent to which the system provides "the right information, to the right people, in the right way, at the right time." However, people in current IR systems are not only the ones issuing search queries, but increasingly they are also the ones being searched. This raises several new problems in IR that have been addressed in recent research, particularly with respect to fairness/non-discrimination, accountability, and transparency. This is a summary of some these initial developments.
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides insights into the model, which can be used to turn an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We further propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects. Our explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and detecting why a classifier should not be trusted.
Article
This review presents empirical researchers with recent advances in causal inference, and stresses the paradigmatic shifts that must be un-dertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that un-derly all causal inferences, the languages used in formulating those assump-tions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coher-ent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: (1) queries about the effects of potential interven-tions, (also called "causal effects" or "policy evaluation") (2) queries about probabilities of counterfactuals, (including assessment of "regret," "attri-bution" or "causes of effects") and (3) queries about direct and indirect effects (also known as "mediation"). Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both.
Chapter
In this paper we apply multilabel classification algorithms to the EUR-Lex database of legal documents of the European Union. For this document collection, we studied three different multilabel classification problems, the largest being the categorization into the EUROVOC concept hierarchy with almost 4000 classes. We evaluated three algorithms: (i) the binary relevance approach which independently trains one classifier per label; (ii) the multiclass multilabel perceptron algorithm, which respects dependencies between the base classifiers; and (iii) the multilabel pairwise perceptron algorithm, which trains one classifier for each pair of labels. All algorithms use the simple but very efficient perceptron algorithm as the underlying classifier, which makes them very suitable for large-scale multilabel classification problems. The main challenge we had to face was that the almost 8,000,000 perceptrons that had to be trained in the pairwise setting could no longer be stored in memory. We solve this problem by resorting to the dual representation of the perceptron, which makes the pairwise approach feasible for problems of this size. The results on the EUR-Lex database confirm the good predictive performance of the pairwise approach and demonstrates the feasibility of this approach for large-scale tasks. KeywordsText Classification-Multilabel Classification-Legal Documents-EUR-Lex Database-Learning by Pairwise Comparison