PresentationPDF Available

Developing a complex query to build a specialised corpus: Reducing the issue of polysemous query terms.

Authors:

Abstract

In this paper, I discuss an approach to formulating a complex query for a topic-specific corpus. Specifically, I outline how a dual query-term-group model can be used to counter the issue of polysemy of query terms when compiling a corpus from a database. I also show how the Relative Query Term Relevance statistical measure (Gabrielatos 2007) can be adapted to this model to aid in the final selection of candidate query terms. This approach was employed in formulating the query used to compile the corpus for my PhD project 'Constructing the Lone Wolf Terrorist: A corpus-driven study of the British press'. This paper was presented at the Corpora and Discourse International Conference 2020 (https://corporadiscourse.com/).
Developing a complex query
to build a specialised corpus:
Reducing the issue of
polysemous query terms


Aim
 dual query-term
groups 


 Query Term
Relevance !"#$
%%

2
Outline
&'(
&(%!
&)*!"
&!%+
&)'"
&!"# #!"#$

&!"#!%
+
&#,
3
Case study
Task-
.*/
0'12
"-%
.3444%456
Aims of the project-
 
,
 7%,
.

4
Developing a
Specialised-
corpus Query
Specialised corpora
&"7838
7 449$
Precision
&1:-
;<, <$
;=:
Recall
&:
%

Text Relevance = Query Relevance
+:44>$
5
Developing a
Specialised-
corpus Query The Query
?:
73
33
:@ +:44>-9$
Working De#nition
&?":single3
ideologically motivated perpetrator
,A,
3,acts without any direct
support 3
7terrorist<3,
:
@ .<<B#4C$
6
Issue of
Polysemous
Query Terms
Dictionary De#nition
&Lone wolf&n. D DU.S.$D
gurative3(a),7
,3<
ED(b),
EDattributive ,
$Dv.Dintransitive33
,<333 =
=45$
&Lone wolf3

7
Issue of
Polysemous
Query Terms
Problem
*
7
%
Example Domain: Sport
&"-Lone wolf 
:.<
&.-2"Lone WolfF#
<F<
&FLone wolfFG:
.*::

&HIF(LONE WOLF ."2
(=1I")I*'H(
8
Dual Query-
term Groups
Solution to Polysemy Issue
&3
:.=#
; lone wolf =#lone actor =#J$
&%
9
Group A
lone wolf 
lone wolf =# lone actor =#…)
I
Group B
terror* 
terrorism=#extremism =#J$
Identifying
Candidate
Terms
1. Academic Literature
2. Pilot study
&*!:
&(-lone-wolf I
terror*.
&I,K(33
+3
&"4LK4> <
$
&LM34L4M<
10
Identifying
Candidate
Terms Keyness Analysis of Pilot Corpus
&(<3""C

&7C44<,44
%,
;1,3/3373
/3)3<3::3
N3<3::3
333N3
3,3
)3<333%,
3<33
33:33
7)3%)3
)
Concordance Check
&7:<%8
lone.
;(33
11
Identifying
Candidate
Terms Binomials
&(,,
<

&8<
7
12
Modi#ers Nouns
  < %O
 < < 
 ::  
% 7  ::
%  P OO
 8  <OO
  OO
 
 
Query Term
Relevance
(QTR) Measure
What?
&(
&'<
7:
7
&.

Purpose?
&

:

+:44$
13
Query Term
Relevance
(QTR) Measure Formula
&!"#-4K
;4Q%
,
 '!"$7
;Q,%
,

+:44>$
14
!"#
!"#
QTR Baseline
Baseline for Dual Query-term
Groups
&?!"#

-:
 .$@ +:444$
&(:
:'!"
15
Group A Baseline
lone wolfI
terror*$
"
lone-wolf$
!"#
2732 MC9 494C
Group B Baseline
terror* Ilone
wolf$
"
terror*$
!"#
2732 9C 444M
Relative Query
Term Relevance
(RQTR) Candidate Terms
&'6!"#
::
&Positive:

+:44>$
16
CT Group A CT(A) &
CQ(B)
T QTR RQTR RQTRn
lone actor 299 311 0.961 58.922 90.233
lone offender 5 15 0.333 -44.900 -44.900
lone radical 0 2 0 -100 -100
Relative Query
Term Relevance
(RQTR) &#!"#
17
RQT
R
Interpretation
R44 G-always co-occurs 
7,

4.-equal
relevance 6
%44 I-never co-occurs 
7

CT CQ & CT T QTR RQTR RQTRn
lone actor 299 311 0.961 58.922 90.233
lone offender 5 15 0.333 -44.900 -44.900
lone radical 0 2 0 -100 -100
&#!"#
;<
78
Review
dual query-groups -
&#
,
:
&
Query Term Relevance :
%


Final Query:
&ML+ ,$
&>+. $
18
Contact
&S<
&",-STT
&*/#+-
-:<5
&%
 ='($::
444U
-<::
19
References
&.<<33B#3VWN 4C$1%"-8
2<DCountering Lone-Actor Terrorism Series
&3 449$'.-%
<:
-,,,:<%,)
&G,,31 44M$"X
DDiscourse in the professions: Perspectives from
corpus linguistics3D113LL
&+:3' 44>$DSelecting query terms to build a specialised corpus from a
restricted-access database.D)'N3LC%MM-:/L/L%
C%MM
&+:3' 44$'%8%
:Corpus Linguistics Advanced Research Education and Training (CLARET)
13M44-,,,C55C5
&+3* 4C$DLone-actor terrorists: A behavioural analysis#
&3(B(/3#G 4>$DThe age of lone wolf terrorismI,Y<-':
*
&H3344.)-=6HP'
$3"#:<'11-#399%>5
&== 45$Z3/Z=7*
-,,,,4559M[GQR, 9
:45$
20
Article
Full-text available
Journal for Digital History (JDH). Online publication: https://journalofdigitalhistory.org/en/article/4yxHGiqXYRbX | Humanities researchers often encounter the problem that their specialized corpora, created by keyword searches, either contain documents that are irrelevant for their research questions because the search queries were too broad, or they miss relevant documents because the search requests were too narrow. The reason for this lies in the complexity of language, which is characterized by ambiguity and concepts that are difficult, if not impossible, to trace by computational methods and thus keyword searches alone. This paper shows how text mining methods can support the building of a topic-specific corpus. Using the example of return migration issues, the aim is, on the one hand, to build a corpus that is as representative as possible and, on the other hand, to overcome the bias that comes with complex keyword searches that are influenced by the researcher’s prior knowledge. The paper begins with a discussion of the motivations for and the challenges of building research driven corpora, leads through the steps that were taken to obtain a satisfactory corpus that can be analysed further and gives an outlook on how the created corpus was used to conduct a qualitative, discourse-driven analysis on return migration from the Americas to Europe between 1850 and 1950.
ResearchGate has not been able to resolve any references for this publication.