PresentationPDF Available

CLARIAH Chaining Search: A Platform for Combined Exploitation for Multiple Linguistic Resources

Authors:

Abstract

We introduce CLARIAH chaining search, a Python library and Jupyter web interface to easily combine exploration of linguistic resources published in the CLARIN/CLARIAH infrastructure, such as corpora, lexica and treebanks. We describe the architecture of our framework and give a number of code examples. Finally, we present a case study to show how the platform can be used in linguistic research. We identify words which are distinctive for a number of sociolinguistic variables (gender, social class and era) in the Letters as Loot corpus of sailing letters from the 17th and 18th century.
CLARIAH Chaining Search:
A Platform for Combined Exploitation of
Multiple Linguistic Resources
Peter Dekker, Mathieu Fannee, Jesse de Does
Dutch Language Institute (INT)
October 1, 2019
CLARIN Annual Conference 2019
Leipzig, Germany
1
Why CLARIAH Chaining search?
Perspective: linguist (with some computational knowledge)
Goal: combine heterogeneous CLARIN resources: corpora,
lexica, treebanks
Current flawed options
Access web interfaces
Download entire dataset
2
Current option: Web interface
Corpus Gysseling: 13th century Dutch
Web interface:
http://gysseling.corpus.taalbanknederlands.inl.nl/gysseling/page/search
3
Current option: Download entire dataset
Corpus Gysseling: 13th century Dutch
Download dataset: https:
//ivdnt.org/taalmaterialen/102-taalmaterialen/2014- tstc-corpus- gysselingh
4
CLARIAH searchability
CLARIAH: digital research infrastructure for arts and
humanities in The Netherlands
Levels of searchability:
Local search: resource available as web service
Federated search: resources of same type queried as single
resource
Chaining search: heterogeneous sources combined, sequential
search workflows
5
CLARIAH chaining search
A Python library and Jupyter web interface to easily combine
exploration of linguistic resources published in the
CLARIN/CLARIAH infrastructure, such as corpora, lexica and
treebanks.
6
Implementation: earlier idea
Large SPARQL query to query multiple resources
But:
Not all resources available as Linked Open Data, queried with
SPARQL
ery becomes highly complex
7
Implementation: now
Python library
Based on pandas DataFrames McKinney (2011)
Four modules: search,process,ui and utils
Jupyter notebooks
Examples notebook
Sandbox: start coding yourself
Sailing leers case study
8
Resources: Supported protocols
Corpora
CLARIN Federated Content Search endpoint (Stehouwer et al.,
2012)
eried via CQL
More than 1200 freely available corpora
BlackLab corpus search engine (De Does et al., 2017)
eried via CQL
More search options, e.g. metadata filtering
Lexica
Linked Open Data, Ontolex lexicon model (McCrae et al., 2017)
eried via SPARQL
Internal INT lexicon service API
Treebanks
CGN (corpus of spoken Dutch) and Lassy (wrien Dutch)
eried via XPath query
9
Library: search
Search in linguistic resource:
r e s u l t s = c r e a t e l e x i c o n ( le xi co n na m e ) . lemma (
wo rd ) . s e a r c h ( )
r e s u l t s = c r e a t e c o r p u s ( c o rp u s n am e ) . p a t t e r n (
p a t t e r n ) . s e a r c h ( )
r e s u l t s = c r e a t e t r e e b a n k ( t r ee ba nk n am e ) .
p a t t e r n ( p a t t e r n ) . s e a r c h ( )
Create results table (Pandas DataFrame) with keywords-in-context:
d f = r e s u l t s . kwi c ( )
10
Library: utils
utils provides functions general operations applied to search
results tables.
d i f f = c o l u m n d i f f e r e n c e ( d f 1 [ ” le mma 0
] , d f 2 [ ” lem ma 0 ] )
11
Library: process
process handles linguistic processing of data.
d f l e x i c o n = e x t r a c t l e x i c o n ( d f c o rp u s
)
12
Library: ui
Operations regarding showing the user interface and loading/saving
data
s a v e d a t a f r a m e ( d f , t e s t . c s v )
d f = l o a d d a t a f r a m e ( t e s t . c s v )
13
Chaining in practice: adjectives without -e
Look for adjectives which should have ending -e, but miss it with
determiner een.
First, search in corpus.
d f c o r p = c r e a t e c o r p u s ( o pe nc hn ) .
p a t t e r n ( [ p o s = ”DET ”&l emma = ” e en ] [
word = ” . [ ˆ e ] $ & pos = ”AA . d e g r e e =
pos . ” ] [ p o s =”NOU. g e n d er = [ fm ] . ” ] ’ )
. s e a r c h ( ) . k wi c ( )
14
Chaining in practice: adjectives without -e
Search in lexicon.
d f l e x = c r e a t e l e x i c o n ( ” m ol ex ” ) . lemma
( ( . + ) [ ˆ e ] $ ) . p o s ( ’ A DJ ( d e g r e e = p o s )
) . s e a r c h ( ) . k wi c ( )
f i n a l e c o n d i t i o n = d f f i l t e r ( d f l e x [ ”
wordfor m ” ] , ’ e$’ )
d f l e x i c o n f o r m e = d f l e x [
f i n a l e c o n d i t i o n ]
15
Chaining in practice: adjectives without -e
Filter corpus using results from lexicon.
e forms = set ( d f l e x i c o n f o r m e . lemma )
cond = d f f i l t e r ( d f c o r p [ ” word 1 ” ] ,
p a t t e r n = e f or m s , me thod = ” i s i n )
r e s u l t d f = d f co r p [ co nd ]
16
Jupyter notebook
17
Case study: Social class and gender in sailing leers
Leers as Loot corpus (Van der Wal et al., 2012):
17th and 18th century leers from Dutch
sailors, annotated with metadata
Research question: Which vocabulary is specific for:
Social class (low or high)
Gender (male or female)
Time period (17th or 18th century)
Use CLARIAH chaining search to:
1. Retrieve data from corpus
2. Filter and split data on metadata (class/gender/era)
3. Compute relative frequencies of words per metadata category
4. Compute dierences (ratios) between relative frequencies
Notebook Case study paper.ipynb in GitHub repository
18
Time period: 17th vs 18th century (high class)
Lemmata most specific for time period (highest di in rel frequency)
17th century
lemma relative frequency
17th 18th di
huisvrouw 0.016 0.002 1.013
vriend 0.023 0.011 1.012
man 0.020 0.010 1.010
gezondheid 0.016 0.007 1.010
goedenacht 0.009 0.000 1.009
schipper 0.009 0.000 1.009
brief 0.022 0.014 1.008
suiker 0.010 0.002 1.008
monsieur 0.008 0.001 1.006
schip 0.023 0.017 1.006
18th century
lemma relative frequency
17th 18th di
heer 0.006 0.021 0.985
mijnheer 0.009 0.018 0.991
edele 0.000 0.008 0.992
jaar 0.005 0.012 0.993
mejurouw 0.001 0.006 0.995
kapitein 0.012 0.017 0.995
zuster 0.008 0.013 0.995
achting 0.000 0.005 0.995
familie 0.002 0.006 0.996
liefde 0.001 0.006 0.996
19
Social class: low vs high (17th century)
Lemmata most specific for social class (highest di in rel frequency)
low
lemma relative frequency
low high di
goedenacht 0.026 0.009 1.017
hart 0.027 0.012 1.015
man 0.033 0.020 1.013
brief 0.034 0.022 1.012
Heer 0.022 0.011 1.011
gezondheid 0.026 0.016 1.009
zuster 0.016 0.008 1.008
huisvrouw 0.024 0.016 1.008
kind 0.020 0.013 1.007
zoon 0.016 0.011 1.005
high
lemma relative frequency
low high di
kapitein 0.005 0.012 0.993
monsieur 0.000 0.008 0.993
suiker 0.002 0.010 0.993
sinjeur 0.001 0.006 0.995
mijnheer 0.004 0.009 0.995
schipper 0.005 0.009 0.996
heer 0.002 0.006 0.996
pond 0.000 0.004 0.997
december 0.002 0.005 0.997
rekening 0.001 0.004 0.997
20
Gender: male vs female (17th century, low class)
Lemmata most specific for gender (highest di in rel frequency)
male
lemma relative frequency
male female di
suiker 0.002 0.000 1.002
schip 0.024 0.022 1.002
oom 0.003 0.002 1.001
vriend 0.022 0.021 1.001
december 0.002 0.001 1.001
maand 0.003 0.002 1.001
huis 0.004 0.003 1.001
port 0.002 0.001 1.001
vaderland 0.002 0.001 1.001
stuk 0.001 0.000 1.001
female
lemma relative frequency
male female di
man 0.033 0.041 0.992
brief 0.034 0.038 0.996
hart 0.027 0.031 0.996
Heer 0.022 0.023 0.998
zoon 0.016 0.018 0.998
allerliefste 0.005 0.006 0.998
mens 0.005 0.006 0.998
gezondheid 0.026 0.027 0.999
huisvrouw 0.024 0.025 0.999
reis 0.010 0.011 0.999
21
Gender: male vs female (17th century, high class)
Lemmata most specific for gender (highest di in rel frequency)
male
lemma relative frequency
male female di
suiker 0.010 0.001 1.009
monsieur 0.008 0.001 1.006
vriend 0.023 0.019 1.005
sinjeur 0.006 0.002 1.004
december 0.005 0.001 1.004
goed 0.011 0.007 1.004
vracht 0.003 0.000 1.003
mr. 0.003 0.000 1.003
cargasoen 0.003 0.000 1.003
dienaar 0.002 0.000 1.00
female
lemma relative frequency
male female di
man 0.020 0.046 0.975
brief 0.022 0.034 0.988
kind 0.013 0.022 0.991
zoon 0.011 0.018 0.993
goedenacht 0.009 0.016 0.993
zuster 0.008 0.015 0.994
gezondheid 0.016 0.022 0.994
dochter 0.004 0.010 0.995
genade 0.008 0.013 0.995
hart 0.012 0.017 0.995
22
Conclusion
Case study shows dierence in vocabulary across
sociolinguistic variables, especially for gender in higher social
class
Case study proves: CLARIAH Chaining search facilitates and
accelerates linguistic research
Discussion
Minimum of programming knowledge is needed
Long computations: invoke library directly, without notebook
Future work: optimization for very large data sets
23
Interested?
Demonstration during poster session
Try it yourself:
https://github.com/INL/chaining-search
Local install
Cloud instance on Azure
FUll API reference in documentation:
https://chaining-search.readthedocs.io/
en/latest/
24
References
De Does, J., Niestadt, J., and Depuydt, K. (2017). Creating Research
Environments with BlackLab. In Utrecht University, NL and
Odijk, J., editors, CLARIN in the Low Countries, pages 245–257.
Ubiquity Press.
McCrae, J. P., Bosque-Gil, J., Gracia, J., Buitelaar, P., and Cimiano, P.
(2017). The ontolex-lemon model: development and applications.
In Proceedings of eLex 2017 conference, pages 19–21.
McKinney, W. (2011). pandas: a Foundational Python Library for
Data Analysis and Statistics. Python for High Performance and
Scientific Computing, 14:9.
25
Stehouwer, H., Durco, M., Auer, E., and Broeder, D. (2012). Federated
Search: Towards a Common Search Infrastructure. LREC 2012,
page 5.
Van der Wal, M. J., Ruen, G., and Simons, T. (2012). Leers as loot:
Confiscated Leers filling major gaps in the History of Dutch. In
Dossena, M. and Del Lungo Camicioi, G., editors, Pragmatics &
Beyond New Series, volume 218, pages 139–162. John Benjamins
Publishing Company, Amsterdam.
26
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
—In this paper we will discuss pandas, a Python library of rich data structures and tools for working with structured data sets common to statistics, finance, social sciences, and many other fields. The library provides integrated, intuitive routines for performing common data manipulations and analysis on such data sets. It aims to be the foundational layer for the future of statistical computing in Python. It serves as a strong complement to the existing scientific Python stack while implementing and improving upon the kinds of data manipulation tools found in other statistical programming languages such as R. In addition to detailing its design and features of pandas, we will discuss future avenues of work and growth opportunities for statistics and data analysis applications in the Python language.
Creating Research Environments with BlackLab
  • J De Does
  • J Niestadt
  • K Depuydt
De Does, J., Niestadt, J., and Depuydt, K. (2017). Creating Research Environments with BlackLab. In Utrecht University, NL and Odijk, J., editors, CLARIN in the Low Countries, pages 245-257. Ubiquity Press.
The ontolex-lemon model: development and applications
  • J P Mccrae
  • J Bosque-Gil
  • J Gracia
  • P Buitelaar
  • P Cimiano
McCrae, J. P., Bosque-Gil, J., Gracia, J., Buitelaar, P., and Cimiano, P. (2017). The ontolex-lemon model: development and applications. In Proceedings of eLex 2017 conference, pages 19-21.