DataPDF Available

ELAG2017

Authors:
Automatic NLP-based
Classification of
Jewish Studies Titles
Maral Dadvar
Web-based Information Systems and Services
Stuttgart Media University
ELAG2017 - 08 June 2017
A collaboration:
University Library Johann Christian
Senckenberg
Stuttgart Media University
Aims to:
Create a central access point
Offer high performance information
infrastructure
And also,
Contextualize the digital Judaica
collections
Enrich the metadata
Connect different data sources as
Linked Open Data
FID Judaica
Specialized Information Service for Jewish Studies
Identified titles with subject code:
Compact-memory dataset
Freimann dataset
Unidentified titles without subject code:
Retro dataset:
A pool of data, titles from variety of subjects
Including Jewish studies titles
FID Judaica
Digital Collections
Smart algorithm to
automatically and accurately
identifies Jewish studies title
within unidentified titles.
Classification of Jewish Studies Titles
Experimental Framework
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classifier
Jewish Keywords
Non-Jewish
Records
Features
Input
records
OCR elements
Jewish Authors
Jewish
Records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classifier
Non-Jewish Records Jewish Records
Retro Dataset
Consisting of 578,807 records
Different disciplines including Jewish studies titles
from the time period of 1600 - 1970 without subject
code.
Available fields
Titles
Authors’ names and family names
Raw OCR
IPN, PPN
Date
Features
OCR elements
Jewish Authors
Jewish Keywords
Input records
Classification of Jewish Studies Titles
Input Records
Language Detection
Python Langdetect package
55 languages
Current version: English, German
Hebrew titles in transliterated form
Stopword Removal
Removes words which carry unnecessary
information
or insignificance for search queries
Language specific:
Example:
English: “The Sufferings of the Jews during the
Middle Ages
German: “Der jüdische Lehrer sein Wirken und
Leben
Classifier
Non-Jewish Records Jewish Records
Features
OCR elements
Jewish Authors
Jewish Keywords
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Preparation
Classifier
Non-Jewish Records Jewish Records
Features
OCR elements
Jewish Authors
Jewish Keywords
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Noise Removal
Removes unwanted spaces and characters
Example: ‘?’, ‘!’, ‘/’,
Character Encoding Harmonization
Make all the characters machine identifiable
Encode in UTF-8
Example: ‘é’, ‘á’, ‘í’, ‘Ü’, ‘Ö’,...
Classification of Jewish Studies Titles
Preparation cont’d.
Features
OCR elements
Jewish Authors
Jewish Keywords
Jewish Studies Keywords
Stage 1:
80 words related to Jewish studies developed
intuitively
Stage 2:
Dataset with subject codes 770 and 760
Consisting of 155,496 records
Wider range of information compared to Retro
Titles and keywords fields
Term frequency–inverse document frequency
(TFIDF) value of the words
Top 100 most informative words
Merged with Stage 1
143 keywords
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Features
Features
OCR elements
Jewish Authors
Jewish Keywords
Jewish Studies Authors
A list of 1578 authors’ names
OCR elements
Hebr
Jud , Jew, Jewish
Place of publication: Tel-aviv, Israel
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Features Cont’d.
OCR elements
Jewish Authors
Jewish Keywords
Jewish Authors
Word level
Jewish Keywords
Syntax level
OCR elements
Word level
Features
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Classification
Unclassified Literature Titles:
578,806 titles
Classified as Jewish Studies Titles:
22,140 titles
~ 4 % of the titles
Independent Manually Labelled Titles:
~18,000 title
Accuracy Measures:
Precision (positive predictive value) 0.97
Recall (probability of detection) 0.91
F1 score 0.94
Overall accuracy 89%
OCR elements
Jewish Authors
Jewish Keywords
Features
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Evaluation
Sources of misclassification
Misspelling
Example: “holocaust” -> “holocast
Misleading keywords
Example: “Der polnische Jude” by “Karl Weis
Example: “The Legend of the Wandering Jew
by “Gustave DoreÌ
Example: “Ueber Lord Byrons, Hebrew
Melodies” by “Karl Adolf Beutler
Requires expert knowledge:
Example: “Der Wartesaal” by “Jenny Aloni
Example: “Our Crowd” by “Stephen
BirminghamOCR elements
Jewish Authors
Jewish Keywords
Features
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Evaluation Cont’d.
OCR elements
Jewish Authors
Jewish Keywords
Features
َAdapt to more languages
Keywords
Stop words
...
Spell Checker
Identify and correct the misspellings
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
Language Detection
Stopword Removal
Noise Removal
Encoding Adaptation
Classification of Jewish Studies Titles
Next steps
A collaboration:
University Library Johann Christian
Senckenberg
Stuttgart Media University
Aims to:
Create a central access point
Offer high performance information
infrastructure
And also,
Contextualize the digital Judaica
collections
Enrich the metadata
Connect different data sources as
Linked Open Data
FID Judaica
Specialized Information Service for Jewish Studies
FID Judaica
Contextualization of Digital Collections
Provide support to publish and
interlink existing reference works of
the Jewish culture and history as
Linked Open Data
Enrichment of metadata
Finding the relevant data sources
Extracting the required
information
Adding to library data collection
A Knowledge-base for Jewish Culture and Literature
www.judaicalink.org
A Knowledge-base for Jewish Culture and Literature
Make unstructured data
sources like online
encyclopedia available
as structured data
Identify and collect
relevant subsets of
general-purpose
knowledge bases like
DBpedia
To function as a single
hub for the
contextualization
process
A Knowledge-base for Jewish Culture and Literature
www.judaicalink.org
Acknowledgments
Web-based Information Systems and Services (WISS),
Stuttgart Media University, Stuttgart, Germany
Prof. Dr Kai Eckert
University Library Johann Christian Senckenberg, Goethe
University, Frankfurt am Main, Germany
Dr Rachel Heuberger
Annette Sasse
German Research Foundation (DFG)
Automatic NLP-based Classification of Jewish
Studies Titles
Maral Dadvar
dadvar@hdm-stuttgart.de
Web-based Information Systems and Services
Stuttgart Media University
THANK YOU
ELAG2017 - 08 June 2017

File (1)

Content uploaded by Maral Dadvar
Author content
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.