Content uploaded by Maral Dadvar
Author content
All content in this area was uploaded by Maral Dadvar on Jun 12, 2017
Content may be subject to copyright.
Automatic NLP-based
Classification of
Jewish Studies Titles
Maral Dadvar
Web-based Information Systems and Services
Stuttgart Media University
ELAG2017 - 08 June 2017
●A collaboration:
○University Library Johann Christian
Senckenberg
○Stuttgart Media University
●Aims to:
○Create a central access point
○Offer high performance information
infrastructure
And also,
○Contextualize the digital Judaica
collections
○Enrich the metadata
○Connect different data sources as
Linked Open Data
FID Judaica
Specialized Information Service for Jewish Studies
●Identified titles with subject code:
○Compact-memory dataset
○Freimann dataset
●Unidentified titles without subject code:
○Retro dataset:
■A pool of data, titles from variety of subjects
■Including Jewish studies titles
FID Judaica
Digital Collections
Smart algorithm to
automatically and accurately
identifies Jewish studies title
within unidentified titles.
Classification of Jewish Studies Titles
Experimental Framework
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classifier
Jewish Keywords
Non-Jewish
Records
Features
Input
records
OCR elements
Jewish Authors
Jewish
Records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classifier
Non-Jewish Records Jewish Records
Retro Dataset
●Consisting of 578,807 records
○Different disciplines including Jewish studies titles
from the time period of 1600 - 1970 without subject
code.
●Available fields
○Titles
○Authors’ names and family names
○Raw OCR
○IPN, PPN
○Date
Features
●OCR elements
●Jewish Authors
●Jewish Keywords
Input records
Classification of Jewish Studies Titles
Input Records
●Language Detection
○Python Langdetect package
○55 languages
○Current version: English, German
○Hebrew titles in transliterated form
●Stopword Removal
○Removes words which carry unnecessary
information
○or insignificance for search queries
○Language specific:
○Example:
■English: “The Sufferings of the Jews during the
Middle Ages”
■German: “Der jüdische Lehrer sein Wirken und
Leben”
Classifier
Non-Jewish Records Jewish Records
Features
●OCR elements
●Jewish Authors
●Jewish Keywords
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Preparation
Classifier
Non-Jewish Records Jewish Records
Features
●OCR elements
●Jewish Authors
●Jewish Keywords
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
●Noise Removal
○Removes unwanted spaces and characters
■Example: ‘?’, ‘!’, ‘/’,
●Character Encoding Harmonization
○Make all the characters machine identifiable
○Encode in UTF-8
■Example: ‘é’, ‘á’, ‘í’, ‘Ü’, ‘Ö’,...
Classification of Jewish Studies Titles
Preparation cont’d.
Features
●OCR elements
●Jewish Authors
●Jewish Keywords
●Jewish Studies Keywords
○Stage 1:
■80 words related to Jewish studies developed
intuitively
○Stage 2:
■Dataset with subject codes 770 and 760
■Consisting of 155,496 records
■Wider range of information compared to Retro
■Titles and keywords fields
■Term frequency–inverse document frequency
(TFIDF) value of the words
■Top 100 most informative words
■Merged with Stage 1
○143 keywords
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Features
Features
●OCR elements
●Jewish Authors
●Jewish Keywords
●Jewish Studies Authors
○A list of 1578 authors’ names
●OCR elements
○Hebr
○Jud , Jew, Jewish
○Place of publication: Tel-aviv, Israel
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Features Cont’d.
●OCR elements
●Jewish Authors
●Jewish Keywords
●Jewish Authors
○Word level
●Jewish Keywords
○Syntax level
●OCR elements
○Word level
Features
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Classification
●Unclassified Literature Titles:
○578,806 titles
●Classified as Jewish Studies Titles:
○22,140 titles
○~ 4 % of the titles
●Independent Manually Labelled Titles:
○~18,000 title
●Accuracy Measures:
○Precision (positive predictive value) 0.97
○Recall (probability of detection) 0.91
○F1 score 0.94
○Overall accuracy 89%
●OCR elements
●Jewish Authors
●Jewish Keywords
Features
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Evaluation
●Sources of misclassification
○Misspelling
■Example: “holocaust” -> “holocast”
○Misleading keywords
■Example: “Der polnische Jude” by “Karl Weis”
■Example: “The Legend of the Wandering Jew”
by “Gustave DoreÌ”
■Example: “Ueber Lord Byrons, Hebrew
Melodies” by “Karl Adolf Beutler”
○Requires expert knowledge:
■Example: “Der Wartesaal” by “Jenny Aloni”
■Example: “Our Crowd” by “Stephen
Birmingham”●OCR elements
●Jewish Authors
●Jewish Keywords
Features
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Evaluation Cont’d.
●OCR elements
●Jewish Authors
●Jewish Keywords
Features
● َAdapt to more languages
○Keywords
○Stop words
○...
●Spell Checker
○Identify and correct the misspellings
Classifier
Non-Jewish Records Jewish Records
Input records
Preparation
●Language Detection
●Stopword Removal
●Noise Removal
●Encoding Adaptation
Classification of Jewish Studies Titles
Next steps
●A collaboration:
○University Library Johann Christian
Senckenberg
○Stuttgart Media University
●Aims to:
○Create a central access point
○Offer high performance information
infrastructure
And also,
○Contextualize the digital Judaica
collections
○Enrich the metadata
○Connect different data sources as
Linked Open Data
FID Judaica
Specialized Information Service for Jewish Studies
FID Judaica
Contextualization of Digital Collections
●Provide support to publish and
interlink existing reference works of
the Jewish culture and history as
Linked Open Data
●Enrichment of metadata
○Finding the relevant data sources
○Extracting the required
information
○Adding to library data collection
A Knowledge-base for Jewish Culture and Literature
www.judaicalink.org
A Knowledge-base for Jewish Culture and Literature
●Make unstructured data
sources like online
encyclopedia available
as structured data
●Identify and collect
relevant subsets of
general-purpose
knowledge bases like
DBpedia
●To function as a single
hub for the
contextualization
process
A Knowledge-base for Jewish Culture and Literature
www.judaicalink.org
Acknowledgments
●Web-based Information Systems and Services (WISS),
Stuttgart Media University, Stuttgart, Germany
○Prof. Dr Kai Eckert
●University Library Johann Christian Senckenberg, Goethe
University, Frankfurt am Main, Germany
○Dr Rachel Heuberger
○Annette Sasse
●German Research Foundation (DFG)
Automatic NLP-based Classification of Jewish
Studies Titles
Maral Dadvar
dadvar@hdm-stuttgart.de
Web-based Information Systems and Services
Stuttgart Media University
THANK YOU
ELAG2017 - 08 June 2017