A Workflow for Creating Publication Databases from Scratch
Jenny Oltersdorf1, Asja Mironenko2 and Jochen Gläser3
firstname.lastname@example.org, TU Berlin, HBS 7, Hardenbergstr.16-18, 10623 Berlin (Germany
email@example.com, TU Berlin, HBS 7, Hardenbergstr.16-18, 10623 Berlin (Germany)
firstname.lastname@example.org, TU Berlin, HBS 7, Hardenbergstr.16-18, 10623 Berlin (Germany)
In many cases, the utilization of bibliometric methods for solving research problems or
practical problems depends on building a dedicated database because commercial databases
such as the Web of Science (WoS) or Scopus do not sufficiently cover the literature of interest.
In particular, researchers studying the social sciences and humanities often base their
analyses on dedicated databases which they created manually (e.g. Ardanuy et al. 2009,
Colavizza et al. 2018). In this extended poster abstract, we report part of our workflow for
building such a database, which ideally should contain all German publications from a
humanities field (Art History) and a social science field (International Relations) in a specific
period. We will use the database to study the communication behaviour of the two fields
through analyses of citation networks and citation contexts (Gläser and Oltersdorf 2019).
Therefore, our dedicated database must include the publications’ references and full texts (or
at least the text surrounding the citations). The overall workflow for creating the database is
based on the following algorithm (Figure 1):
1) Construction of a seed data set from publication lists of university academics.
2) Expansion of seed data by
including publications from the target period that cite seed publications, and
including publications from the target period that are cited by seed publications
until saturation is reached.
Oltersdorf, J., A. Mironenko and J. Gläser (2020). A Workflow for Creating Publication Databases from Scratch. Lockdown
Bibliometrics: Papers not submitted to the STI Conference 2020 in Aarhus. J. Gläser. Berlin: 37-50.
Figure 1: Workflow for creating the database
Workflow for creating the seed data set
In our extended poster abstract and poster we describe the first step, i.e. the construction of
a seed data set, for the field of International Relations. The workflow includes the following
steps (Figure 2):
Construct seed data set for period [t1, t2]
Include publications from [t1, t2] cited by seed
Include publications from [t1, t2] citing the seed
Check saturation (number of publications
added in previous two steps)
Figure 2: Detailed workflow for creating the seed data set
1. Field delineation
Since the common approach to delineating a field by selecting publications does not work if
no publication database with sufficient coverage is available, we first established criteria for
the delineation of the two fields by consulting scholars in the field and library specialists, and
by reading introductory texts. We decided to start from individuals and their affiliation to
universities. For the field ‘International Relations’ (IR) we identified scholars who 1) have at
least a master's degree and 2) are affiliated to a German university that 3) maintains a
department with a research focus on IR. We disregarded nationality, country of graduation,
and publication languages. This means that for creating the seed data set, we considered
German IR to be represented by scholars who are currently located at German universities.
This decision could be challenged and merits further discussion. A closer look reveals that
delineating a national sub-community is by no means trivial because nationality, publication
language and geographic location do not coincide. In the case of Art History, the complexities
are illustrated by the question whether German and non-German scholars working at the
German Max-Planck-Institute ‘Kunsthistorisches Institut in Florenz’ in Florence or the German
Max-Planck-Institute ‘The Bibliotheca Hertziana’ in Rome, who publish not only in German but
also in other languages, should be considered as members of the German national Art History
2. Harvest publication lists from websites
We visited the university websites of all identified researchers in IR and used a Python script
to convert publication lists on websites into PDF documents.
The automated conversion of
publication lists was sometimes hampered by complicated website architectures (see Figure
3), access restrictions (see Figure 4), and other structural properties of websites. A total of 17%
of the websites had to be converted manually. About 22% of researchers do not provide
publication lists on their university websites or the publications are not in the period of
interest. We conducted a search in Scopus and Microsoft Academic (MA) (see step 5) in order
to include at least some of their publications. Further publications from these 22% of authors
will be added with the expansion of the seed data set (see Figure 1).
All scripts mentioned in this poster abstract were written by Asja Mironenko, and can be accessed on GitHub
Figure 3: Example of a website that defied automated conversion (problem: website
Figure 4: Example of a website that defied automated conversion (problem: access only with
3. Extract and filter bibliographic records
We extracted bibliographic records together with information about publication types from
the PDF documents with a second Python script. The script was successful on 70% of the PDF
documents. The extraction quality based on the Python script was determined by indicators
of precision (0.996) and recall (0.957). The calculation was based on 403 records of 25 authors.
Recall is the maximum percentage of records that can be identified. The precision rate is the
percentage of the correctly extracted records in the period of interest that do not contain
other text from the website such as biographical information or information on teaching
For the 30% of records that needed manual adjustment, the manual processing of a
publication list took 6 minutes on average. Foreign-language titles increased the processing
The result of step 3 was a spreadsheet for each publication list that included bibliographic
records as strings. We then tokenised the records, i.e. we split them into their bibliographic
elements (author, title, year of publication, etc.). For this tokenisation we prepared the data
by applying a third Python script, which removed irrelevant characters and built a consistent
structure. The cleaned strings were processed using
is an open source
parser for academic references which uses heuristics based on Conditional Random Fields for
We assessed the performance of
on the bibliographic elements author, title, and year
by creating a scoring scheme. Each field was scored individually at 1 if the extraction was
correct, at 0.5 if the element included additional characters, at 0.25 if characters were missing,
and at 0 if the element could not be detected at all. The average quality for 100 randomly
selected bibliographic records using AnyStyle ‘as is’ were 0.83 for author, 0.71 for title and 0.55
for year. After adjustment of AnyStyle’s code the result for the year improved to 0.79. Even on
the basis of partially incorrect tokenised records a matching algorithm could be applied (see
5. Conduct an author search in Scopus and Microsoft Academic
The two databases have the enormous advantage of providing tokenised bibliographic
records, which in the case of Scopus include the references. This is why obtaining records from
these databases for as many publications as possible is the most efficient approach. Retrieval
of author names was conducted in the format ‘lastname, initial of first name’ in the data bases
Scopus and Microsoft Academic (MA). We considered several name spelling alternatives if
author names included umlauts or “
. Umlauts were replaced by
ae and a
oe and o
and u, respectively because English-language journals and books use both versions of
transliteration. The letter “
was replaced by “
. Thus, the German Name Müßer
would have to be searched for in the four variants Muser, Musser, Mueser, and Muesser.
A significant problem of the databases is homonyms. For authors who had publication lists on
their websites, we eliminated homonyms by matching database entries with the tokenised
records from harvested publication lists. If the results from the databases matched with a
tokenised record, we considered the author as relevant and saved information on subject
For the 22% of authors that had no publication list on their university websites, we only
searched by author names and reduced the likelihood of homonyms by limiting results to the
subject categories we derived from successfully matched records (see below, 6.).
Publications in MA are tagged “[…] with fields of study, or topics, using artificial intelligence
and semantic understanding of content. Topics are organized in a non-mutually exclusive
hierarchy with 19 top-level fields of study.” (Microsoft 2020).
We searched for author names
and filtered the results based on the most often used fields of study.
6. Match Scopus and MA records with tokenized records from publication lists
Matching was done by a fourth Python script, which utilises author names, publication titles,
and publication years in an implementation of the Ratcliff/Obershelp pattern recognition
algorithm (Black, 2014). The matching procedure identified 264 authors in Scopus. In addition,
47 authors with publications lists on their websites but no title match in Scopus and 29
authors without publication lists on their websites could be identified by an author-name
search that was limited to relevant subject categories in order to exclude homonyms. Relevant
subject categories were derived from the 264 successfully matched items and include
categories like ‘Political Science and International Relations’, ‘Social Sciences (all)’ or
‘Sociology and Political Science’. All in all, 340 or 57% of the 594 researchers are covered by
Scopus (see Figure 6).
We used the same procedure in MA and identified 344 authors through matching, another 26
authors who provide a publication list and 48 authors who did not publish literature through
a name search that was limited to relevant subject categories. In this process, 418 or 70% of
researchers have been identified in MA (see Figure 7).
The overlap of authors indexed in Scopus and in MA can be found in Figure 8. Interestingly, the
matching rate with MA varies considerably between publication types. We were able to match
70% of tokenised records for monographs, 58% of journal articles and 21% of working papers.
Further information about the modeling process of fields of study can be found online
Figure 6: Publication list - Scopus overlap
Figure 7: Publication list - MA overlap
Figure 8: Scopus - MA overlap
7. Save records and references in csv format
From Scopus we saved 1293 records and references from 340 authors for further processing.
In addition, we saved 2182 records from MA that are not in Scopus. In the case of MA, only
records were saved as reference information is provided only in form of links to other items
indexed in MA. We would have missed bibliographic information about MA’s non-source items.
8. Add records by Zotero Browser Plugin
Those records that were only on the publication lists and could not be retrieved from the
databases were stored in Zotero using the Zotero browser plugin
. Zotero is an open
source reference management system
The Zotero Connector imports bibliographic meta
data from, e.g., library catalogues or websites to the personal Zotero library. From there, data
can be exported in several formats for further processing. The quality of imported data
depends heavily on the used sources. Usually it varies among library catalogues, repositories,
and websites. Therefore, all entries had to be checked manually and corrected if necessary.
9. Harvest full texts
Harvesting full texts was necessary for two reasons. First, many publications are not covered
by Scopus, which also provides references. For these publications, references must be
extracted from full texts. Secondly, full texts (or at least the citation contexts, which need to
be extracted from full texts) are needed for the planned citation context analysis.
We combined several approaches to harvest full texts:
- We utilised information from publication lists on websites. Some researchers added links to
full texts of at least some of their publications.
- We used the DOI, which is provided by MA and Scopus, to search for full texts.
- We placed national and international library loan requests to get hold of printed
publications. We digitised the relevant parts (full texts of articles and chapters, references
and citation contexts for books) for further processing.
- In a final step, we circulated letters to all researchers. We introduced the project and
announced a follow up email. After two weeks, we sent them emails with the literature lists
we created and asked if they would check for completeness, enhance the list if necessary,
and provide us with the missing full texts that were labeled in the list. None of the scientists
raised any objections to the procedure. So far, about 40% of them supported our project.
10. Parse references from full texts
Automatic reference detection in social sciences and humanities (SSH) publications still
constitutes a major challenge. Most tools that extract references are based on referencing
patterns of publications from the sciences and ignore the respective information for SSH
publications. Furthermore, these tools cannot be applied directly to images, i.e. scanned
documents. SSH publications that are not available in digital form must be scanned and the
images must be processed with OCR software. This creates an additional source of errors
which has a direct effect on the performance of the tools. To make things worse, the
application of automated tools for reference extraction requires the existence of standardised
reference sections at the end of a publication. In some SSH publications, however, full
references are given in footnotes or side notes, and no reference section exists at the end of
For these reasons, none of the available tools met our requirements out of the box. A
combination of tools with text-based and layout-based approaches in an iterative procedure
that included manual adjustment turned out to be the best solution to extract references of
publications in the field of International Relations.
First tests indicated that the problems multiply for the field of Art History. The publication
behaviour of German Art History is still dominated by print publications. There is no consistent
pattern of the positioning or phrasing of references across publications. For example,
references may occur at the end of a publication in a separate reference list, in endnotes,
footnotes, side notes, or insertions in the text. Bibliographic meta data in footnotes or
endnotes are embedded into text (see Figure 9) or complex abbreviations are used to refer to
other sources (see Figure 10). The current state of our investigations suggests that references
cannot be extracted from publications in German Art History by any of the state-of-the-art
tools such as CERMINE, GROBID, and ParsCit on its own. A combination of tools together and
additional manual adjustments are needed. Another promising approach appears to be the
layout-based approach to extract references from images that is based on complex neural
networks (Rizvi et al., 2019).
Figure 9: Reference section from journal ‘International Relations’ – A reference section where
bibliographic information is first given as full bibliographic record and subsequently either as
short titles or in the form “Author (year) without links to the full reference.
Figure 10: Reference section from 'Allgemeines Künstlerlexikon' - Complex abbreviations that
require external sources for decoding
11. Save records and references in csv format
Records and references were saved in csv format for further processing.
12. Save full texts and link them to bibliographic records
To organise our data, we assigned identifiers to each author and each publication. Full texts
were labelled with the publication ID and saved in separate folders for each author on an
external hard disk.
13. Save bibliographic records and meta data of references in SQL data base
We set up an offline SQL database for further analysis in the project.
With this paper, we present for discussion a workflow for a type of project that has been
occurring repeatedly in bibliometric research. Many theoretically interesting and politically
important problems cannot be studied with commercial databases due to the latter’s
exclusion of publications from the Global South, from the Social Sciences and Humanities,
and in languages other than English. So far, each scholar appears to have wrestled in isolation
with the many practical problems involved in creating databases from scratch. We should
discuss these practical problems in order to create a more efficient approach.
This research was supported by the German Federal Ministry of Education and Research
(Grant 01PU17022). As our workflow demonstrates, building such a database as ours requires
tireless and very precise manual labour, for which we thank our student assistants Elaheh
Sadat Ahmadi, Liesa Houben, and Lisa Jura.
Ardanuy, J., C. Urbano und L. Quintana (2009). A citation analysis of Catalan literary studies (1974–
2003): Towards a bibliometrics of humanities studies in minority languages. Scientometrics
Colavizza, G., M. Romanello and F. Kaplan (2018). "The references of references: a method to enrich
humanities library catalogs with citation data."
International Journal on Digital Libraries
Gläser, J. and J. Oltersdorf (2019). Persistent Problems for a Bibliometrics of Social Sciences and
Humanities and How to Overcome Them.
Proceedings of the 17th INTERNATIONAL
CONFERENCE ON SCIENTOMETRICS & INFORMETRICS
, Rome, September 2-5, 2019: 1056-1567.
Black, Paul E. "Ratcliff/Obershelp pattern recognition", in
Dictionary of Algorithms and Data Structures
[online], Paul E. Black, ed. 17 December 2004. (accessed TODAY) Available from:
Microsoft. 2020. „Microsoft Academic“.
. Abgerufen 25. August 2020
Rizvi, S. T. R., Lucieri, A., Dengel, A., & Ahmed, S. (2019). Benchmarking Object Detection Networks for
Image Based Reference Detection in Document Images.
2019 Digital Image Computing:
Techniques and Applications (DICTA)
, 1–8. https://doi.org/10.1109/DICTA47822.2019.8945991
Lockdown Bibliometrics: Papers not submitted to the STI
Conference 2020 in Aarhus
Sociology of Science Discussion Papers
SoS Discussion Paper 2/2020
10 September 2020
Copyright remains with the authors.
This discussion paper serves to disseminate the research results of work in progress
prior to publication to encourage the exchange of ideas and academic debate. The
inclusion of a paper in the discussion paper series does not constitute publication
and should not limit publication in other venues.
Lockdown Bibliometrics: Papers not submitted to the STI conference 2020 in Aarhus
Editor: Jochen Gläser
SoS Discussion Paper 2/2020
Social Studies of Science and Technology