
Giovanni Colavizza- PhD
- Professor (Assistant) at University of Amsterdam
Giovanni Colavizza
- PhD
- Professor (Assistant) at University of Amsterdam
About
104
Publications
36,021
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,646
Citations
Current institution
Publications
Publications (104)
Data sharing is fundamental to scientific progress, enhancing transparency, reproducibility, and innovation across disciplines. Despite its growing significance, the variability of data-sharing practices across research fields remains insufficiently understood, limiting the development of effective policies and infrastructure. This study investigat...
Author Name Disambiguation (AND) is a critical task for digital libraries aiming to link existing authors with their respective publications. Due to the lack of persistent identifiers used by researchers and the presence of intrinsic linguistic challenges, such as homonymy, the development of Deep Learning algorithms to address this issue has becom...
Calls to make scientific research more open have gained traction with a range of societal stakeholders. Open Science practices include but are not limited to the early sharing of results via preprints and openly sharing outputs such as data and code to make research more reproducible and extensible. Existing evidence shows that adopting Open Scienc...
Wikipedia is a well-known platform for disseminating knowledge, and scientific sources, such as journal articles, play a critical role in supporting its mission. The open access movement aims to make scientific knowledge openly available, and we might intuitively expect open access to help further Wikipedia’s mission. However, the extent of this re...
Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Foll...
With the recognition of content discoverability and information analytics in the scientific publishing industry, more and more effort is dedicated to digitization and automated analysis of scholarly publications. As part of this effort, the authors designed a pipeline to extract structured information from bibliography and index lists of existing s...
Purpose
Wikipedia's inclusive editorial policy permits unrestricted participation, enabling individuals to contribute and disseminate their expertise while drawing upon a multitude of external sources. News media outlets constitute nearly one-third of all citations within Wikipedia. However, embracing such a radically open approach also poses the c...
Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide context and information on provenance, purposes, composition, the collection process, recommended uses or societa...
Wikipedia is a well-known platform for disseminating knowledge, and scientific sources, such as journal articles, play a critical role in supporting its mission. The open access movement aims to make scientific knowledge openly available, and we might intuitively expect open access to help further Wikipedia's mission. However, the extent of this re...
The article presents an open educational resource (OER) to introduce humanities students to data analysis with Python. The article beings with positioning the OER within wider pedagogical debates in the digital humanities. The OER is built from our research encounters and committed to computational thinking rather than technicalities. Furthermore,...
The digital transformation of the scientific publishing industry has led to
dramatic improvements in content discoverability and information analytics.
Unfortunately, these improvements have not been uniform across research areas.
The scientific literature in the arts, humanities and social sciences (AHSS)
still lags behind, in part due to the scal...
We propose a decentralised “local2global” approach to graph representation learning, that one can a-priori use to scale any embedding technique. Our local2global approach proceeds by first dividing the input graph into overlapping subgraphs (or “patches”) and training local representations for each patch independently. In a second step, we combine...
Purpose
This paper aims to expand the scope and mitigate the biases of extant archival indexes.
Design/methodology/approach
The authors use automatic entity recognition on the archives of the Dutch East India Company to extract mentions of underrepresented people.
Findings
The authors release an annotated corpus and baselines for a shared task an...
Non-Fungible Tokens (NFTs) have recently surged to mainstream attention by allowing the exchange of digital assets via blockchains. NFTs have also been adopted by artists to sell digital art. One of the promises of NFTs is broadening participation to the art market, a traditionally closed and opaque system, to sustain a wider and more diverse set o...
Wikipedia is the largest online encyclopedia: its open contribution policy allows everyone to edit and share their knowledge. A challenge of radical openness is that it facilitates introducing biased contents or perspectives in Wikipedia. Wikipedia relies on numerous external sources such as journal articles, books, news media, and more. News media...
Non-Fungible Tokens (NFTs) have recently surged to mainstream attention by allowing the exchange of digital assets via blockchains. NFTs have also been adopted by artists to sell digital art. One of the promises of NFTs is broadening participation to the arts market, a traditionally closed and opaque system, to sustain a wider and more diverse set...
Colonial archives are at the center of increased interest from a variety of perspectives, as they contain traces of historically marginalized people. Unfortunately, like most archives, they remain difficult to access due to significant persisting barriers. We focus here on one of them: the biases to be found in historical findings aids, such as ind...
We present a brief review of literature related to blogs and news sites; our focus is on publications related to COVID-19. We primarily focus on the role of blogs and news sites in disseminating research on COVID-19 to the wider public, that is knowledge transfer channels. The review is for researchers and practitioners in scholarly communication a...
Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no com...
This report has been prepared by the “Bibliographical Data” Working Group of the DARIAH-ERIC consortium, which develops public digital research infrastructure for the arts and humanities. The Group consists of more than 30 members from 15 countries, most of whom are researchers and curators in the public sector who are engaged in bibliographical da...
This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines. We created a list of DH journals based on manual curation and bibliometric data. We used th...
The digital transformation of the scientific publishing industry has led to dramatic improvements in content discoverability and information analytics. Unfortunately, these improvements have not been uniform across research areas. The scientific literature in the arts, humanities and social sciences (AHSS) still lags behind, in part due to the scal...
Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessm...
We propose a decentralised "local2global"' approach to graph representation learning, that one can a-priori use to scale any embedding technique. Our local2global approach proceeds by first dividing the input graph into overlapping subgraphs (or "patches") and training local representations for each patch independently. In a second step, we combine...
The digital transformation is turning archives, both old and new, into data. As a consequence, automation in the form of artificial intelligence techniques is increasingly applied both to scale traditional recordkeeping activities, and to experiment with novel ways to capture, organise, and access records. We survey recent developments at the inter...
In recent decades, the rapid growth of Internet adoption is offering opportunities for convenient and inexpensive access to scientific information. Wikipedia, one of the largest encyclopedias worldwide, has become a reference in this respect, and has attracted widespread attention from scholars. However, a clear understanding of the scientific sour...
Citation indexes are by now part of the research infrastructure in use by most scientists: a necessary tool in order to cope with the increasing amounts of scientific literature being published. Commercial citation indexes are designed for the sciences and have uneven coverage and unsatisfactory characteristics for humanities scholars, while no com...
We present four types of neural language models trained on a large historical dataset of books in English, published between 1760 and 1900, and comprised of ≈5.1 billion tokens. The language model architectures include word type embeddings (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model...
Purpose. This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by also highlighting the relation that exists between DH and other disciplines. Methodology. We created a list of DH journals based on manual curation a...
We propose a decentralised "local2global" approach to graph representation learning, that one can a-priori use to scale any embedding technique. Our local2global approach proceeds by first dividing the input graph into overlapping subgraphs (or "patches") and training local representations for each patch independently. In a second step, we combine...
COVID-19 is having a dramatic impact on research and researchers. The pandemic has underlined the severity of known challenges in research and surfaced new ones, but also accelerated the adoption of innovations and manifested new opportunities. This review considers early trends emerging from meta-research on COVID-19. In particular, it focuses on...
We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the...
The digital transformation is turning archives, both old and new, into data. As a consequence, automation in the form of artificial intelligence techniques is increasingly applied both to scale traditional recordkeeping activities, and to experiment with novel ways to capture, organise and access records. We survey recent developments at the inters...
By linking to external websites, Wikipedia can act as a gateway to the Web. To date, however, little is known about the amount of traffic generated by Wikipedia's external links. We fill this gap in a detailed analysis of usage logs gathered from Wikipedia users' client devices. Our analysis proceeds in three steps: First, we quantify the level of...
As the COVID-19 pandemic unfolds, researchers from all disciplines are coming together and contributing their expertise. CORD-19, a dataset of COVID-19 and coronavirus publications, has been made available alongside calls to help mine the information it contains and to create tools to search it more effectively. We analyse the delineation of the pu...
Wikipedia’s content is based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive data set of citations extracted from Wikipedia. A total...
Wikipedia is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia was not conceived as a source of original information, but as a gateway to secondary sources: according to Wikipedia’s guidelines, facts must be backed up by reliable sources that reflect the full spectrum of views...
Crypto art is limited-edition digital art, cryptographically registered with a token on a blockchain. Tokens represent a transparent, auditable origin and provenance for a piece of digital art. Blockchain technologies allow tokens to be held and securely traded without the involvement of third parties. Crypto art draws its origins from conceptual a...
A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with different nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data suppli...
Википедия является одним из самых посещаемых сайтов в интернете и распространённым источником информации для многих пользователей. В качестве энциклопедии Википедия задумывалась не как источник оригинальной (окончательной) научной информации, а, скорее, как ворота к более глубоким и точным источникам. В соответствии с базовыми принципами Википедии...
The plague, an infectious disease caused by the bacterium Yersinia pestis, is widely considered to be responsible for the most devastating and deadly pandemics in human history. Starting with the infamous Black Death, plague outbreaks are estimated to have killed around 100 million people over multiple centuries, with local mortality rates as high...
Timely access to accurate information is crucial during the COVID-19 pandemic. Prompted by key stakeholders' cautioning against an "infodemic", we study information sharing on Twitter from January through May 2020. We observe an overall surge in the volume of general as well as COVID-19-related tweets around peak lockdown in March/April 2020. With...
Wikipedia is one of the main sources of free knowledge on the Web. During the first few months of the pandemic, over 5,200 new Wikipedia pages on COVID-19 have been created and have accumulated over 400M pageviews by mid June 2020. ¹ At the same time, an unprecedented amount of scientific articles on COVID-19 and the ongoing pandemic have been publ...
Wikipedia's contents are based on reliable and published sources. To this date, little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M...
Citation impact is commonly assessed using direct, first-order citation relations. We consider here instead the indirect influence of publications on new publications via citations. We present a novel method to quantify the higher-order citation influence of publications, considering both direct, or first-order, and indirect, or higher-order citati...
Citation impact is commonly assessed using direct, first-order citation relations. We consider here instead the indirect influence of publications on new publications via citations. We present a novel method to quantify the higher-order citation influence of publications, considering both direct, or first-order, and indirect, or higher-order citati...
A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with different nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data suppli...
Wikipedia is one of the main sources of free knowledge on the Web. During the first few months of the pandemic, over 4,500 new Wikipedia pages on COVID-19 have been created and have accumulated close to 250M pageviews by early April 2020. At the same time, an unprecedented amount of scientific articles on COVID-19 and the ongoing pandemic have been...
Consider the situation where a data analyst wishes to carry out an analysis on a given dataset. It is widely recognized that most of the analyst's time will be taken up with \emph{data engineering} tasks such as acquiring, understanding, cleaning and preparing the data. In this paper we provide a description and classification of such tasks into hi...
Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these stat...
As the COVID-19 pandemic unfolds, researchers from all disciplines are coming together and contributing their expertise. CORD-19, a dataset of COVID-19 and coronavirus publications, has recently been published alongside calls to help mine the information it contains, and to create tools to search it more effectively. Here, we focus on the delineati...
The plague, an infectious disease caused by the bacterium Yersinia pestis, is widely considered to be responsible for the most devastating and deadly pandemics in human history. Starting with the infamous Black Death, plague outbreaks are estimated to have killed around 100 million people over multiple centuries, with local mortality rates as high...
There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an i...
A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural...
The ambition of scholarship in the humanities is to systematically understand the human condition in all its aspects and times. To this end, humanists are more apt to interpret specific phenomena than generalize to previously unseen observations. When we consider scholarship as a collective effort, this has consequences. I argue that most of the hu...
Wikipedia, the free online encyclopedia that anyone can edit, is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia is not a source of original information, but was conceived as a gateway to secondary sources: according to Wikipedia's guidelines, facts must be backed up by relia...
The development of models to capture large-scale dynamics in human history is one of the core contributions of cliodynamics. Most often, these models are assessed by their predictive capability on some macro-scale and aggregated measure and compared to manually curated historical data. In this report, we consider the model from Turchin et al. (2013...
We present the Scholar Index: a platform to index the literature and primary sources of the arts and humanities through citations. These resources are becoming increasingly digital, thanks in part to digitization campaigns and a shift towards digital publishing. Nevertheless, the coverage of commercial citation indexes is still poor and mostly limi...
As in other Italian cities, Venetian apprenticeship was primarily ruled by private contract between the master and his pupil and their guardians. A new data set of almost 6,000 contracts from the late sixteenth to the mid-seventeenth century for the first time allows a representative view of the profile of Venetian apprentices and apprenticeships....
This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations....
Even large citation indexes such as the Web of Science, Scopus, or Google Scholar cover only a small fraction of the literature in the humanities. This coverage sensibly decreases going backwards in time. Citation mining of humanities publications-defined as an instance of bibliometric data mining and as a means to the end of building comprehensive...
Success in art markets is difficult to quantify objectively, as it also relies on complex social networks and exchanges of reputation among different actors of the system. We discuss the general task of developing art metrics that are able to capture the different roles actors play in art markets, in particular artists and collectors, are time-awar...
Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these stat...
This is a decentralized position paper on crypto art, which includes viewpoints from different actors of the system: artists, collectors, galleries, art scholars, data scientists. The writing process went as follows: a general definition of the topic was put forward by two of the authors (Franceschet and Colavizza), and used as reference to ask to...
The development of models to capture large-scale dynamics in human history is one of the core contributions of the cliodynamics field. Crucially and most often, these models are assessed by their predictive capability on some macro-scale and aggregated measure, compared to manually curated historical data. We consider the model predicting large-sca...
The promise of digitization of historical archives lies in their indexation at the level of contents. Unfortunately, this kind of indexation does not scale, if done manually. In this article we present a method to bootstrap the deployment of a content-based information system for digitized historical archives, relying on historical indexing tools....
We propose an operationalization of the rural and urban analogy introduced in Becher and Trowler (2001). According to them, a specialism is rural if it is organized into many, smaller topics of research, with higher author mobility among them, lower rate of collaboration and productivity, lower competition for resources and citation recognitions co...
There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an i...
The humanities are often characterized by sociologists as having a low mutual dependence among scholars and high task uncertainty. According to Fuchs’ theory of scientific change, this leads over time to intellectual and social fragmentation, as new scholarship accumulates in the absence of shared unifying theories. We consider here a set of specia...
Historiography is undergoing incessant expansion in the number of publications and active scholars, as is the case with the humanities and sciences in general. Little is known about what effects this has on the research activity and ways of publishing of historians, often stemming from long-established practices. Yet it seems recurrent that during...
The advent of large-scale citation indexes has greatly impacted the retrieval of scientific information in several domains of research. The humanities have largely remained outside of this shift, despite their increasing reliance on digital means for information seeking. Given that publications in the humanities have a longer than average life-span...
A tradition of scholarship discusses the characteristics of different areas of knowledge, in particular after modern academia compartmentalized them into disciplines. The academic approach is often put to question: are there two or more cultures? Is an ever-increasing spe- cialization the only way to cope with information abundance or are holistic...
The humanities are often characterized by sociologists as having a low mutual dependence among scholars and high task uncertainty. According to Fuchs' theory of scientific change, this leads over time to intellectual and social fragmentation, as new scholarship accumulates in the absence of shared unifying theories. We consider here a set of specia...
We consider the task of reference mining: the detection, extraction and classification of references within the full text of scholarly publications. Reference mining brings forward specific challenges, such as the need to capture the morphology of highly abbreviated words and the dependence among the elements of a reference, both following codified...
We propose an operationalization of the rural and urban analogy introduced in Becher and Trowler [2001]. According to them, a specialism is rural if it is organized into many, smaller topics of research, with higher author mobility among them, lower rate of collaboration and productivity, lower competition for resources and citation recognitions co...
We report characteristics of in-text citations in over five million full text articles from two large databases - the PubMed Central Open Access subset and Elsevier journals - as functions of time, textual progression, and scientific field. The purpose of this study is to understand the characteristics of in-text citations in a detailed way prior t...
We present a systematic non-invasive investigation of a large corpus of early printed books, exploiting multiple techniques. This work is part of a broader project – Argeia – aiming to study early printing technologies, their evolution and, potentially, the identification of physical/chemical fingerprints of different manufactures and/or printing d...
We publish a dataset containing more than 40'000 manually annotated references from a broad corpus of books and journal articles on the history of Venice. References were considered from both reference lists and footnotes, include primary and secondary sources, in full or abbreviated form. The dataset comprises references from publications from the...
We investigated the similarities of pairs of articles that are cocited at the different cocitation levels of the journal, article, section, paragraph, sentence, and bracket. Our results indicate that textual similarity, intellectual overlap (shared references), author overlap (shared authors), proximity in publication time all rise monotonically as...
Rating has become a common practice of modern science. No rating system can be considered as final, but instead several approaches can be taken, which magnify different aspects of the fabric of science. We introduce an approach for rating scholars which uses citations in a dynamic fashion, allocating ratings by considering the relative position of...
We report characteristics of in-text citations in over five million full text articles from two large databases - the PubMed Central Open Access subset and Elsevier journals - as functions of time, textual progression, and scientific field. The purpose of this study is to understand the characteristics of in-text citations in a detailed way prior t...
Over the past decades, the humanities have been accumulating a growing body of literature at an increasing pace. How does this impact their traditional organization into disciplines and fields of research therein? This article considers history, by examining a citation network among recent monographs on the history of Venice. The resulting network...
We investigate the similarities of pairs of articles which are co-cited at the different co-citation levels of the journal, article, section, paragraph, sentence and bracket. Our results indicate that textual similarity, intellectual overlap (shared references), author overlap (shared authors), proximity in publication time all rise monotonically a...
The intellectual landscapes of the humanities are mostly uncharted territory. Little is known on the ways published research of humanist scholars defines areas of intellectual activity. An open question relates to the structural role of core literature: highly cited sources, naturally playing a disproportionate role in the definition of intellectua...
A sample of contracts of apprenticeship from three periods in the history of early modern Venice is analysed, as recorded in the archive of the Giustizia Vecchia, a venetian magistracy. The periods are the end of the 16th century, the 1620s and the 1650s. A set of findings is discussed. First, the variety of professions represented in the dataset r...
In recent years, many cultural institutions have engaged in large-scale newspaper digitization projects and large amounts of historical texts are being acquired (via transcription or OCRization). Beyond document preservation, the next step consists in providing an enhanced access to the content of these digital resources. In this regard , the proce...
Massive digitization of archival material, coupled with automatic document processing techniques and data visualization tools offers great opportunities for reconstructing and exploring the past. Unprecedented wealth of historical data (e.g. names of persons, places, transaction records) can indeed be gathered through the transcription and annotati...
Marvel et al. [12] recently argued that the pre-modern contact world was physically and, by set inclusion, socially not small-world. Since the Black Death and similar plagues used to spread in well-defined waves, the argument goes, the underlying contact network could not have been small-world. I counter here that small-world contact networks were...
The advent of large-scale citation services has greatly impacted the retrieval of scientific information for several domains of research. The Humanities have largely remained outside of this shift despite their increasing reliance on digital means for information seeking. Given that publications in the Humanities probably have a longer than average...
Led by an interdisciplinary consortium, the Garzoni project undertakes the study of apprenticeship, work and society in early modern Venice by focusing on a specific archival source, namely the Accordi dei Garzoni from the Venetian State Archives. The project revolves around two main phases with, in the first instance, the design and the developmen...
English. We present preliminary results from the Linked Books project, which aims at analysing citations from the histo-riography on Venice. A preliminary goal is to extract and parse citations from any location in the text, especially footnotes, both to primary and secondary sources. We detail a pipeline for these tasks based on a set of classifie...