Article

Toward a computational history of universities: Evaluating text mining methods for interdisciplinarity detection from PhD dissertation abstracts

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

For the first time, historians of higher education have large data sets of primary sources that reflect the complete output of academic institutions at their disposal. To analyze this unprecedented abundance of digital materials, scholars have access to a large suite of computational methods developed in the field of Natural Language Processing. However, when the intention is to move beyond exploratory studies and use the results of such analyses as quantitative evidences, historians need to take into account the reliability of these techniques. The main goal of this article is to investigate the performance of different text mining methods for a specific task: the automatic identification of interdisciplinary works from a corpus of PhD dissertation abstracts. Based on the output of our study, we provide the research community of a new data set for analyzing recent changes in interdisciplinary practices in a large sample of European universities. We show the potential of this collection by tracking the growth in adoption of computational approaches across different research fields, during the past 30 years.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It proposed using the National Science Foundation (NSF) topic model and the NSF's institutional structure by examining research grant proposals and awards rather than publications. Nanni et al. (Nanni & Dietz & Ponzetto, 2018) investigated the performance of LDA with the outcomes obtained by using other text mining methods such as lexical features within a support vector machine (SVM) or a Rocchio classifier for automatic identification of interdisciplinary works from a corpus of doctoral dissertation abstracts. Considering that, we intend to verify the usefulness of topic modelling for identifying interdisciplinarity in articles. ...
Conference Paper
Full-text available
Topic modelling is one of the most popular topics investigated in the area of Natural Language Processing. One of the techniques used for topics modelling is Latent Dirichlet Allocation (LDA). It is an unsupervised machine learning technique which creates topics using a collection of documents based on words or n-grams with similar meaning. In this paper, we applied a Structural Topic Model with LDA to extract topics from scientific papers in Social Science. A structural topic modelling of 3663 articles from Web of Science Core Collection from 1999 to 2019 was conducted. The obtained results indicate that an optimal number of topics coincides with the existing number of research areas defined in Social Science or with its integer multiple. This opens an area for research into the comparison between the existing taxonomy and the taxonomy proposed by the LDA model and for the future identification of interdisciplinarity.
Article
Color preference in Chinese folksongs is examined from the perspectives of themes, ethnicity, and geographical environment. The results yield that self-organization property of language system plays the role in color use and color preference varies with theme, ethnicity, and geographical environment. Specifically, the color of white is preferred by twenty-three ethnic minorities and the color of red is much more popular among the Han. Only in love songs, the preference for white and red exhibits an approximate north and south dimension. The study shows that digital approaches related to colors in folklore are an effective and promising tool to explore human’s response to colors.
Article
Goal: The growing complexity of problems induces the use of multi- and interdisciplinary approaches in their solution. This situation occurs in a number of fields, including in the field of Education. The results of research from the interdisciplinary point of view in Education are presented in several journals, addressing different subjects, which prevents a holistic view on the development of this area. In order to fill this gap, this article aims to study interdisciplinarity in education in order to understand how the concept of interdisciplinarity has been applied in this area. Design / Methodology / Approach: It consists of a bibliographical survey, with articles indexed in the SCOPUS database. The selection of articles was limited to a transversal research in the literature from 2014 to 2018, using the following keywords: interdisciplinarity and higher education. Limitations of the investigation: Through the methodology used, 60 articles were selected. Results: Few articles were related to interdisciplinary practices, demonstrating the need for research to cover this gap. Practical implications: Although the subject began in the 1970s, there is still much to be researched regarding interdisciplinarity in education, to allow a better dissemination and practice thereof, so that students have a systemic view of current complexity. Originality/ Value: The study points out a gap in the literature, and the quantitative results suggest that there is a greater deficiency of works directed to the application of the interdisciplinary approach in the development of this aspect for the improvement of society.
Article
Full-text available
Purpose – The purpose of this paper is to identify criteria for and definitions of disciplinarity, and how they differ between different types of literature. Design/methodology/approach – This synthesis is achieved through a purposive review of three types of literature: explicit conceptualizations of disciplinarity; narrative histories of disciplines; and operationalizations of disciplinarity. Findings – Each angle of discussing disciplinarity presents distinct criteria. However, there are a few common axes upon which conceptualizations, disciplinary narratives, and measurements revolve: communication, social features, topical coherence, and institutions. Originality/value – There is considerable ambiguity in the concept of a discipline. This is of particular concern in a heightened assessment culture, where decisions about funding and resource allocation are often discipline-dependent (or focussed exclusively on interdisciplinary endeavors). This work explores the varied nature of disciplinarity and, through synthesis of the literature, presents a framework of criteria that can be used to guide science policy makers, scientometricians, administrators, and others interested in defining, constructing, and evaluating disciplines.
Article
Full-text available
In this presentation we argue that the core research activities of scientometries fall in four interrelated areas: science and technology indicators, information systems on science and technology, the interaction between science and technology, and cognitive as well as socioorganisational structures in science and technology. We emphasize that an essential condition for the healthy development of the field is a careful balance between application and basic work, in which the applied side is the driving force. In other words: scientometrics is primarily a field of applied science. This means that the interaction users' is at least as important as the interaction with colleague-scientists. We state that this situation is very stimulating, it strengthens methodology and it activates basic work. We consider idea of scientometrics lacking theoretical content or being otherwise in a 'crisis-like' situation groundless. Scientometrics is in a typical developmental stage in which the creativity of its individual researchers and the ‘climate’ and facilities of their institutional environments determine the Progress in the field and, particularly, its relation with other disciplines. These aspects also contribute substantially to the reputation of scientometrics as a research field respected by the broader scientific community. And this latter point is important, both to let quantitative studies of science and technology take more advantage of an academic environment, as well as to keep it innovative and thus attractive in terms of applications at the longer term.
Article
Full-text available
In nearly all domains of Global Change Research (GCR), the role of humans is a key factor as a driving force, a subject of impacts, or an agent in mitigating impacts and adapting to change. While advances have been made in the conceptualisation and practice of interdisciplinary Global Change Research in fields such as climate change and sustainability, approaches have tended to frame interdisciplinarity as actor-led, rather than understanding that complex problems which cut across disciplines may require new epistemological frameworks and methodological practices that exceed any one discipline. GCR studies must involve from their outset the social, human, natural and technical sciences in creating the spaces of interdisciplinarity, its terms of reference and forms of articulation. We propose a framework for funding excellence in interdisciplinary studies, named the Radically Inter- and Trans-disciplinary Environments (RITE) framework. RITE includes the need for a realignment of funding strategies to ensure that national and international research bodies and programmes road-map their respective strengths and identified areas for radical interdisciplinary research; then ensure that these areas can and are appropriately funded and staffed by talented individuals who want to apply their creative scientific talents to broader issues than their own field in the long term, rather than on limited scope (5 year and less) research projects. While our references are mostly to Europe, recommendations may be applicable elsewhere.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Full-text available
The first section of this article concerns the theme of Humanities Computing teaching. Most experts agree with the opinion that Humanities Computing is an independent discipline - which studies the problems of formalisation and models, crossing all humanities disciplines (linguistic, literature, history, archaeology, history of art, history of music) - and as such it should be introduced into the Faculties of Humanities. The academic organisations are beginning to acknowledge the importance of teaching computer applications to the students, but their approach is far from consistent. The integral proposal of a new independent scientific-disciplinary sector, submitted by a group of experts to be approved by the Italian CUN (Consiglio Universitario Nazionale), is therefore presented. The second part of the article deals with the results of an enquiry, carried out in 21 Italian Universities, on how Humanities Computing is being introduced into the curricula of the Faculties of Humanities. Many relevant quantitative data are illustrated, which clearly clarify both the necessity to distinguish between the simple alphabetisation and the teaching of applications for research, as well as the urgency to solve in this sector of studies the problem of teachers on temporary contracts. pp. 7-32
Article
Full-text available
Innovation policy is increasingly informed from the perspective of a national innovation system (NIS), but, despite the fact that research findings emphasize the importance of national differences in the framing conditions for innovation, policy prescriptions tend to be uniform. Justifications for innovation policy by organizations such as the OECD generally relate to notions of market failure, and the USA, with its focus on the commercialization of public sector research and entrepreneurship, is commonly portrayed as the best model for international emulation. In this paper we develop a broad framework for NIS analysis, involving free market, coordination and complex-evolutionary system approaches. We argue that empirical evidence supporting the hypothesis that the ‘free market’ can be relied upon to promote innovation is limited, even in the USA, and the global financial crisis provides us with new opportunities to consider alternatives. The case of Australia is particularly interesting: a successful economy, but one that faces continuing productivity and innovation challenges. Drawing on information and analysis collected for a major review of Australia’s NIS, and the government’s 10-year plan in response to it, we show how the free market trajectory of policy-making of past decades is being extended, complemented and refocused by new approaches to coordination and complex-evolutionary system thinking. These approaches are shown to emphasize the importance of systemic connectivity, evolving institutions and organizational capabilities. Nonetheless, despite the fact that there has been much progress in this direction in the Australian debate, the predominant logic behind policy choices still remains one of addressing market failure, and the primary focus of policy attention continues to be science and research rather than demand-led approaches. We discuss how the development and elaboration of notions of systems failure, rather than just market failure, ca
Article
Future historians will describe the rise of the World Wide Web as the turning point of their academic profession. As a matter of fact, thanks to an unprecedented amount of digitization projects and to the preservation of born-digital sources, for the first time they have at their disposal a gigantic collection of traces of our past. However, to understand trends and obtain useful insights from these very large amounts of data, historians will need more and more fine-grained techniques. This will be especially true if their objective will turn to hypothesis-testing studies, in order to build arguments by employing their deep in-domain expertise. For this reason, we focus our paper on a set of computational techniques, namely semi-supervised computational methods, which could potentially provide us with a methodological turning point for this change. As a matter of fact these approaches, due to their potential of affirming themselves as both knowledge and data driven at the same time, could become a solid alternative to some of the today most employed unsupervised techniques. However, historians who intend to employ them as evidences for supporting a claim, have to use computational methods not anymore as black boxes but as a series of well known methodological approaches. For this reason, we believe that if developing computational skills will be important for them, a solid background knowledge on the most important data analysis and results evaluation procedures will become far more capital.
Article
As the National Science Foundation (NSF) implements new cross-cutting initiatives and programs, interest in assessing the success of these experiments in fostering interdisciplinarity grows. A primary challenge in measuring interdisciplinarity is identifying and bounding the discrete disciplines that comprise interdisciplinary work. Using statistical text-mining techniques to extract topic bins, the NSF recently developed a topic map of all of their awards issued between 2000 and 2011. These new data provide a novel means for measuring interdisciplinarity by assessing the language or content of award proposals. Using the Directorate for Social, Behavioral, and Economic Sciences as a case study and drawing on the new topic model of the NSF's awards, this paper explores new methods for quantifying interdisciplinarity in the NSF portfolio.
Article
Relationships between authors based on characteristics of published literature have been studied for decades. Author cocitation analysis using mapping techniques has been most frequently used to study how closely two authors are thought to be in intellectual space based on how members of the research community co-cite their works. Other approaches exist to study author relatedness based more directly on the text of their published works. In this study we present static and dynamic word-based approaches using vector space modeling, as well as a topic-based approach based on latent Dirichlet allocation for mapping author research relatedness. Vector space modeling is used to define an author space consisting of works by a given author. Outcomes for the two word-based approaches and a topic-based approach for 50 prolific authors in library and information science are compared with more traditional author cocitation analysis using multidimensional scaling and hierarchical cluster analysis. The two word-based approaches produced similar outcomes except where two authors were frequent co-authors for the majority of their articles. The topic-based approach produced the most distinctive map.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Book
Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.
Workshop on tool criticism in the DH
  • Traub