Article

Improving the accuracy of co-citation clustering using full text

Authors:
  • SciTech Strategies, Inc.
  • SciTech Strategies Inc.
  • SciTech Strategies, United States
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Historically, co-citation models have been based only on bibliographic information. Full-text analysis offers the opportunity to significantly improve the quality of the signals upon which these co-citation models are based. In this work we study the effect of reference proximity on the accuracy of co-citation clusters. Using a corpus of 270,521 full text documents from 2007, we compare the results of traditional co-citation clustering using only the bibliographic information to results from co-citation clustering where proximity between reference pairs is factored into the pairwise relationships. We find that accounting for reference proximity from full text can increase the textual coherence (a measure of accuracy) of a co-citation cluster solution by up to 30% over the traditional approach based on bibliographic information.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Content-independent approaches (such as collaborative filtering [11][12][13], graph-based [14], and hybrid approaches [7,8,15]) are also proposed in recommender systems. Collaborative filtering is based on the idea of "like-minded people like/dislike the same things". ...
... Citation analysis and content-based heuristics were also combined to recommend more related papers. In these models, researchers exploited the behavior of citation within the text [8,15,19]. Researchers analyzed the context of citations [18], number of repeated citations [28], and citations within the sections [29]. The majority of these approaches have investigated co-citation proximity [8,15], while some of the approaches are based on bibliographic coupling. ...
... Researchers analyzed the context of citations [18], number of repeated citations [28], and citations within the sections [29]. The majority of these approaches have investigated co-citation proximity [8,15], while some of the approaches are based on bibliographic coupling. ...
... Notwithstanding the acceptance of DCA in unravelling knowledge structures across academic fields, several researchers have discussed the limitations of DCA (e.g. Elkiss et al. 2008;Chen et al. 2010;Aljaber et al. 2010;Small and Klavans 2011;Boyack et al. 2013;Knoth and Khadka 2017). The main limitation of the conventional co-citation analysis pertains to the underlying assumption of equal weightage to each citation, without considering the contents of the cited documents . ...
... From the existing research aimed at improving DCA, we observe that the focus has been pinned on the proximity of the citations in the main text to modify the DCA (e.g. Boyack et al. 2013;Colavizza et al. 2018;Amador Penichet et al. 2018). The hybrid techniques which utilise the concept of semantic similarity in improving the results of the conventional co-citation method are, however, still in a developmental stage (Boyack et al. 2013). ...
... Boyack et al. 2013;Colavizza et al. 2018;Amador Penichet et al. 2018). The hybrid techniques which utilise the concept of semantic similarity in improving the results of the conventional co-citation method are, however, still in a developmental stage (Boyack et al. 2013). ...
Article
Document co-citation analysis (DCA) is employed across various academic disciplines and contexts to characterise the structure of knowledge. Since the introduction of the method for DCA by Small (J Am Soc Inf Sci 24(4):265-269, 1973) a variety of modifications towards optimising its results have been proposed by several researchers. We recommend a new approach to improve the results of DCA by integrating the concept of the document similarity measure into it. Our proposed method modifies DCA by incorporating the semantic similarity using latent semantic analysis for the abstracts of the top-cited documents. The interaction of these two measures results in a new measure that we call as the semantic similarity adjusted co-citation index. The effectiveness of the proposed method is evaluated through an empirical study of the tourism supply chain (TSC), where we employ the techniques of the network and cluster analyses. The study also comprehensively explores the resulting knowledge structures from both the methods. The results of our case study suggest that the clustering quality and knowledge map of the domain can be improved by considering the document similarity along with their co-citation strength.
... Co-citation considers two research articles as related if one or more articles have cited an article. The researchers extended the co-citation to include content analysis (Boyack et al., 2013). The inclusion of content analysis in the co-citation has a constructive effect in confirming the accuracy of the proposed research article. ...
... That is, co-citation approaches using content analysis result in more relevant articles than traditional co-citation approaches. However, co-citation considers the relationship between articles based on their association with other articles without considering the content of the cited articles (Boyack et al., 2013). ...
... PACA. They also presented a comparison of distancebased co-cited analysis introduced by Boyack, Small, and Klavans (2013) with traditional ACA and CACA and the newly introduced concept of normalized paragraph similarity-based ACA (PACA). ...
... Unlike other approaches, our approach of incorporation location showed subject disciplines at more fine-grained level and also helped us to identify large number of disciplines. In future, we plan to leverage state-of-the-art NLP approaches (Blondel et al. 2008;Boyack, Small, and Klavans 2013;Hair et al. 1998;Hsiao and Chen, 2017) to improve on the employed TF-IDF similarity methods for better clustering to identify potential researchers working in interdisciplinary research areas for more sustainable collaborations to better understand the working relationship among the working groups across institutions. ...
Article
This paper proposes two novel approaches to measure the similarity of co-cited authors for the task of document clustering, a) paragraph-level content-based author co-citation analysis (PCACA) and b) section-level content-based author co-citation analysis (SCACA), by mining the textual cited reference at the paragraph and the section level within a given scientific publication, respectively. Using over 2000 full-text publications, indexed in the field of Computer and Information Sciences, indexed in PLOS.org, we extract the useful information from a full-text publication such as citing sentences, location of citing sentences, cited first author name, and title of cited. We show that our proposed SCACA method of clustering outperforms existing clustering methods by exhibiting more optimal clusters with minimum graph density and an average degree in SCACA, i.e., 0.327 and 27.143, respectively. Finally, we show that SCACA produces the optimum number of clusters that comprehensively explains sub-disciplines of co-cited author pairs.
... Co-citation considers two research articles as related if one or more articles have cited an article. The researchers extended the co-citation to include content analysis (Boyack et al., 2013). The inclusion of content analysis in the co-citation has a constructive effect in confirming the accuracy of the proposed research article. ...
... That is, co-citation approaches using content analysis result in more relevant articles than traditional co-citation approaches. However, co-citation considers the relationship between articles based on their association with other articles without considering the content of the cited articles (Boyack et al., 2013). ...
... Using CCA to refine relational citation analysis is a promising research direction because the citation relevance has much room for in-depth exploration and refinement using full-text information. At the paper level, Gipp and Beel (2009), Liu and Chen (2011), Boyack et al. (2013), and Colavizza et al. (2018) found that integrating the proximity of cocitation sentences improved the accuracy in document retrieval or clustering; in addition, Habib and Afzal (2019) used the citation section location to refine BC, which resulted in a better document recommendation effect. At the author level, Jeong et al. (2014) proposed the content-based ACA method, which weights author cocitation strength underlying the TF-IDF vector similarity between cocitation sentences. ...
... An evaluation method for network clustering based on node content characteristics was applied to demonstrate the difference between ABCA and EABCA, which refers to Boyack et al. (2013). In our study, the coherence gain is defined as (10): where G = {c 1 , c 2 , … c k } denotes the communities that have been discovered, S(c) denotes the mean cosine similarity of the MeSH pool vectors between all nodes in the community c and the centroid of c, and denotes a randomly generated community with the same size as c. ...
Article
Full-text available
Author bibliographic coupling analysis (ABCA) is an extension of bibliographic coupling theory at the author level and is widely used in mapping intellectual structures and scholarly communities. However, the assumption of equal citations and the complete dependence on explicit counts may affect its effectiveness in today’s complex context of discipline development. This research proposes a new approach that uses multiple full-text data to improve ABCA called enhanced author bibliographic coupling analysis. By mining the semantic and syntactic information of citations, the new approach considers more diverse dimensions as the basis of author bibliographic coupling strength. Comparative empirical research was then conducted in the field of oncology. The results show that the new approach can more accurately reveal the relevant relations between authors and map a more detailed domain intellectual structure.
... As representation learning techniques improve, textual corpora may serve as another important supplement. Citation in full texts has been proved to offer the opportunity for significantly improving the quality of the signals upon many co-citation models (Liu et al. 2013;Boyack et al. 2013). Previous research is widely cited in scholarly articles, and similar research tends to be presented in a close grouping when authors are writing their manuscripts. ...
... Wang et al. (2019) identified cited text spans with an improved balanced ensemble model. Boyack et al. (2013) has found that accounting for reference proximity from full text can increase the textual coherence of a co-citation cluster solution over the bibliographic-information-based approaches. Liu et al. (2013) extracted citation contexts from a large number of full-text publications, and then used publication and citation topic distribution to generate a citation graph with vertex prior and edge transitioning probability distributions. ...
Article
Full-text available
Scholarly community detection has important applications in various fields. Current studies rely heavily on structured scholar networks, which have high computational complexity and are challenging to construct in practice. We propose a novel approach that can detect disjoint and overlapping scholarly communities directly from large textual corpora. To the best of our knowledge, this is the first study intended to detect communities directly from unstructured texts. In general, academic articles tend to mention related work and researchers. Researchers that are more closely related to each other are mentioned in a closer grouping in lines of academic text. Based on this correlation, we propose an intuitional method that measures the mutual relatedness of researchers through their textual distance. First, we extract and disambiguate the researcher names from academic articles. Then, we embed each researcher as an implicit vector and measure the relatedness of researchers by their vector distance. Finally, the communities are identified by vector clusters. We develop and evaluate our method on several real-world datasets. The experimental results demonstrate that our method achieves comparable performance with several state-of-the-art methods.
... The same notion was reinforced by distinctive research studies such as Moravcsik et al. [10], who found that 40% of the citations were perfunctory. Therefore, in recent times, these models have been studied and improved by exploiting citation's textual details [11,12]. Furthermore, The Direct Citation model has been found more accurate in the representation of knowledge taxonomies as compared to Co-citation and Bibliographic Coupling [13]. ...
... The current state-of-the-art systems treat all citations equally [3,4,11]. All the citations are not equally important for the citing paper. ...
Article
Full-text available
Citations based relevant research paper recommendations can be generated primarily with the assistance of three citation models: (1) Bibliographic Coupling, (2) Co-Citation, and (3) Direct Citations. Millions of new scholarly articles are published every year. This flux of scientific information has made it a challenging task to devise techniques that could help researchers to find the most relevant research papers for the paper at hand. In this study, we have deployed an in-text citation analysis that extends the Direct Citation Model to discover the nature of the relationship degree-of-relevancy among scientific papers. For this purpose, the relationship between citing and cited articles is categorized into three categories: weak, medium, and strong. As an experiment, around 5,000 research papers were crawled from the CiteSeerX. These research papers were parsed for the identification of in-text citation frequencies. Subsequently, 0.1 million references of those articles were extracted, and their in-text citation frequencies were computed. A comprehensive benchmark dataset was established based on the user study. Afterwards, the results were validated with the help of Least Square Approximation by Quadratic Polynomial method. It was found that degree-of-relevancy between scientific papers is a quadratic increasing/decreasing polynomial with respect to-increase/decrease in the in-text citation frequencies of a cited article. Furthermore, the results of the proposed model were compared with state-of-the-art techniques by utilizing a well-known measure, known as the normalized Discount Cumulative Gain (nDCG). The proposed method received an nDCG score of 0.89, whereas the state-of-the-art models such as the Content, Bibliographic-coupling, and Metadata-based Models were able to acquire the nDCG values of 0.65, 0.54, and 0.51 respectively. These results indicate that the proposed mechanism may be applied in future information retrieval systems for better results.
... cit.), section information, and words indicating how an author feels about a reference (i.e., citation contexts or sentiments). Full text also contains a relatively high level of detail about motivation, methods, data, instruments, results, and conclusions that authors typically report when documenting and submitting their work for publication (Boyack et al. 2013). ...
... Hu et al. 2013;Boyack et al. 2018), proximity of cited references (e.g. Gipp and Beel 2009;Liu and Chen 2012;Boyack et al. 2013;Kim et al. 2016), citation contexts or sentiments (e.g. Small 2011;Liu and Chen 2013;Ding et al. 2014;Lu et al. 2017), andcitation motivation or behavior (e.g. ...
Article
Views and downloads of academic articles have become important supplementary indicators of scholarly impact. It is assumed that linguistic characteristics have an influence on article views and downloads to some extent. To understand the relationship between linguistic characteristics and article views and downloads, this study selected 63,002 full-text articles published from 2014 to 2015 in the PLoS (Public Library of Science) journals (PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Medicine, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens), and introduced seven indicators (title length, abstract length, full text length, sentence length, lexical diversity, lexical density and lexical sophistication) to measure linguistic characteristics of articles, grouped into Top 20% viewed and downloaded (proxy of highly browsed and downloaded articles), total and Bottom 20% viewed and downloaded categories. The results suggested that most linguistic characteristics played little role in article views and downloads in our data sets in general, but some linguistic characteristics (e.g. title length and average sentence length) in specific PLoS journal and platform (PLoS platform or PubMed Central platform) played certain role in article views and downloads. Also, journal differences and platform differences regarding linguistic characteristics of highly viewed and downloaded articles were existed.
... They compared different citation-based relatedness measures (Boyack & Klavans, 2010;Klavans & Boyack, 2017), including relatedness measures that take advantage of full-text data (Boyack, Small, & Klavans, 2013), as well as different text-based relatedness measures (Boyack et al., 2011). To evaluate the accuracy of clustering solutions, they used grant data, textual similarity (Boyack & Klavans, 2010;Boyack et al., 2011Boyack et al., , 2013, and more recently also the reference lists of 'authoritative' publications, defined as publications with at least 100 references (Klavans & Boyack, 2017). 1 Our aim in this paper is to introduce a principled methodology for performing analyses similar to the ones mentioned above. ...
... They compared different citation-based relatedness measures (Boyack & Klavans, 2010;Klavans & Boyack, 2017), including relatedness measures that take advantage of full-text data (Boyack, Small, & Klavans, 2013), as well as different text-based relatedness measures (Boyack et al., 2011). To evaluate the accuracy of clustering solutions, they used grant data, textual similarity (Boyack & Klavans, 2010;Boyack et al., 2011Boyack et al., , 2013, and more recently also the reference lists of 'authoritative' publications, defined as publications with at least 100 references (Klavans & Boyack, 2017). 1 Our aim in this paper is to introduce a principled methodology for performing analyses similar to the ones mentioned above. We restrict ourselves to the use of one specific clustering technique, namely the technique introduced in the bibliometric literature by Waltman and Van Eck (2012), but we allow the use of any measure of the relatedness of publications. ...
Preprint
Full-text available
There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and co-citation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.
... CPA considers a set of citations more fitting to one another when they come within the same sentence than when they come within the same section. In addition, Boyack et al. [31] presented techniques that use the distance between citations. However, as opposed to using the sentence structure to find the distance, they used character or byte offset to propose four different schemes for the same objective. ...
... Garfield and Merton (1979) highlight citation analysis as an effective way to identify the most important publications in a research field. Co-citation analysis, introduced by Small (1973), explores connections among publications that co-occur in other articles' reference lists and is an essential process in cluster analysis (Boyack et al., 2013). Callon et al. (1983) introduce co-word analysis to explain the relationships between various stages of innovation and to determine whether fundamental or applied research is the driving force behind these stages. ...
Article
Full-text available
The field of quantitative finance has been rapidly growing in both academics and practice. This article applies bibliometric analysis to investigate the current state of quantitative finance research. A comprehensive dataset of 2,723 publications from the Web of Science Core Collection database, between 1992 to 2022, is collected and analyzed. CiteSpace and VOSViewer are adopted to visualize the bibliometric analysis. The article identifies the most relevant research in quantitative finance according to journals, articles, research areas, authors, institutions, and countries. The study further identifies emerging research topics in quantitative finance, e.g. deep learning, neural networks, quantitative trading, and reinforcement learning. This article contributes to the literature by providing a systematic overview of the developments, trajectories, objectives, and potential future research topics in the field of quantitative finance.
... They concluded that the best citation-based and text-based approaches have similar accuracy, but the hybrid approach outperformed both. In their later work, which considered the relationship between reference similarity and reference proximity (their relative positions in the text) (Gipp &Beel, 2009), Boyack, Small, andKlavans (2013) found an increase in performance accuracy when combining reference proximity into the cocitation model. ...
Article
Full-text available
Scientific research is an essential stage of the innovation process. However, it remains unclear how a scientific idea becomes applied knowledge and, after that, a commercial product. This paper describes a hypothesis of innovation based on the emergence of new research fields from more mature research fields after interactions between the latter. We focus on graphene, a rising field in materials science, as a case study. First, we used a co-clustering method on titles and abstracts of graphene papers to organize them into four meaningful and robust topics (theory and experimental tests, synthesis and functionalization, sensors, supercapacitors and electrocatalysts). We also demonstrated that they emerged in the order listed. We then tested all topics against the literature on nanotubes and batteries, the possible parent fields of theory and experimental tests, as well as supercapacitors and electrocatalysts. We found incubation signatures for all topics in the nanotube papers collection and weaker incubation signatures for supercapacitors and electrocatalysts in the battery papers collection. Surprisingly, we found and confirmed that the 2004 breakthrough in graphene created a stir in both the nanotube and battery fields. Our findings open the door for a better understanding of how and why new research fields coalesce. Peer Review https://publons.com/publon/10.1162/qss_a_00193
... e methodology used in the branch of bibliometrics deals with the use of quantitative tools for the evaluation of scientific production corresponding to indexed databases, using various metric indicators of production, collaboration, and impact [29]. e objective is to perform a distribution by dividing the elements to be evaluated (journals, keywords, authors, countries, and documents) into different groups to visualize the visual representation of the classification obtained [30]. Bibliometric studies are important because they have the advantage of being able to introduce a review process with systematic, reproducible, and transparent analysis, resulting in an improvement in the quality and prestige of the reviews. ...
Article
Full-text available
Objective: To perform a bibliometric analysis of the scientific research on the development of vaccines against dental caries. Methods: An extraction of the scientific production published on the development of vaccines against dental caries between 2011 and 2020 was carried out from the Scopus database. Microsoft Excel was used for the elaboration of tables and SciVal for the bibliometric analysis of the data, which were divided into indicators of production, impact, and collaboration. Finally, VOSviewer was used for co-occurrence analysis of keywords and collaborative networks. Results: 106 studies were retrieved from the Scopus database, which were conducted on the development of dental caries vaccines within the years 2011-2020. Wuhan University, in China, was the university with the highest scientific production on the subject, with 4 publications. Regarding the most productive journals, the first place was occupied by the Journal of Dental Research with 7 publications. Regarding the most productive journals, the first place was occupied by the Journal of Dental Research with 7 publications. The highest percentage of the documents analyzed was in quartile 1 journals and in the national collaboration pattern. Conclusion: Most of the manuscripts regarding the development of vaccines against dental caries were published in China and in Q1 quartile journals. In addition, Yan Huimin, Yang Jingyi, Zhou Dihan, Yang Yi, Li Yuhong and Fan Mingwen were found to top the list of most productive authors. The Journal of Dental Research was also identified as the most productive and cited journal.
... To analyze the structure and relationships of our scientific research field, bibliometric techniques such as co-citation, bibliographic coupling will be employed, and to discover the main concepts and topics covered in our research field, techniques such as co-word, conceptual or strategic maps will be applied. Furthermore, this research is going to use science mapping techniques (Boyack et al., 2013;Calero Medina & Van Leeuwen, 2012;Small, 1999) to represent the cognitive structure of our research field based on relationships and graphs determined by the aforementioned analyses such as co-citation, coupling or co-word, similar to the analysis in social networks. ...
Article
Full-text available
This article aims to show how the Big Data techniques application in accounting to monitor international cooperation projects are a green-field in the academic world. To obtain an exhaustive vision of the state of the art in academic research in this field, a bibliometric analysis has been carried out, based on multiple Web of Science searches, with focus on international development, Big Data and accounting, adding the holistic vision of the 17 SDGs or “Sustainable Development Goals” of the UN Agenda 2030. Research on Big Data, international development and accounting is a new field that has started in 2015 although academic literature is still scarce. Publications related to SDGs also begin on that date, but with much more prolific academic literature, without explicit references to the use of Big Data in accounting. The article finds deficiencies in existing academic research compared to other enterprise fields in which Big Data techniques are much more developed, and international organization reports lead this line of research, as opposed to the scholarly world. The main practical implication derived from the paper is the need to deepen in real cases of use outside the academic sphere as a starting point to develop this line of research. The development of this research area will help NPOs and governments to have a better accounting to evaluate the impact of their initiatives and cooperation projects. In addition to the bibliometric techniques used for the analysis of main publications, authors and relevant topics focused on this area of study, the authors consider a challenge and an opportunity to take the plunge into this field from academic world, which will undoubtedly improve decision-making in international development, emphasizing the need to gain momentum given the current state of greenfield.
... Co-citation of references means two (or more papers) are cited by one or more subsequent papers at the same time [37], which is vitally important to demonstrate the research front with higher details and accuracy [29]. To better understand the research literature's influence and connection in the field, the top 10 most cited papers were identified and are listed in Table 5. ...
Article
Full-text available
Objective: This study aimed to analyze the progression and trends of multimorbidity in the elderly in China and internationally from a bibliometric perspective, and compare their differences on hotspots and research fronts. Methods: Publications between January 2001 and August 2021 were retrieved from WOS and CNKI databases. Endnote 20 and VOSviewer 1.6.8 were used to summarize bibliometric features, including publication years, journals, and keywords, and the co-occurrence map of countries, institutions, and keywords was drawn. Results: 3857 research papers in English and 664 research papers in Chinese were included in this study. The development trends of multimorbidity in the elderly are fully synchronized in China and other countries. They were divided into germination period, development period, and prosperity period. Research literature in English was found to be mainly focused on public health, and the IF of the literature is high; In China, however, most research papers are in general medicine and geriatrics with fewer core journals. Co-occurrence analysis based on countries and institutions showed that the most productive areas were the United States, Canada, the United Kingdom, and Australia, while the Chinese researchers have made little contribution. The clustering analysis of high-frequency keywords in China and around the globe shows that the hotspots have shifted from individual multimorbidity to group multimorbidity management. Sorting out the top 10 highly cited articles and highly cited authors, Barnett, K's article published in Lancet in 2012 is regarded as a milestone in the field. Conclusion: Multimorbidity in the elderly leads to more attention in the world. Although China lags behind global research the research fronts from disease-centered to patient-centered, and individual management to population management is consistent.
... In several similar analyses, content-based analysis is incorporated with co-citation analysis, and it gives better accuracy for research article recommendation [45]. Herein, co-citation analysis considers the relationship between two articles when a particular article cites them. ...
Article
Full-text available
A research article recommendation approach aims to recommend appropriate research articles to analogous researchers to help them better grasp a new topic in a particular research area. Due to the accessibility of research articles on the web, it is tedious to recommend a relevant article to a researcher who strives to understand a particular article. Most of the existing approaches for recommending research articles are metadata-based, citation-based, bibliographic coupling-based, content-based, and collaborative filtering-based. They require a large amount of data and do not recommend reference articles to the researcher who wants to understand a particular article going through the reference articles of that particular article. Therefore, an approach that can recommend reference articles for a given article is needed. In this paper, a new multi-level chronological learning-based approach is proposed for recommending research articles to understand the topics/concepts of an article in detail. The proposed method utilizes the TeKET keyphrase extraction technique, among other unsupervised techniques, which performs better in extracting keyphrases from the articles. Cosine and Jaccard similarity measures are employed to calculate the similarity between the parent article and its reference articles using the extracted keyphrases. The cosine similarity measure outperforms the Jaccard similarity measure for finding and recommending relevant articles to understand a particular article. The performance of the recommendation approach seems satisfactory, with an NDCG value of 0.87. The proposed approach can play an essential role alongside other existing approaches to recommend research articles.
... Our approach can also be comprehended as a generalized BC and CC that refines the longitudinal coupling (Small, 1997) and the relative BC/CC (Egghe and Rousseau, 2002), which enhances science mapping with indirect connections using temporal groups and lattice properties, respectively. Although non-citation-based similarities have also been used to measure indirect similarities, such as textual similarity (Boyack et al., 2013) and second-order similarity (Colliander and Ahlgren, 2012;Thijs et al., 2013), it is difficult to consider them as the fundamental solutions to the problem of missing citation linkages. They are meta-similarities of first-order similarities, essentially subordinate to first-order similarities. ...
Preprint
Bibliographic coupling (BC) and co-citation (CC) are the two most common citation-based coupling measures of similarity between scientific items. One can interpret these measures as second-neighbor relations distinguished by the direction of the citation: BC is a similarity between two citing items, whereas CC is that between two cited items. A previous study proposed a two-layer node split network that can emulate clusters of coupling measures in a computationally efficient manner; however, the lack of intralayer links makes it impossible to obtain exact similarities. Here, we propose novel methods to estimate intralayer similarity on a node split network using personalized PageRank and neural embedding. We demonstrate that the proposed measures are strongly correlated with the coupling measures. Moreover, our proposed method can yield precise similarities between items even if they are distant from each other. We also show that many links with high similarity are missing in the original BC/CC network, which suggests that it is essential to consider long-range similarities. Comparative experiments on global and local edge sampling suggest that local sampling is stable for both similarities in node split networks. This analysis offers valuable insights into the process of searching for significantly related items regarding each coupling measure.
... They had proposed automatic processing of citation context of the cited paper to find the most relevant documents (Simone, Siddharthan & Tidhar, 2006;Dain, Iida & Tokunaga, 2009). Moreover, citation proximity, citation order analysis, and bytecode usage of in-text citations of the cited papers in the citing paper have also been proposed recently to identify relevant documents (Khan, Shahid & Afzal, 2018;Mehmood et al., 2019;Raja & Afzal, 2019;Boyack, Small & Klavans, 2013). ...
Article
Full-text available
From the past half of a century, identification of the relevant documents is deemed an active area of research due to the rapid increase of data on the web. The traditional models to retrieve relevant documents are based on bibliographic information such as Bibliographic coupling, Co-citations, and Direct citations. However, in the recent past, the scientific community has started to employ textual features to improve existing models’ accuracy. In our previous study, we found that analysis of citations at a deep level (i.e., content level) can play a paramount role in finding more relevant documents than surface level (i.e., just bibliography details). We found that cited and citing papers have a high degree of relevancy when in-text citations frequency of the cited paper is more than five times in the citing paper’s text. This paper is an extension of our previous study in terms of its evaluation of a comprehensive dataset. Moreover, the study results are also compared with other state-of-the-art approaches i.e., content, metadata, and bibliography. For evaluation, a user study is conducted on selected papers from 1,200 documents (comprise about 16,000 references) of an online journal, Journal of Computer Science (J.UCS). The evaluation results indicate that in-text citation frequency has attained higher precision in finding relevant papers than other state-of-the-art techniques such as content, bibliographic coupling, and metadata-based techniques. The use of in-text citation may help in enhancing the quality of existing information systems and digital libraries. Further, more sophisticated measure may be redefined be considering the use of in-text citations.
... The analysis has resulted in four clusters, and the cluster size is small, indicating an increased accuracy. [22] A majority of the cited references are in a red and green cluster with 29 items and 19 items, respectively. The red cluster majorly focused on the onset of lean and lean thinking, which began in the The proximity of the clusters indicates a strong relationship as they are frequently cited. ...
Article
Full-text available
The study aims to investigate lean manufacturing from a bibliometric perspective. A systematic analysis was performed on 1893 documents extracted from the Clarivate Analytics Web of Science (WoS) Collection under the “Lean Manufacturing” (LM) subject category between 1970 to 2020. The R programming is utilized for determining the evolution and progress of lean through the topmost contributing publications, authors, journals, and countries. The network representation is performed using VOS viewer, highlighting the focus areas of LM with co-occurrence, co-citation, and bibliographical coupling of the extracted data. The results indicate LM was initially focused predominantly on topics such as lean thinking and organizational performance by various authors. Later, the LM topic was integrated with six sigma, sustainability, and environmental assessments for overall process improvements. The major contributors in this research area are the United States of America and India. The study attempts to fully comprehend LM, its application, and trends to promote further research. Copyright © The Author(s). 2021 This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
... The edge cut value of B is 0.85 and the edge cut value of C is 0.9. When there is a large amount of nodes, it is common practice to keep the top 15 highest weight edges per node in order to reduce the computational cost while maintaining as much structure as possible (Boyack & Klavans, 2014;Boyack, Small, & Klavans, 2013;Boyack et al., 2011). The Ochiai coefficient (Zhou & Leydesdorff, 2016) was used for the normal ization of co-citation relatedness. ...
Article
Full-text available
Purpose The goal of this study is to explore whether deep learning based embedded models can provide a better visualization solution for large citation networks. Design/methodology/approach Our team compared the visualization approach borrowed from the deep learning community with the well-known bibliometric network visualization for large scale data. 47,294 highly cited papers were visualized by using three network embedding models plus the t-SNE dimensionality reduction technique. Besides, three base maps were created with the same dataset for evaluation purposes. All base maps used the classic OpenOrd method with different edge cutting strategies and parameters. Findings The network embedded maps with t-SNE preserve a very similar global structure to the full edges classic force-directed map, while the maps vary in local structure. Among them, the Node2Vec model has the best overall visualization performance, the local structure has been significantly improved and the maps’ layout has very high stability. Research limitations The computational and time costs of training are very high for network embedded models to obtain high dimensional latent vector. Only one dimensionality reduction technique was tested. Practical implications This paper demonstrates that the network embedding models are able to accurately reconstruct the large bibliometric network in the vector space. In the future, apart from network visualization, many classical vector-based machine learning algorithms can be applied to network representations for solving bibliometric analysis tasks. Originality/value This paper provides the first systematic comparison of classical science mapping visualization with network embedding based visualization on a large scale dataset. We showed deep learning based network embedding model with t-SNE can provide a richer, more stable science map. We also designed a practical evaluation method to investigate and compare maps.
... Full text contains additional information that has not been available in bibliographic data. Full text also contains a relatively high level of detail about motivation, methods, data, instruments, results, and conclusions that authors typically report when documenting and submitting their work for publication (Boyack et al. 2013). ...
Article
Full-text available
Scientific writings, as one essential part of human culture, have evolved over centuries into their current form. Knowing how scientific writings evolved is particularly helpful in understanding how trends in scientific culture developed. It also allows us to better understand how scientific culture was interwoven with human culture generally. The availability of massive digitized texts and the progress in computational technologies today provides us with a convenient and credible way to discern the evolutionary patterns in scientific writings by examining the diachronic linguistic changes. The linguistic changes in scientific writings reflect the genre shifts that took place with historical changes in science and scientific writings. This study investigates a general evolutionary linguistic pattern in scientific writings. It does so by merging two credible computational methods: relative entropy; word-embedding concreteness and imageability. It thus creates a novel quantitative methodology and applies this to the examination of diachronic changes in the Philosophical Transactions of Royal Society (PTRS,1665-1869). The data from two computational approaches can be well mapped to support the argument that this journal followed the evolutionary trend of increasing professionalization and specialization. But it also shows that language use in this journal was greatly influenced by historical events and other socio-cultural factors. This study, as a "culturomic" approach, demonstrates that the linguistic evolutionary patterns in scientific discourse have been interrupted by external factors even though this scientific discourse would likely have cumulatively developed into a professional and specialized genre. The approaches proposed by this study can make a great contribution to full-text analysis in scientometrics.
... Clustering by co-citation also considers weak inter-cluster interactions that involve modifications to standard clustering approaches. (Boyack and Klavans 2010;Boyack et al. 2013;Small and Griffith 1974;Small and Sweeney 1985). We used a modification of variable level clustering combined with agglomerative clustering, an approach developed in 1985 by Small and Sweeney (1985) for co-citation analysis. ...
Article
Computer science has experienced dramatic growth and diversification over the last twenty years. Towards a current understanding of the structure of this discipline, we analyze a large sample of the computer science literature from the DBLP database. For insight on the features of this cohort and the relationship within its components, we have constructed article level clusters based on either direct citations or co-citations, and reconciled them with major and minor subject categories in the All Science Journal Classification. We describe complementary insights from clustering by direct citation and co-citation, and both point to the increase in computer science publications and their scope. Our analysis reveals cross-category clusters, some that interact with external fields, such as the biological sciences, while others remain inward looking. Overall, we document an increase in computer science publications and their scope.
... Cluster analysis and classification of citations using networks were conducted by Small et al. (1985), Kejzjar et al. (2010), Boyack et al. (2013), Shiau et al. (2017), and many others. Apart from these studies, we take a different approach and study the frequency of citations using a Poisson mixture model with concomitant variables. ...
Article
This paper explores the role of gender gap in the actuarial research community with advanced data science tools. The web scraping tools were employed to create a database of publications that encompasses six major actuarial journals. This database includes the article names, authors’ names, publication year, volume, and the number of citations for the time period 2005–2018. The advanced tools built as part of the R software were used to perform gender classification based on the author’s name. Further, we developed a social network analysis by gender in order to analyze the collaborative structure and other forms of interaction within the actuarial research community. A Poisson mixture model was used to identify major clusters with respect to the frequency of citations by gender across the six journals. The analysis showed that women’s publishing and citation networks are more isolated and have fewer ties than male networks. The paper contributes to the broader literature on the “Matthew effect” in academia. We hope that our study will improve understanding of the gender gap within the actuarial research community and initiate a discussion that will lead to developing strategies for a more diverse, inclusive, and equitable community.
... It was successfully applied to certain applications, such as support of transdisciplinary research [27], professional similarity analysis for authors of articles [28], classification of webpages [2,29], and patent analysis [30]. A typical way to improve CC is to consider proximity of a1 and a2 in the citing articles [31][32][33][34] and context passages for a1 and a2 in the citing articles [35]. However the proximity and the context passages need to be collected from full texts of the articles, which are often not publicly available. ...
Article
Full-text available
Bibliographic coupling (BC) is a similarity measure for scientific articles. It works based on an expectation that two articles that cite a similar set of references may focus on related (or even the same) research issues. For analysis and mapping of scientific literature, BC is an essential measure, and it can also be integrated with different kinds of measures. Further improvement of BC is thus of both practical and technical significance. In this paper, we propose a novel measure that improves BC by tackling its main weakness: two related articles may still cite different references. Category-based cocitation (category-based CC) is proposed to estimate how these different references are related to each other, based on the assumption that two different references may be related if they are cited by articles in the same categories about specific topics. The proposed measure is thus named BCCCC (Bibliographic Coupling with Category-based Cocitation). Performance of BCCCC is evaluated by experimentation and case study. The results show that BCCCC performs significantly better than state-of-the-art variants of BC in identifying highly related articles, which report conclusive results on the same specific topics. An experiment also shows that BCCCC provides helpful information to further improve a biomedical search engine. BCCCC is thus an enhanced version of BC, which is a fundamental measure for retrieval and analysis of scientific literature.
... Furthermore, there are studies combining citation-based approaches and content-based approaches [19], [20]. Klavans and Boyack [4] combined citation-based and content-based approaches, in which 91 726 clusters (topics) created by clustering a direct citation network are located on a map based on content similarities of the topics, and also 314 000 project titles and descriptions retrieved from STAR METRIC are classified to the topics based on the content similarities. ...
Article
Full-text available
Maps of science visualizing the structure of science help us analyze the current spread of science, technology, and innovation (ST&I). ST&I enterprises can use the maps of science as competitive technical intelligence to anticipate changes, especially those initiated in their immediate vicinity. Research laboratories and universities can understand their environmental changes and use the map for their research management. However, traditional maps based on bibliometrics, such as citation and cocitation, have difficulty in representing recently published papers and ongoing projects that have few or no references; thus, maps based on contents, i.e., text-mining, have been developed in recent years for locating research papers/projects, for example, using word and paragraph vectors. The content-based maps, however, still pose difficulty in comparing documents in different languages. Therefore, aiming to construct a bilingual (English and Japanese) content-based map of science for the analyses of ST&I information resources in different languages, this article proposes a method for creating word and paragraph vectors corresponding to bilingual textual information in the same multidimensional space. In a comparison of 11 methods for generating document vectors, we confirmed that the best method achieved 87% accuracy of the bilingual content matching based on ${1000}$ IEEE papers. Finally, we published a map of approximately ${15000}$ funding projects of the National Science Foundation, Japan Society for the Promotion of Science, and Japan Science and Technology agency from 2013 to 2017.
... Schwarzer et al. (2016) used this approach to search for Wikipedia articles, where co-links are used instead of co-citation linkages. Also, this co-citation context has recently been used in the research area of 'mapping science', which is not document search but another main usage of co-citation relationships (Boyack et al., 2013;Chu & Yeh, 2016;Hsiao & Chen 2017;Kim et al., 2016;Liu & Chen 2012). ...
Article
Full-text available
This study proposes a novel extended co-citation search technique, which is graph-based document retrieval on a co-citation network containing citation context information. The proposed search expands the scope of the target documents by repetitively spreading the relationship of co-citation in order to obtain relevant documents that are not identified by traditional co-citation searches. Specifically, this search technique is a combination of (a) applying a graph-based algorithm to compute the similarity score on a complicated network, and (b) incorporating co-citation contexts into the process of calculating similarity scores to reduce the negative effects of an increasing number of irrelevant documents. To evaluate the search performance of the proposed search, 10 proposed methods (five representative graph-based algorithms applied to co-citation networks weighted with/without contexts) are compared with two kinds of baselines (a traditional co-citation search with/without contexts) in information retrieval experiments based on two test collections (biomedicine and computer linguistic articles). The experiment results showed that the scores of the normalized discounted cumulative gain ([email protected]) of the proposed methods using co-citation contexts tended to be higher than those of the baselines. In addition, the combination of the random walk with restart (RWR) algorithm and the network weighted with contexts achieved the best search performance among the 10 proposed methods. Thus, it is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-citation search techniques, and that sole use of a graph-based algorithm is not enough to enhance search performances from the baselines.
... Other related issues of CC are also discussed during the following decades (Boyack, Small, & Klavans, 2013;Callahan, Hockema, & Eysenbach, 2010;Eom, 2008;Gmür, 2003 In another study, Zhao & Strotmann map the intellectual structure of information science by ABCA and ACA as well as compare these results. They report that ABCA provides a more realistic description within a research field and ACA reveals the external and internal as well as recent and historical influences in a research field (Zhao & Strotmann, 2008b). ...
Conference Paper
This study investigates how to measure subject relationship based on bibliographic coupling strength. Since the 1960s, researchers use citation analysis methods to discover the relationship between different works and authors. However, how to apply citation‐based methods in measuring the relationships between various subjects remains unknown. We propose a novel method to measure relationships between subjects based on bibliographic coupling strengths. Our dataset is composed of 7,692 articles published in 10 core information science journals from 2008 to 2017. The result shows that our method provides another viewpoint of exploring the development of science. Furthermore, our method can identify the core works in different subjects and help to judge how similar different subjects are.
... DSI is the citation proximity analysis which is similar as being used in [36]. Another extension to CPA has been proposed by Boyack et al. in paper [37]. Although this technique also contemplates closeness of citation tags, however, the method of finding proximity has been reformed by the authors. ...
Article
Full-text available
Since past several years, finding relevant documents from plethora of web repositories has become prime attention of the scientific community. To find out relevant research articles, state-of-the-art techniques employ content, metadata, citations, and collaborative filtering based approaches. Among all of them, citation based approaches hold strong potential because most of the time, authors cite relevant papers. Bibliographic coupling is one of the well-known citation based approaches for recommending relevant papers. In this paper, we present an approach SwICS that harnesses number of common references between pair of documents as similarity measure whereas the distribution of in-text citations within the text are not analyzed. The proposed approach explores the in-text citation frequencies within contents of the paper and in-text citation patterns between different logical sections for bibliographically coupled papers. For evaluation, the employed data set contains 1150 research documents are obtained from a well-known autonomous citation index known as: CiteSeer. A comprehensive user study is conducted to build a gold standard for comparing the proposed approach. The approach is compared with the state-of-the-art bibliographic coupling and content similarity based techniques. The comparison results revealed that proposed approach significantly performs better than the contemporary approaches. The comparison result with gold standard yielded an average of 0.73. The average gain achieved by the proposed approach is 60% from state-of-the-art: bibliographic coupling. Whereas, the correlation between gold standard and content based approach remains 20%. The proposed approach can play a significant role for search engines and citation indexers in terms of improving the quality of their results.
... Dalam menerapkan gaya penulisan referensi, ada pakem untuk setiap jenis, seperti buku, artikel jurnal, skripsi, dst. Penggunaan referensi yang akurat akan menjaga kohesi tulisan (Boyack, Small, & Klavans, 2013). ...
Presentation
Full-text available
Artikel ini mengemukakan langah dan contoh dalam proses menyusun sintesis kepustakaan untuk dijadikan rujukan dalam penulisan artikel. Digunakan tiga langkah, diawali dengan menelusuri kepustakaan.Selanjutnya, menemukan ide utama setiap artikel. Terakhir menuliskan dengan gaya paraphrase, bukan dengan kutipan langsung. Dengan demikian, saat menuliskan artikel akan merujuk kepada ide-ide yang sudah dipublikasikan. Dalam bagian kedua, artikel ini juga mengemukakan contoh-contoh penyusunan sintesis dalam review literatur.
... Luckily enough, impressive advancements in computational linguistics in the last two decades made it possible to carry out analysis on the full content of large collections of patent texts. It is clear that text mining analysis of patents can be a game changing source of information for designers (Boyack et al., 2013). ...
Article
Full-text available
The importance of affordance in Engineering design is well established. Artifacts that are able to activate spontaneous and immediate users’ reactions are considered the outcome of good design practice. A huge effort has been made by researchers for understanding affordances: yet these efforts have been somewhat elusive. In particular, they have been limited to case studies and experimental studies, usually involving a small subset of affordances. No systematic effort has been carried out to list all known affordance effects. This paper offers preliminary steps for such an ambitious effort. We propose a set of three different approaches of Natural Language Processing techniques to be used to extract meaningful affordance information from the full text of patents: 1) a simple word search, 2) a lexicon of affordances and 3) a rule-based system. The results give in-depth measures of how rare affordances in patents are, and a fine grain analysis of the linguistical construction of affordances. Finally, we show an interesting output of our method, that has detected affordances for disabled people, showing the ability of our system to automatically collect design-relevant knowledge.
... Most of the recently published approaches are hybrid [34][35][36][37] by combining the techniques mentioned above in different ways as well as adding additional features of their own. To some researchers [38,39], the proximity of citations helps locate the related article. They proposed that two articles are probably similar to each other if many articles in nearby areas cite them. ...
Article
Scholars routinely search relevant papers to discover and put a new idea into proper context. Despite ongoing advances in scholarly retrieval technologies, locating relevant papers through keyword queries is still quite challenging due to the massive expansion in the size of the research paper repository. To tackle this problem, we propose a novel real-time feedback query expansion technique, which is a two-stage interactive scholarly search process. Upon receiving the initial search query, the retrieval system provides a ranked list of results. In the second stage, a user selects a few relevant papers, from which useful terms are extracted for query expansion. The newly expanded query is run against the index in real time to generate the final list of research papers. In both stages, citation analysis is involved in further improving the quality of the results. The novelty of the approach lies in the combined exploitation of query expansion and citation analysis that may bring the most relevant papers to the top of the search results list. The experimental results on the Association of Computational Linguistics (ACL) Anthology Network data set demonstrate that this technique is effective and robust for locating relevant papers regarding normalised discounted cumulative gain (nDCG), precision and recall rates than several state-of-the-art approaches.
... However, Liu and Chen [16] did not explain the weight of each level of proximity. Boyack et al. [17] selected relative distance rather than absolute distance to represent the cited scientific documents' proximity. They compared the proximity-based co-citation model with the traditional co-citation model and concluded that the use of the cited scientific documents' proximity information increases the accuracy of co-citation clustering. ...
Article
Full-text available
With the number of published scientific paper increasing exponentially, scientific document clustering is becoming a challenging task. Therefore, a scientific document clustering model with high quality is needed. In this study, we propose an extended citation model for scientific document clustering. On the one hand, the proposed model considers that 1) the high frequency and the wide distribution of a scientific document cited in other documents will result in high similarity between the citing and the cited documents; and 2) the close location of two scientific documents cited in a scientific document will also result in high similarity between these two documents. On the other hand, the proposed model combines citation networks and textual similarity network to enhance the performance of scientific document clustering. To evaluate the performance of our proposed model, we collect scientific documents from PMC and PubMed databases in the field of oncology as a case study. It is proved that our proposed model can obtain reasonable clustering results by comparing it with traditional scientific documents clustering models such as traditional bibliographic coupling model and textual similarity model, according to the indices of precision, recall, and F1-score.
... The hybrid approaches [43,[57][58][59] combine the best of the citation graph-based and content-based techniques to compute the relevance of documents to the search query. The proximity of citations is supportive in locating related articles [60,61]. Two articles may be similar to each other if many articles in nearby locations cite them. ...
Article
Full-text available
The enormous growth in the size of scholarly literature makes its retrieval challenging. To address this challenge, researchers and practitioners developed several solutions. These include indexing solutions e.g. ResearchGate, Directory of Open Access Journals (DOAJ), Digital Bibliography & Library Project (DBLP) etc., research paper repositories e.g. arXiv.org, Zenodo, etc., digital libraries, scholarly retrieval systems, e.g., Google Scholar, Microsoft Academic Search, Semantic Scholar etc., digital libraries, and publisher websites. Among these, the scholarly retrieval systems, the main focus of this article, employ efficient information retrieval techniques and other search tactics. However, they are still limited in meeting the user information needs to the fullest. This brief review paper is an attempt to identify the main reasons behind this failure by reporting the current state of scholarly retrieval systems. The findings of this study suggest that the existing scholarly retrieval systems should differentiate scholarly users from ordinary users and identify their needs. Citation network analysis should be made an essential part of the retrieval system to improve the search precision and accuracy. The paper also identifies several research challenges and opportunities that may lead to better scholarly retrieval systems.
... These are important indicators of scientometrics. Citation analysis is widely applied to identify knowledge structures in many disciplines [33][34][35]. The highest citation frequency indicates that the article has a significant impact in the field, has been highly valued by international peers, and has a high academic level. ...
Article
Full-text available
China's solar energy industry is developing rapidly and China's solar energy research is experiencing a high speed of development alongside it. Is China's solar energy research growth quantity-driven (paper-driven) or quality-driven (citation-driven)? Answering this question is important for China's solar research field and industrial sector, and has implications for China’s other renewable research programs. Applying statistical methods, the citation analysis method, and web of science data, this study investigated China’s solar energy research between 2007 and 2015 from two perspectives: quantity (numbers of papers) and quality (number of paper citations). The results show that the number of Science Citation Index Expanded(SCI-E)papers on solar energy in China has grown rapidly, surpassing the United States to become the world leader in 2015. However, the growth rate in scientific production was consistently higher than the growth rate of the number of times cited. When considering the average number of times a paper was cited among the top ten countries researching solar energy, China was in last place from 2007 to 2015. Further, the impact and effectiveness of China’s papers were below the world average from 2010 to 2015, and experienced a sharp decreasing trend. These results suggest that China's solar energy research is a quantitatively driven model, with a mismatch between quantity and quality. New policies should be introduced to encourage high-quality research and achieve a balance between quantity and quality.
Article
Bibliographic coupling (BC) and co-citation (CC) are the two most common citation-based coupling measures of similarity between scientific items. One can interpret these measures as second-neighbor relations distinguished by the direction of the citation: BC is a similarity between two citing items, whereas CC is that between two cited items. A previous study proposed a two-layer node split network that can emulate clusters of coupling measures in a computationally efficient manner; however, the lack of intralayer links makes it impossible to obtain exact similarities. Here, we propose novel methods to estimate intralayer similarity on a node split network using personalized PageRank (PPR) and neural embedding (EMB). We demonstrate that PPR is strongly correlated with the coupling measures. Moreover, our proposed method can yield precise similarities between items even if they are distant from each other. We also show that many links with high similarity are missing in the original BC/CC network, which suggests that it is essential to consider long-range similarities. Comparative experiments on global and local edge sampling suggest that local sampling is stable for PPR in node split networks. This analysis offers valuable insights into the process of searching for significantly related items regarding each coupling measure.
Article
The co-opinionatedness measure, that is, the similarity of cociting documents in their opinions about their cocited articles, has been recently proposed. The present study uses a wider range of baselines and benchmarks to investigate the measure’s effectiveness in retrieval ranking that was previously confirmed in a pilot study. A test collection was built including 30 seed documents and their 4702 cocited articles. Their citances and full-texts were analysed using natural language processing (NLP) and opinion mining techniques. Cocitation values, syntactical similarity and contexts similarity were used as baselines. The distributional semantic similarity and the linear and hierarchical Medical Subject Headings (MeSH) similarities served as benchmarks to evaluate the effect of the co-opinionatedness as a boosting factor on the performance of the baselines. The improvements in the rankings were measured by normalised discounted cumulative gain (nDCG). According to the findings, there existed significant differences between the nDCG mean values obtained before and after weighting the baselines by the co-opinionatedness measures. The results of the generalisability study corroborated the reliability and generalisability of the systems. Accordingly, the similarity in the opinions of the cociting papers towards their cocited articles can explain the cocitation relation in the scientific papers network and can be effectively utilised for improving the results of the cocitation-based retrieval systems.
Chapter
The paper aims to offer a solution for the identification and annotation of impacts in the domain of arts and culture. We explore available (ex post) narratives of impactful interventions in society, such as those contained in the body of scientific papers dealing with related topics to arts and culture, and try to disentangle some meaningful descriptions of impact generation mechanisms, using NLP (Natural Language Processing) techniques based on semantic similarity principles. The typology of texts analysed so far are academic papers from peer reviewed journals being focused on the societal impacts of cultural policies and practices. However, the method easily lends itself to being extended to pilot studies and policy documents. Three main categories of societal impact - borrowed from the New European Agenda for Culture - have been considered: impacts on personal well-being, on social cohesion and on urban renovation. Based on prior literature findings, a collection of possible societal impacts was gathered in the form of 100 phrases of two up to eight words. Then we expanded the semantic neighbourhood of each impact utilising continuous space word representations by cosine similarity measures. We show that impacts can be clustered into well separated and defined groups of related concepts. This can be interpreted in two ways: first, the European Agenda points at three, largely independent, impact areas for cultural interventions, with little overlaps to one another; second, little is left out of these categories, which could still be considered as a separate impact area. Finally, we show that proposed procedure can be successfully applied to the task of automatic annotation of documents in the domain of arts and culture.KeywordsImpacts of arts and cultureSemantic similarityWord2vec embeddingsClustering
Chapter
Although there are a small number of work to conduct patent research by building knowledge graph, but without constructing patent knowledge graph using patent documents and combining latest natural language processing methods to mine hidden rich semantic relationships in existing patents and predict new possible patents. In this paper, we propose a new patent vacancy prediction approach named PatentMiner to mine rich semantic knowledge and predict new potential patents based on knowledge graph (KG) and graph attention mechanism. Firstly, patent knowledge graph over time (e.g. year) is constructed by carrying out named entity recognition and relation extraction from patent documents. Secondly, Common Neighbor Method (CNM), Graph Attention Networks (GAT) and Context-enhanced Graph Attention Networks (CGAT) are proposed to perform link prediction in the constructed knowledge graph to dig out the potential triples. Finally, patents are defined on the knowledge graph by means of co-occurrence relationship, that is, each patent is represented as a fully connected subgraph containing all its entities and co-occurrence relationships of the patent in the knowledge graph; Furthermore, we propose a new patent prediction task which predicts a fully connected subgraph with newly added prediction links as a new patent. The experimental results demonstrate that our proposed patent prediction approach can correctly predict new patents and Context-enhanced Graph Attention Networks is much better than the baseline.
Article
Background The recent surge in clinical and nonclinical health-related data has been accompanied by a concomitant increase in personal health data (PHD) research across multiple disciplines such as medicine, computer science, and management. There is now a need to synthesize the dynamic knowledge of PHD in various disciplines to spot potential research hotspots. Objective The aim of this study was to reveal the knowledge evolutionary trends in PHD and detect potential research hotspots using bibliometric analysis. Methods We collected 8281 articles published between 2009 and 2018 from the Web of Science database. The knowledge evolution analysis (KEA) framework was used to analyze the evolution of PHD research. The KEA framework is a bibliometric approach that is based on 3 knowledge networks: reference co-citation, keyword co-occurrence, and discipline co-occurrence. Results The findings show that the focus of PHD research has evolved from medicine centric to technology centric to human centric since 2009. The most active PHD knowledge cluster is developing knowledge resources and allocating scarce resources. The field of computer science, especially the topic of artificial intelligence (AI), has been the focal point of recent empirical studies on PHD. Topics related to psychology and human factors (eg, attitude, satisfaction, education) are also receiving more attention. Conclusions Our analysis shows that PHD research has the potential to provide value-based health care in the future. All stakeholders should be educated about AI technology to promote value generation through PHD. Moreover, technology developers and health care institutions should consider human factors to facilitate the effective adoption of PHD-related technology. These findings indicate opportunities for interdisciplinary cooperation in several PHD research areas: (1) AI applications for PHD; (2) regulatory issues and governance of PHD; (3) education of all stakeholders about AI technology; and (4) value-based health care including “allocative value,” “technology value,” and “personalized value.”
Article
Purpose Co-citation frequency, defined as the number of documents co-citing two articles, is considered as a quantitative, and thus, an efficient proxy of subject relatedness or prestige of the co-cited articles. Despite its quantitative nature, it is found effective in retrieving and evaluating documents, signifying its linkage with the related documents' contents. To better understand the dynamism of the citation network, the present study aims to investigate various content features giving rise to the measure. Design/methodology/approach The present study examined the interaction of different co-citation features in explaining the co-citation frequency. The features include the co-cited works' similarities in their full-texts, Medical Subject Headings (MeSH) terms, co-citation proximity, opinions and co-citances. A test collection is built using the CITREC dataset. The data were analyzed using natural language processing (NLP) and opinion mining techniques. A linear model was developed to regress the objective and subjective content-based co-citation measures against the natural log of the co-citation frequency. Findings The dimensions of co-citation similarity, either subjective or objective, play significant roles in predicting co-citation frequency. The model can predict about half of the co-citation variance. The interaction of co-opinionatedness and non-co-opinionatedness is the strongest factor in the model. Originality/value It is the first study in revealing that both the objective and subjective similarities could significantly predict the co-citation frequency. The findings re-confirm the citation analysis assumption claiming the connection between the cognitive layers of cited documents and citation measures in general and the co-citation frequency in particular. Peer review The peer review history for this article is available at https://publons.com/publon/10.1108/OIR-04-2020-0126 .
Chapter
Web citation analysis is emerging as an important subject of research in web mining, information retrieval, library science, etc. Scientific publications form a significant part of the research. The quality of the publication is determined by the citation which is to a published or an unpublished source. Citation analysis is used to evaluate the corresponding significance or an impression of an author or publication which is assessed by several times that an author or publication has been cited by other related works. It is useful in ascertaining the impression of a research article. It is useful in learning more about an area of knowledge. The main objective of this chapter is to provide knowledge about web citation analysis. A brief overview of web citation index, citation styles, citation-based metrics, and research challenges are discussed.
Article
Purpose This paper aims to present a longitudinal and visualizing study using scientometric approaches to depict the historical changes in the academic community, intellectual base and research hotspots within the business domain. Design/methodology/approach Two mapping methods are used, namely, co-citation analysis and co-occurrence analysis. Both the co-citation analysis and co-occurrence analysis in this study are conducted using CiteSpace, a Java-based scientific visualization software. Findings This paper detects changes in academic communities in 24 business journals chosen by the University of Texas at Dallas as leading journals (UTD24) and identifies the research hotspots such as corporate governance, organizational research and capital research. Many authors and academic communities appear in two or even three periods, which indicates the lasting academic vitality of scholars in this field. This paper determines the evolution of scholars' research interests by identifying high-frequency keywords during the entire period. Originality/value This paper reveals a systematic and holistic picture of the developmental landscape of the business domain, which can provide a potential guide for future research. Furthermore, based on empirical data and knowledge visualization, the intellectual structure and evolution of the business domain can be identified more objectively.
Article
This study explores weighted author co-citation analysis (ACA) through a comparison of results from four weighted citation counting methods. The data set used comprises full-text research articles published in four top-tier library and information science (LIS) journals from 2011 to 2018. It finds that in-text frequency-weighted counting performs as well as traditional counting in identifying major dimensions of the LIS field but also shows more detail. Re-citation-based counting appears to highlight well-integrated specialties and weaken the presence of more fragmented ones compared to traditional counting. In-text frequency weighted re-citation counting, expected to highlight “deep” impact, appears to effectively zoom into the field to show intense streams of research within it, but fail to identify major dimensions of the field, essentially providing a telescopic view of the LIS field instead of the panoramic one that the other three methods provide. Measuring deep impact may be interesting and important for research evaluation but fails to retain the broader context that makes the visualizations of research fields so informative. It appears that what may be “noise” when considering impact of individuals can provide the context that allows us to see the forest for the trees when examining intellectual structures of research fields as in the case of traditional ACA.
Article
Full-text available
There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and co-citation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.
Article
The present study tests a citation counting method that filters out citations in the introductory and backgrounds sections and then weighs the remaining citations by their in-text frequency. The dataset used comprises articles on bibliometrics available in full text in PubMed Central. This method was inspired by findings from previous studies that in-text frequency indicates importance of citations and citations in Methodology, Results, Discussion, and Conclusions sections tend to be more important to a citing article. We found that this method makes a large difference in author ranking as suggested by a 0.4 correlation between ranking by this method and that by traditional citation counting. Generally, this method has ranked authors concerning biomedical issues higher and those focused on bibliometrics or science communication issues lower compared to traditional citation counting. This rank change pattern suggests that this method appears to have made essential citations stand out more, i.e., citations that studies concerning biomedicine are expected to draw on more heavily. This method has also ranked guidelines or theoretical or methodological frameworks for systematic reviews, meta-analyses, knowledge translation, and scoping studies much higher, indicating that Bibliometrics has been mostly employed in these types of studies in biomedical fields. Unfortunately, citation network analysis doesn’t seem to have been employed much as indicated by key authors representing science mapping being ranked much lower by this method although it has been shown to be informative for these types of studies.
Article
Document relational network has been effective in retrieving and evaluating papers. Despite their effectiveness, relational measures, including co-citation, are far from ideal and need improvements. The assumption underlying the co-citation relation is the content relevance and opinion relatedness of cited and citing papers. This may imply existence of some kind of co-opinionatedness between co-cited papers which may be effective in improving the measure. Therefore, the present study tries to test the existence of this phenomenon and its role in improving information retrieval. To do so, based on CITREC, a medical test collection was developed consisting of 30 queries (seed documents) and 4823 of their co-cited papers. Using NLP techniques, the co-citances of the queries and their co-cited papers were analyzed and their similarities were computed by 4 g similarity measure. Opinion scores were extracted from co-citances using SentiWordnet. Also, nDCG values were calculated and then compared in terms of the citation proximity index (CPI) and co-citedness measures before and after being normalized by the co-opinionatedness measure. The reliability of the test collection was measured by generalizability theory. The findings suggested that a majority of the co-citations exhibited a high level of co-opinionatedness in that they were mostly similar either in their opinion strengths or in their polarities. Although anti-polar co-citations were not trivial in their number, a significantly higher number of the co-citations were co-polar, with a majority being positive. The evaluation of the normalization of the CPI and co-citedness by the co-opinionatedness indicated a generally significant improvement in retrieval effectiveness. While anti-polar similarity reduced the effectiveness of the measure, the co-polar similarity proved to be effective in improving the co-citedness. Consequently, the co-opinionatedness can be presented as a new document relation and used as a normalization factor to improve retrieval performance and research evaluation.
Article
Digital libraries suffer from the problem of information overload due to immense proliferation of research papers in journals and conference papers. This makes it challenging for researchers to access the relevant research papers. Fortunately, research paper recommendation systems offer a solution to this dilemma by filtering all the available information and delivering what is most relevant to the user. Researchers have proposed numerous approaches for research paper recommendation which are based on metadata, content, citation analysis, collaborative filtering, etc. Approaches based on citation analysis, including co-citation and bibliographic coupling, have proven to be significant. Researchers have extended the co-citation approach to include content analysis and citation proximity analysis and this has led to improvement in the accuracy of recommendations. However, in co-citation analysis, similarity between papers is discovered based on the frequency of co-cited papers in different research papers that can belong to different areas. Bibliographic coupling, on the other hand, determines the relevance between two papers based on their common references. Therefore, bibliographic coupling has inherited the benefits of recommending relevant papers; however, traditional bibliographic coupling does not consider the citing patterns of common references in different logical sections of the citing papers. Since the use of citation proximity analysis in co-citation has improved the accuracy of paper recommendation, this paper proposes a paper recommendation approach that extends the traditional bibliographic coupling by exploiting the distribution of citations in logical sections in bibliographically coupled papers. Comprehensive automated evaluation utilizing Jensen Shannon Divergence was conducted to evaluate the proposed approach. The results showed significant improvement over traditional bibliographic coupling and content-based research paper recommendation.
Article
Full-text available
We document an open-source toolbox for drawing large-scale undirected graphs. This toolbox is based on a previously implemented closed-source algorithm known as VxOrd. Our toolbox, which we call OpenOrd, extends the capabilities of VxOrd to large graph layout by incorporating edge-cutting, a multi-level approach, average-link clustering, and a parallel implementation. At each level, vertices are grouped using force-directed layout and average-link clustering. The clustered vertices are then re-drawn and the process is repeated. When a suitable drawing of the coarsened graph is obtained, the algorithm is reversed to obtain a drawing of the original graph. This approach results in layouts of large graphs which incorporate both local and global structure. A detailed description of the algorithm is provided in this paper. Examples using datasets with over 600K nodes are given. Code is available at www.cs.sandia.gov/~smartin.
Article
Full-text available
Traditional co-citation analysis has not taken the proximity of co-cited references into account. As long as two references are cited by the same article, they are retreated equally regardless the distance between where citations appear in the article. Little is known about what additional insights into citation and co-citation behaviours one might gain from studying distributions of co-citation in terms of such proximity. How are citations distributed in an article? What insights does the proximity of co-citation provide? In this article, the proximity of a pair of co-cited reference is defined as the nearest instance of the co-citation relation in text. We investigate the proximity of co-citation in full text of scientific publications at four levels, namely, the sentence level, the paragraph level, the section level, and the article level. We conducted four studies of co-citation patterns in the full text of articles published in 22 open access journals from BioMed Central. First, we compared the distributions of co-citation instances at four proximity levels in journal articles to the traditional article-level co-citation counts. Second, we studied the distributions of co-citations of various proximities across organizational sections in articles. Third, the distribution of co-citation proximity in different co-citation frequency groups is investigated. Fourth, we identified the occurrences of co-citations at different proximity levels with reference to the corresponding traditional co-citation network. The results show that (1) the majority of co-citations are loosely coupled at the article level, (2) a higher proportion of sentence-level co-citations is found in high co-citation frequencies than in low co-citation frequencies, (3) tightly coupled sentence-level co-citations not only preserve the essential structure of the corresponding traditional co-citation network but also form a much smaller subset of the entire co-citation instances typically considered by traditional co-citation analysis. Implications for improving our understanding of underlying factors concerning co-citations and developing more efficient co-citation analysis methods are discussed.
Article
Full-text available
Purpose – The purpose of this paper is to present a narrative review of studies on the citing behavior of scientists, covering mainly research published in the last 15 years. Based on the results of these studies, the paper seeks to answer the question of the extent to which scientists are motivated to cite a publication not only to acknowledge intellectual and cognitive influences of scientific peers, but also for other, possibly non‐scientific, reasons. Design/methodology/approach – The review covers research published from the early 1960s up to mid‐2005 (approximately 30 studies on citing behavior‐reporting results in about 40 publications). Findings – The general tendency of the results of the empirical studies makes it clear that citing behavior is not motivated solely by the wish to acknowledge intellectual and cognitive influences of colleague scientists, since the individual studies reveal also other, in part non‐scientific, factors that play a part in the decision to cite. However, the results of the studies must also be deemed scarcely reliable: the studies vary widely in design, and their results can hardly be replicated. Many of the studies have methodological weaknesses. Furthermore, there is evidence that the different motivations of citers are “not so different or ‘randomly given’ to such an extent that the phenomenon of citation would lose its role as a reliable measure of impact”. Originality/value – Given the increasing importance of evaluative bibliometrics in the world of scholarship, the question “What do citation counts measure?” is a particularly relevant and topical issue.
Article
Full-text available
We propose the use of the text of the sentences surrounding citations as an important tool for semantic interpretation of bioscience text. We hypothesize several different uses of citation sentences (which we call citances), including the creation of training and testing data for semantic analysis (especially for entity and relation recognition), synonym set creation, database curation, document summarization, and information retrieval generally. We illustrate some of these ideas, showing that citations to one document in particular align well with what a hand-built curator extracted. We also show preliminary results on the problem of normalizing the different ways that the same concepts are expressed within a set of citances, using and improving on existing techniques in automatic paraphrase generation.
Conference Paper
Full-text available
We present the results of experiments using terms from citations for scientific literature search. To index a given document, we use terms used by citing documents to describe that document, in combination with terms from the document itself. We find that the combination of terms gives better retrieval performance than standard indexing of the document terms alone and present a brief analysis of our results. This paper marks the first experimental results from a new test collection of scientific papers, created by us in order to study citation-based methods for IR.
Article
Full-text available
In the past several years studies have started to appear comparing the accuracies of various science mapping approaches. These studies primarily compare the cluster solutions resulting from different similarity approaches, and give varying results. In this study we compare the accuracies of cluster solutions of a large corpus of 2,153,769 recent articles from the biomedical literature (2004–2008) using four similarity approaches: co-citation analysis, bibliographic coupling, direct citation, and a bibliographic coupling-based citation-text hybrid approach. Each of the four approaches can be considered a way to represent the research front in biomedicine, and each is able to successfully cluster over 92% of the corpus. Accuracies are compared using two metrics—within-cluster textual coherence as defined by the Jensen-Shannon divergence, and a concentration measure based on the grant-to-article linkages indexed in MEDLINE. Of the three pure citation-based approaches, bibliographic coupling slightly outperforms co-citation analysis using both accuracy measures; direct citation is the least accurate mapping approach by far. The hybrid approach improves upon the bibliographic coupling results in all respects. We consider the results of this study to be robust given the very large size of the corpus, and the specificity of the accuracy measures used. © 2010 Wiley Periodicals, Inc.
Article
Full-text available
It is proposed that citation contexts, the text surrounding references in scientific papers, be analyzed in terms of an expanded notion of sentiment, defined to include attitudes and dispositions toward the cited work. Maps of science at both the specialty and global levels are used as the basis of this analysis. Citation context samples are taken at these levels and contrasted for the appearance of cue word sets, analyzed with the aid of methods from corpus linguistics. Sentiments are shown to vary within a specialty and can be understood in terms of cognitive and social factors. Within-specialty and between-specialty co-citations are contrasted and in some cases suggest a correlation of sentiment with structural location. For example, the sentiment of “uncertainty” is important in interdisciplinary co-citation links, while “utility” is more prevalent within the specialty. Suggestions are made for linking sentiments to technical terms, and for developing sentiment “baselines” for all of science.
Conference Paper
Full-text available
This paper presents an approach for identifying similar documents that can be used to assist scientists in finding related work. The approach called Citation Proximity Analysis (CPA) is a further development of co-citation analysis, but in addition, considers the proximity of citations to each other within an article's full-text. The underlying idea is that the closer citations are to each other, the more likely it is that they are related. In comparison to existing approaches, such as bibliographic coupling, co-citation analysis or keyword based approaches the advantages of CPA are a higher precision and the possibility to identify related sections within documents. Moreover, CPA allows a more precise automatic document classification. CPA is used as the primary approach to analyse the similarity and to classify the 1.2 million publications contained in the research paper recommender system Scienstein.org.
Article
Full-text available
Due to the nature of scientific methodology, research articles are rich in speculative and tentative statements, also known as hedges. We explore a linguistically motivated approach to the problem of recognizing such language in biomedical research articles. Our approach draws on prior linguistic work as well as existing lexical resources to create a dictionary of hedging cues and extends it by introducing syntactic patterns. Furthermore, recognizing that hedging cues differ in speculative strength, we assign them weights in two ways: automatically using the information gain (IG) measure and semi-automatically based on their types and centrality to hedging. Weights of hedging cues are used to determine the speculative strength of sentences. We test our system on two publicly available hedging datasets. On the fruit-fly dataset, we achieve a precision-recall breakeven point (BEP) of 0.85 using the semi-automatic weighting scheme and a lower BEP of 0.80 with the information gain weighting scheme. These results are competitive with the previously reported best results (BEP of 0.85). On the BMC dataset, using semi-automatic weighting yields a BEP of 0.82, a statistically significant improvement (p <0.01) over the previously reported best result (BEP of 0.76), while information gain weighting yields a BEP of 0.70. Our results demonstrate that speculative language can be recognized successfully with a linguistically motivated approach and confirms that selection of hedging devices affects the speculative strength of the sentence, which can be captured reasonably by weighting the hedging cues. The improvement obtained on the BMC dataset with a semi-automatic weighting scheme indicates that our linguistically oriented approach is more portable than the machine-learning based approaches. Lower performance obtained with the information gain weighting scheme suggests that this method may benefit from a larger, manually annotated corpus for automatically inducing the weights.
Article
Full-text available
Background: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. Methodology: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. Conclusions: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.
Article
Full-text available
Citations are widely used in scientific literature. The traditional model of referencing considers all citations to be the same; however, semantically, citations play different roles. By studying the context in which citations appear, it is possible to determine the role that they play. Here, we report on the development of an eight-category classification scheme, annotation using that scheme, and development and evaluation of supervised machine-learning classifiers using the annotated data. We annotated 1,710 sentences using the annotation schema and our trained classifier obtained an average F1-score of 76.5%. The classifier is available for free as a Java API from http://citation.askhermes.org.
Article
To compare citation history and contextual importance, eleven highly cited articles, 4 slowly aging (Type 1) and 7 quickly aging (Type 2), were ranked using an aggregate citation context measure, the Mean Utility Index. Based on citations in late (PY 6 & 7) source articles, methods papers consistently ranked higher than papers cited for research results and theoretical implications, and Type 1 methods papers ranked above all Type 2 papers. A Type 1 paper representing an important theoretical concept could not be distinguished from Type 2 papers using citation context alone.
Article
Discusses whether there is a difference in the value of a citation depending on where in the body of the citing article it occurs; and whether those cited articles to which reference is made more than once within a citing article are more valuable to the user than those cited only once. (Author)
Conference Paper
Citation function is defined as the author's reason for citing a given paper (e.g. acknowledgement of the use of the cited method). The automatic recognition of the rhetorical function of citations in scientific text has many applications, from improvement of impact factor calculations to text summarisation and more informative citation indexers. We show that our annotation scheme for citation function is reliable, and present a supervised machine learning framework to automatically classify citation function, using both shallow and linguistically-inspired features. We find, amongst other things, a strong relationship between citation function and sentiment classification.
Article
Measuring the relatedness between bibliometric units (journals, documents, authors, or words) is a central task in bibliometric analysis. Relatedness measures are used for many different tasks, among them the generating of maps, or visual pictures, showing the relationship between all items from these data. Despite the importance of these tasks, there has been little written on how to quantitatively evaluate the accuracy of relatedness measures or the resulting maps. The authors propose a new framework for assessing the performance of relatedness measures and visualization algorithms that contains four factors: accuracy, coverage, scalability, and robustness. This method was applied to 10 measures of journal–journal relatedness to determine the best measure. The 10 relatedness measures were then used as inputs to a visualization algorithm to create an additional 10 measures of journal–journal relatedness based on the distances between pairs of journals in two-dimensional space. This second step determines robustness (i.e., which measure remains best after dimension reduction). Results show that, for low coverage (under 50&percnt;), the Pearson correlation is the most accurate raw relatedness measure. However, the best overall measure, both at high coverage, and after dimension reduction, is the cosine index or a modified cosine index. Results also showed that the visualization algorithm increased local accuracy for most measures. Possible reasons for this counterintuitive finding are discussed. © 2006 Wiley Periodicals, Inc.
Article
In this work, a novel method of cocitation analysis, coined “contextual cocitation analysis,” is introduced and described in comparison to traditional methods of cocitation analysis. Equations for quantifying contextual cocitation strength are introduced and their implications explored using theoretical examples alongside the application of contextual cocitation to a series of BioMed Central publications and their cited resources. Based on this work, the implications of contextual cocitation for understanding the granularity of the relationships created between cited published research and methods for its analysis are discussed. Future applications and improvements of this work, including its extended application to the published research of multiple disciplines, are then presented with rationales for their inclusion. © 2010 Wiley Periodicals, Inc.
Article
The old Asian legend about the blind men and the elephant comes to mind when looking at how different authors of scientific papers describe a piece of related prior work. It turns out that different citations to the same paper often focus on different aspects of that paper and that neither provides a full description of its full set of contributions. In this article, we will describe our investigation of this phenomenon. We studied citation summaries in the context of research papers in the biomedical domain. A citation summary is the set of citing sentences for a given article and can be used as a surrogate for the actual article in a variety of scenarios. It contains information that was deemed by peers to be important. Our study shows that citation summaries overlap to some extent with the abstracts of the papers and that they also differ from them in that they focus on different aspects of these papers than do the abstracts. In addition to this, co-cited articles (which are pairs of articles cited by another article) tend to be similar. We show results based on a lexical similarity metric called cohesion to justify our claims. Peer Reviewed http://deepblue.lib.umich.edu/bitstream/2027.42/57540/1/20707_ftp.pdf
Article
We describe two general approaches to creating document-level maps of science. To create a local map, one defines and directly maps a sample of data, such as all literature published in a set of information science journals. To create a global map of a research field, one maps “all of science” and then locates a literature sample within that full context. We provide a deductive argument that global mapping should create more accurate partitions of a research field than does local mapping, followed by practical reasons why this may not be so. The field of information science is then mapped at the document level using both local and global methods to provide a case illustration of the differences between the methods. Textual coherence is used to assess the accuracies of both maps. We find that document clusters in the global map have significantly higher coherence than do those in the local map, and that the global map provides unique insights into the field of information science that cannot be discerned from the local map. Specifically, we show that information science and computer science have a large interface and that computer science is the more progressive discipline at that interface. We also show that research communities in temporally linked threads have a much higher coherence than do isolated communities, and that this feature can be used to predict which threads will persist into a subsequent year. Methods that could increase the accuracy of both local and global maps in the future also are discussed.
Article
Finding a particular scientific document amidst a sea of thousands of other documents can often seem like an insurmountable task. The Structure of Scientific Articles shows how linguistic theory can provide a solution by analyzing rhetorical structures to make information retrieval easier and faster. Through the use of an improved citation indexing system, this indispensable volume applies empirical discourse studies to pressing issues of document management, including attribution, the author’s stance towards other work, and problem-solving processes.
Article
The revolution the Web has brought to information dissemination is not so much due to the availability of data-huge amounts of information has long been available in libraries-but rather the improved efficiency of accessing (improved accessibility to) that information. The Web promises to make more scientific articles more easily available. By making the context of citations easily and quickly browsable, autonomous citation indexing can help to evaluate the importance of individual contributions more accurately and quickly. Digital libraries incorporating ACI can help organize scientific literature and may significantly improve the efficiency of dissemination and feedback. ACI may also help speed the transition to scholarly electronic publishing
Article
We present CiteSeer: an autonomous citation indexing system which indexes academic literature in electronic format (e.g. Postscript files on the Web). CiteSeer understands how to parse citations, identify citations to the same paper in different formats, and identify the context of citations in the body of articles. CiteSeer provides most of the advantages of traditional (manually constructed) citation indexes (e.g. the ISI citation indexes), including: literature retrieval by following citation links (e.g. by providing a list of papers that cite a given paper), the evaluation and ranking of papers, authors, journals, etc. based on the number of citations, and the identification of research trends. CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost...
Identifying scientific breakthroughs by combining co-citation analysis and citation context
  • H Small
  • R Klavans
Small, H., & Klavans, R. (2011). Identifying scientific breakthroughs by combining co-citation analysis and citation context. 13th International Conference of the International Society for Scientometrics and Informet-rics, 783–793.
Can citation indexing be automated? Essays of an Information Scientist
  • E Garfield
Garfield, E. (1962). Can citation indexing be automated? Essays of an Information Scientist, 1, 84–90.
Automatically classifying the role of citations in biomedical articles
  • S Agarwal
  • L Choubey
  • H Yu
Agarwal, S., Choubey, L., & Yu, H. (2010). Automatically classifying the role of citations in biomedical articles. AMIA 2010 Symposium Proceedings, 11-15.
Citeseer: An automatic citation indexing system
  • C L Giles
  • K Bollacker
  • S Lawrence
Giles, C. L., Bollacker, K., & Lawrence, S. (1998). Citeseer: An automatic citation indexing system. Paper presented at the Proceedings of the Third ACM Conference on Digital Libraries (DL '98).