## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one’s scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.

To read the full-text of this research,

you can request a copy directly from the authors.

... The search for patterns and universalities in sorted data (Holme, 2022;Newman, 2005) remains a fundamental, multidisciplinary research topic. This includes the study of ranking dynamics (Iñiguez et al., 2022) and the distribution of their static snapshots, from the most straightforward Zipf power-law models (Newman, 2005) to more complex ones (Petersen et al., 2011;Siudem et al., 2020Siudem et al., , 2022Singh et al., 2022). ...

... In this work, we revisit the recently-proposed 3DI model (three dimensions of impact; Siudem et al., 2020Siudem et al., , 2022, which can be considered a rank-size approach to the problem of describing the mechanisms governing the growth of bibliographic and other networks studied originally by Price (1965). ...

... We have ( −1) Relation to the 3DI model. To strengthen the underlying fundaments further, let us return to the 3DI (three dimensions of impact) model (Siudem et al., 2020) mentioned in the introduction. Let ( ) denote the impact of the -th richest entity at time step (e.g., the number of citations to the -th most cited paper). ...

Inequality is an inherent part of our lives: we see it in the distribution of incomes, talents, resources, and citations, amongst many others. Its intensity varies across different environments: from relatively evenly distributed ones, to where a small group of stakeholders controls the majority of the available resources. We would like to understand why inequality naturally arises as a consequence of the natural evolution of any system. Studying simple mathematical models governed by intuitive assumptions can bring many insights into this problem. In particular, we recently observed (Siudem et al., PNAS 117:13896-13900, 2020) that impact distribution might be modelled accurately by a time-dependent agent-based model involving a mixture of the rich-get-richer and sheer chance components. Here we point out its relationship to an iterative process that generates rank distributions of any length and a predefined level of inequality, as measured by the Gini index. Many indices quantifying the degree of inequality have been proposed. Which of them is the most informative? We show that, under our model, indices such as the Bonferroni, De Vergottini, and Hoover ones are equivalent. Given one of them, we can recreate the value of any other measure using the derived functional relationships. Also, thanks to the obtained formulae, we can understand how they depend on the sample size. An empirical analysis of a large sample of citation records in economics (RePEc) as well as countrywise family income data, confirms our theoretical observations. Therefore, we can safely and effectively remain faithful to the simplest measure: the Gini index.

... Our analysis can be considered an extension of the idea presented by Bertoli-Barsotti and Lando in (2017a) and (2017b), where the h-index has been expressed (analytically) by means of 4 other sample statistics for a few models known from the literature. In this paper, however, we utilise a new model that we have recently derived in (Siudem et al., 2020). Due to the complexity of our enterprise, we shall present the results of a numerical study. ...

... In the recent paper (Siudem et al., 2020) we have introduced the so-called 3DSI model (3 dimensions of scientific impact). It is an agent-based model inspired by (Ionescu and Chopard, 2013;Żogała-Siudem et al., 2016) that captures the evolution of an author's citation record which we represent with where X k denotes the number of citations received by the k-th most referenced paper. ...

... In Siudem et al. (2020) we have shown that, for given N, C, and ∈ (0, 1) , the k-th most cited paper is expected to receive ...

We demonstrate that by using a triple of simple numerical summaries: an author’s productivity, their overall impact, and a single other bibliometric index that aims to capture the shape of the citation distribution, we can reconstruct other popular metrics of bibliometric impact with a sufficient degree of precision. We thus conclude that the use of many indices may be unnecessary – entities should not be multiplied beyond necessity. Such a study was possible thanks to our new agent-based model (Siudem et al. in Proc Natl Acad Sci 117:13896–13900, 2020, 10.1073/pnas.2001064117 ), which not only assumes that citations are distributed according to a mixture of the rich-get-richer rule and sheer chance, but also fits real bibliometric data quite well. We investigate which bibliometric indices have good discriminative power, which measures can be easily predicted as functions of other ones, and what implications to the research evaluation practice our findings have.

... The Price model was studied in, amongst others, [17,Chap. 14] and frequently appears in the literature under different names and modifications [18][19][20][21][22][23] and in different contexts: e.g., resistance to random failures and intentional attacks in complex networks [18] or computation of longest paths in random graphs [24]. ...

... The most typical approach (e.g., [21]) to deriving the citation distribution and thus the preferential-toaccidental ratio in the Price model is via master equations (see [25]). In this work, however, we shall apply a ranksize (order statistics) approach which was inspired by our earlier work [23], where we studied citation vectors of individual scientists, i.e., in small scale, but with similar accidental and preferential contributors. Here we shall modify the model's boundary conditions so that we can focus on papers which have obtained a sufficiently large number of citations (e.g., 1, to avoid problems with computing and drawing on the log scale). ...

... We assume X k (k − 1) = δ for every k, i.e., the k-th publication enters the system with δ ∈ [0, m) citations. Solving the above (similarly as we did in [23] but with a less general boundary condition) leads to ...

We consider a version of the D.J.Price's model for the growth of a bibliographic network, where in each iteration a constant number of citations is randomly allocated according to a weighted combination of accidental (uniformly distributed) and preferential (rich-get-richer) rules. Instead of relying on the typical master equation approach, we formulate and solve this problem in terms of the rank-size distribution. We show that, asymptotically, such a process leads to a Pareto-type 2 distribution with an appealingly interpretable parametrisation. We prove that the solution to the Price model expressed in terms of the rank-size distribution coincides with the expected values of order statistics in an independent Paretian sample. We study the bias and the mean squared error of three well-behaving estimators of the underlying model parameters. An empirical analysis of a large repository of academic papers yields a good fit not only in the tail of the distribution (as it is usually the case in the power law-like framework), but also across the whole domain. Interestingly, the estimated models indicate higher degree of preferentially attached citations and smaller share of randomness than previous studies.

... It is interesting to note that in this framework, the model given by Eq. (5) was already used in and in (Siudem et al., 2020) in empirical studies. In these works, the authors considered the sources as the articles produced by a researcher and the items as the citations received, and interpreted the constant , under the constraint > 1∕2 (that corresponds to > 0, in their parametrisation; note that the recent paper by Biró et al., 2023 suggests that > 1∕2 is the most prevalent case for citation data), as a parameter explaining the preferential attachment mechanism -that is an author's specific tendency to produce articles more or less attractive for citations. ...

... where in ⋆, we apply the substitution = 1 2− ; compare (Bertoli-Barsotti, 2023). Equation (11) is equivalent (with a slightly different notation) to Eq. (8) in (Siudem et al., 2020), where ( , ) is given by Eq. (11) therein, with = 1 and = 2 − 1 , which results in exactly our Eq. (5) for ≠ 1 2 ( ≠ 0). ...

We study an iterative discrete information production process (IPP) where we can extend ordered normalised vectors by new elements based on a simple affine transformation, while preserving the predefined level of inequality, G, as measured by the Gini index. Then, we derive the family of Lorenz curves of the corresponding vectors and prove that it is stochastically ordered with respect to both the sample size and G which plays the role of the uncertainty parameter. A case study of family income data in nine countries shows a very good fit of our model. Moreover, we show that asymptotically, we obtain all, and only, Lorenz curves generated by a new, intuitive parametrisation of the finite-mean Generalised Pareto Distribution (GPD) that unifies three other families, namely: the Pareto Type II, exponential, and scaled beta ones. The family is not only ordered with respect to the parameter G, but also, thanks to our derivations, has a nice underlying interpretation. Our result may thus shed new light on the genesis of this family of distributions.

... Although scholars have distinguished different types of citations in citation network analysis, most research has focused on the bibliographic information (Liu and Fang, 2020;Cai et al, 2018;Siudem et al, 2020), while few of them concentrate on the citation contents. At the same time, the number of suspected citations increases 1 , which are established to enhance the impact of publications or authors intentionally rather than disseminate priors scientific advances contributing to the publication 2 . ...

... Liu et al (2018) traces the scientific publications the scientists produced and quantitatively describes the hot streak phenomenon in their careers. Siudem et al (2020) propose a model to recreate citation record from three perspectives, i.e., the number of publications, citations, and the degree of randomness in the citation patterns. In the field of medicine, Liao et al (2018) explore the current status of medicine through visualizing and analyzing on the citation network constructed by publications related to medical big data. ...

Citation network analysis attracts increasing attention from disciplines of complex network analysis and science of science. One big challenge in this regard is that there are unreasonable citations in citation networks, i.e., cited papers are not relevant to the citing paper. Existing research on citation analysis has primarily concentrated on the contents and ignored the complex relations between academic entities. In this paper, we propose a novel research topic, that is, how to detect anomalous citations. To be specific, we first define anomalous citations and propose a unified framework, named ACTION, to detect anomalous citations in a heterogeneous academic network. ACTION is established based on non-negative matrix factorization and network representation learning, which considers not only the relevance of citation contents but also the relationships among academic entities including journals, papers, and authors. To evaluate the performance of ACTION, we construct two anomalous citation datasets. Experimental results demonstrate the effectiveness of the proposed method. Detecting anomalous citations carry profound significance for academic fairness.

... In addition to citations, mechanistic models have been developed to understand the formation of collaborations 136,[180][181][182][183] , knowledge discovery and diffusion 184,185 , topic selection 186,187 , career dynamics 30,31,188,189 , the growth of scientific fields 190 and the dynamics of failure in science and other domains 178 . ...

The advent of large-scale datasets that trace the workings of science has encouraged researchers from many different disciplinary backgrounds to turn scientific methods into science itself, cultivating a rapidly expanding 'science of science'. This Review considers this growing, multidisciplinary literature through the lens of data, measurement and empirical methods. We discuss the purposes, strengths and limitations of major empirical approaches, seeking to increase understanding of the field's diverse methodologies and expand researchers' toolkits. Overall, new empirical developments provide enormous capacity to test traditional beliefs and conceptual frameworks about science, discover factors associated with scientific productivity, predict scientific outcomes and design policies that facilitate scientific progress.

... A promising resource that meets some of these conditions already is the Open Commons of Phenomenology (ophen.org), 34 The literature on the rich-get-richer effect in bibliometrics is reviewed in (Siudem et al., 2020). which provides free access to many of the primary texts as well as clean meta-data and author ids. ...

More has been written about phenomenology than could possibly be read in a single person’s lifetime, or even in several lifetimes. Despite its unwieldy size, this vast “horizon” of literary output has a tractable structure. We leverage the tools of bibliometrics to study the structure of the phenomenology literature, and test several hypotheses about it. We create an author-wise co-citation network, a graph of nodes and connections, where each node corresponds to an author who has written a document with the word “Phenomenology” in it, and where two nodes are connected if the corresponding authors have cited each other. By applying clustering algorithms and other techniques to this network, certain structural features of the field emerge. The main areas of research since 1970 conform fairly well to an intuitive understanding of the literature, though there are some surprises.

... The quality of a scientist's work is commonly quantified by two different, but related, measures, namely, their number of papers and the number of citations thereof (summarized in the h-index [Hirsch, 2005;Siudem, Żogała-Siudem et al., 2020]). The vast majority of investigations about the scientific publication process are focused on the citation side. ...

Scholarly publications represent at least two benefits for the study of the scientific community as a social group. First, they attest to some form of relation between scientists (collaborations, mentoring, heritage, …), useful to determine and analyze social subgroups. Second, most of them are recorded in large databases, easily accessible and including a lot of pertinent information, easing the quantitative and qualitative study of the scientific community. Understanding the underlying dynamics driving the creation of knowledge in general, and of scientific publication in particular, can contribute to maintaining a high level of research, by identifying good and bad practices in science. In this article, we aim to advance this understanding by a statistical analysis of publication within peer-reviewed journals. Namely, we show that the distribution of the number of papers published by an author in a given journal is heavy-tailed, but has a lighter tail than a power law. Interestingly, we demonstrate (both analytically and numerically) that such distributions match the result of a modified preferential attachment process, where, on top of a Barabási-Albert process, we take the finite career span of scientists into account.

... In a recent analysis, Siudem et al. 24 proposed that there are other dimensions to scientific impact, including productivity and total impact (citations), as well as 'luck' (i.e., the 'rich get richer' rule) to some degree. While luck is difficult to assess in bibliometrics, total impact should be measured through both total citations of a body of work and the citation per publication of that body of work (the h-index. ...

Clinical Relevance
Clinicians, researchers funding agencies and indeed the general public can benefit from knowledge of the most highly cited papers and most impactful authors, institutions, countries and journals in the field of keratoconus.
Background
Bibliometrics relating to the keratoconus literature were derived to enable identification of the most impactful papers published, as well as the leading authors, institutions, countries and journals.
Methods
A search was undertaken of the titles of papers on the Scopus database to identify keratoconus-related articles. The 20 most highly cited papers were determined from the total list of 4,419 papers found. Rank-order lists by count were assembled for the ‘top 20ʹ in each of four categories: authors, institutions, countries and journals. A subject-specific keratoconus-related h-index (hKC-index) was derived for each constituent of each category to serve as a measure of impact in the field. The top 10 constituents of each category were ranked by hKC-index and tabulated for consideration.
Results
The hKC-index of the keratoconus field is 125. The 4,419 papers have been cited a total of 98,010 times, and 18.5% of these papers have never been cited. The most highly cited paper is a general review of keratoconus by Yaron Rabinowitz, who is also the most impactful author in the field (hKC = 31). The Cedars Sinai Medical Center in the United States produces the most impactful keratoconus-related papers (hKC = 36), and the United States is the most impactful country (hKC = 91). The Journal of Cataract and Refractive Surgery is the most impactful journal (hKC = 55).
Conclusion
Keratoconus is a topic of high interest in the clinical and scientific literature. Highly cited papers and impactful authors, institutions, countries and journals are identified.

... Bibliometric assessments are considered more objective criteria, but there is no consensus on which indexes are more suitable for evaluating academic performance, and many believe that the intrinsic nature of scientific processes can only be precisely quantified by multidimensional features [14,15]. This data-driven culture of performance evaluation has amassed much criticism [6,[16][17][18], and it also exerts enormous pressure on scholars (particularly on young scientists [19]) for publishing in large quantities, in prestigious journals, and developing highly cited research [5,20,21]. ...

The association between productivity and impact of scientific production is a long-standing debate in science that remains controversial and poorly understood. Here we present a large-scale analysis of the association between yearly publication numbers and average journal-impact metrics for the Brazilian scientific elite. We find this association to be discipline specific, career age dependent, and similar among researchers with outlier and nonoutlier performance. Outlier researchers either outperform in productivity or journal prestige, but they rarely do so in both categories. Nonoutliers also follow this trend and display negative correlations between productivity and journal prestige but with discipline-dependent intensity. Our research indicates that academics are averse to simultaneous changes in their productivity and journal-prestige levels over consecutive career years. We also find that career patterns concerning productivity and journal prestige are discipline-specific, having in common a raise of productivity with career age for most disciplines and a higher chance of outperforming in journal impact during early career stages.

... Bibliometric assessments are considered more objective criteria, but there is no consensus on which indexes are more suitable for evaluating academic performance, and many believe that the intrinsic nature of scientific processes can only be precisely quantified by multidimensional features [14,15]. This data-driven culture of per- * matjaz.perc@gmail.com ...

The association between productivity and impact of scientific production is a long-standing debate in science that remains controversial and poorly understood. Here we present a large-scale analysis of the association between yearly publication numbers and average journal-impact metrics for the Brazilian scientific elite. We find this association to be discipline-specific, career-age dependent, and similar among researchers with outlier and non-outlier performance. Outlier researchers either outperform in productivity or journal prestige, but they rarely do so in both categories. Non-outliers also follow this trend and display negative correlations between productivity and journal prestige but with discipline-dependent intensity. Our research indicates that academics are averse to simultaneous changes in their productivity and journal-prestige levels over consecutive career years. We also find that career patterns concerning productivity and journal prestige are discipline-specific, having in common a raise of productivity with career age for most disciplines and a higher chance of outperforming in journal impact during early career stages.

... One measure to assess the research quality would be to analyze citations, but given the comparably short time in which thousands of manuscripts have been published, a more comprehensive analysis can be expected in the future. If citation numbers grow, this will allow further analyses, according to three easily interpretable parameters: productivity, total impact, and how successful an author has been so far, as proposed in a recent study [19]. Regarding the number of COVID-19 cases and related deaths, we relied on published data from official authorities. ...

Background
The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has instigated immediate and massive worldwide research efforts. Rapid publication of research data may be desirable but also carries the risk of quality loss.
Objective
This analysis aimed to correlate the severity of the COVID-19 outbreak with its related scientific output per country.
Methods
All articles related to the COVID-19 pandemic were retrieved from Web of Science and analyzed using the web application SciPE (science performance evaluation), allowing for large data scientometric analyses of the global geographical distribution of scientific output.
Results
A total of 7185 publications, including 2592 articles, 2091 editorial materials, 2528 early access papers, 1479 letters, 633 reviews, and other contributions were extracted. The top 3 countries involved in COVID-19 research were the United States, China, and Italy. The confirmed COVID-19 cases or deaths per region correlated with scientific research output. The United States was most active in terms of collaborative efforts, sharing a significant amount of manuscript authorships with the United Kingdom, China, and Italy. The United States was China’s most frequent collaborative partner, followed by the United Kingdom.
Conclusions
The COVID-19 research landscape is rapidly developing and is driven by countries with a generally strong prepandemic research output but is also significantly affected by countries with a high prevalence of COVID-19 cases. Our findings indicate that the United States is leading international collaborative efforts.

... A scientist's work is commonly evaluated by two different, but related, quantities, namely, their number of publications and the number of citations thereof. These quantities are summarized in the criticized, but widely spread, h-index [9,10]. Naturally, a vast majority of investigations about the scientific publication process is focussed on the citation side. ...

The community of scientists is characterized by their need to publish in peer-reviewed journals, in an attempt to avoid the "perish" side of the famous maxim. Accordingly, almost all researchers authored some scientific articles. Scholarly publications represent at least two benefits for the study of the scientific community as a social group. First, they attest of some form of relation between scientists (collaborations, mentoring, heritage,...), useful to determine and analyze social subgroups. Second, most of them are recorded in large data bases, easily accessible and including a lot of pertinent information, easing the quantitative and qualitative study of the scientific community. Understanding the underlying dynamics driving the creation of knowledge in general, and of scientific publication in particular, in addition to its interest from the social science point of view, can contribute to maintaining a high level of research, by identifying good and bad practices in science. In this manuscript, we attempt to advance this understanding by a statistical analysis of publications within peer-reviewed journals. Namely, we show that the distribution of the number of articles published by an author in a given journal is heavy-tailed, but has lighter tail than a power law. Moreover, we observe some anomalies in the data that pinpoint underlying dynamics of the scholarly publication process.

Evaluating the influence of interdisciplinary research is important to the development of science. This work considers the large and small disciplines, calculates the interdisciplinary distance, and analyzes the influence of interdisciplinary behavior and interdisciplinary distance in the academic network. The results show that the risk of interdisciplinary behavior in the large discipline is more significant than the benefits. The peer in the small disciplines will tend to agree with the results of the small discipline across the large discipline. We further confirmed this conclusion by utilizing PSM-DID. The analysis between interdisciplinary distance and scientists’ influence shows that certain risks will accompany any distance between disciplines. However, there still exists a “Sweet Spot” which could bring significant rewards. Overall, this work provides a feasible approach to studying and understanding interdisciplinary behaviors in science.

This paper aims to find the reasons why some citation models can predict a set of specific bibliometric indices extremely well. We show why fitting a model that preserves the total sum of a vector can be beneficial in the case of heavy-tailed data that are frequently observed in informetrics and similar disciplines. Based on this observation, we introduce the reparameterised versions of the discrete generalised beta distribution (DGBD) and power law models that preserve the total sum of elements in a citation vector and, as a byproduct, they enjoy much better predictive power when predicting many bibliometric indices as well as partial cumulative sums. This also results in the underlying model parameters’ being easier to fit numerically. Moreover, they are also more interpretable. Namely, just like in our recently-introduced 3DSI (three dimensions of scientific impact) model, we have a clear distinction between the coefficients determining the total productivity (size), total impact (sum), and those that affect the shape of the resulting theoretical curve.

In a recent contribution in this journal, Gagolewski et al. (Scientometrics 127(5):2829–2845, 2022) study a new model—the so-called 3 dimensions of scientific impact (3DSI) model—for representing a rank size distribution. The model depends on three parameters/dimensions: the total number of papers, the total number of citations and a third parameter, \(\rho\), recognized by the authors as a shape parameter. We prove that \(\rho\) is an equivalent Gini coefficient.

We study an agent-based model for generating citation distributions in complex networks of scientific papers, where a fraction of citations is allotted according to the preferential attachment rule (rich get richer) and the remainder is allocated accidentally (purely at random, uniformly). Previously, we derived and analysed such a process in the context of describing individual authors, but now we apply it to scientific journals in computer and information sciences. Based on the large DBLP dataset as well as the CORE (Computing Research and Education Association of Australasia) journal ranking, we find that the impact of journals is correlated with the degree of accidentality of their citation distribution. Citations to impactful journals tend to be more preferential, while citations to lower-ranked journals are distributed in a more accidental manner. Further, applied fields of research such as artificial intelligence seem to be driven by a stronger preferential component – and hence have a higher degree of inequality – than the more theoretical ones, e.g., mathematics and computation theory.

We consider a version of the D. Price’s model for the growth of a bibliographic network, where in each iteration, a constant number of citations is randomly allocated according to a weighted combination of the accidental (uniformly distributed) and the preferential (rich-get-richer) rule. Instead of relying on the typical master equation approach, we formulate and solve this problem in terms of the rank–size distribution. We show that, asymptotically, such a process leads to a Pareto-type 2 distribution with a new, appealingly interpretable parametrisation. We prove that the solution to the Price model expressed in terms of the rank–size distribution coincides with the expected values of order statistics in an independent Paretian sample. An empirical analysis of a large repository of academic papers yields a good fit not only in the tail of the distribution (as it is usually the case in the power law-like framework), but also across a significantly larger fraction of the data domain.

In a recent paper in Scientometrics, Gagolewski et al. (2022) elaborated on a 3DSI model (3 dimensions of scientific impact) using a triplet of primary indicators (N, the number of papers for an author’s productivity, C, the number of citations for overall impact, and ρ for the shape of the citation distribution). The 3DSI model is an agent-based model (Siudem et al. in Proc Natl Acad Sci 117:13896–13900, 2020, https://doi.org/10.1073/pnas.2001064117), which assumes that citations are distributed according to a mixture of the rich-get-richer rule and sheer chance.

We analyse the usefulness of Jain’s fairness measure and the related Prathap’s bibliometric z-index as proxies when estimating the parameters of the 3DSI (three dimensions of scientific impact) model.

Evaluating academic papers and groups is important in scholar evaluation and literature retrieval. However, current evaluation indices, which pay excessive attention to the citation number rather than the citation importance and unidirectionality, are relatively simple. This study proposes new evaluation indices for papers and groups. First, an improved PageRank (PR) algorithm introducing citation importance is proposed to obtain a new citation-based paper index (CPI) via a pre-ranking and fine-tuning strategy. Second, to evaluate the paper’s influence inside and outside its research field, the focus citation-based paper index (FCPI) and diversity citation-based paper index (DCPI) are proposed based on topic similarity and diversity, respectively. Third, aside from the statistical indices for academic papers, we propose a foreign academic degree of dependence (FAD) to characterise the dependence between two academic groups. Finally, artificial intelligence (AI) papers from 2005 to 2019 are utilised for a case study.

Although the citation relationships among papers can help in tracking and understanding the development of knowledge, few studies have noted that the content and sentiments of citations of a paper differ. Here, we use sentiment-labeled citation data to construct a directed signed citation network, in which an author may agree with or criticize the cited paper and these represent different ways of inheriting knowledge. The dataset we use consists of 9,038 papers in the field of Computational Linguistics, including 25,275 citations, with 20.8% positive citations, 8.6% negative citations and 70.6% neutral citations. We systematically quantify the structural patterns of negative citations, impact assortativity of involved papers, occurrence time distribution and consequences of receiving negative attention. Remarkably, we find that papers with different impacts have a similar probability of receiving negative citations, and highly cited papers tend to give negative citations to low-impact papers around but avoid giving negative citations to high-impact papers. Our research also reveals the random occurrence rules and colocation patterns of negative citation distribution. In addition, we show that, in the short term, around 60% of multiple negative citations is positively related to the impact of the cited paper while more than 80% are negatively related to the impact in the long run. Our findings explain the pattern by which negative citations occur and deepen the understanding of negative citations.

Question‐and‐answer (Q&A) sites improve access to information and ease transfer of knowledge. In recent years, they have grown in popularity and importance, enabling research on behavioral patterns of their users. We study the dynamics related to the casting of 7 M votes across a sample of 700 k posts on Stack Overflow, a large community of professional software developers. We employ log‐Gaussian mixture modeling and Markov chains to formulate a simple yet elegant description of the considered phenomena. We indicate that the interevent times can naturally be clustered into 3 typical time scales: those which occur within hours, weeks, and months and show how the events become rarer and rarer as time passes. It turns out that the posts' popularity in a short period after publication is a weak predictor of its overall success, contrary to what was observed, for example, in case of YouTube clips. Nonetheless, the sleeping beauties sometimes awake and can receive bursts of votes following each other relatively quickly.

There are many approaches to the modelling of citation vectors of individual authors. Models may serve different purposes, but usually they are evaluated with regards to how well they align to citation distributions in large networks of papers. Here we compare a few leading models in terms of their ability to correctly reproduce the values of selected bibliometric indices of individual authors. Our recently-proposed three-dimensional model of scientific impact serves this purpose equally well as the discrete generalised beta distribution and the log-normal models, but has fewer parameters which additionally are all easy to interpret. We also indicate which indices can be predicted with high accuracy and which are more difficult to model.

Understanding the reasons associated with successful proposals are of paramount importance to improve evaluation processes. In this context, we analyzed whether bibliometric features are able to predict the success of research grants. We extracted features aiming at characterizing the academic history of Brazilian researchers, including research topics, affiliations, number of publications and visibility. The extracted features were then used to predict grants productivity via machine learning in three major research areas, namely Medicine, Dentistry and Veterinary Medicine. We found that research subject and publication history play a role in predicting productivity. In addition, institution-based features turned out to be relevant when combined with other features. While the best results outperformed text-based attributes, the evaluated features were not highly discriminative. Our findings indicate that predicting grants success, at least with the considered set of bibliometric features, is not a trivial task.

Recent studies in complexity science have uncovered temporal regularities in the dynamics of impact along scientific and other creative careers, but they did not extend the obtained insights to firms. In this paper, we show that firms' technological impact patterns cannot be captured by the state-of-the-art dynamical models for the evolution of scientists' research impact, such as the Q model. Therefore, we propose a time-varying returns model which integrates the empirically-observed relation between patent order and technological impact into the Q model. The proposed model can reproduce the timing pattern of firms' highest-impact patents accurately. Our results shed light on modeling the differences behind the impact dynamics of researchers and firms.

This study uses the thorough data provided by the Leiden Ranking 2020 to support the claim that percentile-based indicators are linked by a power law function. A constant calculated from this function, ℯp, and the total number of papers fully characterize the percentile distribution of publications. According to this distribution, the probability that a publication from a country or institution is in the global xth percentile can be calculated from a simple equation: P = ℯp(2 – lgx). By taking the Leiden Ranking PPtop10%/100 as an approximation of the ℯp constant, our results demonstrate that other PPtopx% indicators can be calculated applying this equation. Consequently, given a PPtopx% indicator, all the others are redundant with it. Even accepting that the total number of papers and a single PPtopx% indicator are sufficient to fully characterize the percentile distribution of papers, the results of comparisons between universities and research institutions differ depending on the percentile selected for the comparison. We discuss which Ptopx% and PPtopx% indicators are the most convenient for these comparisons in order to obtain reliable information that can be used in research policy.

The whys and wherefores of SciSci
The science of science (SciSci) is based on a transdisciplinary approach that uses large data sets to study the mechanisms underlying the doing of science—from the choice of a research problem to career trajectories and progress within a field. In a Review, Fortunato et al. explain that the underlying rationale is that with a deeper understanding of the precursors of impactful science, it will be possible to develop systems and policies that improve each scientist's ability to succeed and enhance the prospects of science as a whole.
Science , this issue p. eaao0185

Streptococcus pneumoniae is commonly found in the human nasopharynx and is the causative agent of multiple diseases. Since invasive pneumococcal infections are associated with encapsulated pneumococci, the capsular polysaccharide is the target of licensed pneumococcal vaccines. However, there is an increasing distribution of non-vaccine serotypes, as well as nonencapsulated S. pneumoniae (NESp). Both encapsulated and nonencapsulated pneumococci possess the polyamine oligo-transport operon (potABCD). Previous research has shown inactivation of the pot operon in encapsulated pneumococci alters protein expression and leads to a significant reduction in pneumococcal murine colonization, but the role of the pot operon in NESp is unknown. Here, we demonstrate deletion of potD from the NESp NCC1 strain MNZ67 does impact expression of the key proteins pneumolysin and PspK, but it does not inhibit murine colonization. Additionally, we show the absence of potD significantly increases biofilm production, both in vitro and in vivo. In a chinchilla model of otitis media (OM), the absence of potD does not significantly affect MNZ67 virulence, but it does significantly reduce the pathogenesis of the virulent encapsulated strain TIGR4 (serotype 4). Deletion of potD also significantly reduced persistence of TIGR4 in the lungs but increased persistence of PIP01 in the lungs. We conclude the pot operon is important for the regulation of protein expression and biofilm formation in both encapsulated and NCC1 nonencapsulated Streptococcus pneumoniae. However, in contrast to encapsulated pneumococcal strains, polyamine acquisition via the pot operon is not required for MNZ67 murine colonization, persistence in the lungs, or full virulence in a model of OM. Therefore, NESp virulence regulation needs to be further established to identify potential NESp therapeutic targets.

The distribution of scientific citations for publications selected with different rules (author, topic, institution, country, journal, etc.) collapse on a single curve if one plots the citations relative to their mean value. We find that the distribution of shares for the Facebook posts re-scale in the same manner to the very same curve with scientific citations. This finding suggests that citations are subjected to the same growth mechanism with Facebook popularity measures, being influenced by a statistically similar social environment and selection mechanism. In a simple master-equation approach the exponential growth of the number of publications and a preferential selection mechanism leads to a Tsallis-Pareto distribution offering an excellent description for the observed statistics. Based on our model and on the data derived from PubMed we predict that according to the present trend the average citations per scientific publications exponentially relaxes to about 4.

I show that the social stratification of academic science can arise as a result of academics' preference for reading work of high epistemic value. This is consistent with a view on which academic superstars are highly competent academics, but also with a view on which superstars arise primarily due to luck. I argue that stratification is beneficial if most superstars are competent, but not if most superstars are lucky. I also argue that it is impossible to tell whether most superstars are in fact competent or lucky, or which group a given superstar belongs to, and hence whether stratification is overall beneficial.

To quantify the mechanism of a complex network growth we focus on the network of citations of scientific papers and use a combination of the theoretical and experimental tools to uncover microscopic details of this network growth. Namely, we develop a stochastic model of citation dynamics based on copying/redirection/triadic closure mechanism. In a complementary and coherent way, the model accounts both for statistics of references of scientific papers and for their citation dynamics. Originating in empirical measurements, the model is cast in such a way that it can be verified quantitatively in every aspect. Such verification is performed by measuring citation dynamics of Physics papers. The measurements revealed nonlinear citation dynamics, the nonlinearity being intricately related to network topology. The nonlinearity has far-reaching consequences including non-stationary citation distributions, diverging citation trajectory of similar papers, runaways or "immortal papers" with infinite citation lifetime etc. Thus, our most important finding is nonlinearity in complex network growth. In a more specific context, our results can be a basis for quantitative probabilistic prediction of citation dynamics of individual papers and of the journal impact factor.

How to quantify the impact of a researcher's or an institution's body of work
is a matter of increasing importance to scientists, funding agencies, and
hiring committees. The use of bibliometric indicators, such as the h-index or
the Journal Impact Factor, have become widespread despite their known
limitations. We argue that most existing bibliometric indicators are
inconsistent, biased, and, worst of all, susceptible to manipulation. Here, we
pursue a principled approach to the development of an indicator to quantify the
scientific impact of both individual researchers and research institutions
grounded on the functional form of the distribution of the asymptotic number of
citations. We validate our approach using the publication records of 1,283
researchers from seven scientific and engineering disciplines and the chemistry
departments at the 106 U.S. research institutions classified as "very high
research activity". Our approach has three distinct advantages. First, it
accurately captures the overall scientific impact of researchers at all career
stages, as measured by asymptotic citation counts. Second, unlike other
measures, our indicator is resistant to manipulation and rewards publication
quality over quantity. Third, our approach captures the time-evolution of the
scientific impact of research institutions.

The Hirsch's h-index is perhaps the most popular citation-based measure of the scientific excellence. In 2013 G. Ionescu and B. Chopard proposed an agent-based model for this index to describe a publications and citations generation process in an abstract scientific community. With such an approach one can simulate a single scientist's activity, and by extension investigate the whole community of researchers. Even though this approach predicts quite well the h-index from bibliometric data, only a solution based on simulations was given. In this paper, we complete their results with exact, analytic formulas. What is more, due to our exact solution we are able to simplify the Ionescu-Chopard model which allows us to obtain a compact formula for h-index. Moreover, a simulation study designed to compare both, approximated and exact, solutions is included. The last part of this paper presents evaluation of the obtained results on a real-word data set.

Citation between papers can be treated as a causal relationship. In addition, some citation networks have a number of similarities to the causal networks in network cosmology, e.g., the similar in-and out-degree distributions. Hence, it is possible to model the citation network using network cosmology. The casual network models built on homogenous spacetimes have some restrictions when describing some phenomena in citation networks, e.g., the hot papers receive more citations than other simultaneously published papers. We propose an inhomogenous causal network model to model the citation network, the connection mechanism of which well expresses some features of citation. The node growth trend and degree distributions of the generated networks also fit those of some citation networks well.

The distribution of the number of academic publications as a function of
citation count for a given year is remarkably similar from year to year. We
measure this similarity as a width of the distribution and find it to be
approximately constant from year to year. We show that simple citation models
fail to capture this behaviour. We then provide a simple three parameter
citation network model using a mixture of local and global search processes
which can reproduce the correct distribution over time. We use the citation
network of papers from the hep-th section of arXiv to test our model. For this
data, around 20% of citations use global information to reference recently
published papers, while the remaining 80% are found using local searches. We
note that this is consistent with other studies though our motivation is very
different from previous work. Finally, we also find that the fluctuations in
the size of an academic publication's bibliography is important for the model.
This is not addressed in most models and needs further work.

The Matthew effect describes the phenomenon that in societies, the rich tend to get richer and the potent even more powerful. It is closely related to the concept of preferential attachment in network science, where the more connected nodes are destined to acquire many more links in the future than the auxiliary nodes. Cumulative advantage and success-breads-success also both describe the fact that advantage tends to beget further advantage. The concept is behind the many power laws and scaling behaviour in empirical data, and it is at the heart of self-organization across social and natural sciences. Here, we review the methodology for measuring preferential attachment in empirical data, as well as the observations of the Matthew effect in patterns of scientific collaboration, socio-technical and biological networks, the propagation of citations, the emergence of scientific progress and impact, career longevity, the evolution of common English words and phrases, as well as in education and brain development. We also discuss whether the Matthew effect is due to chance or optimization, for example related to homophily in social systems or efficacy in technological systems, and we outline possible directions for future research.

Significance
Social scientists have long debated why similar individuals often experience drastically different degrees of success. Some scholars have suggested such inequality merely reflects hard-to-observe personal differences in ability. Others have proposed that one fortunate success may trigger another, thus producing arbitrary differentiation. We conducted randomized experiments through intervention in live social systems to test for success-breeds-success dynamics. Results show that different kinds of success (money, quality ratings, awards, and endorsements) when bestowed upon arbitrarily selected recipients all produced significant improvements in subsequent rates of success as compared with the control group of nonrecipients. However, greater amounts of initial success failed to produce much greater subsequent success, suggesting limits to the distortionary effects of social feedback.

Modeling distributions of citations to scientific papers is crucial for understanding how science develops. However, there is a considerable empirical controversy on which statistical model fits the citation distributions best. This paper is concerned with rigorous empirical detection of power-law behaviour in the distribution of citations received by the most highly cited scientific papers. We have used a
large, novel data set on citations to scientific papers published between 1998 and 2002 drawn from Scopus. The power-law model is compared with a number of alternative models using a likelihood ratio test. We have found that the power-law hypothesis is rejected for around half of the Scopus fields of science. For these fields of science, the Yule, power-law with exponential cut-off and log-normal distributions seem to fit the data better than the pure power-law model. On the other hand, when the power-law hypothesis is not rejected, it is usually empirically indistinguishable from most of the alternative models. The pure power-law model seems to be the best model only for the most highly cited papers in “Physics and Astronomy”. Overall, our results seem to support theories implying that the most highly cited scientific papers follow the Yule, power-law with exponential cut-off or log-normal distribution. Our findings suggest also that power laws in citation distributions, when present, account only for a very small fraction of the published papers (less than 1 % for most of science fields) and that the power-law scaling parameter (exponent) is substantially higher (from around 3.2 to around 4.7) than found in the older literature.

Reputation is an important social construct in science, enabling informed
quality assessments of both publications and careers in the absence of complete
systemic information. However, the relation between reputation and career
growth remains poorly understood, despite the rapid growth of quantitative
methods developed for research evaluation. We develop an original framework for
measuring how citation paths are influenced by two distinct factors -- the
scientific merit of each individual paper versus the reputation of its authors
within the scientific community. To estimate their relative strength, we
perform a longitudinal analysis of publication data for 450 leading scientists
and find a citation crossover $c_{\times}$ which distinguishes the strength of
the reputation effect. For papers with citations $c$ below $c_{\times}$, the
author reputation dominates the citation rate; if $c\geq c_{\times}$ then the
paper quality dominates the citation rate. Hence, papers may gain a significant
early citation advantage if coauthored by authors already having high
reputations in the scientific community. Concomitant to this reputation
mechanism, we observe for top scientists a non-decreasing growth in both
publications and citations, a pattern which reflects the amplifying role of
social processes. As quantitative measures become increasingly common in the
evaluation of scientific careers, we show that it is important to account for
author reputation when estimating the intrinsic quality of research.
Furthermore, our results also indicate a strong role played by reputation in
the mentor matching process within academic institutions, in the effectiveness
of double blinding in peer-review, and in other generic reputation systems.

Networks grow and evolve when new nodes and links are added in. There are two methods to add the links: uniform attachment and preferential attachment. We take account of the addition of links with mixed attachment between uniform attachment and preferential attachment in proportion. By using numerical simulations and analysis based on a continuum theory, we obtain that the degree distribution P(k) has an extended power-law form P(k) ~ (k + k0)−γ. When the number of edges k of a node is much larger than a certain value k0, the degree distribution reduces to the power-law form P(k) ~ k−γ; and when k is much smaller than k0, the degree distribution degenerates into the exponential form . It has been found that degree distribution possesses this extended power-law form for many real networks, such as the movie actor network, the citation network of scientific papers and diverse protein interaction networks.

The principle that 'popularity is attractive' underlies preferential attachment, which is a common explanation for the emergence of scaling in growing networks. If new connections are made preferentially to more popular nodes, then the resulting distribution of the number of connections possessed by nodes follows power laws, as observed in many real networks. Preferential attachment has been directly validated for some real networks (including the Internet), and can be a consequence of different underlying processes based on node fitness, ranking, optimization, random walks or duplication. Here we show that popularity is just one dimension of attractiveness; another dimension is similarity. We develop a framework in which new connections optimize certain trade-offs between popularity and similarity, instead of simply preferring popular nodes. The framework has a geometric interpretation in which popularity preference emerges from local optimization. As opposed to preferential attachment, our optimization framework accurately describes the large-scale evolution of technological (the Internet), social (trust relationships between people) and biological (Escherichia coli metabolic) networks, predicting the probability of new links with high precision. The framework that we have developed can thus be used for predicting new links in evolving networks, and provides a different perspective on preferential attachment as an emergent phenomenon.

The concept of preferential attachment is behind the hubs and power laws
seen in many networks. New results fuel an old debate about its origin,
and beg the question of whether it is based on randomness or
optimization. See Letter p.537

Recent science of science research shows that scientific impact measures for journals and individual articles have quantifiable regularities across both time and discipline. However, little is known about the scientific impact distribution at the scale of an individual scientist. We analyze the aggregate production and impact using the rank-citation profile c(i)(r) of 200 distinguished professors and 100 assistant professors. For the entire range of paper rank r, we fit each c(i)(r) to a common distribution function. Since two scientists with equivalent Hirsch h-index can have significantly different c(i)(r) profiles, our results demonstrate the utility of the β(i) scaling parameter in conjunction with h(i) for quantifying individual publication impact. We show that the total number of citations C(i) tallied from a scientist's N(i) papers scales as [Formula: see text]. Such statistical regularities in the input-output patterns of scientists can be used as benchmarks for theoretical models of career progress.

In this paper we examine a number of methods for probing and
understanding the large-scale structure of networks that evolve over
time. We focus in particular on citation networks, networks of
references between documents such as papers, patents, or court cases. We
describe three different methods of analysis, one based on an
expectation-maximization algorithm, one based on modularity optimization,
and one based on eigenvector centrality. Using the network of citations
between opinions of the United States Supreme Court as an example, we
demonstrate how each of these methods can reveal significant structural
divisions in the network and how, ultimately, the combination of all
three can help us develop a coherent overall picture of the network's
shape.

Random networks with complex topology are common in Nature, describing systems as diverse as the world wide web or social and business networks. Recently, it has been demonstrated that most large networks for which topological information is available display scale-free features. Here we study the scaling properties of the recently introduced scale-free model, that can account for the observed power-law distribution of the connectivities. We develop a mean-field method to predict the growth dynamics of the individual vertices, and use this to calculate analytically the connectivity distribution and the scaling exponents. The mean-field method can be used to address the properties of two variants of the scale-free model, that do not display power-law scaling.

Recently we proposed a model in which when a scientist writes a manuscript, he picks up several random papers, cites them and also copies a fraction of their references (cond-mat/0305150). The model was stimulated by our discovery that a majority of scientific citations are copied from the lists of references used in other papers (cond-mat/0212043). It accounted quantitatively for several properties of empirically observed distribution of citations. However, important features, such as power-law distribution of citations to papers published during the same year and the fact that the average rate of citing decreases with aging of a paper, were not accounted for by that model. Here we propose a modified model: when a scientist writes a manuscript, he picks up several random recent papers, cites them and also copies some of their references. The difference with the original model is the word recent. We solve the model using methods of the theory of branching processes, and find that it can explain the aforementioned features of citation distribution, which our original model couldn't account for. The model can also explain "sleeping beauties in science", i.e., papers that are little cited for a decade or so, and later "awake" and get a lot of citations. Although much can be understood from purely random models, we find that to obtain a good quantitative agreement with empirical citation data one must introduce Darwinian fitness parameter for the papers.

This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Specifically, the system focuses on: 1) Extracting researcher profiles automatically from the Web; 2) Integrating the publication data into the network from existing digital libraries; 3) Modeling the entire academic network; and 4) Providing search services for the academic network. So far, 448,470 researcher profiles have been extracted using a unified tagging approach. We integrate publications from online Web databases and propose a probabilistic framework to deal with the name ambiguity problem. Furthermore, we propose a unified modeling approach to simultaneously model topical aspects of papers, authors, and publication venues. Search services such as expertise search and people association search have been provided based on the modeling results. In this paper, we describe the architecture and main features of the system. We also present the empirical evaluation of the proposed methods.

Citation distributions are crucial for the analysis and modeling of the activity of scientists. We investigated bibliometric data of papers published in journals of the American Physical Society, searching for the type of function which best describes the observed citation distributions. We used the goodness of fit with Kolmogorov-Smirnov statistics for three classes of functions: log-normal, simple power law and shifted power law. The shifted power law turns out to be the most reliable hypothesis for all citation networks we derived, which correspond to different time spans. We find that citation dynamics is characterized by bursts, usually occurring within a few years since publication of a paper, and the burst size spans several orders of magnitude. We also investigated the microscopic mechanisms for the evolution of citation networks, by proposing a linear preferential attachment with time dependent initial attractiveness. The model successfully reproduces the empirical citation distributions and accounts for the presence of citation bursts as well.

For decades, we tacitly assumed that the components of such complex systems as the cell, the society, or the Internet are
randomly wired together. In the past decade, an avalanche of research has shown that many real networks, independent of their
age, function, and scope, converge to similar architectures, a universality that allowed researchers from different disciplines
to embrace network theory as a common paradigm. The decade-old discovery of scale-free networks was one of those events that
had helped catalyze the emergence of network science, a new research field with its distinct set of challenges and accomplishments.

We study the distributions of citations received by a single publication within several disciplines, spanning broad areas of science. We show that the probability that an article is cited c times has large variations between different disciplines, but all distributions are rescaled on a universal curve when the relative indicator c(f) = c/c(0) is considered, where c(0) is the average number of citations per article for the discipline. In addition we show that the same universal behavior occurs when citation distributions of articles published in the same field, but in different years, are compared. These findings provide a strong validation of c(f) as an unbiased indicator for citation performance across disciplines and years. Based on this indicator, we introduce a generalization of the h index suitable for comparing scientists working in different fields.

In this study we introduce and analyze the statistical structural properties of a model of growing networks which may be relevant to social networks. At each step a new node is added which selects k possible partners from the existing network and joins them with probability delta by undirected edges. The "activity" of the node ends here; it will get new partners only if it is selected by a newcomer. The model produces an infinite-order phase transition when a giant component appears at a specific value of delta, which depends on k. The average component size is discontinuous at the transition. In contrast, the network behaves significantly different for k=1. There is no giant component formed for any delta and thus in this sense there is no phase transition. However, the average component size diverges for delta> or =1/2.

We introduce a dynamical network model which unifies a number of network families which are individually known to exhibit q-exponential degree distributions. The present model dynamics incorporates static (nongrowing) self-organizing networks, preferentially growing networks, and (preferentially) rewiring networks. Further, it exhibits a natural random graph limit. The proposed model generalizes network dynamics to rewiring and growth modes which depend on internal topology as well as on a metric imposed by the space they are embedded in. In all of the networks emerging from the presented model we find q-exponential degree distributions over a large parameter space. We comment on the parameter dependence of the corresponding entropic index q for the degree distributions, and on the behavior of the clustering coefficients and neighboring connectivity distributions.

Changes in citation distributions over 100 years can reveal much about the evolution of the scientific communities or disciplines. The prevalence of uncited papers or of highly-cited papers, with respect to the bulk of publications, provides important clues as to the dynamics of scientific research. Using 25 million papers and 600 million references from the Web of Science over the 1900-2006 period, this paper proposes a simple model based on a random selection process to explain the "uncitedness" phenomenon and its decline in recent years. We show that the proportion of uncited papers is a function of 1) the number of articles published in a given year (the competing papers) and 2) the number of articles subsequently published (the citing papers) and the number of references they contain. Using uncitedness as a departure point, we demonstrate the utility of the stretched-exponential function and a form of the Tsallis function to fit complete citation distributions over the 20th century. As opposed to simple power-law fits, for instance, both these approaches are shown to be empirically well-grounded and robust enough to better understand citation dynamics at the aggregate level. Based on an expansion of these models, on our new understanding of uncitedness and on our large dataset, we are able provide clear quantitative evidence and provisional explanations for an important shift in citation practices around 1960, unmatched in the 20th century. We also propose a revision of the "citation classic" category as a set of articles which is clearly distinguishable from the rest of the field.

Numerical data for the distribution of citations are examined for: (i) papers
published in 1981 in journals which are catalogued by the Institute for
Scientific Information (783,339 papers) and (ii) 20 years of publications in
Physical Review D, vols. 11-50 (24,296 papers). A Zipf plot of the number of
citations to a given paper versus its citation rank appears to be consistent
with a power-law dependence for leading rank papers, with exponent close to
-1/2. This, in turn, suggests that the number of papers with x citations, N(x),
has a large-x power law decay N(x)~x^{-alpha}, with alpha approximately equal
to 3.

The desire to predict discoveries—to have some idea, in advance, of what will be discovered, by whom, when, and where—pervades nearly all aspects of modern science, from individual scientists to publishers, from funding agencies to hiring committees. In this Essay, we survey the emerging and interdisciplinary field of the “science of science” and what it teaches us about the predictability of scientific discovery. We then discuss future opportunities for improving predictions derived from the science of science and its potential impact, positive and negative, on the scientific community.

Although statistical models fit many citation data sets reasonably well with the best fitting models being the hooked power law and discretised lognormal distribution, the fits are rarely close. One possible reason is that there might be more uncited articles than would be predicted by any model if some articles are inherently uncitable. Using data from 23 different Scopus categories, this article tests the assumption that removing a proportion of uncited articles from a citation dataset allows statistical distributions to have much closer fits. It also introduces two new models, zero inflated discretised lognormal distribution and the zero inflated hooked power law distribution and algorithms to fit them. In all 23 cases, the zero inflated version of the discretised lognormal distribution was an improvement on the standard version and in 15 out of 23 cases the zero inflated version of the hooked power law was an improvement on the standard version. Without zero inflation the discretised lognormal models fit the data better than the hooked power law distribution 6 out of 23 times and with it, the discretised lognormal models fit the data better than the hooked power law distribution 9 out of 23 times. Apparently uncitable articles seem to occur due to the presence of academic-related magazines in Scopus categories. In conclusion, future citation analysis and research indicators should take into account uncitable articles, and the best fitting distribution for sets of citation counts from a single subject and year is either the zero inflated discretised lognormal or zero inflated hooked power law.

There is no agreement over which statistical distribution is most appropriate for modelling citation count data. This is important because if one distribution is accepted then the relative merits of different citation-based indicators, such as percentiles, arithmetic means and geometric means, can be more fully assessed. In response, this article investigates the plausibility of the discretised lognormal and hooked power law distributions for modelling the full range of citation counts, with an offset of 1. The citation counts from 23 Scopus subcategories were fitted to hooked power law and discretised lognormal distributions but both distributions failed a Kolmogorov–Smirnov goodness of fit test in over three quarters of cases. The discretised lognormal distribution also seems to have the wrong shape for citation distributions, with too few zeros and not enough medium values for all subjects. The cause of poor fits could be the impurity of the subject subcategories or the presence of interdisciplinary research. Although it is possible to test for subject subcategory purity indirectly through a goodness of fit test in theory with large enough sample sizes, it is probably not possible in practice. Hence it seems difficult to get conclusive evidence about the theoretically most appropriate statistical distribution.

Identifying the statistical distribution that best fits citation data is
important to allow robust and powerful quantitative analyses. Whilst previous
studies have suggested that both the hooked power law and discretised lognormal
distributions fit better than the power law and negative binomial
distributions, no comparisons so far have covered all articles within a
discipline, including those that are uncited. Based on an analysis of 26
different Scopus subject areas in seven different years, this article reports
comparisons of the discretised lognormal and the hooked power law with citation
data, adding 1 to citation counts in order to include zeros. The hooked power
law fits better in two thirds of the subject/year combinations tested for
journal articles that are at least three years old, including most medical,
life and natural sciences, and for virtually all subject areas for younger
articles. Conversely, the discretised lognormal tends to fit best for arts,
humanities, social science and engineering fields. The difference between the
fits of the distributions is mostly small, however, and so either could
reasonably be used for modelling citation data. For regression analyses,
however, the best option is to use ordinary least squares regression applied to
the natural logarithm of citation counts plus one, especially for sets of
younger articles, because of the increased precision of the parameters.

The citations to a set of academic articles are typically unevenly shared, with many articles attracting few citations and few attracting many. It is important to know more precisely how citations are distributed in order to help statistical analyses of citations, especially for sets of articles from a single discipline and a small range of years, as normally used for research evaluation. This article fits discrete versions of the power law, the lognormal distribution and the hooked power law to 20 different Scopus categories, using citations to articles published in 2004 and ignoring uncited articles. The results show that, despite its popularity, the power law is not a suitable model for collections of articles from a single subject and year, even for the purpose of estimating the slope of the tail of the citation data. Both the hooked power law and the lognormal distributions fit best for some subjects but neither is a universal optimal choice and parameter estimates for both seem to be unreliable. Hence only the hooked power law and discrete lognormal distributions should be considered for subject-and-year-based citation analysis in future and parameter estimates should always be interpreted cautiously.

We adapt and use methods from the causal set approach to quantum gravity to
analyse the structure of citation networks from academic papers on the arXiv,
supreme court judgements from the US, and patents. We exploit the causal
structure of of citation networks to measure the dimension of the Minkowski
space in which a these directed acyclic graphs can most easily be embedded
explicitly taking time into account as one of the dimensions we are measuring.
We show that seemingly similar networks have measurably different dimensions.
Our interpretation is that a high dimension corresponds to diverse citation
behaviour while a low dimension indicates a narrow range of citations in a
field.

n this paper we deal with the problem of aggregating numeric sequences of arbitrary length that represent e.g. citation records of scientists. Impact functions are the aggregation operators that express as a single number not only the quality of individual publications, but also their author's productivity.
We examine some fundamental properties of these aggregation tools. It turns out that each impact function which always gives indisputable valuations must necessarily be trivial. Moreover, it is shown that for any set of citation records in which none is dominated by the other, we may construct an impact function that gives any a prori-established authors' ordering. Theoretically then, there is considerable room for manipulation in the hands of decision makers.
We also discuss the differences between the impact function-based and the multicriteria decision making-based approach to scientific quality management, and study how the introduction of new properties of impact functions affects the assessment process. We argue that simple mathematical tools like the h- or g-index (as well asother bibliometric impact indices) may not necessarily be a good choice when it comes to assess scientific achievements.

We model a virtual scientific community in which authors publish and cite articles. Citations are attributed according to a preferential attachment mechanism. From the numerical simulations, the h-index can be computed. This bottom-up approach reproduces well real bibliometric data. We consider two versions of our model. (1) The single-scientist is controlled by two parameters which can be tuned to reproduce the value of the h-index of many real scientists. Moreover, this model shows how the h-index grows with the number of citations, for a fixed number of articles. We also define an average h-index that can be used to compare the scientific productivity of institutions of different sizes. (2) The multi-scientist model considers a population of scientists and allows us to study the impact of removing citations from the low h-index researchers on the community. Simulations on real bibilometric data, as well as the predictions of the model, show that the h-index eco-system can be strongly affected by such a filtering.

The citation distribution of papers of selected individual authors was analyzed using five mathematical functions: power-law, stretched exponential, logarithmic, binomial and Langmuir-type. The former two functions have previously been proposed in the literature whereas the remaining three are novel and are derived following the concepts of growth kinetics of crystals in the presence of additives which act as inhibitors of growth. Analysis
of the data of citation distribution of papers of the authors revealed that the value of the goodness-of-the-fit parameter R^2 was the highest for the empirical binomial relation, it was high and comparable for stretched exponential and Langmuir-type functions, relatively low for power law but it was the lowest for the logarithmic function. In the Langmuir-type function a parameter K, defined as Langmuir constant, characterizing the citation behavior of the authors has been identified. Based on the Langmuir-type function an expression for cumulative citations L relating the extrapolated value of citations l_0 corresponding to rank n = 0 for an author and his/her constant K and the number N of paper receiving citation l ≥ 1 is also proposed.

Success-breeds-success phenomenon is described by single-and multiple-urn models. It is shown that these models lead to a negative binomial distribution for the total number of successes and a Zipf-Mandelbrot law for the number of sources contributing a specified number of successes.

Two-dimensional informetrics is defined in the general context of sources that produce items and examples are given. These systems are called "Infor-mation Production Processes" (IPPs). They can be described by a size-frequency function f or, equivalently, by a rank-frequency function g. If f is a decreasing power law then we say that this function is the law of Lotka and it is equivalent with the power law g which is called the law of Zipf. Examples in WWW are given. Next we discuss the scale-free property of f also allowing for the interpretation of a Lotkaian IPP (i.e. for which f is the law of Lotka) as a self-similar fractal. Then we discuss dynamical aspects of (Lotkaian) IPPs by introducing an item-transformation ϕ and a source-transformation ψ. If these transformations are power functions we prove that the transformed IPP is Lotkaian and we present a formula for the exponent of the Lotka law. Applications are given on the evolution of WWW and on IPPs without low productive sources (e.g. sizes of countries, municipalities or databases). Lotka's law is then used to model the cumulative first citation distribution and examples of good fit are given. Finally, Lotka's law is applied to the study of performance indices such as the h-index (Hirsch) or the g-index (Egghe). Formulas are given for the h-and g-index in Lotkaian IPPs and applications are given.

A Cumulative Advantage Distribution is proposed which models statistically the situation in which success breeds success. It differs from the Negative Binomial Distribution in that lack of success, being a non-event, is not punished by increased chance of failure. It is shown that such a stochastic law is governed by the Beta Function, containing only one free parameter, and this is approximated by a skew or hyperbolic distribution of the type that is widespread in bibliometrics and diverse social science phenomena. In particular, this is shown to be an appropriate underlying probabilistic theory for the Bradford Law, the Lotka Law, the Pareto and Zipf Distributions, and for all the empirical results of citation frequency analysis. As side results one may derive also the obsolescence factor for literature use. The Beta Function is peculiarly elegant for these manifold purposes because it yields both the actual and the cumulative distributions in simple form, and contains a limiting case of an inverse square law to which many empirical distributions conform.

Recently several authors have proposed stochastic evolutionary models for the growth of complex networks that give rise to power-law distributions. These models are based on the notion of preferential attachment leading to the “rich get richer” phenomenon. Despite the generality of the proposed stochastic models, there are still some unexplained phenomena, which may arise due to the limited size of networks such as protein, e-mail, actor and collaboration networks. Such networks may in fact exhibit an exponential cutoff in the power-law scaling, although this cutoff may only be observable in the tail of the distribution for extremely large networks. We propose a modification of the basic stochastic evolutionary model, so that after a node is chosen preferentially, say according to the number of its inlinks, there is a small probability that this node will become inactive. We show that as a result of this modification, by viewing the stochastic process in terms of an urn transfer model, we obtain a power-law distribution with an exponential cutoff. Unlike many other models, the current model can capture instances where the exponent of the distribution is less than or equal to two. As a proof of concept, we demonstrate the consistency of our model empirically by analysing the Mathematical Research collaboration network, the distribution of which has been shown to be compatible with a power law with an exponential cutoff.

We propose a model of the evolution of the networks of scientific citations. The model takes an out-degree distribution (distribution of number of citations) and two parameters as input. The parameters capture the two main ingredients of the model: the aging of the relevance of papers and the formation of triangles when new papers cite old. We compare our model to three network structural quantities of an empirical citation network. We find that unique point in parameter space optimizing the match between the real and model data for all quantities. The optimal parameter values suggest that the impact of scientific papers, at least in the empirical data set we model, is proportional to the inverse of the number of papers since they were published.

This account of the Matthew effect is another small exercise in the psychosociological analysis of the workings of science as a social institution. The initial problem is transformed by a shift in theoretical perspective. As originally identified, the Matthew effect was construed in terms of enhancement of the position of already eminent scientists who are given disproportionate credit in cases of collaboration or of independent multiple discoveries. Its significance was thus confined to its implications for the reward system of science. By shifting the angle of vision, we note other possible kinds of consequences, this time for the communication system of science. The Matthew effect may serve to heighten the visibility of contributions to science by scientists of acknowledged standing and to reduce the visibility of contributions by authors who are less well known. We examine the psychosocial conditions and mechanisms underlying this effect and find a correlation between the redundancy function of multiple discoveries and the focalizing function of eminent men of science-a function which is reinforced by the great value these men place upon finding basic problems and by their self-assurance. This self-assurance, which is partly inherent, partly the result of experiences and associations in creative scientific environments, and partly a result of later social validation of their position, encourages them to search out risky but important problems and to highlight the results of their inquiry. A macrosocial version of the Matthew principle is apparently involved in those processes of social selection that currently lead to the concentration of scientific resources and talent (50).

This paper presents a model for author-paper networks, which is based on the assumption that authors are organized into groups and that, for each research topic, the number of papers published by a group is based on a success-breeds-success model. Collaboration between groups is modeled as random invitations from a group to an outside member. To analyze the model, a number of different metrics that can be obtained in author-paper networks were extracted. A simulation example shows that this model can effectively mimic the behavior of a real-world author-paper network, extracted from a collection of 900 journal papers in the field of complex networks.

I propose the index h, defined as the number of papers with citation number ≥h, as a useful index to characterize the scientific output of a researcher.
• citations
• impact
• unbiased

Reputation and impact in academic careers

- A M Petersen
- Petersen A. M.

Evolution of the social network of scientific collaborations

- A Barabási
- Barabási A.