Conference PaperPDF Available

The Distribution of References in Scientific Papers: an Analysis of the IMRaD Structure

Authors:

Abstract and Figures

The organization of scientific articles typically follows a standardized pattern, the well-known IMRaD structure (Introduction, Methods, Results and Discussion). Using the PLOS series of journals as a case study, this paper looks at how the bibliographic references are distributed along the different sections of papers. We use the section titles of the articles to categorize the sections matching the IMRaD structure. We then identify the variations in the basic IMRaD structure of the different PLOS journals. The results show that, though dominant, the IMRaD structure often changes in some journals and these differences must be taken into account in order to compare the distribution of references along the text using an invariant measure, here the number of sentences in the texts. We examine the different distributions of the references in the articles in different journals and show that these distributions are relatively stable and maybe even invariant when taking into account the inversions of sections identified in some journals.
Content may be subject to copyright.
The Distribution of References in Scientific Papers: an Analysis of the
IMRaD Structure
Marc Bertin
1
, Iana Atanassova
1
, Vincent Lariviere
2
and Yves Gingras
3
1
marc.bertin@paris-sorbonne.fr; iana.atanassova@paris-sorbonne.fr
Sens, Texte, Informatique, Histoire (STIH), Paris-Sorbonne University, 1 rue Victor Cousin
75230 Paris cedex (France) and Observatoire des Sciences et des Technologies (OST), Centre Interuniversitaire
de Recherche sur la Science et la Technologie (CIRST), Université du Quebec à Montreal, CP 8888, Succ.
Centre-Ville, Montreal, QC. H3C 3P8 (Canada)
2
vincent.lariviere@umontreal.ca
École de bibliothéconomie et des sciences de l’information, Université de Montréal, C.P. 6128,
Succ. Centre-Ville, Montréal, QC. H3C 3J7 (Canada) and Observatoire des Sciences et des Technologies (OST),
Centre Interuniversitaire de Recherche sur la Science et la Technologie (CIRST), Université du Quebec à
Montreal, CP 8888, Succ. Centre-Ville, Montreal, QC. H3C 3P8 (Canada)
3
gingras.yves@uqam.ca
Observatoire des Sciences et des Technologies (OST), Centre Interuniversitaire de Recherche sur la Science et la
Technologie (CIRST), Université du Quebec à Montreal, CP 8888, Succ. Centre-Ville, Montreal, QC. H3C 3P8
(Canada)
Abstract
The organization of scientific articles typically follows a standardized pattern, the well-known IMRaD structure
(Introduction, Methods, Results and Discussion). Using the PLOS series of journals as a case study, this paper
looks at how the bibliographic references are distributed along the different sections of papers. We use the
section titles of the articles to categorize the sections matching the IMRaD structure. We then identify the
variations in the basic IMRaD structure of the different PLOS journals. The results show that, though dominant,
the IMRaD structure often changes in some journals and these differences must be taken into account in order to
compare the distribution of references along the text using an invariant measure, here the number of sentences in
the texts. We examine the different distributions of the references in the articles in different journals and show
that these distributions are relatively stable and maybe even invariant when taking into account the inversions of
sections identified in some journals.
Introduction
The organization of scientific articles typically follows a standardized pattern, the well-known
IMRaD structure (Introduction, Methods, Results and Discussion). This structure has imposed
itself in most major scientific journals in the mid-twentieth century, and has become the main
standard in the 1970s (Sollaci and Pereira, 2004). Many studies have focused on various
aspects of this structure: automatic classification of sentences in full-text (Agarwal and Yu,
2009), the effects of the use of the IMRaD style (Oriokot et al., 2011), creation of guidelines
for scientific writing (Kucer, 1985; Meadows, 1985; Day and Gastel, 2006), providing
structured abstracts (Nakayama et al., 2005) and editorial requirements (Barron, 2006).
Research question
This article investigates, from the viewpoint of bibliometrics, the relationships that exist
between cited references and the structure of the text. What interests us is the nature of the
distribution of references in scientific articles and more precisely, if there exists a typology of
scientific writing and referencing practices. These characteristics of scientific papers are
studied here using the seven (7) journals published by the Public Library of Science (PLOS),
which are peer-reviewed open-access publications covering all disciplines of sciences and
social sciences. The free access to full text gives us the opportunity to use the PLOS journals
as a test corpus to establish the relation between the distribution of references throughout an
article and its structure. Our analysis consists in several steps: categorisation of the sections of
the text according to section titles, segmentation into sentences in order to obtain the
distribution of the references according to the text progression, reconstruction of the IMRaD
structure and the examination of the distribution of the references in the different journals.
Our results provide an overview of the types of articles in the PLOS journals and show some
properties of the structure of research articles related to the sections and section titles. We
explore the relations between the types of articles and the IMRaD structure, and also the
relations between the types of sections and the references in the texts. Finally, we obtain a
graphical representation of the distribution of references in an article. The next section
presents the corpus of data and its structure. Then, we describe the processing carried out in
order to relate the IMRaD structure to the distribution of references in the articles.
Methods
We first categorize the sections which allows us to work with the different types of sections
and reorder the sections in a text if necessary. This categorization aims to verify the coherence
of the corpus with the IMRaD structure. We then process the text content of all paragraphs in
order to segment them into sentences. This segmentation allows us to work with text elements
that are smaller than paragraphs so that we can associate the references with a given sentence
of the text and obtain their distribution along the text. Finally, our algorithm counts the
number of references in each sentence. This task is not trivial, as we will discuss later.
Data source
Founded in 2006, the Public Library of Science (PLOS) is an Open Access publisher of seven
peer-reviewed academic journals, mostly in the fields of biology and medicine. PLOS ONE,
the publishers’ general journal covers, however, all fields of science and social sciences. For
this study, we have used the entire PLOS corpus up to September/October 2012. Table 1
presents the number of articles processed for each journal, as well as the average number of
sections and sentences per article. More than 47,000 journal articles were analyzed. As these 7
journals follow the same publication model but are in different scientific fields, our aim is to
observe the different uses of bibliographic references in these fields and their relation to the
structure of the articles. Table 1 show that the average number of sections per article varies
between 3.48 and 4.74 according to the journal. We can also observe that the average length
of articles is different: 125 sentences on average for PLOS Medicine, compared to 278
sentences on average for PLOS Computational biology. The Table also shows the relative
importance of PLOS ONE: papers published in this journal account for more than 71% of all
papers in the corpus.
Table 1. Descriptive Statistics on PLOS Journals
Data structure
PLOS provides access to the articles in the XML format. The set of XML elements and
attributes that are used for the representation of journal articles are known as Journal Article
Tag Suite (JATS), which is an application of Z39.96-2012 (ANSI, 2012). Some studies
(Carter, Funk and Mooney, 2012) give various applications of this standard. Technology
evolves quickly and we have to take into consideration that JATS is a continuation of the
NLM Archiving and Interchange DTD work by NCBI (http://dtd.nlm.nih.gov/).
The JATS structure of an article consists of three main elements front body back, where
the textual content of the article is in the body element, which is further divided into sections
and paragraphs. The <front> tag contains some traditional fields of metadata (title, authors,
etc.) as well as the article type.
Labels and section titles processing
The sections of the texts are categorized automatically by analyzing the section titles in order
to match the existing sections with one of the section types in the IMRaD structure. To do
this, we have examined the types of articles present in the corpus, where the typology is given
in the article’s metadata.
Segmentation processing
The first stage of the processing consists in parsing the XML trees and text segmentation into
sentences. The JATS structure used by PLOS provides paragraph elements <p> as the finest
level of text segments. For our analysis, we needed segmentation into sentences and we
parsed the initial JATS trees in order to extract the relevant text segments from the article
body, as well as other elements such as sections, section titles, section numbers, paragraphs
and the bibliography. These data were stored in the DocBook format that was used as the
basis for the further processing.
Each paragraph was segmented into sentences by analysing the punctuation of the text
following a set of typographic rules. All the occurrences of symbols denoting sentence
boundaries (point, exclamation mark, etc.) were examined and disambiguated. Figure 1 gives
some examples which show a few points present in the sentences but which do not finish
them. In fact, the occurrence of a point in a text does not necessarily mean a sentence end,
because in many cases it can be part of an abbreviation, references, genus species, numeric
values, etc.
1. , SE = 0.44, 0.041); and gene diversity from 0.39 (EMX-4) to 0.69 (LafMS03
2. the plastid genome is 0.92±0.03
3. an additional 115.0 ml
4. (Nyakaana and Arctander 1998; Fernando et al. 2001) and compared them
5. HB3 strain of P. falciparum, we demonstrate that at least 60%
6. (i.e., the kinase phosphatase
Figure 1. Examples of occurrences of ‘point’ that do not signal sentence ends.
We used a set of finite-state automata in order to determine the contexts in which the points
signal sentence ends. For this purpose, we have developed a Java application based on the
work of Mourad (2001). The algorithm uses a rule-based approach which disambiguates the
use of punctuation marks by examining the close context of their occurrences. All punctuation
marks in the text are thus labeled as “sentence end” or “no sentence end”. Some of the results
are presented in table 2.These results synthesize a more general problem in NLP. Once we
have identified the sentence boundaries in the corpus, we can consider the sentences as the
finest textual unit and examine the number of references in each sentence. In fact, a sentence
can contain one or more references or an enumeration of references, which is rather frequent
in the background section or the introduction.
Table 2. Segmentation into sentences according to typographic rules
Reference processing
Our algorithm examines each sentence and counts the number of references present in the
text. In fact, the input data is in the XML format where the references are represented in the
<xref> tags. However, counting these tags is not a reliable method to obtain the reference
counts and could bias the system. As shown in the example on Figure 2, some typographic
rules for writing references result in the fact that the XML structure does not render all of the
actual references. In this example, three sources are cited (51, 52 and 53), but only two
<xref> tags are present that delimit a range from 51 to 53. As these cases are rather frequent
in the corpus (on average more than once in an article), they must be taken into consideration.
Our algorithm covers all possible typographic variations for reference ranges and infers the
missing data from the input XML. As a result we obtain the list of sentences in the text, where
to each sentence we have associated a reference count as well as a list of reference identifiers
corresponding to the bibliography entries.
“… during differentiation [<xref ref-type="bibr" rid="pbio-0030356-b51">51</xref><xref
ref-type="bibr" rid="pbio-0030356-b53">53</xref>]. This prediction …”
Figure 2. Example of a reference range rendered in XML
Results
Article Level
Table 3 presents the different article types in the PLOS corpus, exploiting the metadata
present in the XML documents. The article types are identified using the contents of the
<article-meta> tag in the JATS structure. This Table shows, as should be expected, that the
‘Research Article’ is dominant with 94% of the papers published. We notice however that
PLOS Medicine offers a wider variety of article like Perspective’, ‘Correspondence’, ‘Essay’
or ‘Policy Forum’.
Table 3. PLOS article typology study
Section Level
We now concentrate our analysis on research articles, which account for the vast majority of
documents published by PLOS journals. The number of sections in the texts is particularly
important for our study and we first match the section titles with the section position in the
four sections of the IMRaD structure. Table 4 presents the results of the categorization of the
sections for the seven PLOS journals. We have analyzed all section titles that are present as a
separate element in the XML documents. We determine whether the section is part of the
IMRaD structure or not by identifying occurrences of “Introduction”, “Method”, “Result” and
“Discussion” with all possible variations, plurals, combinations, etc. Thus, we have created a
set of criteria for the categorization that cover the majority of the observed section titles. After
normalization, we have considered the subset of titles present in all journals, except for
“Supporting information” which was not considered because this type of sections is not part
of the scientific argumentation and serves as complementary information. Finally, to produce
tables 5 to 10, we look at the position of titles for each section. We check that Introduction
correspond to the section one, Method correspond to the section two, Result correspond to the
section three and Discussion to the section four.
Table 4 shows that PLOS Medicine and PLOS Neglected Tropical Diseases essentially follow
the IMRaD structure. The values on the diagonal of the matrix for PLOS Neglected Tropical
Diseases are well above 85%, which means that virtually all the articles follow the IMRaD
standard. In the case of PLOS Medicine, the values on the diagonal show that about half of the
papers follow the IMRaD structure, while the other half use section titles that did not allow
tha automatic categorization of the sections. For both journals, the high values on the
diagonals indicate clearly that in almost all of the papers that include sections categorized as
Introduction, Method, Result and Discussion, these sections appear in the order defined by the
IMRaD structure. Hence, the first column, which corresponds to section one, never includes
Method, Results and Discussion. This is coherent with the structure generally presented in the
literature.
Table 4. Relation between position of section and title of section for PLOS Medicine and PLOS
Neglected Tropical Diseases
Table 5 shows the relation between the position of the sections and section titles for PLOS
ONE and PLOS Computational Biology. While the first value presented on the diagonal is
more than 99%, other values on the diagonal are very low (close to 50%), which indicate that
the usual order of sections in IMRaD are in fact changed. the Method section (on line two),
can be found not only in section 2 as expected with IMRaD, but also in section 4 usually
reserved for Discussion. The standardization proposed for extraction of titles takes into
account such variations. This inversion explains that of the Results section often appears in
Section 2 instead of 3, and that the methods are presented at the end of the article (Section 4).
Of course these papers do not respect completely the IMRaD structure and should present
some variations in the distributions of references.
Table 5. Relation between position of section and title of section for PLOS ONE and PLOS
Computational Biology
Finally, Table 6 shows the equivalent results for PLOS Genetics, PLOS Pathogens and PLOS
Biology. We note that the distribution of sections and titles for these journals also differs from
IMRaD with Methods coming last instead of Second and Discussion third instead of fourth as
in the standard IMRaD structure.
Table 6. Relation between position of section and title of section for PLOS Genetics, PLOS
Pathogens and PLOS Biology
Knowing the structure of the text in terms of section headings and having reordered the
various texts in order to have a consistent order of sections we can now present the
distribution of references along the texts of papers of the different journals. To do this, we
have used a subset of the corpus which contains only those research articles that contain the
four types of sections of the IMRaD structure. All the articles in this smaller corpus have at
least four sections that correspond to the types Introduction, Method, Result and Discussion
but these sections are not necessarily present in the same order in the text. Table 7 shows the
number of articles that fulfill these criteria. We can observe that this new corpus represents
82.98% of the corpus.
Table 7. Research articles containing the four section types
of the IMRaD structure
Distribution of References at the Sentence Level
Figure 3 presents the normalized distributions of the references throughout the texts for two
PLOS journals. The horizontal axis presents the text progression from 0 to 100 percent based
on the segmentation into sentences. The vertical axis gives the average percentage of the
number of references at a given point of the text for each corpus. We can observe that the first
10 percent of the texts in these corpuses contain relatively large amounts of references. The
three vertical lines on the graph indicate the average positions of the section boundaries.
Figure 3. Distribution of References in PLOS Medicine and PLOS Neglected Tropical Diseases
These results are consistent with what might be expected: references are more concentrated in
the introduction. The comparison of Tables 4, 5 and 6 with Figures 3, 4 and 5 shows that the
distributions of references are similar in the sets of journals having the same structure of
section titles. In fact, Figure 3 shows that section 2, which according to Table 4 corresponds
to the Method in a majority of articles in these two journals, contains less references that the
other sections. On the other hand, Table 6 shows that the Method section tends to be at the
end of the articles for three of the journals. This is consistent with the distribution of
references on Figure 5 where we can observe that the fourth section contains a smaller
number of references that the first three sections. These observations suggest that if we take
into account the variations in the positions of sections the distribution of references could be
very stable and nearly invariant.
0
0,5
1
1,5
2
2,5
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
Plos Medicine
Plos Neglected Tropical Diseases
13
Section 2 Section 3
Section 4
45
Section 1
Figure 4. Distribution of References in PLOS Computational Biology and PLOS ONE
Figure 5. Distribution of Reference in PLOS Genetics, PLOS Pathogens and PLOS Biology
Distribution of References for the ordered IMRaD structure
In order to study the distribution of references independently of the order in which the
sections of the IMRaD structure appear in the texts, we have reordered the sections in all
articles with respect to the order Introduction, Method, Result, Discussion. The reordered
articles were then used to produce the new distribution of references. Figure 6 shows the
distributions of references that were obtained for the 7 PLOS journals. We can observe that
the distributions for all seven journals share practically the same properties. The Introduction
sections contain a relatively large number of references, with a bigger concentration in the
first part of the Introduction. The Method section is characterized by a relatively smaller
number of references which grows bigger towards the Results and Discussion sections. The
“PLOS” curve on this graph corresponds to the distribution of references in the entire corpus.
0
0,5
1
1,5
2
2,5
3
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
Plos Comb. Biol. PLOS ONE
13
Section 2 Section 3
Section 4
Section 1
45
0
0,5
1
1,5
2
2,5
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
Plos Genetics
Plos Pathogens
PLos Biology
13
Section 2 Section 3
Section 4
Section 1
13
Section 2 Section 3
Section 4
Section 1
45
Figure 6. Distribution of References of PLOS journals, following the IMRaD structure
Conclusion
In this paper, we have shown that we can measure the distribution of references along the text
of articles using sentences as the counting unit. We have also shown that this distribution
seems quite stable and maybe even invariant if we take into account the changes that occur in
some journals in the positions of the different sections in the text of the articles. Knowing the
structure of the articles, we are now in a position to connect the references with their position
in the text in order to better characterize the kinds of references in terms of the nature of the
section in which they appear. For it is plausible that the kinds of references present in the
introductory section may differ from the ones mentioned in the Method section, for example.
While this could be done by hand using a small sample, the methods presented here are
applicable to very large data sets.
The results of this study might be of interest for citation context analysis or in case one wants
to assign different weights to citations according to their place in the document (see Bonzi,
1982; Rousseau, 1987). Our future work will focus on citation context analysis, as well as
examining the other correlations that might exist between the position in the text and the
nature of the references: their publication year or the subject category of the reference
journals.
References
Agarwal, S. and Yu, H. (2009). Automatically classifying sentences in full-text biomedical articles
into Introduction, Methods, Results and Discussion. Bioinformatics, 25(23): 31743180.
American National Standards Institute (2012). JATS: Journal Article Tag Suite. ANSI/NISO Z39.96-
2012, 9 August 2012. National Information Standards Organization (NISO). Available at:
http://www.niso.org/apps/group_public/download.php/8975/z39.96-2012.pdf
Barron, J. (2006). The Uniform Requirements for Manuscripts Submitted to Biomedical Journals
Recommended by the International Committee of Medical Journal Editors. Chest, 129(4): 1098
1099.
Bonzi, S. (1982). Characteristics of a Literature as Predictors of Relatedness Between Cited and Citing
Works. Journal of the American Society for Information Science and Technology (JASIST), 33(4):
208216.
Day, R. A, Gastel, B. (2006) How to Write and Publish Scientific Papers. Cambridge: Cambridge
University Press.
13
43 72
Introduction Methods
Results
Discussion
Carter, R., Funk, K., and Mooney, R. (2012). The Front Matters: Capturing Journal Front Matter with
JATS. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012. Bethesda (MD):
National Center for Biotechnology Information (US); 2012. Available at:
http://www.ncbi.nlm.nih.gov/books/NBK100353/
Kucer, S. (1985). The Making of Meaning Reading and Writing as Parallel Processes. Written
Communication, 2(3): 317336.
Meadows, A. (1985). The scientific paper as an archaeological artefact. Journal of information
science, 11(1): 2730.
Mourad G. (2001), Analyse informatique des signes typographiques pour la segmentation de textes et
l’extraction automatique des citations. Réalisation des Applications informatiques : SegATex et
CitaRE, Ph. D. Thesis, Univ. Paris-Sorbonne.
Nakayama, T., Hirai, N., Yamazaki, S., Naito, M. (2005). Adoption of structured abstracts by general
medical journals and format for a structured abstract, Journal of the Medical Library Association
93(2), 237.
Oriokot, L., Buwembo, W., Munabi, I., and Kijjambu, S. (2011). The introduction, methods, results
and discussion (IMRAD) structure: a Survey of its use in different authoring partnerships in a
students' journal. BMC research notes, 4(1): 250.
Rousseau, R. (1987). The Gozinto theorem: Using citations to determine influences on a scientific
publication. Scientometrics, 11(3-4): 217229.
Sollaci, L. and Pereira, M. (2004). The introduction, methods, results, and discussion (IMRAD)
structure: a fifty-year survey. Journal of the Medical Library Association, 92(3): 364.
Article
In the field of bioinformatics, a large number of classical software becomes a necessary research tool. To measure the influence of scientific software as one kind of important intellectual products, a few strategies have been proposed to identify the software names from full texts of papers to collect the usage data of packages in bioinformatics research. However, the performance of these strategies is limited because of the highly imbalance of data in the full texts. This study proposes EnsembleSVMs-CRF, a two-step refinement strategy based on ensemble learning that gradually increases the sentences that contain software mentions to improve the performance of named entity recognition. The experiment on the bioinformatics corpus shows that the performance of EnsembleSVMs-CRF, in terms of the local F1 (78.81%) and the global F1-A (73.49%), is superior to the rule-based bootstrapping method and direct CRF. Application of this strategy to the articles published between 2013 and 2017 in 27 bioinformatics journals extracted 8,239 unique packages. The most popular 50 packages thus identified demonstrate that most of them are professional software which generally requires inter-discipline knowledge, rather than programming skill. Meanwhile, we found that researchers in bioinformatics tend to use free scientific software, and the application of general software is increasing compared with professional software.
Article
Science mapping using document networks comes often with the implicit assumption that scientific papers are indivisible units with unique links to neighbour documents. Research on proximity in co-citation analysis and the study of lexical properties of sections and citation contexts indicate that this assumption doesn’t always hold. Moreover, the meaning of words and co-words depends on the context in which they appear. This study proposes the use of a neural network architecture for word and paragraph embeddings (Doc2Vec) for the measurement of similarity among those smaller units of analysis. It is shown that paragraphs in the “Introduction” and the “Discussion” Section are more similar to the abstract, that the similarity among paragraphs is related to -but not linearly- the distance between the paragraphs. The “Methodology” Section is least similar to the other sections. Abstracts of citing-cited documents are more similar than random pairs and the context in which a reference appears is most similar to the abstract of the cited document. This novel approach with higher granularity can be used for bibliometric aided retrieval and to assist in measuring interdisciplinarity through the application of network-based centrality measures.
Article
Full-text available
Digital strategies for dissemination to decision makers of the results of the researchers in the public health field
Article
Digital libraries suffer from the problem of information overload due to immense proliferation of research papers in journals and conference papers. This makes it challenging for researchers to access the relevant research papers. Fortunately, research paper recommendation systems offer a solution to this dilemma by filtering all the available information and delivering what is most relevant to the user. Researchers have proposed numerous approaches for research paper recommendation which are based on metadata, content, citation analysis, collaborative filtering, etc. Approaches based on citation analysis, including co-citation and bibliographic coupling, have proven to be significant. Researchers have extended the co-citation approach to include content analysis and citation proximity analysis and this has led to improvement in the accuracy of recommendations. However, in co-citation analysis, similarity between papers is discovered based on the frequency of co-cited papers in different research papers that can belong to different areas. Bibliographic coupling, on the other hand, determines the relevance between two papers based on their common references. Therefore, bibliographic coupling has inherited the benefits of recommending relevant papers; however, traditional bibliographic coupling does not consider the citing patterns of common references in different logical sections of the citing papers. Since the use of citation proximity analysis in co-citation has improved the accuracy of paper recommendation, this paper proposes a paper recommendation approach that extends the traditional bibliographic coupling by exploiting the distribution of citations in logical sections in bibliographically coupled papers. Comprehensive automated evaluation utilizing Jensen Shannon Divergence was conducted to evaluate the proposed approach. The results showed significant improvement over traditional bibliographic coupling and content-based research paper recommendation.
Article
The multiple schema for the classification of soils rely on differing criteria but the major soil science systems, including the United States Department of Agriculture (USDA) and the international harmonized World Reference Base for Soil Resources soil classification systems, are primarily based on inferred pedogenesis. Largely these classifications are compiled from individual observations of soil characteristics within soil profiles, and the vast majority of this pedologic information is contained in non-quantitative text descriptions. We present initial text mining analyses of parsed text in the digitally available USDA soil taxonomy documentation and the Soil Survey Geographic database. Previous research has shown that latent information structure can be extracted from scientific literature using Natural Language Processing techniques, and we show that this latent information can be used to expedite query performance by using syntactic elements and part-of-speech tags as indices. Technical vocabulary often poses a text mining challenge due to the rarity of its diction in the broader context. We introduce an extension to the common English vocabulary that allows for nearly-complete indexing of USDA Soil Series Descriptions.
Conference Paper
Information within published papers around the world in scientific journals are structured in the format of Introduction, Methodology, Results, and Conclusion (IMRaD). Human ability to read and analyze is not capable of processing these large amounts of information. If we could identify the structure and consequently extract it to a user who needs a part of the structure, particularly an article in a foreign language, time will be saved as result. Computational approaches like Machine Learning (ML) and Natural Language Processing (NLP) have been widely used for similar purposes. However, it is very important to identify which one, or which group of classifiers work better for a specific kind of problem. The objective of this work is to identify applicable classifiers by analyzing and comparing results produced by different ML classifiers used in locating and classifying sentences from abstract of a paper into the IMRaD structure. This work demonstrates the possibility of integrating ML and NLP for the articles' sentence classification based on the IMRaD structure. It also verifies that it is possible to achieve good results with simple implementations without the need of too many computational resources.
Article
This article presents an investigation of the role of social relations in the writing of scientific articles through the study of in‐text citations. Does the fact that the author of an article knows the author whose work he or she cites have an impact on the context of the citation? Because citations are commonly used as criteria for research evaluation, it is important to question their social background to better understand how it impacts textual features. We studied a collection of science articles (N = 123) from 5 disciplines and interviewed their authors (N = 84) to: (a) identify the social relations between citing and cited authors; and (b) measure the correlation between a set of features related to in‐text citations (N = 6,956) and the identified social relations. Our pioneering work, mixing sociological and linguistic results, shows that social relations between authors can partly explain the variations of citations in terms of frequency, position and textual context.
Article
Full-text available
Globally, the role of universities as providers of research education in addition to leading in main - stream research is gaining more importance with demand for evidence based practices. This paper describes the effect of various students and faculty authoring partnerships on the use of the IMRAD style of writing for a university student journal. This was an audit of the Makerere University Students' Journal publications over an 18-year period. Details of the authors' affiliation, year of publication, composition of the authoring teams and use of IMRAD formatting were noted. Data analysis gave results summarised as frequencies and, effect sizes from correlations and the non parametric test. There were 209 articles found with the earliest from 1990 to latest in 2007 of which 48.3% were authored by faculty only teams, 41.1% were authored by student only teams, 6.2% were authored by students and faculty teams, and 4.3% had no contribution from the above mentioned teams. There were significant correlations between the different teams and the years of the publication (rs = -0.338 p < 0.01 one tailed). Use of the IMRAD formatting was significantly affected by the composition of the teams (Χ2 (2df) = 25.621, p < 0.01) especially when comparing the student only teams to the faculty only teams. (U = 3165 r = - 0.289). There was a significant trend towards student only teams over the years sampled. (z = -4.764, r = -0.34). In the surveyed publications, there was evidence of reduced faculty student authoring teams as evidenced by the trends towards students only authoring teams and reduced use of IMRAD formatting in articles published in the students' journal. Since the university is expected to lead in teaching of research, there is need for increased support for undergraduate research, as a starting point for research education.
Article
Full-text available
The use of a structured abstract has been recommended in reporting medical literature to quickly convey necessary information to editors and readers. The use of structured abstracts increased during the mid-1990s; however, recent practice has yet to be analyzed. This article explored actual reporting patterns of abstracts recently published in selected medical journals and examined what these journals required of abstracts (structured or otherwise and, if structured, which format). The top thirty journals according to impact factors noted in the "Medicine, General and Internal" category of the ISI Journal Citation Reports (2000) were sampled. Articles of original contributions published by each journal in January 2001 were examined. Cluster analysis was performed to classify the patterns of structured abstracts objectively. Journals' instructions to authors for writing an article abstract were also examined. Among 304 original articles that included abstracts, 188 (61.8%) had structured and 116 (38.2%) had unstructured abstracts. One hundred twenty-five (66.5%) of the abstracts used the introduction, methods, results, and discussion (IMRAD) format, and 63 (33.5%) used the 8-heading format proposed by Haynes et al. Twenty-one journals requested structured abstracts in their instructions to authors; 8 journals requested the 8-heading format; and 1 journal requested it only for intervention studies. Even in recent years, not all abstracts of original articles are structured. The eight-heading format was neither commonly used in actual reporting patterns nor noted in journal instructions to authors.
Article
Scientific research is typically communicated via papers in journals. To an outsider, the contents of these papers appear to by mystic and wonderful: to an insider, they convey rapidly and efficiently information about the research that has been done. Even to scientists, it may not be obvious that their papers provide the simplest way of communicating research. However, a detailed study of why papers are constructed as they are suggests that the lay-out is a consequence of a long evolution aimed at simplifying the complexity of scientific communica tion.
Article
A preliminary investigation was conducted to explore which characteristics of citing and cited works may aid in determining relatedness between documents. Thirteen variables were tested on 31 library/information science articles containing nearly 500 citations. Analysis indicates that source of cited work, source of citing work, number of times a work is cited in text, and type of citing article show promise of predicting relatedness between citing and cited works.
Article
This paper gives a mathematical technique to study influences, using citations. Taking into account both the publications that have a direct influence and those that have an indirect influence, we obtain the total influence measure on a fixed paper.
Article
Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at—http://wood.ims.uwm.edu/full_text_classifier/. Contact:hongyu@uwm.edu
Article
The scientific article in the health sciences evolved from the letter form and purely descriptive style in the seventeenth century to a very standardized structure in the twentieth century known as introduction, methods, results, and discussion (IMRAD). The pace in which this structure began to be used and when it became the most used standard of today's scientific discourse in the health sciences is not well established. The purpose of this study is to point out the period in time during which the IMRAD structure was definitively and widely adopted in medical scientific writing. In a cross-sectional study, the frequency of articles written under the IMRAD structure was measured from 1935 to 1985 in a randomly selected sample of articles published in four leading journals in internal medicine: the British Medical Journal, JAMA, The Lancet, and the New England Journal of Medicine. The IMRAD structure, in those journals, began to be used in the 1940s. In the 1970s, it reached 80% and, in the 1980s, was the only pattern adopted in original papers. Although recommended since the beginning of the twentieth century, the IMRAD structure was adopted as a majority only in the 1970s. The influence of other disciplines and the recommendations of editors are among the facts that contributed to authors adhering to it.
The Front Matters: Capturing Journal Front Matter with JATS
  • R Carter
  • K Funk
  • R Mooney
Carter, R., Funk, K., and Mooney, R. (2012). The Front Matters: Capturing Journal Front Matter with JATS. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012. Bethesda (MD): National Center for Biotechnology Information (US); 2012. Available at: http://www.ncbi.nlm.nih.gov/books/NBK100353/