Content uploaded by Iana Atanassova
Author content
All content in this area was uploaded by Iana Atanassova
Content may be subject to copyright.
The Distribution of References in Scientific Papers: an Analysis of the
IMRaD Structure
Marc Bertin
1
, Iana Atanassova
1
, Vincent Lariviere
2
and Yves Gingras
3
1
marc.bertin@paris-sorbonne.fr; iana.atanassova@paris-sorbonne.fr
Sens, Texte, Informatique, Histoire (STIH), Paris-Sorbonne University, 1 rue Victor Cousin
75230 Paris cedex (France) and Observatoire des Sciences et des Technologies (OST), Centre Interuniversitaire
de Recherche sur la Science et la Technologie (CIRST), Université du Quebec à Montreal, CP 8888, Succ.
Centre-Ville, Montreal, QC. H3C 3P8 (Canada)
2
vincent.lariviere@umontreal.ca
École de bibliothéconomie et des sciences de l’information, Université de Montréal, C.P. 6128,
Succ. Centre-Ville, Montréal, QC. H3C 3J7 (Canada) and Observatoire des Sciences et des Technologies (OST),
Centre Interuniversitaire de Recherche sur la Science et la Technologie (CIRST), Université du Quebec à
Montreal, CP 8888, Succ. Centre-Ville, Montreal, QC. H3C 3P8 (Canada)
3
gingras.yves@uqam.ca
Observatoire des Sciences et des Technologies (OST), Centre Interuniversitaire de Recherche sur la Science et la
Technologie (CIRST), Université du Quebec à Montreal, CP 8888, Succ. Centre-Ville, Montreal, QC. H3C 3P8
(Canada)
Abstract
The organization of scientific articles typically follows a standardized pattern, the well-known IMRaD structure
(Introduction, Methods, Results and Discussion). Using the PLOS series of journals as a case study, this paper
looks at how the bibliographic references are distributed along the different sections of papers. We use the
section titles of the articles to categorize the sections matching the IMRaD structure. We then identify the
variations in the basic IMRaD structure of the different PLOS journals. The results show that, though dominant,
the IMRaD structure often changes in some journals and these differences must be taken into account in order to
compare the distribution of references along the text using an invariant measure, here the number of sentences in
the texts. We examine the different distributions of the references in the articles in different journals and show
that these distributions are relatively stable and maybe even invariant when taking into account the inversions of
sections identified in some journals.
Introduction
The organization of scientific articles typically follows a standardized pattern, the well-known
IMRaD structure (Introduction, Methods, Results and Discussion). This structure has imposed
itself in most major scientific journals in the mid-twentieth century, and has become the main
standard in the 1970s (Sollaci and Pereira, 2004). Many studies have focused on various
aspects of this structure: automatic classification of sentences in full-text (Agarwal and Yu,
2009), the effects of the use of the IMRaD style (Oriokot et al., 2011), creation of guidelines
for scientific writing (Kucer, 1985; Meadows, 1985; Day and Gastel, 2006), providing
structured abstracts (Nakayama et al., 2005) and editorial requirements (Barron, 2006).
Research question
This article investigates, from the viewpoint of bibliometrics, the relationships that exist
between cited references and the structure of the text. What interests us is the nature of the
distribution of references in scientific articles and more precisely, if there exists a typology of
scientific writing and referencing practices. These characteristics of scientific papers are
studied here using the seven (7) journals published by the Public Library of Science (PLOS),
which are peer-reviewed open-access publications covering all disciplines of sciences and
social sciences. The free access to full text gives us the opportunity to use the PLOS journals
as a test corpus to establish the relation between the distribution of references throughout an
article and its structure. Our analysis consists in several steps: categorisation of the sections of
the text according to section titles, segmentation into sentences in order to obtain the
distribution of the references according to the text progression, reconstruction of the IMRaD
structure and the examination of the distribution of the references in the different journals.
Our results provide an overview of the types of articles in the PLOS journals and show some
properties of the structure of research articles related to the sections and section titles. We
explore the relations between the types of articles and the IMRaD structure, and also the
relations between the types of sections and the references in the texts. Finally, we obtain a
graphical representation of the distribution of references in an article. The next section
presents the corpus of data and its structure. Then, we describe the processing carried out in
order to relate the IMRaD structure to the distribution of references in the articles.
Methods
We first categorize the sections which allows us to work with the different types of sections
and reorder the sections in a text if necessary. This categorization aims to verify the coherence
of the corpus with the IMRaD structure. We then process the text content of all paragraphs in
order to segment them into sentences. This segmentation allows us to work with text elements
that are smaller than paragraphs so that we can associate the references with a given sentence
of the text and obtain their distribution along the text. Finally, our algorithm counts the
number of references in each sentence. This task is not trivial, as we will discuss later.
Data source
Founded in 2006, the Public Library of Science (PLOS) is an Open Access publisher of seven
peer-reviewed academic journals, mostly in the fields of biology and medicine. PLOS ONE,
the publishers’ general journal covers, however, all fields of science and social sciences. For
this study, we have used the entire PLOS corpus up to September/October 2012. Table 1
presents the number of articles processed for each journal, as well as the average number of
sections and sentences per article. More than 47,000 journal articles were analyzed. As these 7
journals follow the same publication model but are in different scientific fields, our aim is to
observe the different uses of bibliographic references in these fields and their relation to the
structure of the articles. Table 1 show that the average number of sections per article varies
between 3.48 and 4.74 according to the journal. We can also observe that the average length
of articles is different: 125 sentences on average for PLOS Medicine, compared to 278
sentences on average for PLOS Computational biology. The Table also shows the relative
importance of PLOS ONE: papers published in this journal account for more than 71% of all
papers in the corpus.
Table 1. Descriptive Statistics on PLOS Journals
Data structure
PLOS provides access to the articles in the XML format. The set of XML elements and
attributes that are used for the representation of journal articles are known as Journal Article
Tag Suite (JATS), which is an application of Z39.96-2012 (ANSI, 2012). Some studies
(Carter, Funk and Mooney, 2012) give various applications of this standard. Technology
evolves quickly and we have to take into consideration that JATS is a continuation of the
NLM Archiving and Interchange DTD work by NCBI (http://dtd.nlm.nih.gov/).
The JATS structure of an article consists of three main elements front – body – back, where
the textual content of the article is in the body element, which is further divided into sections
and paragraphs. The <front> tag contains some traditional fields of metadata (title, authors,
etc.) as well as the article type.
Labels and section titles processing
The sections of the texts are categorized automatically by analyzing the section titles in order
to match the existing sections with one of the section types in the IMRaD structure. To do
this, we have examined the types of articles present in the corpus, where the typology is given
in the article’s metadata.
Segmentation processing
The first stage of the processing consists in parsing the XML trees and text segmentation into
sentences. The JATS structure used by PLOS provides paragraph elements <p> as the finest
level of text segments. For our analysis, we needed segmentation into sentences and we
parsed the initial JATS trees in order to extract the relevant text segments from the article
body, as well as other elements such as sections, section titles, section numbers, paragraphs
and the bibliography. These data were stored in the DocBook format that was used as the
basis for the further processing.
Each paragraph was segmented into sentences by analysing the punctuation of the text
following a set of typographic rules. All the occurrences of symbols denoting sentence
boundaries (point, exclamation mark, etc.) were examined and disambiguated. Figure 1 gives
some examples which show a few points present in the sentences but which do not finish
them. In fact, the occurrence of a point in a text does not necessarily mean a sentence end,
because in many cases it can be part of an abbreviation, references, genus species, numeric
values, etc.
1. , SE = 0.44, 0.041); and gene diversity from 0.39 (EMX-4) to 0.69 (LafMS03
2. the plastid genome is 0.92±0.03
3. an additional 115.0 ml
4. (Nyakaana and Arctander 1998; Fernando et al. 2001) and compared them
5. HB3 strain of P. falciparum, we demonstrate that at least 60%
6. (i.e., the kinase phosphatase
Figure 1. Examples of occurrences of ‘point’ that do not signal sentence ends.
We used a set of finite-state automata in order to determine the contexts in which the points
signal sentence ends. For this purpose, we have developed a Java application based on the
work of Mourad (2001). The algorithm uses a rule-based approach which disambiguates the
use of punctuation marks by examining the close context of their occurrences. All punctuation
marks in the text are thus labeled as “sentence end” or “no sentence end”. Some of the results
are presented in table 2.These results synthesize a more general problem in NLP. Once we
have identified the sentence boundaries in the corpus, we can consider the sentences as the
finest textual unit and examine the number of references in each sentence. In fact, a sentence
can contain one or more references or an enumeration of references, which is rather frequent
in the background section or the introduction.
Table 2. Segmentation into sentences according to typographic rules
Reference processing
Our algorithm examines each sentence and counts the number of references present in the
text. In fact, the input data is in the XML format where the references are represented in the
<xref> tags. However, counting these tags is not a reliable method to obtain the reference
counts and could bias the system. As shown in the example on Figure 2, some typographic
rules for writing references result in the fact that the XML structure does not render all of the
actual references. In this example, three sources are cited (51, 52 and 53), but only two
<xref> tags are present that delimit a range from 51 to 53. As these cases are rather frequent
in the corpus (on average more than once in an article), they must be taken into consideration.
Our algorithm covers all possible typographic variations for reference ranges and infers the
missing data from the input XML. As a result we obtain the list of sentences in the text, where
to each sentence we have associated a reference count as well as a list of reference identifiers
corresponding to the bibliography entries.
“… during differentiation [<xref ref-type="bibr" rid="pbio-0030356-b51">51</xref>–<xref
ref-type="bibr" rid="pbio-0030356-b53">53</xref>]. This prediction …”
Figure 2. Example of a reference range rendered in XML
Results
Article Level
Table 3 presents the different article types in the PLOS corpus, exploiting the metadata
present in the XML documents. The article types are identified using the contents of the
<article-meta> tag in the JATS structure. This Table shows, as should be expected, that the
‘Research Article’ is dominant with 94% of the papers published. We notice however that
PLOS Medicine offers a wider variety of article like ‘Perspective’, ‘Correspondence’, ‘Essay’
or ‘Policy Forum’.
Table 3. PLOS article typology study
Section Level
We now concentrate our analysis on research articles, which account for the vast majority of
documents published by PLOS journals. The number of sections in the texts is particularly
important for our study and we first match the section titles with the section position in the
four sections of the IMRaD structure. Table 4 presents the results of the categorization of the
sections for the seven PLOS journals. We have analyzed all section titles that are present as a
separate element in the XML documents. We determine whether the section is part of the
IMRaD structure or not by identifying occurrences of “Introduction”, “Method”, “Result” and
“Discussion” with all possible variations, plurals, combinations, etc. Thus, we have created a
set of criteria for the categorization that cover the majority of the observed section titles. After
normalization, we have considered the subset of titles present in all journals, except for
“Supporting information” which was not considered because this type of sections is not part
of the scientific argumentation and serves as complementary information. Finally, to produce
tables 5 to 10, we look at the position of titles for each section. We check that Introduction
correspond to the section one, Method correspond to the section two, Result correspond to the
section three and Discussion to the section four.
Table 4 shows that PLOS Medicine and PLOS Neglected Tropical Diseases essentially follow
the IMRaD structure. The values on the diagonal of the matrix for PLOS Neglected Tropical
Diseases are well above 85%, which means that virtually all the articles follow the IMRaD
standard. In the case of PLOS Medicine, the values on the diagonal show that about half of the
papers follow the IMRaD structure, while the other half use section titles that did not allow
tha automatic categorization of the sections. For both journals, the high values on the
diagonals indicate clearly that in almost all of the papers that include sections categorized as
Introduction, Method, Result and Discussion, these sections appear in the order defined by the
IMRaD structure. Hence, the first column, which corresponds to section one, never includes
Method, Results and Discussion. This is coherent with the structure generally presented in the
literature.
Table 4. Relation between position of section and title of section for PLOS Medicine and PLOS
Neglected Tropical Diseases
Table 5 shows the relation between the position of the sections and section titles for PLOS
ONE and PLOS Computational Biology. While the first value presented on the diagonal is
more than 99%, other values on the diagonal are very low (close to 50%), which indicate that
the usual order of sections in IMRaD are in fact changed. the Method section (on line two),
can be found not only in section 2 as expected with IMRaD, but also in section 4 usually
reserved for Discussion. The standardization proposed for extraction of titles takes into
account such variations. This inversion explains that of the Results section often appears in
Section 2 instead of 3, and that the methods are presented at the end of the article (Section 4).
Of course these papers do not respect completely the IMRaD structure and should present
some variations in the distributions of references.
Table 5. Relation between position of section and title of section for PLOS ONE and PLOS
Computational Biology
Finally, Table 6 shows the equivalent results for PLOS Genetics, PLOS Pathogens and PLOS
Biology. We note that the distribution of sections and titles for these journals also differs from
IMRaD with Methods coming last instead of Second and Discussion third instead of fourth as
in the standard IMRaD structure.
Table 6. Relation between position of section and title of section for PLOS Genetics, PLOS
Pathogens and PLOS Biology
Knowing the structure of the text in terms of section headings – and having reordered the
various texts in order to have a consistent order of sections – we can now present the
distribution of references along the texts of papers of the different journals. To do this, we
have used a subset of the corpus which contains only those research articles that contain the
four types of sections of the IMRaD structure. All the articles in this smaller corpus have at
least four sections that correspond to the types Introduction, Method, Result and Discussion
but these sections are not necessarily present in the same order in the text. Table 7 shows the
number of articles that fulfill these criteria. We can observe that this new corpus represents
82.98% of the corpus.
Table 7. Research articles containing the four section types
of the IMRaD structure
Distribution of References at the Sentence Level
Figure 3 presents the normalized distributions of the references throughout the texts for two
PLOS journals. The horizontal axis presents the text progression from 0 to 100 percent based
on the segmentation into sentences. The vertical axis gives the average percentage of the
number of references at a given point of the text for each corpus. We can observe that the first
10 percent of the texts in these corpuses contain relatively large amounts of references. The
three vertical lines on the graph indicate the average positions of the section boundaries.
Figure 3. Distribution of References in PLOS Medicine and PLOS Neglected Tropical Diseases
These results are consistent with what might be expected: references are more concentrated in
the introduction. The comparison of Tables 4, 5 and 6 with Figures 3, 4 and 5 shows that the
distributions of references are similar in the sets of journals having the same structure of
section titles. In fact, Figure 3 shows that section 2, which according to Table 4 corresponds
to the Method in a majority of articles in these two journals, contains less references that the
other sections. On the other hand, Table 6 shows that the Method section tends to be at the
end of the articles for three of the journals. This is consistent with the distribution of
references on Figure 5 where we can observe that the fourth section contains a smaller
number of references that the first three sections. These observations suggest that if we take
into account the variations in the positions of sections the distribution of references could be
very stable and nearly invariant.
0
0,5
1
1,5
2
2,5
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
Plos Medicine
Plos Neglected Tropical Diseases
13
Section 2 Section 3
Section 4
45
Section 1
Figure 4. Distribution of References in PLOS Computational Biology and PLOS ONE
Figure 5. Distribution of Reference in PLOS Genetics, PLOS Pathogens and PLOS Biology
Distribution of References for the ordered IMRaD structure
In order to study the distribution of references independently of the order in which the
sections of the IMRaD structure appear in the texts, we have reordered the sections in all
articles with respect to the order Introduction, Method, Result, Discussion. The reordered
articles were then used to produce the new distribution of references. Figure 6 shows the
distributions of references that were obtained for the 7 PLOS journals. We can observe that
the distributions for all seven journals share practically the same properties. The Introduction
sections contain a relatively large number of references, with a bigger concentration in the
first part of the Introduction. The Method section is characterized by a relatively smaller
number of references which grows bigger towards the Results and Discussion sections. The
“PLOS” curve on this graph corresponds to the distribution of references in the entire corpus.
0
0,5
1
1,5
2
2,5
3
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
Plos Comb. Biol. PLOS ONE
13
Section 2 Section 3
Section 4
Section 1
45
0
0,5
1
1,5
2
2,5
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
Plos Genetics
Plos Pathogens
PLos Biology
13
Section 2 Section 3
Section 4
Section 1
13
Section 2 Section 3
Section 4
Section 1
45
Figure 6. Distribution of References of PLOS journals, following the IMRaD structure
Conclusion
In this paper, we have shown that we can measure the distribution of references along the text
of articles using sentences as the counting unit. We have also shown that this distribution
seems quite stable and maybe even invariant if we take into account the changes that occur in
some journals in the positions of the different sections in the text of the articles. Knowing the
structure of the articles, we are now in a position to connect the references with their position
in the text in order to better characterize the kinds of references in terms of the nature of the
section in which they appear. For it is plausible that the kinds of references present in the
introductory section may differ from the ones mentioned in the Method section, for example.
While this could be done by hand using a small sample, the methods presented here are
applicable to very large data sets.
The results of this study might be of interest for citation context analysis or in case one wants
to assign different weights to citations according to their place in the document (see Bonzi,
1982; Rousseau, 1987). Our future work will focus on citation context analysis, as well as
examining the other correlations that might exist between the position in the text and the
nature of the references: their publication year or the subject category of the reference
journals.
References
Agarwal, S. and Yu, H. (2009). Automatically classifying sentences in full-text biomedical articles
into Introduction, Methods, Results and Discussion. Bioinformatics, 25(23): 3174–3180.
American National Standards Institute (2012). JATS: Journal Article Tag Suite. ANSI/NISO Z39.96-
2012, 9 August 2012. National Information Standards Organization (NISO). Available at:
http://www.niso.org/apps/group_public/download.php/8975/z39.96-2012.pdf
Barron, J. (2006). The Uniform Requirements for Manuscripts Submitted to Biomedical Journals
Recommended by the International Committee of Medical Journal Editors. Chest, 129(4): 1098–
1099.
Bonzi, S. (1982). Characteristics of a Literature as Predictors of Relatedness Between Cited and Citing
Works. Journal of the American Society for Information Science and Technology (JASIST), 33(4):
208–216.
Day, R. A, Gastel, B. (2006) How to Write and Publish Scientific Papers. Cambridge: Cambridge
University Press.
0
0,5
1
1,5
2
2,5
3
0 10 20 30 40 50 60 70 80 90 100
Percentage of references
Text progression (%)
PLos Biology
Plos Comp. Biology
Plos Genetics
Plos Medicine
Plos Neglected Tropical
Diseases
Plos ONE
Plos Pathogens
PLOS -- All journals
13
43 72
Introduction Methods
Results
Discussion
Carter, R., Funk, K., and Mooney, R. (2012). The Front Matters: Capturing Journal Front Matter with
JATS. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012. Bethesda (MD):
National Center for Biotechnology Information (US); 2012. Available at:
http://www.ncbi.nlm.nih.gov/books/NBK100353/
Kucer, S. (1985). The Making of Meaning Reading and Writing as Parallel Processes. Written
Communication, 2(3): 317–336.
Meadows, A. (1985). The scientific paper as an archaeological artefact. Journal of information
science, 11(1): 27–30.
Mourad G. (2001), Analyse informatique des signes typographiques pour la segmentation de textes et
l’extraction automatique des citations. Réalisation des Applications informatiques : SegATex et
CitaRE, Ph. D. Thesis, Univ. Paris-Sorbonne.
Nakayama, T., Hirai, N., Yamazaki, S., Naito, M. (2005). Adoption of structured abstracts by general
medical journals and format for a structured abstract, Journal of the Medical Library Association
93(2), 237.
Oriokot, L., Buwembo, W., Munabi, I., and Kijjambu, S. (2011). The introduction, methods, results
and discussion (IMRAD) structure: a Survey of its use in different authoring partnerships in a
students' journal. BMC research notes, 4(1): 250.
Rousseau, R. (1987). The Gozinto theorem: Using citations to determine influences on a scientific
publication. Scientometrics, 11(3-4): 217–229.
Sollaci, L. and Pereira, M. (2004). The introduction, methods, results, and discussion (IMRAD)
structure: a fifty-year survey. Journal of the Medical Library Association, 92(3): 364.