ArticlePDF Available

A Systematic Approach to Map the Research Articles’ Sections to IMRAD

Authors:

Abstract and Figures

The amount of scientific publications is believed to get doubled every five-years. These publications are stored by citation indexes and digital libraries in the form of complete PDF or/and by extracting terms from these documents. This indexing behavior poses several challenges for the scientific community as well as for digital repositories in terms of handling the advanced requirements of a user. For instance, addressing queries like “Give me those papers that contain the term “Pagerank” in their result section” may not be answered unless the papers are indexed section-wise. This issue has been focused by researchers and international prestigious challenges by top venues in the world like Semantic Publishing Challenge in ESWC. One of the important metadata extraction from research papers is the section information such as IMRAD (Introduction, Methodology, Results, and Discussion). Researchers have presented different approaches to identify and map the section-headings to IMRAD sections. The existing studies have employed parameters like dictionary terms, the template of a paper, and in-text citation frequency to map section-headings onto logical sections. The critical analysis of state-of-the-art revealed that some immensely potential features have been ignored, which might result in accurate mapping. In this study, we propose a novel approach that employs new features along with previously well-known features to map sections-headings to IMRAD. The newly proposed features are: (1) variant of In-text Citation count (2) Figure counts, (3) Table counts, and (4) subheading implicit mapping. The employed data set contains 5000 research papers, collected from CiteSeer. The evaluation of the proposed approach and comparisons with state-of-the-art three approaches revealed an improvement of 18.96%, 21.77%, and 9.50% in average precision with Ding et al, Shahid et al, and Habib et al respectively. This research has significant implications for citation indexes and digital libraries.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
A Systematic Approach to Map the
Research Articles’ Sections to IMRAD
IBRAR AHMED1, MUHAMMAD TANVIR AFZAL2
1Capital University of Science and Technology Islamabad Pakistan (e-mail: ibrar.ahmad@gmail.com)
2Capital University of Science and Technology Islamabad Pakistan (e-mail: mafzal@cust.edu.pk)
Corresponding author: Ibrar Ahmed (e-mail: ibrar.ahmad@gmail.com).
ABSTRACT The amount of scientific publications is believed to get doubled every five-years. These
publications are stored by citation indexes and digital libraries in the form of complete PDF or/and by
extracting terms from these documents. This indexing behavior poses several challenges for the scientific
community as well as for digital repositories in terms of handling the advanced requirements of a user.
For instance, addressing queries like “Give me those papers that contain the term “Pagerank” in their
result section” may not be answered unless the papers are indexed section-wise. This issue has been
focused by researchers and international prestigious challenges by top venues in the world like Semantic
Publishing Challenge in ESWC. One of the important metadata extraction from research papers is the
section information such as IMRAD (Introduction, Methodology, Results, and Discussion). Researchers
have presented different approaches to identify and map the section-headings to IMRAD sections. The
existing studies have employed parameters like dictionary terms, the template of a paper, and in-text citation
frequency to map section-headings onto logical sections. The critical analysis of state-of-the-art revealed
that some immensely potential features have been ignored, which might result in accurate mapping. In this
study, we propose a novel approach that employs new features along with previously well-known features
to map sections-headings to IMRAD. The newly proposed features are: (1) variant of In-text Citation count
(2) Figure counts, (3) Table counts, and (4) subheading implicit mapping. The employed data set contains
5000 research papers, collected from CiteSeer. The evaluation of the proposed approach and comparisons
with state-of-the-art three approaches revealed an improvement of 18.96%, 21.77%, and 9.50% in average
precision with Ding et al, Shahid et al, and Habib et al respectively. This research has significant implications
for citation indexes and digital libraries.
INDEX TERMS IMRAD, Relations Database, Section Mapping, Scientific Document Classification.
I. INTRODUCTION
COMMUNICATION in science is realized through sci-
entific publications. Due to the latest inventions in sci-
ence, a tremendous increase has been reported in the amount
of publications on WWW. The amount is believed to get
doubled after every five-year [1]. The scientific plethora is
diffused on the web through different means like publishing
in different venues such as conferences, journals, and work-
shops. These venues publish their research corpora on the
web which is indexed by digital repositories like CiteSeer,
Google Scholar, and Scopus, etc. A user exploits information
retrieval (IR) systems like search engines, citation indexes or
digital libraries to extract desired information. A user poses a
query with an intention that he/she will obtain the maximum
relevant information. For instance, a research scholar finding
papers to conduct a literature survey will always wish to
retrieve a maximum amount of strongly relevant research
papers against the topic posed in the query. However, the
existing IR systems index the data in a semi-structured for-
mat. The PDF is one of the most widely employed semi-
structured formats, which was developed under the Camelot
Project to share documents that include text and images. Due
to improper indexing of PDF files, the existing IR systems are
unable to handle advanced queries of a user. The examples of
advanced queries are: (1) find all the research papers contain-
ing the term “Data Science” in the Methodology section of a
research paper or (2) find all those papers that have calculated
“F-measure” in the “Results” section etc. The queries of such
nature can only be addressed if research papers are section-
wise indexed. The community has proposed solutions in the
form of defining the semantic structure of PDF documents
and performing their section-wise mapping [2].
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Initially, research papers were written in the letter-style
format. In the early 20th century, a standard format was
presented in the form of IMRAD (Introduction, Methods,
Results, And Discussion) [3].
The origin of the IMRAD is vague. However, according to
Gitanjali Batmanabane [4], Louis Pasteur is the first person
who used that format and later used by Sir Austin Bradford
Hill. The structure states that a research paper should com-
prise logical sections like Introduction, Methods, Results,
and Discussion. The current behavior from the majority of
the scientific community in terms of preparing research pa-
pers favors the IMRAD structure. Identifying logical sections
of a research paper has already been focused by international
prestigious challenges by the top venues in the world like Se-
mantic Publishing Challenge in ESWC [5] and the research
community [2], [6]–[8].
It should be noted here that the names given by the
authors of the papers as the section-headings usually are
not identical to the names being used by IMRAD structure
(i.e Introduction, Methods, Results, and Discussions). For
example, according to the state-of-the-art [2], that performed
experiments on 1,833 section-headings, have concluded that
none of the methodology section was named as “Meth-
ods/Methodology” by the authors of respective papers. Only
1
During the course of many years, researchers presented
different section identification techniques [2], [6], [8]. Ding
et al [8] performs a study to identify the distribution of in-
text citations across sections. For this task, the researcher
identified the section headings and mapped those headings on
to logical sections. For this, the researcher used very exten-
sive dictionary terms to identify the section and applied their
technique on 866 full-text articles containing 6866 sections
and achieved 81% accuracy. A. Shahid and M. T. Afzal [2]
extended Ding et al [8] technique with different dictionary
terms along with research paper templates and layout to
identify section headings and mapped them to IMRAD struc-
ture. The researcher applied the technique on 1200 papers
containing 12,180 sections and got 0.78 precision and 0.79
recall. Raja Habib and M. T. Afzal [6] used frequency of in-
text citation to identify the section and applied that technique
on 5000 bibliographically coupled papers and achieved 90%
accuracy.
However, on the critical investigation, we identified two
key areas. In some scenarios, the contemporary approaches
are unable to differentiate between sections and subsections
of a research paper. For example, a logical section “1. In-
troduction” contains a subsection, named “1.2 Background”.
The existing approaches treat both of them as independent
headings and map them individually to the IMRAD structure,
which may increase the chances of inaccurate mapping.
Some studies also consider the subheading and rely on head-
ing tags <h1... hn>, which can be failed in some cases, we
have explained the issue in detail in section 2 of this paper.
Other than subheadings, there exists a list of potential section
identifier parameters that have been ignored by the existing
state-of-the-art.
In this paper, we present a comprehensive approach that
intelligently maps research articles to IMRAD. The pro-
posed approach takes advantage of accurately identifying the
subsections and mapping them to IMRAD headings based
on their main section mapping to achieve better results.
Furthermore, the proposed approach also exploits novel po-
tential features which have great potential to improve the
performance of IR systems. The features include: (1) In-Text
Citations Count (2) Figures Count, and (3) Tables Count.
These features have been chosen based on the assumption
that their certain frequency may hint in determining the as-
sociation of one typical heading to a specific logical section.
The proposed methodology is evaluated on logical sections
of 5000 research papers in PDF taken from CiteSeer having
39420 section headings.
We have compared the proposed approach with all of three
section-mapping techniques [2], [6], [8] on the same dataset
of 5000 papers. Our proposed technique outperformed all
three. The proposed approach gained 18.96%, 21.77%, and
9.50% improvement in average precision of all sections from
Ding et al [8], A. Shahid, and M. T. Afzal [2] and Raja Habib
and M. T. Afzal [6] respectively.
The rest of the paper is organized as follows: Section
II presents a critical analysis on state-of-the-art systems.
Section III highlights the lessons learned from empirical
experiments on research papers. Section IV contains the
information about the dataset followed by Section V which
elaborates the methodology proposed to map sections onto
the IMRAD structure. Section VI discusses the results and
analysis. In the end, the performance evaluation has been
demonstrated in section VII followed by the conclusion that
is presented in section VIII.
II. LITERATURE REVIEW
Since 1665, the letter style format has been followed by
the research community to prepare research documents [9].
In the earlier 20th century, a standard structure designed
specifically to write research papers was presented in the
form of IMRAD (Introduction, Methodology, Results, and
Discussion). Gradually, the popularity of the structure in-
creased and most of the research community adopted this for
preparing research documents [9]. The community believed
that if the structure of research papers is defined by mapping
their logical sections to IMRAD, then the performance of
IR systems can significantly be enhanced. In this regard, the
scientific community has put various efforts from defining the
structure of PDF files to mapping them to IMRAD. The stud-
ies have focused on adding semantic structure to the content
of PDF files to retrieve information in an intelligent manner.
A semantically defined structure of a document can be used
for different applications pertaining to information retrieval,
such as, to generate summaries or process advanced queries.
The studies focusing on defining the structure of a research
paper’s content are referred to as discourse analysis. The
scientific community has presented research papers sections-
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
based studies in two dimensions: some have employed log-
ical sections of research papers to identify different aspects
pertaining to bibliometric analysis and others have mapped
logical sections to IMRAD structure.
A. TECHNIQUES USING LOGICAL SECTIONS
The study proposed by Tuefel and Moens et al [10], presented
an Argumentative Zoning (AZ) system, using elements of
scientific argumentation [10]. These scientific arguments are
referred to as “owns a work”, “others work” and “con-
trast” [11]. Another similar scheme, Core Scientific Concepts
(CoreSC) extracts generic concepts like Hypothesis, Model,
and Experiments, etc. from research papers [12]. Besides
defining the structure, IMRAD has also proven useful in
various other contexts, for instance, Teufel et al [13] stated
that investigating a citation count with respect to its location
in a logical section can produce good results to discern
the sentiment aspect of the citing author. Similarly, another
study [14] has performed citation analysis by exploiting in-
text citations in different sections (introduction, methods,
results, and discussion) of a research paper. The importance
of logical sections has also been delineated by A. Shahid
and M. T. Afzal [2], to discover the semantic relationships
between research articles. Another approach [15], presented
an ontology (DEo) to define logical structures of scientific
documents. The semantic indexing of research documents
holds great potential in identifying implicit knowledge. In the
study [16] authors have exploited in-text citation frequencies
and their patterns from the logical section of research papers.
The evaluation has been done on the data set of scientific pa-
pers taken from CiteSeer. Kafkas et al [17]presented a study
wherein sections-based search functionality was provided for
the research papers published in the Journal of Biomedical.
The approach presented by A. Shahid and M. T. Afzal [2] is
closest to Kafkas et al [17] The difference is that the approach
Kafkas et al [17] manually extracts the sections by utilizing
designed rules. The semantic indexing of research documents
holds great potential in identifying implicit knowledge. The
contemporary IR systems are unable to semantically index
the structure of PDF files due to which advanced queries
cannot be processed. The examples of advanced queries are:
(1) find all the research papers containing the term “Data
Science” in the Methodology section of a research paper or
(2) find all those papers that have calculated “F-measure” in
the “Results” section etc. Addressing such queries is a dire
need of the current era especially when there exists a huge
amount of research plethora on the web. However, defining a
semantic structure that is able to address the aforementioned
queries is a challenging process. This is due to the fact that
people use different sets of vocabulary or semantic terms to
name the same logical sections. For instance, some authors
use the term “Literature Review” while some use “Related
work” for a section to represent state-of-the-art studies. In
such scenarios, it becomes difficult to semantically distin-
guish the terms. Such issues have also been reported by
A. Shahid and M. T. Afzal [2] wherein 329 papers were
manually assessed and it was revealed that none of the papers
used term “Methodology” for the methodology section and
only 1% of the papers used the term “Result” for the “Result”
section. Besides these, some other researchers also used the
logical section to identify important citations for example
like Nazir et al [18], Hasan et al [19], and Pride et al [20].
B. TECHNIQUES MAPPING LOGICAL SECTIONS
Ding et al [8] used extensive dictionary terms to identify
and map the logical sections to IMRAD. This technique
was tested on 866 full-text articles containing 6866 sections
and achieved 81% accuracy. A. Shahid and M. T. Afzal [2].
extended the study of Ding et al [8]. and used different
dictionary terms and templates of the papers to identify and
map the section to IMRAD. The study of A. Shahid and M. T.
Afzal [2] has mapped the research articles from the domain
of Computer Science and automatically extracted the sections
using DEo ontology. The study has then performed a rigorous
analysis of comparison done with various ML techniques
and unlike the study presented by Kafkas et al [17], the
approach [2] has formally presented the proposed algorithm.
The approach A. Shahid and M. T. Afzal [2] have mapped
the sections of research papers into six logical sections of
IMRAD (“Introduction”, “Related Work”, “Methods”, “Re-
sults”, “Discussion” and “Conclusion”). The approach [21]
has harnessed two main features, layout information of sci-
entific documents and dictionary terms to form heuristics to
section-wise map the research articles on IMRAD structure.
The evaluation has been done on 329 papers from the domain
of Computer Science published in the Journal of Universal
Computer Science (J.UCS). Raja Habib and M. T. Afzal
[6] used in-text citation frequency to identify the sections.
This technique achieved 90% accuracy when tested on 5000
papers.
After a critical analysis of the studies presented above
illustrates that contemporary state-of-the-art has proposed
several methods to semantically index research documents
according to some predefined structure. A few efforts have
been made to define the structure of research documents from
the domain of Computer Science. The approach proposed by
A. Shahid and M. T. Afzal [2], is the most recent approach
which performs section-wise mapping of research articles
from the domain of Computer Science by utilizing heuristics
formed using layout information and content information.
However, the approach holds various deficiencies which can
adversely impact the performance of IR systems. We have
critically scrutinized the approach by manual investigation
and identified existing gaps which are the focus of the
proposed study. The following section “Lessons Learned”
presents an in-depth analysis of the identified issues with the
help of examples taken from a real data set.
III. LESSONS LEARNED
As explained earlier, our work is closest to the work pre-
sented by A. Shahid and M. T. Afzal [2]. The approach
presented by A. Shahid and M. T. Afzal [2] maps logical
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
sections of research papers to IMRAD by using template in-
formation and dictionary-based rules. We have implemented
the approach of A. Shahid and M. T. Afzal [2] and discovered
some patterns that led us to formulate the proposed research
questions. Let us look into the patterns with the help of real
examples taken from the PDF files of the employed data set.
A. SUBHEADINGS MAPPING
In the PDF file shown in Figure 1, the Introduction section of
a research paper “1. Introduction” contains three subsections,
“1.1. Biology needs computation”, “1.2. Genes and cells” and
“1.3. GemCell”. The XML file of this PDF shown in Figure
2 the main heading (i.e., 1. Introduction) is represented
with tag <h1> and all the sub-headings are represented with
the tag <h2>. The content inside the opening and closing
bracket of the heading tag is considered as the name of an
independent logical section. All the remaining subheadings
are also represented with the same tag <h1>. This manual
inspection revealed the fact that the approach [2] is unable
to differentiate between the main section and subsection of
a paper, rather it treats all of them as independent headings.
This deficiency causes another major issue, explained below.
Ding et al [8] already used the heading <h1> and <h2>. How-
ever, if we see in Figure 3, there is a sub-section named “1.3
Related Work”. Now, as per the IMRAD structure, the section
will be considered as an independent section and will get
mapped to the “Related Work” section of IMRAD. However,
in reality, the section does not belong to the literature review
section of IMRAD. Such issues result in false mapping,
further compromising the performance of IR systems. On
careful examination of the heading content, we identified that
most of the headings start with a bullet number. For instance,
the main heading starts with “1. . . ” followed by the sub-
headings with “1.1..”, “1.2..”, “1.3..” and “1.4..”. We argue
that the inability of differentiating the logical structure by
XML can be addressed with the help of regular expressions
that are able to intelligently differentiate between the sections
on the basis of the mentioned patterns.
To recapitulate, the examples discussed above depict that
the scope of the contemporary state-of-the-art on logical
sections mapping fails to intelligently map the subsections.
Treating all the sub-sections as independent sections and
explicitly mapping them to the IMRAD structure can have
an adverse impact on the overall precision of IR systems.
Our study overcomes all the mentioned issues by utilizing
regex to implicitly map the subsection to the same section
which is its main section. As explained earlier, the proposed
study employs some potential features such as In-Text Ci-
tations count, Figures count, and Tables count to determine
the association of a section to specific logical sections of
IMRAD. The justification for harnessing these parameters is
given below.
B. FIGURES AND TABLES COUNT
A scientific article illustrates results in the form of a figure or
table. There may be a high probability that the frequency of
figures or tables hints towards the association of the section
to a particular logical section of IMRAD. For instance, a
“Methodology” or “Result” section contains a comparatively
higher “Figure count” or “Table count” than other sections.
To the best of our knowledge, the contemporary approach [2]
has not given the due importance to the potential parameters
“Figure count” and “Tables count”. In this study, we consider
both parameters “figure count” and “table count” based on
an assumption that the count of figures links to the specific
section of the IMRAD structure. In the XML files, figures
and tables are represented in the form of objects, as shown in
Figures 4 and Figure 5.
C. IN-TEXT CITATIONS FREQUENCY
Similar to the aforementioned assumption followed for “Fig-
ure Count” and “Table Count” parameters, the “In-Text Ci-
tation count” can also serve as an important indicator in
determining association to a specific logical section. Ding
et al [8] have employed the frequency of in-text citation
in all the logical sections. We have manually investigated
research papers of the employed data set and identified that
the number of in-text citation counts is not equal in all
the sections. For instance, the “Literature Review” section
contains the highest amount of in-text citations than other
sections. Considering this aspect, our approach maps a logi-
cal section having the highest number of in-text citations to
the Literature review section. Raja Habib and M. T. Afzal [6]
have considered this parameter for mapping logical sections
to the IMRAD structure using the Ding et al [8] study. Similar
to figures and tables, citations are also represented in the form
of objects in XML files, as shown in Figure 6.
The important aspect overlooked by contemporary studies
is that they have not given adequate importance to the po-
tential features like “In-Text Citation count”, “Figure count”
and “Table count”. Although, as explained above, these
parameters could be the potential contributors for section
identification. In this study, our focus is to overcome the
stated deficiencies to improve the performance of IR systems
to a great extent.
IV. DATA COLLECTION
Contemplating the fact that an appropriate data set plays a
crucial role in determining the significance of a proposed
study, we have collected the data set in such a way that
ensures the validity of the proposed study on the papers
published in diverse domains.
For the verification of our approach, we require a compre-
hensive dataset from diversifying domains, covering different
authors, and different journals. We found a dataset that has
the characteristic which we require from Raja Habib and M.
T. Afzal [6]. This data set is freely available. We used 17
different queries mentioned in the Table 1 which are adapted
from Raja Habib and M. T. Afzal [6] to collect the data from
CiteSeer. CiteSeer indexes a vast amount of research papers
in diversified disciplines of Computer Science. The employed
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 1. Subheading example in form of PDF file.
FIGURE 2. Subheading Example 1 (XML)
FIGURE 3. Subheading Example 2 (XML)
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 4. Figures in Research Publications
FIGURE 5. Tables in Research Publications.
FIGURE 6. Example of a figure caption.
dataset contains 5000 papers containing 39420 sections of
different journals.
V. METHODOLOGY
This section encompasses details about the proposed method-
ology. It works in four modules to map the logical sections
of research articles to the IMRAD structure. The modules
include Schema Generation Engine (SGE), (2) Data Ex-
traction Engine (DEE), (3) Data Mapping Engine (DME),
and Mapping View Engine (MVE). First of all, the data set
containing PDF files of research papers are collected from a
digital library named CiteSeer. The PDF files are converted
into XML using the PDFX [22] tool. The Schema Generation
Engine (SGE) is used to generate schema of the XML files,
which is maintained in PostgreSQL to parse and insert the
XML data. Thereafter, Data Extraction Engine (DEE) is
used to extract headings and subheadings from sections of
research papers and other objects like citations, figures, and
tables. The Mapping SQL Engine (MSE) maps the extracted
headings and subheadings to IMRAD with the help of a
devised algorithm 1. The last module, Mapping View Engine
(MVE) is used to visualize the resultant mapping using
XPath/XQuery expressions. In the end, the mapped sections
are evaluated by using the benchmark data set that contains
section annotations formed with the help of a user study. The
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1. Queries used to collect dataset
Queries used to collect dataset [6]
Number Query
1 Social network
2 Information retrieval
3 Bayesian networks
4 Feature selection
5 Collaborative recommendation
6 Recommendation system
7 Content based filtering
8 Black box testing
9 Automatic generation
10 Regression testing
11 Query processing
12 Sensor networks
13 Wireless communications
14 Opinion mining
15 Subjectivity analysis
16 Online marketing
17 Graph theory
overall structure of the proposed methodology is shown in
Figure 7 and Figure 8. The proposed algorithm is formally
represented bellow in form or different modules 1, 2 4, 5 6,
7. The detailed explanation of all the modules implemented
in the proposed methodology is delineated in the following
sections.
A. SCHEMA GENERATION ENGINE (SGE)
As explained earlier, research papers in the employed data
set were in the form of PDF, which was converted into XML
format. The requisite parameters from XML files are required
to be stored in some meaningful format so that extensive
queries could be processed on it. In this regard, we developed
Schema Generation Engine (SGE) wherein we stored that
information in different tables of the database and created
links between them. The XML publications were inserted
into the PostgreSQL, which is a renowned relational database
formed using the generated schema. The schema is the true
mapping of research publications to the relational databases.
Since PostgreSQL is a very powerful XML engine used to
manipulate XML data, therefore, we picked it to parse and
insert the XML data
B. DATA EXTRACTION ENGINE(DEE)
The Data Extraction Engine (DEE) is used to pre-process
the data. While preprocessing, all the noise from the data
is removed. By the term “noise”, we mean those papers
that do not follow the paragraph and heading syntax of
research papers. In XML files, the tags of sub-headings are
not properly nested within the tags of main headings, rather
all are represented as independent headings. We have ex-
tracted all the tags from XML using DEE to determine their
position (i.e, main heading, or subheading) in the paper. The
DEE extracted the headings and subheadings using the XML
heading structures and populated the database accordingly.
Afterward, DEE identified citations and objects like figures,
tables, etc. from the XML files. The algorithms 2 shows the
process of data extraction from XML.
C. DATA MAPPING ENGINE (DME)
The previous module Data Extraction Engine extracts the
data from XML and after some processing, it inserts data
into a relational format in the database. The relation database
format of the dataset provides ease to manipulate the data to
make information extraction easy, but there is still a need to
find a section and map these sections to IMRAD. The Data
Mapping Engine(DME) uses XQuery/XPath/SQL queries to
extract the section and mapped that section to IMRAD. By
carefully examining the structure of the XML files, we have
designed some rules to map the sections to IMRAD. The
algorithm 5, 6 and 7 shows the process of mapping sections
to IMRAD.
1) Subheadings Mappings
The research publications consist of heading and subhead-
ings. It is really important to segregate the heading and
subheadings. While mapping the heading to sections, some
studies keep subheading into consideration, but some studies
totally ignore to handle the subheadings. For instance, if a
heading <h1> contains two sub-headings <h2> and <h3>,
then all of three tags are considered as independent sections
by the existing studies. Due to this negligence, IMRAD
heading can be mistakenly mapped to the subheading.
Our proposed approach distinguishes between the main
heading and subheading with the help of regular expres-
sions 3. Unlike other studies, our approach does not map
sub-heading to the IMRAD structure. The sub-headings are
mapped to that section of IMRAD which is the parent
heading of that sub-heading in the hierarchy. The proposed
approach validates the following two aspects that have been
overlooked by the scientific community.
1 - Section subheading is properly distinguishable in XML
format by XML heading tags <h2><h3> etc.
2 - Section subheadings are not distinguishable in XML
format, and need to run some regular expression 3 to distin-
guish heading by bullets.
The following regular expression is used to detect the
headings and subheadings, in the scenarios wherein separate
tags for headings were not present in XML.
We observed that in some cases headings and sub-headings
are not explicitly defined in XML. In such scenarios, it
becomes difficult to detect the starting and ending of the
main headings<h1>. We have critically analyzed such XML
files in the data set and found that XML’s element “region
class="DoCO:TextChunk" can be used wherein headers are
missing in XML files.
2) Sections Sequences
In recent years, the scholarly community follows IMRAD
based rules while preparing research documents. They follow
the rule of sequence. If the rules are properly followed then
it becomes much feasible for IR systems to correctly identify
the sections. However, we have critically analyzed PDF files
from the employed data set and observed that some of the
research papers have not followed the rule of sequence. The
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 7. System Design and Component
alternate methods are required to identify the sections from
those papers that have not been structured according to the
rules.
3) Sections Known Names
Typically, the scholarly community harnesses the below-
mentioned names for the sections of research papers. The
presence of these names makes it easier to identify the name
of logical sections. The names include:
Abstract
Keywords
Introduction
Literature Review / History /Related work
Methodology
Results / Discussion
Conclusions and future work
References
In this study, we have also identified the synonyms of these
section names. Although the scientific community follows
the IMRAD structure but does not strictly adhere to use
the same names as mentioned above. Usually, a researcher
picks synonyms of the names as per his/her choice. For
instance, as identified in PDF files of the employed data
set, some authors have named “Evaluation” while some have
named “Analysis” to the “Results/Discussion” section. We
have extracted synonyms from all the 5000 research articles.
Wherever the exact section name is not found, the proposed
methodology matches the synonyms for section mapping.
4) Citation Count
Citation is one of the powerful bibliometric indices, used to
reveal facts regarding various aspects of scientific literature.
A citation count is the number of occurrences a citation
appears in the body of a document, also known as in-text
citations. We have considered “citation count” as a measure
to detect logical sections. This measure has been chosen
based on an assumption that a certain count of a citation
identified in a logical section may serve as a strong relevance
clue. For instance, typically, the “Literature Review” section
contains comparatively a greater number of in-text citations
than other sections. Such a section can be mapped to the
“Literature Review/Related Work” section.
The XML files contain elements like Xref, ref, section, etc.
The Xref along with ref-type = ’bibr’ exhibits in-text citation.
This can be linked to the ’ref’ tags via an attribute. After
extracting all these tags, we computed the count of citations.
Thereafter, the tag anchors were detected using the CAD
[23].
5) The occurrence of objects
Most of the research articles contain multiple figures and
tables. The XML files represent these figures and tables
in the form of objects. We have picked these objects as a
measure to identify logical sections based on the same notion
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 8. System Flow
of harnessing citation count measure. The results/discussion
section contains comparatively a greater number of objects
than other sections. These objects are detected using XML
tags like “TableBox”, and “FigureBox”, as shown in Figure
4 and Figure 5.
D. MAPPING VIEW ENGINE (MVE)
The Mapping View Engine (MVE) is designed to visualize
the resultant mapping for analysis. As discussed earlier, we
mapped all the sections with IMRAD; then we need some
views to get that information from the database. To get that
data, we need to run some SQL/XPATH queries using regular
expressions which is not always an easy way; therefore MVE
provides different views of that results.
VI. RESULTS AND EVALUATION
Once all the modules of the proposed methodology have
been implemented, the next steps involve the evaluation
of data using the benchmark data set. The outcomes are
evaluated in two phases: (1) We have identified a one-layer
hierarchy of sub-headings and then mapped the sections to
the IMRAD (2) Afterwards, we have determined collective
contribution of all the parameters in a hybrid manner by
combining the parameters “citation count”, “figures count”
Input: XML Publications Dataset
Result: IMRAD Mapped Sections
1:
2: for each xml X ML do
3: lh = exctractHeaders(xmli) 2
4: ml = mapSectionUsingDisctionary(lh, ml)
5: ml = mapSectionUsingTemplate(lh, ml) A. Shahid and M.
T. Afzal [2]
6: ml = mapSectionUsingCitationFreq(lh, ml)
7: ml = mapSectionUsingFigureFreq(lh, ml)
8: ml = mapSectionUsingTableFreq(lh, ml)
9: end for
Algorithm 1: Section Mapping to IMRAD
Input: XML File: xml
Output: Header/Section List hl
1: Txml // extract text from xml
2:
3: for each heading ∈ T do
4: hl <= xP athQuery(getH1) // get Heading <h1>
5: hl <= xP athQuery(getH2) // get Heading <h2>
6: hl <= xP athQuery(getH3) // get Heading <h3>
7: bl <= extBullets(hl)// get Bullets
8: hl <= getSectionClassDoCO :S ectionT itle
9: end for
Algorithm 2: extractHeader
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Input: Header/Section List hl
Output: Bullet List: bl
1: \?:^|
(?<=\s))\d\.?(?:\d+)?(?=\s)|
\*(?=\s)
2: return bl // Bullet List
Algorithm 3: exctractBullet
Input: XML file:xml, Header List:lh
Result: Section Mapping List:sml, Header List:lh
1:
2: for each llhdo
3:
4: for each imrad ∈ IMRAD do
5:
6: if lj! = imradithen
7: return sml // Section Mapping List
8: end if
9: end for
10: end for
Algorithm 4: mapSectionUsingDisctionary
Input: XML file:xml, Section Mapping List: sml
Result: Section Mapping List:sml
1:
2: for each llhdo
3: (CTi, CT Pi)= String-Citaion_Anchor_Detection(t) CAD
[23]
4: (CNi, C NPi)= Numeric-Citaion_Anchor_Detection(t)
CAD [23]
5:
6: for each imrad ∈ IMRAD do
7:
8: if lj! = imradithen
9:
10: if CFi50% then
11: smldiscussion <=true;// A DISCUSSION
Section
12: return sml // Section Mapping List
13: end if
14:
15: else
16:
17: if CFi30% then
18: smlintro <=true;// A INTRODUCTION
Section
19: return sml // Section Mapping List
20: end if
21: end if
22: end for
23: end for
Algorithm 5: mapSectionUsingCitationFrequncy
Input: XML file:xml, Section Mapping List: sml
Result: Section Mapping List:sml
1:
2: for each llhdo
3: F Pi=getRegionC lassDCO :F ig ureBOX
4:
5: for each imrad ∈ IMRAD do
6:
7: if lj! = imradithen
8:
9: if F Pi60% then
10: smlresults <=true;// A RESULT
section
11: return sml // Section Mapping List
12: end if
13:
14: else
15:
16: if F Pi30% then
17: smlmeth <=true;// A METHODOLOGY
Section
18: return sml // Section Mapping List
19: end if
20: end if
21: end for
22: end for
Algorithm 6: mapSectionUsingFigureFreq
Input: XML file:xml, Section Mapping List: sml
Result: Section Mapping List:sml
1:
2: for each llhdo
3: T Pi=getRegionC lassDCO :T ableB ox
4:
5: for each imrad ∈ IMRAD do
6:
7: if lj! = imradithen
8:
9: if T Pi70% then
10: smlresults <=true;// A RESULT
section
11: return sml // Section Mapping List
12: end if
13:
14: else
15:
16: if F Pi20% then
17: smlmeth <=true;// A METHODOLOGY
Section
18: return sml // Section Mapping List
19: end if
20: end if
21: end for
22: end for
Algorithm 7: mapSectionUsingTableFrequncy
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
and “tables count”. The standard evaluation measures, in-
cluding precision, recall, and f-measure have been employed.
We calculated the precision/recall / f-measure of each section
separately for both the aforementioned phases. Figure 12
shows the combined average score of precision, recall, and
F-measure for all the sections, and figures 9, 10, and 11
represents the individual scores for each section. The formal
description of the formulas of precision, recall, and f-measure
is shown below.
Px=T Px/(T Px+F Px)
Rx=T Px/(T Px+F Nx)
x0=Introduction
x1=Methods
x2=Results
x3=Discussion
(1)
Pavg =
3
X
x=0
Px/4Pavg =Aver age P recision
Ravg =
3
X
x=0
Rx/4Ravg =Average Recall
(2)
FIGURE 9. Section wise comparison of precision
Figure 9 shows the comparison of the precision of our
approach with state-of-the-art approaches. It is quite evident
from the graph that all three approaches mapped the intro-
duction with high accuracy but failed to accurately map the
other sections. On the other hand, our approach performed
better and achieved almost similar precision in the mapping
of all the sections. Further, the other approaches were unable
to identify the Result or Discussion section. Similarly, Figure
10 shows the comparison of recall of our proposed approach
with all three state-of-the-art approaches.
FIGURE 10. Section wise comparison of recall
FIGURE 11. Section wise comparison of F-Measure
Since the proposed approach has an application in in-
formation retrieval (IR) systems is specifically concerned
about returning most of the correct responses against a query,
therefore, we have drawn comparisons here on the basis of
the precision score. The evaluation results of our approach
illustrate the increased value of precision, and when all the
features were assessed in a hybrid manner then the proposed
approach tremendously outperformed the approaches [2],
[6], [8]. We believe that this improvement is due to the
consideration of objects in XML files.
The analysis discussed above revealed that our study has
outperformed all of the three existing state-of-the-art studies
on section mapping by achieving 8.96%, 21.77%, and 9.50%
improvement from Ding et al [8], Raja Habib and M. T.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 12. Comparison of average precision of combined sections
Afzal [6] and A. Shahid and M. T. Afzal [2], respectively.
This signifies the potential of all the novel parameters like
In-Text Citation count, Figure count, and Table count. We
argue that the worth of XML objects must not be overlooked
while section detection. The collective contribution of all
the employed parameters and potential of designed regular
expressions can adequately cope up to overcome the deficien-
cies in the existing state-of-the-art in section mapping.
VII. PERFORMANCE EVALUATION COMPARISON
We have evaluated the performance of our approach with
state-of-the-art techniques. All experiments were performed
in the same environment on AWS medium instances. We have
performed experiments in multiple iterations and reported
the outcome in median values. The table 2 shows the
number in seconds. The approach of Ding et al [8] consumed
comparatively less time than all other approaches. This is
due to the fact that Ding et al [8] have only used dictionary
terms to identify sections and mapped them to IMRAD. A.
Shahid and M. T. Afzal [2] took a bit longer than Ding et
al [8] because Shahid’s approach also uses templates along
with the dictionary terms. The Raja Habib and M. T. Afzal
[6] took longer than the previous two approaches because
counting the citations and parsing the whole text take a lot
of time. Most of the time is consumed by our proposed
approach because it applies multiple techniques for section
identification and mapping. Moreover, it is philosophically
true because our approach uses all the features employed
by the state-of-the-art techniques [2], [6], [8] along with the
novel proposed features. The whole process was done offline,
and we have maintained all the information in the database.
Although the proposed technique takes more time than state-
of-the-art techniques; however, this is not important because
all of this processing will be done on the backend and all
TABLE 2. Performance comparison with start-of-the-art techniques
Techniques Time in Seconds
Ding et al [8] 10
A. Shahid and M. T. Afzal [2] 14
Habi et al [6] 21
Proposed 34
of these results will be precompiled before they are made
available to be used in the online system. This indicates that
when a user performs such a query then there is no such
difference between the time taken by any of the approaches.
Instead, all of the approaches will service the user based on
the processed data.
VIII. CONCLUSION
This paper has presented a novel method to map logical
sections of PDF formatted research papers to the IMRAD
structure. Unlike contemporary state-of-the-art, the approach
has been designed after a critical manual investigation of one
of the most recent approaches of logical sections mapping.
Our study has identified a set of novel features and justified
their potential by evaluating them on a benchmark data set.
These features include: (1) In-Text Citation count (2) Figure
count and (4) Tables count. The study generated XML files
from PDF files using PDFX [22] and exploited XML headers
and objects with the help of regular expressions to detect
logical sections, sub-sections, and mapping them to their cor-
responding section of IMRAD. The study has outperformed
contemporary studies by attaining 0.97 precision (i.e., 0.85
vs. 0.97) and recall 0.98 (i.e. 0.86 to 0.98) This improved
precision and recall is due to the incorporation of the features
that have been overlooked by the contemporary state-of-the-
art. The tool PDFX was unable to process approximately
10% of the PDF files into XML files. In the future, attention
may be paid to initiate methods of section identification in
the scenarios wherein PDFX [22] fails.
ACKNOWLEDGMENT
The authors acknowledge Ms. Faiza Qayyum for helping us
in linguistics review of the paper and giving feedback to make
it readily understandable.
REFERENCES
[1] P. O. Larsen and M. von Ins, “The rate of growth in scientific
publication and the decline in coverage provided by science citation
index,” Scientometrics, vol. 84, no. 3, pp. 575–603, Mar. 2010. [Online].
Available: https://doi.org/10.1007/s11192-010-0202-z
[2] A. Shahid and M. T. Afzal, “Section-wise indexing and retrieval of
research articles,” Cluster Computing, vol. 21, no. 1, pp. 481–492, May
2017. [Online]. Available: https://doi.org/10.1007/s10586-017-0914-4
[3] L. B. Sollaci and M. G. Pereira, “The introduction, methods, results, and
discussion (IMRAD) structure: a fifty-year survey,J Med Libr Assoc,
vol. 92, no. 3, pp. 364–367, Jul 2004.
[4] G. Batmanabane, “The IMRAD structure,” in Reporting and Publishing
Research in the Biomedical Sciences. Springer Singapore, Dec. 2017, pp.
1–4. [Online]. Available: https://doi.org/10.1007/978-981-10-7062-4_1
[5] C. Lange and A. Di Iorio, “Semantic publishing challenge – assessing the
quality of scientific output,” vol. 475, 08 2014.
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3009021, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[6] R. Habib and M. T. Afzal, “Sections-based bibliographic coupling for
research paper recommendation,” Scientometrics, vol. 119, no. 2, pp.
643–656, Mar. 2019. [Online]. Available: https://doi.org/10.1007/s11192-
019-03053-8
[7] A. KHAN, A. Shahid, and M. Afzal, “Extending co-citation using sections
of research articles,” TURKISH JOURNAL OF ELECTRICAL ENGI-
NEERING AND COMPUTER SCIENCES, vol. 26, pp. 3346–3356, 11
2018.
[8] Y. Ding, X. Liu, C. Guo, and B. Cronin, “The distribution of references
across texts: Some implications for citation analysis,” Journal of
Informetrics, vol. 7, no. 3, pp. 583–592, Jul. 2013. [Online]. Available:
https://doi.org/10.1016/j.joi.2013.03.003
[9] B. Fecher and S. Friesike, Open Science: One Term, Five Schools of
Thought. Cham: Springer International Publishing, 2014, pp. 17–47.
[10] S. Teufel, A. Siddharthan, and C. Batchelor, “Towards domain-
independent argumentative zoning: Evidence from chemistry and
computational linguistics,” in Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing. Singapore:
Association for Computational Linguistics, Aug. 2009, pp. 1493–1502.
[Online]. Available: https://www.aclweb.org/anthology/D09-1155
[11] S. Teufel and M. Moens, “Summarizing scientific articles: Experiments
with relevance and rhetorical status,Computational Linguistics,
vol. 28, no. 4, pp. 409–445, Dec. 2002. [Online]. Available:
https://doi.org/10.1162/089120102762671936
[12] M. Liakata, S. Teufel, A. Siddharthan, and C. Batchelor, “Corpora for the
conceptualisation and zoning of scientific papers,” in Proceedings of the
Seventh International Conference on Language Resources and Evaluation
(LREC’10). Valletta, Malta: European Language Resources Association
(ELRA), May 2010.
[13] S. Teufel, “Citations and sentiment. in: Workshop on text mining for
scholarly communications and repositories, university of manchester, uk
(2009),” 2009.
[14] S. Mari ˇ
ci´
c, J. Spaventi, L. Paviˇ
ci´
c, and G. Pifat-Mrzljak, “Citation
context versus the frequency counts of citation histories,Journal of
the American Society for Information Science, vol. 49, no. 6, pp.
530–540, 1998. [Online]. Available: https://doi.org/10.1002/(sici)1097-
4571(19980501)49:6<530::aid-asi5>3.0.co;2-8
[15] S. Peroni, D. Shotton, and F. Vitali, “Faceted documents,
in Proceedings of the 2012 ACM symposium on Document
engineering - DocEng '12. ACM Press, 2012. [Online]. Available:
https://doi.org/10.1145/2361354.2361396
[16] Y. Guo, A. Korhonen, M. Liakata, I. Silins, J. Hogberg, and U. Stenius,
“A comparison and user-based evaluation of models of textual information
structure in the context of cancer risk assessment,” BMC bioinformatics,
vol. 12, p. 69, 03 2011.
[17] ¸S. Kafkas, X. Pi, N. Marinos, F. Talo’, A. Morrison, and J. R.
McEntyre, “Section level search functionality in europe PMC,Journal
of Biomedical Semantics, vol. 6, no. 1, p. 7, 2015. [Online]. Available:
https://doi.org/10.1186/s13326-015-0003-7
[18] S. Nazir, M. Asif, S. Ahmad, F. Bukhari, M. T. Afzal, and H. Aljuaid,
“Important citation identification by exploiting content and section-wise
in-text citation count,” PLOS ONE, vol. 15, no. 3, p. e0228885, Mar.
2020. [Online]. Available: https://doi.org/10.1371/journal.pone.0228885
[19] S.-U. Hassan, M. Imran, S. Iqbal, N. R. Aljohani, and R. Nawaz, “Deep
context of citations using machine-learning models in scholarly full-text
articles,” Scientometrics, vol. 117, no. 3, pp. 1645–1662, Oct. 2018.
[Online]. Available: https://doi.org/10.1007/s11192-018-2944-y
[20] D. Pride and P. Knoth, “Incidental or influential? - challenges in auto-
matically detecting citation importance using publication full texts,” in
Research and Advanced Technology for Digital Libraries. Springer
International Publishing, 2017, pp. 572–578.
[21] J. Beel and B. Gipp, “Google scholar’s ranking algorithm: An introductory
overview,” 07 2009.
[22] A. Constantin, S. Pettifer, and A. Voronkov, “PDFX,” in
Proceedings of the 2013 ACM symposium on Document
engineering - DocEng '13. ACM Press, 2013. [Online]. Available:
https://doi.org/10.1145/2494266.2494271
[23] R. Ahmad and M. T. Afzal, “CAD: an algorithm for citation-anchors
detection in research papers,” Scientometrics, vol. 117, no. 3, pp. 1405–
1423, Sep. 2018. [Online]. Available: https://doi.org/10.1007/s11192-018-
2920-6
Ibrar Ahmed earned his master’s in computer
science from International Islamic University Is-
lamabad. Later he earned his MS in Computer
Engineering Degree from the University of Sci-
ence and Technology Taxila. He is working in
a research and development organization as a
Database Specialist. Currently, he is pursuing his
Ph.D. in computer science. He authored multiple
books in the database field. His research interest
includes Databases, digital libraries, and Data Sci-
ence.
MUHAMMAD TANVIR AFZAL received the
Ph.D. degree with highest distinction in Computer
Science from the Graz University of Technology,
Austria, secured Gold medal in his M.Sc Com-
puter Science from Quiad-i-Azam University, Is-
lamabad, Pakistan. He has been associated with
academia and industry at various levels for the last
20 years, and currently, he is serving as Professor
in the Department of Computer Science at Capital
University of Science and Technology, Islamabad.
He is also serving as editor-in-chief for a reputed impact factor journal
known as: Journal of Universal Computer Science. Dr. Afzal authored more
than 100 research papers in leading peer-reviewed journals and confer-
ences in the field of Data Science, Information retrieval and visualization,
Semantics, Digital Libraries, and Scientometrics. He has authored two
books and has edited two books in Computer Science. His cumulative
impact factor is 60+, with citations over 500. He played a pivotal role
in making collaborations between MAJU-JUCS, MAJU-IICM, and TUG-
UNIMAS. He served as a Ph.D. symposium chair, session chair, finance
chair, committee member, and editor of several IEEE, ACM, Springer,
Elsevier international conferences, and journals. Dr. Afzal conducted more
than 100 curricular, co-curricular, and extra-curricular activities in the last
5 years including seminars, workshops, national competitions (ExcITeCup),
and invited international and national speakers from Google, Oracle, IICM,
IFIS, SEGA Europe, etc. Under his supervision, more than 60 post-grad
students (MS and Ph.D.) have defended their research theses successfully
and a number of Ph.D. and MS students are pursuing their research with
him.
VOLUME 4, 2016 13
... For instance, some of the papers published in ACL anthology have used the term "motivation" for "Introduction" section, "model" for "Methodology" section and "evaluation" for "results" etc. In the scientific literature, IMRaD (Introduction, Methods, Results, And Discussion) is considered a standard for writing research articles, which suggests that all of the main logical section of the papers must be named Introduction, Methods, Results and Discussion (IMRaD) Ahmed and Afzal (2020). In this study, we employed the model from Ahmed and Afzal (2020) to map logical sections to IMRAD. ...
... In the scientific literature, IMRaD (Introduction, Methods, Results, And Discussion) is considered a standard for writing research articles, which suggests that all of the main logical section of the papers must be named Introduction, Methods, Results and Discussion (IMRaD) Ahmed and Afzal (2020). In this study, we employed the model from Ahmed and Afzal (2020) to map logical sections to IMRAD. The reason of harnessing this model is that it is one of the most recent and effective models that has introduced (1) variants of In-text citation count (2) ...
... The structure of a research paper follows a section-wise hierarchical structure encompassing sections such as (1) Abstract, (2) Introduction, (3) Related Work, (4) Methods, (5) Results and Conclusion. Since we have already mapped research articles to IMRaD, so as per that structure, "Introduction" section and "Literature review" section have in-text citations in higher count Ahmed and Afzal (2020). These sections delineate form a link with earlier studies. ...
Article
Full-text available
Citation analysis-based systems are premised on assuming that all citations are equally important. The scientific community argues that a citation may hold divergent reasons and, thus, should not be treated at par. In this regard, a plethora of existing studies classify citations for varying reasons. Presently, the community has a propensity toward binary citation classification with the notion of contemplating only important reasons while employing quantitative analysis-based measures. We argue that outcomes yielded by the contemporary state-of-the-art models cannot be deemed ideal as the plethora of them has been evaluated on a data set with a minimal number of instances, due to which the outcomes cannot be generalized. The scope of results from such approaches is restricted to a single domain only, which may exhibit entirely different behaviour for the different data sets. Most of the studies are ruled by the content-based features evaluated by harnessing traditional classification models like Support Vector Machine (SVM) and random forest (RF), while an inconsiderable number of studies employ metadata which holds the potential to serve as a quintessential indicator, to tackle meaningful citations. In this study, we introduce a Multilayer perceptron artificial neural network (MLP-ANN) binary citation classifier, which exploits the best combinations of features formed using both sources. We also introduce a new benchmark data set from the electrical engineering domain, which is consolidated with two existing benchmark data sets for model evaluation. The outcomes reveal that the results produced by the proposed MLP model outperform the contemporary models achieving a precision of 0.92.
... Some other studies employed systematic methods to integrate various rules to enhance the identification ability. For example, Ahmed et al. (2020) proposed a system framework containing compelling features that are often overlooked, such as the number of citations, figures, tables, and other essential features, which improved the identification effect of previous work. ...
... History was excluded because the characteristic will introduce predictive information in classifying the test set, which may cause the accumulation of errors and impact the overall classification performance. We further incorporated some potential characteristics mentioned by Ahmed and Afzal (2020), which are often ignored, including the number of citations, the number of figures, the number of tables and so on. We viewed the sum of tables and figures as another characteristic explored in the experiments to simplify the additional characteristics. ...
Article
Full-text available
With the enrichment of literature resources, researchers are facing the growing problem of information explosion and knowledge overload. To help scholars retrieve literature and acquire knowledge successfully, clarifying the semantic structure of the content in academic literature has become the essential research question. In the research on identifying the structure function of chapters in academic articles, only a few studies used the deep learning model and explored the optimization for feature input. This limits the application, optimization potential of deep learning models for the research task. This paper took articles of the ACL conference as the corpus. We employ the traditional machine learning models and deep learning models to construct the classifiers based on various feature input. Experimental results show that (1) Compared with the chapter content, the chapter title is more conducive to identifying the structure function of academic articles. (2) Relative position is a valuable feature for building traditional models. (3) Inspired by (2), this paper further introduces contextual information into the deep learning models and achieved significant results. Meanwhile, our models show good migration ability in the open test containing 200 sampled non-training samples. We also annotated the ACL main conference papers in recent five years based on the best practice performing models and performed a time series analysis of the overall corpus. This work explores and summarizes the practical features and models for this task through multiple comparative experiments and provides a reference for related text classification tasks. Finally, we indicate the limitations and shortcomings of the current model and the direction of further optimization.
... Some other studies employed systematic methods to integrate various rules to enhance the identification ability. For example, Ahmed et al. (2020) proposed a system framework containing compelling features that are often overlooked, such as the number of citations, figures, tables, and other essential features, which improved the identification effect of previous work. ...
... History was excluded because the characteristic will introduce predictive information in classifying the test set, which may cause the accumulation of errors and impact the overall classification performance. We further incorporated some potential characteristics mentioned by Ahmed and Afzal (2020), which are often ignored, including the number of citations, the number of figures, the number of tables and so on. We viewed the sum of tables and figures as another characteristic explored in the experiments to simplify the additional characteristics. ...
Preprint
Full-text available
With the enrichment of literature resources, researchers are facing the growing problem of information explosion and knowledge overload. To help scholars retrieve literature and acquire knowledge successfully, clarifying the semantic structure of the content in academic literature has become the essential research question. In the research on identifying the structure function of chapters in academic articles, only a few studies used the deep learning model and explored the optimization for feature input. This limits the application, optimization potential of deep learning models for the research task. This paper took articles of the ACL conference as the corpus. We employ the traditional machine learning models and deep learning models to construct the classifiers based on various feature input. Experimental results show that (1) Compared with the chapter content, the chapter title is more conducive to identifying the structure function of academic articles. (2) Relative position is a valuable feature for building traditional models. (3) Inspired by (2), this paper further introduces contextual information into the deep learning models and achieved significant results. Meanwhile, our models show good migration ability in the open test containing 200 sampled non-training samples. We also annotated the ACL main conference papers in recent five years based on the best practice performing models and performed a time series analysis of the overall corpus. This work explores and summarizes the practical features and models for this task through multiple comparative experiments and provides a reference for related text classification tasks. Finally, we indicate the limitations and shortcomings of the current model and the direction of further optimization.
... The IMRaD format, standing for Introduction, Methods, Results, and Discussion [6], is one of the most widely accepted structures for scientific papers, particularly in the fields of life sciences, engineering, and increasingly in Information Technology (IT), as illustrated in Figure 1. The format follows a linear narrative that mirrors the research process, guiding readers through a logical flow from the background of the research to its conclusion. ...
Article
Full-text available
The effectiveness of scientific communication is vital in the field of Information Technology, where researchers strive to present their results clearly and systematically. This paper aims to explore the potential of using the IMRaD format (Introduction, Methods, Results, and Discussion) to improve the effectiveness of scientific papers in this field. The IMRaD format is characterized by its ability to enhance clarity and organization, making it easier to understand the complex information that distinguishes IT research. Through a comparative analysis of papers that follow the IMRaD format, the study evaluates the impact of this structure on citation rates, reader engagement, and acceptance of research within the academic community. Data was collected through surveys and reviews of published papers in the field of Information Technology, revealing that papers adopting the IMRaD format led to significant improvements in the clarity and comprehension of research messages. Furthermore, the research highlights the importance of adopting the IMRaD format in enhancing collaboration between researchers and practitioners, facilitating the exchange of knowledge and expertise. By improving the accessibility and understandability of research, the adoption of this format can contribute to driving innovation across various areas within Information Technology. Thus, utilizing the IMRaD format represents an important step toward enhancing the effectiveness of scientific research and practical applications in this dynamic field.
... The common practice in reporting research in journal articles follows the Introduction, Methods, Results and Discussion (IMRAD) structure (Wolfe, Britt & Poe Alexander, 2011;Oriokot et al. ,2011;Bertin et al., 2016;Ribeiro, Yao & Rezende 2018;Ahmed & Afzal, 2020). IMRAD has been used in reporting research findings since the lastcentury (Teodosiu, 2010;Nair & Nair, 2014). ...
Article
Full-text available
The primary aim of this article is to document the process of turning research data into a peer review journal article in the built environment disciplines. This is necessary to educate prospective authors who wish to convert their research output into quality journal articles. It proceeds to do this by first explaining how to do the write up in a professional manner. The write up is made up of three key parts: preliminaries (title, abstract, key words), main body (introduction, literature review, methods, findings, discussion, conclusion, notes, and references) and appendage. Once the write up is complete, it guides the prospective author on dos and don'ts (or professional ethics of publishing) during the pre-submission and post-submission period. This practice would help lessen the burden on the peer review systems and facilitate the prospective author to achieve successful article publication in a peer review journal.
... They found that the Voting Feature Intervals algorithm performed best, and the best result of the accuracy value is 71.38%. Ahmed and Afzal (2020) thought that term-based literature retrieval could not meet special needs. For example, the current retrieval system cannot return the literature containing a specific term (e.g., "PageRank") in the structure of the results. ...
Preprint
Full-text available
Purpose The purpose of this paper is to explore which structures of academic articles referees would pay more attention to, what specific content referees focus on, and whether the distribution of PRC is related to the citations. Design/methodology/approach Firstly, utilizing the feature words of section title and hierarchical attention network model (HAN) to identify the academic article structures. Secondly, analyzing the distribution of PRC in different structures according to the position information extracted by rules in PRC. Thirdly, analyzing the distribution of feature words of PRC extracted by the Chi-square test and TF-IDF in different structures. Finally, four correlation analysis methods are used to analyze whether the distribution of PRC in different structures is correlated to the citations. Findings The count of PRC distributed in Materials and Methods and Results section is significantly more than that in the structure of Introduction and Discussion, indicating that referees pay more attention to the Material and Methods and Results. The distribution of feature words of PRC in different structures is obviously different, which can reflect the content of referees' concern. There is no correlation between the distribution of PRC in different structures and the citations. Research limitations/implications Due to the differences in the way referees write peer review reports, the rules used to extract position information cannot cover all PRC. Originality/value The paper finds a pattern in the distribution of PRC in different academic article structures proving the long-term empirical understanding. It also provides insight into academic article writing: researchers should ensure the scientificity of methods and the reliability of results when writing academic article to obtain a high degree of recognition from referees.
... They found that the Voting Feature Intervals algorithm performed best, and the best result of the accuracy value is 71.38%. Ahmed and Afzal (2020) thought that term-based literature retrieval could not meet special needs. For example, the current retrieval system cannot return the literature containing a specific term (e.g., "PageRank") in the structure of the results. ...
Article
Full-text available
[Purpose] The purpose of this paper is to explore which structures of academic articles referees would pay more attention to, what specific content referees focus on, and whether the distribution of PRC is related to the citations. [Design/methodology/approach] Firstly, utilizing the feature words of section title and hierarchical attention network model (HAN) to identify the academic article structures. Secondly, analyzing the distribution of PRC in different structures according to the position information extracted by rules in PRC. Thirdly, analyzing the distribution of feature words of PRC extracted by the Chi-square test and TF-IDF in different structures. Finally, four correlation analysis methods are used to analyze whether the distribution of PRC in different structures is correlated to the citations. [Findings] The count of PRC distributed in Materials and Methods and Results section is significantly more than that in the structure of Introduction and Discussion, indicating that referees pay more attention to the Material and Methods and Results. The distribution of feature words of PRC in different structures is obviously different, which can reflect the content of referees' concern. There is no correlation between the distribution of PRC in different structures and the citations. [Research limitations/implications] Due to the differences in the way referees write peer review reports, the rules used to extract position information cannot cover all PRC. [Originality/value] The paper finds a pattern in the distribution of PRC in different academic article structures proving the long-term empirical understanding. It also provides insight into academic article writing: researchers should ensure the scientificity of methods and the reliability of results when writing academic article to obtain a high degree of recognition from referees.
... The studies focusing on defining the structure of a research paper's content are referred to as discourse analysis [5]. This methodology provides a clear guide with broad milestones and a structure for the researcher or student to follow and produce a high-quality, scientifically valuable project in the least amount of time and effort. ...
Conference Paper
— In the information technology discipline, a graduation project is normally a compulsory task in order to fulfill the university graduation requirements. The project is usually completed within one or two semesters; therefore, represents a real challenge for many students as the time is limited, furthermore it expected that the students have not collected yet adequate research experience. They lack a defined plan regarding project phases and the activities involved during their research. This paper proposes a new methodology named IMGSIE which stands for Introduction, Method, Generalization, Specification, Implementation, and evaluation. It functions as a research lifecycle for structuring and organizing graduation projects. This methodology is already used, and three different questionnaires have been distributed to supervisors and students, the feedback was highly satisfactory, and positive towards the effectiveness of all students who used the methodology.
Conference Paper
Full-text available
مَعَ التطورِ الكبيرِ فِي مَجالِ تقنيةِ المعلوماتِ وظهورِ تقنياتٍ حديثةٍ كانتْ لها اليد الأولى على تطور مجالاتٍ عِدة في حياتنا اليومية. ومِنَ المُتوقع أنْ يَستمرَ هذا النمو خِلالَ السنوات القادم، لذلك كان من الضروري الاهتمامُ والتركيز على الجانبِ البحثي والتقني حتى يتماشى معَ التطورِ الكبير في مجالِ عُلوم الحاسب الآلي. في هذه الورقة البحثية رَكزنا عَلى حَل مُشكلةٍ يُعاني منها الكثيرُ مِن طَلبةِ مَشاريعِ التخرج، وطلبةِ الدراساتِ العليا، وهي عدمُ وجودِ مَنهجيةٍ أو خُطةِ عملٍ واضحةٍ تُعنى بشكلٍ مُباشرٍ بالجانبِ البحثي، والتي تتماشي مع الطبيعة التقنية لتلكَ المشاريع. الأمرِ الذي قد أدى إلى التأخير الكبير في إنجاز المشروعِ وعدمِ ظُهورهِ بالشكلِ الجيدِ الذي يَعكسُ جوهرَ وأساس فكرة المشروع. عَليه تم اقتراحُ مَنهجيةٍ واضحةٍ لوضعِ الخطوطِ العريضة التي تُمكن الباحثَ مِنَ الحُصولِ عَلى مَشروعٍ ذِي جَودةٍ عاليةٍ وقِيمةٍ علميةٍ جيدةٍ بأقلِ وقتٍ وجُهدٍ عِند اتباعها. تم تسمية هذه المنهجية (IMGSIE) اختصارًا لـ (Introduction, Methods, General, Specific, Implementation and Evaluation) حيثُ أنها تتضمنُ على سِت مَراحلٍ أساسيةٍ في عَملية بِناء المشروع البحثي: (المُقدمة, الطُرق المستخدمة (المنهجية البحثية), موضوع المشروع بشكل عام, موضوع المشروع بشكل دقيق, التطبيق العملي, ومَرحلة التقييم). ولقياسِ مَدى جَودةِ وفاعليةِ هذهِ المنهجية تم عَمل نَموذجي استبيانٍ ضَمَّا عددًا كبيرًا مِنَ الأساتذةِ والطلبةِ سَواءً الذين تخرجوا فعلًا أو الذين لايزالون قيد العملِ على مشاريعهم البحثية في مَجالِ تقنية المعلومات، وذلك لتسليطِ الضوءِ على أهمِّ المشاكلِ والعراقيلِ التي واجهتهم في مشاريعهم البحثية. حيث كان نموذج الاستبيانِ الأول خاصًا بالذين أنجزو مشاريعهم البحثية بدون استخدام المنهجية (IMGSIE), ونموذجُ الاستبيان الثاني يختصُ بالأشخاصِ الذين استخدموا المنهجية. وتمتْ مُلاحظةُ رِضى وارتياح شَريحةٍ كبيرةٍ مِن الباحثين الذين استخدموا هذ المنهجية, على سبيل الذكر لا الحصر فإن للطلاب الذين أكملوا مشاريع التخرج باستخدام منهجية IMGSIE كانت نسبة رضاهم على استخدامها حوالي 90.3٪ بين (موافق بشدة، موافق إلى حد ما، موافق) من حيث سُرعة الانجاز ودقة العمل وجودة مشاريع التخرج.
Article
Full-text available
A citation is deemed as a potential parameter to determine linkage between research articles. The parameter has extensively been employed to form multifarious academic aspects like calculating the impact factor of journals, h-Index of researchers, allocate different research grants, find the latest research trends, etc. The current state-of-the-art contends that all citations are not of equal importance. Based on this argument, the current trend in citation classification community categorizes citations into important and non-important reasons. The community has proposed different approaches to extract important citations such as citation count, context-based, metadata, and textual based approaches. The contemporary state-of-the-art in citation classification community ignores significantly potential features that can play a vital role in citation classification. This research presents a novel approach for binary citation classification by exploiting section-wise in-text citation frequencies, similarity score, and overall citation count-based features. The study also introduces machine learning algorithms based novel approach for assigning appropriate weights to the logical sections of research papers. The weights are allocated to the citations with respect to their sections. To perform the classification, we used three classification techniques, Support Vector Machine, Kernel Linear Regression, and Random Forest. The experiment was performed on two annotated benchmark datasets that contain 465 and 311 citation pairs of research articles respectively. The results revealed that the proposed approach attained an improved value of precision (i.e., 0.84 vs 0.72) from contemporary state-of-the-art approach.
Article
Full-text available
The excessive amount of digital information has made it crucial to extract the relevant information. This hinders researchers in finding documents pertaining to their research. There exist various state-of-the-art techniques, such as co-citation, bibliographic coupling, and their recent extensions like citation proximity analysis and citation order analysis, that recommend the relevant documents against the posed query. Most of these approaches are statistical in nature and thus can further be extended by incorporating some semantics to enhance the results. In this paper, we present an extension of a co-citation-based technique to identify the most relevant documents to co-cited document(s). The proposed system explores in-text citation frequencies and in-text citation patterns of co-cited documents within the different logical sections of cited-by papers. Furthermore, we have evaluated the proposed approach with the co-citation approach and citation proximity analysis (CPA) approach on a dataset acquired from CiteSeer. The outcomes revealed that most of the time the proposed approach outperformed other state-of-the-art techniques. The average correlation of the proposed approach is increased by 68% as compared to the co-citation-based approach. In comparison with CPA approach, the average correlation of the proposed approach is increased by 39% with respect to gold-standard rankings.
Article
Full-text available
This work looks in depth at several studies that have attempted to automate the process of citation importance classification based on the publications full text. We analyse a range of features that have been previously used in this task. Our experimental results confirm that the number of in text references are highly predictive of influence. Contrary to the work of Valenzuela et al. we find abstract similarity one of the most predictive features. Overall, we show that many of the features previously described in literature are not particularly predictive. Consequently, we discuss challenges and potential improvements in the classification pipeline, provide a critical review of the performance of individual features and address the importance of constructing a large scale gold standard reference dataset.
Article
Full-text available
Relevant information extraction is a dire need of the scholarly community. There are a number of systems available to find relevant information from scientific literature such as search engines, citation indexes, digital libraries etc. For a search query, a long list of irrelevant documents is presented to the users mainly due to the huge number of availability of the full-text document, and furthermore due to the unstructured nature of indexed scientific resources. The contemporary systems have formally defined the structure of scientific documents. However, populating the already available enriched scientific structure from unstructured/semi-structured scientific documents has not been addressed previously. In this research paper, we have designed, implemented, and evaluated an automated technique that is able to tag each paper’s content with logical sections appearing in the scientific document. The proposed system has been evaluated against the benchmark, subsequently, the proposed system have been also compared with machine learning techniques that may be used for the same task. It has been empirically shown that the overall correctness and completeness of our proposed technique is 0.78 and 0.79 respectively and thus the overall accuracy of about 0.78 was achieved. The achieved results are good as compared to machine learning based classification. The developed system may help future information retrieval systems, digital libraries, and citation indexes to index, retrieve, rank and visualize most relevant scientific documents for the scientific community.
Article
Full-text available
As the availability of open access full text research articles increases, so does the need for sophisticated search services that make the most of this new content. Here, we present a new feature available in Europe PMC that allows selected sections of full text articles to be searched, including figures and reference lists. Users can now search particular parts of an article, reducing noise and allowing fine-tuning of searches. To the best of our knowledge, Europe PMC is the first service that provides a granular literature search by allowing users to target their search to particular sections of articles. This new functionality is based on a heuristic algorithm that identifies and categorises article sections into 17 pre-defined categories based on the section heading. The tagger's performance is measured against a manually curated dataset consisting of 100 full text articles with an F-score of 98.02%. The section search is available from the advanced search within Europe PMC (http://europepmc.org). The source code is freely available from http://europepmc.org/ftp/oa/SectionTagger/.
Article
Digital libraries suffer from the problem of information overload due to immense proliferation of research papers in journals and conference papers. This makes it challenging for researchers to access the relevant research papers. Fortunately, research paper recommendation systems offer a solution to this dilemma by filtering all the available information and delivering what is most relevant to the user. Researchers have proposed numerous approaches for research paper recommendation which are based on metadata, content, citation analysis, collaborative filtering, etc. Approaches based on citation analysis, including co-citation and bibliographic coupling, have proven to be significant. Researchers have extended the co-citation approach to include content analysis and citation proximity analysis and this has led to improvement in the accuracy of recommendations. However, in co-citation analysis, similarity between papers is discovered based on the frequency of co-cited papers in different research papers that can belong to different areas. Bibliographic coupling, on the other hand, determines the relevance between two papers based on their common references. Therefore, bibliographic coupling has inherited the benefits of recommending relevant papers; however, traditional bibliographic coupling does not consider the citing patterns of common references in different logical sections of the citing papers. Since the use of citation proximity analysis in co-citation has improved the accuracy of paper recommendation, this paper proposes a paper recommendation approach that extends the traditional bibliographic coupling by exploiting the distribution of citations in logical sections in bibliographically coupled papers. Comprehensive automated evaluation utilizing Jensen Shannon Divergence was conducted to evaluate the proposed approach. The results showed significant improvement over traditional bibliographic coupling and content-based research paper recommendation.
Article
Information retrieval systems for scholarly literature rely heavily not only on text matching but on semantic- and context-based features. Readers nowadays are deeply interested in how important an article is, its purpose and how influential it is in follow-up research work. Numerous techniques to tap the power of machine learning and artificial intelligence have been developed to enhance retrieval of the most influential scientific literature. In this paper, we compare and improve on four existing state-of-the-art techniques designed to identify influential citations. We consider 450 citations from the Association for Computational Linguistics corpus, classified by experts as either important or unimportant, and further extract 64 features based on the methodology of four state-of-the-art techniques. We apply the Extra-Trees classifier to select 29 best features and apply the Random Forest and Support Vector Machine classifiers to all selected techniques. Using the Random Forest classifier, our supervised model improves on the state-of-the-art method by 11.25%, with 89% Precision-Recall area under the curve. Finally, we present our deep-learning model, the Long Short-Term Memory network, that uses all 64 features to distinguish important and unimportant citations with 92.57% accuracy.
Article
Citations are very important parameters and are used to take many important decisions like ranking of researchers, institutions, countries, and to measure the relationship between research papers. All of these require accurate counting of citations and their occurrence (in-text citation counts) within the citing papers. Citation anchors refer to the citation made within the full text of the citing paper for example: ‘[1]’, ‘(Afzal et al, 2015)’, ‘[Afzal, 2015]’ etc. Identification of citation-anchors from the plain-text is a very challenging task due to the various styles and formats of citations. Recently, Shahid et al. highlighted some of the problems such as commonality in content, wrong allotment, mathematical ambiguities, and string variations etc in automatically identifying the in-text citation frequencies. The paper proposes an algorithm, CAD, for identification of citation-anchors and its in-text citation frequency based on different rules. For a comprehensive analysis, the dataset of research papers is prepared: on both Journal of Universal Computer Science (J.UCS) and (2) CiteSeer digital libraries. In experimental study, we conducted two experiments. In the first experiment, the proposed approach is compared with state-of-the-art technique over both datasets. The J.UCS dataset consists of 1200 research papers with 16,000 citation strings or references while the CiteSeer dataset consists of 52 research papers with 1850 references. The total dataset size becomes 1252 citing documents and 17,850 references. The experiments showed that CAD algorithm improved F-score by 44% and 37% respectively on both J.UCS and CiteSeer dataset over the contemporary technique (Shahid et al. in Int J Arab Inf Technol 12:481–488, 2014). The average score is 41% on both datasets. In the second experiment, the proposed approach is further analyzed against the existing state-of-the-art tools: CERMINE and GROBID. According to our results, the proposed approach is best performing with F1 of 0.99, followed by GROBID (F1 0.89) and CERMINE (F1 0.82).
Chapter
IMRAD refers to the format in which most biomedical journals publish an original research paper. This framework for a scientific paper spells out how a manuscript should be presented. The letter I stands for Introduction, the M for Methods, the R for Results, the A for And and the D for Discussion. The origin of this format is somewhat hazy; however, Louis Pasteur is said to be the first person who published his work in this format. (1) The format was later made more popular by the famous British statistician Sir Austin Bradford Hill, (2) who worked with the Medical Research Council of the UK and was also a statistical consultant for the British Medical Journal.
Conference Paper
This work looks in depth at several studies that have attempted to automate the process of citation importance classification based on the publications’ full text. We analyse a range of features that have been previously used in this task. Our experimental results confirm that the number of in-text references are highly predictive of influence. Contrary to the work of Valenzuela et al. (2015) [1], we find abstract similarity one of the most predictive features. Overall, we show that many of the features previously described in literature are not particularly predictive. Consequently, we discuss challenges and potential improvements in the classification pipeline, provide a critical review of the performance of individual features and address the importance of constructing a large scale gold-standard reference dataset.