Conference PaperPDF Available

SWAN – Scientific Writing AssistaNt A Tool for Helping Scholars to Write Reader-Friendly Manuscripts

Authors:

Abstract and Figures

Difficulty of reading scholarly papers is significantly reduced by reader-friendly writing principles. Writing reader-friendly text, however, is challenging due to difficulty in recognizing problems in one's own writing. To help scholars identify and correct potential writing problems, we introduce SWAN (Scientific Writing AssistaNt) tool. SWAN is a rule-based system that gives feedback based on various quality metrics based on years of experience from scientific writing classes including 960 scientists of var-ious backgrounds: life sciences, engineering sciences and economics. According to our first experiences, users have perceived SWAN as helpful in identifying problematic sections in text and increasing overall clarity of manuscripts.
Content may be subject to copyright.
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 20–24,
Avignon, France, April 23 - 27 2012. c
2012 Association for Computational Linguistics
SWAN – Scientific Writing AssistaNt
A Tool for Helping Scholars to Write Reader-Friendly Manuscripts
http://cs.joensuu.fi/swan/
Tomi KinnunenHenri Leisma Monika Machunik Tuomo Kakkonen Jean-Luc Lebrun
Abstract
Difficulty of reading scholarly papers is sig-
nificantly reduced by reader-friendly writ-
ing principles. Writing reader-friendly text,
however, is challenging due to difficulty in
recognizing problems in one’s own writing.
To help scholars identify and correct poten-
tial writing problems, we introduce SWAN
(Scientific Writing AssistaNt) tool. SWAN
is a rule-based system that gives feedback
based on various quality metrics based on
years of experience from scientific writ-
ing classes including 960 scientists of var-
ious backgrounds: life sciences, engineer-
ing sciences and economics. According to
our first experiences, users have perceived
SWAN as helpful in identifying problem-
atic sections in text and increasing overall
clarity of manuscripts.
1 Introduction
A search on “tools to evaluate the quality of writ-
ing” often gets you to sites assessing only one of
the qualities of writing: its readability. Measur-
ing ease of reading is indeed useful to determine
if your writing meets the reading level of your tar-
geted reader, but with scientific writing, the sta-
tistical formulae and readability indices such as
Flesch-Kincaid lose their usefulness.
In a way, readability is subjective and depen-
dent on how familiar the reader is with the spe-
cific vocabulary and the written style. Scien-
tific papers are targeting an audience at ease with
T. Kinnunen, H. Leisma, M. Machunik and T.
Kakkonen are with the School of Computing, Univer-
sity of Eastern Finland (UEF), Joensuu, Finland, e-mail:
tkinnu@cs.joensuu.fi. Jean-Luc Lebrun is an inde-
pendent trainer of scientific writing and can be contacted at
jllebrun@me.com.
a more specialized vocabulary, an audience ex-
pecting sentence-lengthening precision in writing.
The readability index would require recalibration
for such a specific audience. But the need for
readability indices is not questioned here. “Sci-
ence is often hard to read” (Gopen and Swan,
1990), even for scientists.
Science is also hard to write, and finding fault
with one’s own writing is even more challenging
since we understand ourselves perfectly, at least
most of the time. To gain objectivity scientists
turn away from silent readability indices and find
more direct help in checklists such as the peer re-
view form proposed by Bates College1, or scor-
ing sheets to assess the quality of a scientific pa-
per. These organise a systematic and critical walk
through each part of a paper, from its title to its
references in peer-review style. They integrate
readability criteria that far exceed those covered
by statistical lexical tools. For example, they ex-
amine how the text structure frames the contents
under headings and subheadings that are consis-
tent with the title and abstract of the paper. They
test whether or not the writer fluidly meets the ex-
pectations of the reader. Written by expert review-
ers (and readers), they represent them, their needs
and concerns, and act as their proxy. Such man-
ual tools effectively improve writing (Chuck and
Young, 2004).
Computer-assisted tools that support manual
assessment based on checklists require natural
language understanding. Due to the complexity
of language, today’s natural language processing
(NLP) techniques mostly enable computers to de-
liver shallow language understanding when the
1http://abacus.bates.edu/˜ganderso/
biology/resources/peerreview.html
20
vocabulary is large and highly specialized – as is
the case for scientific papers. Nevertheless, they
are mature enough to be embedded in tools as-
sisted by human input to increase depth of under-
standing. SWAN (ScientificWriting AssistaNt) is
such a tool (Fig. 1). It is based on metrics tested
on 960 scientists working for the research Insti-
tutes of the Agency for Science, Technology and
Research (A*STAR) in Singapore since 1997.
The evaluation metrics used in SWAN are de-
scribed in detail in a book written by the designer
of the tool (Lebrun, 2011). In general, SWAN fo-
cuses on the areas of a scientific paper that create
the first impression on the reader. Readers, and in
particular reviewers, will always read these partic-
ular sections of a paper: title, abstract, introduc-
tion, conclusion, and the headings and subhead-
ings of the paper. SWAN does not assess the over-
all quality of a scientific paper. SWAN assesses
its fluidity and cohesion, two of the attributes that
contribute to the overall quality of the paper. It
also helps identify other types of potential prob-
lems such as lack of text dynamism, overly long
sentences and judgmental words.
Figure 1: Main window of SWAN.
2 Related Work
Automatic assessment of student-authored texts is
an active area of research. Hundreds of research
publications related to this topic have been pub-
lished since Page’s (Page, 1966) pioneering work
on automatic grading of student essays. The re-
search on using NLP in support of writing scien-
tific publications has, however, gained much less
attention in the research community.
Amadeus (Aluisio et al., 2001) is perhaps the
system that is the most similar to the work out-
lined in this system demonstration. However, the
focus of the Amadeus system is mostly on non-
native speakers on English who are learning to
write scientific publications. SWAN is targeted
for more general audience of users.
Helping our own (HOO) is an initiative that
could in future spark a new interest in the re-
search on using of NLP for supporting scientific
writing (Dale and Kilgarriff, 2010). As the name
suggests, the shared task (HOO, 2011) focuses on
supporting non-native English speakers in writing
articles related specifically to NLP and computa-
tional linguistics. The focus in this initiative is
on what the authors themselves call “domain-and-
register-specific error correction”, i.e. correction
of grammatical and spelling mistakes.
Some NLP research has been devoted to apply-
ing NLP techniques to scientific articles. Paquot
and Bestgen (Paquot and Bestgen, 2009), for in-
stance, extracted keywords from research articles.
3 Metrics Used in SWAN
We outline the evaluation metrics used in SWAN.
Detailed description of the metrics is given in (Le-
brun, 2011). Rather than focusing on English
grammar or spell-checking included in most mod-
ern word processors, SWAN gives feedback on
the core elements of any scientific paper: title,ab-
stract,introduction and conclusions. In addition,
SWAN gives feedback on fluidity of writing and
paper structure.
SWAN includes two types of evaluation met-
rics, automatic and manual ones. Automatic met-
rics are solely implemented as text analysis of the
original document using NLP tools. An example
would be locating judgemental word patterns such
as suffers from or locating sentences with passive
voice. The manual metrics, in turn, require user’s
input for tasks that are difficult – if not impossible
– to automate. An example would be highlighting
title keywords that reflect the core contribution of
the paper, or highlighting in the abstract the sen-
tences that cover the relevant background.
Many of the evaluation metrics are strongly
inter-connected with each other, such as
Checking that abstract and title are consis-
tent; for instance, frequently used abstract
keywords should also be found in the title;
21
and the title should not include keywords ab-
sent in the abstract.
Checking that all title keywords are also
found in the paper structure (from headings
or subheadings) so that the paper structure is
self-explanatory.
An important part of paper quality metrics is as-
sessing text fluidity. By fluidity we mean the ease
with which the text can be read. This, in turn,
depends on how much the reader needs to mem-
orize about what they have read so far in order
to understand new information. This memorizing
need is greatly reduced if consecutive sentences
do not contain rapid change in topic. The aim of
the text fluidity module is to detect possible topic
discontinuities within and across paragraphs, and
to suggest ways of improving these parts, for ex-
ample, by rearranging the sentences. The sugges-
tions, while already useful, will improve in future
versions of the tool with a better understanding
of word meanings thanks to WordNet and lexical
semantics techniques.
Fluidity evaluation is difficult to fully auto-
mate. Manual fluidity evaluation relies on the
reader’s understanding of the text. It is therefore
superior to the automatic evaluation which relies
on a set of heuristics that endeavor to identify text
fluidity based on the concepts of topic and stress
developed in (Gopen, 2004). These heuristics re-
quire the analysis of the sentence for which the
Stanford parser is used. These heuristics are per-
fectible, but they already allow the identification
of sentences disrupting text fluidity.More fluidity
problems would be revealed through the manual
fluidity evaluation.
Simply put, here topic refers to the main fo-
cus of the sentence (e.g. the subject of the main
clause) while stress stands for the secondary sen-
tence focus, which often becomes one of the fol-
lowing sentences’ topic. SWAN compares the po-
sition of topic and stress across consecutive sen-
tences, as well as their position inside the sentence
(i.e. among its subclauses). SWAN assigns each
sentence to one of four possible fluidity classes:
1. Fluid: the sentence is maintaining connec-
tion with the previous sentences.
2. Inverted topic: the sentence is connected
to a previous sentence, but that connection
only becomes apparent at the very end of
the sentence (“The cropping should preserve
all critical points. Images of the same size
should also be kept by the cropping”).
3. Out-of-sync: the sentence is connected to a
previous one, but there are disconnected sen-
tences in between the connected sentences
(“The cropping should preserve all critical
points. The face features should be normal-
ized. The cropping should also preserve all
critical points”).
4. Disconnected: the sentence is not connected
to any of the previous sentences or there are
too many sentences in between.
The tool also alerts the writer when transition
words such as in addition,on the other hand,
or even the familiar however are used. Even
though these expressions are effective when cor-
rectly used, they often betray the lack of a log-
ical or semantic connection between consecutive
sentences (“The cropping should preserve all crit-
ical points. However, the face features should be
normalized”). SWAN displays all the sentences
which could potentially break the fluidity (Fig. 2)
and suggests ways of rewriting them.
Figure 2: Fluidity evaluation result in SWAN.
4 The SWAN Tool
4.1 Inputs and outputs
SWAN operates on two possible evaluation
modes: simple and full. In simple evaluation
mode, the input to the tool are the title, abstract,
introduction and conclusions of a manuscript.
These sections can be copy-pasted as plain text
to the input fields.
In full evaluation mode, which generally pro-
vides more feedback, the user provides a full pa-
per as an input. This includes semi-automatic
import of the manuscript from certain standard
22
document formats such as TeX, MS Office and
OpenOffice, as well as semi-automatic structure
detection of the manuscript. For the well-known
Adobe’s portable document format (PDF) we use
state-of-the-art freely available PdfBox extractor2.
Unfortunately, PDF format is originally designed
for layout and printing and not for structured text
interchange. Most of the time, simple copy &
paste from a source document to the simple eval-
uation fields is sufficient.
When the text sections have been input to the
tool, clicking the Evaluate button will trigger the
evaluation process. This has been observed to
complete, at most, in a minute or two on a mod-
ern laptop. The evaluation metrics in the tool are
straight-forward, most of the processing time is
spent in the NLP tools. After the evaluation is
complete, the results are shown to the user.
SWAN provides constructive feedback from
the evaluated sections of your paper. The tool also
highlights problematic words or sentences in the
manuscript text and generates graphs of sentence
features (see Fig. 2). The results can be saved and
reloaded to the tool or exported to html format
for sharing. The feedback includes tips on how
to maintain authoritativeness and how to convince
the scientist reader. Use of powerful and precise
sentences is emphasized together with strategical
and logical placement of key information.
In addition to these two main evaluation modes,
the tool also includes a manual fluidity assessment
exercise where the writer goes through a given
text passage, sentence by sentence, to see whether
the next sentence can be predicted from the previ-
ous sentences.
4.2 Implementation and External Libraries
The tool is a desktop application written in Java.
It uses external libraries for natural language pro-
cessing from Stanford, namely Stanford POS Tag-
ger (Toutanova et al., 2003) and Stanford Parser
(Klein and Manning, 2003). This is one of the
most accurate and robust parsers available and im-
plemented in Java, as is the rest of our system.
Other external libraries include Apache Tika3,
which we use in extracting textual content from
files. JFreeChart4is used in generating graphs
2http://pdfbox.apache.org/
3http://tika.apache.org/
4http://www.jfree.org/jfreechart/
and XStream5in saving and loading inputs and
results.
5 Initial User Experiences of SWAN
Since its release in June 2011, the tool has
been used in scientific writing classes in doc-
toral schools in France, Finland, and Singapore,
as well as in 16 research institutes from A*STAR
(Agency for Science Technology and Research).
Participants to the classes routinely enter into
SWAN either parts, or the whole paper they wish
to immediately evaluate. SWAN is designed to
work on multiple platforms and it relies com-
pletely on freely available tools. The feedback
given by the participants after the course reveals
the following benefits of using SWAN:
1. Identification and removal of the inconsis-
tencies that make clear identification of the
scientific contribution of the paper difficult.
2. Applicability of the tool across vast domains
of research (life sciences, engineering sci-
ences, and even economics).
3. Increased clarity of expression through the
identification of the text fluidity problems.
4. Enhanced paper structure leading to a more
readable paper overall.
5. More authoritative, more direct and more ac-
tive writing style.
Novice writers already appreciate SWAN’s
functionalityand even senior writers, although ev-
idence remains anecdotal. At this early stage,
SWAN’s capabilities are narrow in scope.We con-
tinue to enhance the existing evaluation metrics.
And we are eager to include a new and already
tested metric that reveals problems in how figures
are used.
Acknowledgments
This works of T. Kinnunen and T. Kakkonen were supported
by the Academy of Finland. The authors would like to thank
Arttu Viljakainen, Teemu Turunen and Zhengzhe Wu in im-
plementing various parts of SWAN.
References
[Aluisio et al.2001] S.M. Aluisio, I. Barcelos, J. Sam-
paio, and O.N. Oliveira Jr. 2001. How to learn
the many “unwritten rules” of the game of the aca-
demic discourse: a hybrid approach based on cri-
tiques and cases to support scientific writing. In
5http://xstream.codehaus.org/
23
Proc. IEEE International Conference on Advanced
Learning Technologies, Madison, Wisconsin, USA.
[Chuck and Young2004] Jo-Anne Chuck and Lauren
Young. 2004. A cohort-driven assessment task for
scientific report writing. Journal of Science, Edu-
cation and Technology, 13(3):367–376, September.
[Dale and Kilgarriff2010] R. Dale and A. Kilgarriff.
2010. Text massaging for computational linguis-
tics as a new shared task. In Proc. 6th Int. Natural
Language Generation Conference, Dublin, Ireland.
[Gopen and Swan1990] George D. Gopen and Ju-
dith A. Swan. 1990. The science of scien-
tific writing. American Scientist, 78(6):550–558,
November-December.
[Gopen2004] George D. Gopen. 2004. Expectations:
Teaching Writing From The Reader’s perspective.
Longman.
[HOO2011] 2011. HOO - helping our own. Web-
page, September. http://www.clt.mq.edu.
au/research/projects/hoo/.
[Klein and Manning2003] Dan Klein and Christo-
pher D. Manning. 2003. Accurate unlexicalized
parsing. In Proc. 41st Meeting of the Association
for Computational Linguistics, pages 423–430.
[Lebrun2011] Jean-Luc Lebrun. 2011. Scientific Writ-
ing 2.0 – A Reader and Writer’s Guide. World Sci-
entific Publishing Co. Pte. Ltd., Singapore.
[Page1966] E. Page. 1966. The imminence of grading
essays by computer. In Phi Delta Kappan, pages
238–243.
[Paquot and Bestgen2009] M. Paquot and Y. Bestgen.
2009. Distinctive words in academic writing: A
comparison of three statistical tests for keyword ex-
traction. In A.H. Jucker, D. Schreier, and M. Hundt,
editors, Corpora: Pragmatics and Discourse, pages
247–269. Rodopi, Amsterdam, Netherlands.
[Toutanova et al.2003] Kristina Toutanova, Dan Klein,
Christopher Manning, and Yoram Singer. 2003.
Feature-rich part-of-speech tagging with a cyclic
dependency network. In Proc. HLT-NAACL, pages
252–259.
24
... Similarly, there are other tools meant for a specialized task or audience. For example, FLOW [10] is an interactive writing assistant for people who learn english as a foreign language, Creative Help [48] and LISA [49] help with story writing, SWAN [34] helps with scientific writing, etc. ...
Preprint
Full-text available
Stuttering is a speech disorder which impacts the personal and professional lives of millions of people worldwide. To save themselves from stigma and discrimination, people who stutter (PWS) may adopt different strategies to conceal their stuttering. One of the common strategies is word substitution where an individual avoids saying a word they might stutter on and use an alternative instead. This process itself can cause stress and add more burden. In this work, we present Fluent, an AI augmented writing tool which assists PWS in writing scripts which they can speak more fluently. Fluent embodies a novel active learning based method of identifying words an individual might struggle pronouncing. Such words are highlighted in the interface. On hovering over any such word, Fluent presents a set of alternative words which have similar meaning but are easier to speak. The user is free to accept or ignore these suggestions. Based on such user interaction (feedback), Fluent continuously evolves its classifier to better suit the personalized needs of each user. We evaluated our tool by measuring its ability to identify difficult words for 10 simulated users. We found that our tool can identify difficult words with a mean accuracy of over 80% in under 20 interactions and it keeps improving with more feedback. Our tool can be beneficial for certain important life situations like giving a talk, presentation, etc. The source code for this tool has been made publicly accessible at github.com/bhavyaghai/Fluent.
... Some examples of WAS for general language writing include: CANDLE (Corpora And NLP for Digital Learning of English), an online English learning environment for learners in Taiwan (Chang & Chang, 2004) and FLOW (First-Language-Oriented Writing Assistant System), an interactive system for assisting EFL (English as a Foreign Language) writers (Chen, Huang, Hsieh, Kao, & Chang, 2012). Among WAS for specific purposes, we can find AMADEUS (AMiable Article DEvelopment for User Support), aimed at assisting non-native English users in scientific writing (Oliveira, Zucolotto, & Aluísio, 2006), SWAN (Scientific Writing AssistaNt), a tool to help scholars to identify and correct potential writing problems in their scientific papers (Kinnunen, Leisma, Machunik, Kakkonen, & Lebrun, 2012), the Scientific_Abstract_Generator, a text generator for producing abstracts in the bio sciences (López- Arroyo & Roberts, 2014a,b), and BEAR (Building English Abstracts by Ricoh), a tool based on rhetorical templates extracted from a parallel corpus and developed to help Japanese software engineers to write abstracts in English (Narita, Kurokawa, & Utsuro, 2002). ...
Article
The present paper describes the construction of a text generator (CDG: Cheese Descriptions Generator), prompted by the need to assist Spanish-speaking professionals in the dairy industry in writing promotional cheese descriptions in English, the current lingua franca in business. This tool aims at bridging the gap between descriptive studies on second language writing and applications in real life contexts. For building this text generator, we compiled a specialized comparable corpus of online cheese descriptions in English and Spanish. The corpus was tagged rhetorically and explored using specific custom-made software to retrieve all the necessary linguistic information. The application offers both professionals in this field and business students the main rhetorical, phraseological and lexical features of this particular text type: a) general guidelines on the structure of these texts, b) an inventory of ready-made phraseological units to be used when describing cheeses, and c) an English-Spanish glossary of specialized and semi-specialized terms in this domain. The CDG generator is a writing assistance system designed for current and future Spanish-speaking professionals in the dairy industries to help them promote their products internationally and export their cheeses to the rest of the world. Key words: cheese descriptions, text generator, specialized corpora, promotional texts, second language writing.
... Writing assistance systems in a foreign language are directed towards semiautomatic text generators, e.g., Aluisio and Oliveira (1996), Oliveira et al. (2001), Aluísio et al. (2011) and Chang and Chang (2004). One of the best-known programs to assist users in writing with a specific goal is SWAN (Kinnunen et al. 2012). This tool helps scholars to identify and correct potential writing problems in their scientific papers, and is intended to aid writers with the content of a document, not only with grammar or spelling. ...
Article
Full-text available
In today’s globalised world where there is a growing need for international communication, non-native speakers (NNS) from a wide range of professional fields are increasingly called upon to write specialised texts in English. More often than not, however, the linguistic competence required to do so is well beyond that of the majority of NNS. While software applications can serve to assist NNS in their English writing tasks, most of the applications available are designed for users of English for general purposes as opposed to English for professional purposes. Therefore, these applications lack the specific vocabulary, style guidelines and common structures required in more specialised documents. Necessary modifications to meet the needs of English for professional purposes tend to be viewed as representing an overly complex and expensive task. To overcome these challenges, we present a software called O-WEAA (Ontology-Writing English Assistant Architecture) which makes use of an ontology that represents the knowledge which, according to our formalisation, is required to write most types of specialised professional documents in the English language. Our formalisation of the required knowledge is based on an exhaustive linguistic analysis of several written genres. The proposed software is composed of two parts: i) a web application named Acquisition Interface Module, which allows experts to populate the ontology with new data and ii) a user-friendly, general web interface named Writing Assistant Interface Module which guides the user throughout the writing process of the English document in the specific domain described in the ontology.
... Writing assistance systems in a foreign language are directed towards semiautomatic text generators, e.g., [9] [10] [11] [12]. One of the best-known programs to assist users in writing with a specific goal is SWAN [13]. This tool helps scholars to identify and correct potential writing problems in their scientific papers, and is intended to aid writers with the content of a document, not only with grammar or spelling. ...
Conference Paper
In today's globalised world where there is a growing need for international communication, non-native speakers (NNS) from a wide range of professional fields are increasingly called upon to write specialised texts in English. More often than not, however, the linguistic competence required to do so is well beyond that of the majority of NNS. While software applications can serve to assist NNS in their English writing tasks, most of the applications available are designed for users of English for general purposes as opposed to English for professional purposes. Therefore, these applications lack the specific vocabulary, style guidelines and common structures required in more specialised documents. Necessary modifications to meet the needs of English for professional purposes tend to be viewed as representing an overly complex and expensive task. To overcome these challenges, we present a software called O-WEAA (Ontology-Writing English Assistant Architecture) which makes use of an ontology that represents the knowledge which, according to our formalisation, is required to write most types of specialised professional documents in the English language. Our formalisation of the required knowledge is based on an exhaustive linguistic analysis of several written genres. The proposed software is composed of two parts: i) a web application named Acquisition Interface Module, which allows experts to populate the ontology with new data and ii) a user-friendly, general web interface named Writing Assistant Interface Module which guides the user throughout the writing process of the English document in the specific domain described in the ontology.
... Writing assistance systems in a foreign language are directed towards semiautomatic text generators, e.g., [9] [10] [11] [12]. One of the best-known programs to assist users in writing with a specific goal is SWAN [13]. This tool helps scholars to identify and correct potential writing problems in their scientific papers, and is intended to aid writers with the content of a document, not only with grammar or spelling. ...
Presentation
In today’s globalized world where there is a growing need for international communication, non-native speakers (NNS) from a wide range of professional fields are increasingly called upon to write specialised texts in English. More often than not, however, the linguistic competence required to do so is well beyond that of the majority of NNS. While software applications can serve to assist NNS in their English writing tasks, most available applications are designed to serve users of English for general purposes as opposed to English for professional purposes. Therefore, these applications lack the specific vocabulary, style guidelines and common structures required in more specialised documents. Necessary modifications to meet the needs of English for professional purposes tend to be viewed as representing an overly complex and expensive task. To overcome these challenges, we present a software called O-WEAA (Ontology-Writing English Assistant Architecture) which makes use of an ontology that represents the knowledge which, according to our formalisation, is required to write most types of specialised professional documents in the English language. Our formalisation of the required knowledge is based on an exhaustive linguistic analysis of several written genres. The proposed architecture is composed of two parts: i) a web application named Acquisition Interface Module, that allows experts to populate the ontology with new data; ii) a user-friendly, general web interface named Writing Assistant Interface Module which guides the user throughout the writing process of the English document in the specific domain described in the ontology.
... To date, tools for writing texts have often been designed for general subject areas and included information on orthographic, grammatical and/or lexical aspects of the writing process. NLP researchers have tended not to study systems for structuring and writing specialized texts, although a few researchers have bucked this trend: Kinnunen et al. (2012) developed a system to identify and correct writing problems in English in several domains; Aluisio et al. ...
Chapter
Error-free scientific research articles are more likely to be accepted for publication than those permeated with errors. This chapter identifies, describes, and explains how to avoid 22 common language errors. Scientists need to master the genre of scientific writing to conform to the generic expectations of the community of practice. Based on a systematic analysis of the pedagogic literature, five categories of errors were identified in scientific research articles namely accuracy, brevity, clarity, objectivity, and formality. To gain a more in-depth understanding of the errors, a corpus investigation of scientific articles was conducted. A corpus of 200 draft research articles submitted for internal review at a research institute with university status was compiled, annotated, and analyzed. This investigation showed empirically the types of errors within these categories that may impinge on publication success. In total, 22 specific types of language errors were identified. These errors are explained, and ways for overcoming each of them are described.
Article
Full-text available
In this paper we propose a modeling of contextual information around a terminological unit, for the needs of a scientific writing aid tool in the biomedical field. We focus more specifically on the modeling of significant phraseic contexts that we formalize as semantically characterized argument patterns. This modeling is based on a large corpus of biomedical scientific articles and relay on semantic types specified in a domain ontology, Unified Medical Language System. Key words: terminology, term, scientific writing, scientific writing aid tool, context, modeling, ontology, corpus
Conference Paper
Non-native researchers encountered a problem of lacking academic writing skill. During the writing, we may accidentally use a word repeatedly due to our familiarity that reduces a quality of writing. To solve the problem, a paraphrase is a good option. It helps the manuscript read more flowery by reducing duplicate words and refining sentence alignments. In this study, we propose a novel idea to use a sentence dependency ontology to suggest possible verbs replaceable on existing context without influence to the original intention. We created an ontology-based system and designed a new ontology. To discover a list of verb choices, our idea is based on sentence dependency, especially a dependency between subject and verb (nsubj) as well as a relationship between verb and object (dobj). We chose them because these two dependencies had a strong relationship to the verb of the sentence. To evaluate the system, we compared the efficiencies of two different systems, i.e., a tradition system utilizing synonyms as word choices and our ontology-based system. As the results, ours provided the better performance rather than the traditional system. This clarifies that our system should be a proper solution for studies on paraphrasing.
Thesis
Full-text available
Ce mémoire est consacré à la présentation de divers travaux entrepris dans l’objectif de concevoir des outils d'aide à la rédaction des textes en langues spécialisées. Il y est question de l’importance de la modélisation linguistique, des contraintes liées à l’automatisation des certaines tâches, du rôle de l’expert et de la prise en compte des utilisateurs finaux. Le rôle prépondérant est donné au lexique et à son contextualisation, une thématique, qui, je pense, fait consensus parmi les chercheurs travaillant sur la rédaction en langues spécialisées. J’y présente quatre logiciels qui sont conçus pour donner à un auteur d'un texte spécialisé plus d’autonomie dans la rédaction. Certains de ces logiciels s’adressent directement aux rédacteurs (Compagnon LiSe et SARS, Système d’Aide à la Rédaction Scientifique). D’autres constituent des aides indirectes, soit à l’enseignement et à l’apprentissage du vocabulaire spécialisé du niveau académique, soit à la constitution des lexiques pour la rédaction en langues contrôlées (Station Sensunique). Ces logiciels tentent timidement de répondre à un véritable besoin sociétal, à savoir la nécessité de produire des textes techniques de qualité, par les rédacteurs qui ne sont pas toujours des professionnels de la rédaction. La spécificité de ces rédacteurs occasionnels – qu’il s’agisse de rédaction technique ou scientifique - impose de véritables contraintes sur la conception des outils d’aide à la rédaction, résultant des injonctions apparemment contradictoires : d’une part, d’un besoin des outils simples, mais précis, et d’autre part, de la complexité de l’information à transmettre.
Article
Full-text available
Most studies that make use of keyword analysis rely on log-likelihood ratio or chi-square tests to extract words that are particularly characteristic of a corpus (e.g. Scott and Tribble 2006). These measures are computed on the basis of absolute frequencies and cannot account for the fact that “corpora are inherently variable internally” (Gries 2006: 110). To overcome this limitation, measures of dispersion are sometimes used in combination with keyness values (e.g. Rayson 2003; Oakes and Farrow 2007). Some scholars have also suggested using other statistical measures (e.g. Wilcoxon-Mann-Whitney test) but these techniques have not gained corpus linguists' favour (yet?). One possible explanation for this lack of enthusiasm is that statistical tests for keyword extraction have rarely been compared. In this article, we make use of the log-likelihood ratio, the t-test and the Wilcoxon-Mann-Whitney test in turn to compare the academic and the fiction sub-corpora of the British National Corpus and extract words that are typical of academic discourse. We compare the three lists of academic keywords on a number of criteria (e.g. number of keywords extracted by each measure, percentage of keywords that are shared in the three lists, frequency and distribution of academic keywords in the two corpora) and explore the specificities of the three statistical measures. We also assess the advantages and disadvantages of these measures for the extraction of general academic words.
Conference Paper
Full-text available
In this paper, we propose a new shared task called HOO: Helping Our Own. The aim is to use tools and techniques developed in computational linguistics to help people writing about computational linguistics. We describe a text-to-text generation scenario that poses challenging research questions, and delivers practical outcomes that are useful in the first case to our own community and potentially much more widely. Two specific factors make us optimistic that this task will generate useful outcomes: one is the availability of the ACL Anthology, a large corpus of the target text type; the other is that CL researchers who are non-native speakers of English will be motivated to use prototype systems, providing informed and precise feedback in large quantity. We lay out our plans in detail and invite comment and critique with the aim of improving the nature of the planned exercise.
Article
Full-text available
We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar.
Article
Full-text available
We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) fine-grained modeling of unknown word features. Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ, an error reduction of 4.4% on the best previous single automatically learned tagging result.
Article
A formative assessment task was developed to improve the scientific report writing skills of university students. Students undertaking this task typically possessed varying levels of scientific literacy and were drawn from a cohort of mixed abilities. The assessment task involved the construction of a scientific report that included feedback from instructor to students before final submission of the assessment piece. After initial submission of a scientific report, the instructor developed a cohort-specific marking scheme based on the deficiencies that were evident within the class group. Using a mixture of peer and self-review against specific criteria, the students were required to resubmit an amended report. This resulted in elevated marks compared with those that would have been obtained after first submission, thus rewarding the student for the application of feedback. This technique proved to be efficient for both parties and also resulted in improvement of skills of the entire student population, regardless of the ability of the student prior to the assessment task. Using this methodology, students of varying aptitudes were able to measure their own skill improvement against tangible criteria, and enjoy a degree of learning success independent of the ranking within their group.
Book
The book helps scientists write papers for scientific journals. Using the key parts of typical scientific papers (Title, Abstract, Introduction, Visuals, Structure, and Conclusions), it shows through numerous examples, how to achieve the essential qualities required in scientific writing, namely being clear, concise, convincing, fluid, interesting, and organized. To enable the writer to assess whether these parts are well written from a reader's perspective, the book also offers practical metrics in the form of six checklists, and even an original Java application to assist in the evaluation. The focus of the book is on self- and reader-assisted assessment of the scientific journal article. It is also the first time that a book on scientific writing takes a human factor view of the reading task and the reader scientist. By revealing and addressing the physiological causes that create substantial reading difficulties, namely limited reader memory, attention span, and patience, the book guarantees that writing will gain the much coveted reader-centered quality.
Conference Paper
We present the computational and composition theoretical bases for the design of a collaborative writing tool, based on the critiquing approach, to assist non-native novice researchers in understanding and production of the structure of scientific papers. This critiquing tool is embedded in a suite named AMADEUS that caters for various needs of non-native English users to produce a first draft of a paper, relying on the reuse of contextualized linguistic material as input for the user. Our emphasis is on the architecture and methodology to build the linguistic resources for the critiquing tool. Though originally targeted at non-native authors, the critiquing tool may also be useful for novice native English writers and as a teaching resource for English for Academic Purposes practitioners