Natural-Language Neutrality in Programming
Languages: Bridging the Knowledge Divide in Software
Ivan Ruby1, Salomão David2
1Osmania University, India
2Universitá Della Svizzera Italiana (USI), Switzerland
Abstract. This paper introduces an approach to allow English Language Learners
(ELLs) to collaborate in the Software Engineering field using their individual
native languages. Natural-Language Neutrality (NLN) aims to bridge the
Knowledge Divide in Software Engineering by providing tools and methodolo-
gies to allow speakers of different Natural Languages to learn, practice and col-
laborate in an environment that is Natural-Language agnostic.
A Knowledge Divide in Software Engineering is constituted by the differences
in the knowledge assimilation capability, between native English-speakers and
ELLs, due to the English-language barrier.
NLN intends to provide standardized methods to enable already-existing and
new Programming Languages to be accessible to learners in their Natural-
language context. The tools created to achieve this purpose, Glotter, Glotation
and the Collaborative Model, are described.
Keywords: Human Computer Interaction, Computer Science Education, Learn-
ing & Collaboration Technologies, Programming Languages
A Programming Language (PL) is a formal constructed language used to create a pro-
gram, a list of instructions, to perform a task.
Although a PL specifies a notation (Aaby, 1996) to write programs, these are often
written with a combination of mathematical and everyday language characters, words
According to World Language Statistics (SIL International, 2015), English is the 3rd
most spoken language in the world, with 5.43% of speakers, behind Chinese
ish with 14.4% and 6.15%, respectively. Nonetheless, a survey of the most used PLs’
A group of related varieties of languages spoken in China described as dialects of a single
(TIOBE Software BV, 2015) Syntax, Semantics, Standard Library and Runtime System
indicates that the most popular are all English-based.
Although Non-English-based PLs exist (Miller, Vandome, & McBrewster, 2012),
currently the most used have syntax, learning resources, Runtime, and Development
Environments that are developed with an English-speaking audience in mind.
Hypothetically, in a universe of more than 7 Billion people, to make usage of the
speed and computational capacity of machines to solve problems, approximately 94%
of the people would have to be able to express their instructions to the computer in
English, even though not speaking it as a native language.
Software Engineering is a fast changing and evolving field. Thus, it is a challenge to
translate and distribute the learning material in languages other than English, keeping
pace with the technology development. This fact often categorizes a non-native English
speaker student of Software Engineering as an English Language Learner (ELL) since
the learning process makes usage of material and tools that are in English, regardless
of whether the medium of instruction is English or not.
The discrepancy between the English Language not being the most spoken Natural-
Language but being the most widely used in the most popular PLs, inability of ELLs to
use their native languages and the constraint of being taught in one language while
practicing the concepts (programming) in a different language altogether create a
Knowledge barrier, or Knowledge Divide, to ELLs in Software Engineering.
To keep pace with innovations and generate ideas, people need to be able to produce
and manage knowledge. However, the increase in the 21st century of access to infor-
mation has resulted in an uneven overall ability to assimilate it.
Knowledge Divide is a term that denotes the differences between those who have
access to knowledge and can assimilate it, participating in knowledge-sharing and using
it as a tool for development, and others who are impaired in this process (Bindé &
A Knowledge Divide in Software Engineering is constituted by the differences in
Software Engineering-related knowledge assimilation capabilities between native Eng-
lish-speakers and ELLs, due to the English-language barrier.
ELLs need to develop English language and literacy skills in the context of the sub-
jects being taught to keep up with English-speaking students (Lee, 2005). However, the
linguistic knowledge that students already possess is often not taken into consideration
By allowing students to employ their existing language skills, the Knowledge Divide
can be decreased.
Thus, this paper proposes a methodology to bridge the aforementioned Knowledge
2 Data Collection
During the month of April of 2015, a Survey was conducted to 78 students of the Uni-
versity College of Engineering, Osmania University. The students were from different
streams of Engineering but had common introductory Programming courses in C and
The sample was split into two groups, of 34 and 44 students to have a representative
sample, and the questions presented to the students intended to study the following
Perceived importance of comments in source code
Perceived importance and difficulty in understanding source code written in a native
Students found a program without comments easier to understand but when presented
a choice the version with comments was more favorable. When asked about the im-
portance attributed to comments the majority (54%) of the students was neutral. This
inconsistency might suggest that comments are under-used, although of considerable
importance in reading and understanding the source code.
Fig. 1. Perceived Difficulty in understanding a program’s source code
Regarding the difficulty and importance of the usage of Native languages in source
codes, a similar scenario could be verified.
58% of the students found understanding a program written in their mother tongue
difficult. Although a small portion would prefer reading a program written in their na-
tive language, when asked about its importance 45% were neutral.
difficult Difficult Normal Easy Very easy
Without Comments With Comments
Fig. 2. Perceived importance of a Native-language in understanding source code
It is this considerably undecided portion of the sample that led to the following ques-
tions being raised regarding students’ perceptions:
Are students aware of the resources available to them?
Are the resources being presented and contextualized to suit the students’ learning
What determines the outcome of the learning process in Software Development: stu-
dents’ usage of their existing resources or their ability to adapt to the already estab-
lished required resources?
Having this questions in mind, we decided to venture in the construction of a learning
and practice model that would highlight the importance of using students’ existing re-
4 Multilingual vs. NLN Programming Languages
In one hand, Multilingual PLs, also called International PLs, allow the usage of more
than one Natural Language for writing programs. Such are the cases of ALGOL 68
(Wijngaarden, Mailloux, Peck, & Koster, 1969) and BabylScript (Iu, 2011).
ALGOL 68 is the 1968 version of the Algorithmic Language. It is an imperative PL,
which succeeds ALGOL 60, and provides translations of its Standard in Russian, Ger-
man, French, Bulgarian, Japanese and Chinese. The translations allow the internation-
alization of the PL.
BabylScript is an open-source, multilingual scripting language that compiles to Ja-
vaScript. It is implemented using the Java PL, by modifying the open-source Mozilla
objects, and functions names are translated into non-English languages. With this fea-
Not important Fairly important Neutral Important Very important
ture, it allows programmers to write programs in languages other than English. Bab-
ylScript also allows a mixed language model, on which the same source code can con-
tain code written in more than one language.
At the time of writing, BabylScript has 17 language translations including Chinese,
Hindi, Swahili, Spanish and Russian.
Although Multilingual PLs reduce the initial language barrier, they pose a threat to
their development and adoption for being Natural-language isolated. A larger audience
can be engaged, but ultimately only speakers of the same language can collaborate.
So far, the approaches used for the creation of Multilingual PLs have not been stand-
ardized and a single approach to enable the feature to different PLs, existing as well as
newly created, has not been identified.
On the other hand, NLN is an approach that intends to provide tools and methodol-
ogies to allow speakers of different Natural-Languages to learn, practice and collabo-
rate in an environment that is Natural-Language-agnostic.
By allowing learners with different Native languages to interact in a unified plat-
form, the Single Natural Language (English) knowledge requirement can be reduced.
The English-language is still required in Software Engineering. However, re-estab-
lishing a balance between its usage as a Lingua franca and native languages is desirable,
recognizing the existing linguistic diversity (Bindé & Matsuura, 2005).
NLN can be integrated into an already existing or newly created PL, taking ad-
vantage of the most used English-based ones.
Fig. 3. Natural-Language Neutrality Model
At the core of the NLN approach is a Natural-Language Translation mechanism. The
required translation is only of the PL’s keywords, not of the complete source code.
Hence, we came up with a Translation mechanism that can be further exploited.
The proposed tools contemplate the source code keywords, comments and Collabora-
tion between programmers.
Although each tool is designed to iterate over the elements of Bloom’s Taxonomy
Cognitive Dimension (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956), they mainly
intend to stimulate the Affective Dimension elements in students, through the inclusion
of their existing linguistic knowledge in the problem-solving process.
5.1 Glotter: A compiler-level Natural-Language Neutrality Enabler
A Glotter is a Lexical Analysis tool that converts the source code Lexical Units (tokens)
from a Source to a Target Natural-Language.
A Source Natural Language can be any existing Natural Language while the Target
Natural-Language is a predefined Bridge Natural Language, a Lingua Franca, which
will enable all other Source Natural Languages to be translated to and from it.
The name is derived from the Latin word glot, which means Language, and the Eng-
lish word enabler. Therefore, a Glotter is a Language Enabler.
The Glotter receives a list of tokens, a list of Language Dictionaries and a selected
Natural-Language, which serves as the context for the translation.
Its integration to a compiler enables the possibility of different (translated) versions
of the same keywords being compiled into a single version. This process ultimately
serves the purpose of enabling a single PL to be used with various Natural Languages,
while maintaining all the syntactic and semantic structure and rules.
Upon processing, if the keyword is present at the selected Language Dictionary its
value is substituted by the matching value. Otherwise, it is left intact. Although it is
possible to implement an error reporting functionality upon detection of a non-existent
keyword version in the selected Language, this feature might exceed the responsibilities
of a Lexical Analyzer. Furthermore, this error can be reported by the Syntax Analyzer.
Embedded. By integrating the Glotter to the compiler, each token can be translated at a
time. This approach is more flexible and does not add a performance impact on the
normal working of the compiler. An Embedded Glotter requires modification to the
compiler source code for an existing PL, what poses a disadvantage in case a seamless
integration is expected.
Fig. 4. Embedded Glotter Implementation
Standalone. In this alternative method, the Glotter is separated from the Compiler. The
complete source code (input) is parsed by the Glotter, in a process that involves a Lex-
ical Analysis (Tokenization) of the given code. Therefore, the source code is Tokenized
twice. This process requires no modification of an existing compiler's source code, a
fact that constitutes an advantage to enabling NLN in already existing PLs.
Fig. 5. Standalone Glotter Implementation
Input: A List of Lexical Units L, A List of Language
Dictionaries LD, a selected Language S
Output: A List of Lexical Units L
if S = null, then
for each token in L, do
translatedToken := null
if token(type) = “Keyword”, then
if translatedToken != null, then
token := translatedToken
Input: A List of Language Dictionaries LD, A selected
Language S, a Lexical Unit token
Output: A String representing the a token or null
for each selectedDictionaryToken in LD(S), do
if selectedDictionaryToken(token) != null, then
It is assumed that the List of Lexical Units comprises of a list of objects with at least
type and value properties and upon not finding an entry or entry-value in a dictionary
null is returned.
5.2 Glotation: Natural-Language annotated comments
A Glotation is a special kind of comment that includes a source Natural Language at-
tribute and the comment message.
The name is derived from the Latin word Glot, which means Language, and the
English word Annotation, metadata attached to text (in this case, attached to source
The source language attribute can later be used to translate the comment message to
a different Natural Language.
Syntax. @xx message
@ is a Symbol that denotes a Glotation, xx is a two-letter lowercase ISO
639-1 Language code
and message is the Comment message or text.
@en This is a Glotation
@pt Esta é uma Glotação
@fr Ceci est un Glotation
The example above creates Glotations in English, Portuguese and French with the
equivalents of “This is a Glotation”. Each time a user will access the source code, an
option to translate the Glotations, Glotate, can allow the translations to occur, provided
the user specifies to Environment (target) Language. Therefore, although the comments
can be written in different languages, a user can choose to visualize all comments in
The @ symbol is desirable since its usage is not common among the most used PLs.
Therefore, it is possible to avoid confusion between a general comment and a Glotation.
A Glotation translation can be achieved using a Third Party translation service,
which might require an internet connection.
To implement Glotations, the rules of the Syntax Analyzer (Parser) should be mod-
ified. The rules should detect a Glotation by the symbol @ and build an Abstract Syn-
node with the following properties:
─ type: “Glotation”
─ language: two-letter country code (content immediately following the @ symbol)
─ value: the message text (separated from the country code by a whitespace)
Therefore, the rules for a well-formed Glotation can be deduced as:
1. Starts with the @ symbol
2. Has no space between the symbol and the following text
3. The text immediately following the @ symbol consists of a two-character string
4. Immediately following the two character string, there is a whitespace
5. After the whitespace follows the comment message with alphanumeric and special
characters, including whitespace
A message should only be translated if the Glotation language is different from the
Language currently being used in the Development Environment by the user. There-
fore, there should be a mechanism to obtain the Development Environment language.
Tree representation of the Abstract syntactic structure of source code written in a pro-
Input: List of Comment nodes LC, Environment Language L
Output: List of Comments LC
for each comment in LC, do
if comment(type) = “Glotation”, then
if comment(lang) != L, then
comment(value) := Translate(comment(value),
5.3 Natural-Language Neutrality Collaborative Model for Programming
Making usage of the Glotter and Glotations, a collaborative model can be implemented
to allow dissimilar Natural-Languages to be used in a programming environment. Such
model should employ a mechanism to allow a user to write a program with keywords
and comments in his/hers Natural-Language granted that this same program can be un-
derstood by a user with a different Natural-Language.
Translation of keywords and comments can be achieved by the Glotter and Glota-
tions, respectively, but the key factor lies in the data format being used when storing
and exchanging the program among the users.
Fig. 6. NLN Collaborative Environment Workflow
Upon creation, the source code to be exchanged should desirably possess only Glota-
tions, instead of only comments or a mix. Such process can be automated on the Source
code editor by automatically replacing general comments with Glotations, granted that
the user has already permitted the functionality and chosen the environment Natural-
Language. Similarly, the source code should always be stored with the keywords in the
Such source code file, with Glotations and keywords in the target Natural-Language,
will serve as the intermediate file format, the essence of the collaborative model.
When a different user receives this same source code, the process of contextualiza-
tion can be performed by applying the Glotter and Glotation functionalities, joint or
Therefore, the inverse process can take place by the second author editing the source
code file, storing it in the intermediate file format and sending it back to the first author.
Language plays a critical role in a student's effective education. This process also de-
pends on the teaching institutions taking into consideration the sociocultural aspects of
the learners, such as their identity and experiences (Janzen, 2008; Lee, 2005).
Making the current trends and developments in the Software Engineering field avail-
able should be accompanied by processes, tools, and resources that will enable or, at
least, ease the ability to assimilate this knowledge to the underprivileged. This increase
in literacy would benefit not only the disadvantaged but the society as a whole since
more people would be brought to an acceptable level of literacy and employability,
becoming active contributors in combating poverty.
Although Multilingual PLs exist, a standardized and methodological approach is re-
quired to explore the context of Bridging the Knowledge Divide in Software Engineer-
ing thoroughly. Such can be accomplished through the proposed NLN approach.
Further research should be undertaken to understand its underlying factors, provide
quantitative as well as qualitative indicators of its effectiveness and to incorporate new
tools and methodologies to support it.
1. Aaby, A. (1996). Introduction to programming languages. Walla Walla College, Computer
Science Department. Retrieved from
2. Bindé, J., & Matsuura, K. (2005). Towards Knowledge Societies. UNESCO world report
3. Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. . (1956). Taxonomy
of Educational Objectives: Handbook 1 Cognitive Domain. London: Longmans, Green and
the ACM international conference companion on Object oriented programming systems
languages and applications companion (pp. 197–198).
5. Janzen, J. (2008). Teaching English Language Learners in the Content Areas. Review of
Educational Research, 78(4), 1010–1038. http://doi.org/10.3102/0034654308325580
6. Lee, O. (2005). Science Education With English Language Learners: Synthesis and
Research Agenda. Review of Educational Research, 75(4), 491–530.
7. Miller, F. P., Vandome, A. F., & McBrewster, J. (2012). Non-English Based Programming
Languages. Alphascript Publishing.
8. SIL International. (2015). Summary by language size |Ethnologue Languages of the World.
Retrieved May 13, 2015, from https://www.ethnologue.com/statistics/size
9. TIOBE Software BV. (2015). TIOBE Index for June 2015. Retrieved April 23, 2015, from
10. Wijngaarden, A. van, Mailloux, B. J., Peck, J. E. L., & Koster, C. H. A. (1969). Report on
the Algorithmic Language ALGOL 68. Mathematisch Centrum.
11. Wikipedia. (2015). Non-English-based programming languages. Retrieved May 8, 2015,