Conference PaperPDF Available

Through the Limits of Newspeak: an Analysis of the Vector Representation of Words in George Orwell’s 1984

Authors:

Figures

Content may be subject to copyright.
Through the Limits of Newspeak: an Analysis of
the Vector Representation of Words in
George Orwell’s 1984
I. Dunđer* and M. Pavlovski*
*Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of
Zagreb, Ivana Lučića 3, 10000 Zagreb, Croatia
ivandunder@gmail.com, mpavlovs@ffzg.hr
Abstract - The era of fake news, media manipulation and
information wars has been beneficent to the lasting fame
and continuous acclaim of George Orwell’s 1984. The novel,
published in 1949, influences to the present day the
terminology of various political and social analysts through
the use of its fictional language called “Newspeak”. The
question arises – can the inner connections of the concepts
present in Orwell’s 1984, when analysing the text on a
semantic level of the words in their contextual environment,
be used to further the understanding of the inner-workings
of the novel’s language itself? More specifically, the aim of
this paper is to examine whether a reader without
knowledge of the subject of a fictional work of art,
exemplified by Orwell’s 1984, could gain deeper
comprehension of the text just from analysing word vector
representations, without the use of external resources and
only through an overview of the established similarities on
the semantic level of words in a given text. In fact, word
vector representations, as a form of word embeddings in a
vector space model, are a machine learning technique
sometimes applied in natural language processing, which
attempts to identify semantically similar words.
Keywords - natural language processing (NLP); word
vector representation; word2vec; word embeddings; vector
space model; George Orwell’s 1984; dystopia; Newspeak;
machine learning; information and communication sciences
I. INTRODUCTION AND MOTIVATION
George Orwell’s novel 1984 has had a wide influence
on the terminology of various political and social analysts
to the present day. Its fictional language “Newspeak”, as
well as the Orwellian definition of concepts such as war
or power are all relatable to the present era of fake news,
media manipulation and information wars.
When analysing the text of Orwell’s 1984 on a
semantic level of the words in their contextual
environment, can the inner connections of its concepts be
used to further the comprehension of the inner-workings
of the language itself?
Could a reader, without any knowledge of the subject
of a novel, play or other fictional works of art, gain a
deeper understanding of the text just from a survey of
words that were identified as semantically most similar by
vector representations of words?
These vector representations on the semantic level,
more specifically, an overview of the established
similarities on the semantic level of words in a given text,
could aid students to investigate the text without any
topic-related knowledge or the use of additional resources.
The application of such a method on a complex, literary
text, could be a proof of concept for all other possible
applications on simpler texts.
This approach, when applied with natural language
processing techniques, could, therefore, enhance methods
of text examination in literary studies, higher education
and the academic curricula itself.
II. MOTIVATION
The aim of this paper is to study, on the example of
Orwell’s 1984, if a reader could comprehend the meaning
of the text just from a brief survey of its “semantic
network”, the vector representation of words, without the
use of any other resources, and only through an overview
of the established similarities between words in the text.
Word vector representations, as a form of word
embeddings in a vector space model, are a machine
learning technique sometimes applied in natural language
processing, which attempts to discover semantically
comparable words.
This paper is a proof of concept if one can
successfully examine the “semantic network” of a
fictional work of art, and in this case a complex novel with
various important concepts and a whole fictional language
used by the characters and invented specifically for it, then
one could presumably apply this approach on less-
complicated, non-fictional texts for various analytical
purposes.
III. RELATED WORK
The terminology of Orwell’s 1984 has attracted a lot
of attention and was examined by a wide range of
scientists in higher education, not only from the field of
literary studies but also from politics, media, information
and communication sciences, linguistics and sociology,
just to name a few.
In recent years, publications concerning the use of
Orwellian terminology in politics and in the media has
MIPRO 2019/CE
691
been on the rise due to the so-called era of fake news. In
relation to language, between ideology and power in
Orwell’s 1984, [1] asserts that “ideology gains shape and
manifests itself only through a semiotic system that is
called “language”, and thus, whoever controls the
language, controls the ideology, and consequently the
power structures of a given society. By devising the
concept of Newspeak in 1984, Orwell manifests his
profound awareness of the relationship between language
and the power relations in society”. Reference [2] uses the
example of the novel’s famous quote “The object of
power is power”, to reveal “tautological symbolic lack [...]
as an unnerving point of convergence between the
totalitarian state envisioned by Orwell and the modern
social democratic structures” a claim that the paper
exemplifies by examining British Prime Minister Theresa
May’s tautology “Brexit means Brexit”. “Inspired by
Orwell’s chilling account of brainwashing, propaganda
and the obliteration of the lines between fiction and
truth...”, [3] discusses the topic of fake news, alternative
facts and “[...] the challenges that the advent of fake news
organisations and the dissemination of alternative facts as
truth pose to deliberative civics education, specifically
how the trend of a diminishing space of truth and fact
undermines efforts to teach students how to engage
effectively and productively in democratic deliberations”.
Within the field of future studies, [4] tries “to depict how
the state of inverted totalitarianism is emerging in post-
postnormal times, and to illustrate how it shares many of
the same features of the totalitarianism depicted in the
novels Brave New World (A. Huxley) and 1984 (G.
Orwell)”. In the field of semiotics, [5] states that the
interesting aspect encountered in Orwell’s 1984 is the
vicious unended cycle and the war that will never end
between the stated groups within the framework of the
ideology/axiology perspective”. When it comes to word
embeddings applied to literary arts, [6] extracted a social
network from a literary text and “the experiments suggest
that specific types of word embeddings like word2vec are
well-suited for the task at hand and the specific
circumstances of literary fiction text”. One paper [7]
demonstrates that building word embeddings on annotated
literary texts of 19th century fiction “can provide us with
an insight into how characters group differently under
different conditions, allowing us to make comparisons
across different novels and authors. These results suggest
that word embeddings can potentially provide a useful
tool in supporting a quantitative literary analysis”. A
research conducted by [8] on applying word embeddings
and semantic lexicons to literary texts “demonstrates the
importance of examining implicit assumptions around
default strategies, when using embeddings with literary
texts, and highlights the potential of quantitative analysis
to inform critical analysis”. Furthermore, projects such as
the LAPPS Grid have developed a platform [9] states
that “providing access to a vast array of language
processing tools and resources for the purposes of
research and development in natural language processing
(NLP), has recently expanded to enhance its usability by
non-technical users, such as those in the Digital
Humanities community (DH)”, but “it is only recently that
Computational Linguistics (CL) methods and tools have
begun to be made more accessible to non-technical users,
and are beginning to be widely adopted by the DH
community; however, there remains considerable work to
be done to fully adapt CL tools and methods in order to be
used by DH scholars”. One paper [10] tried to
demonstrate, through the use of a computational analysis
of frequencies of dystopian terminology in the text of
George Orwell’s 1984, that it is possible to measure how
the Orwellian concept is created, constructed and
structured in the novel. Such a computational approach,
when made with word embeddings, can help to create a
“semantic network” of words from Orwell’s novel.
Another research [11] showed how conducting a
computational concordance analysis of Orwell’s fictional
literary work could be applied for the purpose of analysing
terms related to the core Orwellian concept. To analyse
specific textual corpora, statistical approaches have been
proposed [12, 13], with special emphasis on concordances
[14, 15], terminology extraction [16, 17] and language
modelling [18].
IV. RESEARCH
This section is divided into two subsections – Data Set
Acquisition and Preprocessing, and Word Embeddings
and Vector Representations. The first subsection deals
with the process of acquiring the experimental data set,
and the following preprocessing phase. The second
subsection discusses the applied research approach and
selected natural language processing method.
The authors of this paper decided to employ an
objective, precise and cost-effective mathematical way to
analyse the chosen data set by constructing a vector space
model in order to identify semantically similar words in
the data set. This paper concentrates on exploring the
possibilities of utilising word embeddings in form of word
vector representations for the purpose of analysing a very
specific data set a dystopian novel written by George
Orwell. This paper analyses not only the quantitate aspects
of the resulting word vectors, but also evaluates the results
on a qualitative level. The foundations of this research
originate in the studies of artificial intelligence and,
mainly, natural language processing.
Both studies and their corresponding techniques are
applicable in higher education and for purposes of
academic curricula especially in selected courses in
information and communication sciences, and computer
science, that deal, among others, with various aspects of
natural language understanding and generation,
computational language analyses, knowledge and
information extraction, intelligent machine behaviour etc.
A. Data Set Acquistion and Preprocessing
The experimental data set contained the whole content
of George Orwell’s well-known novel 1984, which came
out in 1949. No other data was added to the data set.
The already preprocessed version of the data set was
initially used for the purposes of [10, 11]. The
preprocessing step encompassed various tasks
conversion of file format from HTML to plain TXT with
UTF-8 character encoding, getting rid of style and text
formatting etc. Superfluous and undesired characters were
deleted from the data set using regexes. The data set
content was then split into sentences or segments (text
692
MIPRO 2019/CE
lines not ending with a sentence delimiter), tokenised
afterwards using custom tokenisation rules, and manually
inspected for any potential errors e.g. orthographic
mistakes, various oversights, missing or faulty characters,
such as apostrophes, quotation marks, dashes, hyphens,
brackets and other punctuation marks and typographic
symbols.
Most of the preprocessing was made with the use of
Python, Perl, sed and awk. Once the data set was ready, it
was used as the input for generating distributed word
vector representations.
B. Word Embeddings and Vector Representations
The vector space model [19] represents an algebraic
model for the simplified representation of text or its
constituent parts in form of vectors.
Word embeddings are a language modelling and
feature learning technique. They allow transforming, i.e.
mapping of distinct words from a text (so-called
vocabulary) into numbers – generally, real numbers [20-
22]. This is needed since machine learning algorithms
usually rely on numeric values. Namely, such algorithms
are limited by the type of qualified or desired input, and
usually cannot work with plain text straight away. Using
input that is represented by a corresponding vector of
continuous values in a predefined vector space is only one
of the possible approaches to this problem in natural
language processing. But this way the complexity of the
text can be reduced to a purely mathematical problem. Not
only that this method reduces the dimensionality of the
problem [22], it allows to inspect the contextual similarity
of vectors [20] – namely, the word context serves as the
principal feature in the word representation model. In very
low-dimensional vector spaces, the values in a vector can
be interpreted right away by humans.
The classical bag-of-words (BOW) [23] approach
typically generates enormous and sparse vectors for a
textual input, due to the one-hot word representation
strategy. Here the vectors’ dimensions are determined by
the vocabulary size of the textual input, as each dimension
corresponds to a separate word. If a word appears in a
textual input, the matching value in the resulting vector
will be a non-zero value. Besides computational
complexity as a result of the large number of dimensions,
with BOW there is always a risk of model overfitting, a
common problem in machine learning – when a model’s
prediction performance is very good in a small number of
specific situations but terrible in most other cases due to
the large amounts of noise that sparse vectors hold. Also,
here the vectors are generally unaware of the underlying
similarities between words [23].
With word embeddings machines can learn how to
create word vector representations, i.e. how to point each
word to one specific vector with values that can be
inferred from the ways of using individual words. They
can preserve the contextual surroundings of a word and,
hence, its contextual similarity [20, 21, 23]. Namely,
words that appear nearby in a text will also be in close
proximity in a vector space [23]. Put differently, words
that come from the same or similar context and are
utilised in a comparable way can be associated with each
other, since they share a similar vector representation,
which encapsulates and explicitly encodes their semantic
or syntactic similarity, many linguistic regularities,
analogies and patterns, or the mutual interrelations of
words [24]. Likewise, words that do not share any
similarities tend to not share the same context.
That words with similar context share similar meaning
is, in fact, not a novel finding. According to the well-
known distributional semantics hypothesis in linguistics,
“words that are used and occur in the same contexts tend
to purport similar meanings” [25]. According to [26], “a
word is characterised by the company it keeps”. In other
words, the ways of using a particular word define its
distinct meaning.
Word vectors are low-dimensional dense vectors of
fixed length which makes them computationally very
efficient due to the low space and time complexity [23].
The fixed length arises from the fixed and limited
vocabulary size of a textual input. Each word is illustrated
by a relatively small vector with, usually, hundreds of
dimensions. This is much lower than thousands and,
possibly, millions (or more) dimensions that are needed
for sparse word representations like in the BOW model. A
huge benefit of the dense representations is generalisation
power [27].
Since word vectors are just a numerical representation
of contextual similarities between words, they can be
treated and mathematically processed just like regular
mathematical vectors. For instance, one could measure the
vector similarity by calculating the cosine angle between
two non-zero vectors, which is called cosine similarity
[23]. The cosine of 0° equals to 1, and it is less than 1 for
any angle in the interval (0, π] radians. So, no vector
similarity (similarity of 0) is present at a 90° angle. Total
similarity (similarity of 1) is expressed as a angle
those are vectors with the same orientation and complete
overlap. Two vectors that are diametrically opposed have
a similarity of -1.
In natural language processing, one popular algorithm
for generating high-quality word embeddings is word2vec
[24]. Word2vec is, in fact, a neural network that is capable
of recognising similarities within text and automatically
producing a set of vectors [19, 20]. But it is not a deep
neural network it can only build numerical
representations that can be interpreted by deep neural
networks [20].
The word2vec algorithm uses statistics and linear
algebra (calculating dot products in co-occurrence
matrices, matrix transposition etc.) to learn word vector
representations from textual input [19, 20, 22, 24]. It is
able to capture the syntactic and semantic coherences in a
language. Each relationship is characterised by a relation-
specific vector offset, which allows vector-oriented
reasoning based on the offsets between words [28]. If
sufficient data and context are provided, word2vec can
predict the meaning and associations of words quite
accurately.
There are two different models within word2vec used
for learning word embeddings Continuous Bag-of-
Words model (CBOW) and Continuous Skip-Gram Model
MIPRO 2019/CE
693
(CSGM) [19]. CBOW modelling means learning word
embeddings for predicting an individual word by
observing the word’s adjacent context, whereas CSGM
works the other way around for a provided individual
word the corresponding neighbouring words (context) are
predicted [24]. Both models base their findings on the
local context window, which, as a parameter in the model,
can be adjusted to the user’s specific needs.
The context window size impacts the vectors
similarities strongly. Large context windows generate
vectors that reflect topical similarities, whereas smaller
window context windows tend to produce vectors that
exhibit more functional and syntactic similarities [27].
In this research, all of the experiments were done on a
multicore machine with a 64-bit operating system, an Intel
Core i7 processor (4 cores, 8 threads) and 16 GB of RAM.
All available threads were used to train the model in order
to decrease training time. The authors constructed a 100-
dimensional vector space model, which means that for
every word present in the model there were 100 features
available. The authors used Python 3, Gensim [29] and
word2vec for that purpose. Gensim (short for “generate
similar”) is a robust open-source vector space modelling
and topic modelling toolkit implemented in Python. It was
specifically designed to handle large text collections using
data streaming and efficient incremental algorithms.
The basis for constructing the model was the
preprocessed data set containing George Orwell’s novel
1984. The data set was also lowercased in order to avoid
multiple versions of the same word. As for the training
algorithm, CBOW was chosen. Window context size, i.e.
the maximum distance between the current and predicted
word within a sentence, was set to 5, meaning that at the
same time 5 words were taken into consideration. Words
with a total frequency less than 3 were ignored (minimal
count). Generating word vector representations has shown
to be very CPU-intensive.
Then for every single keyword, chosen freely by the
authors, 20 semantically most similar words (identified
though word vector representations) from the same data
set were extracted. Similarity was obtained by computing
the cosine similarity between a simple mean of the
projection weight vectors of the given words and the
vectors for each word in the model [29]. Afterwards, the
authors carried out a detailed manual qualitative analysis
on 6 keywords and, in total, 120 words that were unveiled
as their most similar corresponding words – 20 words for
every chosen keyword. Among the 120 words most of the
functional words were ignored during the evaluation
phase, as they usually convey very little information and
provide limited semantics (meaning) such as articles,
adpositions, auxiliary verbs, conjunctions, interjections,
junctions, particles, expletives, pronouns, pro-sentences
etc. The remaining content words were then thoroughly
analysed.
V. RESULTS AND DISCUSSION
Here the authors present and discuss the results of the
experiment, as well as its implications. Additionally, the
authors point out the various downsides and limitations of
the experiment in one of the following subsections.
A. Results of the Experiment
At first, the authors chose to obtain word vector
representations for 40 keywords in total – 24 were related
to the Newspeak dictionary and the Orwellian concept
[30], whereas 16 belonged to the general domain.
Keywords related to the Newspeak dictionary and the
Orwellian concept were: “newspeak”, “brother” (refers to
“big brother”), “thoughtcrime”, “101” (refers to “room
101”), “winston”, “party” (refers to “inner party”, “outer
party”, “party members”), “telescreen”, “julia”, “oceania”,
“eurasia”, “ingsoc”, “goldstein”, “eastasia”,
“doublethink”, “brotherhood”, “vaporized”, “oldspeak”,
“crimestop”, “crimethink”, “minipax”, “minitrue”,
“duckspeak”, “prole” and “speakwrite”.
Keywords from the general domain were: “war”,
“moment”, “people”, “world”, “voice”, “word”, “free”,
“freedom”, “time”, “hate”, “love”, “youth”, “best”,
“worst”, “good” and “bad”.
From this base of 40 keywords the authors decided to
perform a manual quality analysis on a sample of 6 word
vector representations (15%). Those 6 keywords were
chosen arbitrarily by the authors, and 5 of them were
words related to the Newspeak dictionary and the
Orwellian concept: “newspeak” – is the fictional language
in Orwell’s novel 1984; it is “politically correct” speech
taken to its maximum extent; doublethink refers to
reality control; winston” – refers to Winston Smith, the
main protagonist that the reader most identifies with in
Orwell’s novel 1984; telescreen– refers to a fictional
surveillance and communication device, which is operated
by the ruling Party in a totalitarian system, in order to
keep its citizens under permanent observation; julia
refers to the fictional character Julia, who pretends to be
supporting Big Brother and the ruling Party, but in fact,
despises the system; and “war”.
Fig. 1 shows the 100-dimensional vector
representation of the word “newspeak”. Since the numbers
in the vector are cosine values they can range from -1 to 1.
Figure 1. Vector representation of the word “newspeak”.
The results of this research have shown some
interesting correlations between the selected words.
The word “newspeak” can semantically be associated
mostly with the word “power”. Newspeak, the fictional
language used by the characters of the novel, is the official
language of Oceania. It is based on standard English, with
words describing “unorthodox” political ideas removed. It
is the principal means of the totalitarian ruling Party to
retain power, through the control of language. In this
context, the correlation between “newspeak” and “power”
694
MIPRO 2019/CE
is very precise. Fig. 2 shows the top 20 words that are
most similar to the word “newspeak”. All words are
ranked according to semantic similarity (vector
similarity).
Figure 2. Words most similar to the word “newspeak” (according to
vector similarity).
The word “doublethink” is semantically connected
mainly with “proles”. Doublethink is the power to hold
two completely contradictory beliefs in one’s mind
simultaneously, and to accept both of them. “Proles” are
proletarians, ca. 85% of Oceania’s population. Although
not as closely and rigidly observed as members of the
Party, proles, as it is stated in the novel, were taught to be
inferior through the principles of doublethink.
The word winston” predominantly corresponds to the
wordjulia. Winston Smith and Julia are the two
protagonists of the novel and tragic lovers. Their
relationship is precisely shown on the semantic level.
The word “telescreen” is primarily related with the
adjective “white”, but amongst others, also with the words
“table”, “floor”, “street”, “front”, “bed”, “down”,
“middle”, “out” and “body” giving a sense that the
telescreen, a two-way television and tool of control by the
Party, is always unavoidably present.
The word “julia” is mostly linked with the word “it”,
but second by similarity is the wordwinston, which
confirms their relationship also on the semantic level, as
with the previous example of the word “winston”.
The word “war” is most closely affiliated with the
wordnewspeak. The Orwellian concept of an ongoing
war correlates well with that of the Newspeak language,
as the two concepts are the two means of control by the
Party, and this is also confirmed on the semantic level.
B. Downsides of the Experiment
The disadvantages and shortcomings of the experiment
are highlighted in this subsection. As the authors chose to
use all of the unoccupied CPU power (8 threads) in order
to increase the training speed, Python’s seed()
functionality was, unfortunately, not available due to the
limitations of Gensim and controlling hash randomisation
[29], and the problems with thread scheduling in common
operating systems. Seed is essential for the initialisation of
the random number generator and, hence, for the
initialisation of vectors for each word with a hash of the
concatenation of a given word and the seed value [29].
Using seed() would allow to exactly and fully
deterministically reproduce all of the experiments with the
vector space model, which is not the case with this
research.
Nevertheless, the authors undertook many training
runs, and the differences in the resulting vectors were not
significant. Only slight modifications were observed,
mostly changes in word ranking within vectors but,
semantic similarity, when seen in general, was not altered
considerably.
VI. FUTURE RESEARCH
The authors plan to significantly increase the
dimensionality of the word vectors and analyse the impact
on the word vector representations, in form of a cost-
benefit analysis (consumption of CPU power, model
training time and performance, memory requirements
etc.). This should result in additional generalisation power.
Furthermore, the plan is to increment the data set with
additional dystopian texts (novels) in order to increase the
data set size. It would also be interesting to inspect what
exact words contribute positively, and what negatively to
the overall word vector similarities. The authors also plan
to increase the context size (context window), and the
minimal number of word occurrences that should be
considered during the process of generating word vectors.
Furthermore, experiments with the Continuous Skip-
Gram Model (CSGM) and evaluating the performance
differences between the hierarchical softmax and negative
sampling approaches should be carried out [24]. The
possibilities of the related doc2vec algorithm should also
be investigated [23]. Algorithms for vector visualisation,
such as the t-Distributed Stochastic Neighbor Embedding
(t-SNE) [31], should be tested as well. Alternative models,
such as GloVe (Global Vectors for Word Representation)
and Facebook’s fastText should also be applied on this
data set. GloVe trains on global word-word co-occurrence
counts and makes efficient use of statistics [32], whereas
the main improvement of fastText over the original
word2vec implementation of vectors is the inclusion of
character n-grams [21], which allows computing word
representations for words that did not appear in the
training data (OOV, “out-of-vocabulary” words).
The authors would also like to examine the seed
functionality with regard to the cost of training speed and
make the generated models freely available online in order
to enable experiment reproducibility. Source code could
possibly also be rewritten to allow experimenting without
lowercasing, with lemmatisation, and implementing stop
words and lexicons of function words for filtering
purposes.
VII. CONLUSION
Word vectors representations are just one of the many
possible approaches to computationally representing
words. In order to analyse George Orwell’s novel 1984 on
a semantic level, the authors chose to build a 100-
dimensional vector representation model from the novel’s
content, which preserved the relations between the words
and its contextual similarities. This approach was chosen
due to the fact that words with similar meaning often have
similar word embeddings in form of word vectors. This
method has shown to have the potential to be applied for
studies of literature, in academic curricula and other fields
of higher education. For instance, students could
investigate a literary work and gain valuable knowledge
about it only through the observation of semantically most
MIPRO 2019/CE
695
similar words identified through word vectors, and
without using any other external resources. Students could
then, without having any prior knowledge of the literary
work, study the internal connections, the work’s narrative
(time and space), language and style, the context and
meaning of individual words or (fictional) characters
within a literary work and, therefore, try to summarise its
key concepts on a semantic level.
REFERENCES
[1] M. Shadi, “The Principles of Newspeak or How Language Defines
Reality in Orwell’s 1984”, Journal of International Social
Research, vol. 11, no. 59, pp. 180–186, 2018.
[2] P. Anson, “‘The object of power is power’: tautology, paranoia,
and George Orwell’s Nineteen Eighty-Four”, Textual Practice
Journal, DOI: 10.1080/0950236X.2018.1508066, pp. 1–20, 2018.
[3] G. Mordechai, “Lying in Politics: Fake News, Alternative Facts,
and the Challenges for Deliberative Civics Education”,
Educational Theory Journal, vol. 68, no. 1, pp. 49–64, 2018.
[4] D. R. Morgan, “Inverted totalitarianism in (post) postnormal
accelerated dystopia: the arrival of Brave New World and 1984 in
the twenty-first century”, Foresight Journal, Emerald Publishing
Limited, vol. 20, no. 3, pp. 221–236, 2018.
[5] M. Kalelioğlu, “Creating Society in Orwell’s 1984”, Chinese
Semiotic Studies Journal, vol. 14, no. 4, pp. 481–503, 2018.
[6] G. Wohlgenannt, E. Chernyak and D. Ilvovsky, “Extracting social
networks from literary text with word embedding tools”,
Proceedings of the Workshop on Language Technology Resources
and Tools for Digital Humanities (LT4DH), pp. 18–25, 2016.
[7] S. Grayson, M. Mulvany, K. Wade, G. Meaney and D. Greene,
“Novel2Vec: Characterising 19th Century Fiction via Word
Embeddings”, Proceedings of the 24th Irish Conference on
Artificial Intelligence and Cognitive Science (AICS’16), 2016.
[8] S. Leavy, K. Wade, G. Meaney and D. Greene, “Navigating
Literary Text with Word Embeddings and Semantic Lexicons”,
Proceedings of the Workshop on Computational Methods in the
Humanities (COMHUM 2018). Lausanne, 2018.
[9] N. Ide, K. Suderman and J. Pustejovsky, “The Language
Application Grid as a Platform for Digital Humanities Research”,
Proceedings of the Workshop on Corpora in the Digital
Humanities (CDH@ TLT). Bloomington, pp. 71–76. 2017.
[10] M. Pavlovski and I. Dunđer, “Is Big Brother Watching You? A
Computational Analysis of Frequencies of Dystopian Terminology
in George Orwell’s 1984”, Proceedings of the 41st International
Convention on Information and Communication Technology,
Electronics and Microelectronics (MIPRO 2018). Croatian Society
for Information and Communication Technology, Electronics and
Microelectronics - MIPRO (Rijeka). Opatija, pp. 714–719, 2018.
[11] I. Dunđer and M. Pavlovski, “Computational Concordance
Analysis of Fictional Literary Work”, Proceedings of the 41st
International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2018).
Croatian Society for Information and Communication Technology,
Electronics and Microelectronics - MIPRO (Rijeka). Opatija, pp.
720–724, 2018.
[12] I. Dunđer, M. Horvat and S. Lugović, “Exploratory Study of
Words and Emotions in Tweets of UK Start-up Founders”,
Proceedings of the Second International Scientific Conference
“Communication Management Forum” (CMF2017). The Edward
Bernays College of Communication Management. Zagreb, pp.
201–224, 2017.
[13] I. Dunđer, M. Horvat and S. Lugović, “Word Occurrences and
Emotions in Social Media: Case Study on a Twitter Corpus”,
Proceedings of the 39th International Convention on Information
and Communication Technology, Electronics and
Microelectronics (MIPRO 2016). Croatian Society for Information
and Communication Technology, Electronics and
Microelectronics - MIPRO (Rijeka). Opatija, pp. 1557–1560,
2016.
[14] R. Jaworski, I. Dunđer and S. Seljan, “Usability Analysis of the
Concordia Tool Applying Novel Concordance Searching”, Lecture
Notes in Computer Science (LNCS). Springer, p. 11, 2016, in
press.
[15] S. Seljan, I. Dunđer and A. Gašpar, “From Digitisation Process to
Terminological Digital Resources”, Proceedings of the 36th
International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2013).
Croatian Society for Information and Communication Technology,
Electronics and Microelectronics - MIPRO (Rijeka). Opatija, pp.
1329–1334, 2013.
[16] S. Seljan, H. Stančić and I. Dunđer, “Extracting Terminology by
Language Independent Methods”, Proceedings of the 2nd
International TRANSLATA Conference (2014). Translation
Studies and Translation Practice: Forum Translationswissenschaft,
Peter Lang GmbH, pp. 141–147, 2017.
[17] I. Dunđer, S. Seljan and H. Stančić, “The concept of the automatic
classification of the registry and archival records” (Koncept
automatske klasifikacije registraturnoga i arhivskoga gradiva).
Proceedings of the 48. savjetovanje hrvatskih arhivista (HAD) /
Zaštita arhivskoga gradiva u nastajanju. Hrvatsko arhivističko
društvo. Topusko, pp. 195–211, 2015.
[18] I. Dunđer, “Statistical Machine Translation System and
Computational Domain Adaptation” (Sustav za statističko strojno
prevođenje i računalna adaptacija domene) / doctoral dissertation,
University of Zagreb, p. 281, 2015.
[19] T. Mikolov, K. Chen, G. Corrado and J. Dean, “Efficient
Estimation of Word Representations in Vector Space”,
arXiv:1301.3781 [cs.CL], 2013.
[20] F. T. Asr, R. Zinkov and M. N. Jones, “Querying Word
Embeddings for Similarity and Relatedness”, Proceedings of
NAACL-HLT 2018, ACL. New Orleans, pp. 675–684, 2018.
[21] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, “Enriching
Word Vectors with Subword Information”, arXiv:1607.04606
[cs.CL], 2017.
[22] Z. Yin and Y. Shen “On the Dimensionality of Word Embedding”,
arXiv:1812.04224 [cs.LG], 2018.
[23] Q. Le and T. Mikolov, “Distributed representations of sentences
and documents”, Proceedings of the 31st International Conference
on Machine Learning (ICML 2014), pp. 1188–1196, 2014.
[24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean,
“Distributed Representations of Words and Phrases and their
Compositionality”, arXiv:1310.4546 [cs.CL], 2013.
[25] Z. S. Harris, “Distributional Structure”, Word, vol. 10, no. 23,
DOI: 10.1080/00437956.1954.11659520, pp. 146–162, 1954.
[26] J. R. Firth, “A synopsis of linguistic theory 1930-1955”, Studies in
Linguistic Analysis: 1–32. Reprinted in F. R. Palmer, Ed. (1968).
Selected Papers of J. R. Firth 1952-1959. London: Longman,
1957.
[27] Y. Goldberg, “Neural Network Methods in Natural Language
Processing”, Synthesis Lectures on Human Language
Technologies (Book 37). San Rafael, CA: Morgan & Claypool
Publishers, 2017.
[28] T. Mikolov, W. Yih and G. Zweig, “Linguistic Regularities in
Continuous Space Word Representations”, Proceedings of
NAACL-HLT 2013, ACL. Atlanta, pp. 746–751, 2013.
[29] R. Řehůřek and P. Sojka, “Software Framework for Topic
Modelling with Large Corpora”, Proceedings of the LREC 2010
workshop New Challenges for NLP Frameworks. Valletta, pp. 46–
50, 2010.
[30] M. Rose, “1984 - Newspeak Dictionary. Newspeak and other
terminology found in 1984”. Available at (15.01.2019.):
http://moellerlit.weebly.com/uploads/1/0/2/4/10248653/1984_--
_newspeak_dictionary.pdf
[31] L. van der Maaten and G. Hinton, “Visualizing High-Dimensional
Data Using t-SNE”, Journal of Machine Learning Research
(JMLR), vol. 9, pp. 2579–2605, 2008.
[32] J. Pennington, R. Socher and C. D. Manning, “GloVe: Global
Vectors for Word Representation”, Proceedings of the 2014
Conference on Empirical Methods in Natural Language
Processing (EMNLP), ACL. Doha, pp. 1532–1543, 2014.
696
MIPRO 2019/CE
... Cur-rently, word vector depictions are a type of word embedding in a vector-space model and are often used in the processing of natural languages to classify terms semantically related. [20]. ...
Article
The short access to facts on social media networks in addition to its exponential upward push also made it tough to distinguish among faux information or actual facts. The quick dissemination thru manner of sharing has more high quality its falsification exponentially. It is also essential for the credibility of social media networks to avoid the spread of fake facts. So its miles rising research task to robotically check for misstatement of information thru its source, content material, or author and save you the unauthenticated assets from spreading rumours. This paper demonstrates an synthetic intelligence primarily based completely approach for the identification of the fake statements made by way of the use of social network entities. Versions of Deep neural networks are being applied to evalues datasets and have a look at for fake information presence. The implementation setup produced most volume 99% category accuracy, even as dataset is tested for binary (real or fake) labelling with multiple epochs.
... It can also be applied in the process of detecting affective states in speech (Lugović et al., 2016) or text (Dunđer, Pavlovski, 2019a), which can summarize expressed emotions in a given setting. It has also a significant application in natural language processing, e.g., in vector analyses of digital corpora (Dunđer, Pavlovski, 2019b) where it can detect semantic similarities between entities within a dataset. ...
Preprint
Full-text available
Machine learning, and more specifically its subfield known as deep learning, have been driving new disruptive technologies and recent accomplishments in various industries. This paper applies to some extent the same technology to develop a mobile application for educational purposes, that allows one to identify unknown flags with the help of artificial intelligence. For the application to be functional and effective, a custom dataset had to be created and processed in various ways in order to provide a collection of data that resembles data from the real world. The collection would then be used as the input data for constructing the model, which is developed within the framework TensorFlow. Once the model was developed, it was implemented as part of a mobile application programmed with Flutter. The functionality of the mobile application and the model’s accuracy are then put to the test against over a hundred new images the model has not seen previously or been trained on. The result of this evaluation could be considered an estimate of the readiness and usability of the application in a real-world scenario.
... It can also be applied in the process of detecting affective states in speech (Lugović et al., 2016) or text (Dunđer, Pavlovski, 2019a), which can summarize expressed emotions in a given setting. It has also significant application natural language processing, e.g., in vector analyses of digital corpora (Dunđer, Pavlovski, 2019b) where it can detect semantic similarities between entities within a dataset. ...
Article
Strojno učenje te posebno pripadajuće potpodručje poznato kao duboko učenje pokreću novedisruptivne tehnologije i nedavna postignuća u različitim industrijama. Ovaj rad koristi do određenerazine istu tehnologiju pri izradi mobilne aplikacije za edukacijske svrhe, a koja omogućuje identifikacijunepoznatih zastava pomoću umjetne inteligencije. Kako bi aplikacija bila funkcionalna i efikasna,potrebno je bilo kreirati vlastiti podatkovni skup i obraditi ga na razne načine kako bi se osiguralazbirka podataka koja odražava podatke iz stvarnoga svijeta. Ova zbirka bi se potom koristila kaopodatkovni ulaz za razvijanje modela, koji se razvija unutar okruženja TensorFlow. Nakon što je modelrazvijen, implementiran je kao dio mobilne aplikacije programirane Flutterom. Funkcionalnost mobilneaplikacije i točnost modela ispitana je na brojci od preko 100 novih slika koje model prethodno nijevidio niti je nad njima bio treniran. Rezultat ovog ispitivanja mogao bi se smatrati odrazom okvirnespremnosti i primjenjivosti ove aplikacije u stvarnom svijetu.
... This technique can mitigate the effect of statistical outliers caused by wrong alignments, reveal statistical distributions, calculate confidence intervals and provide statistically significant quality metrics' results, which are not guaranteed in this paper. The authors plan to apply word vector representations [24] and sentiment analysis [25] on machine-translated poems. ...
... It might be useful to annotate the data set beforehand, statistically or linguistically [16]. Applying word embeddings could reveal interesting concept-related and semantic relationships between different unigrams [17] as well. Moreover, a sentiment analysis could detect the overall affective states in the poetry-related data set [18]. ...
... When it comes to additional suggestions and directions for future research, some of the possible research tasks related to machine translation are: enhancing the machine translation system model with supplementary features, such as word embeddings in form of vector representation of words [27]; integrating machine translation into a Croatian speech synthesis system [28] with additional word-level evaluation [29] or domain-specific evaluation [30]; analyzing the affective states of machine translation output in comparison to the emotions expressed in the corresponding reference translations by applying sentiment analysis [31]; generating concordances from machine translation output using a novel concordance search algorithm [32], and analyzing the resulting concordances computationally [33]; extracting key terminology out of machine translation output and creating new resources using language-independent methods, which could then be used for e.g. rule-based machine translation (RBMT) [34], [35]; or computationally analyzing domain-specific word occurrences and distributions [36] in machine translation output with regard to more than one reference translation for one hypothesis sentence. ...
Article
Full-text available
Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.
Chapter
Full-text available
Spatiality is a term used to describe the attributes of a given space, its various cultural identities in an established time, differentiated from the notion of territoriality. While territoriality is naturally bound by the established limits of the national territory of a state, spatiality overcomes geographical distinctions and focuses on the identity or identities of a space defined solely on the basis of its “cultural territory”, unlimited by international territorial boundaries. On the example of a literary work, the spatial elements used to achieve such a definition of space can be investigated through the adoperation of words that form the motives of such a work. The frequent use of topoi of the space defined in a literary work establishes the rhythm of its narrative. The authors of this paper shall make a computational qualitative and quantitative corpus analysis of a literary work that thematizes an urban identity in the context of its space and time. The literary method of repeating the topoi of a city, or simply its name and all other types of words derived from its lexical root will be analyzed with natural language processing techniques that will expose the aforementioned method with mathematical precision. The results of this analysis will afterwards be interpreted in a way that it is possible to exemplify how those words, used for the literary method of establishing spatial elements in the analyzed literary work and the rhythm of its narrative, help achieve a sense of spatiality and a singular united cultural identification of an extraterritorial space by its literary inhabitants.
Chapter
Full-text available
This paper describes a novel tool for concordance searching, named Concordia. It combines the capabilities of standard concordance searchers with the usability of a translation memory. The tool is described in detail with regard to main applied methods and differences when compared to already existing CAT tools. Concordia uses three data structures, i.e. hashed index, markers array and suffix array, which are loaded into memory to enable fast lookups according to fragments that cover a search pattern. In this new concordancing system, sentences are stored in the index and marked with additional information, such as unique ids, which are then retrieved by the Concordia search algorithm. The usability of the new tool is analysed in an experiment involving two English-Croatian human translation tasks. The paper presents a detailed scheme and methodology of the conducted experiment. Furthermore, an analysis of the experiment results is presented, with special emphasis on the users’ attitudes towards the usefulness and functionalities of Concordia.
Article
Full-text available
George Orwell in 1984 vividly depicts a totalitarian regime which draws upon multiple resources and an array of methods in order to control the minds of the individuals. In this paper, an argument is made that the true struggle between the collective and individual identity is underway in the domain of the totalizing discourse, and the discourse is a linguistic entity. It has been discussed how language is the only means by which the understanding of the world, both in its objective and subjective manifestations, is made possible and how the aspirations of the Oceania regime towards the control of the language, which is seen in its attempt to create Newspeak, is the key step towards the realization of a fully totalitarian ruling power. Then distinctive features of the Orwellian Newspeak are closely investigated to demonstrate how Newspeak functions by reducing the number of words in lexicon, hacking down the semantic horizons of the existing vocabulary and rigorously doing away with ambiguities and shades of meaning. It is argued that ambiguities and indeterminacy markers are the opportunities for the users of the language to practice their creative powers, and the Orwellian attempt to dispose of them in Newspeak is the final realization of totalitarianism that can be imaginable.
Conference Paper
Full-text available
Word embeddings represent a powerful tool for mining the vocabularies of literary and historical text. However, there is little research demonstrating appropriate strategies for representing text and setting parameters, when constructing embedding models within a digital humanities context. In this paper we examine the effects of these choices using a case study involving 18th and 19th century texts from the British Library. The study demonstrates the importance of examining implicit assumptions around default strategies , when using embeddings with literary texts and highlights the potential of quantitative analysis to inform critical analysis.
Conference Paper
Full-text available
There is clear evidence of intense activities concerning start-up companies and their founders. In particular, EU governments play an active role in investing taxpayers money into start-ups, which are, by their very definition, high-risk business ventures. To evaluate such market interventions, interested parties have to develop precise methodologies and measures. In order to solve this problem, the authors applied machine observation of start-up founders Twitter accounts. The aim of this paper is to investigate the capabilities and the usefulness of two analytical methods used for the purpose of machine observation of start-up founders Twitter activities. The first method is rooted in Natural Language Processing, particularly in statistical analysis of word usage in tweets. The second method relates to Affective Computing, in particular , mining of emotional states of start-up founders by visually analysing their Twitter profile pictures. The authors correlated the data with the dynamic properties of Twitter accounts , attributes such as number of tweets, number of followers, and following accounts. Exploratory Study of Word Occurrences and Emotions in Tweets of UK-Up Founders Communication Management Forum 2017 Living in crisis mode: Time to reconsider definition, meaning and practice?
Article
In this paper, the idea of constructing a new society in George Orwell’s 1984 is analyzed in the context of the Paris School’s semiotics trajectory. Saussurean legacy, which heavily sheds light on the semiotic conception of the school proposed by Greimas, asserts the significance of dichotomies for signs to gain their meaning. Accordingly, the study is grounded on the desired and non-desired contrariety to make the analysis with the semiotic square meaningful. It is possible to encounter the traces of the proposed idea pertaining to the struggle of forming an ideal society at all levels of meaning, predominantly at the deep level as the proposed idea represents the elementary meaning of the narrative, throughout the text. Considering the approach, desired society gains its meaning in the face of the non-desired one relativistically. Regarding the opposition theory of Saussure, what is good for the Party is not supposed to be good for the Opponents. For this reason, the idea of creating society is on the battleground, as there is an uphill fight between the ruling Party and the Opponents. The formation of desired society is revealed thanks to the semiotic square by focusing on both positive and negative transition processes. The really interesting aspect that we encountered is the vicious unended cycle and the war that will never end between the stated groups within the framework of the ideology/axiology perspective.
Article
Building on Eve Kosofsky Sedgwick's claim that paranoid modes of reading are ‘strongly tautological,’ this essay begins its exploration of the relationship between tautology and paranoia by suggesting that tautological forms of expression can be read as symptomatic of paranoia. Following a comparison of Lacan's description of the paranoiac's linguistic dysfunction with Ludwig Wittgenstein's Roland Barthes's, and Immanuel Kant's analyses of tautology, the relationship between tautology and paranoia is then developed through a reading of George Orwell's Nineteen Eighty-Four (1949). While in Orwell's novel tautology appears symptomatic of paranoia insofar as the Party's rejection of old systems of meaning – produces new systems that are tautological, the fact that Winston Smith's specific form of paranoia burgeons following a refusal to accept the tautological irrationality of the new symbolic order suggests that tautology might also operate in the precipitation of paranoia. Turning finally to a comparison of Winston's psychosis with Eric Santner's reading of Judge Daniel Paul Schreber's paranoia, which, similarly to Winston's, also results from a confrontation with tautological symbolic lack, this essay reveals such lack as an unnerving point of convergence between the totalitarian state envisioned by Orwell and the modern democratic social structures Santner studies—a claim that I exemplify by examining British Prime Minister Theresa May’s tautology ‘Brexit means Brexit.'
Article
Purpose This paper aims to depict how the state of inverted totalitarianism is emerging in post-postnormal times and illustrate how it shares many of the same features of the totalitarianism depicted in the novels Brave New World (A. Huxley) and 1984 (G. Orwell). It also shows how a “way forward” is possible through a paradigmatic reorientation of “well-being” and “happiness”. Design/methodology/approach The research is based on literature within the field of futures studies, as well as relevant sources outside the futures field. It applies R Slaughter’s critical futures and F Polak’s method of social critique and reconstruction in its analysis of the state of inverted totalitarianism in post postmodern times. Findings It finds that the technological society and the US empire (with its attendant corporatocracy, Panopticon and PAC man values) in post-postnormal times is drifting toward a state of inverted totalitarianism, which is remarkably beginning to resemble Aldous Huxley’s Brave New World and G. Orwell’s 1984. Research limitations/implications The research is an essay and conceptual paper, so it is limited by its conceptual, philosophical nature and the author’s interpretation of social phenomena. It could also include the latest research on the role that the manipulation of internet algorithms plays in the state of inverted totalitarianism. It could also include more reconstructive details. Practical implications Sheer consciousness of the state of inverted totalitarianism and the need for social reconstruction should lead to a reevaluation of the meaning of the good society and how to realize it. Social implications Social critique and reconstruction are essential to the survival of any given society or civilization, as the groundwork for the emergence of wise foresight. The creative minority of a civilization must understand its predicament, the nature of its civilizational crisis, before it can even begin to understand and meet the challenge of the future. Originality/value The paper presents post-postnormal times as the back drop through which a state of inverted totalitarianism is emerging – a social dystopia that resembles the dystopias depicted in the novels, Brave New World and 1984. Inverted totalitarianism is shown to be an outgrowth of the technological society and the American empire (a corporatocracy and Panopticon increasingly global in nature). Freedom from this emerging totalitarianism begins with the realization of its existence and its given assumptions about the meaning of life and the pursuit of happiness. The paper also posits social critique and reconstruction (as well as critical futures) as a fundamental method to deconstruct and reconstruct the paradigm that supports inverted totalitarianism.