ArticlePDF AvailableLiterature Review

Abstract and Figures

Natural language is inherently a discrete symbolic representation of human knowledge. Recentadvances in machine learning (ML) and in natural language processing (NLP) seem to contra-dict the above intuition: discrete symbols are fading away, erased by vectors or tensors calleddistributedanddistributional representations. However, there is a strict link between distribu-ted/distributional representations and discrete symbols, being the first an approximation of thesecond. A clearer understanding of the strict link between distributed/distributional representationsand symbols may certainly lead to radically new deep learning networks. In this paper we make asurvey that aims to renew the link between symbolic representations and distributed/distributionalrepresentations. This is the right time to revitalize the area of interpreting how discrete symbolsare represented inside neural networks.
Content may be subject to copyright.
Symbolic, Distributed and Distributional
Representations for Natural Language
Processing in the Era of Deep Learning: a
Survey
Lorenzo Ferrone 1and Fabio Massimo Zanzotto 1,
1
Department of Enterprise Engineering, University of Rome Tor Vergata, Rome, Italy
Correspondence*:
Fabio Massimo Zanzotto
fabio.massimo.zanzotto@uniroma2.it
ABSTRACT
Natural language is inherently a discrete symbolic representation of human knowledge. Recent
advances in machine learning (ML) and in natural language processing (NLP) seem to contra-
dict the above intuition: discrete symbols are fading away, erased by vectors or tensors called
distributed
and
distributional representations
. However, there is a strict link between distribu-
ted/distributional representations and discrete symbols, being the first an approximation of the
second. A clearer understanding of the strict link between distributed/distributional representations
and symbols may certainly lead to radically new deep learning networks. In this paper we make a
survey that aims to renew the link between symbolic representations and distributed/distributional
representations. This is the right time to revitalize the area of interpreting how discrete symbols
are represented inside neural networks.
Keywords: keyword, keyword, keyword, keyword, keyword, keyword, keyword, keyword
1 INTRODUCTION
Natural language is inherently a discrete symbolic representation of human knowledge. Sounds are
transformed in letters or ideograms and these discrete symbols are composed to obtain words. Words
then form sentences and sentences form texts, discourses, dialogs, which ultimately convey knowledge,
emotions, and so on. This composition of symbols in words and of words in sentences follow rules that
both the hearer and the speaker know (Chomsky, 1957). Hence, thinking to natural language understanding
systems, which are not based on discrete symbols, seems to be extremely odd.
Recent advances in machine learning (ML) applied to natural language processing (NLP) seem to
contradict the above intuition: discrete symbols are fading away, erased by vectors or tensors called
distributed and distributional representations. In ML applied to NLP, distributed representations are
pushing deep learning models (LeCun et al., 2015; Schmidhuber, 2015) towards amazing results in many
high-level tasks such as image generation (Goodfellow et al., 2014), image captioning (Vinyals et al.,
2015b; Xu et al., 2015), machine translation (Bahdanau et al., 2014; Zou et al., 2013), syntactic parsing
(Vinyals et al., 2015a; Weiss et al., 2015) and in a variety of other NLP tasks Devlin et al. (2018). In a more
traditional NLP, distributional representations are pursued as a more flexible way to represent semantics
of natural language, the so-called distributional semantics (see (Turney and Pantel, 2010)). Words as
well as sentences are represented as vectors or tensors of real numbersVectors for words are obtained
observing how rhese words co-occur with other words in document collections. Moreover, as in traditional
1
Ferrone&Zanzotto Running Title
compositional representations, vectors for phrases (Mitchell and Lapata, 2008; Baroni and Zamparelli,
2010; Clark et al., 2008; Grefenstette and Sadrzadeh, 2011; Zanzotto et al., 2010) and sentences (Socher
et al., 2011, 2012; Kalchbrenner and Blunsom, 2013) are obtained by composing vectors for words.
The success of distributed and distributional representations over symbolic approaches is mainly due
to the advent of new parallel paradigms that pushed neural networks (Rosenblatt, 1958; Werbos, 1974)
towards deep learning (LeCun et al., 2015; Schmidhuber, 2015). Massively parallel algorithms running
on Graphic Processing Units (GPUs) (Chetlur et al., 2014; Cui et al., 2015) crunch vectors, matrices and
tensors faster than decades ago. The back-propagation algorithm can be now computed for complex and
large neural networks. Symbols are not needed any more during “resoning”, that is, the neural network
learning and its application. Hence, discrete symbols only survive as inputs and outputs of these wonderful
learning machines.
However, there is a strict link between distributed/distributional representations and symbols, being
the first an approximation of the second (Fodor and Pylyshyn, 1988; Plate, 1994, 1995; Ferrone et al.,
2015). The representation of the input and the output of these networks is not that far from their internal
representation. The similarity and the interpretation of the internal representation is clearer in image
processing (Zeiler and Fergus, 2014a). In fact, networks are generally interpreted visualizing how subparts
represent salient subparts of target images. Both input images and subparts are tensors of real number.
Hence, these networks can be examined and understood. The same does not apply to natural language
processing with its discrete symbols.
A clearer understanding of the strict link between distributed/distributional representations and discrete
symbols is needed (Jang et al., 2018; Jacovi et al., 2018) to understand how neural networks treat information
and to propose novel deep learning architectures. Model interpretability is becoming an important topic in
machine learning in general (Lipton, 2016). This clearer understanding is then the dawn of a new range of
possibilities: understanding what part of the current symbolic techniques for natural language processing
have a sufficient representation in deep neural networks; and, ultimately, understanding whether a more
brain-like model – the neural networks – is compatible with methods for syntactic parsing or semantic
processing that have been defined in these decades of studies in computational linguistics and natural
language processing. There is thus a tremendous opportunity to understand whether and how symbolic
representations are used and emitted in a brain model.
In this paper we make a survey that aims to draw the link between symbolic representations and
distributed/distributional representations. This is the right time to revitalize the area of interpreting how
symbols are represented inside neural networks. In our opinion, this survey will help to devise new
deep neural networks that can exploit existing and novel symbolic models of classical natural language
processing tasks.
The paper is structured as follow: first we give an introduction to the very general concept of representati-
ons and the difference between local and distributed representations (Plate, 1995). After that we present
each techniques in detail. Afterwards, we focus on distributional representations (Turney and Pantel, 2010),
which we treat as a specific example of a distributed representation. Finally we discuss more in depth
the general issue of compositionality, analyzing three different approaches to the problem: compositional
distributional semantics (Clark et al., 2008; Baroni et al., 2014), holographic reduced representations (Plate,
1994; Neumann, 2001), and recurrent neural networks (Kalchbrenner and Blunsom, 2013; Socher et al.,
2012).
2
Ferrone&Zanzotto Running Title
2 SYMBOLIC AND DISTRIBUTED REPRESENTATIONS: INTERPRETABILITY AND
CONCATENATIVE COMPOSITIONALITY
Distributed representations put symbolic expressions in metric spaces where similarity among examples is
used to learn regularities for specific tasks by using neural networks or other machine learning models.
Given two symbolic expressions, their distributed representation should capture their similarity along
specific features useful for the final task. For example, two sentences such as
s1
=“a mouse eats some
cheese” and
s2
=“a cat swallows a mouse” can be considered similar in many different ways: (1) number
of words in common; (2) realization of the pattern “
ANIMAL EATS FOOD
”. The key point is to decide or
to let an algorithm decide which is the best representation for a specific task.
Distributed representations are then replacing long-lasting, successful discrete symbolic representations
in representing knowledge for learning machines but these representations are less human interpretable.
Hence, discussing about basic, obvious properties of discrete symbolic representations is not useless as
these properties may guarantee success to distributed representations similar to the one of discrete symbolic
representations.
Discrete symbolic representations are human interpretable as symbols are not altered in expressions.
This is one of the most important, obvious feature of these representations. Infinite sets of expressions,
which are sequences of symbols, can be interpreted as these expressions are obtained by concatenating
a finite set of basic symbols according to some concatenative rules. During concatenation, symbols
are not altered and, then, can be recognized. By using the principle of semantic compositionality, the
meaning of expressions can be obtained by combining the meaning of the parts and, hence, recursively, by
combining the meaning of the finite set of basic symbols. For example, given the set of basic symbols
D
=
{
mouse,cat,a,swallows,(,)
}
, expressions like
s1
=“a cat swallows a mouse” or
t1
=((a cat) (swallows (a
mouse))) are totally plausible and interpretable given rules for producing natural language utterances or for
producing tree structured representations in parenthetical form, respectively. This strongly depends on the
fact that individual symbols can be recognized.
Distributed representations instead seem to alter symbols when applied to symbolic inputs and, thus, are
less interpretable. In fact, symbols as well as expressions are represented as vectors in these metric spaces.
Observing distributed representations, symbols and expressions do not immediately emerge. Moreover,
these distributed representations may be transformed by using matrix multiplication or by using non-
linear functions. Hence, it is generally unclear: (1) what is the relation between the initial symbols or
expressions and their distributed representations and (2) how these expressions are manipulated during
matrix multiplication or when applying non-linear functions. In other words, it is unclear whether symbols
can be recognized in distributed representations.
Hence, a debated question is whether discrete symbolic representations and distributed representations
are two very different ways of encoding knowledge because of the difference in alterning symbols. The
debate dates back in the late 80s. For Fodor and Pylyshyn (1988), distributed representations in Neural
Network architectures are “only an implementation of the Classical approach” where classical approach is
related to discrete symbolic representations. Whereas, for Chalmers (1992), distributed representations give
the important opportunity to reason “holistically” about encoded knowledge. This means that decisions
over some specific part of the stored knowledge can be taken without retrieving the specific part but acting
on the whole representation. However, this does not solve the debated question as it is still unclear what is
in a distributed representation.
3
Ferrone&Zanzotto Running Title
To contribute to the above debated question, Gelder (1990) has formalized the property of altering symbols
in expressions by defining two different notions of compositionality: concatentative compositionality and
functional compositionality. Concatenative compositionality explains how discrete symbolic representations
compose symbols to obtain expressions. In fact, the mode of combination is an extended concept of
juxtaposition that provides a way of linking successive symbols without altering them as these form
expressions. Concatenative compositionality explains discrete symbolic representations no matter the
means is used to store expressions: a piece of paper or a computer memory. Concatenation is sometime
expressed with an operator like
, which can be used in a infix or prefix notation, that is a sort of function
with arguments
(w1, ..., wn)
. By using the operator for concatenation, the two above examples
s1
and
t1
can be represented as the following:
acat swallows amouse
that represents a sequence with the infix notation and
((a, cat),(swallows, (a, mouse)))
that represents a tree with the prefix notation. Functional compositionality explains distributed representati-
ons. In functional compositionality, the mode of combination is a function
Φ
that gives a reliable, general
process for producing expressions given its constituents. Within this perspective, semantic compositionality
is a special case of functional compositionality where the target of the composition is a way to represent
meaning (Blutner et al., 2003).
Local distributed representations (as referred in (Plate, 1995)) or one-hot encodings are the easiest
way to visualize how functional compositionality act on distributed representations. Local distributed
representations give a first, simple encoding of discrete symbolic representations in a metric space. Given a
set of symbols
D
, a local distributed epresentation maps the
i
-th symbol in
D
to the
i
-th base unit vector
ei
in
Rn
, where
n
is the cardinality of
D
. Hence, the
i
-th unit vector represents the
i
-th symbol. In functional
compositionality, expressions
s=w1. . . wk
are represented by vectors
s
obtained with an eventually
recursive function
Φ
applied to vectors
ew1. . . ewk
. The function
f
may be very simple as the sum or more
complex. In case the function Φis the sum, that is:
funcΣ(s) =
k
X
j=1
ewj(1)
the derived vector is the classical bag-of-word vector space model (Salton, 1989). Whereas, more
complex functions
f
can range from different vector-to-vector operations like circular convolution in
Holographic Reduced Representations (Plate, 1995) to matrix multiplications plus non linear operations
in models such as in recurrent neural networks (Schuster and Paliwal, 1997; Hochreiter and Schmi-
dhuber, 1997) or in neural networks with attention (Vaswani et al., 2017; Devlin et al., 2018). The
above example can be useful to describe concatenative and functional compositionality. The set
D
=
{
mouse,cat,a,swallows,eats,some,cheese,(,)
}
may be represented with the base vectors
eiR9
where
e1
is the base vector for mouse,
e2
for cat,
e3
for a,
e4
for swallaws,
e5
for eats,
e6
for some,
e7
for cheese,
e8
for (, and
e9
for ). The additive functional composition of the expression
s1
=a cat swallows a mouse is
then:
4
Ferrone&Zanzotto Running Title
expression in eiadditive functional composition
a cat swallows a mouse
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
e3+e2+e4+e3+e1
funcΣ(s1) =
1
1
2
1
0
0
0
0
0
where the concatenative operator has been substituted with the sum +. Just to observe, in the additive
functional composition
funcΣ(s1)
, symbols are still visible but the sequence is lost. Hence, it is difficult to
reproduce the initial discrete symbolic expression. However, for example, the additive composition function
gives the possibility to compare two expressions. Given the expression s1and s2=a mouse eats some che-
ese, the dot product between
funcΣ(s1)
and
funcΣ(s2) = 101011100T
counts the
common words between the two expressions. In a functional composition with a function
Φ
, the expression
s1
may become
funcΦ(s1) = Φ(Φ(Φ(Φ(e3,e2),e4),e3),e1)
by following the concatenative compositio-
nality of the discrete symbolic expression. The same functional compositional principle can be applied to
discrete symbolic trees as
t1
by producing this distributed representation
Φ(Φ(e3,e2),Φ(e4,Φ(e3,e1)))
.
Finally, in the functional composition with a generic recursive function
funcΦ(s1)
, the function
Φ
will be
crucial to determine whether symbols can be recognized and sequence is preserved.
Distributed representations in their general form are more ambitious than distributed local representations
and tend to encode basic symbols of
D
in vectors in
Rd
where
d << n
. These vectors generally alter
symbols as there is not a direct link between symbols and dimensions of the space. Given a distributed
local representation
ew
of a symbol
w
, the encoder for a distributed representation is a matrix
Wd×n
that
transforms
xw
in
yw=Wd×new
. As an example, the encoding matrix
Wd×n
can be build by modeling
words in
D
around three dimensions: number of vowels, number of consonants and, finally, number of
non-alphabetic symbols. Given these dimensions, the matrix W3×9for the example is :
W3×9=
311222300
220622300
000000011
This is a simple example of a distributed representation. In a distributed representation (Plate, 1995; Hinton
et al., 1986) the informational content is distributed (hence the name) among multiple units, and at the
same time each unit can contribute to the representation of multiple elements. Distributed representation
has two evident advantages with respect to a distributed local representation: it is more efficient (in the
example, the representation uses only 3 numbers instead of 9) and it does not treat each element as being
equally different to any other. In fact, mouse and cat in this representation are more similar than mouse
and a. In other words, this representation captures by construction something interesting about the set of
symbols. The drawback is that symbols are altered and, hence, it may be difficult to interpret which symbol
is given its distributed representation. In the example, the distributed representations for eats and some are
exactly the same vector W3×9e5=W3×9e6.
5
Ferrone&Zanzotto Running Title
Even for distributed representations in the general form, it is possible to define concatenative composition
and functional composition to represent expressions. Vectors
Wd×nei
should be replaced to vectors
ei
in
the definition of the concatenative compositionality and the functional compositionality. Equation (
??
) is
translated to:
Ys=Wd×nconc(s) = [Wd×new1. . . Wd×newk]
and Equation (1) for additive functional compositionality becomes:
ys=Wd×nfuncΣ(s) =
k
X
j=1
Wd×nej
In the running example, the additive functional compositionality of sentence s1is:
ys1=W3×9funcΣ(s1) =
8
12
0
Clearly, in this case, it is extremely difficult to derive back the discrete symbolic sequence
s1
that has
generated the final distributed representation.
Summing up, a distributed representation
ys
of an discrete symbolic expression
s
is obtained by using an
encoder that acts in two ways:
transforms symbols
wi
in vectors by using an embedding matrix
Wd×n
and the local distributed
representation eiof wi;
transposes the concatenative compositionality of the discrete symbolic expression
s
in a functional
compositionality by defining the used composition function
When defining a distributed representation, we need to define two elements:
an embedding matrix
W
that should balance these two different aims: (1) maximize interpretability,
that is, inversion; (2) maximize similarity among different symbols for specific purposes.
the functional composition model: additive, holographic reduced representations (Plate, 1995), recur-
sive neural networks Schuster and Paliwal (1997); Hochreiter and Schmidhuber (1997) or with attention
Vaswani et al. (2017); Devlin et al. (2018)
And, the final questions are: What’s inside the distributed representation? What’s exactly encoded? How
this information is used to take decisions? Hence, the debated question become how concatenative is the
functional compositionality in distributed representations behind neural networks? Can we retrieve discrete
symbols and rebuild sequences?
To answer the above questions, we then describe the two properties Interpretability and concatena-
tive compositionality for distributed representations. These two properties want to measure how far are
distributed representations from symbolic representations.
Interpretability
is the possibility of decoding distributed representations, that is, extracting the embedded
symbolic representations. This is an important characteristic but it must be noted that it’s not a simple
yes-or-no classification. It is more a degree associated to specific representations. In fact, even if each
component of a vector representation does not have a specific meaning, this does not mean that the
6
Ferrone&Zanzotto Running Title
representation is not interpretable as a whole, or that symbolic information cannot be recovered from it.
For this reason, we can categorize the degree of interpretability of a representation as follows:
human-interpretable – each dimension of a representation has a specific meaning;
decodable – the representation may be obscure, but it can be decoded into an interpretable, symbolic
representation.
Concatenative Compositionality for distributed representations
is the possibility of composing basic
distributed representations with strong rules and of decomposing back composed representations with
inverse rules. Generally, in NLP, basic distributed representations refer to basic symbols.
The two axes of Interpretability and Concatenative Compositionality for distributed representations
will be used to describe the presented distributed representations as we are interested in understanding
whether or not a representation can be used to represent structures or sequences and whether it is possible
to extract back the underlying structure or sequence given a distributed representation. It is clear that a local
distributed representation is more interpretable than a distributed representation. Yet, both representations
lack in concatenative compositionality when sequences or structures are collapsed in vectors or tensors that
do not depend on the length of represented sequences or structures. For example, the bag-of-word local
representation does not take into consideration the order of the symbols in the sequence.
3 STRATEGIES TO OBTAIN DISTRIBUTED REPRESENTATIONS FROM SYMBOLS
There is a wide range of techniques to transform symbolic representations in distributed representations.
When combining natural language processing and machine learning, this is a major issue: transforming
symbols, sequences of symbols or symbolic structures in vectors or tensors that can be used in learning
machines. These techniques generally propose a function
η
to transform a local representation with a large
number of dimensions in a distributed representation with a lower number of dimensions:
η:RnRd
This function is often called encoder.
We propose to categorize techniques to obtain distributed representations in two broad categories, showing
some degree of overlapping:
representations derived from dimensionality reduction techniques;
learned representations
In the rest of the section, we will introduce the different strategies according to the proposed categorization.
Moreover, we will emphasize its degree of interpretability for each representation and its related function
η
by answering to two questions:
Has a specific dimension in Rda clear meaning?
Can we decode an encoded symbolic representation? In other words, assuming a decoding function
δ:RdRn, how far is vRn, which represents a symbolic representation, from v0=δ(η(v))?
Instead, composability of the resulting representations will be analyzed in Sec. 5.
7
Ferrone&Zanzotto Running Title
3.1 Dimensionality reductio with Random Projections
Random projection (RP) (Bingham and Mannila, 2001; Fodor, 2002) is a technique based on random
matrices
WdRd×n
. Generally, the rows of the matrix
Wd
are sampled from a Gaussian distribution
with zero mean, and normalized as to have unit length (Johnson and Lindenstrauss, 1984) or even less
complex random vectors (Achlioptas, 2003). Random projections from Gaussian distributions approxima-
tely preserves pairwise distance between points (see the Johnsonn-Lindenstrauss Lemma (Johnson and
Lindenstrauss, 1984)), that is, for any vector x, y X:
(1 ε)kxyk2≤ kWxWyk2(1 + ε)kxyk2
where the approximation factor
ε
depends on the dimension of the projection, namely, to assure that the
approximation factor is ε, the dimension kmust be chosen such that:
k8 log(m)
ε2
Constraints for building the matrix
W
can be significantly relaxed to less complex random vectors
(Achlioptas, 2003). Rows of the matrix can be sampled from very simple zero-mean distributions such as:
Wij =3
+1 with probability 1
6
1with probability 1
6
0with probability 2
3
without the need to manually ensure unit-length of the rows, and at the same time providing a significant
speed up in computation due to the sparsity of the projection.
Unfortunately, vectors
η(v)
are not human-interpretable as, even if their dimensions represent linear
combinations of dimensions in the original local distribution, these dimensions have not an interpretation
or particular properties.
On the contrary, vectors η(v)are decodable. The decoding function is:
δ(v0) = WT
dv0
and
WT
dWdI
when
Wd
is derived using Gaussian random vectors. Hence, distributed vectors in
Rd
can
be approximately decoded back in the original symbolic representation with a degree of approximation that
depends on the distance between d.
The major advantage of RP with respect to PCA is that the matrix
X
of all the data points is not needed
to derive the matrix
Wd
. Moreover, the matrix
Wd
can be produced
`
a-la-carte starting from the symbols
encountered so far in the encoding procedure. In fact, it is sufficient to generate new Gaussian vectors for
new symbols when they appear.
3.2 Learned representation
Learned representations differ from the dimensionality reduction techniques by the fact that: (1) enco-
ding/decoding functions may not be linear; (2) learning can optimize functions that are different with
respect to the target of PCA; and, (3) solutions are not derived in a closed form but are obtained using
optimization techniques such as stochastic gradient decent.
8
Ferrone&Zanzotto Running Title
Learned representation can be further classified into:
task-independent representations learned with a standalone algorithm (as in autoencoders (Socher
et al., 2011; Liou et al., 2014)) which is independent from any task, and which learns a representation
that only depends from the dataset used;
task-dependent representations learned as the first step of another algorithm (this is called end-to-end
training), usually the first layer of a deep neural network. In this case the new representation is driven
by the task.
3.2.1 Autoencoder
Autoencoders are a task independent technique to learn a distributed representation encoder
η:RnRd
by using local representations of a set of examples (Socher et al., 2011; Liou et al., 2014). The distributed
representation encoder ηis half of an autoencoder.
An autoencoder is a neural network that aims to reproduce an input vector in
Rn
as output by passing
in a hidden layer(s) that are in
Rd
. Given
η:RnRd
and
δ:RdRn
as the encoder and the decoder,
respectively, an autoencoder aims to maximize the following function:
L(x,x0) = kxx0k2
where
x0=δ(η(x))
The encoding and decoding module are two neural networks, which means that they are functions depending
on a set of parameters θof the form
ηθ(x) = s(W x +b)
δθ0(y) = s(W0y+b0)
where the parameters of the entire model are
θ, θ0={W, b, W 0, b0}
with
W, W 0
matrices,
b, b0
vectors and
s
is a function that can be either a non-linearity sigmoid shaped function, or in some cases the identity
function. In some variants the matrices
W
and
W0
are constrained to
WT=W0
. This model is different
with respect to PCA due to the target loss function and the use of non-linear functions.
Autoencoders have been further improved with denoising autoencoders (Vincent et al., 2010, 2008;
Masci et al., 2011) that are a variant of autoencoders where the goal is to reconstruct the input from a
corrupted version. The intuition is that higher level features should be robust with regard to small noise in
the input. In particular, the input xgets corrupted via a stochastic function:
˜
x=g(x)
and then one minimizes again the reconstruction error, but with regard to the original (uncorrupted) input:
L(x,x0) = kxδ(η(g(x)))k2
Usually gcan be either:
adding gaussian noise: g(x) = x+ε, where ε∼ N(0, σI);
masking noise: where a given a fraction νof the components of the input gets set to 0
9
Ferrone&Zanzotto Running Title
For what concerns intepretability, as for random projection, distributed representations
η(v)
obtained with
encoders from autoencoders and denoising autoencoders are not human-interpretable but are decodable as
this is the nature of autoencoders.
Moreover, composability is not covered by this formulation of autoencoders.
3.2.2 Embedding layers
Embedding layers are generally the first layers of more complex neural networks which are responsible to
transform an initial local representation in the first internal distributed representation. The main difference
with autoencoders is that these layers are shaped by the entire overall learning process. The learning process
is generally task dependent. Hence, these first embedding layers depend on the final task.
It is argued that each layers learn a higher-level representation of its input. This is particularly visible
with convolutional network (Krizhevsky et al., 2012) applied to computer vision tasks. In these suggestive
visualizations (Zeiler and Fergus, 2014b), the hidden layers are seen to correspond to abstract feature of
the image, starting from simple edges (in lower layers) up to faces in the higher ones.
However, these embedding layers produce encoding functions and, thus, distributed representations that
are not interpretable when applied to symbols. In fact, these distributed representations are not human-
interpretable as dimensions are not clearly related to specific aggregations of symbols. Moreover, these
embedding layers do not naturally provide decoders. Hence, this distributed representation is not decodable.
4DISTRIBUTIONAL REPRESENTATIONS AS ANOTHER SIDE OF THE COIN
Distributional semantics is an important area of research in natural language processing that aims to
describe meaning of words and sentences with vectorial representations (see (Turney and Pantel, 2010) for
a survey). These representations are called distributional representations.
It is a strange historical accident that two similar sounding names – distributed and distributional – have
been given to two concepts that should not be confused for many. Maybe, this has happened because the
two concepts are definitely related. We argue that distributional representation are nothing more than a
subset of distributed representations, and in fact can be categorized neatly into the divisions presented in
the previous section
Distributional semantics is based on a famous slogan – “you shall judge a word by the company it keeps”
(Firth, 1957) – and on the distributional hypothesis (Harris, 1964) – words have similar meaning if used in
similar contexts, that is, words with the same or similar distribution. Hence, the name distributional as well
as the core hypothesis comes from a linguistic rather than computer science background.
Distributional vectors represent words by describing information related to the contexts in which they
appear. Put in this way it is apparent that a distributional representation is a specific case of a distributed
representation, and the different name is only an indicator of the context in which this techniques originated.
Representations for sentences are generally obtained combining vectors representing words.
Hence, distributional semantics is a special case of distributed representations with a restriction on what
can be used as features in vector spaces: features represent a bit of contextual information. Then, the largest
body of research is on what should be used to represent contexts and how it should be taken into account.
Once this is decided, large matrices
X
representing words in context are collected and, then, dimensionality
reduction techniques are applied to have treatable and more discriminative vectors.
10
Ferrone&Zanzotto Running Title
In the rest of the section, we present how to build matrices representing words in context, we will shortly
recap on how dimensionality reduction techniques have been used in distributional semantics, and, finally,
we report on
word2vec
(Mikolov et al., 2013), which is a novel distributional semantic techniques based
on deep learning.
4.1 Building distributional representations for words from a corpus
The major issue in distributional semantics is how to build distributional representations for words by
observing word contexts in a collection of documents. In this section, we will describe these techniques
using the example of the corpus in Table 1.
s1a cat catches a mouse
s2a dog eats a mouse
s3a dog catches a cat
Table 1. A very small corpus
A first and simple distributional semantic representations of words is given by word vs. document
matrices as those typical in information retrieval (Salton, 1989). Word context are represented by document
indexes. Then, words are similar if these words similarly appear in documents. This is generally referred
as topical similarity (Landauer and Dumais, 1997) as words belonging to the same topic tend to be more
similar. An example of this approach is given by the matrix in Eq.
??
. In fact, this matrix is already a
distributional and distributed representation for words which are represented as vectors in rows.
A second strategy to build distributional representations for words is to build word vs. contextual feature
matrices. These contextual features represent proxies for semantic attributes of modeled words (Baroni and
Lenci, 2010). For example, contexts of the word dog will somehow have relation with the fact that a dog
has four legs, barks, eats, and so on. In this case, these vectors capture a similarity that is more related to a
co-hyponymy, that is, words sharing similar attributes are similar. For example, dog is more similar to cat
than to car as dog and cat share more attributes than dog and car. This is often referred as attributional
similarity (Turney, 2006).
A simple example of this second strategy are word-to-word matrices obtained by observing n-word
windows of target words. For example, a word-to-word matrix obtained for the corpus in Table 1 by
considering a 1-word window is the following:
X=
a cat dog mouse catches eats
a0 1 2 2 2 2
cat 2 0 0 0 1 0
dog 2 0 0 0 1 1
mouse 2 0 0 0 0 0
catches 2 1 1 0 0 0
eats 1 0 1 0 0 0
(2)
Hence, the word cat is represented by the vector
cat =200010
and the similarity between
cat and dog is higher than the similarity between cat and mouse as the cosine similarity
cos(cat,dog)
is
higher than the cosine similarity cos(cat,mouse).
11
Ferrone&Zanzotto Running Title
The research on distributional semantics focuses on two aspects: (1) the best features to represent contexts;
(2) the best correlation measure among target words and features.
How to represent contexts is a crucial problem in distributional semantics. This problem is strictly
correlated to the classical question of feature definition and feature selection in machine learning. A wide
variety of features have been tried. Contexts have been represented as set of relevant words, sets of relevant
syntactic triples involving target words (Pado and Lapata, 2007; Rothenh
¨
ausler and Sch
¨
utze, 2009) and
sets of labeled lexical triples (Baroni and Lenci, 2010).
Finding the best correlation measure among target words and their contextual features is the other issue.
Many correlation measures have been tried. The classical measures are term frequency-inverse document
frequency (tf-idf ) (Salton, 1989) and point-wise mutual information (
pmi
). These, among other measures,
are used to better capture the importance of contextual features for representing distributional semantic of
words.
This first formulation of distributional semantics is a distributed representation that is interpretable. In
fact, features represent contextual information which is a proxy for semantic attributes of target words
(Baroni and Lenci, 2010).
4.2 Compacting distributional representations
As distributed representations, distributional representations can undergo the process of dimensionality
reduction with Principal Component Analysis and Random Indexing. This process is used for two issues.
The first is the classical problem of reducing the dimensions of the representation to obtain more compact
representations. The second instead want to help the representation to focus on more discriminative
dimensions. This latter issue focuses on the feature selection and merging which is an important task in
making these representations more effective on the final task of similarity detection.
Principal Component Analysis (PCA) is largely applied in compacting distributional representations:
Latent Semantic Analysis (LSA) is a prominent example (Landauer and Dumais, 1997). LSA were born
in Information Retrieval with the idea of reducing word-to-document matrices. Hence, in this compact
representation, word context are documents and distributional vectors of words report on the documents
where words appear. This or similar matrix reduction techniques have been then applied to word-to-word
matrices.
Principal Component Analysis (PCA) (Markovsky, 2012; Pearson, 1901) is a linear method which
reduces the number of dimensions by projecting
Rn
into the “best” linear subspace of a given dimension
d
by using the a set of data points. The “best” linear subspace is a subspace where dimensions maximize
the variance of the data points in the set. PCA can be interpreted either as a probabilistic method or as
a matrix approximation and is then usually known as truncated singular value decomposition. We are
here interested in describing PCA as probabilistic method as it related to the interpretability of the related
distributed representation.
As a probabilistic method, PCA finds an orthogonal projection matrix
WdRn×d
such that the
variance of the projected set of data points is maximized. The set of data points is referred as a matrix
XRm×n
where each row
xT
iRn
is a single observation. Hence, the variance that is maximized is
b
Xd=X W T
dRm×d.
12
Ferrone&Zanzotto Running Title
More specifically, let’s consider the first weight vector
w1
, which maps an element of the dataset
x
into a
single number hx,w1i. Maximizing the variance means that wis such that:
w1= arg max
kwk=1 X
i
(hxi,wi)2
and it can be shown that the optimal value is achieved when
w
is the eigenvector of
XTX
with largest
eigenvalue. This then produces a projected dataset:
b
X1=XTW1=XTw1
The algorithm can then compute iteratively the second and further components by first subtracting the
components already computed from X:
XXw1w1T
and then proceed as before. However, it turns out that all subsequent components are related to the
eigenvectors of the matrix
XTX
, that is, the
d
-th weight vector is the eigenvector of
XTX
with the
d
-th
largest corresponding eigenvalue.
The encoding matrix for distributed representations derived with a PCA method is the matrix:
Wd=
w1
w2
. . .
wd
Rd×n
where
wi
are eigenvectors with eigenvalues decreasing with
i
. Hence, local representations
vRn
are
represented in distributed representations in Rdas:
η(v) = Wdv
Hence, vectors
η(v)
are human-interpretable as their dimensions represent linear combinations of
dimensions in the original local representation and these dimensions are ordered according to their
importance in the dataset, that is, their variance. Moreover, each dimension is a linear combination of
the original symbols. Then, the matrix
Wd
reports on which combination of the original symbols is more
important to distinguish data points in the set.
Moreover, vectors η(v)are decodable. The decoding function is:
δ(v0) = WT
dv0
and
WT
dWd=I
if
d
is the rank of the matrix
X
, otherwise it is a degraded approximation (for more details
refer to (Fodor, 2002; Sorzano et al., 2014)). Hence, distributed vectors in
Rd
can be decoded back in the
original symbolic representation with a degree of approximation that depends on the distance between
d
and the rank of the matrix X.
The compelling limit of PCA is that all the data points have to be used in order to obtain the enco-
ding/decoding matrices. This is not feasible in two cases. First, when the model has to deal with big data.
13
Ferrone&Zanzotto Running Title
Figure 1. word2vec: CBOW model
Second, when the set of symbols to be encoded in extremely large. In this latter case, local representations
cannot be used to produce matrices Xfor applying PCA.
In Distributional Semantics, random indexing has been used to solve some issues that arise naturally with
PCA when working with large vocabularies and large corpora. PCA has some scalability problems:
The original co-occurrence matrix is very costly to obtain and store, moreover, it is only needed to be
later transformed;
Dimensionality reduction is also very costly, moreover, with the dimensions at hand it can only be
done with iterative methods;
The entire method is not incremental, if we want to add new words to our corpus we have to recompute
the entire co-occurrence matrix and then re-perform the PCA step.
Random Indexing (Sahlgren, 2005) solves these problems: it is an incremental method (new words can be
easily added any time at low computational cost) which creates word vector of reduced dimension without
the need to create the full dimensional matrix.
Interpretability of compacted distributional semantic vectors is comparable to the interpretability of
distributed representations obtained with the same techniques.
4.3 Learning representations: word2vec
Recently, distributional hypothesis has invaded neural networks: word2vec (Mikolov et al., 2013) uses
contextual information to learn word vectors. Hence, we discuss this technique in the section devoted to
distributional semantics.
The name word2Vec comprises two similar techniques, called skip grams and continuous bag of words
(CBOW). Both methods are neural networks, the former takes input a word and try to predict its context,
while the latter does the reverse process, predicting a word from the words surrounding it. With this
technique there is no explicitly computed co-occurrence matrix, and neither there is an explicit association
feature between pairs of words, instead, the regularities and distribution of the words are learned implicitly
by the network.
We describe only CBOW because it is conceptually simpler and because the core ideas are the same in
both cases. The full network is generally realized with two layers
W1n×k
and
W2k×n
plus a softmax layer
14
Ferrone&Zanzotto Running Title
to reconstruct the final vector representing the word. In the learning phase, the input and the output of the
network are local representation for words. In CBOW, the network aims to predict a target word given
context words. For example, given the sentence
s1
of the corpus in Table 1, the network has to predict
catches given its context (see Figure 1).
Hence, CBOW offers an encoder
W1n×k
, that is, a linear word encoder from data where
n
is the size of
the vocabulary and
k
is the size of the distributional vector. This encoder models contextual information
learned by maximizing the prediction capability of the network. A nice description on how this approach is
related to previous techniques is given in (Goldberg and Levy, 2014).
Clearly, CBOW distributional vectors are not easily human and machine interpretable. In fact, specific
dimensions of vectors have not a particular meaning and, differently from what happens for auto-encoders
(see Sec. 3.2.1), these networks are not trained to be invertible.
5 COMPOSING DISTRIBUTED REPRESENTATIONS
In the previous sections, we described how one symbol or a bag-of-symbols can be transformed in
distributed representations focusing on whether these distributed representations are interpretable. In this
section, we want to investigate a second and important aspect of these representations, that is, have these
representations Concatenative Compositionality as symbolic representations? And, if these representations
are composed, are still interpretable?
Concatenative Compositionality is the ability of a symbolic representation to describe sequences or
structures by composing symbols with specific rules. In this process, symbols remain distinct and composing
rules are clear. Hence, final sequences and structures can be used for subsequent steps as knowledge
repositories.
Concatenative Compositionality is an important aspect for any representation and, then, for a distributed
representation. Understanding to what extent a distributed representation has concatenative compositionality
and how information can be recovered is then a critical issue. In fact, this issue has been strongly posed by
Plate (Plate, 1995, 1994) who analyzed how same specific distributed representations encode structural
information and how this structural information can be recovered back.
Current approaches for treating distributed/distributional representation of sequences and structures mix
two aspects in one model: a “semantic” aspect and a representational aspect. Generally, the semantic
aspect is the predominant and the representational aspect is left aside. For “semantic” aspect, we refer to
the reason why distributed symbols are composed: a final task in neural network applications or the need
to give a distributional semantic vector for sequences of words. This latter is the case for compositional
distributional semantics (Clark et al., 2008; Baroni et al., 2014). For the representational aspect, we
refer to the fact that composed distributed representations are in fact representing structures and these
representations can be decoded back in order to extract what is in these structures.
Although the “semantic” aspect seems to be predominant in models-that-compose, the convolution
conjecture (Zanzotto et al., 2015) hypothesizes that the two aspects coexist and the representational aspect
plays always a crucial role. According to this conjecture, structural information is preserved in any model
that composes and structural information emerges back when comparing two distributed representations
with dot product to determine their similarity.
15
Ferrone&Zanzotto Running Title
Hence, given the convolution conjecture,models-that-compose produce distributed representations for
structures that can be interpreted back. Interpretability is a very important feature in these models-that-
compose which will drive our analysis.
In this section we will explore the issues faced with the compositionality of representations, and the
main “trends”, which correspond somewhat to the categories already presented. In particular we will start
from the work on compositional distributional semantics, then we revise the work on holographic reduced
representations (Plate, 1995; Neumann, 2001) and, finally, we analyze the recent approaches with recurrent
and recursive neural networks. Again, these categories are not entirely disjoint, and methods presented in
one class can be often interpreted to belonging into another class.
5.1 Compositional Distributional Semantics
In distributional semantics, models-that-compose have the name of compositional distributional semantics
models (CDSMs) (Baroni et al., 2014; Mitchell and Lapata, 2010) and aim to apply the principle of
compositionality (Frege, 1884; Montague, 1974) to compute distributional semantic vectors for phrases.
These CDSMs produce distributional semantic vectors of phrases by composing distributional vectors
of words in these phrases. These models generally exploit structured or syntactic representations of
phrases to derive their distributional meaning. Hence, CDSMs aim to give a complete semantic model for
distributional semantics.
As in distributional semantics for words, the aim of CDSMs is to produce similar vectors for semantically
similar sentences regardless their lengths or structures. For example, words and word definitions in
dictionaries should have similar vectors as discussed in (Zanzotto et al., 2010). As usual in distributional
semantics, similarity is captured with dot products (or similar metrics) among distributional vectors.
The applications of these CDSMs encompass multi-document summarization, recognizing textual
entailment (Dagan et al., 2013) and, obviously, semantic textual similarity detection (Agirre et al., 2013).
Apparently, these CDSMs are far from having concatenative compositionality , since these distributed
representations that can be interpreted back. In some sense, their nature wants that resulting vectors forget
how these are obtained and focus on the final distributional meaning of phrases. There is some evidence
that this is not exactly the case.
The convolution conjecture (Zanzotto et al., 2015) suggests that many CDSMs produce distributional
vectors where structural information and vectors for individual words can be still interpreted. Hence, many
CDSMs have the concatenative compositionality property and interpretable.
In the rest of this section, we will show some classes of these CDSMs and we focus on describing how
these morels are interpretable.
5.1.1 Additive Models
Additive models for compositional distributional semantics are important examples of models-that-
composes where semantic and representational aspects is clearly separated. Hence, these models can be
highly interpretable.
These additive models have been formally captured in the general framework for two words seque-
nces proposed by Mitchell&Lapata (Mitchell and Lapata, 2008). The general framework for composing
16
Ferrone&Zanzotto Running Title
distributional vectors of two word sequences “u v” is the following:
p=f(u,v;R;K)(3)
where
pRn
is the composition vector,
u
and
v
are the vectors for the two words uand v,
R
is
the grammatical relation linking the two words and
K
is any other additional knowledge used in the
composition operation. In the additive model, this equation has the following form:
p=f(u,v;R;K) = ARu+BRv(4)
where
AR
and
BR
are two square matrices depending on the grammatical relation
R
which may be learned
from data (Zanzotto et al., 2010; Guevara, 2010).
Before investigating if these models are interpretable, let introduce a recursive formulation of additive
models which can be applied to structural representations of sentences. For this purpose, we use dependency
trees. A dependency tree can be defined as a tree whose nodes are words and the typed links are the relations
between two words. The root of the tree represents the word that governs the meaning of the sentence. A
dependency tree
T
is then a word if it is a final node or it has a root
rT
and links
(rT, R, Ci)
where
Ci
is
the i-th subtree of the node
rT
and
R
is the relation that links the node
rT
with
Ci
. The dependency trees
of two example sentences are reported in Figure 2. The recursive formulation is then the following:
fr(T) = X
i
(ARrT+BRfr(Ci))
According to the recursive definition of the additive model, the function
fr(T)
results in a linear
combination of elements
Msws
where
Ms
is a product of matrices that represents the structure and
ws
is
the distributional meaning of one word in this structure, that is:
fr(T) = X
sS(T)
Msws
where
S(T)
are the relevant substructures of
T
. In this case,
S(T)
contains the link chains. For example,
the first sentence in Fig. 2 has a distributed vector defined in this way:
fr(cows eat animal extracts) =
=AV N eat +BV N cows +AV N eat +
+BV N fr(animal extracts) =
=AV N eat +BV N cows +AV N eat +
+BV N AN N extracts +BV N BN N animal
Each term of the sum has a part that represents the structure and a part that represents the meaning, for
example:
structure
z }| {
BV N BN N beef
|{z}
meaning
17
Ferrone&Zanzotto Running Title
Figure 2. A sentence and its dependency graph
Hence, this recursive additive model for compositional semantics is a model-that-composes which, in
principle, can be highly interpretable. By selecting matrices Mssuch that:
MT
s1Ms2(Is1=s2
0s16=s2
(5)
it is possible to recover distributional semantic vectors related to words that are in specific parts of the
structure. For example, the main verb of the sample sentence in Fig. 2 with a matrix AT
V N , that is:
AT
V N fr(cows eat animal extracts)2eat
In general, matrices derived for compositional distributional semantic models (Guevara, 2010; Zanzotto
et al., 2010) do not have this property but it is possible to obtain matrices with this property by applying
thee Jonson-Linderstrauss Tranform (Johnson and Lindenstrauss, 1984) or similar techniques as discussed
also in (Zanzotto et al., 2015).
5.1.2 Lexical Functional Compositional Distributional Semantic Models
Lexical Functional Models are compositional distributional semantic models where words are tensors and
each type of word is represented by tensors of different order. Composing meaning is then composing these
tensors to obtain vectors. These models have solid mathematical background linking Lambek pregroup
theory, formal semantics and distributional semantics (Coecke et al., 2010). Lexical Function models are
concatenative compositional, yet, in the following, we will examine whether these models produce vectors
that my be interpreted.
To determine whether these models produce interpretable vectors, we start from a simple Lexical Function
model applied to two word sequences. This model has been largely analyzed in (Baroni and Zamparelli,
2010) as matrices were considered better linear models to encode adjectives.
In Lexical Functional models over two words sequences, there is one of the two words which as a
tensor of order 2 (that is, a matrix) and one word that is represented by a vector. For example, adjectives
are matrices and nouns are vectors (Baroni and Zamparelli, 2010) in adjective-noun sequences. Hence,
adjective-noun sequences like “black cat” or “white dog” are represented as:
f(black cat) = BLACKcat
f(white dog) = WHITEdog
where
BLACK
and
WHITE
are matrices representing the two adjectives and
cat
and
dog
are the two
vectors representing the two nouns.
18
Ferrone&Zanzotto Running Title
These two words models are partially interpretable: knowing the adjective it is possible to extract the
noun but not vice-versa. In fact, if matrices for adjectives are invertible, there is the possibility of extracting
which nouns has been related to particular adjectives. For example, if
BLACK
is invertible, the inverse
matrix BLACK1can be used to extract the vector of cat from the vector f(black cat):
cat =BLACK1f(black cat)
This contributes to the interpretability of this model. Moreover, if matrices for adjectives are built using
Jonson-Lindestrauss Transforms (Johnson and Lindenstrauss, 1984), that is matrices with the property in
Eq. 5, it is possible to pack different pieces of sentences in a single vector and, then, select only relevant
information, for example:
cat BLACKT(f(black cat) + f(white dog))
On the contrary, knowing noun vectors, it is not possible to extract back adjective matrices. This is a strong
limitation in term of interpretability.
Lexical Functional models for larger structures are concatenative compositional but not interpretable at
all. In fact, in general these models have tensors in the middle and these tensors are the only parts that
can be inverted. Hence, in general these models are not interpretable. However, using the convolution
conjecture (Zanzotto et al., 2015), it is possible to know whether subparts are contained in some final
vectors obtained with these models.
5.2 Holographic Representations
Holographic reduced representations (HRRs) are models-that-compose expressly designed to be interpre-
table (Plate, 1995; Neumann, 2001). In fact, these models to encode flat structures representing assertions
and these assertions should be then searched in oder to recover pieces of knowledge that is in. For example,
these representations have been used to encode logical propositions such as
eat(John, apple)
. In this case,
each atomic element has an associated vector and the vector for the compound is obtained by combining
these vectors. The major concern here is to build encoding functions that can be decoded, that is, it
should be possible to retrieve composing elements from final distributed vectors such as the vector of
eat(John, apple).
In HRRs, nearly orthogonal unit vectors (Johnson and Lindenstrauss, 1984) for basic symbols, circular
convolution
and circular correlation
guarantees composability and intepretability. HRRs are the
extension of Random Indexing (see Sec. 3.1) to structures. Hence, symbols are represented with vectors
sampled from a multivariate normal distribution
N(0,1
dId)
. The composition function is the circular
convolution indicated as and defined as:
zj= (ab)j=
d1
X
k=0
akbjk
where subscripts are modulo
d
. Circular convolution is commutative and bilinear. This operation can be
also computed using circulant matrices:
z= (ab) = Ab=Ba
19
Ferrone&Zanzotto Running Title
where
A
and
B
are circulant matrices of the vectors
a
and
b
. Given the properties of vectors
a
and
b
,
matrices
A
and
B
have the property in Eq. 5. Hence, circular convolution is approximately invertible
with the circular correlation function () defined as follows:
cj= (zb)j=
d1
X
k=0
zkbj+k
where again subscripts are modulo
d
. Circular correlation is related to inverse matrices of circulant matrices,
that is BT
. In the decoding with , parts of the structures can be derived in an approximated way, that is:
(ab)ba
Hence, circular convolution
and circular correlation
allow to build interpretable representations. For
example, having the vectors
e
,
J
, and
a
for
eat
,
John
and
apple
, respectively, the following encoding and
decoding produces a vector that approximates the original vector for John:
J(Jea)(ea)
The “invertibility” of these representations is important because it allow us not to consider these
representations as black boxes.
However, holographic representations have severe limitations as these can encode and decode simple,
flat structures. In fact, these representations are based on the circular convolution, which is a commutative
function; this implies that the representation cannot keep track of composition of objects where the order
matters and this phenomenon is particularly important when encoding nested structures.
Distributed trees (Zanzotto and Dell’Arciprete, 2012) have shown that the principles expressed in
holographic representation can be applied to encode larger structures, overcoming the problem of reliably
encoding the order in which elements are composed using the shuffled circular convolution function as the
composition operator. Distributed trees are encoding functions that transform trees into low-dimensional
vectors that also contain the encoding of every substructures of the tree. Thus, these distributed trees are
particularly attractive as they can be used to represent structures in linear learning machines which are
computationally efficient.
Distributed trees and, in particular, distributed smoothed trees (Ferrone and Zanzotto, 2014) repre-
sent an interesting middle way between compositional distributional semantic models and holographic
representation.
5.3 Compositional Models in Neural Networks
When neural networks are applied to sequences or structured data, these networks are in fact models-
that-compose. However, these models result in models-that-compose which are not interpretable. In
fact, composition functions are trained on specific tasks and not on the possibility of reconstructing the
structured input, unless in some rare cases (Socher et al., 2011). The input of these networks are sequences
or structured data where basic symbols are embedded in local representations or distributed representations
obtained with word embedding (see Sec. 4.3). The output are distributed vectors derived for specific tasks.
Hence, these models-that-compose are not interpretable in our sense for their final aim and for the fact that
non linear functions are adopted in the specification of the neural networks.
20
Ferrone&Zanzotto Running Title
In this section, we revise some prominent neural network architectures that can be interpreted as models-
that-compose: the recurrent neural networks (Krizhevsky et al., 2012; He et al., 2016; Vinyals et al., 2015a;
Graves, 2013) and the recursive neural networks (Socher et al., 2012).
5.3.1 Recurrent Neural Networks
Recurrent neural networks form a very broad family of neural networks architectures that deal with the
representation (and processing) of complex objects. At its core a recurrent neural network (RNN) is a
network which takes in input the current element in the sequence and processes it based on an internal
state which depends on previous inputs. At the moment the most powerful network architectures are
convolutional neural networks (Krizhevsky et al., 2012; He et al., 2016) for vision related tasks and
LSTM-type network for language related task (Vinyals et al., 2015a; Graves, 2013).
A recurrent neural network takes as input a sequence
x= (x1. . . xn)
and produce as output a single
vector
yRn
which is a representation of the entire sequence. At each step
1t
the network takes as input
the current element
xt
, the previous output
ht1
and performs the following operation to produce the
current output ht
ht=σ(W[ht1xt] + b)(6)
where
σ
is a non-linear function such as the logistic function or the hyperbolic tangent and
[ht1xt]
denotes the concatenation of the vectors
ht1
and
xt
. The parameters of the model are the matrix
W
and
the bias vector b.
Hence, a recurrent neural network is effectively a learned composition function, which dynamically
depends on its current input, all of its previous inputs and also on the dataset on which is trained. However,
this learned composition function is basically impossible to analyze or interpret in any way. Sometime an
“intuitive” explanation is given about what the learned weights represent: with some weights representing
information that must be remembered or forgotten.
Even more complex recurrent neural networks as long-short term memory (LSTM) (Hochreiter and
Schmidhuber, 1997) have the same problem of interpretability. LSTM are a recent and successful way for
neural network to deal with longer sequences of inputs, overcoming some difficulty that RNN face in the
training phase. As with RNN, LSTM network takes as input a sequence
x= (x1. . . xn)
and produce as
output a single vector
yRn
which is a representation of the entire sequence. At each step
t
the network
takes as input the current element
xt
, the previous output
ht1
and performs the following operation to
produce the current output htand update the internal state ct.
ft=σ(Wf[ht1xt] + bf)
it=σ(Wi[ht1xt] + bi)
ot=σ(Wo[ht1xt] + bo)
˜ct= tanh(Wc[ht1xt] + bc)
ct=ftcti+it˜ct
ht=ottanh(ct)
1we can usually think of this as a timestep, but not all applications of recurrent neural network have a temporal interpretation
21
Ferrone&Zanzotto Running Title
S
VP
NP
extracts
animal
eat
cows
Figure 3. A simple binary tree
Figure 4. Recursive Neural Networks
where
stands for element-wise multiplication, and the parameters of the model are the matrices
Wf, Wi, Wo, Wcand the bias vectors bf, bi, bo, bc.
Generally, the interpretation offered for recursive neural networks is functional or “psychological” and
not on the content of intermediate vectors. For example, an interpretation of the parameters of LSTM is the
following:
ft
is the forget gate: at each step takes in consideration the new input and output computed so far to
decide which information in the internal state must be forgotten (that is, set to 0);
itis the input gate: it decides which position in the internal state will be updated, and by how much;
˜ct
is the proposed new internal state, which will then be updated effectively combining the previous
gate;
otis the output gate: it decides how to modulate the internal state to produce the output
These models-that-compose have high performance on final tasks but are definitely not interpretable.
5.3.2 Recursive Neural Network
The last class of models-that-compose that we present is the class of recursive neural networks (Socher
et al., 2012). These networks are applied to data structures as trees and are in fact applied recursively on the
structure. Generally, the aim of the network is a final task as sentiment analysis or paraphrase detection.
Recursive neural networks is then a basic block (see Fig. 4) which is recursively applied on trees like the
one in Fig. 3. The formal definition is the following:
p=fU,V (u,v) = f(Vu, Uv) = g(WVu
Uv)
22
Ferrone&Zanzotto Running Title
where
g
is a component-wise sigmoid function or
tanh
, and
W
is a matrix that maps the concatenation
vectorVu
Uvto have the same dimension.
This method deals naturally with recursion: given a binary parse tree of a sentence
s
, the algorithm
creates vectors and matrices representation for each node, starting from the terminal nodes. Words are
represented by distributed representations or local representations. For example, the tree in Fig. 3 is
processed by the recursive network in the following way. First, the network in Fig. 4 is applied to the pair
(animal,extracts) and
fUV (animal,extract)
is obtained. Then, the network is applied to the result and
eat and fUV (eat, fUV (animal,extract)) is obtained and so on.
Recursive neural networks are not easily interpretable even if quite similar to the additive compositional
distributional semantic models as those presented in Sec. 5.1.1. In fact, the non-linear function
g
is the one
that makes final vectors less interpretable.
6 CONCLUSIONS
Natural language is symbolic representation. Thinking to natural language understanding systems which
are not based on symbols seems to be extremely odd. However, recent advances in machine learning (ML)
and in natural language processing (NLP) seem to contradict the above intuition: symbols are fading away,
erased by vectors or tensors called distributed and distributional representations.
We made this survey to show the not-surprising link between symbolic representations and distribu-
ted/distributional representations. This is the right time to revitalize the area of interpreting how symbols
are represented inside neural networks. In our opinion, this survey will help to devise new deep neural
networks that can exploit existing and novel symbolic models of classical natural language processing tasks.
We believe that a clearer understanding of the strict link between distributed/distributional representations
and symbols may lead to radically new deep learning networks.
REFERENCES
Achlioptas, D. (2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins.
Journal of computer and System Sciences 66, 671–687
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., and Guo, W. (2013). *sem 2013 shared task: Semantic
textual similarity. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume
1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity (Atlanta,
Georgia, USA: Association for Computational Linguistics), 32–43
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and
translate. arXiv preprint arXiv:1409.0473
Baroni, M., Bernardi, R., and Zamparelli, R. (2014). Frege in space: A program of compositional
distributional semantics. LiLT (Linguistic Issues in Language Technology) 9
Baroni, M. and Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics.
Comput. Linguist. 36, 673–721. doi:10.1162/coli a 00016
Baroni, M. and Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-
noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing (Cambridge, MA: Association for Computational Linguistics), 1183–1193
Belkin, M. and Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and
clustering. In NIPS. vol. 14, 585–591
23
Ferrone&Zanzotto Running Title
Bellman, R. and Corporation, R. (1957). Dynamic Programming. Rand Corporation research study
(Princeton University Press)
Bingham, E. and Mannila, H. (2001). Random projection in dimensionality reduction: applications to image
and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining (ACM), 245–250
Blutner, R., Hendriks, P., and de Hoop, H. (2003). A new hypothesis on compositionality. In Proceedings
of the joint international conference on cognitive science
Chalmers, D. J. (1992). Syntactic Transformations on Distributed Representations (Dordrecht: Springer
Netherlands). 46–55. doi:10.1007/978-94-011-2624-3 3
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., et al. (2014). cudnn: Efficient
primitives for deep learning. arXiv preprint arXiv:1410.0759
Chomsky, N. (1957). Aspect of Syntax Theory (Cambridge, Massachussetts: MIT Press)
Clark, S., Coecke, B., and Sadrzadeh, M. (2008). A compositional distributional model of meaning.
Proceedings of the Second Symposium on Quantum Interaction (QI-2008) , 133–140
Coecke, B., Sadrzadeh, M., and Clark, S. (2010). Mathematical foundations for a compositional
distributional model of meaning. CoRR abs/1003.4394
Cui, H., Ganger, G. R., and Gibbons, P. B. (2015). Scalable deep learning on distributed GPUs with a
GPU-specialized parameter server. Tech. rep., CMU PDL Technical Report (CMU-PDL-15-107)
Dagan, I., Roth, D., Sammons, M., and Zanzotto, F. M. (2013). Recognizing Textual Entailment: Models and
Applications. Synthesis Lectures on Human Language Technologies (Morgan & Claypool Publishers)
Daum, F. and Huang, J. (2003). Curse of dimensionality and particle filters. In Aerospace Conference,
2003. Proceedings. 2003 IEEE (IEEE), vol. 4, 4 1979–4 1993
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional
transformers for language understanding. CoRR abs/1810.04805
Ferrone, L. and Zanzotto, F. M. (2014). Towards syntax-aware compositional distributional semantic
models. In Proceedings of COLING 2014, the 25th International Conference on Computational Lin-
guistics: Technical Papers (Dublin, Ireland: Dublin City University and Association for Computational
Linguistics), 721–730
Ferrone, L., Zanzotto, F. M., and Carreras, X. (2015). Decoding distributed tree structures. In Statistical
Language and Speech Processing - Third International Conference, SLSP 2015, Budapest, Hungary,
November 24-26, 2015, Proceedings. 73–83. doi:10.1007/978-3-319-25789-1 8
Firth, J. R. (1957). Papers in Linguistics. (London: Oxford University Press.)
Fodor, I. (2002). A Survey of Dimension Reduction Techniques. Tech. rep.
Fodor, J. A. and Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis.
Cognition 28, 3 – 71. doi:https://doi.org/10.1016/0010-0277(88)90031-5
Frege, G. (1884). Die Grundlagen der Arithmetik (The Foundations of Arithmetic): eine logisch-
mathematische Untersuchung ¨
uber den Begriff der Zahl (Breslau)
Friedman, J. H. (1997). On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and
knowledge discovery 1, 55–77
Gelder, T. V. (1990). Compositionality: A connectionist variation on a classical theme. Cognitive Science
384, 355–384. doi:10.1207/s15516709cog1403 2
Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.’s negative-sampling
word-embedding method. arXiv preprint arXiv:1402.3722
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative
adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680
24
Ferrone&Zanzotto Running Title
Graves, A. (2013). Generating sequences with recurrent neural networks. CoRR abs/1308.0850
Grefenstette, E. and Sadrzadeh, M. (2011). Experimental support for a categorical compositional distri-
butional model of meaning. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (Stroudsburg, PA, USA: Association for Computational Linguistics), EMNLP ’11,
1394–1404
Guevara, E. (2010). A regression model of adjective-noun compositionality in distributional semantics. In
Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics (Uppsala,
Sweden: Association for Computational Linguistics), 33–37
Harris, Z. (1964). Distributional structure. In The Philosophy of Linguistics, eds. J. J. Katz and J. A. Fodor
(New York: Oxford University Press)
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint
arXiv:1603.05027
Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. (1986). Distributed representations. In Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, eds.
D. E. Rumelhart and J. L. McClelland (MIT Press, Cambridge, MA.)
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation 9, 1735–1780
Jacovi, A., Shalom, O. S., and Goldberg, Y. (2018). Understanding Convolutional Neural Networks for
Text Classification , 56–65doi:doi:10.1046/j.1365-3040.2003.01027.x
Jang, K.-r., Kim, S.-b., and Corp, N. (2018). Interpretable Word Embedding Contextualization , 341–343
Johnson, W. and Lindenstrauss, J. (1984). Extensions of lipschitz mappings into a hilbert space. Contemp.
Math. 26, 189–206
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent convolutional neural networks for discourse com-
positionality. Proceedings of the 2013 Workshop on Continuous Vector Space Models and their
Compositionality
Keogh, E. and Mueen, A. (2011). Curse of dimensionality. In Encyclopedia of Machine Learning
(Springer). 257–258
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems. 1097–1105
Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis
theory of acquisition, induction, and representation of knowledge. Psychological Review 104, 211–240
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, 436–444
Liou, C.-Y., Cheng, W.-C., Liou, J.-W., and Liou, D.-R. (2014). Autoencoder for words. Neurocomputing
139, 84 – 96. doi:http://dx.doi.org/10.1016/j.neucom.2013.09.055
Lipton, Z. C. (2016). The Mythos of Model Interpretability doi:10.1145/3233231
Markovsky, I. (2012). Low rank approximation: Algorithms, implementation, applications
Masci, J., Meier, U., Cire
s¸
an, D., and Schmidhuber, J. (2011). Stacked convolutional auto-encoders for
hierarchical feature extraction. In International Conference on Artificial Neural Networks (Springer),
52–59
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in
vector space. CoRR abs/1301.3781
Mitchell, J. and Lapata, M. (2008). Vector-based models of semantic composition. In Proceedings of
ACL-08: HLT (Columbus, Ohio: Association for Computational Linguistics), 236–244
Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science
doi:10.1111/j.1551-6709.2010.01106.x
25
Ferrone&Zanzotto Running Title
Montague, R. (1974). English as a formal language. In Formal Philosophy: Selected Papers of Richard
Montague, ed. R. Thomason (New Haven: Yale University Press). 188–221
Neumann, J. (2001). Holistic processing of hierarchical structures in connectionist networks. Ph.D. thesis,
University of Edinburgh
Pado, S. and Lapata, M. (2007). Dependency-based construction of semantic space models. Computational
Linguistics 33, 161–199
Pearson, K. (1901). Principal components analysis. The London, Edinburgh and Dublin Philosophical
Magazine and Journal 6, 566
Plate, T. A. (1994). Distributed Representations and Nested Compositional Structure. Ph.D. thesis
Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks 6,
623–641. doi:10.1109/72.377968
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in
the brain. Psychological Reviews 65, 386–408
Rothenh
¨
ausler, K. and Sch
¨
utze, H. (2009). Unsupervised classification with dependency based word spaces.
In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics (Stroudsburg,
PA, USA: Association for Computational Linguistics), GEMS ’09, 17–24
Sahlgren, M. (2005). An introduction to random indexing. In Proceedings of the Methods and Applications
of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge
Engineering TKE (Copenhagen, Denmark)
Salton, G. (1989). Automatic text processing: the transformation, analysis and retrieval of information by
computer (Addison-Wesley)
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks 61, 85–117
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. Trans. Sig. Proc. 45,
2673–2681. doi:10.1109/78.650093
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011). Dynamic pooling
and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information
Processing Systems 24
Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012). Semantic Compositionality Through
Recursive Matrix-Vector Spaces. In Proceedings of the 2012 Conference on Empirical Methods in
Natural Language Processing (EMNLP)
Sorzano, C. O. S., Vargas, J., and Montano, A. P. (2014). A survey of dimensionality reduction techniques.
arXiv preprint arXiv:1403.2877
Turney, P. D. (2006). Similarity of semantic relations. Comput. Linguist. 32, 379–416. doi:http:
//dx.doi.org/10.1162/coli.2006.32.3.379
Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. J. Artif.
Intell. Res. (JAIR) 37, 141–188
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is
all you need. In Advances in Neural Information Processing Systems 30, eds. I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Curran Associates, Inc.). 5998–6008
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th international conference on Machine
learning (ACM), 1096–1103
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising criterion. J.
Mach. Learn. Res. 11, 3371–3408
26
Ferrone&Zanzotto Running Title
Vinyals, O., Kaiser, L. u., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2015a). Grammar as a foreign
language. In Advances in Neural Information Processing Systems 28, eds. C. Cortes, N. D. Lawrence,
D. D. Lee, M. Sugiyama, and R. Garnett (Curran Associates, Inc.). 2755–2763
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015b). Show and tell: A neural image caption generator.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164
Weiss, D., Alberti, C., Collins, M., and Petrov, S. (2015). Structured training for neural network
transition-based parsing. arXiv preprint arXiv:1506.06158
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., et al. (2015). Show, attend and tell:
Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 2, 5
Zanzotto, F. M. and Dell’Arciprete, L. (2012). Distributed tree kernels. In Proceedings of International
Conference on Machine Learning. –
Zanzotto, F. M., Ferrone, L., and Baroni, M. (2015). When the whole is not greater than the combination
of its parts: A ”decompositional” look at compositional distributional semantics. Comput. Linguist. 41,
165–173. doi:10.1162/COLI a 00215
Zanzotto, F. M., Korkontzelos, I., Fallucchi, F., and Manandhar, S. (2010). Estimating linear models
for compositional distributional semantics. In Proceedings of the 23rd International Conference on
Computational Linguistics (COLING)
Zeiler, M. D. and Fergus, R. (2014a). Visualizing and understanding convolutional networks. In Computer
Vision – ECCV 2014, eds. D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Cham: Springer International
Publishing), 818–833
Zeiler, M. D. and Fergus, R. (2014b). Visualizing and understanding convolutional networks. In European
Conference on Computer Vision (Springer), 818–833
Zou, W. Y., Socher, R., Cer, D. M., and Manning, C. D. (2013). Bilingual word embeddings for
phrase-based machine translation. In EMNLP. 1393–1398
27
... In this article we use the term syntax-based models to describe approaches to representing sentence meaning in this tradition. 3 In recent years, hybrid models that combine the complementary strengths of both syntax and vector-based approaches have been introduced (Boleda and Herbelot 2016; Ferrone and Zanzotto 2020;Donatelli and Koller 2023). Hybrid models are very diverse, and include methods for embedding parse trees into a vector representation, as well as other specialized architectures and approaches that sometimes go by the name neurocompositional semantics (Smolensky et al. 2022). ...
... Two approaches for incorporating interaction effects into sentence embeddings are element-wise multiplication and circular convolution (Emerson 2020). However, both of these operations are commutative, meaning that unlike natural language, the resulting embeddings are invariant to word order (Ferrone and Zanzotto 2020). Given these limitations, more complex models have been developed. ...
... More recent vector-based approaches have moved away from explicitly representing the combination function f , instead learning it implicitly by adjusting the weights in a neural network architecture in accordance with a learning objective such as next word prediction (Ferrone and Zanzotto 2020;Baroni 2020). Several architectures have 6 Embeddings provided from the OpenAI API, based on a large transformer with additional fine-tuning from human feedback. ...
Article
Full-text available
One of the major outstanding questions in computational semantics is how humans integrate the meaning of individual words into a sentence in a way that enables understanding of complex and novel combinations of words, a phenomenon known as compositionality. Many approaches to modeling the process of compositionality can be classified as either “vector-based” models, in which the meaning of a sentence is represented as a vector of numbers, or “syntax-based” models, in which the meaning of a sentence is represented as a structured tree of labeled components. A major barrier in assessing and comparing these contrasting approaches is the lack of large, relevant datasets for model comparison. This article aims to address this gap by introducing a new dataset, STS3k, which consists of 2,800 pairs of sentences rated for semantic similarity by human participants. The sentence pairs have been selected to systematically vary different combinations of words, providing a rigorous test and enabling a clearer picture of the comparative strengths and weaknesses of vector-based and syntax-based methods. Our results show that when tested on the new STS3k dataset, state-of-the-art transformers poorly capture the pattern of human semantic similarity judgments, while even simple methods for combining syntax- and vector-based components into a novel hybrid model yield substantial improvements. We further show that this improvement is due to the ability of the hybrid model to replicate human sensitivity to specific changes in sentence structure. Our findings provide evidence for the value of integrating multiple methods to better reflect the way in which humans mentally represent compositional meaning.
... Unlike traditional symbolist approaches, these representations are not designed to satisfy structural constraints. A large body of work aims to combine symbolic and distributed elements, e.g., [36,27,4,8] (see also [14] for a survey). ...
Preprint
We describe a basic correspondence between linear algebraic structures within vector embeddings in artificial neural networks and conditional independence constraints on the probability distributions modeled by these networks. Our framework aims to shed light on the emergence of structural patterns in data representations, a phenomenon widely acknowledged but arguably still lacking a solid formal grounding. Specifically, we introduce a characterization of compositional structures in terms of "interaction decompositions," and we establish necessary and sufficient conditions for the presence of such structures within the representations of a model.
... Unlike distributed WEMs, distributional representations encode words and phrases as vectors or tensors of real numbers. Word vectors are extracted by determining how these words relate to other words in document collections [25]. On the other hand, distributed representations with a strong relationship among them represent the approximate value of distributional representations [26]. ...
Article
Recently, the field of Natural Language Processing (NLP) has made significant progress with the evolution of Contextualised Neural Language Models (CNLMs) and the emergence of large LMs. Traditional and static language models exhibit limitations in tasks demanding contextual comprehension due to their reliance on fixed representations. CNLMs such as BERT and Semantic Folding aim to produce feature-rich representations by considering a broader linguistic context. In this paper, Deep Learning-based Aspect Category Detection approaches are introduced to perform text classification. The study extensively assesses classification model performance, emphasising enhanced representativeness and optimised feature extraction resolution using CNLMs and their hybridised variants. The effectiveness of the proposed approaches is evaluated on benchmark datasets of 4500 reviews from the laptop and restaurant domains. The results show that the proposed approaches using hybridised CNLMs outperform state-of-the-art methods with an f-score of 0.85 for the laptop and f-scores higher than 0.90 for the restaurant dataset. This study represents a pioneering work as one of the initial research efforts aiming to jointly evaluate the representation performance of CNLMs with different architectures to determine their classification capabilities. The findings indicate that the proposed approaches can enable the development of more effective classification models in various NLP tasks.
... So far, few studies have been conducted in the field of suggestion mining, which is often empirical. Previous research followed binary classification approaches that used handcrafted features [8]. This research shows that suggestions in opinions text convey ideas for improvement to brand owners or advice to other users about business services [9]. ...
Preprint
Full-text available
After the outbreak of COVID-19, the use of the web platform as a means of communication between users and decision-making in online shopping has been accepted in e-commerce. Comments of users express substantially positive and negative emotions about a particular item, which offers suggestions for the improvement of products and organizations. Therefore, mining opinions to extract suggestions can increase the utility of companies and organizations. A good deal of research on suggestion mining employs rule-based approaches and statistical classifiers using manually identified features, while recently, several researchers have considered solutions based on deep learning tools and techniques. The purpose of this study is to use information retrieval techniques for the automatic detection of suggestions. Therefore, distance measurement approaches, multilayer perceptron neural networks, support vector machines, regression logistics, convolutional neural networks with TF-IDF, BOW, and Word2Vec vectors, and keyword extraction were adopted. The proposed approach is presented on the SemEval2019 dataset for extracting suggestions from the text of online user reviews. The results show that the F 1 Score has improved, as compared to the previous work, and has reached 0.87. The experiments indicate that information retrieval-based approaches tend to be promising for this task.
Chapter
Writing is thought to be a complicated process that combines higher-level abilities. Besides, writing in a second language presents its own distinct set of difficulties related to possible inadequacies and gaps in lexical, pragmatic, syntactic, and/or rhetorical understanding. In certain domains AWE tools could not be very helpful. The difficulty for AI to completely comprehend the affluence and complexity of human languages, as well as their contextual pragmatics and contextual characteristics, may be ascribed to the density of human languages . Therefore, this study aims to explore the current uses of AI tools utilised in writing and translation. In addition, it identifies the ethical issues related to using these AI tools in both processes. The study will review varied sources taking into account the current trends as well as a background of using AI in its early stages and the future directions. The findings of this study could be useful to novice writers who are interested in scholarly publications as well as academicians, EFL instructors, and FL instructors.
Article
Recently, the methods that combine the merits of the numerical model and the deep learning to improve the prediction accuracy of the sea surface temperature have received considerable attention. Existing methods usually apply the output of the numerical model as the physical knowledge to guide the training of the deep learning models. However, the physical knowledge in the observed data has not been fully exploited. With the development of observational instruments and techniques, increasing amount of observational data has been collected. These data can be utilized for the exploration of physical knowledge. Towards this end, we propose novel scheme for sea surface temperature (SST) prediction, which applies generative adversarial networks (GANs) to analyze the physical knowledge in the historical data. In particular, two GAN models are trained with numerical model data and observed data separately. Afterwards, the physical knowledge is extracted from the observed data which is not contained in the data generated by the numerical model by comparing the learned physical feature from the two pretrained GAN models. Finally, to validate the relevance of the physical knowledge which we have discovered, the extracted features are added into the numerical model data which are called newly-corrected data. Besides, we train two spatial-temporal models over the newly-corrected dataset and the original numerical model data for SST prediction, respectively. The experimental results show that the newly-corrected dataset performs better than using the original numerical model for SST prediction.
Article
The purpose of this study is to investigate the relations among children’s symbolic functioning at 15 months, joint attention at 24 months, expressive communication at 24 and 36 months, and executive functioning at 36 months. With the sample from rural areas in the United States collected by the Family Life Project ( N = 1,008), a longitudinal data analysis was conducted. The results of structural equation modeling suggested that children’s symbolic functioning at 15 months and children’s executive functioning at 36 months was directly related to each other. These two variables were also indirectly related to each other through joint attention at 24 months and expressive communication at 24 and 36 months. Psychological distancing and verbal and nonverbal communication were used to explain the role symbolic functioning plays in the development of executive functioning during the second and the third years of children’s lives.
Preprint
Full-text available
The advancement of machine learning and symbolic approaches have underscored their strengths and weaknesses in Natural Language Processing (NLP). While machine learning approaches are powerful in identifying patterns in data, they often fall short in learning commonsense and the factual knowledge required for the NLP tasks. Meanwhile, the symbolic methods excel in representing knowledge-rich data. However, they struggle to adapt dynamic data and generalize the knowledge. Bridging these two paradigms through hybrid approaches enables the alleviation of weaknesses in both while preserving their strengths. Recent studies extol the virtues of this union, showcasing promising results in a wide range of NLP tasks. In this paper, we present an overview of hybrid approaches used for NLP. Specifically, we delve into the state-of-the-art hybrid approaches used for a broad spectrum of NLP tasks requiring natural language understanding, generation, and reasoning. Furthermore, we discuss the existing resources available for hybrid approaches for NLP along with the challenges, offering a roadmap for future directions.
Article
Full-text available
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.
Article
The lexicon of any natural language encodes a huge number of distinct word meanings. Just to understand this article, you will need to know what thousands of words mean. The space of possible sentential meanings is infinite: In this article alone, you will encounter many sentences that express ideas you have never heard before, we hope. Statistical semantics has addressed the issue of the vastness of word meaning by proposing methods to harvest meaning automatically from large collections of text (corpora). Formal semantics in the Fregean tradition has developed methods to account for the infinity of sentential meaning based on the crucial insight of compositionality, the idea that meaning of sentences is built incrementally by combining the meanings of their constituents. This article sketches a new approach to semantics that brings together ideas from statistical and formal semantics to account, in parallel, for the richness of lexical meaning and the combinatorial power of sentential semantics. We adopt, in particular, the idea that word meaning can be approximated by the patterns of co-occurrence of words in corpora from statistical semantics, and the idea that compositionality can be captured in terms of a syntax-driven calculus of function application from formal semantics.
Thesis
It is argued that the resulting representations carry structural information without being themselves syntactically structured. The structural information about a represented object is encoded in the position of its representation in a high-dimensional representational space. We use Principal Component Analysis and constructivist networks to show that well-separated clusters consisting of representations for structurally similar hierarchical objects are formed in the representational spaces of RAAMs and HRRs. The spatial structure of HRRs and RAAM representations supports the holistic yet structure-sensitive processing of them. Holistic operations on RAAM representations can be learned by backpropagation networks. However, holistic operators over HRRs, Tensor Products, and BSCs have to be constructed by hand, which is not a desirable situation. We propose two new algorithms for learning holistic transformations of HRRs from examples. These algorithms are able to generalise the acquired knowledge to hierarchical objects of higher complexity than the training examples. Such generalisations exhibit systematicity of a degree which, to our best knowledge, has not yet been achieved by any other comparable learning method.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry