PreprintPDF Available

Textual Description for Mathematical Equations

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Reading of mathematical expression or equation in the document images is very challenging due to the large variability of mathematical symbols and expressions. In this paper, we pose reading of mathematical equation as a task of generation of the textual description which interprets the internal meaning of this equation. Inspired by the natural image captioning problem in computer vision, we present a mathematical equation description (MED) model, a novel end-to-end trainable deep neural network based approach that learns to generate a textual description for reading mathematical equation images. Our MED model consists of a convolution neural network as an encoder that extracts features of input mathematical equation images and a recurrent neural network with attention mechanism which generates description related to the input mathematical equation images. Due to the unavailability of mathematical equation image data sets with their textual descriptions, we generate two data sets for experimental purpose. To validate the effectiveness of our MED model, we conduct a real-world experiment to see whether the students are able to write equations by only reading or listening their textual descriptions or not. Experiments conclude that the students are able to write most of the equations correctly by reading their textual descriptions only.
Content may be subject to copyright.
arXiv:2008.02980v1 [cs.CV] 7 Aug 2020
Textual Description for Mathematical Equations
Ajoy Mondal and C V Jawahar
Centre for Visual Information Technology,
International Institute of Information Technology, Hyderabad, India,
Email: ajoy.mondal@iiit.ac.in and jawahar@iiit.ac.in
Abstract—Reading of mathematical expression or equation
in the document images is very challenging due to the large
variability of mathematical symbols and expressions. In this
paper, we pose reading of mathematical equation as a task
of generation of the textual description which interprets the
internal meaning of this equation. Inspired by the natural
image captioning problem in computer vision, we present
a mathematical equation description (M E D) model, a novel
end-to-end trainable deep neural network based approach
that learns to generate a textual description for reading
mathematical equation images. Our M ED model consists of a
convolution neural network as an encoder that extracts features
of input mathematical equation images and a recurrent neural
network with attention mechanism which generates description
related to the input mathematical equation images. Due to
the unavailability of mathematical equation image data sets
with their textual descriptions, we generate two data sets for
experimental purpose. To validate the effectiveness of our ME D
model, we conduct a real-world experiment to see whether the
students are able to write equations by only reading or listening
their textual descriptions or not. Experiments conclude that the
students are able to write most of the equations correctly by
reading their textual descriptions only.
Keywords-Mathematical symbols; mathematical expressions;
mathematical equation description; document image; convolu-
tion neural network; attention; recurrent neural network.
I. INTRO DUC TIO N
Learning of mathematics is necessary for students at every
stage of student life. Solving mathematical problems is an
alternative way to develop and improve their mathematical
skills. Unfortunately, blind and visually impaired (VI) stu-
dents face the difficulties to especially learn mathematics
due to their limitations in reading and writing mathematical
formulas. In generally, human readers help those categories
of students to access and interpret materials or documents
of mathematics courses. However, at every time, it is impos-
sible and impractical for those categories of students having
human reader; because of the cost and the limited availability
of the trained personnel. Braille is the popular and more
convenient way to access the document for blind and VI
students. Unfortunately, many documents are not available
in Braille, since the conversion of mathematical documents
in Braille is more difficult and complicated [1]. Moreover, it
is also difficult for students who are comfortable for reading
literary Braille transcriptions [2].
Other than Braille, sound based representation of docu-
Figure 1: Our model treats reading of mathematical equation
in a document image as generation of textual description
which interprets the internal meaning of this equation.
ments is also an important and popular way to access infor-
mation for those kinds of students. In this direction, DA ISY
books and talking books are commonly used audio materials
for understanding documents. However, these books are
less prepared for mathematical expressions or equations
(MEs) [2]. Recently, text-to-speech (TTS) systems have been
widely used by blind and VI students to read electronic text
through computers. TTS systems convert digital text into
synthetic speech. Unfortunately, most available TTS systems
can read only plain text. They fail to generate appropriate
speech when it comes across mathematical equations.
Many researchers realized that it is very important to
enhance the accessibility of mathematical materials to the
blind and VI students and developed TTS based mathematical
expressions reading systems [3]–[5]. Some of these devel-
oped systems need extra words to read an ME. Due to the
extra words, the rendered audio is very long. Hence, the
students may not always be able to interpret the main point
of the expressions due to long audio duration. Moreover,
most of these existing automatic math reading systems
take MEs as input in the form of LaTeX or other similar
markup languages. Unfortunately, all available mathematical
documents are not in the form of LaTeX or any other markup
language. Therefore, generation of LaTeX or other markup
languages corresponds to mathematical documents is also
challenging [2].
In this paper, our goal is to develop a framework, called
as mathematical equation description (M ED ) model which
can help the blind and VI students for reading/interpreting
internal meaning of MEs present in the document images.
We pose reading of M E as a problem of generation of
natural language description. Basically, our proposed MED
model automatically generates textual (natural language)
description which can able to interpret the internal meaning
of the ME. For examples, Rsin x dx is a ME and its textual
description is like “integration of sin x with respect to x”.
Figure 2: Example of sensitivity of variables, operators and their positions while reading the equations. Only ‘3’ in (a) is
changing position in (b), ‘x’ in (a) is changing position in (c), ‘differentiation operator’ in (d) is changed by ‘integration
in (e), and ‘differentiation operator’ in (d) is changed by ‘finite integration’ in (f), ‘+’ operator in denominator of (g) is
changed by ‘’ in (h), ‘+’ operator in nominator of (g) is changed by ‘’ in (i), ‘+’ operators in both nominator and
denominator of (g) are changed by ‘’ in (j), variable ‘x’ in (g) is changed by ‘z’ in (k), constant ‘a’ in (g) is changed by
b’ in (l), limit value ‘a’ in (g) is replaced by limit value ‘a’ in (m), and limit value ‘a’ in (g) is changed to ‘a+’in (n).
With the textual description, the blind and V I students can
easily read/interpret the M Es. Figure 1 shows the gener-
ated textual description of an M EI using our MED model.
This task is closely related to image description/captioning
task [6]. However, description of equation is very sensitive
with respect to variables, operators and their positions.
Figure 2 illustrates the sensitivity of variables, operators
and their positions during generation of textual description
for MEs. For example, 3xin Figure 2(a), it can be read
as “three times x”. While 3is changing its position e.g.
3xin Figure 2(c), textual sentence “xpower of three” for
reading, it is totally different from previous equation. As
per our understanding goes, this is the first work where
reading/interpreting of M Es is posed as a generation of
textual description task.
The main inspiration of our work comes from image
captioning, a recent advancement in computer vision. In this
paper, we propose an end-to-end trainable deep network to
generate natural language descriptions for the M Es which
can read/interpret the internal meaning of these expressions.
The network consists of two modules: encoder and decoder.
The encoder encodes the ME images using Convolution
Neural Networks (C NNs). Long Short Term Memory (LS TM)
network as decoder that takes the intermediate representation
to generate textual descriptions of corresponding ME images.
The attention mechanism impels the decoder to focus on
specific parts of the input image. The encoder, decoder and
attention mechanism are trained in a joint manner. We refer
this network as Mathematical Equation Description (M E D)
network.
In particular, our contributions are as follows.
We present an end-to-end trainable deep network with
the combination of both vision and language models
to generate description of MEs for reading/interpreting
MEs .
We generate two data sets with ME images and their
corresponding natural language descriptions for our
experimental purpose.
We conduct a real world experiment to establish ef-
fectiveness of the M ED model for reading/interpreting
mathematical equation.
II. RELATED WORK
A. Mathematical Expression Recognition
Automatic recognition of MEs is one of the major tasks to-
wards transcribing documents into digital form in the scien-
tific and engineering fields. This task mainly consists of two
major steps: symbol recognition and structural analysis [7].
In case of symbols recognition, the initial task is to segment
the symbols and then to recognize the segmented symbols.
Finally, structural analysis of the recognized symbols have
been done to recognize the mathematical expressions. These
two problems can be solved either sequentially [8] or a single
framework (global) [9]. However, both of these sequential
and global approaches have several limitations including (i)
segmentation of mathematical symbols is challenging for
both printed and handwritten documents as it contains a mix
of text, expressions and figures; (ii) symbols recognition is
difficult because a large number of symbols, fonts, typefaces
and font sizes [7]; (iii) for structural analysis, commonly two
dimensional context free grammar is used which requires
prior knowledge to define math grammar [10] and (iv) the
complexity of parsing algorithm increases with the size of
math grammar [11].
Due to the success of deep neural network in computer
vision task, the researchers adopted deep neural models
to recognize mathematical symbols [12], [13] and expres-
sions [14]–[18]. In [12], [13], the authors considered CN N
along with bidirectional L S TM to recognize mathematical
symbols. Whereas, [14], [15] explored the use of attention
based image-to-text models for generating structured markup
for MEs. These models consist of a multi-layer convolution
Figure 3: Overview of mathematical expression description network. Our model uses a end-to-end trainable network consisting
of CN N followed by a language generating L ST M. It generates textual description of an input mathematical expression image
in natural language which interprets its internal meaning.
network to extract image features with an attention based
recurrent neural network as a decoder for generating struc-
tured markup text. In the same direction, Zhang et al. [18]
proposed a novel end-to-end approach based on neural net-
work that learns to recognize handwritten mathematical ex-
pressions (H ME s) in a two-dimensional layout and produces
output as one-dimensional character sequences in the LaTeX
format. Here, the C NN, as encoder is considered to extract
feature from H M Es images and a recurrent neural network
is employed as decoder to generate LaTeX sequences.
B. Image Captioning
Image captioning is a task which automatically describes
the content of an image using properly formed English
sentences. Although, it is a very challenging task, it helps
the visually impaired people to better understand the content
of images on the web. Recently, a large variety of deep
models [6], [19]–[22] have been proposed to generate textual
description of natural images. All these models considered
recurrent neural network (RN N) as language models condi-
tioned on the image features extracted by convolution neural
networks and sample from them to generate text. Instead of
generating caption for whole image, a handful of approaches
to generate captions for image regions [6], [23], [24]. In
contrast of generating a sentence, various models have also
been been introduced to generate paragraph for describing
content of the images in literature [25], [26] by considering
a hierarchy of language models.
III. MATHE MATICA L EQUATIO N DESC RI PTI ON
A. Overview
Our ME D model takes a mathematical expression image
(ME I) as an input and generates a natural language sentence
to describe the internal meaning of this expression. Figure 3
provides an overview of our model. It consists of encoder
and decoder networks. The encoder extracts deep features
to richly represent the equation images. The decoder uses
the intermediate representation to generate a sentence to
describe the meaning of the M E. The attention mechanism
impels the decoder to focus on specific parts of the input
image. Each of these networks are discussed in details in
the following subsections.
B. Feature Extraction using Encoder
The ME D model takes a ME I and generates its textual
description Yencoded as a sequence of 1-of-Kencoded
words.
Y={y1,y2, ..., yT},yiRK(1)
where Kis the size of the vocabulary and Tis the length of
the description. We consider a Convolution Neural Network
(CN N) as an encoder in order to extract a set of feature
vectors. We assume that the output of C NN encoder is a
three-dimensional array of size H×W×D, and consider
the output as a variable length grid of Lvectors, L=H×W
as referred to annotation vectors. Each of these vector is D-
dimensional representation that corresponds to a local region
of the input image.
A={a1,a2, ..., aL},aiRD(2)
We extract features from a lower convolution layer in
order to obtain a correspondence between the feature vectors
and regions of the image. This allows the decoder to
selectively focus on certain regions of the input image by
selecting a subset of all these feature vectors.
C. Sentence Generation using Decoder
We employ L ST M [27] as a decoder that produces a sen-
tence by generating one word at every time step conditioned
on a context vector ˆzt, the hidden state htand the previously
generated word yt1. It produces word at time step tusing
the following equation:
p(yt|y1,y2, ..., yt1,x) = f(yt1,ht,ˆzt),(3)
where xdenotes the input MEI and fdenotes a multi-layered
perceptron (M LP) which is expanded in Eq. (7). The hidden
state htof LS TM is computed using following equation:
it=σ(Wyi Eyt1+Uhiht1+Vz iˆzt)
ft=σ(Wyf Eyt1+Uhf ht1+Vzf ˆzt)
ot=σ(WyoEyt1+Uhoht1+Vz oˆzt)
gt= tanh((WycEyt1+Uhcht1+Vz cˆzt)
ct=ftct1+itgt
ht=ottanh(ct).
(4)
Here, it,ft,ct,otand htare the input, forget, memory,
output and hidden states of the L ST M, respectively. The
vector ˆztis a context vector which captures visual infor-
mation of a particular image region. The context vector ˆzt
(in Eq. (4)) is a dynamic representation of the relevant part
of the input image at time step t. We consider soft attention
defined by Bahdannu et al. [28] which computes weight αti
of each annotation vectors aiconditioned on the previous
LS TM hidden state ht1. Here, we parameterize attention as
ML P which is jointly trained:
eti =vT
atanh(Waht+Uaai)
αti =exp(eti)
PL
k=1 exp(etk).(5)
Let n
be the dimension of the attention, then vaRn
×n,
UaRn
×D. After computation of the weights αti , the
context vector ˆztis calculated as follows:
ˆzt=
L
X
i=1
αtiai.(6)
This weight αti will make decoder to know which part
of input image is the suitable place to attend to generate
the next predicted word and then assign a higher weight to
the corresponding annotation vectors ai.mand ndenote
the dimensions of embedding and L S TM , respectively; E
Rm×Kis the embedding matrix. σis sigmoid activation
function and is element wise multiplication.
Finally, the probability of each predicted word at time
tis computed by the context vector ˆzt, current LS T M
hidden state htand previous predicted word yt1using the
following equation:
p(yt|y1,y2, ..., yt1,x) =
g(Wo(Eyt1+Whht+Wzˆzt)),(7)
where gdenotes a softmax activation function over all the
words in the vocabulary; E,WoRK×n,WhRm×nand
WcRm×Dare learned parameters initialized randomly.
The initial memory state c0and hidden state h0of the
LS TM are predicted by an average of the annotation vectors
fed through two separate M LPs (finit, c,finit, h )
c0=finit, c 1
LPL
i=1 ai
h0=finit, h 1
LPL
i=1 ai.(8)
Expression Description
10xten times x
x2xsquare or second power of x
xsecond root of x
x
10 xover ten
(x+y)2second power of all xplus y
log2xlog xto base two
x
yxover y
x
y+zxover yplus z
x+y
zxplus yall over z
x
y2xover ysquare
x+y
zxall plus yover z
xy
y+zxminus yall over yplus z
x2
yxsquare over y
xy
z+txminus yall over zall plus t
e(1+x)exponential of all one plus x
ex+ 1 exponential of xall plus one
e(1+x)1exponential of all one plus xall minus one
Rx dx integral of xwith respect to x
Z1
0
x dx integral of xwith respect to xfrom lower limit
zero to upper limit one
lim
x0
sin x
xlimit of sin x over xas xapproaches to zero
d
dx (x2)differentiation of xsquare with respect to x
x6 = 3 xminus six equal to three
x+ 7 >10 xplus seven greater than ten
x+ 2y= 7,xplus two times yequal to seven and
xy= 3 xminus yequal to three
Table I: Natural language phrases to uniquely describe
mathematical equations.
D. Implementation Details
Pre-processing: Textual descriptions of the ME s are
pre-processed with basic tokenization algorithm by keeping
all words that appeared at least 4times in the training set.
Training Details: The training objective of our M ED
model is to maximize the predicted word probability as given
in Eq. (7). We use cross entropy as the objective function:
O=
T
X
t=1
log p(ygt
t|yt,x),(9)
where ygt
trepresents the ground truth word at time step t.
We consider 299K images and their corresponding textual
descriptions to train the model. We consider pre-trained
ResNet-152 model [29] (on ImageNet [30]). We do this in all
the experiments. We train the network with batch size of 50
for 60 epochs. We use stochastic gradient descent (SG D) with
fixed learning rate 104, momentum = 0.5and weight decay
= 0.0001. All the weights were randomly initialized except
for the C NN. We use 512 dimensions for the embedding
and 1024 for the size of the LST M memory. We consider
a dropout layer after each convolution layer and set as 0.5.
The best trained model is determined in terms of B L EU -
4score on validation set. For further implementation and
architecture details, please refer to the source code at: https:
//github.com/ ajoymondal/ Equation-Description-PyTorch.
Figure 4: Few sample M EI and their corresponding textual description of Math-Exp-Syn data set.
Decoding: In decoding stage, our main aim is to
generate a most likely textual description for a given MEI:
ˆy = arg max
y
log p(y|x).(10)
Different from training procedure, the ground truth of
previous predicted word is not available. We employ beam
search [31] of size 20 during decoding procedure. A set of
20 partial hypothesis beginning with the start-of-sentence
<start>is maintained. At each time step, each partial hy-
pothesis in the beam is expanded with every possible word.
Only the 20 most likely beams are kept. When the <start>
token is encounter, it is removed from the beam and added
to the set of complete hypothesis. This process is repeated
until the output word becomes a symbol corresponding to
the end-of-sentence <end>.
IV. DATA SETS AND EVAL UATI ON MET RIC S
A. Data Sets
Unavailability of mathematical equation image data sets
and their textual descriptions inspired us to generate data sets
for experimental purpose. Various issues must be concerned
during generation of unambiguous textual description of a
mathematical equation. One important issue is that the same
textual description can lead to the different expressions.
For example, the textual description like “xplus yover
z” could be description of two possible equations: either
x+y
zor x+y
z. Thus, an algorithm should be carefully
designed to generate an unambiguous textual description
corresponds to exactly one expression. As per our knowledge
goes, no mathematical expression data sets with their textual
descriptions is available for experiment. We create a data
set, referred as Math-Exp-Syn with large number of
synthetically generated M EI s and their descriptions. For
this purpose, we create sets of predefined functions (e.g.
linear equation, limit, etc.), variables (e.g. x,y,z, etc.),
operators (e.g. +,, etc.) and constants (e.g. 10,1, etc.)
and sets of their corresponding textual descriptions. We
develop a python code which randomly selects a func-
tion, variable, operator and constant from the corresponding
predefined sets and automatically generates mathematical
equation as an image and corresponding textual description
Data set Division No. images Total
LE I E PL E LT DI IN FIN
Math-Exp-Syn Training 41K43K 43K 44K 39K 40K 36K 299K
Validation 5K 5K 5K 6K 4K 5K 4K 37K
Test 5K 5K 5K 6K 4K 5K 4K 37K
Math-Exp Test 1K 0.64K 0.05K 0.68K0.06K 0.2K 0.1K 2.7K
Table II: Category level statistics of considered data sets. L E:
linear equation, IE: inequality, PL E: pair of linear equations,
LT: limit, DI: differentiation, I N: integral and FI N: finite
integral.
in the text format. We make our Math-Exp-Syn data gen-
eration code available at: https:// github.com/ajoymondal/
Equation-Description-PyTorch. We also create another data
set, referred as Math-Exp by manually annotating a limited
number of M E Is. During creation of both these data sets,
we take care the uniqueness of the equations and their
descriptions. We consider the following natural language
sentences listed in table I to uniquely describe the internal
meaning of the equations.
In this work, we limit ourselves to only seven categories
of MEs: linear equation,inequality,pair of linear equations,
limit,differentiation,integral and finite integral. Table II
displays the category wise statistics of these data sets.
Figure 4 shows few sample images and their descriptions
of Math-Exp-Syn data set.
B. Evaluation Metrics
In this work, we evaluate the generated descriptions for
ME I with respect to three metrics: BL E U [32], CID Er [33],
and ROUGE [34] which are popularly used in natural lan-
guage processing (N L P) and image captioning tasks. All
these metrics basically measure the similarity of a generated
sentence against a set of ground truth sentences written by
humans. Higher values of all these metrics indicate that the
generated sentence (text) is more similar to the ground truth
sentence (text).
V. EXP E RI MEN TS AN D RESU LTS
An extensive set of experiments is performed to assess
the effectiveness of our ME D model using several metrics
on the ME data sets.
Figure 5: Visual illustration of sample results of test Math-Exp-Syn data set produced by our MED framework. G T: ground
truth description, O P1: description generated by ResNet152+LST M,O P2: description generated by ResNet152+L ST M+Attn.,
OP 3: description generated by ResNet152+LST M and O P4: description generated by ResNet152+L S TM +Attn., LS T M:
description generated by ResNet152+LS TM+Attn., G RU: description generated by ResNet152+GRU+Attn., RNN: description
generated by ResNet152+R NN+Attn., indicates fine-tune and Attn. denotes attention in decoder. Red colored text indicates
wrongly generated text.
A. Ablation Study
A number of ablation experiments is conducted to quan-
tify the importance of each of the components of our
algorithm and to justify various design issues in the context
of mathematical equation description. Math-Exp-Syn data
set is used for this purpose.
Models Test Performance
B-1 B- 2 B-3 B- 4 cR
ResNet-18+LST M+Attn. 0.971 0.949 0.927 0.906 0.960 9.014
ResNet-34+LST M+Attn. 0.973 0.952 0.931 0.910 0.962 9.058
ResNet-50+LST M+Attn. 0.978 0.959 0.940 0.922 0.968 9.172
ResNet-101+LST M+Attn. 0.979 0.960 0.941 0.923 0.968 9.179
ResNet-152+LST M+Attn. 0.981 0.962 0.941 0.923 0.971 9.184
Table III: It illustrates that the deeper pre-trained model
gets better representation and improves textual description
accuracy with respect to three evaluation measures: BLEU-1
(B-1), BLEU-2 (B-2), BLEU-3 (B-3), BLEU-4 (B-4), CI D Er
(C) and RO UGE (R). Number along with the model refers to
the depth of the corresponding model. ‘’ denotes that the
encoder is fine-tuned during training. LS TM with attention
is considered as a decoder.
Pre-trained Encoder:It is well known that the deeper
networks are beneficial for the large scale image classifica-
tion task. We conduct an experiment with different depths
of the pre-trained models to analyze their performances on
the mathematical equation description task. Detailed scores
of equation description using the various pre-trained models
are listed in Table III.
Models Test Performance
B-1 B- 2 B-3 B- 4 cR
ResNet-152+LS TM. 0.976 0.956 0.937 0.918 0.965 9.156
ResNet-152+LS TM+Attn. 0.977 0.956 0.937 0.918 0.966 9.163
ResNet-152+LST M. 0.978 0.960 0.940 0.922 0.968 9.182
ResNet-152+LST M+Attn. 0.981 0.962 0.941 0.923 0.971 9.184
Table IV: Quantitative illustration of effectiveness of fine-
tuning the encoder and attention in decoder during training
on ME D task. ‘’ denotes fine-tune.
Fine-tuned vs. Without Fine-tuned Encoder and At-
tention vs. Without Attention in Decoder:The considered
encoder, ResNet-152 pre-trained on ImageNet [30] is not
effective without fine-tuning due to domain heterogeneity
(natural images and M E Is). We perform an experiment to
establish potency of fine-tuning on the equation description
task. The attention mechanism tells the decoder to focus
on a particular region of the image while generating the
description related to that region of the image. We do
an experiment to analyze the effectiveness of attention
mechanism on the mathematical equation description task.
Observation of the experiments is quantitatively reported in
Table IV. This table highlights the effectiveness of fine-tune
and attention in mathematical equation description task. First
row of Figure 5 visually illustrates the effectiveness of fine-
tuning the pre-trained ResNet-152 and L S TM with attention
for ME D task.
RNN vs. GRU vs. LS TM:We also conduct an exper-
iment to analyze performances of L STM, Gated Recurrent
Units (GRU) and Recurrent Neural Networks (RN N) on
generating captions for mathematical equation images. In
this experiment, we consider pre-trained ResNet-152 as an
encoder which is fine-tuned during training and different
decoders: RN N,G RU and L STM with attention mechanism.
Table V displays the numerical comparison between three
decoder models. The table highlights that LS TM is more
effective than other two models for mathematical equation
description task. Second and third rows of Figures 5 display
the visual outputs. This figure highlights that L STM is able
to generate text most similar to the ground truth.
Models Test Performance
B-1 B- 2 B-3 B- 4 cR
ResNet-152+RNN +Attn. 0.977 0.958 0.939 0.920 0.967 9.179
ResNet-152+GRU+Attn. 0.979 0.959 0.939 0.920 0.968 9.182
ResNet-152+LST M+Attn. 0.981 0.962 0.941 0.923 0.971 9.184
Table V: Performance comparison between RN N,G RU and
LS TM with attention mechanism on the mathematical equa-
tion description task. We fine-tune the encoder during train-
ing process.
Models Data sets Division Scores
B-1 B- 2 B-3 B- 4 cR
ME D Math-Exp-Syn test set 0.981 0.962 0.941 0.923 0.971 9.184
ME D Math-Exp test set 0.975 0.956 0.936 0.917 0.966 9.146
Table VI: Quantitative results of our M ED model on stan-
dard evaluation metrics for both Math-Exp-Syn and
Math-Exp data sets. Both the cases M E D is trained using
training set of Math-Exp-Syn data set.
B. Quantitative Analysis of Results
The quantitative results obtained using our ME D model for
both Math-Exp-Syn and Math-Exp data sets are listed
in Table VI.
C. Real world Experiments
We conduct a real world experiment to see whether the
students are able to write the equations by only reading or
Figure 6: Sample cropped mathematical equation images
from NC RT class V mathematical book for real world
experiment.
Cropped Textual description Equation Written
image generated by M ED by Students
five by sixteen 5
16
eleven into eleven 11 ×11
three hundred and sixty three 363
two and one by two 21
2
one hundred and eighty degree 180o
eight plus nine plus three 8 + 9 + 3
one thousand three hundred and eleven 1131
thirty two point 32.
Table VII: Summary of real world experiments. First Col-
umn: cropped equation images. Second Column: textual
descriptions generated by the MED and given to the students
and ask them to write corresponding equations by reading
the descriptions. Third Column: equations written by the
students.
listening their textual descriptions or not. For this purpose,
we create a test set of mathematical equation images which
are cropped from NC RT class V mathematical book1. Test
set consists of 398 cropped equation images of various
types of equations: integer, decimal, fraction, addition, sub-
traction, multiplication and division. Figure 6 shows the
sample cropped mathematical equation images from NC RT
class V mathematical book. Our ME D system generates the
textual description for each of these equations. The list of
descriptions of equations is given to the students and ask
to write the corresponding mathematical equations within 1
hour. Twenty students participate in this test. If anyone of
the students writes the incorrect equations by only reading
or listening their textual descriptions. Then the answer is
wrong otherwise correct. Among 398 equations, students are
able to correctly write 359 equations within time by reading
their textual descriptions generated by our ME D model. For
remaining 39 equations, our M ED model generates wrong
descriptions due to the presence of other structural elements
(i.e. triangle, square, etc). Table VII highlights the few
results of this test. Since, descriptions generated by M ED
model are wrong, students write wrong equations by reading
1https://www.ncertbooks.guru/ncert-maths- books/
wrongly generated descriptions. From this test, we conclude
that our MED model is effective for reading equations by
generating their textual descriptions.
VI. CO NCL USI ONS
In this paper, we introduce a novel mathematical equation
description (M ED ) model for reading mathematical equa-
tions for blind and visually impaired students by generat-
ing textual descriptions of the equations. Unavailability of
mathematical images and their textual descriptions, inspires
us to generate two data sets for experiments. Real-world
experiment concludes that the students are able to write
mathematical expression by reading or listening their de-
scriptions generated by the ME D network. This experiment
establishes the effectiveness of the M E D framework for
reading mathematical equation for the blind and V I students.
REF ERE NC E S
[1] V. Moc¸o and D. Archambault, “Automatic conversions of
mathematical braille: A survey of main difficulties in different
languages,” in ICCHP, 2004.
[2] W. Wongkia, K. Naruedomkul, and N. Cercone, “i-math: Au-
tomatic math reader for thai blind and visually impaired stu-
dents,” Computers & Mathematics with Applications, 2012.
[3] N. Soiffer, “MathPlayer: web-based math accessibility,” in
International ACM SIGACCESS Conference on Computers
and Accessibility, 2005.
[4] R. D. Stevens, A. D. Edwards, and P. A. Harling, “Access
to mathematics for visually disabled students through multi-
modal interaction,” HCI, 1997.
[5] S. Medjkoune, H. Mouchere, S. Petitrenaud, and C. Viard-
Gaudin, “Combining speech and handwriting modalities
for mathematical expression recognition,IEEE Trans. on
Human-Machine Systems, 2017.
[6] A. Karpathy and L. Fei-Fei, “Deep visual-semantic align-
ments for generating image descriptions,” in CVPR, 2015.
[7] K.-F. Chan and D.-Y. Yeung, “Mathematical expression
recognition: a survey,” IJDAR, 2000.
[8] R. Zanibbi, D. Blostein, and J. R. Cordy, “Recognizing
mathematical expressions using tree transformation,” TPAMI,
2002.
[9] F. Alvaro, J.-A. S´anchez, and J.-M. Bened´ı, “An integrated
grammar-based approach for mathematical expression recog-
nition,” PR, 2016.
[10] F. Alvaro, J.-M. Benedi et al., “Recognition of printed mathe-
matical expressions using two-dimensional stochastic context-
free grammars,” in ICDAR, 2011.
[11] F. Julca-Aguilar, H. Mouch`ere, C. Viard-Gaudin, and N. S.
Hirata, “Top-down online handwritten mathematical ex-
pression parsing with graph grammar,” in IberoAmerican
Congress on PR, 2015.
[12] H. Dai Nguyen, A. D. Le, and M. Nakagawa, “Deep neural
networks for recognizing online handwritten mathematical
symbols,” in ACPR, 2015.
[13] ——, “Recognition of online handwritten math symbols using
deep neural networks,IEICE Trans. on Inform. and Sys.,
2016.
[14] Y. Deng, A. Kanervisto, and A. M. Rush, “What you get is
what you see: A visual markup decompiler,CoRR, 2016.
[15] Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush, “Image-
to-markup generation with coarse-to-fine attention,” in ICML,
2017.
[16] J. Zhang, J. Du, and L. Dai, “Multi-scale attention with dense
encoder for handwritten mathematical expression recogni-
tion,” arXiv preprint arXiv:1801.03530, 2018.
[17] ——, “A GRU-based encoder-decoder approach with attention
for online handwritten mathematical expression recognition,
in ICDAR, 2017.
[18] J. Zhang, J. Du, S. Zhang, D. Liu, Y. Hu, J. Hu, S. Wei, and
L. Dai, “Watch, attend and parse: An end-to-end neural net-
work based approach to handwritten mathematical expression
recognition,” PR, 2017.
[19] X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recurrent
visual representation for image caption generation,” in CVPR,
2015.
[20] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell,
“Long-term recurrent convolutional networks for visual
recognition and description,” in CVPR, 2015.
[21] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and
tell: A neural image caption generator,” in CVPR, 2015.
[22] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image
captioning with semantic attention,” in CVPR, 2016.
[23] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural
image caption generation with visual attention,” in ICML,
2015.
[24] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully
convolutional localization networks for dense captioning,” in
CVPR, 2016.
[25] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video
paragraph captioning using hierarchical recurrent neural net-
works,” in CVPR, 2016.
[26] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierar-
chical approach for generating descriptive image paragraphs,
arXiv preprint arXiv:1611.06607, 2016.
[27] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
ory,NC, 1997.
[28] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
translation by jointly learning to align and translate,” arXiv
preprint arXiv:1409.0473, 2014.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in CVPR, 2016.
[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A Large-Scale Hierarchical Image Database,” in
CVPR, 2009.
[31] K. Cho, “Natural language understanding with distributed
representation,” arXiv preprint arXiv:1511.07916, 2015.
[32] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLE U: a
method for automatic evaluation of machine translation,” in
AMACL, 2002.
[33] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider:
Consensus-based image description evaluation,” in CVPR,
2015.
[34] C.-Y. Lin, “Rouge: A package for automatic evaluation of
summaries,” Text Summarization Branches Out, 2004.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Article
Full-text available
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.
Technical Report
We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.
Article
Machine recognition of a handwritten mathematical expression (HME) is challenging due to the ambiguities of handwritten symbols and the two-dimensional structure of mathematical expressions. Inspired by recent work in deep learning, we present Watch, Attend and Parse (WAP), a novel end-to-end approach based on neural network that learns to recognize HMEs in a two-dimensional layout and outputs them as one-dimensional character sequences in LaTeX format. Inherently unlike traditional methods, our proposed model avoids problems that stem from symbol segmentation, and it does not require a predefined expression grammar. Meanwhile, the problems of symbol recognition and structural analysis are handled, respectively, using a watcher and a parser. We employ a convolutional neural network encoder that takes HME images as input as the watcher and employ a recurrent neural network decoder equipped with an attention mechanism as the parser to generate LaTeX sequences. Moreover, the correspondence between the input expressions and the output LaTeX sequences is learned automatically by the attention mechanism. We validate the proposed approach on a benchmark published by the CROHME international competition. Using the official training dataset, WAP significantly outperformed the state-of-the-art method with an expression recognition accuracy of 46.55% on CROHME 2014 and 44.55% on CROHME 2016.
Article
In this paper, we open new perspectives for mathematical expression recognition by introducing an original bimodal system. Since handwritten mathematical expression recognition is a very challenging task prone to many ambiguities, we use speech as an additional modality to circumvent limitations that are inherent to the written form. A use case scenario corresponds to lectures given in classrooms where the teacher would write and read aloud any mathematical expressions to allow a better interpretation. In addition to state-of-the-art solutions for recognizing handwriting and speech, we introduce a multilayer architecture for the merger of modalities. Specifically, the Dempster–Shafer theory is used to process the information at the symbol level. This bimodal system is evaluated on real bimodal data, the HAMEX dataset. Large improvements are observed when speech and handwriting are combined when compared to the single handwriting modality.