ArticlePDF Available

Neural Multi-Step Reasoning for Question Answering on Semi-Structured Tables

Authors:

Abstract

Advances in natural language processing tasks have gained momentum in recent years due to the increasingly popular neural network methods. In this paper, we explore deep learning techniques for answering multi-step reasoning questions that operate on semi-structured tables. Challenges here arise from the level of logical compositionality expressed by questions, as well as the domain openness. Our approach is weakly supervised, trained on question-answer-table triples without requiring intermediate strong supervision. It performs two phases: first, machine understandable logical forms (programs) are generated from natural language questions following the work of [Pasupat and Liang, 2015]. Second, paraphrases of logical forms and questions are embedded in a jointly learned vector space using word and character convolutional neural networks. A neural scoring function is further used to rank and retrieve the most probable logical form (interpretation) of a question. Our best single model achieves 34.8% accuracy on the WikiTableQuestions dataset, while the best ensemble of our models pushes the state-of-the-art score on this task to 38.7%, thus slightly surpassing both the engineered feature scoring baseline, as well as the Neural Programmer model of [Neelakantan et al., 2016].
Neural Multi-Step Reasoning for Question Answering on Semi-Structured Tables
Till Haug
Veezoo AG
Zurich, Switzerland
till@veezoo.com
Octavian-Eugen Ganea
ETH Zurich
Zurich, Switzerland
octavian.ganea@inf.ethz.ch
Paulina Grnarova
ETH Zurich
Zurich, Switzerland
paulina.grnarova@inf.ethz.ch
Abstract
Advances in natural language processing tasks have
gained momentum in recent years due to the in-
creasingly popular neural network methods. In
this paper, we explore deep learning techniques for
answering multi-step reasoning questions that op-
erate on semi-structured tables. Challenges here
arise from the level of logical compositionality ex-
pressed by questions, as well as the domain open-
ness. Our approach is weakly supervised, trained
on question-answer-table triples without requir-
ing intermediate strong supervision. It performs
two phases: first, machine understandable logical
forms (programs) are generated from natural lan-
guage questions following the work of [Pasupat and
Liang, 2015]. Second, paraphrases of logical forms
and questions are embedded in a jointly learned
vector space using word and character convolu-
tional neural networks. A neural scoring function is
further used to rank and retrieve the most probable
logical form (interpretation) of a question. Our best
single model achieves 34.8% accuracy on the Wiki-
TableQuestions dataset, while the best ensemble of
our models pushes the state-of-the-art score on this
task to 38.7%, thus slightly surpassing both the en-
gineered feature scoring baseline, as well as the
Neural Programmer model of [Neelakantan et al.,
2016].
1 Introduction
Teaching computers to answer complex questions ex-
pressed in natural language is an old longing of artifi-
cial intelligence that requires sophisticated reasoning and
human language understanding. In this work, we in-
vestigate generic natural language interfaces for semi-
structured databases. Typical questions for this task are
topic independent and require multi-step reasoning in-
volving discrete operations such as aggregation, compar-
ison, superlatives or arithmetic. One example is the
question What was the difference in average
attendance between 2010 and 2001? which has
to be answered based on a table containing information from
soccer games1. This line of research is practically relevant
for automated systems that support interactions between non-
expert users and databases without requiring specific pro-
gramming knowledge.
Question-Answering (QA) systems are often faced with a
trade-off between the openness of the domain and the depth
of logical compositionality hidden in questions. One example
are systems able to answer complex questions about a specific
topic (e.g. [Wang et al., 2015]). Unsurprisingly, these systems
often struggle to generalize to other, more open domains. On
the other side, topic-independent QA systems that can poten-
tially interrogate large databases are usually limited to simple
look-up operations (e.g. [Bordes et al., 2014a]).
Here, we propose a novel weakly supervised model for
natural language interfaces operating on semi-structured ta-
bles. Our deep learning approach eliminates the need for ex-
pensive feature engineering in the candidate scoring phase,
while being able to generalize well to never-seen before data.
Each natural language question is translated into a set of com-
puter understandable candidate representations, called logical
forms, based on the work of [Pasupat and Liang, 2015]. Fur-
ther, the most likely such program is selected in two steps: i)
using a simple algorithm, logical forms are transformed back
into paraphrases (textual representations) understandable by
non-expert users, ii) next, these raw strings are further embed-
ded together with the respective questions in a jointly learned
vector space using convolutional neural networks over char-
acter and word embeddings. Multi-layer neural networks and
bilinear mappings are employed as effective similarity mea-
sures and combined to score the candidate interpretations. Fi-
nally, the highest ranked logical form is executed against the
input data to retrieve the answer that will be shown to the
user. Our method uses only weak-supervision from question-
answer-table input triples, without requiring expensive anno-
tations of logical forms or latent operation sequences.
We empirically confirm our approach on a series of exper-
iments on WikiTableQuestions [Pasupat and Liang, 2015], a
real-world dataset containing 22,033 pairs of questions and
their corresponding manually retrieved answers with about
2,108 randomly selected Wikipedia tables. The inherent chal-
1Taken from the WikiTableQuestions dataset: http:
//nlp.stanford.edu/software/sempre/wikitable/
viewer/#204-590
arXiv:1702.06589v1 [cs.CL] 21 Feb 2017
lenges of this dataset include i) the small number of training
examples, ii) the complexity of questions that generally re-
quire compositionality over multiple simpler operations, iii)
generalization to completely unseen tables and domains at
test time, and iv) lack of strong supervision. An ensem-
ble of our best models is able to reach state-of-the-art accu-
racy of 38.7% on this task, showing that neural networks are
promising at outperforming engineered systems when deal-
ing with complex questions. We provide further insight into
our method by comparing against a set of baselines (includ-
ing engineered feature systems) and model variations, as well
as showing ablation and error analysis studies.
2 Related Work
We now briefly highlight prior research relevant to our
work. There are two main types of QA systems: semantic
parsing-based and embedding-based.
Semantic parsing-based methods perform a functional
parse of the question that is further converted to a machine
understandable program and executed on a knowledgebase
or database. A big obstacle for semantic parsers is the
need for annotated logical forms when dealing with new do-
mains. To tackle this problem, our method follows recent
work of [Reddy et al., 2014; Kwiatkowski et al., 2013]that
relies solely on weak-supervision through question-answer-
table triples. In the context of QA for semi-structured tables
and dealing with multi-compositional queries, [Pasupat and
Liang, 2015]generate and rank candidate logical forms with a
log-linear model trained on question-answer pairs. The logi-
cal form with the highest model probability is then considered
as the correct interpretation and is executed. In this work, we
generate logical form candidates in the same way as [Pasupat
and Liang, 2015]. While they resort to hand-crafted features
to determine the relevance of the candidates for a question,
we use automatic learning of representations, thus benefit-
ing from generalization and flexibility. To do so, we embed
each question and the paraphrases of the respective candidate
logical forms into the same vector space, making use of sim-
ilarity metrics for scoring. Paraphrases have been success-
fully used to facilitate semantic parsers [Wang et al., 2015;
Berant and Liang, 2014]. While [Berant and Liang, 2014]is
suited for factoid questions with a modest amount of com-
positionality, [Wang et al., 2015]targets more complicated
questions. Both of these paraphrase-driven QA systems differ
from our work as their scoring relies on hand-crafted features.
[Neelakantan et al., 2016]also focus on compositional
questions, but instead of generating and ranking multiple log-
ical forms, they propose a model that directly constructs a
logical form from an embedding of the question. A list of
discrete operations are manually defined and each operation
is parametrized by a real-valued vector that is learned during
training. A separate recurrent neural network (RNN) is used
for modeling the history of selected operations and the ques-
tion representation. The probability distributions over opera-
tions and columns are induced using the question embedding,
the history and an attention vector. While their model per-
forms well on semi-structured tables, understanding how it
generated the logical form is not trivial. Recently, [Yin et al.,
2015]propose Neural Enquirer, a fully neural, end-to-end dif-
ferentiable network that executes queries across multiple ta-
bles. They use a synthetic dataset to demonstrate the abilities
of the model to deal with compositionality in the questions.
Embedding-based methods represent the question and
the answer as semantic vectors. Compatibility between a
question-answer pair is then determined by their similarity
in the shared vector space. Existing approaches often repre-
sent questions and knowledgebase constitutes in a single vec-
tor using simple bag-of-words (BOW) models [Bordes et al.,
2014a; Bordes et al., 2014b]under the framework of mem-
ory networks. [Dong et al., 2015]propose a multi-column
convolutional neural network to account for the word order
and higher order n-grams. Even though we embed questions
along paraphrases instead of answers, our method relates to
embedding-based models since the key challenge is learn-
ing representations. However, whereas these models are ap-
plied on datasets that require little compositional reasoning,
our work targets questions whose answers ask for multi-step
complex deductions and operate on semi-structured tables in-
stead of structured knowledgebases. Representation learning
using deep learning architectures has been widely explored in
other domains, e.g. in the context of sentiment classification,
[Kim, 2014; Socher et al., 2013], or for image-hashtag pre-
diction [Denton et al., 2015].
QA systems also differ in the knowledge structure they rea-
son on, which can impose additional challenges. Systems
vary from operating on structured knowledge bases [Bordes
et al., 2014b];[Bordes et al., 2014a]to semi-structured ta-
bles [Pasupat and Liang, 2015],[Neelakantan et al., 2016],
[Jauhar et al., 2016]and completely unstructured text, which
is related to information extraction [Clark et al., 2016]. We
focus on semi-structured tables that face the trade-off be-
tween degree of structure and ubiquity.
3 Model
We now proceed with the detailed description of our QA
system. In a nutshell, our model runs through the follow-
ing stages. For every question q: i) a set of candidate log-
ical forms {zi}iIqis generated using the method of [Pasu-
pat and Liang, 2015]; ii) each such candidate program ziis
paraphrased in a textual representation tithat offers accuracy
gain, interpretability and comprehensibility ; iii) all textual
forms tiare scored against the input question qusing a neural
network model; iv) the logical form z
icorresponding to the
highest ranked t
iis selected as the machine-understandable
translation of question q; v) z
iis executed on the input ta-
ble and its answer is returned to the user. Our contributions
are the novel models that perform the steps ii) and iii), while
for i), iv) and v) we rely on the work of [Pasupat and Liang,
2015](henceforth: PL2015).
We detail each of these steps in the subsequent sections.
3.1 Candidate Logical Form Generation
We use the method of PL2015 to generate a set of candi-
date logical forms with respect to a question. Specifically,
given a pair of input table and question to be answered, the
table is first transformed into a knowledge graph (KG). Next,
Algorithm 1 Recursive paraphrasing of a Lambda DCS logical form. The + operation is defined as string concatenation with
spaces. Details about Lambda DCS language can be found in [Liang, 2013].
1: procedure PARAPHRASE(z) z is the root of a Lambda DCS logical form
2: switch zdo
3: case Aggregation e.g. count, max, min...
4: tAGG RE GATI ON(z) + PARAPHRASE(z.child)
5: case Join join on relations, e.g. λx.Country(x, Australia)
6: tPARAPHRASE(z.relation)+ PARAPHRASE(z.child)
7: case Reverse reverses a binary relation
8: tPARAPHRASE(z.child)
9: case LambdaFormula lambda expression λx.[...]
10: tPARAPHRASE(z.body)
11: case Arithmetic or Merge e.g. plus, minus, union...
12: tPARAPHRASE(z.left) + OPERATIO N(z) + PARAPHRASE(z.right)
13: case Superlative e.g. argmax(x, value)
14: tOPE RATI ON(z) + PARAPHRASE(z.value) + PARAPHRASE(z.relation)
15: case Value i.e. constants
16: tz.value
17: return t  t is the textual paraphrase of the Lambda DCS logical form
18: end procedure
information from the KG facilitates the process of parsing a
question into a set of candidate logical forms. This is done us-
ing a semantic parser that recursively builds up logical forms
by repeatedly applying deduction rules. Each candidate log-
ical form is represented in Lambda DCS form [Liang, 2013]
and can be transformed into a SPARQL query, whose execu-
tion against the KG yields an answer.
For instance, the question How many people
attended the last Rolling Stones
concert? will be translated into a set of
candidate logical forms, among which the cor-
rect one is: R[λx[Attendance.Number.x]]
.argmax(Act.RollingStones,Index).
3.2 Converting Logical Forms to Text
We describe our proposed paraphrasing algorithm to trans-
form logical forms into textual representations. These, and
not the original logical forms, are further scored against the
input question. Besides the advantages of interpretability and
comprehensibility, we also observe a quality gain when using
paraphrases in comparison with ranking directly based on the
string representations of the original Lambda DCS expres-
sions (details in section 4.4).
Usability and extensibility of QA systems may benefit
from revealing the translation of the human-language ques-
tion in machine-language (e.g. Lambda DCS in our case).
This is called transparency, i.e. the property of revealing the
generated program from an input question in its raw format.
However, this might not achieve comprehensibility, the char-
acteristic of a system to be understandable by non-technical
users. Paraphrasing the logical form zinto a textual repre-
sentation tsatisfies both of these properties as it yields a non-
expert understandable description of the executed program
based on an input question.
Given a logical form zin Lambda DCS, we paraphrase it
using a simple algorithm that recursively traverses the tree
representation of zstarting at the root. The translation oper-
ations associated with each node type can be seen in Algo-
rithm 1. As an example of how this algorithm works, the cor-
rect candidate logical form for the question mentioned in sec-
tion 3.1, namely How many people attended the
last Rolling Stones concert? will be mapped
to the paraphrase Attendance as number of last
row where act is Rolling Stones.
3.3 Joint Embedding Model
For each question qwe generate a set of logical forms zi
and apply Algorithm 1 to retrieve their corresponding para-
phrases ti. Subsequently, questions and paraphrases are em-
bedded in a jointly learned vector space. Each tiwill be
scored based on the similarity with question qdefined with
a neural network acting on top of their corresponding embed-
dings. Features used by our scoring system are learned auto-
matically without the need for hand-engineering. We use two
separate convolutional neural networks (CNNs) for question
and paraphrase embeddings, on top of which a max-pooling
operation is applied. The CNNs receive as input token em-
beddings consisting of concatenation of word and character
vectors. The details of our model are outlined in the follow-
ing sections. For readability, some hyper-parameter values
are shown in section 4.2.
Token Embedding
We now detail the neural architecture used to embed tokens
of an input piece of text (e.g. question, paraphrase). This
model is depicted in Figure 1. Every token in our vocabulary
is parametrized by a word and a character embedding, both
learned during training.
We find that word vector initialization affects both the con-
vergence speed and quality of our method. A typical choice
for this initialization is to use pre-trained vectors learned from
Figure 1: Proposed architecture to convert a sentence into a token embedding matrix. Each token vector is a concatenation of
the corresponding word and character embeddings. The latter is obtained using a character CNN consisting of filters of different
character widths (depicted of size 2, 3 and 4). Parts of figure taken from [Kim et al., 2015].
unsupervised textual data. These models are known to en-
code both syntactic and semantic information, e.g. similar
or related words like synonyms are mapped to close points
in the vector space. We experiment with two different pop-
ular methods, namely GloVe [Pennington et al., 2014]and
Word2vec [Mikolov et al., 2013], comparing them also with
random initializations. Embeddings for tokens not in the vo-
cabulary are randomly initialized by sampling each compo-
nent from a uniform distribution U[0.25,0.25].
When comparing the textual representations of the log-
ical forms to the input question, we notice that in many
cases similar numbers or dates are written in different ways,
e.g. 1,000,000 vs 1000000. Other common sources
of noise are rare words and word misspellings. These are
different or unknown tokens in the vocabulary for which
word embeddings alone would perform poorly. One tech-
nique to mitigate these issues inspired from [Kim et al., 2015;
Zhang et al., 2015]is to use character embeddings in addi-
tion to word vectors. Token vectors are then obtained using
a CNN over the constituent characters. Our CNN model uses
multiple filter widths, followed by a max-over-time pooling
layer and concatenation with the respective word vector.
Sentence Embedding
We map both the question qand the paraphrase tinto a
joint vector space using sentence embeddings obtained from
two jointly trained CNNs. The input to the CNNs are ten-
sors of shape Rs×d, containing an embedding for each token,
where sis the maximum input length and dis the dimension
of the token embedding. We use filters spanning a varying
amount of tokens with the widths from a set L. For each filter
width lL, we learn ndifferent filters, each of dimension
Rl×d. After the convolution layer, we apply a max-over-time
pooling on the resulting feature matrices which yields, per
filter-width, a vector of dimension n. Next, we concatenate
the resulting max-over-time pooling vectors of the different
filter-widths in Lto form our sentence embedding. The final
Figure 2: Design of our best single model, CNN-FC-
BILINEAR-SEP. Embedding the question and paraphrase us-
ing two CNNs yields two vectors. The final score for the
paraphrase is a linear combination between the output of two
networks: a bilinear mapping and a fully connected network.
embedding size produced by this architecture is thus n|L|. As
non-linearity we use Exponential Linear Units (ELUs) [Clev-
ert et al., 2015].
3.4 Neural Similarity Measures
We denote the sentence embedding of the question qby
uRkand of the paraphrase tby vRk, respectively.
We experiment with various neural-based similarity scores
between uand vdenoted as follows:
1. uTv(DOTPRODUCT)
2. uTSv , with SRk×ka parameter matrix learned dur-
ing training. (BILINEAR)
3. (u, v)concatenated followed by two sequential fully
connected layers. (FC)
4. BILINEAR concatenated with uand vand followed by
fully connected layers. (FC-BILINEAR)
5. BILINEAR and FC linearly combined with learned
weights (Figure 2). (FC-BILINEAR-SEP)
The best performing model is FC -BILINEAR-SEP as
shown in Table 1. Intuitively, BILINEAR and FC are able
to extract different interaction features between the two input
vectors, while their linear combination retains the best of both
models.
Training Algorithm
For training, we define two sets P(positive) and N(nega-
tive) that contain pairs (q, t)of questions and paraphrases of
logical forms that, when executed on their respective tables,
give the correct or incorrect answer2, respectively. Our mod-
els presented in Section 3.4 map such a pair to a real valued
score representing question-logical form similarity, denoted
here by the function Φ:Q×TR. We use a max-margin
loss function (with margin θ) aiming to rank pairs in Pabove
pairs in N:
L(P,N) = 1
|P ||N | X
p∈P
X
n∈N
max(0Φ(p) + Φ(n))
We also experiment with cross entropy loss function, but
achieve significantly worse results.
4 Experiments
4.1 Dataset
We use the train-validation-test split of WikiTableQues-
tions dataset containing 9,659, 1,200 and 4,344 questions,
respectively. We obtain about 3.8 million training (q, t, l)
triples from PL2015, where lis a binary indicator of correct-
ness (whether the logical form gives the correct answer when
executed). During training we ignore questions for which a
single matching pair (q, t)is not present. The percentage of
questions for which a candidate logical form exists that eval-
uates to the correct answer is called oracle score. PL2015
report an oracle score of 76.7%, but a manual annotation by
[Pasupat and Liang, 2015]reveals that PL2015 can answer
only 53.5% of the questions correctly. The difference can be
explained by incorrect logical forms that give the correct an-
swer by chance.
4.2 Training Details
The neural network models are implemented using Tensor-
Flow [Abadi et al., 2016]and trained on a single Tesla P100
GPU. Training takes approximately 6 hours, while 50’000
mini-batches are processed. Generating the textual represen-
tations from logical forms for all the questions with PL2015
takes about 14 hours on a 2016 Macbook Pro computer.
The vocabulary contains 14,151 tokens. We obtain the
best results when initialising the word embeddings with the
200 dimensional GloVe vectors. Using higher dimensional
vectors does not result in significant gains in accuracy. The
size of the word embeddings xemb = (xglove, xchar)is set
2In some cases it happens that, when executed on a particular ta-
ble, a logical form gives the correct answer by chance without being
the real translation of the input question.
Baselines
System Accuracy
Neural Programmer 34.2%
Neural Programmer Ensemble (15 models) 37.7%
PL2015 37.1%
Our Models
System Accuracy
CNN-DOTPRODUCT 31.4%
CNN-BILINEAR 20.4%
CNN-FC 30.4%
CNN-FC-BILINEAR 33.3%
RNN-FC-BILINEAR-SEP 29.6%
CNN-FC-BILINEAR-SEP (best single) 34.8%
Ensemble (15 models) 38.7%
Table 1: Results on the WikiTableQuestions dataset of differ-
ent systems. Notation used for our models: <sentence em-
bedding type>-<similarity network>.
to d=dglove +dchar = 200 + 192 = 392. The sen-
tence embedding CNNs span multiple tokens with widths of
L={2,4,6,8}, while for the character CNN we use widths
spanning 1, 2 and 3 characters. The two fully connected lay-
ers in the FC models have 500 hidden neurons, which we
regularize using dropout [Srivastava et al., 2014]with a keep
probability of 0.8. We use a mini-batch size of 100, each
batch containing 50 different questions qwith one positive
t1and one negative t2paraphrase. We set the margin θof
the loss function to 0.2. Loss minimization is done using
the Adam optimizer [Kingma and Ba, 2014]with a learning
rate of 7104. All hyperparameters are tunned on the de-
velopment data split of the Wiki-TableQuestions table. We
evaluate the model every 500 steps on the validation set, and
choose the best performing model after reaching 50,000 train-
ing steps using the early stopping procedure. Each model
variant is trained eight times and the best one of each variant
is eventually run against the test set.
4.3 Results
Table 1 shows the performance of our models com-
pared to Neural Programmer [Neelakantan et al., 2016]and
PL2015 [Pasupat and Liang, 2015]baselines. The best per-
forming single model is a linear combination between BI-
LINEAR and FC models, namely CNN-FC-BILINEAR-SEP,
that gives an accuracy of 34.8%. One explanation for this is
that the two methods are able to recover different types of
errors. Our best final model is an ensemble of 15 single mod-
els, reaching a state-of-the-art accuracy for this task of 38.7%.
The score of the ensemble is calculated by averaging over the
normalized scores of its constituents. The significant increase
in performance of the ensemble over the single model shows
that the different models learn unique features.
In an additional experiment, we use a recurrent neural
network (RNN) for the sentence embedding, observing that
the model RNN-FC-BILINEAR-SEP performs significantly
worse than the corresponding CNN variant. RNNs are known
to work well with sequence data, while CNNs can capture
Question Paraphrase
Which association entered last? association of last row
association of row with highest number of joining year
What is the total of all the medals? count all rows
number of total of nation is total
How many episodes were originally aired before December 1965? count original air date as date <= 12 1965
count original air date as date <12 1965
Table 2: Example questions highlighting common mistakes our model makes. Correctness of a logical form is indicated by
green color, whereas red color represents an incorrect logical form.
System Accuracy (Dev)
CNN-FC-BILINEAR-SEP 34.1%
without Dropout 33.3%
without Character Embeddings 33.8%
without GloVe 32.4%
without Paraphrasing 33.1%
Table 3: Contributions of each component of our model.
patterns in a bag-of-n-grams manner, which is more suitable
for the paraphrases produced by Algorithm 1.
There are a few reasons for the low accuracy obtained on
this task by various methods (including ours) compared to
other NLP problems. Weak supervision, small training size
and a high percentage of unanswerable questions3contribute
to this difficulty.
4.4 Ablation Studies
For a better understanding of our model, we investigate
the usefulness of various components with an ablation study
shown in Table 3. Leaving out the character embeddings has a
marginal effect on accuracy. Regularizing the fully connected
layers using dropout is important. However, the biggest im-
pact on accuracy comes from using GloVe pre-trained word
vectors to initialize the token embeddings, since switching to
random initializations significantly decreases the accuracy.
In order to test the effect of the paraphrasing on the quality
of the results, we conduct an additional experiment by re-
placing the paraphrase with the raw strings of the Lambda
DCS expressions. The results are worse by a small margin,
confirming that the paraphrasing method is not inducing ad-
ditional errors. Moreover, our neural network component has
the biggest impact in the success of our method.
4.5 Analysis of Correct Answers
To gain further insight into our approach, we analyze how
well our best single model, CNN-FC-BILINEAR-SEP, per-
forms on various question types. We manually annotate 80
randomly chosen questions that are correctly answered by our
model. Results are shown in Table 4. The biggest contribu-
tion to accuracy stems from questions containing aggregation,
next or previous operations4, even though they only account
3[Pasupat and Liang, 2015]state that 21% of questions cannot be
answered because of various issues like annotation errors or tables
requiring advanced normalization.
4Next/previous fetch the table row below/above the current.
System Amount (%)
Lookup 10.8%
Aggregation + next, previous 39.8%
Superlatives 30.1%
Arithmetic and Comparisons 19.3%
Table 4: Type distribution of correctly answered questions.
to 20.5% of all the questions 5.
4.6 Error Analysis
The questions our models do not answer correctly can be
split into two categories: either a correct logical form is not
generated, or our scoring models do not rank the correct one
at the top. In many cases, the correctness of a logical forms
depends highly on the table structure. We perform a qualita-
tive analysis presented in Table 2 to reveal common question
types our models often rank incorrectly. The first two ex-
amples show questions whose correct logical form depends
on the structure of the table. In these cases a bias towards
the more general logical form is often exhibited. The third
example shows that our model has difficulty distinguishing
operands with slight modification (e.g. smaller and smaller
equals), which may be due to weak-supervision. As we do
not use or have access to the ground truth logical form dur-
ing training, but only to the correct answer, queries using
operands with slight modifications, would yield the same an-
swer except in the edge case.
5 Conclusion
In this paper we propose a two stage QA system for semi-
structured tables. The first stage consists of a standard method
for generating candidate logical forms and a simple approach
for transforming logical forms into textual paraphrases un-
derstandable by non-expert users. The second stage is a fully
neural model that ranks the candidate logical forms indirectly
through their respective paraphrases, eliminating the need for
manually designed features. Experiments show that an en-
semble of our models reaches state-of-the-art accuracy on the
WikiTableQuestions dataset, thus indicating its capability to
answer complex, multi-compositional questions. In the fu-
ture we plan to advance this work by extending it to be able
to reason on queries across multiple tables and work on an
5The distribution of the various question types in the WikiTab-
lesQuestions dataset can be found at: http://cs.stanford.
edu/˜ppasupat/resource/ACL2015-poster.pdf
end-to-end approach where a joint training of both stages can
be achieved.
Our code is publicly available at https://github.
com/dalab/neural_qa.
References
[Abadi et al., 2016]Mart´
ın Abadi, Ashish Agarwal, Paul
Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu
Devin, et al. Tensorflow: Large-scale machine learn-
ing on heterogeneous distributed systems. arXiv preprint
arXiv:1603.04467, 2016.
[Berant and Liang, 2014]Jonathan Berant and Percy Liang.
Semantic parsing via paraphrasing. In ACL (1), pages
1415–1425, 2014.
[Bordes et al., 2014a]Antoine Bordes, Sumit Chopra, and
Jason Weston. Question answering with subgraph embed-
dings. arXiv preprint arXiv:1406.3676, 2014.
[Bordes et al., 2014b]Antoine Bordes, Jason Weston, and
Nicolas Usunier. Open question answering with weakly
supervised embedding models. In Joint European Confer-
ence on Machine Learning and Knowledge Discovery in
Databases, pages 165–180. Springer, 2014.
[Clark et al., 2016]Peter Clark, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Oyvind Tafjord, Peter D Turney, and
Daniel Khashabi. Combining retrieval, statistics, and in-
ference to answer elementary science questions. In AAAI,
pages 2580–2586, 2016.
[Clevert et al., 2015]Djork-Arn´
e Clevert, Thomas Un-
terthiner, and Sepp Hochreiter. Fast and accurate deep
network learning by exponential linear units (elus). arXiv
preprint arXiv:1511.07289, 2015.
[Denton et al., 2015]Emily Denton, Jason Weston, Manohar
Paluri, Lubomir Bourdev, and Rob Fergus. User con-
ditional hashtag prediction for images. In Proceedings
of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 1731–
1740. ACM, 2015.
[Dong et al., 2015]Li Dong, Furu Wei, Ming Zhou, and
Ke Xu. Question answering over freebase with multi-
column convolutional neural networks. In ACL (1), pages
260–269, 2015.
[Jauhar et al., 2016]Sujay Kumar Jauhar, Peter D Turney,
and Eduard Hovy. Tables as semi-structured knowledge
for question answering. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics,
volume 1, pages 474–483, 2016.
[Kim et al., 2015]Yoon Kim, Yacine Jernite, David Sontag,
and Alexander M Rush. Character-aware neural language
models. arXiv preprint arXiv:1508.06615, 2015.
[Kim, 2014]Yoon Kim. Convolutional neural networks for
sentence classification. arXiv preprint arXiv:1408.5882,
2014.
[Kingma and Ba, 2014]Diederik Kingma and Jimmy Ba.
Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[Kwiatkowski et al., 2013]Tom Kwiatkowski, Eunsol Choi,
Yoav Artzi, and Luke Zettlemoyer. Scaling semantic
parsers with on-the-fly ontology matching. In In Proceed-
ings of EMNLP. Percy. Citeseer, 2013.
[Liang, 2013]Percy Liang. Lambda dependency-based com-
positional semantics. arXiv preprint arXiv:1309.4408,
2013.
[Mikolov et al., 2013]Tomas Mikolov, Ilya Sutskever, Kai
Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositional-
ity. In Advances in neural information processing systems,
pages 3111–3119, 2013.
[Neelakantan et al., 2016]Arvind Neelakantan, Quoc V Le,
Martin Abadi, Andrew McCallum, and Dario Amodei.
Learning a natural language interface with neural pro-
grammer. arXiv preprint arXiv:1611.08945, 2016.
[Pasupat and Liang, 2015]Panupong Pasupat and Percy
Liang. Compositional semantic parsing on semi-structured
tables. arXiv preprint arXiv:1508.00305, 2015.
[Pennington et al., 2014]Jeffrey Pennington, Richard
Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In EMNLP, volume 14,
pages 1532–1543, 2014.
[Reddy et al., 2014]Siva Reddy, Mirella Lapata, and Mark
Steedman. Large-scale semantic parsing without question-
answer pairs. Transactions of the Association for Compu-
tational Linguistics, 2:377–392, 2014.
[Socher et al., 2013]Richard Socher, Alex Perelygin, Jean Y
Wu, Jason Chuang, Christopher D Manning, Andrew Y
Ng, Christopher Potts, et al. Recursive deep models for se-
mantic compositionality over a sentiment treebank. In Pro-
ceedings of the conference on empirical methods in nat-
ural language processing (EMNLP), volume 1631, page
1642. Citeseer, 2013.
[Srivastava et al., 2014]Nitish Srivastava, Geoffrey E Hin-
ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural
networks from overfitting. Journal of Machine Learning
Research, 15(1):1929–1958, 2014.
[Wang et al., 2015]Yushi Wang, Jonathan Berant, Percy
Liang, et al. Building a semantic parser overnight. In ACL
(1), pages 1332–1342, 2015.
[Yin et al., 2015]Pengcheng Yin, Zhengdong Lu, Hang Li,
and Ben Kao. Neural enquirer: Learning to query tables
with natural language. arXiv preprint arXiv:1512.00965,
2015.
[Zhang et al., 2015]Xiang Zhang, Junbo Zhao, and Yann
LeCun. Character-level convolutional networks for text
classification. In Advances in neural information process-
ing systems, pages 649–657, 2015.
... We evaluate our approach on the WIKITABLE-QUESTIONS dataset (Pasupat and Liang, 2015), which features a semantic parsing task with opendomain knowledge bases and complex questions. We first extend the model in Pasupat and Liang (2015) to achieve a new state-of-the-art test accuracy of 42.7%, representing a 10% relative improvement over the best reported result (Haug et al., 2017). We then show that training with macro grammars yields an 11x speedup compared to training with only the base grammar. ...
... We report experiments on the WIKITABLEQUES-TIONS dataset (Pasupat and Liang, 2015). Our algorithm is compared with the parser trained only with the base grammar, the floating parser of Pasupat and Liang (2015) (PL15), the Neural Programmer parser (Neelakantan et al., 2016) and the Neural Multi-Step Reasoning parser (Haug et al., 2017). Our algorithm not only outperforms the others, but also achieves an order-of-magnitude speedup over the parser trained with the base grammar and the parser in PL15. ...
... For training the baseline parser that only relies on the base grammar, we use the same beam size B = 100, and take 3 passes over the dataset for training. There is no maximum constraint on the Dev Test Pasupat and Liang (2015) 37.0% 37. 1% Neelakantan et al. (2016) 37.5% 37.7% Haug et al. (2017) -38.7% This paper: base grammar 40.6% 42.7% ...
... This data set has a broad variety of entities and relations across different tables, along with complex questions that necessitate long logical forms. On this data set, our parser achieves a question answering accuracy of 43.3% and an ensemble of 5 parsers achieves 45.9%, both of which outperform the previous state-of-the-art of 38.7% set by an ensemble of 15 models (Haug et al., 2017). We further perform several ablation studies that demonstrate the importance of both type constraints and entity linking to achieving high accuracy on this task. ...
... We distinguish between single models and ensembles, as we expect ensembling to improve accuracy, but not all prior work has used it. Prior work on this data set includes a loglinear semantic parser (Pasupat and Liang, 2015), that same parser with a neural, paraphrase-based reranker (Haug et al., 2017), and a neural programmer that answers questions by predicting a sequence of table operations (Neelakantan et al., 2017). We find that our parser outperforms the best prior result on this data set by 4.6%, despite that prior result using a 15-model ensemble. ...
... Modern approaches to the wide range of tasks based on structured-data (e.g. table retrieval [7,41], table classification [9], question answering [12]) now propose to leverage progress in deep learning to represent these data into a semantic vector space (also called embedding space). In parallel, an emerging task, called "data-to-text", aims at describing structured data into a natural language description. ...
Preprint
Transcribing structured data into natural language descriptions has emerged as a challenging task, referred to as "data-to-text". These structures generally regroup multiple elements, as well as their attributes. Most attempts rely on translation encoder-decoder methods which linearize elements into a sequence. This however loses most of the structure contained in the data. In this work, we propose to overpass this limitation with a hierarchical model that encodes the data-structure at the element-level and the structure level. Evaluations on RotoWire show the effectiveness of our model w.r.t. qualitative and quantitative metrics.
... Similarly to Krishna- murthy et al. (2017), we constrained the decod-Supervised by Denotations Dev. Test Pasupat and Liang (2015) 37.0 37.1 Neelakantan et al. (2017) 34.1 34.2 Haug et al. (2018) -34.8 Zhang et al. (2017) 40.4 43.7 Liang et al. (2018) 42.3 43.1 Dasigi et al. (2019) 42.1 43.9 Agarwal et al. (2019) 43 ing process so that only well-formed programs are predicted. This baseline can be viewed as merging the two stages of our model into one stage where generation of abstract programs and their instantiations are performed with a shared decoder. ...
Preprint
Full-text available
Semantic parsing aims to map natural language utterances onto machine interpretable meaning representations, aka programs whose execution against a real-world environment produces a denotation. Weakly-supervised semantic parsers are trained on utterance-denotation pairs treating programs as latent. The task is challenging due to the large search space and spuriousness of programs which may execute to the correct answer but do not generalize to unseen examples. Our goal is to instill an inductive bias in the parser to help it distinguish between spurious and correct programs. We capitalize on the intuition that correct programs would likely respect certain structural constraints were they to be aligned to the question (e.g., program fragments are unlikely to align to overlapping text spans) and propose to model alignments as structured latent variables. In order to make the latent-alignment framework tractable, we decompose the parsing task into (1) predicting a partial "abstract program" and (2) refining it while modeling structured alignments with differential dynamic programming. We obtain state-of-the-art performance on the WIKITABLEQUESTIONS and WIKISQL datasets. When compared to a standard attention baseline, we observe that the proposed structured-alignment mechanism is highly beneficial.
... This avoids the limitation of using a small answer vocabulary for multi-class classification as is done in existing work on VQA. There are some recent neural approaches for answering questions over semi-structured tables such as [19,7] which take an ensemble of many models and outperform the relatively simpler model of [21] only by a small margin (1-2%). In the absence of an ensemble, these neural methods do not perform better than the method proposed in [21]. ...
Preprint
Reasoning over plots by question answering (QA) is a challenging machine learning task at the intersection of vision, language processing, and reasoning. Existing synthetic datasets (FigureQA, DVQA) do not model variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. We propose PlotQA with 8.1 million question-answer pairs over 220,000 plots with data from real-world sources and questions based on crowd-sourced question templates. 26% of the questions in PlotQA have answers that are not in a fixed vocabulary, requiring reasoning capabilities. Analysis of existing models on PlotQA reveals that a hybrid model is required: Specific questions are answered better by choosing the answer from a fixed vocabulary or by extracting it from a predicted bounding box in the plot, while other questions are answered with a table question-answering engine which is fed with a structured table extracted by visual element detection. For the latter, we propose the VOES pipeline and combine it with SAN-VQA to form a hybrid model SAN-VOES. On the DVQA dataset, SAN-VOES model has an accuracy of 58%, significantly improving on highest reported accuracy of 46%. On the PlotQA dataset, SAN-VOES has an accuracy of 54%, which is the highest amongst all the models we trained. Analysis of each module in the VOES pipeline reveals that further improvement in accuracy requires more accurate visual element detection.
... Rank hinge loss. is an objective function for a ranking tasks, this function is used in several text matching models [3,4]. Given a sequence S and two other different sequences S + and S − , such that S + is most similar to S and must be ranked better then S − . ...
Conference Paper
Deep models are getting a wide interest in recent NLP and IR state-of-the-art. Among the proposed models, position-based models and attention-based models take into account the word position in the text, in the former, and the importance of a word among other words in the latter. The positional information are some of the important features that help text representation learning. However, the importance of a given word among others in a given text, which is an important aspect in text matching, is not considered in positional features. In this paper, we propose a model that combines position-based representation learning approach with the attention-based weighting process. The latter learns an importance coefficient for each word of the input text. We propose an extension of a position-based model MV-LSTM with an attention layer, allowing a parameterizable architecture. We believe that when the model is aware of both word position and importance, the learned representations will get more relevant features for the matching process. Our model, namely aMV-LSTM, learns the attention based coefficients to weight words of the different input sentences, before computing their position-based representations. Experimental results, in question/answer matching and question pairs identification tasks, show that the proposed model outperforms the MV-LSTM baseline and several state-of-the-art models.
... Visual reasoning is a perceptual ability to deal with object interactions, attribute comparison, or arithmetic problems [12]. When assemble multiple logical operations to answer questions, we call this process multi-step reasoning [6]. Multi-step Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
Conference Paper
Video question answering (VideoQA) always involves visual reasoning. When answering questions composing of multiple logic correlations, models need to perform multi-step reasoning. In this paper, we formulate multi-step reasoning in VideoQA as a new task to answer compositional and logical structured questions based on video content. Existing VideoQA datasets are inadequate as benchmarks for the multi-step reasoning due to limitations such as lacking logical structure and having language biases. Thus we design a system to automatically generate a large-scale dataset, namely SVQA (Synthetic Video Question Answering). Compared with other VideoQA datasets, SVQA contains exclusively long and structured questions with various spatial and temporal relations between objects. More importantly, questions in SVQA can be decomposed into human readable logical tree or chain layouts, each node of which represents a sub-task requiring a reasoning operation such as comparison or arithmetic. Towards automatic question answering in SVQA, we develop a new VideoQA model. Particularly, we construct a new attention module, which contains spatial attention mechanism to address crucial and multiple logical sub-tasks embedded in questions, as well as a refined GRU called ta-GRU (temporal-attention GRU) to capture the long-term temporal dependency and gather complete visual cues. Experimental results show the capability of multi-step reasoning of SVQA and the effectiveness of our model when compared with other existing models.
Article
Full-text available
Automatic image annotation aims to assign relevant keywords to images and has become a research focus. Although many techniques have been proposed to solve this problem in the last decade, giving promissing performance on standard datasets, we propose a novel automatic image annotation technique in this paper. Our method uses a label transfer mechanism to automatically recommend those promising tags to each image by using the category information of images. As image representation is one of the key technique in image annotation, we use sparse coding based spatial pyramid matching and deep convolutional neural networks to model image features. And metric learning technique is further used to combine these features to achieve more effective image representation in this paper. Experimental results illustrate that the proposed method get similar or better results than the state-of-the-art methods on three standard image datasets.
Preprint
We consider the problem of learning from sparse and underspecified rewards, where an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. Such success-failure rewards are often underspecified: they do not distinguish between purposeful and accidental success. Generalization from underspecified rewards hinges on discounting spurious trajectories that attain accidental success, while learning from sparse feedback requires effective exploration. We address exploration by using a mode covering direction of KL divergence to collect a diverse set of successful trajectories, followed by a mode seeking KL divergence to train a robust policy. We propose Meta Reward Learning (MeRL) to construct an auxiliary reward function that provides more refined feedback for learning. The parameters of the auxiliary reward function are optimized with respect to the validation performance of a trained policy. The MeRL approach outperforms our alternative reward learning technique based on Bayesian Optimization, and achieves the state-of-the-art on weakly-supervised semantic parsing. It improves previous work by 1.2% and 2.4% on WikiTableQuestions and WikiSQL datasets respectively.
Preprint
Full-text available
This paper presents Memory Augmented Policy Optimization (MAPO): a novel policy optimization formulation that incorporates a memory buffer of promising trajectories to reduce the variance of policy gradient estimates for deterministic environments with discrete actions. The formulation expresses the expected return objective as a weighted sum of two terms: an expectation over a memory of trajectories with high rewards, and a separate expectation over the trajectories outside the memory. We propose 3 techniques to make an efficient training algorithm for MAPO: (1) distributed sampling from inside and outside memory with an actor-learner architecture; (2) a marginal likelihood constraint over the memory to accelerate training; (3) systematic exploration to discover high reward trajectories. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with a sparse reward. We evaluate MAPO on weakly supervised program synthesis from natural language with an emphasis on generalization. On the WikiTableQuestions benchmark we improve the state-of-the-art by 2.5%, achieving an accuracy of 46.2%, and on the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our code is open sourced at https://github.com/crazydonkey200/neural-symbolic-machines
Article
Full-text available
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Conference Paper
Full-text available
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Conference Paper
Understanding the content of user's image posts is a particularly interesting problem in social networks and web settings. Current machine learning techniques focus mostly on curated training sets of image-label pairs, and perform image classification given the pixels within the image. In this work we instead leverage the wealth of information available from users: firstly, we employ user hashtags to capture the description of image content; and secondly, we make use of valuable contextual information about the user. We show how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction. We explore two ways of combining these heterogeneous features into a learning framework: (i) simple concatenation; and (ii) a 3-way multiplicative gating, where the image model is conditioned on the user metadata. We apply these models to a large dataset of de-identified Facebook posts and demonstrate that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.
Article
We consider the challenge of learning semantic parsers that scale to large, open-domain problems, such as question answering with Freebase. In such settings, the sentences cover a wide variety of topics and include many phrases whose meaning is difficult to represent in a fixed target ontology. For example, even simple phrases such as 'daughter' and 'number of people living in' cannot be directly represented in Freebase, whose ontology instead encodes facts about gender, parenthood, and population. In this paper, we introduce a new semantic parsing approach that learns to resolve such ontologi-cal mismatches. The parser is learned from question-answer pairs, uses a probabilistic CCG to build linguistically motivated logical-form meaning representations, and includes an ontology matching model that adapts the output logical forms for each target ontology. Experiments demonstrate state-of-the-art performance on two benchmark semantic parsing datasets, including a nine point accuracy improvement on a recent Freebase QA corpus.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.