# Proceedings of the AAAI Conference on Artificial Intelligence

Published by Association for the Advancement of Artificial Intelligence (AAAI)

Print ISSN: 2159-5399

Published by Association for the Advancement of Artificial Intelligence (AAAI)

Print ISSN: 2159-5399

Publications

Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in artificial intelligence. In this paper we present a new non-parametric calibration method called Bayesian Binning into Quantiles (BBQ) which addresses key limitations of existing calibration methods. The method post processes the output of a binary classification algorithm; thus, it can be readily combined with many existing classification algorithms. The method is computationally tractable, and empirically accurate, as evidenced by the set of experiments reported here on both real and simulated datasets.

…

Linear Dynamical System (LDS) is an elegant mathematical framework for modeling and learning Multivariate Time Series (MTS). However, in general, it is difficult to set the dimension of an LDS's hidden state space. A small number of hidden states may not be able to model the complexities of a MTS, while a large number of hidden states can lead to overfitting. In this paper, we study learning methods that impose various regularization penalties on the transition matrix of the LDS model and propose a regularized LDS learning framework (rLDS) which aims to (1) automatically shut down LDSs' spurious and unnecessary dimensions, and consequently, address the problem of choosing the optimal number of hidden states; (2) prevent the overfitting problem given a small amount of MTS data; and (3) support accurate MTS forecasting. To learn the regularized LDS from data we incorporate a second order cone program and a generalized gradient descent method into the Maximum a Posteriori framework and use Expectation Maximization to obtain a low-rank transition matrix of the LDS model. We propose two priors for modeling the matrix which lead to two instances of our rLDS. We show that our rLDS is able to recover well the intrinsic dimensionality of the time series dynamics and it improves the predictive performance when compared to baselines on both synthetic and real-world MTS datasets.

…

Pairwise Markov Networks (PMN) are an important class of Markov networks which, due to their simplicity, are widely used in many applications such as image analysis, bioinformatics, sensor networks, etc. However, learning of Markov networks from data is a challenging task; there are many possible structures one must consider and each of these structures comes with its own parameters making it easy to overfit the model with limited data. To deal with the problem, recent learning methods build upon the L1 regularization to express the bias towards sparse network structures. In this paper, we propose a new and more flexible framework that let us bias the structure, that can, for example, encode the preference to networks with certain local substructures which as a whole exhibit some special global structure. We experiment with and show the benefit of our framework on two types of problems: learning of modular networks and learning of traffic networks models.

…

Partially Observable Markov Decision Processes have been studied widely as a model for decision making under uncertainty, and a number of methods have been developed to find the solutions for such processes. Such studies often involve calculation of the value function of a specific policy, given a model of the transition and observation probabilities, and the reward. These models can be learned using labeled samples of on-policy trajectories. However, when using empirical models, some bias and variance terms are introduced into the value function as a result of imperfect models. In this paper, we propose a method for estimating the bias and variance of the value function in terms of the statistics of the empirical transition and observation model. Such error terms can be used to meaningfully compare the value of different policies. This is an important result for sequential decision-making, since it will allow us to provide more formal guarantees about the quality of the policies we implement. To evaluate the precision of the proposed method, we provide supporting experiments on problems from the field of robotics and medical decision making.

…

Abstract Subset selection from massive data with noised information is increasingly popular for various applications. This problem is still highly challenging as current methods are generally slow in speed and sensitive to outliers. To address the above two issues, we propose an accelerated robust subset selection (ARSS) method. Specifically in the subset selection area, this is the first attempt to employ the $\ell_{p}\left(0<p\leq1\right)$-norm based measure for the representation loss, preventing large errors from dominating our objective. As a result, the robustness against outlier elements is greatly enhanced. Actually, data size is generally much larger than feature length, i.e. $N\!\gg\! L$. Based on this observation, we propose a speedup solver (via ALM and equivalent derivations) to highly reduce the computational cost, theoretically from $O\left(N^{4}\right)$ to $O\left(N{}^{2}L\right)$. Extensive experiments on ten benchmark datasets verify that our method not only outperforms state of the art methods, but also runs 10,000+ times faster than the most related method.

…

Abstract argumentation framework (\AFname) is a unifying framework able to
encompass a variety of nonmonotonic reasoning approaches, logic programming and
computational argumentation. Yet, efficient approaches for most of the decision
and enumeration problems associated to \AFname s are missing, thus potentially
limiting the efficacy of argumentation-based approaches in real domains. In
this paper, we present an algorithm for enumerating the preferred extensions of
abstract argumentation frameworks which exploits parallel computation. To this
purpose, the SCC-recursive semantics definition schema is adopted, where
extensions are defined at the level of specific sub-frameworks. The algorithm
shows significant performance improvements in large frameworks, in terms of
number of solutions found and speedup.

…

Unlike unsupervised approaches such as autoencoders that learn to reconstruct
their inputs, this paper introduces an alternative approach to unsupervised
feature learning called divergent discriminative feature accumulation (DDFA)
that instead continually accumulates features that make novel discriminations
among the training set. Thus DDFA features are inherently discriminative from
the start even though they are trained without knowledge of the ultimate
classification problem. Interestingly, DDFA also continues to add new features
indefinitely (so it does not depend on a hidden layer size), is not based on
minimizing error, and is inherently divergent instead of convergent, thereby
providing a unique direction of research for unsupervised feature learning. In
this paper the quality of its learned features is demonstrated on the MNIST
dataset, where its performance confirms that indeed DDFA is a viable technique
for learning useful features.

…

Higher-order tensors are becoming prevalent in many scientific areas such as
computer vision, social network analysis, data mining and neuroscience.
Traditional tensor decomposition approaches face three major challenges: model
selecting, gross corruptions and computational efficiency. To address these
problems, we first propose a parallel trace norm regularized tensor
decomposition method, and formulate it as a convex optimization problem. This
method does not require the rank of each mode to be specified beforehand, and
can automatically determine the number of factors in each mode through our
optimization scheme. By considering the low-rank structure of the observed
tensor, we analyze the equivalent relationship of the trace norm between a
low-rank tensor and its core tensor. Then, we cast a non-convex tensor
decomposition model into a weighted combination of multiple much smaller-scale
matrix trace norm minimization. Finally, we develop two parallel alternating
direction methods of multipliers (ADMM) to solve our problems. Experimental
results verify that our regularized formulation is effective, and our methods
are robust to noise or outliers.

…

Recently, there has been a growing interest in modeling planning with
information constraints. Accordingly, an agent maximizes a regularized expected
utility known as the free energy, where the regularizer is given by the
information divergence from a prior to a posterior policy. While this approach
can be justified in various ways, including from statistical mechanics and
information theory, it is still unclear how it relates to decision-making
against adversarial environments. This connection has previously been suggested
in work relating the free energy to risk-sensitive control and to extensive
form games. Here, we show that a single-agent free energy optimization is
equivalent to a game between the agent and an imaginary adversary. The
adversary can, by paying an exponential penalty, generate costs that diminish
the decision maker's payoffs. It turns out that the optimal strategy of the
adversary consists in choosing costs so as to render the decision maker
indifferent among its choices, which is a definining property of a Nash
equilibrium, thus tightening the connection between free energy optimization
and game theory.

…

Online advertising is an important and huge industry. Having knowledge of the
website attributes can contribute greatly to business strategies for
ad-targeting, content display, inventory purchase or revenue prediction.
Classical inferences on users and sites impose challenge, because the data is
voluminous, sparse, high-dimensional and noisy. In this paper, we introduce a
stochastic blockmodelling for the website relations induced by the event of
online user visitation. We propose two clustering algorithms to discover the
connection structures of websites, and compare the performance with a
goodness-of-fit method and a deterministic graph partitioning method. We
demonstrate the effectiveness our algorithms on both simulation and AOL website
dataset.

…

Mobile geo-location advertising, where mobile ads are targeted based on a
user's location, has been identified as a key growth factor for the mobile
market. As with online advertising, a crucial ingredient for their success is
the development of effective economic mechanisms. An important difference is
that mobile ads are shown sequentially over time and information about the user
can be learned based on their movements. Furthermore, ads need to be shown
selectively to prevent ad fatigue. To this end, we introduce, for the first
time, a user model and suitable economic mechanisms which take these factors
into account. Specifically, we design two truthful mechanisms which produce an
advertisement plan based on the user's movements. One mechanism is allocatively
efficient, but requires exponential compute time in the worst case. The other
requires polynomial time, but is not allocatively efficient. Finally, we
experimentally evaluate the tradeoff between compute time and efficiency of our
mechanisms.

…

We address the problem of planning collision-free paths for multiple agents
using optimization methods known as proximal algorithms. Recently this approach
was explored in Bento et al. 2013, which demonstrated its ease of
parallelization and decentralization, the speed with which the algorithms
generate good quality solutions, and its ability to incorporate different
proximal operators, each ensuring that paths satisfy a desired property.
Unfortunately, the operators derived only apply to paths in 2D and require that
any intermediate waypoints we might want agents to follow be preassigned to
specific agents, limiting their range of applicability. In this paper we
resolve these limitations. We introduce new operators to deal with agents
moving in arbitrary dimensions that are faster to compute than their 2D
predecessors and we introduce landmarks, space-time positions that are
automatically assigned to the set of agents under different optimality
criteria. Finally, we report the performance of the new operators in several
numerical experiments.

…

Machine learning algorithms have been applied to predict agent behaviors in
real-world dynamic systems, such as advertiser behaviors in sponsored search
and worker behaviors in crowdsourcing. The behavior data in these systems are
generated by live agents: once the systems change due to the adoption of the
prediction models learnt from the behavior data, agents will observe and
respond to these changes by changing their own behaviors accordingly. As a
result, the behavior data will evolve and will not be identically and
independently distributed, posing great challenges to the theoretical analysis
on the machine learning algorithms for behavior prediction. To tackle this
challenge, in this paper, we propose to use Markov Chain in Random Environments
(MCRE) to describe the behavior data, and perform generalization analysis of
the machine learning algorithms on its basis. Since the one-step transition
probability matrix of MCRE depends on both previous states and the random
environment, conventional techniques for generalization analysis cannot be
directly applied. To address this issue, we propose a novel technique that
transforms the original MCRE into a higher-dimensional time-homogeneous Markov
chain. The new Markov chain involves more variables but is more regular, and
thus easier to deal with. We prove the convergence of the new Markov chain when
time approaches infinity. Then we prove a generalization bound for the machine
learning algorithms on the behavior data generated by the new Markov chain,
which depends on both the Markovian parameters and the covering number of the
function class compounded by the loss function for behavior prediction and the
behavior prediction model. To the best of our knowledge, this is the first work
that performs the generalization analysis on data generated by complex
processes in real-world dynamic systems.

…

Many multi-agent coordination problems can be represented as DCOPs. Motivated
by task allocation in disaster response, we extend standard DCOP models to
consider uncertain task rewards where the outcome of completing a task depends
on its current state, which is randomly drawn from unknown distributions. The
goal of solving this problem is to find a solution for all agents that
minimizes the overall worst-case loss. This is a challenging problem for
centralized algorithms because the search space grows exponentially with the
number of agents and is nontrivial for standard DCOP algorithms we have. To
address this, we propose a novel decentralized algorithm that incorporates
Max-Sum with iterative constraint generation to solve the problem by passing
messages among agents. By so doing, our approach scales well and can solve
instances of the task allocation problem with hundreds of agents and tasks.

…

We study the problem of eliciting and aggregating probabilistic information
from multiple agents. In order to successfully aggregate the predictions of
agents, the principal needs to elicit some notion of confidence from agents,
capturing how much experience or knowledge led to their predictions. To
formalize this, we consider a principal who wishes to elicit predictions about
a random variable from a group of Bayesian agents, each of whom have privately
observed some independent samples of the random variable, and hopes to
aggregate the predictions as if she had directly observed the samples of all
agents. Leveraging techniques from Bayesian statistics, we represent confidence
as the number of samples an agent has observed, which is quantified by a
hyperparameter from a conjugate family of prior distributions. This then allows
us to show that if the principal has access to a few samples, she can achieve
her aggregation goal by eliciting predictions from agents using proper scoring
rules. In particular, if she has access to one sample, she can successfully
aggregate the agents' predictions if and only if every posterior predictive
distribution corresponds to a unique value of the hyperparameter. Furthermore,
this uniqueness holds for many common distributions of interest. When this
uniqueness property does not hold, we construct a novel and intuitive mechanism
where a principal with two samples can elicit and optimally aggregate the
agents' predictions.

…

This paper introduces a principled approach for the design of a scalable general reinforcement learning agent. This approach is based on a direct approximation of AIXI, a Bayesian optimality notion for general reinforcement learning agents. Previously, it has been unclear whether the theory of AIXI could motivate the design of practical algorithms. We answer this hitherto open question in the affirmative, by providing the first computationally feasible approximation to the AIXI agent. To develop our approximation, we introduce a Monte Carlo Tree Search algorithm along with an agent-specific extension of the Context Tree Weighting algorithm. Empirically, we present a set of encouraging results on a number of stochastic, unknown, and partially observable domains. Comment: 8 LaTeX pages, 1 figure

…

During broadcast events such as the Superbowl, the U.S. Presidential and
Primary debates, etc., Twitter has become the de facto platform for crowds to
share perspectives and commentaries about them. Given an event and an
associated large-scale collection of tweets, there are two fundamental research
problems that have been receiving increasing attention in recent years. One is
to extract the topics covered by the event and the tweets; the other is to
segment the event. So far these problems have been viewed separately and
studied in isolation. In this work, we argue that these problems are in fact
inter-dependent and should be addressed together. We develop a joint Bayesian
model that performs topic modeling and event segmentation in one unified
framework. We evaluate the proposed model both quantitatively and qualitatively
on two large-scale tweet datasets associated with two events from different
domains to show that it improves significantly over baseline models.

…

Word alignment is an important natural language processing task that
indicates the correspondence between natural languages. Recently, unsupervised
learning of log-linear models for word alignment has received considerable
attention as it combines the merits of generative and discriminative
approaches. However, a major challenge still remains: it is intractable to
calculate the expectations of non-local features that are critical for
capturing the divergence between natural languages. We propose a contrastive
approach that aims to differentiate observed training examples from noises. It
not only introduces prior knowledge to guide unsupervised learning but also
cancels out partition functions. Based on the observation that the probability
mass of log-linear models for word alignment is usually highly concentrated, we
propose to use top-n alignments to approximate the expectations with respect to
posterior distributions. This allows for efficient and accurate calculation of
expectations of non-local features. Experiments show that our approach achieves
significant improvements over state-of-the-art unsupervised word alignment
methods.

…

We study propagation algorithms for the conjunction of two AllDifferent constraints. Solutions of an AllDifferent constraint can be seen as perfect matchings on the variable/value bipartite graph. Therefore, we investigate the problem of finding simultaneous bipartite matchings. We present an extension of the famous Hall theorem which characterizes when simultaneous bipartite matchings exists. Unfortunately, finding such matchings is NP-hard in general. However, we prove a surprising result that finding a simultaneous matching on a convex bipartite graph takes just polynomial time. Based on this theoretical result, we provide the first polynomial time bound consistency algorithm for the conjunction of two AllDifferent constraints. We identify a pathological problem on which this propagator is exponentially faster compared to existing propagators. Our experiments show that this new propagator can offer significant benefits over existing methods. Comment: AAAI 2010, Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence

…

Simple board games, like Tic-Tac-Toe and CONNECT-4, play an important role
not only in the development of mathematical and logical skills, but also in the
emotional and social development. In this paper, we address the problem of
generating targeted starting positions for such games. This can facilitate new
approaches for bringing novice players to mastery, and also leads to discovery
of interesting game variants. We present an approach that generates starting
states of varying hardness levels for player~$1$ in a two-player board game,
given rules of the board game, the desired number of steps required for
player~$1$ to win, and the expertise levels of the two players. Our approach
leverages symbolic methods and iterative simulation to efficiently search the
extremely large state space. We present experimental results that include
discovery of states of varying hardness levels for several simple grid-based
board games. The presence of such states for standard game variants like $4
\times 4$ Tic-Tac-Toe opens up new games to be played that have never been
played as the default start state is heavily biased.

…

We explore the idea that authoring a piece of text is an act of maximizing
one's expected utility. To make this idea concrete, we consider the societally
important decisions of the Supreme Court of the United States. Extensive past
work in quantitative political science provides a framework for empirically
modeling the decisions of justices and how they relate to text. We incorporate
into such a model texts authored by amici curiae ("friends of the court"
separate from the litigants) who seek to weigh in on the decision, then
explicitly model their goals in a random utility model. We demonstrate the
benefits of this approach in improved vote prediction and the ability to
perform counterfactual analysis.

…

Answering conjunctive queries (CQs) over $\mathcal{EL}$ knowledge bases (KBs)
with complex role inclusions is PSPACE-hard and in PSPACE in certain cases;
however, if complex role inclusions are restricted to role transitivity, the
tight upper complexity bound has so far been unknown. Furthermore, the existing
algorithms cannot handle reflexive roles, and they are not practicable.
Finally, the problem is tractable for acyclic CQs and $\mathcal{ELH}$, and
NP-complete for unrestricted CQs and $\mathcal{ELHO}$ KBs. In this paper we
complete the complexity landscape of CQ answering for several important cases.
In particular, we present a practicable NP algorithm for answering CQs over
$\mathcal{ELHO}^s$ KBs---a logic containing all of OWL 2 EL, but with complex
role inclusions restricted to role transitivity. Our preliminary evaluation
suggests that the algorithm can be suitable for practical use. Moreover, we
show that, even for a restricted class of so-called arborescent acyclic
queries, CQ answering over $\mathcal{EL}$ KBs becomes NP-hard in the presence
of either transitive or reflexive roles. Finally, we show that answering
arborescent CQs over $\mathcal{ELHO}$ KBs is tractable, whereas answering
acyclic CQs is NP-hard.

…

We study techniques to incentivize self-interested agents to form socially
desirable solutions in scenarios where they benefit from mutual coordination.
Towards this end, we consider coordination games where agents have different
intrinsic preferences but they stand to gain if others choose the same strategy
as them. For non-trivial versions of our game, stable solutions like Nash
Equilibrium may not exist, or may be socially inefficient even when they do
exist. This motivates us to focus on designing efficient algorithms to compute
(almost) stable solutions like Approximate Equilibrium that can be realized if
agents are provided some additional incentives. Our results apply in many
settings like adoption of new products, project selection, and group formation,
where a central authority can direct agents towards a strategy but agents may
defect if they have better alternatives. We show that for any given instance,
we can either compute a high quality approximate equilibrium or a near-optimal
solution that can be stabilized by providing small payments to some players. We
then generalize our model to encompass situations where player relationships
may exhibit complementarities and present an algorithm to compute an
Approximate Equilibrium whose stability factor is linear in the degree of
complementarity. Our results imply that a little influence is necessary in
order to ensure that selfish players coordinate and form socially efficient
solutions.

…

This paper presents a technique for approximating, up to any precision, the set of subgame-perfect equilibria (SPE) in discounted repeated games. The process starts with a single hypercube approximation of the set of SPE. Then the initial hypercube is gradually partitioned on to a set of smaller adjacent hypercubes, while those hypercubes that cannot contain any point belonging to the set of SPE are simultaneously withdrawn. Whether a given hypercube can contain an equilibrium point is verified by an appropriate mathematical program. Three different formulations of the algorithm for both approximately computing the set of SPE payoffs and extracting players' strategies are then proposed: the first two that do not assume the presence of an external coordination between players, and the third one that assumes a certain level of coordination during game play for convexifying the set of continuation payoffs after any repeated game history. A special attention is paid to the question of extracting players' strategies and their representability in form of finite automata, an important feature for artificial agent systems. Comment: 26 pages, 13 figures, 1 table

…

The expressive power of a Gaussian process (GP) model comes at a cost of poor
scalability in the data size. To improve its scalability, this paper presents a
low-rank-cum-Markov approximation (LMA) of the GP model that is novel in
leveraging the dual computational advantages stemming from complementing a
low-rank approximate representation of the full-rank GP based on a support set
of inputs with a Markov approximation of the resulting residual process; the
latter approximation is guaranteed to be closest in the Kullback-Leibler
distance criterion subject to some constraint and is considerably more refined
than that of existing sparse GP models utilizing low-rank representations due
to its more relaxed conditional independence assumption (especially with larger
data). As a result, our LMA method can trade off between the size of the
support set and the order of the Markov property to (a) incur lower
computational cost than such sparse GP models while achieving predictive
performance comparable to them and (b) accurately represent features/patterns
of any scale. Interestingly, varying the Markov order produces a spectrum of
LMAs with PIC approximation and full-rank GP at the two extremes. An advantage
of our LMA method is that it is amenable to parallelization on multiple
machines/cores, thereby gaining greater scalability. Empirical evaluation on
three real-world datasets in clusters of up to 32 computing nodes shows that
our centralized and parallel LMA methods are significantly more time-efficient
and scalable than state-of-the-art sparse and full-rank GP regression methods
while achieving comparable predictive performances.

…

We have carefully instrumented a large portion of the population living in a
university graduate dormitory by giving participants Android smart phones
running our sensing software. In this paper, we propose the novel problem of
predicting mobile application (known as "apps") installation using social
networks and explain its challenge. Modern smart phones, like the ones used in
our study, are able to collect different social networks using built-in
sensors. (e.g. Bluetooth proximity network, call log network, etc) While this
information is accessible to app market makers such as the iPhone AppStore, it
has not yet been studied how app market makers can use these information for
marketing research and strategy development. We develop a simple computational
model to better predict app installation by using a composite network computed
from the different networks sensed by phones. Our model also captures
individual variance and exogenous factors in app adoption. We show the
importance of considering all these factors in predicting app installations,
and we observe the surprising result that app installation is indeed
predictable. We also show that our model achieves the best results compared
with generic approaches: our results are four times better than random guess,
and predict almost 45% of all apps users install with almost 45% precision (F1
score= 0.43).

…

We consider the problem of learning soft assignments of $N$ items to $K$
categories given two sources of information: an item-category similarity
matrix, which encourages items to be assigned to categories they are similar to
(and to not be assigned to categories they are dissimilar to), and an item-item
similarity matrix, which encourages similar items to have similar assignments.
We propose a simple quadratic programming model that captures this intuition.
We give necessary conditions for its solution to be unique, define an
out-of-sample mapping, and derive a simple, effective training algorithm based
on the alternating direction method of multipliers. The model predicts
reasonable assignments from even a few similarity values, and can be seen as a
generalization of semisupervised learning. It is particularly useful when items
naturally belong to multiple categories, as for example when annotating
documents with keywords or pictures with tags, with partially tagged items, or
when the categories have complex interrelations (e.g. hierarchical) that are
unknown.

…

Lifted probabilistic inference algorithms have been successfully applied to a
large number of symmetric graphical models. Unfortunately, the majority of
real-world graphical models is asymmetric. This is even the case for relational
representations when evidence is given. Therefore, more recent work in the
community moved to making the models symmetric and then applying existing
lifted inference algorithms. However, this approach has two shortcomings.
First, all existing over-symmetric approximations require a relational
representation such as Markov logic networks. Second, the induced symmetries
often change the distribution significantly, making the computed probabilities
highly biased. We present a framework for probabilistic sampling-based
inference that only uses the induced approximate symmetries to propose steps in
a Metropolis-Hastings style Markov chain. The framework, therefore, leads to
improved probability estimates while remaining unbiased. Experiments
demonstrate that the approach outperforms existing MCMC algorithms.

…

Probabilistic models can learn users' preferences from the history of their
item adoptions on a social media site, and in turn, recommend new items to
users based on learned preferences. However, current models ignore
psychological factors that play an important role in shaping online social
behavior. One such factor is attention, the mechanism that integrates
perceptual and cognitive features to select the items the user will consciously
process and may eventually adopt. Recent research has shown that people have
finite attention, which constrains their online interactions, and that they
divide their limited attention non-uniformly over other people. We propose a
collaborative topic regression model that incorporates limited, non-uniformly
divided attention. We show that the proposed model is able to learn more
accurate user preferences than state-of-art models, which do not take human
cognitive factors into account. Specifically we analyze voting on news items on
the social news aggregator and show that our model is better able to predict
held out votes than alternate models. Our study demonstrates that
psycho-socially motivated models are better able to describe and predict
observed behavior than models which only consider latent social structure and
content.

…

This paper presents the beginnings of an automatic statistician, focusing on
regression problems. Our system explores an open-ended space of possible
statistical models to discover a good explanation of the data, and then
produces a detailed report with figures and natural-language text. Our approach
treats unknown functions nonparametrically using Gaussian processes, which has
two important consequences. First, Gaussian processes model functions in terms
of high-level properties (e.g. smoothness, trends, periodicity, changepoints).
Taken together with the compositional structure of our language of models, this
allows us to automatically describe functions through a decomposition into
additive parts. Second, the use of flexible nonparametric models and a rich
language for composing them in an open-ended manner also results in
state-of-the-art extrapolation performance evaluated over 13 real time series
data sets from various domains.

…

The Rao-Blackwell theorem is utilized to analyze and improve the scalability of inference in large probabilistic models that exhibit symmetries. A novel marginal density estimator is introduced and shown both analytically and empirically to outperform standard estimators by several orders of magnitude. The developed theory and algorithms apply to a broad class of probabilistic models including statistical relational models considered not susceptible to lifted probabilistic inference Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

…

Topic modeling of textual corpora is an important and challenging problem. In
most previous work, the "bag-of-words" assumption is usually made which ignores
the ordering of words. This assumption simplifies the computation, but it
unrealistically loses the ordering information and the semantic of words in the
context. In this paper, we present a Gaussian Mixture Neural Topic Model
(GMNTM) which incorporates both the ordering of words and the semantic meaning
of sentences into topic modeling. Specifically, we represent each topic as a
cluster of multi-dimensional vectors and embed the corpus into a collection of
vectors generated by the Gaussian mixture model. Each word is affected not only
by its topic, but also by the embedding vector of its surrounding words and the
context. The Gaussian mixture components and the topic of documents, sentences
and words can be learnt jointly. Extensive experiments show that our model can
learn better topics and more accurate word distributions for each topic.
Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM
obtains significantly better performance in terms of perplexity, retrieval
accuracy and classification accuracy.

…

Given a CNF formula and a weight for each assignment of values to variables,
two natural problems are weighted model counting and distribution-aware
sampling of satisfying assignments. Both problems have a wide variety of
important applications. Due to the inherent complexity of the exact versions of
the problems, interest has focused on solving them approximately. Prior work in
this area scaled only to small problems in practice, or failed to provide
strong theoretical guarantees, or employed a computationally-expensive maximum
a posteriori probability (MAP) oracle that assumes prior knowledge of a
factored representation of the weight distribution. We present a novel approach
that works with a black-box oracle for weights of assignments and requires only
an {\NP}-oracle (in practice, a SAT-solver) to solve both the counting and
sampling problems. Our approach works under mild assumptions on the
distribution of weights of satisfying assignments, provides strong theoretical
guarantees, and scales to problems involving several thousand variables. We
also show that the assumptions can be significantly relaxed while improving
computational efficiency if a factored representation of the weights is known.

…

Nanson's and Baldwin's voting rules select a winner by successively
eliminating candidates with low Borda scores. We show that these rules have a
number of desirable computational properties. In particular, with unweighted
votes, it is NP-hard to manipulate either rule with one manipulator, whilst
with weighted votes, it is NP-hard to manipulate either rule with a small
number of candidates and a coalition of manipulators. As only a couple of other
voting rules are known to be NP-hard to manipulate with a single manipulator,
Nanson's and Baldwin's rules appear to be particularly resistant to
manipulation from a theoretical perspective. We also propose a number of
approximation methods for manipulating these two rules. Experiments demonstrate
that both rules are often difficult to manipulate in practice. These results
suggest that elimination style voting rules deserve further study.

…

In this paper, we investigate the hybrid tractability of binary Quantified
Constraint Satisfaction Problems (QCSPs). First, a basic tractable class of
binary QCSPs is identified by using the broken-triangle property. In this
class, the variable ordering for the broken-triangle property must be same as
that in the prefix of the QCSP. Second, we break this restriction to allow that
existentially quantified variables can be shifted within or out of their
blocks, and thus identify some novel tractable classes by introducing the
broken-angle property. Finally, we identify a more generalized tractable class,
i.e., the min-of-max extendable class for QCSPs.

…

Set and multiset variables in constraint programming have typically been
represented using subset bounds. However, this is a weak representation that
neglects potentially useful information about a set such as its cardinality.
For set variables, the length-lex (LL) representation successfully provides
information about the length (cardinality) and position in the lexicographic
ordering. For multiset variables, where elements can be repeated, we consider
richer representations that take into account additional information. We study
eight different representations in which we maintain bounds according to one of
the eight different orderings: length-(co)lex (LL/LC), variety-(co)lex (VL/VC),
length-variety-(co)lex (LVL/LVC), and variety-length-(co)lex (VLL/VLC)
orderings. These representations integrate together information about the
cardinality, variety (number of distinct elements in the multiset), and
position in some total ordering. Theoretical and empirical comparisons of
expressiveness and compactness of the eight representations suggest that
length-variety-(co)lex (LVL/LVC) and variety-length-(co)lex (VLL/VLC) usually
give tighter bounds after constraint propagation. We implement the eight
representations and evaluate them against the subset bounds representation with
cardinality and variety reasoning. Results demonstrate that they offer
significantly better pruning and runtime.

…

Neuroimage analysis usually involves learning thousands or even millions of
variables using only a limited number of samples. In this regard, sparse
models, e.g. the lasso, are applied to select the optimal features and achieve
high diagnosis accuracy. The lasso, however, usually results in independent
unstable features. Stability, a manifest of reproducibility of statistical
results subject to reasonable perturbations to data and the model, is an
important focus in statistics, especially in the analysis of high dimensional
data. In this paper, we explore a nonnegative generalized fused lasso model for
stable feature selection in the diagnosis of Alzheimer's disease. In addition
to sparsity, our model incorporates two important pathological priors: the
spatial cohesion of lesion voxels and the positive correlation between the
features and the disease labels. To optimize the model, we propose an efficient
algorithm by proving a novel link between total variation and fast network flow
algorithms via conic duality. Experiments show that the proposed nonnegative
model performs much better in exploring the intrinsic structure of data via
selecting stable features compared with other state-of-the-arts.

…

Symmetry is an important problem in many combinatorial problems. One way of
dealing with symmetry is to add constraints that eliminate symmetric solutions.
We survey recent results in this area, focusing especially on two common and
useful cases: symmetry breaking constraints for row and column symmetry, and
symmetry breaking constraints for eliminating value symmetry

…

In machine learning and statistics, probabilistic inference involving multimodal distributions is quite difficult. This is especially true in high dimensional problems, where most existing algorithms cannot easily move from one mode to another. To address this issue, we propose a novel Bayesian inference approach based on Markov Chain Monte Carlo. Our method can effectively sample from multimodal distributions, especially when the dimension is high and the modes are isolated. To this end, it exploits and modifies the Riemannian geometric properties of the target distribution to create wormholes connecting modes in order to facilitate moving between them. Further, our proposed method uses the regeneration technique in order to adapt the algorithm by identifying new modes and updating the network of wormholes without affecting the stationary distribution. To find new modes, as opposed to redis-covering those previously identified, we employ a novel mode searching algorithm that explores a residual energy function obtained by subtracting an approximate Gaussian mixture density (based on previously discovered modes) from the target density function.

…

Restricted Boltzmann machines (RBMs) are powerful machine learning models,
but learning and some kinds of inference in the model require sampling-based
approximations, which, in classical digital computers, are implemented using
expensive MCMC. Physical computation offers the opportunity to reduce the cost
of sampling by building physical systems whose natural dynamics correspond to
drawing samples from the desired RBM distribution. Such a system avoids the
burn-in and mixing cost of a Markov chain. However, hardware implementations of
this variety usually entail limitations such as low-precision and limited range
of the parameters and restrictions on the size and topology of the RBM. We
conduct software simulations to determine how harmful each of these
restrictions is. Our simulations are designed to reproduce aspects of the
D-Wave quantum computer, but the issues we investigate arise in most forms of
physical computation.

…

RDF and Description Logics work in an open-world setting where absence of
information is not information about absence. Nevertheless, Description Logic
axioms can be interpreted in a closed-world setting and in this setting they
can be used for both constraint checking and closed-world recognition against
information sources. When the information sources are expressed in well-behaved
RDF or RDFS (i.e., RDF graphs interpreted in the RDF or RDFS semantics) this
constraint checking and closed-world recognition is simple to describe. Further
this constraint checking can be implemented as SPARQL querying and thus
effectively performed.

…

Dimensionality reduction (DR) is often used as a preprocessing step in
classification, but usually one first fixes the DR mapping, possibly using
label information, and then learns a classifier (a filter approach). Best
performance would be obtained by optimizing the classification error jointly
over DR mapping and classifier (a wrapper approach), but this is a difficult
nonconvex problem, particularly with nonlinear DR. Using the method of
auxiliary coordinates, we give a simple, efficient algorithm to train a
combination of nonlinear DR and a classifier, and apply it to a RBF mapping
with a linear SVM. This alternates steps where we train the RBF mapping and a
linear SVM as usual regression and classification, respectively, with a
closed-form step that coordinates both. The resulting nonlinear low-dimensional
classifier achieves classification errors competitive with the state-of-the-art
but is fast at training and testing, and allows the user to trade off runtime
for classification accuracy easily. We then study the role of nonlinear DR in
linear classification, and the interplay between the DR mapping, the number of
latent dimensions and the number of classes. When trained jointly, the DR
mapping takes an extreme role in eliminating variation: it tends to collapse
classes in latent space, erasing all manifold structure, and lay out class
centroids so they are linearly separable with maximum margin.

…

Sparse coding can learn good robust representation to noise and model more
higher-order representation for image classification. However, the inference
algorithm is computationally expensive even though the supervised signals are
used to learn compact and discriminative dictionaries in sparse coding
techniques. Luckily, a simplified neural network module (SNNM) has been
proposed to directly learn the discriminative dictionaries for avoiding the
expensive inference. But the SNNM module ignores the sparse representations.
Therefore, we propose a sparse SNNM module by adding the mixed-norm
regularization (l1/l2 norm). The sparse SNNM modules are further stacked to
build a sparse deep stacking network (S-DSN). In the experiments, we evaluate
S-DSN with four databases, including Extended YaleB, AR, 15 scene and
Caltech101. Experimental results show that our model outperforms related
classification methods with only a linear classifier. It is worth noting that
we reach 98.8% recognition accuracy on 15 scene.

…

We propose a voted dual averaging method for online classification problems
with explicit regularization. This method employs the update rule of the
regularized dual averaging (RDA) method, but only on the subsequence of
training examples where a classification error is made. We derive a bound on
the number of mistakes made by this method on the training set, as well as its
generalization error rate. We also introduce the concept of relative strength
of regularization, and show how it affects the mistake bound and generalization
performance. We experimented with the method using $\ell_1$ regularization on a
large-scale natural language processing task, and obtained state-of-the-art
classification performance with fairly sparse models.

…

Click prediction is one of the fundamental problems in sponsored search. Most
of existing studies took advantage of machine learning approaches to predict ad
click for each event of ad view independently. However, as observed in the
real-world sponsored search system, user's behaviors on ads yield high
dependency on how the user behaved along with the past time, especially in
terms of what queries she submitted, what ads she clicked or ignored, and how
long she spent on the landing pages of clicked ads, etc. Inspired by these
observations, we introduce a novel framework based on Recurrent Neural Networks
(RNN). Compared to traditional methods, this framework directly models the
dependency on user's sequential behaviors into the click prediction process
through the recurrent structure in RNN. Large scale evaluations on the
click-through logs from a commercial search engine demonstrate that our
approach can significantly improve the click prediction accuracy, compared to
sequence-independent approaches.

…

Associative memories are data structures that allow retrieval of stored
messages from part of their content. They thus behave similarly to human brain
that is capable for instance of retrieving the end of a song given its
beginning. Among different families of associative memories, sparse ones are
known to provide the best efficiency (ratio of the number of bits stored to
that of bits used). Nevertheless, it is well known that non-uniformity of the
stored messages can lead to dramatic decrease in performance. We introduce
several strategies to allow efficient storage of non-uniform messages in
recently introduced sparse associative memories. We analyse and discuss the
methods introduced. We also present a practical application example.

…

In this paper we investigate clustering in the weighted setting, in which
every data point is assigned a real valued weight. We conduct a theoretical
analysis on the influence of weighted data on standard clustering algorithms in
each of the partitional and hierarchical settings, characterising the precise
conditions under which such algorithms react to weights, and classifying
clustering methods into three broad categories: weight-responsive,
weight-considering, and weight-robust. Our analysis raises several interesting
questions and can be directly mapped to the classical unweighted setting.

…

Spectral clustering is a fundamental technique in the field of data mining
and information processing. Most existing spectral clustering algorithms
integrate dimensionality reduction into the clustering process assisted by
manifold learning in the original space. However, the manifold in
reduced-dimensional subspace is likely to exhibit altered properties in
contrast with the original space. Thus, applying manifold information obtained
from the original space to the clustering process in a low-dimensional subspace
is prone to inferior performance. Aiming to address this issue, we propose a
novel convex algorithm that mines the manifold structure in the low-dimensional
subspace. In addition, our unified learning process makes the manifold learning
particularly tailored for the clustering. Compared with other related methods,
the proposed algorithm results in more structured clustering result. To
validate the efficacy of the proposed algorithm, we perform extensive
experiments on several benchmark datasets in comparison with some
state-of-the-art clustering approaches. The experimental results demonstrate
that the proposed algorithm has quite promising clustering performance.

…

In recent years, spectral clustering has become a standard method for data
analysis used in a broad range of applications. In this paper we propose a new
class of algorithms for multiway spectral clustering based on optimization of a
certain "contrast function" over a sphere. These algorithms are simple to
implement, efficient and, unlike most of the existing algorithms for multiclass
spectral clustering, are not initialization-dependent. Moreover, they are
applicable without modification for normalized and un-normalized clustering,
which are two common variants of spectral clustering.
Geometrically, the proposed algorithms can be interpreted as recovering a
discrete weighted simplex by means of function optimization. We give complete
necessary and sufficient conditions on contrast functions for the optimization
to guarantee recovery of clusters. We show how these conditions can be
interpreted in terms of certain "hidden convexity" of optimization over a
sphere.

…

Many networks are complex dynamical systems, where both attributes of nodes
and topology of the network (link structure) can change with time. We propose a
model of co-evolving networks where both node at- tributes and network
structure evolve under mutual influence. Specifically, we consider a mixed
membership stochastic blockmodel, where the probability of observing a link
between two nodes depends on their current membership vectors, while those
membership vectors themselves evolve in the presence of a link between the
nodes. Thus, the network is shaped by the interaction of stochastic processes
describing the nodes, while the processes themselves are influenced by the
changing network structure. We derive an efficient variational inference
procedure for our model, and validate the model on both synthetic and
real-world data.

…

Top-cited authors