[show abstract][hide abstract] ABSTRACT: We describe two new approaches to human pose estimation. Both can quickly and accurately predict the 3D positions of body joints from a single depth image without using any temporal information. The key to both approaches is the use of a large, realistic, and highly varied synthetic set of training images. This allows us to learn models that are largely invariant to factors such as pose, body shape, field-of-view cropping, and clothing. Our first approach employs an intermediate body parts representation, designed so that an accurate per-pixel classification of the parts will localize the joints of the body. The second approach instead directly regresses the positions of body joints. By using simple depth pixel comparison features and parallelizable decision forests, both approaches can run super-real time on consumer hardware. Our evaluation investigates many aspects of our methods, and compares the approaches to each other and to the state of the art. Results on silhouettes suggest broader applicability to other imaging modalities.
[show abstract][hide abstract] ABSTRACT: This paper presents a novel meta algorithm, Partition-Merge (PM), which takes
existing centralized algorithms for graph computation and makes them
distributed and faster. In a nutshell, PM divides the graph into small
subgraphs using our novel randomized partitioning scheme, runs the centralized
algorithm on each partition separately, and then stitches the resulting
solutions to produce a global solution. We demonstrate the efficiency of the PM
algorithm on two popular problems: computation of Maximum A Posteriori (MAP)
assignment in an arbitrary pairwise Markov Random Field (MRF), and modularity
optimization for community detection. We show that the resulting distributed
algorithms for these problems essentially run in time linear in the number of
nodes in the graph, and perform as well -- or even better -- than the original
centralized algorithm as long as the graph has geometric structures. Here we
say a graph has geometric structures, or polynomial growth property, when the
number of nodes within distance r of any given node grows no faster than a
polynomial function of r. More precisely, if the centralized algorithm is a
C-factor approximation with constant C \ge 1, the resulting distributed
algorithm is a (C+\delta)-factor approximation for any small \delta>0; but if
the centralized algorithm is a non-constant (e.g. logarithmic) factor
approximation, then the resulting distributed algorithm becomes a constant
factor approximation. For general graphs, we compute explicit bounds on the
loss of performance of the resulting distributed algorithm with respect to the
[show abstract][hide abstract] ABSTRACT: This paper makes two contributions: the first is the proposal of a new model - the associative hierarchical random field (AHRF), and a novel algorithm for its optimisation; the second is the application of this model to the problem of semantic segmentation. Most methods for semantic segmentation are formulated as a labelling problem for variables that might correspond to either pixels or segments such as super-pixels. It is well known that the generation of super pixel segmentations is not unique. This has motivated many researchers to use multiple super pixel segmentations for problems such as semantic segmentation or single view reconstruction. These super-pixels have not yet been combined in a principled manner, this is a difficult problem, as they may overlap, or be nested in such a way that the segmentations form a segmentation tree. Our new hierarchical random field model allows information from all of the multiple segmentations to contribute to a global energy. MAP inference in this model can be performed efficiently using powerful graph cut based move making algorithms.
IEEE Transactions on Software Engineering 08/2013; · 2.59 Impact Factor
[show abstract][hide abstract] ABSTRACT: Many successful applications of computer vision to image or video manipulation are interactive by nature. However, parameters of such systems are often trained neglecting the user. Traditionally, interactive systems have been treated in the same manner as their fully automatic counterparts. Their performance is evaluated by computing the accuracy of their solutions under some fixed set of user interactions. In this paper, we study the problem of evaluating and learning interactive segmentation systems which are extensively used in the real world. The key questions in this context are how to measure (1) the effort associated with a user interaction, and (2) the quality of the segmentation result as perceived by the user. We conduct a user study to analyze user behavior and answer these questions. Using the insights obtained from these experiments, we propose a framework to evaluate and learn interactive segmentation systems which brings the user in the loop. The framework is based on the use of an active robot user—a simulated model of a human user. We show how this approach can be used to evaluate and learn parameters of state-of-the-art interactive segmentation systems. We also show how simulated user models can be integrated into the popular max-margin method for parameter learning and propose an algorithm to solve the resulting optimisation problem.
International Journal of Computer Vision 08/2013; 100(3). · 3.62 Impact Factor
[show abstract][hide abstract] ABSTRACT: The Markov and Conditional random fields (CRFs) used in computer vision typically model only local interactions between variables, as this is generally thought to be the only case that is computationally tractable. In this paper we consider a class of global potentials defined over all variables in the CRF. We show how they can be readily optimised using standard graph cut algorithms at little extra expense compared to a standard pairwise field. This result can be directly used for the problem of class based image segmentation which has seen increasing recent interest within computer vision. Here the aim is to assign a label to each pixel of a given image from a set of possible object classes. Typically these methods use random fields to model local interactions between pixels or super-pixels. One of the cues that helps recognition is global object co-occurrence statistics, a measure of which classes (such as chair or motorbike) are likely to occur in the same image together. There have been several approaches proposed to exploit this property, but all of them suffer from different limitations and typically carry a high computational cost, preventing their application on large images. We find that the new model we propose produces a significant improvement in the labelling compared to just using a pairwise model and that this improvement increases as the number of labels increases.
International Journal of Computer Vision 08/2013; 103(2). · 3.62 Impact Factor
[show abstract][hide abstract] ABSTRACT: Energy minimization algorithms, such as graph cuts, enable the computation of
the MAP solution under certain probabilistic models such as Markov random
fields. However, for many computer vision problems, the MAP solution under the
model is not the ground truth solution. In many problem scenarios, the system
has access to certain statistics of the ground truth. For instance, in image
segmentation, the area and boundary length of the object may be known. In these
cases, we want to estimate the most probable solution that is consistent with
such statistics, i.e., satisfies certain equality or inequality constraints.
The above constrained energy minimization problem is NP-hard in general, and
is usually solved using Linear Programming formulations, which relax the
integrality constraints. This paper proposes a novel method that finds the
discrete optimal solution of such problems by maximizing the corresponding
Lagrangian dual. This method can be applied to any constrained energy
minimization problem whose unconstrained version is polynomial time solvable,
and can handle multiple, equality or inequality, and linear or non-linear
constraints. We demonstrate the efficacy of our method on the
foreground/background image segmentation problem, and show that it produces
impressive segmentation results with less error, and runs more than 20 times
faster than the state-of-the-art LP relaxation based approaches.
[show abstract][hide abstract] ABSTRACT: In this paper, we propose novel algorithms for inferring the Maximum a
Posteriori (MAP) solution of discrete pairwise random field models under
multiple constraints. We show how this constrained discrete optimization
problem can be formulated as a multi-dimensional parametric mincut problem via
its Lagrangian dual, and prove that our algorithm isolates all constraint
instances for which the problem can be solved exactly. These multiple solutions
enable us to even deal with `soft constraints' (higher order penalty
functions). Moreover, we propose two practical variants of our algorithm to
solve problems with hard constraints. We also show how our method can be
applied to solve various constrained discrete optimization problems such as
submodular minimization and shortest path computation. Experimental evaluation
using the foreground-background image segmentation problem with statistic
constraints reveals that our method is faster and its results are closer to the
ground truth labellings compared with the popular continuous relaxation based
[show abstract][hide abstract] ABSTRACT: This chapter introduces a new random field model for discrete image labeling tasks, the Decision Tree Field (DTF), that combines and generalizes decision forests and conditional random fields (CRF) which have been widely used in computer vision. In a typical CRF model the unary potentials are derived from sophisticated forest or boosting based classifiers, however, the pairwise potentials are assumed to (1) have a simple parametric form with a pre-specified and fixed dependence on the image data, and (2) to be defined on the basis of a small and fixed neighborhood. In contrast, in DTF, local interactions between multiple variables are determined by means of decision trees evaluated on the image data, allowing the interactions to be adapted to the image content. This results in powerful graphical models which are able to represent complex label structure. Our key technical contribution is to show that the DTF model can be trained efficiently and jointly using a convex approximate likelihood function, enabling us to learn over a million free model parameters. We show experimentally that for applications which have a rich and complex label structure, our model achieves excellent results.
[show abstract][hide abstract] ABSTRACT: We describe two new approaches to human pose estimation. Both can quickly and accurately predict the 3D positions of body joints from a single depth image, without using any temporal information. The key to both approaches is the use of a large, realistic, and highly varied synthetic set of training images. This allows us to learn models that are largely invariant to factors such as pose, body shape, and field-of-view cropping. Our first approach employs an intermediate body parts representation, designed so that an accurate per-pixel classification of the parts will localize the joints of the body. The second approach instead directly regresses the positions of body joints. By using simple depth pixel comparison features, and parallelizable decision forests, both approaches can run super-realtime on consumer hardware. Our evaluation investigates many aspects of our methods, and compares the approaches to each other and to the state of the art.
[show abstract][hide abstract] ABSTRACT: Image partitioning is an important preprocessing step for many of the state-of-the-art algorithms used for performing high-level computer vision tasks. Typically, partitioning is conducted without regards to the task in hand. We propose a task-specific image partitioning framework to produce a regionbased image representation that will lead to a higher task performance than that reached using any task-oblivious partitioning framework and existing supervised partitioning framework, albeit few in number. The proposed method partitions the image by means of correlation clustering, maximizing a linear discriminant function defined over a superpixel graph. The parameters of the discriminant function that define taskspecific similarity/dissimilarity among superpixels are estimated based on structured support vector machine (S-SVM) using taskspecific training data. The S-SVM learning leads to a better generalization ability while the construction of the superpixel graph used to define the discriminant function allows a rich set of features to be incorporated to improve discriminability and robustness. We evaluate the learnt task-aware partitioning algorithms on three benchmark datasets. Results show that taskaware partitioning leads to better labeling performance than the partitioning computed by the state-of-the-art general-purpose and supervised partitioning algorithms. We believe that the task-specific image partitioning paradigm is widely applicable to improve the performance in high-level image understanding tasks.
IEEE Transactions on Image Processing 09/2012; · 3.20 Impact Factor
[show abstract][hide abstract] ABSTRACT: Hough transform based methods for detecting multiple objects use non-maxima suppression or mode-seeking to locate and distinguish peaks in Hough images. Such postprocessing requires tuning of many parameters and is often fragile, especially when objects are located spatially close to each other. In this paper, we develop a new probabilistic framework for object detection which is related to the Hough transform. It shares the simplicity and wide applicability of the Hough transform but at the same time, bypasses the problem of multiple peak identification in Hough images, and permits detection of multiple objects without invoking non-maximum suppression heuristics. Our experiments demonstrate that this method results in a significant improvement in detection accuracy both for the classical task of straight line detection and for a more modern category-level (pedestrian) detection problem.
IEEE Transactions on Software Engineering 03/2012; · 2.59 Impact Factor
[show abstract][hide abstract] ABSTRACT: Markov Networks are widely used through out computer vision and machine
learning. An important subclass are the Associative Markov Networks which are
used in a wide variety of applications. For these networks a good approximate
minimum cost solution can be found efficiently using graph cut based move
making algorithms such as alpha-expansion. Recently a related model has been
proposed, the associative hierarchical network, which provides a natural
generalisation of the Associative Markov Network for higher order cliques (i.e.
clique size greater than two). This method provides a good model for object
class segmentation problem in computer vision. Within this paper we briefly
describe the associative hierarchical network and provide a computationally
efficient method for approximate inference based on graph cuts. Our method
performs well for networks containing hundreds of thousand of variables, and
higher order potentials are defined over cliques containing tens of thousands
of variables. Due to the size of these problems standard linear programming
techniques are inapplicable. We show that our method has a bound of 4 for the
solution of general associative hierarchical network with arbitrary clique size
noting that few results on bounds exist for the solution of labelling of Markov
Networks with higher order cliques.
[show abstract][hide abstract] ABSTRACT: Most image labeling problems such as segmentation and image reconstruction
are fundamentally ill-posed and suffer from ambiguities and noise. Higher order
image priors encode high level structural dependencies between pixels and are
key to overcoming these problems. However, these priors in general lead to
computationally intractable models. This paper addresses the problem of
discovering compact representations of higher order priors which allow
efficient inference. We propose a framework for solving this problem which uses
a recently proposed representation of higher order functions where they are
encoded as lower envelopes of linear functions. Maximum a Posterior inference
on our learned models reduces to minimizing a pairwise function of discrete
variables, which can be done approximately using standard methods. Although
this is a primarily theoretical paper, we also demonstrate the practical
effectiveness of our framework on the problem of learning a shape prior for
image segmentation and reconstruction. We show that our framework can learn a
compact representation that approximates a prior that encourages low curvature
shapes. We evaluate the approximation accuracy, discuss properties of the
trained model, and show various results for shape inpainting and image
[show abstract][hide abstract] ABSTRACT: Representation languages for coalitional games are a key research area in
algorithmic game theory. There is an inherent tradeoff between how general a
language is, allowing it to capture more elaborate games, and how hard it is
computationally to optimize and solve such games. One prominent such language
is the simple yet expressive Weighted Graph Games (WGGs) representation ,
which maintains knowledge about synergies between agents in the form of an edge
We consider the problem of finding the optimal coalition structure in WGGs.
The agents in such games are vertices in a graph, and the value of a coalition
is the sum of the weights of the edges present between coalition members. The
optimal coalition structure is a partition of the agents to coalitions, that
maximizes the sum of utilities obtained by the coalitions. We show that finding
the optimal coalition structure is not only hard for general graphs, but is
also intractable for restricted families such as planar graphs which are
amenable for many other combinatorial problems. We then provide algorithms with
constant factor approximations for planar, minor-free and bounded degree
[show abstract][hide abstract] ABSTRACT: This paper presents a new adaptive graph-cut based move-making algorithm for energy minimization. Traditional move-making algorithms such as Expansion and Swap operate by searching for better solutions in some predefined moves spaces around the current solution. In contrast, our algorithm uses the primal-dual interpretation of the Expansion-move algorithm to adaptively compute the best move-space to search over. At each step, it tries to greedily find the move-space that will lead to biggest decrease in the primal-dual gap. We test different variants of our algorithm on a variety of image labelling problems such as object segmentation and stereo. Experimental results show that our adaptive strategy significantly outperforms the conventional Expansion-move algorithm, in some cases cutting the runtime by 50%.
[show abstract][hide abstract] ABSTRACT: We present a new approach to general-activity human pose estimation from depth images, building on Hough forests. We extend existing techniques in several ways: real time prediction of multiple 3D joints, explicit learning of voting weights, vote compression to allow larger training sets, and a comparison of several decision-tree training objectives. Key aspects of our work include: regression directly from the raw depth image, without the use of an arbitrary intermediate representation; applicability to general motions (not constrained to particular activities) and the ability to localize occluded as well as visible body joints. Experimental results demonstrate that our method produces state of the art results on several data sets including the challenging MSRC-5000 pose estimation test set, at a speed of about 200 frames per second. Results on silhouettes suggest broader applicability to other imaging modalities.
IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: This paper presents a method for joint stereo matching and object segmentation. In our approach a 3D scene is represented as a collection of visually distinct and spatially coherent objects. Each object is characterized by three different aspects: a color model, a 3D plane that approximates the object’s disparity distribution, and a novel 3D connectivity property. Inspired by Markov Random Field models of image segmentation, we employ object-level color models as a soft constraint, which can aid depth estimation in powerful ways. In particular, our method is able to recover the depth of regions that are fully occluded in one input view, which to our knowledge is new for stereo matching. Our model is formulated as an energy function that is optimized via fusion moves. We show high-quality disparity and object segmentation results on challenging image pairs as well as standard benchmarks. We believe our work not only demonstrates a novel synergy between the areas of image segmentation and stereo matching, but may also inspire new work in the domain of automatic and interactive objectlevel scene manipulation.
The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Conditional Random Fields (CRFs) are popular models in computer vision for solving labeling problems such as image denoising. This paper tackles the rarely addressed but important problem of learn- ing the full form of the potential functions of pairwise CRFs. We ex- amine two popular learning techniques, maximum likelihood estimation and maximum margin training. The main focus of the paper is on models such as pairwise CRFs, that are simplistic (misspecified) and do not fit the data well. We empirically demonstrate that for misspecified models maximum-margin training with MAP prediction is superior to maximum likelihood estimation with any other prediction method. Additionally we examine the common belief that MLE is better at producing predictions matching image statistics.
Pattern Recognition - 33rd DAGM Symposium, Frankfurt/Main, Germany, August 31 - September 2, 2011. Proceedings; 01/2011