Article

Stages as Models of Scene Geometry

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Reconstruction of 3D scene geometry is an important element for scene understanding, autonomous vehicle and robot navigation, image retrieval, and 3D television. We propose accounting for the inherent structure of the visual world when trying to solve the scene reconstruction problem. Consequently, we identify geometric scene categorization as the first step toward robust and efficient depth estimation from single images. We introduce 15 typical 3D scene geometries called stages, each with a unique depth profile, which roughly correspond to a large majority of broadcast video frames. Stage information serves as a first approximation of global depth, narrowing down the search space in depth estimation and object localization. We propose different sets of low-level features for depth estimation, and perform stage classification on two diverse data sets of television broadcasts. Classification results demonstrate that stages can often be efficiently learned from low-dimensional image representations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Indoor images are mostly related to the humanmade objects, e.g., room, kitchen, office, tunnel, etc., whereas the outdoor images are natural scenes without borders, e.g., skymountain-ground, building and sky, etc. These images are divided into different categories [5][6][7][8][9] designed on the base of image scene structure. For example, images scenes are divided into twelve 3D rough structures, also called as 3D scene geometries [5] shown in Fig. 1a-d. ...
... These images are divided into different categories [5][6][7][8][9] designed on the base of image scene structure. For example, images scenes are divided into twelve 3D rough structures, also called as 3D scene geometries [5] shown in Fig. 1a-d. The scene recognition is important for many computer vision applications, such as 3D TV, navigation, systems, and video categorization [5,10,11]. ...
... For example, images scenes are divided into twelve 3D rough structures, also called as 3D scene geometries [5] shown in Fig. 1a-d. The scene recognition is important for many computer vision applications, such as 3D TV, navigation, systems, and video categorization [5,10,11]. It is useful for a computer vision system to understand the scene type of an input image because it is beneficial for scene understanding and extracting the image layout [5]. ...
Article
Full-text available
Scene recognition is a challenging problem due to intra-class variations and inter-class similarities. Traditional methods and convolutional neural networks (CNN) represent the global spatial structure, which is suitable for general scene classification and object recognition, but show poor presentation for particular indoor or outdoor medium–scale scene datasets. In this manuscript, we study the local and global structures of image scene, and then combine both types of information for indoor and outdoor scenes to improve the scene recognition accuracy. Local region structure indicates sub-part of the scene, such as sky or ground, etc., and global structure indicates whole scene structure, such as sky-background-ground outdoor scene type. For this purpose, the multi-layer convolutional features of inception and residual-based architecture are used at intermediate and higher layers to preserve both local and global structures of image scene. Each layer used for feature extraction, is connected with the global average pooling to obtain a discriminative representation of the image scenes. In this way, local structure is explored at the intermediate convolutional layers, and global spatial structure is obtained from the higher layers. The proposed method is evaluated on 8-scene, 15-scene, UMC-21, MIT67, and 12-scene challenging datasets achieving 98.51%, 96.49%, 99.05%, 80.31%, and 84.88%, respectively, significantly outperforming state-of-the-art approaches.
... These regularities are also proposed to be used in the machine vision. Nedovic et al. [3] identified these regularities into limited scene geometries on 2D images called 'stages'. Stages may contain a straight background (like a wall, façade of a building, or mountain range), box with three sides, e.g., corridor, tunnel, close street, etc. Stage is a rough model of a 3d scene image geometry, with small objects ignored. ...
... At step 11-12, the center value of each component is saved for later used. Such M component of a template are given to the active contour algorithm [3] as initial contour to generate a segment, Seg, of image by using steps 13-17. Set of generated segments are based on its number of components, e.g., Fig. 3(b) shows that the template capture the basic geometry of stage 'sky-backgroundground' and it has three components (n=3) upper (k=1), middle (k=2) and bottom (k=3) parts. ...
... Hoiem et al. [7] defined it as: The coarse geometric properties of a scene can be estimated by learning appearance-based model of geometric classes. Nedovic et al. [3] call these classes as 'stages'. The stages describe the 3D structure of an image region with respect to camera. ...
Preprint
Full-text available
3D layout extraction of a single image is an important aspect of artificial intelligence (AI) applications, e.g., 3D TV, object localization, and scene understanding. The pixel-level 3D layout of a single image can be extracted by using the prior information of image scene geometry. State-of-the-art Lou’s method introduces predefined templates corresponding to 3D scene geometry which use as prior information to extract the pixel-level 3D layout from a single image. 3D scene geometries called as Stages, are rough models of indoor and outdoor scenes. We introduce a new method of 3D layout extraction by using template-based segmentation as prior information and further template-based-segmentation is used as weighting map in random walk method to enhance the edges between two regions. Next, for each sub-region, multiple seeds are initialized by random walk method and their output segments are combined to generate 3D layout of a single image. Proposed algorithm is evaluated on the indoor and outdoor images dataset. The experimental results show that the proposed method has improved the accuracy by 5.31% versus the state-of-the-art methods. Additionally, this algorithm is extended for road scene extraction where instead of using template-based segmentation, the magnitude value of Gaussian derivative uses as weighting map. It is evaluated on KITTI road scene dataset and quantitatively compared to previous efforts on a publically available, provides superior results.
... For the computer vision, it is an important task to identify regularities in 2D images representing 3D scenes and then exploit them for the purpose of 3D scene reconstruction. Nedovic et al. [1] categorize images into a limited set of 3D scene geometries according to their structural regularities, called as 'stages,' such as sky-background-ground, skyground and box are shown in Fig. 1. Once the stage type is known to the system, the image can be adjusted with its corresponding template, whereas scene or objects can be placed in this 3D setup as like cardboard figures. ...
... Oliva et al. [9] used global image structure which was based on the magnitude of the global Fourier transform and the local wavelet transforms to estimate of the average scene depth. Nedovic et al. [1] took more generic approach to classify images into limited set of scene categories. They believed that the scenes could be categorized into a limited number of 3D scene geometries, called as stages. ...
... Thus, a particular scene image can be classified into one of the geometric stages. Nedovic et al. [1] used uniform grid regions to extract feature set, including Weibull distribution [28] (4 features), atmospheric scatting (5 features), and perspective line features (8 features) for stage classification. Other approaches of scene recognition are based on bag-of-the-visual words (BoW) model [29], e.g., J. Sánchez et al. [30] used dense SIFT features in the Fisher kernel (FK) framework as an alternative patch encoding strategy and obtained reasonable accuracy of 47.2% by 50% training samples of SUN dataset [31]. ...
Article
Full-text available
Stage classification is a significant important task for scene understanding, 3D TV, autonomous vehicle, and object localization. Images can be categorized into a limited number of 3D scene geometries, called stages, and each one of them is having a unique depth pattern to provide a specific context for stage objects. Moreover, convolutional neural networks (CNN) have shown high performance of scene classification due to their powerful perspective of feature learning and reasoning. However, we found that edge-preserving Laplacian filter (LF) based on Laplacian pyramids, which enhances the edge details of image scene owing to this, it can improve the performance of stage classification. We introduce a novel method of stage classification based on two-stream CNN model in which one stream is encoded by LF, and another stream is normal RGB images and their output is fused at the decision level. This proposed method is evaluated on two different stage datasets: first ‘stage-1209’ contains 1209 images, and second, ‘12-scene’ image dataset contains 12,000 images. Results exhibited that LF encoded images have a positive influence on stage classification accuracy. Following this, while using product rule the proposed method obtains the most significant improvement in the stage classification for both datasets. It improves particularly 7.96% stage accuracy on 12-scene image dataset, compared to the state-of-the-art method.
... Image scene geometry recognition is an important task in many applications of computer vision, such as video categorization, autonomous vehicle, 3D TV [1], and image retrieval [1,2], etc. Typically, scene images consist of indoor scene (e.g., room, corridor etc.) or outdoor scene images (e.g., beach, mountain, etc.). ...
... Image scene geometry recognition is an important task in many applications of computer vision, such as video categorization, autonomous vehicle, 3D TV [1], and image retrieval [1,2], etc. Typically, scene images consist of indoor scene (e.g., room, corridor etc.) or outdoor scene images (e.g., beach, mountain, etc.). ...
... Typically, scene images consist of indoor scene (e.g., room, corridor etc.) or outdoor scene images (e.g., beach, mountain, etc.). These images can be divided into semantic categories by following their rough geometrical structure types, called '3D scene geometries' [1]. Each 3D scene geometry represents a rough model of a scene, with small objects are ignored. ...
Article
The image scene geometry recognition is an important element for reconstructing the 3D scene geometry of a single image. It is useful for computer vision applications, such as 3D TV, video categorization, and robot navigation system. A 3D scene geometry with a unique depth represents a rough structure of 2D images. An approach to efficient implementation and achieving high recognition accuracy of 3D scene geometry remains a significant challenges in the computer vision domain. Existing approaches attempts to use the pre-trained deep convolutional neural networks (CNN) models as feature extractors and also explore the benefits of multi-layer features representation for small or medium-size datasets. However, these studies pay little attention to building a discriminative feature representation by exploring the benefits of low-level features fusion with multi-layer feature from a single CNN model. To address this problem, we propose a novel model of image scene geometry recognition, where the integration of low-level handcrafted features with deep CNN multi-stage features (HF-MSF) by using the feature-fusion and score-level fusion strategies. The low-level features contain rich discriminative information of 3D scene geometry, including shape, color, and depth estimation. In feature-fusion, the multi-layer features at different stages and handcrafted features are fused at an early phase, and in score-level fusion, the handcrafted features are integrated with multi-layer feature of a single CNN model in different stages and each stage is connected with a classifier and the score-level fusion of these classifiers is performed automatically to recognize the scene geometry type. For validation and comparison purposes, two well-known deep learning architectures, namely GoogLeNet and ResNet are employed as a backbone of proposed model. Experimental results exhibited that by taking the advantages of both types of fusion, the proposed HF-MSF model has an improved recognition accuracy of 12.21% and 4.96% when compared to G-MS2F model for 12-Scene and 15-Scene image datasets, respectively. Similarly, it improves the accuracy by 3.85% when compared with the FTOTLM model for the 15-Scene dataset.
... Image scene geometry recognition is an important task for many applications of computer vision, such as 3D TV, video categorization, scene understanding, etc. The indoor and outdoor images can be divided into limited image scene geometries which is called 'stages' [1]. Each image scene geometry represents a rough model of a scene, with small number of objects ignored. ...
... Each image scene geometry represents a rough model of a scene, with small number of objects ignored. The geometry type of an image can be used for 3D scene information extraction [1][2][3][4]. ...
... Many existing methods showing promising results for image scene geometry recognition such as [1][2][3][4] but cannot properly predict the geometry type when scene geometry consolidated with multiple objects, it is because the involvement of humans in feature design and extraction significantly influences the effectiveness of the representation capacity of scene images. ...
Conference Paper
Image scene geometry recognition is an important task for reconstruct the 3D information of a single image which is beneficial for computer vision applications, such as 3D TV, video categorization. In this paper, a novel architecture for the image scene geometry recognition based on the feature-level fusion of convolutional neural networks (CNN) features and low-level texture gradient features is presented. The main advantages of using low-level features are; simple to extract and contain rich information of image scene geometry. Next, it is evaluated on a novel scene dataset that is constructed by following the twelve different image scene geometries (1000 samples for each category) and experimental results exhibit that proposed system achieves higher accuracy than applying the CNN alone. Additionally, by utilizing the extreme learning machine (ELM) as a classifier, the proposed system achieves 86.29% recognition accuracy that is superior the existing baseline methods.
... These regularities are also proposed for use in machine vision systems. Nedovic et al. [4] identified several image-level 3D scene geometries on 2D images called 'stages'. They may contain a straight background (like a wall, façade of a building, or mountain range) or a box with three sides, e.g., a corridor, tunnel, close street, etc., Figure 1 illustrates some of them. ...
... As stages are limited, they can be used for 3D scene information extraction, e.g., for an autonomous vehicle, robot navigation system, or scene understanding. In the literature, Nedovic et al. [4] and Lou et al. [5] introduce stage recognition algorithms by using low-level feature sets and these features are combined at each local region. Lou et al. [5] use predefined template-based segmentation to extract features, namely histograms of oriented gradients (HOGs) [6], mean color values (Hue, Saturation and Value (HSV) and red (R), green(G), and blue(B)), and parameters of Weibull distribution features [7] for each image patch and introduced a graphical model to learn mapping from image features to a stage. ...
... The state-of-the-art methods do not share their stage datasets publicly. Thus, we construct a new stage dataset by following the 12 stages proposed in [4]. Secondly, we utilize the '15-scene image dataset' introduced in [23] and its categories on the basis of a global image structure. ...
Article
Full-text available
Image-level structural recognition is an important problem for many applications of computer vision such as autonomous vehicle control, scene understanding, and 3D TV. A novel method, using image features extracted by exploiting predefined templates, each associated with individual classifier, is proposed. The template that reflects the symmetric structure consisting of a number of components represents a stage—a rough structure of an image geometry. The following image features are used: a histogram of oriented gradient (HOG) features showing the overall object shape, colors representing scene information, the parameters of the Weibull distribution features, reflecting relations between image statistics and scene structure, and local binary pattern (LBP) and entropy (E) values representing texture and scene depth information. Each of the individual classifiers learns a discriminative model and their outcomes are fused together using sum rule for recognizing the global structure of an image. The proposed method achieves an 86.25% recognition accuracy on the stage dataset and a 92.58% recognition rate on the 15-scene dataset, both of which are significantly higher than the other state-of-the-art methods.
... We propose the Generic SP approach to obtain structural image representations from 3D scene geometries by exploiting the 3D scene geometry of images. The 13 stages of [7] are used as 3D priors. After the image scene geometry is estimated, the most appropriate stage per object category is selected as the spatial pyramid. ...
... A number of methods have been proposed to estimate the rough scene geometry from single images [11], [12], [13]. We use the scheme which derives scene information for a wider range of generic scene categories by using stages [7]. Stages are defined as a set of prototypes of often recurring scene configurations. ...
... These models are dependent on the inherent geometrical structure of images. In this paper, 13 different stages are used excluding noDepth or tab+pers+bkg, as these stages are specific characteristics of the data set used in [7]. ...
Article
Full-text available
The Bag-of-Words (BoW) approach has been successfully applied in the context of category-level image classification. To incorporate spatial image information in the BoW model, Spatial Pyramids (SPs) are used. However, spatial pyramids are rigid in nature and are based on pre-defined grid configurations. As a consequence, they often fail to coincide with the underlying spatial structure of images from different categories which may negatively affect the classification accuracy.The aim of the paper is to use the 3D scene geometry to steer the layout of spatial pyramids for category-level image classification (object recognition). The proposed approach provides an image representation by inferring the constituent geometrical parts of a scene. As a result, the image representation retains the descriptive spatial information to yield a structural description of the image. From large scale experiments on the Pascal VOC2007 and Caltech101, it can be derived that SPs which are obtained by the proposed Generic SPs outperforms the standard SPs.
... Deriving the pixel-level 3D layout from 2D still images is important for many computer vision tasks such as object localization, image understanding and video segmentation [1], [2], [3], [4]. ...
... Nedovic et al. [1] takes a generic approach in which scenes are categorized into a limited set of image-level 3D geometry classes called 'stages'. Stages represent the general 3D geometry structure of scenes such as sky-backgroundground, sky-ground, and box, see Fig. 1. ...
... Instead of considering scene categories, Nedovic et al. [1] take a more abstract approach to classify 3D geometries. They show that scenes can be categorized into a limited number of 3D geometries called stages. ...
Article
Full-text available
Extracting the pixel-level 3D layout from a single image is important for different applications such as object localization, image and video categorization. Traditionally, the 3D layout is derived by solving a pixel-level classification problem. However, the image-level 3D structure can be very beneficial for extracting pixel-level 3D layout since it implies the way how pixels in the image are organized. In this paper, we propose an approach that firstly predicts the global image structure and then we use the global structure for fine-grained pixel-level 3D layout extraction. Specifically, image features are extracted based on multiple layout templates. We then learn a discriminative model for classifying the global layout at the image-level. By using latent variables, we implicitly model the sub-level semantics of the image, which enrich the expressiveness of our model. After the image-level structure is obtained, it is used as the prior knowledge to infer pixel-wise 3D layout. Experiments show that the results of our model outperform the state-of- the-art methods by 11.7% for 3D structure classification. Moreover, we show that employing the 3D structure prior information yields accurate 3D scene layout segmentation.
... Image statistics and scene depth: The relation between depth patterns and natural image statistics is studied in [14], [15]. They show that in case of a dominant structure (object or background) gradient histograms correspond to a decaying power-law distribution. ...
... On the contrary, when scene depth becomes smaller, object surfaces will become larger and coarser showing more contrasting details. In this case, natural image statistics computed from the image follow a Weibull distribution with increasing β and γ [15], which is consistent with the observations in [16]. Hence, a relation exists between natural image statistics and depth patterns of a scene [14], [15]. ...
... In this case, natural image statistics computed from the image follow a Weibull distribution with increasing β and γ [15], which is consistent with the observations in [16]. Hence, a relation exists between natural image statistics and depth patterns of a scene [14], [15]. ...
Article
Full-text available
The aim of color constancy is to remove the effect of the color of the light source. As color constancy is inherently an ill-posed problem, most of the existing color constancy algorithms are based on specific imaging assumptions (e.g. grey-world and white patch assumption). In this paper, 3D geometry models are used to determine which color constancy method to use for the different geometrical regions (depth/layer) found in images. The aim is to classify images into stages (rough 3D geometry models). According to stage models; images are divided into stage regions using hard and soft segmentation. After that, the best color constancy methods is selected for each geometry depth. To this end, we propose a method to combine color constancy algorithms by investigating the relation between depth, local image statistics and color constancy. Image statistics are then exploited per depth to select the proper color constancy method. Our approach opens the possibility to estimate multiple illuminations by distinguishing nearby light source from distant illuminations. Experiments on state-of-the-art data sets show that the proposed algorithm outperforms state-of-the-art single color constancy algorithms with an improvement of almost 50% of median angular error. When using a perfect classifier (i.e, all of the test images are correctly classified into stages); the performance of the proposed method achieves an improvement of 52% of the median angular error compared to the best-performing single color constancy algorithm.
... As listed in Table II, the region depth model is either a pixel-wise non-parametric model, (i.e. depth values in a region are either computed or manually labeled pixel-wisely,) or a parametric graphics model (such as planar surfaces, polyhedrons, stage models [29], and some specific models like human faces and human bodies [1], etc.). Such region depth models are appointed in the 2D-to-stereo video conversion system (listed in Table I), either interactively by the system users, or automatically by the estimation modules in the system. ...
... 2) Stage Model (μ j = 4): According to the research on visual perception [11], the partial order of distance is sufficient to describe the spatial relations among the objects far away, rather than using accurate depth values. Hence, in many real scenes, background regions are approximated by a generic model, namely stage model [29]. As shown in Fig. 3(c), a stage model consists of several planar surfaces, which is a special case of polyhedron. ...
Article
Full-text available
We propose a novel representation for stereo videos namely 2D-plus-depth-cue. This representation is able to encode stereo videos compactly by leveraging the by-product of a stereo video conversion process. Specifically, the depth cues are derived from an interactive labeling process during 2D-to-stereo video conversion - they are contour points of image regions and their corresponding depth models, etc. Using such cues and the image features of 2D video frames, the scene depth can be reliably recovered. Experimental results demonstrate that the bit rate can be saved about 10%-50% in coding a stereo video compared with multi-view video coding and the 2D-plus-depth methods. In addition, since the objects are segmented in the conversion process, it is convenient to adopt the region-of-interest (ROI) coding in the proposed stereo video coding system. Experimental results show that using ROI coding the bit rate is reduced by 30% - 40% or the video quality is increased by 1:5dB - 4dB with the fixed bit rate.
... These models are used to select the best unitary method. Typical 3D scene geometries, called stages, are proposed by Nedovic et al. [33]. Each stage has a certain depth layout, and 13 different stages are used in Lu's method [10]. ...
... The database of labels is made available on-line at 'www.cs.sfu.ca/~colour/data/'. Following Nedovic et al. [33], the 15 typical 3D stages: sky+bkg+grd (sbg), bkg+grd (bg), sky+grd (sg), grd (g), nodepth (n), grd+Tbkg(LR) (gtl), grd+Tbkg(RL) (gtr), Tbkg(LR) (tl), Tbkg(RL) (tr), tbl+Prs+bkg (tpb), 1sd+wall(LR) (wl), 1sd+wall(RL) (wr), corner (ce), corridor (cd), and prs+bkg (pb) are used. ...
Article
Illumination estimation is an important component of color constancy and automatic white balancing. A number of methods of combining illumination estimates obtained from multiple subordinate illumination estimation methods now appear in the literature. These combinational methods aim to provide better illumination estimates by fusing the information embedded in the subordinated solutions. The existing combinational methods are surveyed and analyzed here with the goals of determining: (1) the effectiveness of the fusing illumination estimates from multiple subordinate methods, (2) the best method of combination, (3) the underlying factors that affect the performance of a combinational method, and (4) the effectiveness of combination for illumination estimation in multiple-illuminant scenes. The various combinational methods are categorized in terms of whether or not they require supervised training and whether or not they rely on high-level scene content cues (e.g., indoor versus outdoor). Extensive tests and enhanced analyses using 3 data sets of real-world images are conducted. For consistency in testing, the images were labeled according to their high-level features (3D stages, indoor/outdoor) and this label data is made available on-line. The tests reveal that the trained combinational methods (direct combination by support vector regression in particular) clearly outperform both the non-combinational methods and those combinational methods based on scene content cues.
... On the other hand, dynamic weights vary with image features to guide the selection or combination of unitary estimates [10,11,18]. Various scene characteristics, including low-level properties (e.g., visual properties [10], 3D geometry [28,29]), mid-level initial illumination estimates [30], high-level semantic content (e.g., semantic likelihood [31], indoor/outdoor classification [9]), can be used to find the best combination. Furthermore, the weights might be obtained by different algorithms, such as machine learning [19,[32][33][34], Fuzzy model [24], multi-objective optimization [20], graph-based semi-supervised [30], or one without prior training [9,21]. ...
Article
Full-text available
Computational color constancy (CCC) is a fundamental prerequisite for many computer vision tasks. The key of CCC is to estimate illuminant color so that the image of a scene under varying illumination can be normalized to an image under the canonical illumination. As a type of solution, combination algorithms generally try to reach better illuminant estimation by weighting other unitary algorithms for a given image. However, due to the diversity of image features, applying the same weighting combination strategy to different images might result in unsound illuminant estimation. To address this problem, this study provides an effective option. A two-step strategy is first employed to cluster the training images, then for each cluster, ANFIS (adaptive neuro-network fuzzy inference system) models are effectively trained to map image features to illuminant color. While giving a test image, the fuzzy weights measuring what degrees the image belonging to each cluster are calculated, thus a reliable illuminant estimation will be obtained by weighting all ANFIS predictions. The proposed method allows illuminant estimation to be dynamic combinations of initial illumination estimates from some unitary algorithms, relying on the powerful learning and reasoning capabilities of ANFIS. Extensive experiments on typical benchmark datasets demonstrate the effectiveness of the proposed approach. In addition, although there is an initial observation that some learning-based methods outperform even the most carefully designed and tested combinations of statistical and fuzzy inference systems, the proposed method is good practice for illuminant estimation considering fuzzy inference eases to implement in imaging signal processors with if-then rules and low computation efforts.
... Other approaches to image modeling have explicitly parametrized indoor scenes as 3D boxes (or as collections of orthogonal planes) [124,84,104,178,148]. In our work, we use appearance features to infer image depth, but augment this inference with priors based on geometric reasoning about the scene. ...
Preprint
From a single picture of a scene, people can typically grasp the spatial layout immediately and even make good guesses at materials properties and where light is coming from to illuminate the scene. For example, we can reliably tell which objects occlude others, what an object is made of and its rough shape, regions that are illuminated or in shadow, and so on. It is interesting how little is known about our ability to make these determinations; as such, we are still not able to robustly "teach" computers to make the same high-level observations as people. This document presents algorithms for understanding intrinsic scene properties from single images. The goal of these inverse rendering techniques is to estimate the configurations of scene elements (geometry, materials, luminaires, camera parameters, etc) using only information visible in an image. Such algorithms have applications in robotics and computer graphics. One such application is in physically grounded image editing: photo editing made easier by leveraging knowledge of the physical space. These applications allow sophisticated editing operations to be performed in a matter of seconds, enabling seamless addition, removal, or relocation of objects in images.
... The main drawback of the TIP model lies in the fact that it is applicable to the curved floor conditions. Nedovic et al. [20] introduce the typical 3D scene geometries called stages, each with a unique depth profile. The stage information serves as the first step to infer the global depth. ...
Article
Full-text available
Road scene model construction is an important aspect of intelligent transportation system research. This paper proposes an intelligent framework that can automatically construct road scene models from image sequences. The road and foreground regions are detected at superpixel level via a new kind of random walk algorithm. The seeds for different regions are initialized by trapezoids that are propagated from adjacent frames using optical flow information. The superpixel level region detection is implemented by the random walk algorithm, which is then refined by a fast two-cycle level set method. After this, scene stages can be specified according to a graph model of traffic elements. These then form the basis of 3D road scene models. Each technical component of the framework was evaluated and the results confirmed the effectiveness of the proposed approach.
... Lu et al. (2009) proposed a 3D Scene Geometry (SG) GC method. This method models image based on the 3D scene geometries called stages as in Nedovic, Smeulders, Redert, and Geusebroek (2010) and for each stage, depth layout is estimated. The SG algorithm selects a best unitary illuminant estimation algorithm for the whole image based on this 3D geometry stages and also select the best algorithm for each image region based on the depth information. ...
Article
Color constancy algorithm aims to estimate the color of light source. Many of computer vision applications, such as object detection and scene understanding, benefited from this color constancy algorithm. Since the traditional color constancy algorithm uses either a statistical assumption or a trained regression function, none of those methods is universal illuminant estimator. As a solution for this, researchers proposed combination color constancy algorithm that combines the estimate of several statistical or learning based unitary algorithms. Traditional combination method either uses a static weight to combine the estimate of unitary methods or choose a best unitary algorithm for the input image. The former one fails due to the limitation of static weight to correctly reflect the underlying relationship for a wide range of scenes and the second one has the difficulty to train a multi-class model with limited training data. This paper addresses this limitation of combination methods and proposes a hybrid multi-class dynamic weight model with an ensemble of classifiers. The proposed method classifies images into several groups and uses distinct dynamic weight generation model (DWM) for each group. The DWM generates dynamic weight using an image feature that has a correlation with the capability of the unitary algorithm used for combination. Experiments on Gehler–Shi and National University of Singapore color constancy benchmark data set show that the proposed method outperforms state-of-the-art.
... If a new emergent category was not included in the train dataset, this predicting model could not achieve an accurate recognition rate. For a better outdoor scene modeling, Nedovic et al. [11] made an in-depth study on the outdoor scene reconstruction by 3-D visual information. ...
Article
Multiple images have been widely used for scene understanding and navigation of unmanned ground vehicles in long term operations. However, as the amount of visual data in multiple images is huge, the cumulative error in many cases becomes untenable. This paper proposes a novel method that can extract features from a large dataset of multiple images efficiently. Then the membership K-means clustering is used for high dimensional features, and the large dataset is divided into N subdatasets to train N conditional random field (CRF) models based on superpixel. A Softmax subdataset selector is used to decide which one of the N CRF models is chosen as the prediction model for labeling images. Furthermore, some experiments are conducted to evaluate the feasibility and performance of the proposed approach.
... The GIST descriptor (Oliva and Torralba, 2001) is among the first attempts to characterize the global arrangement of geometric structures using simple image features such as color, texture and gradients. Following this seminal work, a large number of supervised machine learning methods have been developed to infer approximate 3D structures or depth maps from the image using carefully designed models (Hoiem et al., 2005(Hoiem et al., , 2007Gould et al., 2009;Saxena et al., 2009;Nedovic et al., 2010) or grammars (Gupta et al., 2010;Han and Zhu, 2009). In addition, models tailored for specific scenarios have been studied, such as indoor scenes (Lee et al., 2009;Hedau et al., 2009Hedau et al., , 2010 and urban scenes (Barinova et al., 2008). ...
Article
Full-text available
The capacity of automatically modeling photographic composition is valuable for many real-world machine vision applications such as digital photography, image retrieval, image understanding, and image aesthetics assessment. The triangle technique is among those indispensable composition methods on which professional photographers often rely. This paper proposes a system that can identify prominent triangle arrangements in two major categories of photographs: natural or urban scenes, and portraits. For the natural or urban scene pictures, the focus is on the effect of linear perspective. For portraits, we carefully examine the positioning of human subjects in a photo. We show that line analysis is highly advantageous for modeling composition in both categories. Based on the detected triangles, new mathematical descriptors for composition are formulated and used to retrieve similar images. Leveraging the rich source of high aesthetics photos online, similar approaches can potentially be incorporated in future smart cameras to enhance a person's photo composition skills.
... In a man-made scene, the entities are aligned mainly along three orthogonal directions (the Manhattan assumption); providing a cue about the observer's orientation or viewpoint [29,31]. Furthermore, the majority of the man-made scenes can be summarized into a stage model [15,24] composed of a few dominant planes surrounding the viewing camera. The indoor scenes, for example, have a box-stage with 5 planes -floor, ceiling and left-center-right walls. ...
Conference Paper
Leveraging Manhattan assumption we generate metrically rectified novel views from a single image, even for non-box scenarios. Our novel views enable the already trained classifiers to handle training data missing views (blind spots) without additional training. We demonstrate this on end-to-end scene text spotting under perspective. Additionally, utilizing our fronto-parallel views, we discover unsupervised invariant mid-level patches given a few widely separated training examples (small data domain). These invariant patches outperform various baselines on small data image retrieval challenge.
... Along this line of research, Gupta et al. [7] used physics-based constraints to model 3D scenes based on stability and mechanical properties. By categorizing scenes into 15 different scene geometries or stages with unique depth profiles, we can reduce the space of solutions in the 3D reconstruction problem [21]. ...
Conference Paper
Full-text available
Junctions are strong cues for understanding the geometry of a scene. In this paper, we consider the problem of detecting junctions and using them for recovering the spatial layout of an indoor scene. Junction detection has always been challenging due to missing and spurious lines. We work in a constrained Manhattan world setting where the junctions are formed by only line segments along the three principal orthogonal directions. Junctions can be classified into several categories based on the number and orientations of the incident line segments. We provide a simple and efficient voting scheme to detect and classify these junctions in real images. Indoor scenes are typically modeled as cuboids and we formulate the problem of the cuboid layout estimation as an inference problem in a conditional random field. Our formulation allows the incorporation of junction features and the training is done using structured prediction techniques. We outperform other single view geometry estimation methods on standard datasets.
Article
Spatiotemporal analysis of road scenes is a hot research topic in the communities of computer vision and intelligent transportation systems. In this paper, we propose a new framework for spatiotemporal analysis of static and dynamic traffic elements from road scenes. In the first stage, a bottom-up analysis method for static traffic elements is proposed based on a hierarchical spatiotemporal model using hidden conditional random fields (HCRF). The bottom-level features are extracted from sub-regions in the hierarchical model, and the local and global features of the image sequence are then fully combined for spatial and temporal layers. In the second stage, a lightweight multi-stream 3DCNN network is developed for the behavior classification of dynamic traffic elements, which is composed of three parts. Firstly, a SELayer-3DCNN is designed to extract the appearance, motion and edge information from the image sequences. Secondly, the channel attention fusion strategy (CAF) is introduced to enhance the feature fusion ability. Finally, the 3D-RFB module is incorporated to expand the receptive field of the convolution kernel. The experimental results well demonstrate the effectiveness of the proposed framework.
Chapter
Most successful approaches on image classification apply the Bag-of-Words (BoW) approach in the context of category-level image classification. To incorporate spatial image information in the BoW model, Spatial Pyramids (SPs) are used. However, spatial pyramids are rigid in nature and are based on predefined grid configurations. As a consequence, they often fail to coincide with the underlying spatial structure of images from different categories which may negatively affect the classification accuracy.
Conference Paper
This paper presents semi-automatic method hybrid depth map generation using fusion of monocular cues. Depth Estimation is generally done using stereoscopic cameras.It is a difficult task to estimate depth from singleview camera.In this paper,fusion of four monocular cues as-Motion cue,Linear perspectivecue, Aerial perspective cue and defocus are proposed for depth estimation using single camera. Bilateral filter is used to get a contineous depth map.The results show that the present system for estimation of depth based on monocular cues achieves better performance.
Conference Paper
Automatic understanding of photo composition is a valuable technology in multiple areas including digital photography, multimedia advertising, entertainment, and image retrieval. In this paper, we propose a method to model geometrically the compositional effects of linear perspective. Comparing with existing methods which have focused on basic rules of design such as simplicity, visual balance, golden ratio, and the rule of thirds, our new quantitative model is more comprehensive whenever perspective is relevant. We first develop a new hierarchical segmentation algorithm that integrates classic photometric cues with a new geometric cue inspired by perspective geometry. We then show how these cues can be used directly to detect the dominant vanishing point in an image without extracting any line segments, a technique with implications for multimedia applications beyond this work. Finally, we demonstrate an interesting application of the proposed method for providing on-site composition feedback through an image retrieval system.
Article
Depth estimation is an important step to generate 3D structure. The number of algorithm has explored to estimate depth based on Binocular cues (using tw o or more images). But very less work has done to estimate depth from monocular cue (using single image. This paper uses monocular cue as defocus to estimate depth. Also we compare different edge detection algorithms used for depth estimation using monocular defocus cue from single image. The edge detection is an important step in an image processing. The various edge detectors used are Roberts Kernel, Prewitt Kernel, Sobel Kernel, Canny edge detector. The experimental results show that canny edge detector is superior as compared to other and shows better depth for given image.
Article
This work is a contribution to understanding multi-object traffic scenes from video sequences. All data is provided by a camera system which is mounted on top of the autonomous driving platform AnnieWAY. The proposed probabilistic generative model reasons jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, the scene topology, geometry as well as traffic activities are inferred from short video sequences. © 2013 Karlsruher Institut fur Technologie (KIT). All rights reserved.
Article
A multi-cue illumination estimation method based on tree-structured group joint sparse representation is proposed. Tests show that the proposed method works better than existing methods, most of which are based on using only a single cue type, for example, a binarized color histogram or simple image statistic such as the mean RGB. Most existing illumination estimation methods make their estimates using only one of three kinds of cues. They differ in which cue type they use, but the chosen cue is either based on (1) properties of the low-level RGB color distribution, (2) mid-level initial illuminant estimates provided by subordinate methods, or (3) high-level knowledge of scene content (e.g., indoor versus outdoor scene). The proposed multi-cue method combines the information provided by cues of all three of these types within the framework of a tree-structured group joint sparse representation (TGJSR). In TGJSR, the training data is grouped into a tree of subgroups. A test image under an unknown illuminant has its features reconstructed in terms of a joint sparse representation model derived from the grouped training data. The test image’s illumination is then estimated based on the weights involved in the joint sparse representation model. As a general framework, the proposed TGJSR framework can also easily be extended to incorporate any new features or cues that might be discovered in the future for illumination estimation.
Conference Paper
Full-text available
Extracting the 3D geometry plays an important part in scene understanding. Recently, robust visual descriptors are proposed for extracting the indoor scene layout from a passive agent’s perspective, specifically from a single image. Their robustness is mainly due to modelling the physical interaction of the underlying room geometry with the objects and the humans present in the room. In this work we add the physical constraints coming from acoustic echoes, generated by an audio source, to this visual model. Our audio-visual 3D geometry descriptor improves over the state of the art in passive perception models as we show in our experiments.
Conference Paper
Full-text available
Abstract— Nowadays real time visual Simultaneous Localization And Mapping (SLAM) algorithms exist and rely on consistent measurements across multiple views. In indoor environments, where majority of robot’s activity takes place, severe occlusions can occur, e.g., when turning around a corner or moving from one room to another. In these situations, SLAM algorithms can not establish correspondences across views, which leads to failures in camera localization or map construction. This work takes advantage of the recent scene box layout descriptor to make the above mentioned SLAM systems occlusion aware. This room box reasoning helps the sequential tracker to reason about possible occlusions and therefore look for matches in only potentially visible features instead of the entire map. This increases the life of the tracker, as it does not consider itself lost under the occlusion state. Additionally, focusing on the potentially visible portion of the map, i.e., the current room features, it improves the computational efficiency without compromising the accuracy. Finally, this room level reasoning helps in better image selection for bundle adjustment. The image bundle coming from the same room has little occlusion, which leads to better dense reconstruction. We demonstrate the superior performance of layout aware SLAM on several long monocular sequences acquired in difficult indoor situations, specifically in a roomroom transition and turning around a corner.
Article
In general, computational methods to estimate the color of the light source are based on single, low-level image cues such as pixel values and edges. Only a few methods are proposed exploiting multiple cues for color constancy by incorporating pixel values, edge information and higher-order image statistics. However, expanding color constancy beyond these low-level image statistics (pixels, edges and n-jets) to include high-level cues and integrate all these cues together into a unified framework has not been explored.
Conference Paper
Estimating depth information from a single image has recently attracted great attention in various vision-based applications such as mobile robot navigation. Although there are numerous depth map generation methods, little effort has been done on the depth estimation from a single indoor scene. In this paper, we propose a novel method for estimating depth from a single indoor image via nonlinear diffusion and image segmentation techniques. One important advantage of our approach is that no learning scheme is required to estimate a depth map. Based on the proposed method, we obtain visually plausible depth estimation results even with the presence of occlusions or clutters in the single indoor image. From experimental results, we confirm that the proposed algorithm provides reliable depth information under various indoor environments.
Conference Paper
Full-text available
What types of developers do active software projects have? This paper presents a study of the characteristics of developers' activities in open source software development. GitHub, a widely-used hosting service for software development projects, provides APIs for collecting various kinds of GitHub data. To clarify the characteristics of developers' activities, we used these APIs to investigate GitHub events generated by each developer. Using this information, we categorized developers based on measures such as whether they prefer communication by coding or comments, or whether they are specialists or generalists. Our study indicates that active software projects have various kinds of developers characterized by different types of development activities.
Article
Codebooks are a widely accepted technique to recognise objects by sets of local features. The method has been applied to many classes of objects, even very abstract ones. But although state of the art recognition rates have been reported, the method is still far away from being reliable in any sense that is related to human vision. The literature on this topic emphasises detailed descriptions of statistical estimators over a basic analysis of the data. A deeper understanding of the data is however needed to achieve a further development of the field. In this paper, we therefore present a set of quantitative experiments on codebooks of the popular SIFT descriptors. The results discourage the use of illustrative but overly simplifying descriptions of the visual words approach. It is in particular demonstrated that (1) there are more visually distinct patterns than can be listed in a codebook, (2) one element of a codebook represents a set of many, visually distinct patterns, and (3) there are no single, selective SIFT descriptors to serve as codebook elements. This makes us wonder why the method works after all. We discuss several options.
Conference Paper
We address the problem of geometric and semantic consistent video segmentation for outdoor scenes. With no assumption on camera movement, we jointly model the semantic-geometric class of spatio-temporal regions (supervoxels) and geometric scene layout in each frame. Our main contribution is to propose a stage scene model to efficiently capture the dependency between the semantic and geometric labels. We build a unified CRF model on supervoxel labels and stage parameters, and design an alternating inference algorithm to minimize the resulting energy function. We also extend smoothing based on hierarchical image segmentation to spatio-temporal setting and show it achieves better performance than a pairwise random field model. Our method is evaluated on the CamVid dataset and achieves state-of-the-art per-pixel as well as per-class accuracy in predicting both semantic and geometric labels.
Conference Paper
In this paper, we focus on recovering a 3-D depth map from a single image via ground-vertical boundary analysis. First, we generate a ground map from the input image based on the spectral matting method, followed by a spatial geometric inference. After that, we derive the depth information for the ground-vertical boundaries. Unlike conventional approaches which generally use plane models to reconstruct a 3-D structure that fits the estimated boundaries, we infer a dense depth map by solving a Maximum-A-Posteriori (MAP) estimation problem. In this MAP problem, we use a generalized spatial-coherence prior model based on the Matting Laplacian (ML) matrix in order to provide a more robust solution for depth inference. We demonstrate that this approach can produce more pleasant depth maps for cluttered scenes.
Conference Paper
In this paper, we propose an approach to detect scene geometrical structure given only one monocular image. Several typical scene geometries are investigated and corresponding models are built. A scene geometry reasoning system is set up based on image statistical features and scene geometric features. This system is able to find best fitting geometric models for most of the images from the benchmark dataset. Scene categorization could reveal important three-dimensional information contained in an image. We demonstrate how this valuable information could be used to reason the depth profile of a specific scene. Planes co-constructing the scene could be detected and located. Experiments have been done to roughly restore the structure of the scene to verify system performance. By our approach, computer could interpret a single image in terms of its geometry straightforwardly, avoiding usual semantically overlapping and deficiency problems.
Conference Paper
Interpreting 3D structure from 2D images is a constant problem to be solved in the field of computer vision. Prior work has been made to tackle this issue mainly in two different ways - depth estimation from multiple-view images based on geometric triangulation and depth reasoning from single image depending on monocular depth cues. Both solutions do not involve direct depth map information. In this work, we captured a RGBD dataset using Microsoft Kinect depth sensor. Approximate depth information is acquired as the fourth channel and employed as an extra reference for 3D scene geometry reasoning. It helps to achieve better estimation accuracy. We define nine basic geometric models for general indoor restricted-view scenes. Then we extract low/medium level colour and depth features from all four of the RGBD channels. Sequential Minimal Optimization SVM is used in this work as efficient classification tool. Experiments are implemented to compare the result of this approach with previous work that does not have the depth channel as input.
Conference Paper
Recently there have been studies to predict 3D geometric scene categories from a single monocular image, where estimation of 3D scene structure is formulated as a geometric scene categorization (GSC) problem. This paper employs the GSC method to build a depth map generation system for 2D-to-3D image conversion. We show how the GSC algorithm can be effectively extended to solve the problem of estimating depth map from a single image. Experimental results demonstrate that our proposed method gives perceptually plausible depth maps, which can be applied to 2D-to-3D image conversion.
Conference Paper
Holistic scene understanding is a major goal in recent research of computer vision. To deal with this task, reasoning the 3D relationship of components in a scene is identified as one of the key problems. We study this problem in terms of structural reconstruction of 3D scene from single view image. Our first step concentrates on geometrical layout analysis of scene using low-level features. We allocate images into seven recurring and stable geometry classes. This classification labels the image with rough knowledge of its scene geometry. Then, based on this geometry label, we propose an adaptive autonomous scene reconstruction algorithm which adopts specific approaches particularly for different scene types. We show, experimentally, given the right geometry label, low-quality uncalibrated monocular images from the benchmark dataset can be structurally reconstructed in 3D space in a time/effort efficient way. This robust approach does not require high quality or high complexity input image. We demonstrate the effectiveness of this approach in this paper.
Article
Estimating depth information from a single image has recently attracted great attention in 3D-TV applications, such as 2D-to-3D conversion owing to an insufficient supply of 3-D contents. In this paper, we present a new framework for estimating depth from a single image via scene classification techniques. Our goal is to produce perceptually reasonable depth for human viewers; we refer to this as pesudo depth estimation. Since the human visual system highly relies on structural information and salient objects in understanding scenes, we propose a framework that combines two depth maps: initial pseudo depth map (PDM) and focus depth map. We use machine learning based scene classification to classify the image into one of two classes, namely, object-view and non-object-view. The initial PDM is estimated by segmenting salient objects (in the case of object-view) and by analyzing scene structures (in the case of non-object-view). The focus blur is locally measured to improve the initial PDM. Two depth maps are combined, and a simple filtering method is employed to generate the final PDM. Simulation results show that the proposed method outperforms other state-of-the-art approaches for depth estimation in 2D-to-3D conversion, both quantitatively and qualitatively. Furthermore, we discuss how the proposed method can effectively be extended to image sequences by employing depth propagation techniques.
Conference Paper
Video streams generate the better part of the internet traffic due to platforms such as YouTube and Youku in China. A novel genre-adaptive solution is presented in this paper benefitting from an interdisciplinary approach combining image and sound processing features from Pattern Recognition with high-level concepts from Mass Communications. It is shown that it is possible to automatically analyze videos for bundles of syntactical features which represent semantic high-level concepts which are typical for certain genres. In this way, semantic concepts called “Key Visuals” [1] as well as generic semantic concepts with obvious relevance to certain genres can be identified and classified in videos. Once identified it is possible to use the video shots assigned to these semantic concepts to create video abstracts by a dramaturgical synthesis of video shots which can serve the purpose of trailers to inform prospective viewers though such a “trailer” about the content of videos. Other applications could be semantic video retrieval based on the identified semantic features. We describe the concepts of a system capable of automatically generating quite convincing video trailers for some important genres by using low-level audio and video processing algorithms, context-based knowledge and a rule-based system.
Article
Full-text available
Construction of three-dimensional structures from video sequences has wide applications for intelligent video analysis. This paper summarizes the key issues of the theory and surveys the recent advances in the state of the art. Reconstruction of a scene object from video sequences often takes the basic principle of structure from motion with an uncalibrated camera. This paper lists the typical strategies and summarizes the typical solutions or algorithms for modeling of complex three-dimensional structures. Open difficult problems are also suggested for further study.
Article
This paper proposed an Interval Type-2 Fuzzy Kernel based Support Vector Machine (IT2FK-SVM) for scene classification of humanoid robot. Type-2 fuzzy sets have been shown to be a more promising method to manifest the uncertainties. Kernel design is a key component for many kernel-based methods. By integrating the kernel design with type-2 fuzzy sets, a systematic design methodology of IT2FK-SVM classification for scene images is presented to improve robustness and selectivity in the humanoid robot vision, which involves feature extraction, dimensionality reduction and classifier learning. Firstly, scene images are represented as high dimensional vector extracted from intensity, edge and orientation feature maps by biological-vision feature extraction method. Furthermore, a novel three-domain Fuzzy Kernel-based Principal Component Analysis (3DFK-PCA) method is proposed to select the prominent variables from the high-dimensional scene image representation. Finally, an IT2FM SVM classifier is developed for the comprehensive learning of scene images in complex environment. Different noisy, different view angle, and variations in lighting condition can be taken as the uncertainties in scene images. Compare to the traditional SVM classifier with RBF kernel, MLP kernel, and the Weighted Kernel (WK), respectively, the proposed method performs much better than conventional WK method due to its integration of IT2FK, and WK method performs better than the single kernel methods (SVM classifier with RBF kernel or MLP kernel). IT2FK-SVM is able to deal with uncertainties when scene images are corrupted by various noises and captured by different view angles. The proposed IT2FK-SVM method yields over 92 %92~\% 92 % classification rates for all cases. Moreover, it even achieves 98 %98~\% 98 % classification rate on the newly built dataset with common light case.
Article
In this paper, we present a novel probabilistic generative model for multi-object traffic scene understanding from movable platforms which reasons jointly about the 3D scene layout as well as the location and orientation of objects in the scene. In particular, the scene topology, geometry and traffic activities are inferred from short video sequences. Inspired by the impressive driving capabilities of humans, our model does not rely on GPS, lidar or map knowledge. Instead, it takes advantage of a diverse set of visual cues in the form of vehicle tracklets, vanishing points, semantic scene labels, scene flow and occupancy grids. For each of these cues we propose likelihood functions that are integrated into a probabilistic generative model. We learn all model parameters from training data using contrastive divergence. Experiments conducted on videos of 113 representative intersections show that our approach successfully infers the correct layout in a variety of very challenging scenarios. To evaluate the importance of each feature cue, experiments using different feature combinations are conducted. Furthermore, we show how by employing context derived from the proposed method we are able to improve over the state-of-the-art in terms of object detection and object orientation estimation in challenging and cluttered urban environments.
Article
Significant advances have recently been made in the development of computational methods for predicting 3D scene structure from a single monocular image. However, their computational complexity severely limits the adoption of such technologies to various computer vision and pattern recognition applications. In this paper, we address the problem of inferring 3D scene geometry from a single monocular image of man-made environments. Our goal is to estimate the 3D structure of a scene in real-time with a level of accuracy useful in certain real applications. Towards this end, we decompose the three-dimensional world space into a set of geometrically inspired primitive subspaces. One important advantage of our approach is that the complex estimation problem can be systematically broken down into a sequence of subproblems, which are easier to solve and more reliable even with the presence of occlusion or clutter, without loss of generality. The proposed algorithm also serves as the technical foundation for effective representation of the 3D scene geometry based on a simple description of the textural patterns present in the image and their spatial arrangement. Extensive experiments have been conducted on a large scale challenging dataset of real-world images. Our results demonstrate that the proposed method remarkably outperforms the recent state-of-the-art algorithms with respect to speed and accuracy.
Book
Full-text available
The image of the world projected onto the retina is essentially two-dimensional. From this image we recover information about the shapes of objects in a three-dimensional world. How is this done? The answer lies in part in the variation of brightness, or shading, often exhibited in a region of an image. In a photograph of a face, for example, there are variations in brightness, even though the reflecting properties of the skin presumably do not vary much from place to place. It may be concluded that shading effects arise primarily because some parts of a surface are oriented so as to reflect more of the incident light toward the viewer than are others. It should be pointed out right away that the recovery of shape from shading is by no means trivial. We cannot simply associate a given image brightness with a particular surface orientation. The problem is that there are two degrees of freedom to surface orientation - it takes two numbers to specify the direction of a unit vector perpendicular to the surface. Since we have only one brightness measurement at each picture cell, we have one equation in two unknowns at every point in the image. Additional constraint must therefore be brought to bear. One way to provide the needed constraint is to assume that the surface is continuous and smooth, so that the surface orientations of neighboring surface patches are not independent. Note that there is no magic at work here: we are not recovering a function of three variables given only a function of two variables. The distribution of some absorbing material in three-dimensional space cannot be recovered from a single two-dimensional projection. The techniques of tomographic reconstruction can be applied to that problem, but only if a large number of images taken from many different viewpoints are available. Why then are we able to learn so much about the three-dimensional world from merely two-dimensional images?
Article
Full-text available
This paper discusses the question: Can we improve the recognition of objects by using their spatial context? We start from Bag-of-Words models and use the Pascal 2007 dataset. We use the rough object bounding boxes that come with this dataset to investigate the fundamental gain con-text can bring. Our main contributions are: (I) The result of Zhang et al. in CVPR07 that context is superfluous de-rived from the Pascal 2005 data set of 4 classes does not generalize to this dataset. For our larger and more realistic dataset context is important indeed. (II) Using the rough bounding box to limit or extend the scope of an object dur-ing both training and testing, we find that the spatial extent of an object is determined by its category: (a) well-defined, rigid objects have the object itself as the preferred spatial extent. (b) Non-rigid objects have an unbounded spatial ex-tent: all spatial extents produce equally good results. (c) Objects primarily categorised based on their function have the whole image as their spatial extent. Finally, (III) using the rough bounding box to treat object and context sepa-rately, we find that the upper bound of improvement is 26% (12% absolute) in terms of Mean Average Precision, and this bound is likely to be higher if the localisation is done using segmentation. It is concluded that object localisation, if done sufficiently precise, helps considerably in the recog-nition of objects for the Pascal 2007 dataset.
Conference Paper
Full-text available
We study the statistics of an ensemble of images taken in the woods. Distributions of local quantities such as contrast are scale invariant and have nearly exponential tails. Power spectra exhibit scaling with a nontrivial exponent. These data limit the information content of natural images and point to the importance of gain-control strategies in visual processing.
Conference Paper
Full-text available
This paper discusses the question: Can we improve the recognition of objects by using their spatial context? We start from Bag-of-Words models and use the Pascal 2007 dataset. We use the rough object bounding boxes that come with this dataset to investigate the fundamental gain context can bring. Our main contributions are: (I) The result of Zhang et al. in CVPR07 that context is superfluous derived from the Pascal 2005 data set of 4 classes does not generalize to this dataset. For our larger and more realistic dataset context is important indeed. (II) Using the rough bounding box to limit or extend the scope of an object during both training and testing, we find that the spatial extent of an object is determined by its category: (a) well-defined, rigid objects have the object itself as the preferred spatial extent. (b) Non-rigid objects have an unbounded spatial extent : all spatial extents produce equally good results. (c) Objects primarily categorised based on their function have the whole image as their spatial extent. Finally, (III) using the rough bounding box to treat object and context separately, we find that the upper bound of improvement is 26% (12% absolute) in terms of mean average precision, and this bound is likely to be higher if the localisation is done using segmentation. It is concluded that object localisation, if done sufficiently precise, helps considerably in the recognition of objects for the Pascal 2007 dataset.
Conference Paper
Full-text available
In this paper we describe our TRECVID 2008 video retrieval experiments. The MediaMill team participated in three tasks: concept detection, automatic search, and interactive search. Rather than continuing to increase the number of concept detectors available for retrieval, our TRECVID 2008 experiments focus on increasing the robustness of a small set of detectors using a bag-of-words approach. To that end, our concept detection experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. For retrieval, a robust but limited set of concept detectors necessitates the need to rely on as many auxiliary information channels as possible. Therefore, our automatic search experiments focus on predicting which information channel to trust given a certain topic, leading to a novel framework for predictive video retrieval. To improve the video retrieval results further, our interactive search experiments investigate the roles of visualizing preview results for a certain browsedimension and active learning mechanisms that learn to solve complex search topics by analysis from user browsing behavior. The 2008 edition of the TRECVID benchmark has been the most successful MediaMill participation to date, resulting in the top ranking for both concept detection and interactive search, and a runner-up ranking for automatic retrieval. Again a lot has been learned during this year’s TRECVID campaign; we highlight the most important lessons at the end of this paper.
Article
Full-text available
We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocu- lar images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply su- pervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since lo- cal features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on un- structured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth es- timates than is possible using either monocular or stereo cues alone.
Article
Full-text available
The TREC Video Retrieval Evaluation (TRECVid) is an international benchmarking activity to encourage research in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organizations, universities and other consortia. Throughout its existence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video corpus, automatic detection of a variety of semantic and low-level video features, shot boundary detection and the detection of story boundaries in broadcast TV news. This paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, highlighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation benchmarking campaign and this allows us to discuss whether such campaigns are a good thing or a bad thing. There are arguments for and against these campaigns and we present some of them in the paper concluding that on balance they have had a very positive impact on research progress.
Article
Full-text available
This paper presents the semantic pathfinder architecture for generic indexing of multimedia archives. The semantic pathfinder extracts semantic concepts from video by exploring different paths through three consecutive analysis steps, which we derive from the observation that produced video is the result of an authoring-driven process. We exploit this authoring metaphor for machine-driven understanding. The pathfinder starts with the content analysis step. In this analysis step, we follow a data-driven approach of indexing semantics. The style analysis step is the second analysis step. Here, we tackle the indexing problem by viewing a video from the perspective of production. Finally, in the context analysis step, we view semantics in context. The virtue of the semantic pathfinder is its ability to learn the best path of analysis steps on a per-concept basis. To show the generality of this novel indexing approach, we develop detectors for a lexicon of 32 concepts and we evaluate the semantic pathfinder against the 2004 NIST TRECVID video retrieval benchmark, using a news archive of 64 hours. Top ranking performance in the semantic concept detection task indicates the merit of the semantic pathfinder for generic indexing of multimedia archives.
Conference Paper
Full-text available
When we look at a picture, our prior knowledge about the world allows us to resolve some of the ambiguities that are inherent to monocular vision, and thereby infer 3d information about the scene. We also recognize different objects, decide on their orientations, and identify how they are connected to their environment. Focusing on the problem of autonomous 3d reconstruction of indoor scenes, in this paper we present a dynamic Bayesian network model capable of resolving some of these ambiguities and recovering 3d information for many images. Our model assumes a "floorwall" geometry on the scene and is trained to recognize the floor-wall boundary in each column of the image. When the image is produced under perspective geometry, we show that this model can be used for 3d reconstruction from a single image. To our knowledge, this was the first monocular approach to automatically recover 3d reconstructions from single indoor images.
Conference Paper
Full-text available
We present a generic and robust approach for scene categorization. A complex scene is described by proto-concepts like vegetation, water, fire, sky etc. These proto-concepts are represented by low level features, where we use natural images statistics to compactly represent color invariant texture information by a Weibull distribution. We introduce the notion of contextures which preserve the context of textures in a visual scene with an occurrence histogram (context) of similarities to proto-concept descriptors (texture). In contrast to a codebook approach, we use the similarity to all vocabulary elements to generalize beyond the code words. Visual descriptors are attained by combining different types of contexts with different texture parameters. The visual scene descriptors are generalized to visual categories by training a support vector machine. We evaluate our approach on 3 different datasets: 1) 50 categories for the TRECVID video dataset; 2) the Caltech 101-object images; 3) 89 categories being the intersection of the Corel photo stock with the Art Explosion photo stock. Results show that our approach is robust over different datasets, while maintaining competitive performance.
Conference Paper
Full-text available
We present a new approach to model visual scenes in image collections, based on local invariant features and probabilistic latent space models. Our formulation provides answers to three open questions:(l) whether the invariant local features are suitable for scene (rather than object) classification; (2) whether unsupennsed latent space models can be used for feature extraction in the classification task; and (3) whether the latent space formulation can discover visual co-occurrence patterns, motivating novel approaches for image organization and segmentation. Using a 9500-image dataset, our approach is validated on each of these issues. First, we show with extensive experiments on binary and multi-class scene classification tasks, that a bag-of-visterm representation, derived from local invariant descriptors, consistently outperforms state-of-the-art approaches. Second, we show that probabilistic latent semantic analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and significantly more robust when less training data are available. Third, we have exploited the ability of PLSA to automatically extract visually meaningful aspects, to propose new algorithms for aspect-based image ranking and context-sensitive image segmentation.
Chapter
Bayesian probability theory has emerged not only as a powerful tool for building computational theories of vision, but also as a general paradigm for studying human visual perception. This 1996 book provides an introduction to and critical analysis of the Bayesian paradigm. Leading researchers in computer vision and experimental vision science describe general theoretical frameworks for modelling vision, detailed applications to specific problems and implications for experimental studies of human perception. The book provides a dialogue between different perspectives both within chapters, which draw on insights from experimental and computational work, and between chapters, through commentaries written by the contributors on each others' work. Students and researchers in cognitive and visual science will find much to interest them in this thought-provoking collection.
Article
Given a single picture which is a projection of a three-dimensional scene onto the two-dimensional picture plane, we usually have definite ideas about the 3-D shapes of objects. To do this we need to use assumptions about the world and the image formation process, since there exist a large number of shapes which can produce the same picture. The purpose of this paper is to identify some of these assumptions - mostly geometrical ones - by demonstrating how the theory and techniques which exploit such assumptions can provide a systematic shape-recovery method. -from Author
Conference Paper
We derive the decomposition of the anisotropic Gaussian in a one dimensional Gauss filter in the x-direction followed by a one dimensional filter in a non-orthogonal direction p. So also the anisotropic Gaussian can be decomposed by dimension. This appears to be extremely efficient from a computing perspective. An implementation scheme for normal convolution and for recursive filtering is proposed. Also directed derivative filters are demonstrated. For the recursive implementation, filtering an 512 x 512 image is performed within 65 msec, independent of the standard deviations and orientation of the filter. Accuracy of the filters is still reasonable when compared to truncation error or recursive approximation error. The anisotropic Gaussian filtering method allows fast calculation of edge and ridge maps, with high spatial and angular accuracy. For tracking applications, the normal anisotropic convolution scheme is more advantageous, with applications in the detection of dashed lines in engineering drawings. The recursive implementation is more attractive in feature detection applications, for instance in affine invariant edge and ridge detection in computer vision. The proposed computational filtering method enables the practical applicability of orientation scale-space analysis.
Conference Paper
This chapter presents a new framework of video analysis and associated techniques to automatically parse long programs, to extract story structures, and identify story units. Content-based browsing and navigation in digital video collections have been centered on sequential and linear presentation of images. To facilitate such applications, nonlinear and nonsequential access into video documents is essential, especially with long programs. For many programs, this can be achieved by identifying underlying story structures, which are reflected both by visual content and temporal organization of composing elements. The proposed analysis and representation contribute to the extraction of scenes and story units, each representing a distinct locale or event that cannot be achieved by shot boundary detection alone. Analysis is performed on MPEG-compressed video and without prior models. In addition, the building of story structure gives nonsequential and nonlinear access to a featured program and facilitates browsing and navigation. The result is a compact representation that serves as a summary of the story and allows hierarchical organization of video documents. Story units, which represent distinct events or locales from several types of video programs, have been successfully segmented and the results are promising. The video into the hierarchy of story units and scenes, clusters of similar shots, and shots at the lowest have been decomposed, which helps in further organization.
Conference Paper
In this paper, we wish to build a high quality database of images depicting scenes, along with their real-world three-dimensional (3D) coordinates. Such a database is useful for a variety of applications, including training systems for object detection and validation of 3D output. We build such a database from images that have been annotated with only the identity of objects and their spatial extent in images. Important for this task is the recovery of geometric information that is implicit in the object labels, such as qualitative relationships between objects (attachment, support, occlusion) and quantitative ones (inferring camera parameters). We describe a model that integrates cues extracted from the object labels to infer the implicit geometric information. We show that we are able to obtain high quality 3D information by evaluating the proposed approach on a database obtained with a laser range scanner. Finally, given the database of 3D scenes, we show how it can find better scene matches for an unlabeled image by expanding the database through viewpoint interpolation to unseen views.
Article
We report a six-stimulus basis for stochastic texture perception. Fragmentation of the scene by a chaotic process causes the spatial scene statistics to conform to a Weibull-distribution. The parameters of the Weibull distribution characterize the spatial structure of uniform stochastic textures of many different origins completely. In this paper, we report the perceptual significance of the Weibull parameters. We demonstrate the parameters to be sensitive to orthogonal variations in the imaging conditions, specifically to the illumination conditions, camera magnification and resolving power, and the texture orientation. Apparently, the Weibull parameters form a six-stimulus basis for stochastic texture description. The results indicate that texture perception can be approached like the experimental science of colorimetry.
Article
A texture operator for use in computer vision programs is described. The operator classifies texture according to characteristics of the Fourier transform of local image windows. Gradients of the texture are found by comparing and associating quantitative and qualitative values of adjacent windows. The gradients are then interpreted as a depth cue for longitudinal (receding) surfaces. Experimental results with natural, outdoor scenes are reported.
Article
We present a system for exploring large collections of photos in a virtual 3D space. Our system does not assume the photographs are of a single real 3D location, nor that they were taken at the same time. Instead, we organize the photos in themes, such as city streets or skylines, and let users navigate within each theme using intuitive 3D controls that include move left/right, zoom and rotate. Themes allow us to maintain a coherent semantic meaning of the tour, while visual similarity allows us to create a “being there” impression, as if the images were of a particular location. We present results on a collection of several million images downloaded from Flickr and broken into themes that consist of a few hundred thousand images each. A byproduct of our system is the ability to construct extremely long panoramas, as well as image taxi, a program that generates a virtual tour between a user supplied start and finish images. The system, and its underlying technology can be used in a variety of applications such as games, movies and online virtual 3D spaces like Second Life.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
We report a six-stimulus basis for stochastic texture perception. Fragmentation of the scene by a chaotic process causes the spatial scene statistics to conform to a Weibull-distribution. The parameters of the Weibull distribution characterize the spatial structure of uniform stochastic textures of many different origins completely. In this paper, we report the perceptual significance of the Weibull parameters. We demonstrate the parameters to be sensitive to orthogonal variations in the imaging conditions, specifically to the illumination conditions, camera magnification and resolving power, and the texture orientation. Apparently, the Weibull parameters form a six-stimulus basis for stochastic texture description. The results indicate that texture perception can be approached like the experimental science of colorimetry.
Article
Given a single picture which is a projection of a three-dimensional scene onto the two-dimensional picture plane, we usually have definite ideas about the 3-D shapes of objects. To do this we need to use assumptions about the world and the image formation process, since there exist a large number of shapes which can produce the same picture.The purpose of this paper is to identify some of these assumptions—mostly geometrical ones—by demonstrating how the theory and techniques which exploit such assumptions can provide a systematic shape-recovery method. The method consists of two parts. The first is the application of the Origami theory which models the world as a collection of plane surfaces and recovers the possible shapes qualitatively. The second is the technique of mapping image regularities into shape constraints for recovering the probable shapes quantitatively.Actual shape recovery from a single view is demonstrated for the scenes of an object such as a box and a chair. Given a single image, the method recovers the 3-D shapes of an object in it, and generates images of the same object as we would see it from other directions.
Article
Grouping images into semantically meaningful categories using low-level visual features is a challenging and important problem in content-based image retrieval. Based on these groupings, effective indices can be built for an image database. In this paper, we show how a specific high-level classification problem (city images vs landscapes) can be solved from relatively simple low-level features geared for the particular classes. We have developed a procedure to qualitatively measure the saliency of a feature towards a classification problem based on the plot of the intra-class and inter-class distance distributions. We use this approach to determine the discriminative power of the following features: color histogram, color coherence vector, DCT coefficient, edge direction histogram, and edge direction coherence vector. We determine that the edge direction-based features have the most discriminative power for the classification problem of interest here. A weighted k-NN classifier is used for the classification which results in an accuracy of 93.9% when evaluated on an image database of 2716 images using the leave-one-out method. This approach has been extended to further classify 528 landscape images into forests, mountains, and sunset/sunrise classes. First, the input images are classified as sunset/sunrise images vs forest & mountain images (94.5% accuracy) and then the forest & mountain images are classified as forest images or mountain images (91.7% accuracy). We are currently identifying further semantic classes to assign to images as well as extracting low level features which are salient for these classes. Our final goal is to combine multiple 2-class classifiers into a single hierarchical classifier.
Conference Paper
We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps.
Conference Paper
Understanding how line drawings convey tri-dimensionality is of fundamental importance in explaining surface perception when photometry is either uninformative or too compex to model analytically. We put forward here a computational model for interpreting line drawings as three-dimensional surfaces, based on constraints on local surface orientation along extremal and discontinuity boun- daries. Specific techniques are described for two key processes recovering the three-dimensional conformation of a space curve (e.g., a surface boundary) from its two-dimensional projection in an image, and interpolating smooth surfaces from orientation constraints along extremal boundaries. The relevance of the model to a general theory of low-level vision is discussed.
Conference Paper
A stochastic optimization approach to stereo matching is pre- sented. Unlike conventional correlation matching and feature matching, the approach provides a dense array of disparities, eliminating the need for interpolation. First, the stereo match- ing problem is defined in terms of finding a disparity map that satisfies two competing constraints: (1) matched points should have similar image intensity, and (2) the disparity map should be smooth. These constraints are expressed in an '(energy" function that can be evaluated locally. A simulated anneal- ing algorithm is used to find a disparity map that has very low energy (i.e., in which both constraints have simultaneously been approximately satisfied). Annealing allows the large-scale structure of the disparity map to emerge at higher tempera- tures, and avoids the problem of converging too quickly on a local minimum. Results are shown for a sparse random-dot stereogram, a vertical aerial stereogram (shown in compari- son to ground truth), and an oblique ground-level scene with occlusion boundaries.
Conference Paper
We propose a novel approach to learn and recognize natural scene categories. Unlike previous work [9,17], it does not require experts to annotate the training set. We represent the image of a scene by a collection of local regions, denoted as codewords obtained by unsupervised learning. Each region is represented as part of a "theme". In previous work, such themes were learnt from hand-annotations of experts, while our method learns the theme distributions as well as the codewords distribution over the themes without supervision. We report satisfactory categorization performances on a large set of 13 categories of complex scenes.
Conference Paper
We consider the problem of estimating 3-d structure from a single still image of an outdoor urban scene. Our goal is to efficiently create 3-d models which are visually pleasant. We chose an appropriate 3-d model structure and formulate the task of 3-d reconstruction as model fitting problem. Our 3-d models are composed of a number of vertical walls and a ground plane, where ground-vertical boundary is a continuous polyline. We achieve computational efficiency by special preprocessing together with stepwise search of 3-d model parameters dividing the problem into two smaller sub-problems on chain graphs. The use of Conditional Random Field models for both problems allows to various cues. We infer orientation of vertical walls of 3-d model vanishing points.
Conference Paper
Given a set of images of scenes containing multiple object categories (e.g. grass, roads, buildings) our objective is to discover these objects in each image in an unsupervised manner, and to use this object distribution to perform scene classification. We achieve this discovery using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature, here applied to a bag of visual words representation for each image. The scene classification on the object distribution is carried out by a k-nearest neighbour classifier. We investigate the classification performance under changes in the visual vocabulary and number of latent topics learnt, and develop a novel vocabulary using colour SIFT descriptors. Classification performance is compared to the supervised approaches of Vogel & Schiele [19] and Oliva & Torralba [11], and the semi-supervised approach of Fei Fei & Perona [3] using their own datasets and testing protocols. In all cases the combination of (unsupervised) pLSA followed by (supervised) nearest neighbour classification achieves superior results. We show applications of this method to image retrieval with relevance feedback and to scene classification in videos.
Conference Paper
Recently, methods for estimating 3D scene geometry or absolute scene depth information from 2D image content have been proposed. However, general applicability of these methods in depth estimation may not be realizable, as inconsistencies may be introduced due to a large variety of possible pictorial content. We identify scene categorization as the r st step towards efcient and robust depth estima- tion from single images. To that end, we describe a limited number of typical 3D scene geometries, called stages, each having a unique depth pattern and thus providing a specic context for stage objects. This type of scene information narrows down the possibilities with respect to individual objects' locations, scales and identities. We show how these stage types can be efciently learned and how they can lead to robust extraction of depth information. Our results indi- cate that stages without much variation and object clutter can be detected robustly, with up to 60% success rate.
Conference Paper
Occlusion reasoning, necessary for tasks such as navi- gation and object search, is an important aspect of every- day life and a fundamental problem in computer vision. We believe that the amazing ability of humans to reason about occlusions from one image is based on an intrinsically 3D interpretation. In this paper, our goal is to recover the occlusion boundaries and depth ordering of free-standing structures in the scene. Our approach is to learn to identify and label occlusion boundaries using the traditional edge and region cues together with 3D surface and depth cues. Since some of these cues require good spatial support (i.e., a segmentation), we gradually create larger regions and use them to improve inference over the boundaries. Our experi- ments demonstrate the power of a scene-based approach to occlusion reasoning.
Conference Paper
Image scene classiflcation is an integral part of many aspects of image processing. Indoor and Outdoor classiflcation is a fundamental part of scene processing as it is the starting point of many semantic scene evaluation approaches. Many novel techniques have been developed to tackle this problem, but each technique relies on its own database of images thus reducing the confldence in the success of each method. We attempt here to look at the current fleld of indoor / outdoor scene classi- flcation and develop a benchmark model for evaluating current methods.
Article
Using known camera motion to estimate depth from image sequences is an important problem in robot vision. Many applications of depth-from-motion, including navigation and manipulation, require algorithms that can estimate depth in an on-line, incremental fashion. This requires a representation that records the uncertainty in depth estimates and a mechanism that integrates new measurements with existing depth estimates to reduce the uncertainty over time. Kalman filtering provides this mechanism. Previous applications of Kalman filtering to depth-from-motion have been limited to estimating depth at the location of a sparse set of features. In this paper, we introduce a new, pixel-based (iconic) algorithm that estimates depth and depth uncertainty at each pixel and incrementally refines these estimates over time. We describe the algorithm and contrast its formulation and performance to that of a feature-based Kalman filtering algorithm. We compare the performance of the two approaches by analyzing their theoretical convergence rates, by conducting quantitative experiments with images of a flat poster, and by conducting qualitative experiments with images of a realistic outdoor-scene model. The results show that the new method is an effective way to extract depth from lateral camera translations. This approach can be extended to incorporate general motion and to integrate other sources of information, such as stereo. The algorithms we have developed, which combine Kalman filtering with iconic descriptions of depth, therefore can serve as a useful and general framework for low-level dynamic vision.
Article
Humans have an amazing ability to instantly grasp the overall 3D structure of a scene—ground orientation, relative positions of major landmarks, etc.—even from a single image. This ability is completely missing in most popular recognition algorithms, which pretend that the world is flat and/or view it through a patch-sized peephole. Yet it seems very likely that having a grasp of this “surface layout” of a scene should be of great assistance for many tasks, including recognition, navigation, and novel view synthesis. In this paper, we take the first step towards constructing the surface layout, a labeling of the image intogeometric classes. Our main insight is to learn appearance-based models of these geometric classes, which coarsely describe the 3D scene orientation of each image region. Our multiple segmentation framework provides robust spatial support, allowing a wide variety of cues (e.g., color, texture, and perspective) to contribute to the confidence in each geometric label. In experiments on a large set of outdoor images, we evaluate the impact of the individual cues and design choices in our algorithm. We further demonstrate the applicability of our method to indoor images, describe potential applications, and discuss extensions to a more complete notion of surface layout.
Article
In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Article
Image understanding requires not only individually esti- mating elements of the visual world but also capturing the interplay among them. In this paper, we provide a frame- work for placing local object detection in the context of the overall 3D scene by modeling the interdependence of ob- jects, surface orientations, and camera viewpoint. Most object detection methods consider all scales and locations in the image as equally likely. We show that with probabilistic estimates of 3D geometry, both in terms of surfaces and world coordinates, we can put objects into perspective and model the scale and location variance in the image. Our approach reflects the cyclical nature of the problem by allowing probabilistic object hypotheses to re- fine geometry and vice-versa. Our framework allows pain- less substitution of almost any object detector and is easily extended to include other aspects of image understanding. Our results confirm the benefits of our integrated approach.
Article
Current vision systems are designed to perform in clear weather. Needless to say, in any outdoor application, there is no escape from “bad” weather. Ultimately, computer vision systems must include mechanisms that enable them to function (even if somewhat less reliably) in the presence of haze, fog, rain, hail and snow. We begin by studying the visual manifestations of different weather conditions. For this, we draw on what is already known about atmospheric optics, and identify effects caused by bad weather that can be turned to our advantage. Since the atmosphere modulates the information carried from a scene point to the observer, it can be viewed as a mechanism of visual information coding. We exploit two fundamental scattering models and develop methods for recovering pertinent scene properties, such as three-dimensional structure, from one or two images taken under poor weather conditions. Next, we model the chromatic effects of the atmospheric scattering and verify it for fog and haze. Based on this chromatic model we derive several geometric constraints on scene color changes caused by varying atmospheric conditions. Finally, using these constraints we develop algorithms for computing fog or haze color, depth segmentation, extracting three-dimensional structure, and recovering “clear day” scene colors, from two or more images taken under different but unknown weather conditions.
Article
Understanding how line drawings convey tri-dimensionality is of fundamental importance in explaining surface perception when photometry is either uninformative or too complex to model analytically.We put forward here a computational model for interpreting line drawings as three-dimensional surfaces, based on constraints on local surface orientation along extremal and discontinuity boundaries. Specific techniques are described for two key processes: recovering the three-dimensional conformation of a space curve (e.g., a surface boundary) from its two-dimensional projection in an image, and interpolating smooth surfaces from orientation constraints along extremal boundaries. The relevance of the model to a general theory of low-level vision is discussed.
Article
The subjective visual space perceived by humans does not reflect a simple transformation of objective physical space; rather, perceived space has an idiosyncratic relationship with the real world. To date, there is no consensus about either the genesis of perceived visual space or the implications of its peculiar characteristics for visually guided behavior. Here we used laser range scanning to measure the actual distances from the image plane of all unoccluded points in a series of natural scenes. We then asked whether the differences between real and apparent distances could be explained by the statistical relationship of scene geometry and the observer. We were able to predict perceived distances in a variety of circumstances from the probability distribution of physical distances. This finding lends support to the idea that the characteristics of human visual space are determined probabilistically.
Conference Paper
In the last decade, graph-cut optimization has been popular for a variety of pixel labeling problems. Typically graph-cut methods are used to incorporate a smoothness prior on a labeling. Recently several methods incorporated ordering constraints on labels for the application of object segmentation. An example of an ordering constraint is prohibiting a pixel with a ldquocar wheelrdquo label to be above a pixel with a ldquocar roofrdquo label. We observe that the commonly used graph-cut based alpha-expansion is more likely to get stuck in a local minimum when ordering constraints are used. For certain models with ordering constraints, we develop new graph-cut moves which we call order-preserving moves. Order-preserving moves act on all labels, unlike alpha-expansion. Although the global minimum is still not guaranteed, optimization with order-preserving moves performs significantly better than alpha-expansion. We evaluate order-preserving moves for the geometric class scene labeling (introduced by Hoiem et al.) where the goal is to assign each pixel a label such as ldquoskyrdquo, ldquogrounrdquo, etc., so ordering constraints arise naturally. In addition, we use order-preserving moves for certain simple shape priors in graphcut segmentation, which is a novel contribution in itself.
Conference Paper
Inferring the 3D spatial layout from a single 2D image is a fundamental visual task. We formulate it as a grouping problem where edges are grouped into lines, quadrilaterals, and finally depth-ordered planes. We demonstrate that the 3D structure of planar objects in indoor scenes can be fast and accurately inferred without any learning or indexing.
Conference Paper
We develop an integrated, probabilistic model for the appearance and three-dimensional geometry of cluttered scenes. Object categories are modeled via distributions over the 3D location and appearance of visual features. Uncertainty in the number of object instances depicted in a particular image is then achieved via a transformed Dirichlet process. In contrast with image-based approaches to object recognition, we model scale variations as the perspective projection of objects in different 3D poses. To calibrate the underlying geometry, we incorporate binocular stereo images into the training process. A robust likelihood model accounts for outliers in matched stereo features, allowing effective learning of 3D object structure from partial 2D segmentations. Applied to a dataset of office scenes, our model detects objects at multiple scales via a coarse reconstruction of the corresponding 3D geometry.
Conference Paper
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
Conference Paper
Many computer vision algorithms limit their performance by ignoring the underlying 3D geometric structure in the image. We show that we can estimate the coarse geometric properties of a scene by learning appearance-based models of geometric classes, even in cluttered natural scenes. Geometric classes describe the 3D orientation of an image region with respect to the camera. We provide a multiple-hypothesis framework for robustly estimating scene structure from a single image and obtaining confidences for each geometric label. These confidences can then be used to improve the performance of many other applications. We provide a thorough quantitative evaluation of our algorithm on a set of outdoor images and demonstrate its usefulness in two applications: object detection and automatic single-view reconstruction.