Conference Paper

Microsoft COCO: Common Objects in Context

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We demonstrate its generality by implementing DEGCs efficient dynamic image graph construction and flexible local-global graph representation learning with standard GCN backbones. • ClusterViG reaches state-of-the-art (SOTA) performance across three representative CV tasks: ImageNet image classification [32], COCO object detection [33], and COCO instance segmentation [33], when compared to its SOTA counterparts (CNN, MLP, ViT or ViG based), with comparable total model parameters and GMACs. • ClusterViG reaches this SOTA performance with a hardware-friendly isotropic architecture. ...
... We demonstrate its generality by implementing DEGCs efficient dynamic image graph construction and flexible local-global graph representation learning with standard GCN backbones. • ClusterViG reaches state-of-the-art (SOTA) performance across three representative CV tasks: ImageNet image classification [32], COCO object detection [33], and COCO instance segmentation [33], when compared to its SOTA counterparts (CNN, MLP, ViT or ViG based), with comparable total model parameters and GMACs. • ClusterViG reaches this SOTA performance with a hardware-friendly isotropic architecture. ...
... Object Detection and Instance Segmentation. We validate the generalization of ClusterViG by using it as a backbone for object detection and instance segmentation downstream tasks on the MS COCO 2017 dataset, using the Mask-RCNN framework [33]. We use a Feature Pyramid Network (FPN) [75] as our neck to extract multi-scale feature maps following prior works [26]- [29]. ...
Preprint
Full-text available
Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive k-Nearest Neighbors (k-NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to 5×5\times when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.
... Datasets and architectures. We mainly evaluate our method across a wide range of tasks, including ImageNet-1K [12] for the classification, ADE20K [77] for the semantic segmentation, COCO [39] for the object detection, and BoolQ [9], PIQA [3], SIQA [56], HellaSwag [75], and ARC [10] for the commonsense reasoning tasks. To verify the scalability of our approach, we conduct experiments on various architectures with parameter counts ranging from several to hundred million. ...
... We also investigate the generalization of our approach to semantic segmentation as well as object detection and instance segmentation tasks. We choose ADE20K [77] and COCO [39] as our benchmark datasets. For semantic segmentation, following Zhao et al. [76], we adopt UperNet [72] as the segmentation model and train it on ADE20K to prepare checkpoints. ...
Preprint
Parameter generation has struggled to scale up for a long time, significantly limiting its range of applications. In this study, we introduce \textbf{R}ecurrent diffusion for large-scale \textbf{P}arameter \textbf{G}eneration, called \textbf{RPG}. We first divide the trained parameters into non-overlapping parts, after which a recurrent model is proposed to learn their relationships. The recurrent model's outputs, as conditions, are then fed into a diffusion model to generate the neural network parameters. Using only a single GPU, recurrent diffusion enables us to generate popular vision and language models such as ConvNeXt-L and LoRA parameters of LLaMA-7B. Meanwhile, across various architectures and tasks, the generated parameters consistently perform comparable results over trained networks. Notably, our approach also shows the potential to generate models for handling unseen tasks, which largely increases the practicality of parameter generation. Our code is available \href{https://github.com/NUS-HPC-AI-Lab/Recurrent-Parameter-Generation}{here}.
... In recent years, deep convolutional neural networks (DCNNs) have rapidly developed and achieved significant breakthroughs in numerous computer vision tasks [1][2][3][4][5]. However, these models typically require long training periods on large datasets to achieve excellent performance [6]. In many realworld applications, especially in healthcare and endangered species research, collecting and annotating large amounts of data is highly challenging. ...
... The few-shot classification loss can be calculated using (3). To regularize the feature space, the contrastive prototype loss is calculated using (6). Thus the total learning objective of the proposed CPL-DFNet is finally given by: ...
Article
Full-text available
Metric-based few-shot image classification methods generally perform classification by comparing the distances between the query sample features and the prototypes of each class. These methods often focus on constructing prototype representations for each class or learning a metric, while neglecting the significance of the feature space itself. In this paper, we redirect the focus to feature space construction, with the goal of constructing a discriminative feature space for few-shot image classification tasks. To this end, we designed a contrastive prototype loss that incorporates the distribution of query samples with respect to class prototypes in the feature space, emphasizing intra-class compactness and inter-class separability, thereby guiding the model to learn a more discriminative feature space. Based on this loss, we propose a contrastive prototype loss based discriminative feature network (CPL-DFNet) to address few-shot image classification tasks. CPL-DFNet enhances sample utilization by fully leveraging the distance relationships between query samples and class prototypes in the feature space, creating more favorable conditions for few-shot image classification tasks and significantly improving classification performance. We conducted extensive experiments on both general and fine-grained few-shot image classification benchmark datasets to validate the effectiveness of the proposed CPL-DFNet method. The experimental results show that CPL-DFNet can effectively perform few-shot image classification tasks and outperforms many existing methods across various task scenarios, demonstrating significant performance advantages.
... Then, the original data are filtered and processed through strip imaging. Finally, the SAR images are cropped into fixed pixel size images, and the data are annotated in manual or semimanual way to construct them into target detection dataset in a dedicated format, such as MS COCO format [28] or PASCAL VOC format [29]. ...
Article
Full-text available
With the advent of high-quality SAR images and the rapid development of computing technology, the object detection algorithms based on convolution neural network have attracted a lot of attention in the field of SAR object detection. At present, the main dataset for SAR target detection in China focus on ships, there is a lack of SAR vehicle detect datasets, and complex ground scenes can affect vehicle detection performance. To solve these problems, we propose a lightweight SAR vehicle detection algorithm, aiming to improve the vehicle detection accuracy and simplify the model complexity. First, we constructed a multi-band SAR vehicle detection dataset (SVDD) with annotations as the training dataset of the object detection model. Then, we introduce dual conv into the RT-DETR model. Dual conv uses group convolution technology to filter the convolutional network to reduce model parameters, so we can achieve a lightweight and real-time end-to-end detection. Finally, we use the mmdetection framework as a benchmark performance and test the robust performance under different conditions. Experimental results show that the AP50 of we proposed method reaches 98.5%, achieving excellent detection performance.
... LSP [62] 2010 2000 images Not specified COCO [92] 2014 328000 images Not specified Expressive hands and faces dataset [129] [63] 2015 297000 frames Not specified video PennAction [192] 2013 2326 videos Not specified PoseTrack [7] 2018 1337 videos Not specified poral streams. Lastly, achieving real-time recognition with low latency while maintaining high accuracy is a persistent challenge, especially for early-stage gesture detection. ...
Preprint
Full-text available
Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction in fields such as sign language recognition, virtual and augmented reality, and robotics. Despite the rapid growth of the field, there are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets. This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data including RGB images, depth images, and videos from monocular or multiview cameras, examining the differing methodological requirements of each approach. Furthermore, an overview of widely used datasets is provided, detailing their main characteristics and application domains. Finally, open challenges such as achieving robust recognition in real-world environments, handling occlusions, ensuring generalization across diverse users, and addressing computational efficiency for real-time applications are highlighted to guide future research directions. By synthesizing the objectives, methodologies, and applications of recent studies, this survey offers valuable insights into current trends, challenges, and opportunities for future research in human hand gesture recognition.
... With the proposal of various new types of networks, deep learning-based detectors have achieved great success on multiple large-scale object detection datasets 30,31 , and corresponding methods [32][33][34][35] for concealed object detection have been actively explored. Xiao et al. combined Faster R-CNN's preprocessing and structural optimization to propose a fast detection framework called R-PCNN 32 , which effectively improves the efficiency and precision of object detection in human THz images. ...
Article
Full-text available
The terahertz (THz) security scanner offers advantages such as non-contact inspection and the ability to detect various types of dangerous goods, playing an important role in preventing terrorist attacks. We aim to accurately and quickly detect concealed objects in THz security images. However, current object detection algorithms face many challenges when applied to THz images. The main reasons for the detection difficulty are that the concealed objects are small, the image resolution is low, and there is back-ground noise. Many methods often ignore the contextual dependency of the objects, hindering the effective capture of the object’s features. To address this task, this paper first proposes an adaptive context-aware attention network (ACAN), which models global contextual association features in both spatial and channel dimensions. By dynamically combining local features and their global relationships, contextual association information can be obtained from the input features, and enhanced attention features can be achieved through feature fusion to enable precise detection of concealed objects. Secondly, we improved the adaptive convolution and developed the dynamic adaptive convolution block (DACB). DACB can adaptively adjust convolution filter parameters and allocate the filters to the corresponding spatial regions, then filter the feature maps to suppress interference information. Finally, we integrated these two components to YOLOv8, resulting in Adaptation-YOLO. Through wide-ranging experiments on the active THz image dataset, the results demonstrate that the suggested method effectively improves the accuracy and efficiency of object detectors.
... Finally, we also randomly sampled 450 non-OSN images containing human faces from the COCO2017 dataset [49] to construct Dataset 2.B. The same three authors labeled images for Dataset 2.A did the labeling work for the 450 images and people in photos were labeled as bystanders only when both or all three authors agreed. ...
Preprint
Full-text available
Online users often post facial images of themselves and other people on online social networks (OSNs) and other Web 2.0 platforms, which can lead to potential privacy leakage of people whose faces are included in such images. There is limited research on understanding face privacy in social media while considering user behavior. It is crucial to consider privacy of subjects and bystanders separately. This calls for the development of privacy-aware face detection classifiers that can distinguish between subjects and bystanders automatically. This paper introduces such a classifier trained on face-based features, which outperforms the two state-of-the-art methods with a significant margin (by 13.1% and 3.1% for OSN images, and by 17.9% and 5.9% for non-OSN images). We developed a semi-automated framework for conducting a large-scale analysis of the face privacy problem by using our novel bystander-subject classifier. We collected 27,800 images, each including at least one face, shared by 6,423 Twitter users. We then applied our framework to analyze this dataset thoroughly. Our analysis reveals eight key findings of different aspects of Twitter users' real-world behaviors on face privacy, and we provide quantitative and qualitative results to better explain these findings. We share the practical implications of our study to empower online platforms and users in addressing the face privacy problem efficiently.
... For the pipelines, we employ Llama-3-8B [1] as the LLM and LLaVA-1.6 [22] as the MLLM. For datasets, we use COCO [21] as the image captioning dataset and VQAv2 [12] for the VQA dataset. ...
Preprint
While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
... In the first stage, the input image x is converted into a target image x′ with a light level l, and then in the second stage, the target image x′ is restored to a reconstructed image x′ with the original image light level. model supports training with dataset of non-low-light images, we use the small dataset of Zero-DCE, the full dataset of EnlightenGAN, and the COCO2017 Val dataset [25] with totally 8508 images for our research. Later we compare the results of model trained using only the COCO2017 Val dataset. ...
Article
Full-text available
Deep learning-based methods have achieved remarkable success in the problem of low-light image enhancement. However, previous works mainly focused on model training using paired or unpaired datasets. Are they still competitive in the absence of accurate classification of image light levels? Instead of classifying image data based on illumination, the approach presented in this paper directly uses images with mixed light levels to train the model. We propose a Unsupervised Low-light Enhancement network, dubbed ULE-Net, that inserts a Light Modulation Module (LMM) into the network to dynamically control the light level of the output image during the calculation process. In the training phase, the available light space of the image is traversed to realize the learning of multi-light levels. The binary conversion problem of low-light image to normal image is successfully converted to a discrete/continuous light conversion problem in the image light space. Through extensive experiments, our proposed method outperforms recent methods in various metrics of visual quality. Meanwhile, our enhancement results are likewise competitive with the most state-of-the-art methods when training with the COCO dataset in the field of non-low-light image. Additionally, our approach demonstrates that for the issue of low-light image enhancement, the light level requirement of the training image is completely arbitrary.
... CoCo [85], ADE20K [200], Cityscapes [200] Object detection, semantic segmentation ...
Preprint
We present a survey paper on methods and applications of digital twins (DT) for urban traffic management. While the majority of studies on the DT focus on its "eyes," which is the emerging sensing and perception like object detection and tracking, what really distinguishes the DT from a traditional simulator lies in its ``brain," the prediction and decision making capabilities of extracting patterns and making informed decisions from what has been seen and perceived. In order to add values to urban transportation management, DTs need to be powered by artificial intelligence and complement with low-latency high-bandwidth sensing and networking technologies. We will first review the DT pipeline leveraging cyberphysical systems and propose our DT architecture deployed on a real-world testbed in New York City. This survey paper can be a pointer to help researchers and practitioners identify challenges and opportunities for the development of DTs; a bridge to initiate conversations across disciplines; and a road map to exploiting potentials of DTs for diverse urban transportation applications.
... It is well-known for its real-time object detection capabilities, outperforming those of fast and faster Region CNN (RCNN) [42] for tool detection (See Figure 3). While YOLO comes pre-trained on general datasets like COCO [43] or ImageNet [44], its direct application in surgical video analysis may be suboptimal. Surgical instruments and operating environments possess unique characteristics, necessitating fine-tuning of the model for optimal performance. ...
Preprint
Full-text available
The interest in leveraging Artificial Intelligence (AI) for surgical procedures to automate analysis has witnessed a significant surge in recent years. One of the primary tools for recording surgical procedures and conducting subsequent analyses, such as performance assessment, is through videos. However, these operative videos tend to be notably lengthy compared to other fields, spanning from thirty minutes to several hours, which poses a challenge for AI models to effectively learn from them. Despite this challenge, the foreseeable increase in the volume of such videos in the near future necessitates the development and implementation of innovative techniques to tackle this issue effectively. In this article, we propose a novel technique called Kinematics Adaptive Frame Recognition (KAFR) that can efficiently eliminate redundant frames to reduce dataset size and computation time while retaining useful frames to improve accuracy. Specifically, we compute the similarity between consecutive frames by tracking the movement of surgical tools. Our approach follows these steps: i) Tracking phase: a YOLOv8 model is utilized to detect tools presented in the scene, ii) Similarity phase: Similarities between consecutive frames are computed by estimating variation in the spatial positions and velocities of the tools, iii) Classification phase: A X3D CNN is trained to classify segmentation. We evaluate the effectiveness of our approach by analyzing datasets obtained through retrospective reviews of cases at two referral centers. The Gastrojejunostomy (GJ) dataset covers procedures performed between 2017 to 2021, while the Pancreaticojejunostomy (PJ) dataset spans from 2011 to 2022 at the same centers. By adaptively selecting relevant frames, we achieve a tenfold reduction in the number of frames while improving accuracy by 4.32% (from 0.749 to 0.7814).
... An overview of patch indexes and corresponding causal mask from raster-scan, concentric, and all-one position encoding on an example from COCO(Lin et al., 2014). ...
Preprint
Full-text available
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://github.com/SakuraTroyChen/PyPE.
... To simulate a simplified social media scenario where posts typically contain relatively short text alongside images, we leverage the MS COCO2017 dataset [84]. This dataset offers a rich collection of both image and text data, with each image accompanied by five human-written descriptions. ...
Article
Full-text available
The proliferation of text generation applications in social networks has raised concerns about the authenticity of online content. Large language models like GPTs can now produce increasingly indistinguishable text from human-written content. While learning-based classifiers can be trained to differentiate between human-written and machine-generated text, their robustness is often questionable. This work first demonstrates the vulnerability of pre-trained human-written text detectors to simple mutation-based adversarial attacks. We then propose a novel black-box defense strategy to enhance detector robustness on such attacks without requiring any knowledge about the attacking method. Our experiments demonstrate that the proposed black-box method significantly enhances detector performance in discerning human-authored from machine-generated text, achieving comparable results to white-box defense strategies.
... Gamma Ray X-Ray Ultraviolet Infrared Ray Microwave Radar Radio Broadcast band 3 3 Fig. 1. A detailed spectrogram depicting almost all wavelength and frequency ranges, particularly expanding the range of the human visual system and annotating corresponding computer vision and image fusion datasets [5]. ...
Preprint
Full-text available
Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data compatibility, perception accuracy, and efficiency remain. Unfortunately, there is a lack of recent comprehensive surveys that address this rapidly expanding domain. This paper fills that gap by providing a thorough survey covering a broad range of topics. We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods, from visual enhancement strategies to data compatibility and task adaptability. We also present a detailed analysis of these approaches, accompanied by a lookup table clarifying their core ideas. Furthermore, we summarize performance comparisons, both quantitatively and qualitatively, focusing on registration, fusion, and subsequent high-level tasks. Beyond technical analysis, we discuss potential future directions and open issues in this area. For further details, visit our GitHub repository: https://github.com/RollingPlain/IVIF_ZOO.
... Additionally, note that the MS COCO [44] weights have been used, which were passed as input parameters to the YOLO algorithm. ...
Article
Full-text available
This work presents an automated welding inspection system based on a neural network trained through a series of 2D images of welding seams obtained in the same study. The object detection method follows a geometric deep learning model based on convolutional neural networks. Following an extensive review of available solutions, algorithms, and networks based on this convolutional strategy, it was determined that the You Only Look Once algorithm in its version 8 (YOLOv8) would be the most suitable for object detection due to its performance and features. Consequently, several models have been trained to enable the system to predict specific characteristics of weld beads. Firstly, the welding strategy used to manufacture the weld bead was predicted, distinguishing between two of them (Flux-Cored Arc Welding (FCAW)/Gas Metal Arc Welding (GMAW)), two of the predominant welding processes used in many industries, including shipbuilding, automotive, and aeronautics. In a subsequent experiment, the distinction between a well-manufactured weld bead and a defective one was predicted. In a final experiment, it was possible to predict whether a weld seam was well-manufactured or not, distinguishing between three possible welding defects. The study demonstrated high performance in three experiments, achieving top results in both binary classification (in the first two experiments) and multiclass classification (in the third experiment). The average prediction success rate exceeded 97% in all three experiments.
... For instance, spliced images from COLUMBIA [103,104] are made by randomly copying and pasting pixel areas between images, while spliced images from CASIA were manually created by people using Photoshop. DEFACTO, however, was automatically generated from the MS-COCO dataset [157], pasting objects from 7 di erent semantic classes into random areas of images consistent with the semantics of the selected objects 8 . Hence, it is expected to see a significant drop in performances when a detector trained on one of these datasets is tested on anothersee references [121,158]. ...
Thesis
Full-text available
Today, it is easier than ever to manipulate images for unethical purposes. This practice is therefore increasingly prevalent in social networks and advertising. Malicious users can for instance generate convincing deep fakes in a few seconds to lure a naive public. Alternatively, they can also communicate secretly hidding illegal information into images. Such abilities raise significant security concerns regarding misinformation and clandestine communications. The Forensics community thus actively collaborates with Law Enforcement Agencies worldwide to detect image manipulations. The most effective methodologies for image forensics rely heavily on convolutional neural networks meticulously trained on controlled databases. These databases are actually curated by researchers to serve specific purposes, resulting in a great disparity from the real-world datasets encountered by forensic practitioners. This data shift addresses a clear challenge for practitioners, hindering the effectiveness of standardized forensics models when applied in practical situations. Through this thesis, we aim to improve the efficiency of forensics models in practical settings, designing strategies to mitigate the impact of data shift. It starts by exploring literature on out-of-distribution generalization to find existing strategies already helping practitioners to make efficient forensic detectors in practice. Two main frameworks notably hold promise: the implementation of models inherently able to learn how to generalize on images coming from a new database, or the construction of a representative training base allowing forensics models to generalize effectively on scrutinized images. Both frameworks are covered in this manuscript. When faced with many unlabeled images to examine, domain adaptation strategies matching training and testing bases in latent spaces are designed to mitigate data shifts encountered by practitioners. Unfortunately, these strategies often fail in practice despite their theoretical efficiency, because they assume that scrutinized images are balanced, an assumption unrealistic for forensic analysts, as suspects might be for instance entirely innocent. Additionally, such strategies are tested typically assuming that an appropriate training set has been chosen from the beginning, to facilitate adaptation on the new distribution.. Trying to generalize on a few images is more realistic but much more difficult by essence. We precisely deal with this scenario in the second part of this thesis, gaining a deeper understanding of data shifts in digital image forensics. Exploring the influence of traditional processing operations on the statistical properties of developed images, we formulate several strategies to select or create training databases relevant for a small amount of images under scrutiny. Our final contribution is a framework leveraging statistical properties of images to build relevant training sets for any testing set in image manipulation detection. This approach improves by far the generalization of classical steganalysis detectors on practical sets encountered by forensic analyst and can be extended to other forensic contexts.
... The dataset is all shot on an optical platform, in normal conditions, existing methods typically use UCF101 [37] as the training dataset for this type of VCM and COCO [38] as the training dataset for the object detection model training. However, since video classification and object detection requirements are specific to different human demonstration videos and objects, a dataset is created. ...
Article
Robot pick-and-place for unknown objects is still a very challenging research topic. This paper proposes a multi-modal learning method for robot one-shot imitation of pick-and-place tasks. This method aims to enhance the generality of industrial robots while reducing the amount of data and training costs the one-shot imitation method relies on. The method first categorizes human demonstration videos into different tasks, and these tasks are classified into six types to symbolize as many types of pick-and-place tasks as possible. Second, the method generates multi-modal prompts and finally predicts the action of the robot and completes the symbolic pick-and-place task in industrial production. A carefully curated dataset is created to complement the method. The dataset consists of human demonstration videos and instance images focused on real-world scenes and industrial tasks, which fosters adaptable and efficient learning. Experimental results demonstrate favorable success rates and loss results both in simulation environments and real-world experiments, confirming its effectiveness and practicality.
... To achieve this goal, we use YOLOv10 [35] in UVtrack, which is an efficient framework for target detection. In UVtrack, the YOLOv10 network is pretrained on the modified COCO dataset [36], which is committed to fast and robust recognition and segmentation in any scene of human activities. Using YOLOv10 to detect pedestrians can realize robust and accurate pedestrian detection. ...
Article
Full-text available
High precision and robust indoor positioning system has a broad range of applications in the area of mobile computing. Due to the advancement of image processing algorithms, the prevalence of surveillance ambient cameras shows promise for offering sub-meter accuracy localization services. The tracking performance in dynamic contexts is still unreliable for ambient camera-based methods, despite their general ability to pinpoint pedestrians in video frames at fine-grained levels. Contrarily, ultra-wideband-based technology can continuously track pedestrians, but they are frequently susceptible to the effects of non-line-of-sight (NLOS) errors on the surrounding environment. We see a chance to combine these two most viable approaches in order to get beyond the aforementioned drawbacks and return to the pedestrian localization issue from a different angle. In this article, we propose UVtrack, a localization system based on UWB and ambient cameras that achieves centimeter accuracy and improved reliability. The key innovation of UVtrack is a well-designed particle filter which adopts UWB and vision results in the weight update of the particle set, and an adaptive distance variance weighted least squares method (DVLS) to improve UWB sub-system robustness. We take UVtrack into use on common smartphones and test its effectiveness in three different situations. The results demonstrated that UVtrack attains an outstanding localization accuracy of 7cm .
... Dataset 1 was selected from the COCO 2017 (Lin et al., 2014) dataset with category objectives related to autonomous driving, consisting of 10 categories, 35,784 images for training, and 2431 images for validation. Dataset 2 combines the original categories from the PASCAL VOC 2012 (Everingham et al., 2010) dataset, which include Person, Car, Train, Motorcycle, Bicycle, and Other, with 11,540 images for training and 2913 images for validation. ...
Article
Full-text available
Object detection is a critical component in the development of autonomous driving technology and has demonstrated significant growth potential. To address the limitations of current techniques, this paper presents an improved object detection method for autonomous driving based on a detection transformer (DETR). First, we introduce a multi-scale feature and location information extraction method, which solves the inadequacy of the model for multi-scale object localization and detection. In addition, we developed a transformer encoder based on the group axial attention mechanism. This allows for efficient attention range control in the horizontal and vertical directions while reducing computation, ultimately enhancing the inference speed. Furthermore, we propose a novel dynamic hyperparameter tuning training method based on Pareto efficiency, which coordinates the training state of the loss functions through dynamic weights, overcoming issues associated with manually setting fixed weights and enhancing model convergence speed and accuracy. Experimental results demonstrate that the proposed method surpasses others, with improvements of 3.3%, 4.5%, and 3% in average precision on the COCO, PASCAL VOC, and KITTI datasets, respectively, and an 84% increase in FPS.
... To ensure consistency and comparability, we employ a unified training regimen. The training datasets from GOT-10k [73], LaSOT [74], COCO [75], and TrackingNet [76] are utilized. In the model training process, the AdamW optimizer [77] is used, with weight decay set to 10 −4 . ...
Article
Full-text available
Driven by the rapid advancement of Unmanned Aerial Vehicle (UAV) technology, the field of UAV object tracking has witnessed significant progress. This study introduces an innovative single-stream UAV tracking architecture, dubbed NT-Track, which is dedicated to enhancing the efficiency and accuracy of real-time tracking tasks. Addressing the shortcomings of existing tracking systems in capturing temporal relationships between consecutive frames, NT-Track meticulously analyzes the positional changes in targets across frames and leverages the similarity of the surrounding areas to extract feature information. Furthermore, our method integrates spatial and temporal information seamlessly into a unified framework through the introduction of a temporal feature fusion technique, thereby bolstering the overall performance of the model. NT-Track also incorporates a spatial neighborhood feature extraction module, which focuses on identifying and extracting features within the neighborhood of the target in each frame, ensuring continuous focus on the target during inter-frame processing. By employing an improved Transformer backbone network, our approach effectively integrates spatio-temporal information, enhancing the accuracy and robustness of tracking. Our experimental results on several challenging benchmark datasets demonstrate that NT-Track surpasses existing lightweight and deep learning trackers in terms of precision and success rate. It is noteworthy that, on the VisDrone2018 benchmark, NT-Track achieved a precision rate of 90% for the first time, an accomplishment that not only showcases its exceptional performance in complex environments, but also confirms its potential and effectiveness in practical applications.
Article
Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.
Article
Full-text available
The modern society generates vast amounts of digital content, whose credibility plays a pivotal role in shaping public opinion and decision-making processes. The rapid development of social networks and generative technologies, such as deepfakes, significantly increases the risk of disinformation through image manipulation. This article aims to review methods for verifying images’ integrity, particularly through deep learning techniques, addressing both passive and active approaches. Their effectiveness in various scenarios has been analyzed, highlighting their advantages and limitations. This study reviews the scientific literature and research findings, focusing on techniques that detect image manipulations and localize areas of tampering, utilizing both statistical properties of images and embedded hidden watermarks. Passive methods, based on analyzing the image itself, are versatile and can be applied across a broad range of cases; however, their effectiveness depends on the complexity of the modifications and the characteristics of the image. Active methods, which involve embedding additional information into the image, offer precise detection and localization of changes but require complete control over creating and distributing visual materials. Both approaches have their applications depending on the context and available resources. In the future, a key challenge remains the development of methods resistant to advanced manipulations generated by diffusion models and further leveraging innovations in deep learning to protect the integrity of visual content.
Preprint
Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.
Preprint
Full-text available
Diffusion models achieve superior performance in image generation tasks. However, it incurs significant computation overheads due to its iterative structure. To address these overheads, we analyze this iterative structure and observe that adjacent time steps in diffusion models exhibit high value similarity, leading to narrower differences between consecutive time steps. We adapt these characteristics to a quantized diffusion model and reveal that the majority of these differences can be represented with reduced bit-width, and even zero. Based on our observations, we propose the Ditto algorithm, a difference processing algorithm that leverages temporal similarity with quantization to enhance the efficiency of diffusion models. By exploiting the narrower differences and the distributive property of layer operations, it performs full bit-width operations for the initial time step and processes subsequent steps with temporal differences. In addition, Ditto execution flow optimization is designed to mitigate the memory overhead of temporal difference processing, further boosting the efficiency of the Ditto algorithm. We also design the Ditto hardware, a specialized hardware accelerator, fully exploiting the dynamic characteristics of the proposed algorithm. As a result, the Ditto hardware achieves up to 1.5x speedup and 17.74% energy saving compared to other accelerators.
Article
Full-text available
Accurate and efficient object detection in UAV images is a challenging task due to the diversity of target scales and the massive number of small targets. This study investigates the enhancement in the detection head using sparse convolution, demonstrating its effectiveness in achieving an optimal balance between accuracy and efficiency. Nevertheless, the sparse convolution method encounters challenges related to the inadequate incorporation of global contextual information and exhibits network inflexibility attributable to its fixed mask ratios. To address the above issues, the MFFCESSC-SSD, a novel single-shot detector (SSD) with multi-scale feature fusion and context-enhanced spatial sparse convolution, is proposed in this paper. First, a global context-enhanced group normalization (CE-GN) layer is developed to address the issue of information loss resulting from the convolution process applied exclusively to the masked region. Subsequently, a dynamic masking strategy is designed to determine the optimal mask ratios, thereby ensuring compact foreground coverage that enhances both accuracy and efficiency. Experiments on two datasets (i.e., VisDrone and ARH2000; the latter dataset was created by the researchers) demonstrate that the MFFCESSC-SSD remarkably outperforms the performance of the SSD and numerous conventional object detection algorithms in terms of accuracy and efficiency.
Article
In today’s society dissemination of information among the individuals occur very rapidly due to the widespread usage of social media platforms like Twitter (now-a-days acclaimed as X). However, information may pose challenges to maintaining a healthy online environment because often it contains harmful content. This paper presents a novel approach to identify different categories of offensive posts such as hate speech, profanity, targeted insult, and derogatory commentary by analyzing multi-modal image and text data, collected from Twitter. We propose a comprehensive deep learning framework, “Value Mixed Cross Attention Transformer” (VMCA-Trans) that leverage a combination of computer vision and natural language processing methodologies to effectively classify the posts into four classes with binary labels. We have created an in-house dataset (OffenTweet) comprising of Twitter posts having textual content, accompanying with images to build the proposed model. The dataset is carefully annotated by several experts with offensive labels such as hate speech, profanity, targeted insult, and derogatory commentary. VMCA-Trans utilizes fine-tuned state-of-the-art transformer based backbones such as ViT, BERT, RoBERTA etc. The combined representation of image and text embeddings obtained by these fine-tuned transformer encoders, is fed into a classifier to categorize the posts into offensive and non-offensive classes. To assess its effectiveness, we extensively evaluate the VMCA-Trans model using various performance metrics. The results indicate that the proposed multi-modal approach, achieves superior performance compared to traditional unimodal methods.
Article
In recent years, object detection has significantly advanced by using deep learning, especially convolutional neural networks. Most of the existing methods have focused on detecting objects under favorable weather conditions and achieved impressive results. However, object detection in the presence of rain remains a crucial challenge owing to the visibility limitation. In this paper, we introduce an Amalgamating Knowledge Network (AK-Net) to deal with the problem of detecting objects hampered by rain. The proposed AK-Net obtains performance improvement by associating object detection with visibility enhancement, and it is composed of five subnetworks: rain streak removal (RSR) subnetwork, raindrop removal (RDR) subnetwork, foggy rain removal (FRR) subnetwork, feature transmission (FT) subnetwork, and object detection (OD) subnetwork. Our approach is flexible; it can adopt different object detection models to construct the OD subnetwork for the final inference of objects. The RSR, RDR, and FRR subnetworks are responsible for producing clean features from rain streak, raindrop, and foggy rain images, respectively, and offer them to the OD subnetwork through the FT subnetwork for efficient object prediction. Experimental results indicate that the mean average precision (mAP) achieved by our proposed AK-Net was up to 19.58 %\% and 26.91 %\% higher than those produced using competitive methods on published iRain and RID datasets, respectively, while preserving the fast-running time of the baseline detector.
Article
Full-text available
Insect pests strongly affect crop growth and value globally. Fast and precise pest detection and counting are crucial measures in the management and mitigation of pest infestations. In this area, deep learning technologies have come to represent the method with the most potential. However, for small-sized crop pests, recent deep-learning-based detection attempts have not accomplished accurate recognition and detection due to the challenges posed by feature extraction and positive and negative sample selection. Therefore, to overcome these limitations, we first designed a co-ordinate-attention-based feature pyramid network, termed CAFPN, to extract the salient visual features that distinguish small insects from each other. Subsequently, in the network training stage, a dynamic sample selection strategy using positive and negative weight functions, which considers both high classification scores and precise localization, was introduced. Finally, several experiments were conducted on our constructed large-scale crop pest datasets, the AgriPest 21 dataset and the IP102 dateset, achieving accuracy scores of 77.2% and 29.8% for mAP (mean average precision), demonstrating promising detection results when compared to other detectors.
Article
Generalized Few-Shot Segmentation (GFSS) aims to segment both base and novel classes in a query image, conditioning on richly annotated data of base classes and limited exemplars from novel classes. The learning of novel classes undoubtedly faces a disadvantage in this competition due to the highly unbalanced data, which skews the learned feature space towards the base classes. In this paper, we present an innovative idea termed as “learning from orthogonal space” to avoid the conflict in the process of learning novel classes. Specifically, we first utilize textual modal information from labels to provide more distinguishable initial prototypes for different categories, ensuring that the prototypes for base and novel classes have distinct initial separations. Then, a simple but effective Feature Separating Module (FSM) is introduced to enhance the model’s ability to differentiate between base and novel classes through learning the novel features from orthogonal space. In addition, we propose a Trigger-Promoting Framework (TPF) during the testing stage to further boost performance. The prediction results from the FSM serve as a multimodal prompt to leverage information residing in large models, such as CLIP and SAM, to enhance performance. Comprehensive experiments on two benchmarks demonstrate that our method achieves superior performance on novel classes without sacrificing accuracy on base classes. Notably, our Feature Separating with Trigger-Promopting Network (FS-TPNet) outperforms the current state-of-the-art method by 12.8%12.8\% overall IoU on novel classes on PASCAL- 5i5^{i} under the 1-shot scenario. Our codes will be available at https://github.com/returnZXJ/FS-TPNet .
Article
Teleoperated robots are attracting attention as a solution to the pressing labor shortage. To reduce the burden on the operators of teleoperated robots and improve manpower efficiency, research is underway to make these robots more autonomous. However, end-to-end imitation learning models that directly map camera images to actions are vulnerable to changes in image background and lighting conditions. To improve robustness against these changes, we modified the learning model to handle segmented images where only the arm and the object are preserved. The task success rate for the demonstration data and the environment with different backgrounds was 0.0% for the model with the raw image input and 66.0% for the proposed model with segmented image input, with the latter having achieved a significant improvement. However, the grasping force of this model was stronger than that during the demonstration. Accordingly, we added haptics information to the observation input of the model. Experimental results show that this can reduce the grasping force.
Patent
Full-text available
Iconic images for a given object or object category may be identified in a set of candidate images by using a learned probabilistic composition model to divide each candidate image into a most probable rectangular object region and a background region, ranking the candidate images according to the maximal composition score of each image, removing non-discriminative images from the candidate images, clustering highest-ranked candidate images to form clusters, wherein each cluster includes images having similar object regions according to a feature match score, selecting a representative image from each cluster as an iconic image of the object category, and causing display of the iconic image. The composition model may be a Naïve Bayes model that computes composition scores based on appearance cues such as hue, saturation, focus, and texture. Iconic images depict an object or category as a relatively large object centered on a clean or uncluttered contrasting background.
Article
Full-text available
Datasets for training object recognition systems are steadily growing in size. This paper investigates the question of whether existing detectors will continue to improve as data grows, or if models are close to saturating due to limited model complexity and the Bayes risk associated with the feature spaces in which they operate. We focus on the popular paradigm of scanning-window templates defined on oriented gradient fea-tures, trained with discriminative classifiers. We investigate the performance of mixtures of templates as a function of the number of templates (complexity) and the amount of training data. We find that additional data does help, but only with correct regularization and treatment of noisy examples or "outliers" in the training data. Surprisingly, the per-formance of problem domain-agnostic mixture models appears to saturate quickly (∼10 templates and ∼100 positive training examples per template). However, compositional mixtures (implemented via composed parts) give much better performance because they share parameters among templates, and can synthesize new templates not encountered during training. This suggests there is still room to improve performance with linear classifiers and the existing feature space by improved representations and learning algo-rithms.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Conference Paper
Full-text available
We study the problem of object classification when training and test classes are disjoint, i.e. no training examples of the target classes are available. This setup has hardly been studied in computer vision research, but it is the rule rather than the exception, because the world contains tens of thousands of different object classes and for only a very few of them image, collections have been formed and annotated with suitable class labels. In this paper, we tackle the problem by introducing attribute-based classification. It performs object detection based on a human-specified high-level description of the target objects instead of training images. The description consists of arbitrary semantic attributes, like shape, color or even geographic information. Because such properties transcend the specific learning task at hand, they can be pre-learned, e.g. from image datasets unrelated to the current task. Afterwards, new classes can be detected based on their attribute representation, without the need for a new training phase. In order to evaluate our method and to facilitate research in this area, we have assembled a new largescale dataset, "Animals with Attributes", of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes. Our experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes.
Article
Full-text available
The quantitative evaluation of optical flow algorithms by Barron et al. (1994) led to significant advances in performance. The challenges for optical flow algorithms today go beyond the datasets and evaluation methods proposed in that paper. Instead, they center on problems associated with complex natural scenes, including nonrigid motion, real sensor noise, and motion discontinuities. We propose a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms. To that end, we contribute four types of data to test different aspects of optical flow algorithms: (1)sequences with nonrigid motion where the ground-truth flow is determined by tracking hidden fluorescent texture, (2)realistic synthetic sequences, (3)high frame-rate video used to study interpolation error, and (4)modified stereo sequences of static scenes. In addition to the average angular error used by Barron etal., we compute the absolute flow endpoint error, measures for frame interpolation error, improved statistics, and results at motion discontinuities and in textureless regions. In October 2007, we published the performance of several well-known methods on a preliminary version of our data to establish the current state of the art. We also made the data freely available on the web at http://vision.middlebury.edu/flow/. Subsequently a number of researchers have uploaded their results to our website and published papers using the data. A significant improvement in performance has already been achieved. In this paper we analyze the results obtained to date and draw a large number of conclusions from them. KeywordsOptical flow–Survey–Algorithms–Database–Benchmarks–Evaluation–Metrics
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
This paper details a new approach for learning a discriminative model of object classes, incorporating texture, layout, and context information e ciently. The learned model is used for automatic visual under-standing and semantic segmentation of photographs. Our discriminative model exploits texture-layout fil-ters, novel features based on textons, which jointly model patterns of texture and their spatial layout. Unary classification and feature selection is achieved using shared boosting to give an e cient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorpo-rating the unary classifier in a conditional random field, which (i) captures the spatial interactions be-tween class labels of neighboring pixels, and (ii) im-proves the segmentation of specific object instances. E cient training of the model on large datasets is achieved by exploiting both random feature selection and piecewise training methods.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Article
Full-text available
The GIST descriptor has recently received increasing attention in the context of scene recognition. In this paper we evaluate the search accuracy and complexity of the global GIST descriptor for two applications, for which a local description is usually preferred: same location/object recognition and copy detection. We identify the cases in which a global description can reasonably be used. The comparison is performed against a state-of-the-art bag-of-features representation. To evaluate the impact of GIST's spatial grid, we compare GIST with a bag-of-features restricted to the same spatial grid as in GIST. Finally, we propose an indexing strategy for global descriptors that optimizes the trade-off between memory usage and precision. Our scheme provides a reasonable accuracy in some widespread application cases together with very high efficiency: In our experiments, querying an image database of 110 million images takes 0.18 second per image on a single machine. For common copyright attacks, this efficiency is obtained without noticeably sacrificing the search accuracy compared with state-of-the-art approaches.
Article
Full-text available
Caltech-UCSD Birds 200 (CUB-200) is a challenging image dataset annotated with 200 bird species. It was created to enable the study of subordinate categorization, which is not possible with other popular datasets that focus on basic level categories (such as PASCAL VOC, Caltech-101, etc). The images were downloaded from the website Flickr and filtered by workers on Amazon Mechanical Turk. Each image is annotated with a bounding box, a rough bird segmentation, and a set of attribute labels.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
We study strategies for scalable multi-label annotation, or for efficiently acquiring multiple labels from humans for a collection of items. We propose an algorithm that exploits correlation, hierarchy, and sparsity of the label distribution. A case study of labeling 200 objects using 20,000 images demonstrates the effectiveness of our approach. The algorithm results in up to 6x reduction in human computation time compared to the naive method of querying a human annotator for the presence of every object in every image.
Article
April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly
Conference Paper
Entry level categories - the labels people will use to name an object - were originally defined and studied by psychologists in the 1980s. In this paper we study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. Our models combine visual recognition predictions with proxies for word "naturalness" mined from the enormous amounts of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people. We also learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval.
Conference Paper
This paper shows how to analyze the influences of object characteristics on detection performance and the frequency and impact of different types of false positives. In particular, we examine effects of occlusion, size, aspect ratio, visibility of parts, viewpoint, localization error, and confusion with semantically similar objects, other labeled objects, and background. We analyze two classes of detectors: the Vedaldi et al. multiple kernel learning detector and different versions of the Felzenszwalb et al. detector. Our study shows that sensitivity to size, localization error, and confusion with similar objects are the most impactful forms of error. Our analysis also reveals that many different kinds of improvement are necessary to achieve large gains, making more detailed analysis essential for the progress of recognition research. By making our software and annotations available, we make it effortless for future researchers to perform similar analysis.
Conference Paper
Crowd-sourcing approaches such as Amazon's Mechanical Turk (MTurk) make it possible to annotate or collect large amounts of linguistic data at a relatively low cost and high speed. However, MTurk offers only limited control over who is allowed to particpate in a particular task. This is particularly problematic for tasks requiring free-form text entry. Unlike multiple-choice tasks there is no correct answer, and therefore control items for which the correct answer is known cannot be used. Furthermore, MTurk has no effective built-in mechanism to guarantee workers are proficient English writers. We describe our experience in creating corpora of images annotated with multiple one-sentence descriptions on MTurk and explore the effectiveness of different quality control strategies for collecting linguistic data using Mechanical MTurk. We find that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly. Using our best setup, we construct two image corpora, totaling more than 40,000 descriptive captions for 9000 images.
Conference Paper
The growth of detection datasets and the multiple directions of object detection research provide both an unprecedented need and a great opportunity for a thorough evaluation of the current state of the field of categorical object detection. In this paper we strive to answer two key questions. First, where are we currently as a field: what have we done right, what still needs to be improved? Second, where should we be going in designing the next generation of object detectors? Inspired by the recent work of Hoiem et al. on the standard PASCAL VOC detection dataset, we perform a large-scale study on the Image Net Large Scale Visual Recognition Challenge (ILSVRC) data. First, we quantitatively demonstrate that this dataset provides many of the same detection challenges as the PASCAL VOC. Due to its scale of 1000 object categories, ILSVRC also provides an excellent test bed for understanding the performance of detectors as a function of several key properties of the object classes. We conduct a series of analyses looking at how different detection methods perform on a number of image-level and object-class-level properties such as texture, color, deformation, and clutter. We learn important lessons of the current object detection methods and propose a number of insights for designing the next generation object detectors.
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Article
The appearance of surfaces in real-world scenes is determined by the materials, textures, and context in which the surfaces appear. However, the datasets we have for visualizing and modeling rich surface appearance in context, in applications such as home remodeling, are quite limited. To help address this need, we present OpenSurfaces, a rich, labeled database consisting of thousands of examples of surfaces segmented from consumer photographs of interiors, and annotated with material parameters (reflectance, material names), texture information (surface normals, rectified textures), and contextual information (scene category, and object names). Retrieving usable surface information from uncalibrated Internet photo collections is challenging. We use human annotations and present a new methodology for segmenting and annotating materials in Internet photo collections suitable for crowdsourcing (e.g., through Amazon's Mechanical Turk). Because of the noise and variability inherent in Internet photos and novice annotators, designing this annotation engine was a key challenge; we present a multi-stage set of annotation tasks with quality checks and validation. We demonstrate the use of this database in proof-of-concept applications including surface retexturing and material and image browsing, and discuss future uses. OpenSurfaces is a public resource available at http://opensurfaces.cs.cornell.edu/.
Conference Paper
In this paper we present the first large-scale scene attribute database. First, we perform crowd-sourced human studies to find a taxonomy of 102 discriminative attributes. Next, we build the “SUN attribute database” on top of the diverse SUN categorical database. Our attribute database spans more than 700 categories and 14,000 images and has potential for use in high-level scene understanding and fine-grained scene recognition. We use our dataset to train attribute classifiers and evaluate how well these relatively simple classifiers can recognize a variety of attributes related to materials, surface properties, lighting, functions and affordances, and spatial envelope properties.
Conference Paper
In this paper, we propose an approach to accurately localize detected objects. The goal is to predict which features pertain to the object and define the object extent with segmentation or bounding box. Our initial detector is a slight modification of the DPM detector by Felzenszwalb et al., which often reduces confusion with background and other objects but does not cover the full object. We then describe and evaluate several color models and edge cues for local predictions, and we propose two approaches for localization: learned graph cut segmentation and structural bounding box prediction. Our experiments on the PASCAL VOC 2010 dataset show that our approach leads to accurate pixel assignment and large improvement in bounding box overlap, sometimes leading to large overall improvement in detection accuracy.
Article
We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely chal-lenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods in-corporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective perfor-mance measure for image captioning.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We propose to shift the goal of recognition from naming to describing. Doing so allows us not only to name familiar objects, but also: to report unusual aspects of a familiar object (ldquospotty dogrdquo, not just ldquodogrdquo); to say something about unfamiliar objects (ldquohairy and four-leggedrdquo, not just ldquounknownrdquo); and to learn how to recognize new objects with few or no visual examples. Rather than focusing on identity assignment, we make inferring attributes the core problem of recognition. These attributes can be semantic (ldquospottyrdquo) or discriminative (ldquodogs have it but sheep do notrdquo). Learning attributes presents a major new challenge: generalization across object categories, not just across instances within a category. In this paper, we also introduce a novel feature selection method for learning attributes that generalize well across categories. We support our claims by thorough evaluation that provides insights into the limitations of the standard recognition paradigm of naming and demonstrates the new abilities provided by our attribute-based framework.
Article
We formulate a layered model for object detection and image segmentation. We describe a generative probabilistic model that composites the output of a bank of object detectors in order to define shape masks and explain the appearance, depth ordering, and labels of all pixels in an image. Notably, our system estimates both class labels and object instance labels. Building on previous benchmark criteria for object detection and image segmentation, we define a novel score that evaluates both class and instance segmentation. We evaluate our system on the PASCAL 2009 and 2010 segmentation challenge data sets and show good test results with state-of-the-art performance in several categories, including segmenting humans.
Article
We demonstrate that it is possible to automatically find representative example images of a specified object cate-gory. These canonical examples are perhaps the kind of images that one would show a child to teach them what, for example a horse is – images with a large object clearly separated from the background. Given a large collection of images returned by a web search for an object category, our approach proceeds with-out any user supplied training data for the category. First images are ranked according to a category independent composition model that predicts whether they contain a large clearly depicted object, and outputs an estimated lo-cation of that object. Then local features calculated on the proposed object regions are used to eliminate images not distinctive to the category and to cluster images by simi-larity of object appearance. We present results and a user evaluation on a variety of object categories, demonstrating the effectiveness of the approach.
Article
Stereo matching is one of the most active research areas in computer vision. While a large number of algorithms for stereo correspondence have been developed, relatively little work has been done on characterizing their performance. In this paper, we present a taxonomy of dense, two-frame stereo methods. Our taxonomy is designed to assess the different components and design decisions made in individual stereo algorithms. Using this taxonomy, we compare existing stereo methods and present experiments evaluating the performance of many different variants. In order to establish a common software platform and a collection of data sets for easy evaluation, we have designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms. We have also produced several new multi-frame stereo data sets with ground truth and are making both the code and data sets available on the Web. Finally, we include a comparative evaluation of a large set of today's best-performing stereo algorithms.
Conference Paper
Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (e.g. the Corel world, the Caltech-101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.
Article
Current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. In addition, no algorithm presented in the literature has been tested on more than a handful of object categories. We present an method for learning object categories from just a few training images. It is quick and it uses prior information in a principled way. We test it on a dataset composed of images of objects belonging to 101 widely varied categories. Our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. A generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. The parameters of the model are learnt incrementally in a Bayesian manner. Our incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum likelihood. The incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. Both Bayesian methods outperform maximum likelihood on small training sets.
Article
In this paper we present a comprehensive and critical survey of face detection algorithms. Face detection is a necessary first-step in face recognition systems, with the purpose of localizing and extracting the face region from the background. It also has several applications in areas such as content-based image retrieval, video coding, video conferencing, crowd surveillance, and intelligent human–computer interfaces. However, it was not until recently that the face detection problem received considerable attention among researchers. The human face is a dynamic object and has a high degree of variability in its apperance, which makes face detection a difficult problem in computer vision. A wide variety of techniques have been proposed, ranging from simple edge-based algorithms to composite high-level approaches utilizing advanced pattern recognition methods. The algorithms presented in this paper are classified as either feature-based or image-based and are discussed in terms of their technical approach and performance. Due to the lack of standardized tests, we do not provide a comprehensive comparative evaluation, but in cases where results are reported on common datasets, comparisons are presented. We also give a presentation of some proposed applications and possible application areas.
Conference Paper
This paper presents a quantitative comparison of several multi-view stereo reconstruction algorithms. Until now, the lack of suitable calibrated multi-view image datasets with known ground truth (3D shape models) has prevented such direct comparisons. In this paper, we first survey multi-view stereo algorithms and compare them qualitatively using a taxonomy that differentiates their key properties. We then describe our process for acquiring and calibrating multiview image datasets with high-accuracy ground truth and introduce our evaluation methodology. Finally, we present the results of our quantitative comparison of state-of-the-art multi-view stereo reconstruction algorithms on six benchmark datasets. The datasets, evaluation details, and instructions for submitting new models are available online at http://vision.middlebury.edu/mview.
Conference Paper
We present an approach for object recognition that com- bines detection and segmentation within a efficient hypoth- esize/test framework. Scanning-window template classifiers are the current state-of-the-art for many object classes such as faces, cars, and pedestrians. Such approaches, though quite successful, can be hindered by their lack of explicit encoding of object shape/structure - one might, for exam- ple, find faces in trees. We adopt the following strategy; we first use these sys- tems as attention mechanisms, generating many possible object locations by tuning them for low missed-detections and high false-positives. At each hypothesized detection, we compute a local figure-ground segmentation using a win- dow of slightly larger extent than that used by the classifier. This segmentation task is guided by top-down knowledge. We learn offline from training data those segmentations that are consistent with true positives. We then prune away those hypotheses with bad segmentations. We show this strat- egy leads to significant improvements (10-20%) over estab- lished approaches such as ViolaJones and DalalTriggs on a variety of benchmark datasets including the PASCAL chal- lenge, LabelMe, and the INRIAPerson dataset.
Conference Paper
Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods. Additionally, we study a finer-grained scene representation to detect scenes embedded inside of larger scenes.
Conference Paper
In this paper, we propose techniques to make use of two complementary bottom-up features, image edges and texture patches, to guide top-down object segmentation towards higher precision. We build upon the part-based poselet detector, which can predict masks for numerous parts of an object. For this purpose we extend poselets to 19 other categories apart from person. We non-rigidly align these part detections to potential object contours in the image, both to increase the precision of the predicted object mask and to sort out false positives. We spatially aggregate object information via a variational smoothing technique while ensuring that object regions do not overlap. Finally, we propose to refine the segmentation based on self-similarity defined on small image patches. We obtain competitive results on the challenging Pascal VOC benchmark. On four classes we achieve the best numbers to-date.
Conference Paper
The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors.
Conference Paper
We address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet. We postulate two criteria (1) It should be easy to find a poselet given an input image (2) it should be easy to localize the 3D configuration of the person conditioned on the detection of a poselet. To permit this we have built a new dataset, H3D, of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints. This enables us to implement a data-driven search procedure for finding poselets that are tightly clustered in both 3D joint configuration space as well as 2D image appearance. The algorithm discovers poselets that correspond to frontal and profile faces, pedestrians, head and shoulder views, among others. Each poselet provides examples for training a linear SVM classifier which can then be run over the image in a multiscale scanning mode. The outputs of these poselet detectors can be thought of as an intermediate layer of nodes, on top of which one can run a second layer of classification or regression. We show how this permits detection and localization of torsos or keypoints such as left shoulder, nose, etc. Experimental results show that we obtain state of the art performance on people detection in the PASCAL VOC 2007 challenge, among other datasets. We are making publicly available both the H3D dataset as well as the poselet parameters for use by other researchers.
Article
Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes. The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corre- sponding semantically labeled images at 1 Hz and in part, 15 Hz. The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driv- ing scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expand- ing this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by mea- suring the performance of an algorithm from each of three distinct domains: multi-class object recogni- tion, pedestrian detection, and label propagation.
Article
Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.