Article

Histograms of Oriented Gradients for Human Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Object detection is one of the basic tasks in the field of computer vision, as long as it is to complete the recognition and positioning of the object in the image. Traditional object detection is mostly based on artificially designed feature extractors, such as HOG [5], DPM [6], etc. These feature extractors scan the entire image to find the area with the largest feature response. ...
... Most of the traditional algorithms for vehicle detection are based on moving object detection and manual feature extraction methods. LIU et al. [22] use HOG [5] to detect vehicles in images. Choudhur et al. [3] use functions similar to Haar [18] for vehicle detection. ...
Article
Full-text available
The engineering vehicle detection is a key issue for the raw material warehouse scenes. Through the engineering vehicle detection, the working conditions of engineering vehicles in the raw material warehouse can be intelligently managed to prevent large-scale smoke pollution and the danger of smoke and dust. In this paper, we propose an intelligent method based on the framework of YOLOv4-Tiny for locating and identifying the engineering vehicles. In our detection task, the monitoring scenes are complex with a lot of interference. And the scope of monitoring is large. In order to solve these challenging problems, we introduce the Split-attention module to the network, which can adaptively extract important information of the image and improve the receptive field of detection. In addition, we introduce the Dynamic ReLU function to the network, which allow the network to adaptively learn more suitable ReLU parameters based on the input. We also collect a large number of images obtained from the front-end cameras and create a self-built dataset of engineering vehicles. In this paper, we test our method on the COCO dataset and the self-built engineering vehicle dataset. Experimental results show that our method proposed in this paper can detect engineering vehicles with higher accuracy and faster speed, which can be used for engineering vehicle detection in the scenes of raw material storage warehouses.
... DPM decomposes the object into partial collections, which follows the idea on the image model introduced in the 1970s, enforces a set of geometric constraints among them, and treats the simulated potential object center as a potential variable. DPM excels at object detection tasks (using bounding boxes for localizing objects) and defeating template matching as compared to other object detection methods that were popular at that time whereby the Histogram of oriented Gradient (HoG) [54] feature was used to generate the corresponding "filter" for various objects. The HoG filter can record the edge and contour information of the object and use to filter at various positions in different pictures. ...
Thesis
Imaging flow cytometry is a high-throughput tool widely used for bioparticle analysis in various applications. Nevertheless, the vast number of images generated by imaging flow cytometry imposes a great challenge for data analysis. Although various learning algorithms were optimized to achieve high predication accuracies, they did overlooked the trade-off between the speed and hardware requirements, causing a major hurdle for mass deployment of these learning algorithms to commercial devices for high-throughput bioparticle analysis due to its high cost and high-power consumption. In this thesis, we have developed an efficient neural network named MCellNet for a rapid, accurate and high-throughput approach for label-free bioparticle detection using imaging flow cytometry. MCellNet achieved a classification accuracy over 99.6% and a processing speed of 346+ images per second in embedded platforms, outperforming MobileNetV2 (251 frames per second) with similar classification accuracy. In addition, deep metric learning for rare bioparticle detection was also studied. Furthermore, a machine-learning-based pipeline is established for the high accuracy bioparticle sizing. The pipeline consisted of an image segmentation module for measuring the pixel size of the bioparticle and a machine learning model for accurate pixel-to-size conversion. The size algorithm showed significantly more accurate sizing capability and promised great potential for a wide range of bioparticle sizing applications. The proposed methods could also be potentially applied to other high-throughput and real-time bioparticle analysis for biomedical diagnosis and environmental monitoring
... Both MOSSE [4] and CSK [19] with introducing Cyclic Matrix and kernel function adopted simple gray feature, which was easily disturbed by the external environment and lead to inaccurate tracking. To improve the performance of the tracker, many methods based on HOG [11] are proposed. HOG is the most commonly used feature, which is robust to the motion blur and illumination variation. ...
Article
Full-text available
Object tracking in videos has been a hot research for decades. Many approaches have been applied to improve the visual tracking, a challenging task in computer vision. Compared with the state-of-the-art methods, correlation filters have achieved more significant performance in visual object tracking. However, their flexibilities in the robust scale estimation are not very well. In this paper, we improve the performance of tracking with high discrimination power and explore an energy-efficient approach to design a simple superior tracker. First, instead of one simple feature extraction, we utilize multi-feature channels from the color space and convolutional layers, respectively, and establish a corresponding weighted formulation to fuse multiple features. Through the optimization, it can effectively obtain the latest position estimation of target object. Furthermore, the scale space correlation filter is investigated by the tracking-by-detection structure to distinguish the scale variation of the target object according to the updating position estimation. Additionally, we employ fusion approach to merge the multi-channel response maps to obtain an optimal tracking result, which ensures that our model can supply sufficient tracking information. Compared with the existing tracking approaches, we reduce the computation complexity. On the OTB-dataset, our tracker significantly improves the baseline, with a gain of 3.4% in the experimental evaluation. Both quantitative and qualitative evaluations are implemented on multiple benchmark sequences to demonstrate that the effectiveness of our proposed algorithm outperforms the state-of-the-art approaches.
... See Local Function Identification and Extraction for more knowledge about approved techniques (Computer Vision Toolbox). The rescaled pixel intensities as predictor variables are used in this illustration.[18] ...
Article
Full-text available
Abstract. Nanotube pictures that offer a new way for the dental titanium implant bioactivationof the surface. An significant aspect of the content consistency determination is themeasurement of selected criteria. The first step is to divide and detect pictures. Theclassification of objects is an important point in the separation object in the following step.Diverse classification techniques could be used, one of which is the neural network. The paperintroduces a method of classification of artefacts based on the competitive application of theneural network. The first part explains the core concepts of the neural network. In the followingsegment, a picture is evaluated using the chosen classification process. (PDF) Algorithms To Solve The Classification Problem And Objects Recognition In Images Using Mat Lab. Available from: https://www.researchgate.net/publication/362680779_Algorithms_To_Solve_The_Classification_Problem_And_Objects_Recognition_In_Images_Using_Mat_Lab [accessed Aug 14 2022].
... These approaches are based on extracting discriminative and efficient features for general object detection, like Adaboost's face detector [21], histogram of oriented gradients (HOG) [22] or covariance features [23] for pedestrian detection. Additionally, the deformable part model (DPM) [24] has been proposed by accelerating coarse-to-fine search of possible locations. ...
Article
Full-text available
Small object detection techniques have been developed for decades, but one of key remaining open challenges is detecting tiny objects in wild or nature scenes. While recent works on deep learning techniques have shown a promising potential direction on common object detection in the wild, their accuracy and robustness on tiny object detection in the wild are still unsatisfactory. In this paper, we target at studying the problem of tiny pest detection in the wild and propose a new effective deep learning approach. It builds up a global activated feature pyramid network on convolutional neural network backbone for detecting tiny pests across a large range of scales over both positions and pyramid levels. The network enables retrieving the depth and spatial intension information over different levels in the feature pyramid. It makes variance or changes of spatial or depth-sensitive features in tiny pest images more visible. Besides, a hard example enhancement strategy is also proposed to implement fast and efficient training in this approach. The approach is evaluated on our newly built large-scale wide tiny pest dataset containing 27.8K images with 145.6K manually labelled pest objects. The results show that our approach perform well on pest detection with over 71% mAP, which outweighs other state-of-the-art object detection methods.
... Object detection in early period usually integrated handcrafted features [34], [35], [36] and machine learning approaches [37], [38] to recognize the objects of interest. The methods following this sophisticated philosophy perform catastrophically poorly in small objects due to their limited capability of scale variation. ...
Preprint
With the rise of deep convolutional neural networks, object detection has achieved prominent advances in past years. However, such prosperity could not camouflage the unsatisfactory situation of Small Object Detection (SOD), one of the notoriously challenging tasks in computer vision, owing to the poor visual appearance and noisy representation caused by the intrinsic structure of small targets. In addition, large-scale dataset for benchmarking small object detection methods remains a bottleneck. In this paper, we first conduct a thorough review of small object detection. Then, to catalyze the development of SOD, we construct two large-scale Small Object Detection dAtasets (SODA), SODA-D and SODA-A, which focus on the Driving and Aerial scenarios respectively. SODA-D includes 24704 high-quality traffic images and 277596 instances of 9 categories. For SODA-A, we harvest 2510 high-resolution aerial images and annotate 800203 instances over 9 classes. The proposed datasets, as we know, are the first-ever attempt to large-scale benchmarks with a vast collection of exhaustively annotated instances tailored for multi-category SOD. Finally, we evaluate the performance of mainstream methods on SODA. We expect the released benchmarks could facilitate the development of SOD and spawn more breakthroughs in this field. Datasets and codes will be available soon at: \url{https://shaunyuan22.github.io/SODA}.
... To learn more semantic features, MaskFeat [39] introduces the low-level local features (HOG [12]) as the reconstruction target while CIM [17] opts for more complex input. ...
Preprint
Full-text available
Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the target branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The target encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shift for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. Codes will be made publicly available.
... The speed of MOSSE is quite fast because computations are processed in the frequency domain. DCF/KCF [15] utilizes a circulant matrix of the area around the object to collect positive and negative samples, converts the operation of the matrix into the Hadamard product of the vector to decrease the computation, and applies HOG features [16] to obtain a more stable representation. ...
Article
Full-text available
Existing thermal infrared (TIR) trackers based on correlation filters cannot adapt to the abrupt scale variation of nonrigid objects. This deficiency could even lead to tracking failure. To address this issue, we propose a TIR tracker, called ECO_LS, which improves the performance of efficient convolution operators (ECO) via the level set method. We first utilize the level set to segment the local region estimated by the ECO tracker to gain a more accurate size of the bounding box when the object changes its scale suddenly. Then, to accelerate the convergence speed of the level set contour, we leverage its historical information and continuously encode it to effectively decrease the number of iterations. In addition, our variant, ECOHG_LS, also achieves better performance via concatenating histogram of oriented gradient (HOG) and gray features to represent the object. Furthermore, experimental results on three infrared object tracking benchmarks show that the proposed approach performs better than other competing trackers. ECO_LS improves the EAO by 20.97% and 30.59% over the baseline ECO on VOT-TIR2016 and VOT-TIR2015, respectively.
... The Caltech101-7 and Caltech101-20 datasets with the size of 1474 and 2386 are generated from the subset of Caltech101, where 7 and 20 categories are, respectively, included. Both of them have six views consisting of the Gabor feature (48-D), the Wavelet-moment feature [178] (40-D), the Centrist feature [179] (254-D), the HOG feature [180] (1984-D), the GIST feature [181] (512-D), and the LBP feature (928-D). The Caltech101 [127] dataset is composed of four views, whose dimensions are, respectively, 2048, 4800, 3540,1240. ...
Article
Full-text available
Multi-view clustering (MVC) has attracted more and more attention in the recent few years by making full use of complementary and consensus information between multiple views to cluster objects into different partitions. Although there have been two existing works for MVC survey, neither of them jointly takes the recent popular deep learning-based methods into consideration. Therefore, in this paper, we conduct a comprehensive survey of MVC from the perspective of representation learning. It covers a quantity of multi-view clustering methods including the deep learning-based models, providing a novel taxonomy of the MVC algorithms. Furthermore, the representation learning-based MVC methods can be mainly divided into two categories, i.e., shallow representation learning-based MVC and deep representation learning-based MVC, where the deep learning-based models are capable of handling more complex data structure as well as showing better expression. In the shallow category, according to the means of representation learning, we further split it into two groups, i.e., multi-view graph clustering and multi-view subspace clustering. To be more comprehensive, basic research materials of MVC are provided for readers, containing introductions of the commonly used multi-view datasets with the download link and the open source code library. In the end, some open problems are pointed out for further investigation and development.
... For studying masked prediction, [23] follows the two-stage approach as BEiT and investigates various target tokenizers. Interestingly, it is found that handcrafted HOG features [47] achieve a competitive performance, suggesting a target tokenizer generated by dVAE might be unnecessary. ...
Preprint
Full-text available
Masked autoencoders are scalable vision learners, as the title of MAE \cite{he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. Specifically, generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. By contrast, early attempts at generative methods in vision have been buried by their discriminative counterparts (like contrastive learning); however, the success of mask image modeling has revived the masking autoencoder (often termed denoising autoencoder in the past). As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond. This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL. As the first to review SSL with masked autoencoders, this work focuses on its application in vision by discussing its historical developments, recent progress, and implications for diverse applications.
... However the performance of the histogram difference degrades in presence of illumination variation. Gradient orientation feature is invariant to photometric and geometric change [9]. Motivated by histogram differencing approach of AT detection and gradient orientation feature we developed a joint histogram of magnitude and orientation of gradient (JHMOG) the similarity between two consecutive frames are measured based on the sum of difference of the JHMOG of two consecutive frames. ...
Article
Full-text available
An efficient video shot boundary detection is highly desirable for subsequent semantic video content analysis and retrieval applications. The major challenge of shot boundary detection problem is an appropriate choice of features to handle the illumination variation and motion artifacts of frames while finding the boundary in a shot. In this paper, to improve the efficacy of shot boundary detection in presence of the aforementioned challenges, the strength of the gradient feature is explored to develop a dual detection framework for automatic shot boundary detection. In the first phase, abrupt transition (AT) detection is addressed in presence of illumination variation and motion in the frames of a shot by generating an combined feature through joint histogram of gradient magnitude and gradient orientation features frame. In the second phase, gradual transition detection (GT) detection is applied only on the frames within two AT frames satisfying specific frame distance criteria. Handling both AT and GT by the proposed simple gradient-based framework is the uniqueness of this work. Moreover, the proposed method is fully ubiquitous, independent of the video content and free from any training process. Exhaustive simulations are carried out on different databases to validate the proposed approach. The performance of the proposed feature-based shot boundary framework, in terms of average F1 measure, is 94% for AT detection, 84% for GT detection and 90.01% for overall detection.
... For handcraft feature based models, Viola et al. proposed the real-time detection method of human faces based on AdaBoost algorithm and a "cascade" of features [14]. After that, Dalal et al. first applied histogram of oriented gradients (HOG) for object detection in [15], which achieves good performance for person detection. Cheng et al. proposed a rapid object detection method by using the binarized normed gradients (BING) features [13]. ...
Article
Full-text available
Convolutional neural networks (CNNs) have been widely applied in the image quality assessment (IQA) field, but the size of the IQA databases severely limits the performance of the CNN-based IQA models. The most popular method to extend the size of database in previous works is to resize the images into patches. However, human visual system (HVS) can only perceive the qualities of objects in an image rather than the qualities of patches in it. Motivated by this fact, we propose a CNN-based algorithm for no-reference image quality assessment (NR-IQA) based on object detection. The network has three parts: an object detector, an image quality prediction network, and a self-correction measurement (SCM) network. First, we detect objects from input image by the object detector. Second, a ResNet-18 network is applied to extract features of the input image and a fully connected (FC) layer is followed to estimate image quality. Third, another ResNet-18 network is used to extract features of both the images and its detected objects, where the features of the objects are concatenated to the features of the image. Then, another FC layer is followed to compute the correction value of each object. Finally, the predicted image quality is amended by the SCM values. Experimental results demonstrate that the proposed NR-IQA model has state-of-the-art performance. In addition, cross-database evaluation indicates the great generalization ability of the proposed model.
... Minkyu Cheon et al. [135] have proposed a method using the knowledge-based HG process that uses shadow regions to extract hypotheses. In HV, the system utilizes the histogram of oriented gradients (HOG) [136,137] and HOG symmetry vectors [138] for representation of feature vectors and Total Error Rate Minimization Using Reduced Model (TER-RM) [139] for classification. A new strategy for vehicle detection has been developed by Daniel Alonso et al. [140] focused on classifying metrics of multidimensional probability. ...
Research
Full-text available
Smart traffic and information systems require the collection of traffic data from respective sensors for regulation of traffic. In this regard, surveillance cameras have been installed in monitoring and control of traffic in the last few years. Several studies are carried out in video surveillance technologies using image processing techniques for traffic management. Video processing of a traffic data obtained through surveillance cameras is an instance of applications for advance cautioning or data extraction for real-time analysis of vehicles. This paper presents a detailed review of literature in vehicle detection and classification techniques and also discusses about the open challenges to be addressed in this area of research. It also reviews on various vehicle datasets used for evaluating the proposed techniques in various studies.
Article
Traffic sign detection is a challenging task. Although existing deep learning techniques have made great progress in detecting traffic signs, there are still many unsolved challenges. We propose a novel traffic sign detection network named ReYOLO that learns rich contextual information and senses scale variations to efficiently detect small and ambiguous traffic signs in the wild. Specifically, we first replace the conventional convolutional block with modules that are built by structural reparameterization methods and are embedded into bigger structures, thus decoupling the training structures and the inference structures using parameter transformation, and allowing the model to learn more effective features. We then design a novel weighting mechanism which can be embedded into a feature pyramid to exploit foreground features at different scales to narrow the semantic gap between multiple scales. To fully evaluate the proposed method, we conduct experiments on a traditional traffic sign dataset GTSDB as well as two new traffic sign datasets TT100K and CCTSDB2021, achieving 97.2%, 68.3% and 83.9% mAP (Mean Average Precision) for the three-class detection challenge in these three datasets.
Article
This paper is about recognizing multiple person actions occurring in videos, including individual actions, interactions,and group activities. In an environment, multiple people perform group actions such as walking in groupsand talking by facing each other. The model develops by retrieving individual person action from video sequencesby representing interactive contextual features among multiple people. The novelty of the proposed frameworkis the development of interactive action context descriptors (IAC) and classifying group activities using MachineLearning. Each individual person and other nearby people’s relative action score are encoded by IAC in thevideo frame. Individual person action descriptors are important clues for recognition of multiple person activityby developing interaction context. An action retrieval technique was formulated based on KNN for individualaction classification scores. This model also introduces Fully Connected Conditional Random Field (FCCRF) tolearn interaction context information among multiple people. FCCRF regularizes activity categorization by thespatial-temporal model. This paper also presents threshold processing to improve the performance of contextdescriptors. The experimental results compared to state-of-the-art approaches and demonstrated improvement inperformance for group activity recognition.
Article
Recently, contrastive learning has gained increasing attention as a research topic for image-clustering tasks. However, most contrastive learning-based clustering models focus only on the similarity of embedded features or divergence of cluster assignments, without considering the semantic distribution of instances, undermining the performance of clustering. Therefore, an improved deep clustering model based on semantic consistency (DCSC) was proposed in this study, motivated by the assumption that the semantic probability distribution of various augmentations of the same instance should be similar and that of different instances should be orthogonal. The DCSC fully exploits instance-level differentiation, cluster-level discrimination, and semantic consistency of instances to design the objective function. Compared with existing contrastive learning-based clustering models, the proposed model is more cluster-sensitive to differentiate semantic concepts owing to the incorporation of cluster structure discovering loss. Extensive experimental results on six benchmark datasets illustrate that the proposed DCSC achieves superior performance compared to the state-of-the-art clustering models, with an improved accuracy of 9.3% for CIFAR-100 and 22.1% for tiny-ImageNet. The visualization results show that the DCSC produces geometrically well-separated cluster embeddings defined by the Euclidean distance, verifying the effectiveness of the proposed DCSC.
Article
Tomato leaf infections are a common threat to long-term tomato production that affects many farmers worldwide. Early detection, treatment, and solution of tomato leaf specificity are critical for promoting healthy tomato plant growth and ensuring ample supply and health security for the world’s geometric growth (population). The detection of plant leaf disease using computer-assisted technologies is prevalent these days. In this work, use the 1610 tomato leaf images of different classes from PlantVillage standard repository for the localization of objects. An effective Deep Learning (DL) modified Mask Region Convolutional Neural Network (Mask R-CNN) is proposed for the autonomous segmentation and detection of tomato plant leaf disease in this research. Intending to conserve memory space and computational expense, the suggested model adds a light head “Region Convolutional Neural Network (R-CNN)”. By varying the proportions of anchor in the RPN network and also changing the feature extraction topology, which improves the detection accuracy and computing the metric performance. The proposed technique is compared to existing state-of-the-art models to check if it is viable and robust. The outcomes of the suggested model achieved the results in terms of Mean Average Precision (mAP), F1-score, and accuracy of 0.88, 0.912, and 0.98, respectively. Furthermore, as the model’s ability increases with some parameters, the detection time for lesion detection is reduced by two times than the existing models.
Article
Detection and identification of macroplastic debris in aquatic environments is crucial to understand and counter the growing emergence and current developments in distribution and deposition of macroplastics. In this context, close-range remote sensing approaches revealing spatial and spectral properties of macroplastics are very beneficial. To date, field surveys and visual census approaches are broadly acknowledged methods to acquire information, but since 2018, techniques based on remote sensing and artificial intelligence are on the rise. Despite their proven efficiency, speed and wide applicability, there are still obstacles to overcome, especially when looking at the availability and accessibility of data. Thus, our review paper looks into state-of-the-art research about the visual recognition and identification of different sorts of macroplastics. The focus is on both data acquisition techniques and evaluation methods, including Machine Learning and Deep Learning, but resulting products and published data will also be taken into account. Our aim is to provide a critical overview and outlook in a time where this research direction is thriving fast. This study shows that most Machine Learning and Deep Learning approaches are still in an infancy state regarding accuracy and detail when compared to visual monitoring, even though their results look very promising.
Article
Kinship verification refers to comparing similarities between two different individuals through their facial images. In this context, feature descriptors play a crucial role, and few feature descriptors are present in literature to extract kin features from facial images. In this paper, we propose a binary cross-coupled discriminant analysis (BC2DA) based feature descriptor which is able to extract effective kin features from input facial image pairs. This method reduces the discrimination between kin pairs at the feature extraction stage itself. BC2DA converts original kin image pairs to encoded image pairs to reduce the discrimination between them. To make better use of tri-subject kin relations, we further propose multi cross-coupled discriminant analysis (MC2DA). This method reduces the discrimination between child and both parents’ images at the feature extraction stage. Extensive experiments were conducted on six kinship datasets such as KinfaceW-I/II, Cornell, FIW, TSKinface UBKinface to show the efficacy of the proposed algorithm.
Article
Full-text available
Since the start of the COVID-19 pandemic, social distancing (SD) has played an essential role in controlling and slowing down the spread of the virus in smart cities. To ensure the respect of SD in public areas, visual SD monitoring (VSDM) provides promising opportunities by (i) controlling and analyzing the physical distance between pedestrians in real-time, (ii) detecting SD violations among the crowds, and (iii) tracking and reporting individuals violating SD norms. To the authors’ best knowledge, this paper proposes the first comprehensive survey of VSDM frameworks and identifies their challenges and future perspectives. Typically, we review existing contributions by presenting the background of VSDM, describing evaluation metrics, and discussing SD datasets. Then, VSDM techniques are carefully reviewed after dividing them into two main categories: hand-crafted feature-based and deep-learning-based methods. A significant focus is paid to convolutional neural networks (CNN)-based methodologies as most of the frameworks have used either one-stage, two-stage, or multi-stage CNN models. A comparative study is also conducted to identify their pros and cons. Thereafter, a critical analysis is performed to highlight the issues and impediments that hold back the expansion of VSDM systems. Finally, future directions attracting significant research and development are derived.
Chapter
Human communication relies heavily on facial expressions. Although detecting emotion from facial expression has always been a simple task for humans, doing so with a computer technique is rather difficult. It is now possible to discern emotions from images because of recent advances in computer vision and machine learning. They frequently disclose people’s true emotional situations beyond their spoken language. Furthermore, visual pattern-based understanding of human effect is a critical component of any human-machine interaction system, which is why the task of Facial Expression Recognition (FER) attracts both scientific and corporate interest. Deep Learning (DL) approaches have recently achieved very high performance on FER by utilizing several architectures and learning paradigms. We have considered two types of images here, mainly static and live streaming datasets, and compared their performance using a convolution neural networks (CNN) strategy.
Article
The complexity metric of infrared image sequences is crucial to the prediction and evaluation of single-object tracker performance, and it is a research hotspot in the field of computer vision. However, the accuracy and comprehensiveness of the existing complexity metrics are limited. In this paper, an effective method is proposed to quantify the single-object tracking difficulty of infrared image sequences. First, based on the classification and analysis of trackers, the influencing factors of infrared target tracking are summarized. Then, five metrics combining deep features and shallow features are proposed to characterize the complexity of infrared image sequences. Finally, a synthesis complexity metric is designed for comprehensive evaluations of infrared image sequences. Experimental results indicate that our method performs better than traditional methods on comprehensiveness, and the proposed metrics can more accurately reflect the performance of trackers on different infrared image sequences.
Article
This paper proposes a Periocular recognition algorithm that uses region-specific and sub-image-based neighbor gradient feature extraction to achieve better recognition results. This approach initially segments the periocular region into four sub-regions such as eyebrow, eye corner regions, upper and lower eye fold regions after detecting the left and right eye corner points. KAZE feature extraction algorithm is used to extract the features from the upper eye fold region, while the HOG feature extraction algorithm is used to extract the feature from the eyebrow and eye corner regions. This approach also estimates the shape of the eyebrow by estimating the distance from N points on the eyebrow region to the eye corner midpoint. The eyebrow shape feature also contains the width and height measures through N points on the eyebrow that give its shape. The proposed approach also proposes a sub-image-based neighbor gradient (SING) feature extraction that extracts the neighbor gradient features from a 3×3 sub-image. Finally, the extracted features are trained using the Naïve Bayes classifier. The experimental evaluation was done using the AR dataset, CASIA Iris distance dataset, and UBIPr dataset using the metrics such as rank-1, rank-5 recognition accuracies, the area under the ROC curve (AUC), and equal error rate (EER). The proposed scheme provides a rank-1 recognition rate of 92.32%, 97.41%, and 97.87% for the dataset UBIPr, CASIA-Iris, and AR datasets respectively. The experimental results reveal that the proposed method outperforms the traditional periocular recognition algorithms.
Article
Full-text available
Unmanned Flight Vehicle (UAVs) primarily provides many applications and uses in the commerce and also recreation fields. Thus, the perception and visualization of the state of UAVs are of prime importance. Also in this paper, the authors incorporate the primary objective involving capturing and Detecting Drones, thus deriving important And Valuable data of position along with also coordinates. The wide and overall diffusion of drones increases the existing hazards of their misuse in a lot of illegitimate actions for example drug smuggling and also terrorism. Thereby, drones' surveillance and also automated detection are very crucial for protecting and safeguarding certain restricted areas or special zones and regions from illegal drone entry. Although, when present under low illumination situations and scenes, the designed capturers may lose the capability in discovering valuable data, which may lead to the wrong and not accurate results. In order to alleviate and resolve this, there are some works that consider using and reading infrared (IR) videos and images for object detection and tracking. The crucial drawback existing for infrared images is pertaining that they generally possess low resolution, this, thus provides inadequate resolution and information for trackers. Thus, considering the provided above analysis, fusing RGB and visible data along with also infrared picture data is essential in capturing and detecting drones. Moreover, this leverages data consisting of more than a single mode of crucial data which is useful and advantageous in studying along with understanding precise with also important drone existing capturers. Thus, the very use involves few good data comprising more than a single mode which is also needed in order for learning and understanding some objectives involving detecting and capturing UAVs. This paper introduces an automated video and image-based drone tracking and detection system which utilizes a crucial and advanced deep-learning-based image and object detection and tracking method known as you only look once (YOLOv5) to protect restricted areas and regions or special regions and zones from the unlawful drone entry and interventions. YOLO v5, part of the single-stage existing detectors, has one of the best detection and tracking performances required for balancing both the accuracy and also speed by collecting in-depth and also high-level extracted features. Based on YOLO v5, this paper also improves it to track and detect UAVs more accurately and precisely, and it's one of the first times introducing a YOLO v5-based developed algorithm for UAV object tracking and detection for the anti-UAV. It also adopts the last four existing scales of feature extraction maps instead of the previous three pertaining scales of feature maps required to predict and draft bounding boxes of given objects, which can alsodeliver moretextureandalsoimportantcontour data for the necessity to track and detect tiny and small objects. Also at the same time, in order to reduce and decrease the calculation, the provided size of the UAV in the existing four scales feature and contour maps are calculated according to the provided input data, and also then the tracked number of anchor existing boxes is also modified and adjusted. Therefore, the proposed UAV tracking and detection technology can also be applied in the given field of anti-UAV. Accordingly, an important and effective method named a double-training strategy has been developed mainly in drone detection and capturing. Trained mainly in class and instance segmentation spanning in moving frames and image series, the capturer also understands the accurate and important segments data along with information and also derives some distinct and important instantaneous and class-order characteristics.
Article
Full-text available
This paper presents a novel attention-based adversarial autoencoder network (A3N) that consists of a two-stream decoder to detect abnormal events in video sequences. The first stream of the decoder is a reconstructive model responsible for recreating the input frame sequence. However, the second stream is a future predictive model used to predict the future frame sequence through adversarial learning. A global attention mechanism is employed at the decoder side that helps to decode the encoded sequences effectively. The training of A3N is carried out on normal video data. The attention-based reconstructive model is used during the inference stage to compute the anomaly score. A3N delivers a considerable average speed of 0.0227 s (∼ 44 fps ) for detecting anomalies in the testing phase on used datasets. Several experiments and ablation analyses have been performed on UCSD Pedestrian, CUHK Avenue and ShanghaiTech datasets to validate the efficiency of the proposed model.
Article
Full-text available
This study combined SIFT and SSA to propose a novel algorithm for real-time object tracking. The proposed algorithm utilizes an intermediate fixed-size buffer and a modified SSA algorithm. Since the complete reconstruction step of the SSA algorithm was unnecessary, it was considerably simplified. In addition, the execution time of a Matlab implementation of the SSA algorithm was compared with a respective C++ implementation. Moreover, the performance of the two different matching algorithms in the detection, the FlannBasedMatcher and Brute-Force matcher algorithms of the OpenCV library, was compared.
Article
The brute-force behaviour of High Efficiency Video Coding (HEVC) is the biggest hurdle in the communication of multimedia content. Therefore, two novel methods will be presented here to expedite the intra mode decision process of HEVC. In the first algorithm, the feasibility of Histogram of Oriented Gradients (HOG) for early intra mode decision is presented by using statistical evidence. Then, HOG of the current block and 35 intra predictions are obtained. The intra-prediction that gives the least sum of absolute difference (SAD) with the HOG of the current block is selected as the termination point. In the second algorithm, the difference between the Hardmard-cost of intra modes is modeled to achieve fast intra mode decision. The proposed algorithms accelerated the encoding process of the HEVC by 5% and 35.57%, while their Bjontegaard Delta Bit Rate (BD-BR) is 1.09% and 1.61%, respectively.
Article
Full-text available
The risk of death incurred by breast cancer is rising exponentially, especially among women. This made the early breast cancer detection a crucial problem. In this paper, we propose a computer-aided diagnosis (CAD) system, called CADNet157, for mammography breast cancer based on transfer learning and fine-tuning of well-known deep learning models. Firstly, we applied hand-crafted features-based learning model using four extractors (local binary pattern, gray-level co-occurrence matrix, and Gabor) with four selected machine learning classifiers (K-nearest neighbors, support vector machine, random forests, and artificial neural networks). Then, we performed some modifications on the Basic CNN model and fine-tuned three pre-trained deep learning models: VGGNet16, InceptionResNetV2, and ResNet152. Finally, we conducted a set of experiments using two benchmark datasets: Digital Database for Screening Mammography (DDSM) and INbreast. The results of the conducted experiments showed that for the hand-crafted features based CAD system, we achieved an area under the ROC curve (AUC) of 95.28% for DDSM using random forest and 98.10% for INbreast using support vector machine with the histogram of oriented gradients extractor. On the other hand, CADNet157 model (i.e., fine-tuned ResNet152) was the best performing deep model with an AUC of 98.90% (sensitivity: 97.72%, specificity: 100%), and 98.10% (sensitivity: 100%, specificity: 96.15%) for, respectively, DDSM and INbreast. The CADNet157 model overcomes the limitations of traditional CAD systems by providing an early detection of breast cancer and reducing the risk of false diagnosis.
Chapter
Recently the object detection algorithms have been widely used in various fields. In the highway monitoring scene, the performance of existing pedestrian detection algorithms degrade rapidly when the size of pedestrian decreased. To enhance the performance of detector, we designed a method named DSANet, which combines super-resolution with object detection algorithm, so that the detection network can capture more detailed features. Compared with the existing super-resolution algorithms, we integrate degeneration learning and self-attention module to make the super-resolution algorithm better fit the pedestrian detection. In particular, we introduce the MSAF module to fuse self attention information of different head numbers. The proposed super-resolution method provides better support for pedestrian detection. The experimental results show that the reconstructed SR image has richer detail features, which improves the accuracy of pedestrian detection.
Article
Video face detection is a crucial first step in many facial recognition and face analysis systems. It should serve postprocessing steps as much as possible while satisfying high-accuracy real-time detection. In this paper, we first introduce the constrained aspect ratio loss (CARLoss) for better facial boxes regression and incorporate it into the modified FH-YOLOv4, then the IoU Tracker-based video face image deduplication algorithm is proposed on the detection level. Extensive experiments and comparative tests show the effectiveness of our method.
Article
Emotion categorization has become an important area of research due to the increasing number of intelligent systems, such as robots interacting with humans. This includes deep learning models, which have performed remarkably well on many classification-based tasks. However, due to their homogeneous representation of knowledge, the deep learning models are vulnerable to different kinds of attacks. The hypothesis is that emotions displayed in facial images are more than patterns of pixels. Thus, the objective of this work is to propose a novel heterogeneous facial landmark-based emotion categorization (LEmo) method that will show robustness to distractor and adversarial attacks. Moreover, we compared the proposed LEmo method with seven state-of-the-art methods, including neural networks, (i.e. the residual neural network (ResNet), the Visual Geometry Group (VGG), and the Inception-ResNet models), emotion categorization tools (i.e. Py-Feat and LightFace), as well as anti-attack-based methods (i.e. Adv-Network and DLP-CNN). To test the robustness of the LEmo method, three different types of adversarial attacks, and a distractor attack were launched at the data. Unlike other methods that have exhibited large performance decreases (up to 79%), the LEmo method was strongly resistant to all attacks, achieving high accuracy with only little (< 9.3%) or no decrease with different changes made to the images of CK+ and the KDEF databases. Furthermore, the LEmo method has shown a considerably lower execution time compared to all other methods.
Article
Full-text available
We present a new landmark detection problem on the upper body of a clothed person for tailoring purposes. This is a landmark detection problem unknown in the literature, which is in the same domain as, but different to the ‘fashion’ landmark detection problem where the landmarks are for classifying clothing. An existing ‘attentive fashion network’ (AFN) was trained using 800,000 annotated images of the DeepFashion dataset, with a base network of VGG16 pre-trained on the ImageNet dataset, to provide initial weights. To train a network for ‘body’ landmark detection would require a similar sized dataset. We propose a deep neural network for body landmark detection where the knowledge from an existing network was transferred and trained with an extremely small dataset of just 99 images, annotated with body landmarks. A baseline model was tested where only the fashion landmark branch was used, but retrained for body landmarks. This produced a testing error of 0.068 (normalised mean distance between the predicted landmarks and ground-truth). The error was significantly reduced by adopting the fashion landmark branch and the attention unit of AFN, but substituting the classification branch with a new body landmark detection branch for the proposed Attention-based Fashion-to-Body landmark Network (AFBN). We tested 6 variants of the proposed AFBN model with different convolutional block designs and auto-encoders for enforcing landmark relations. The trained model had a low testing error ranging from 0.022 to 0.028 over these variants. The variant with an increased number of channels and inception units with residual connections, had the best overall performance. Although AFBN and its variants were trained with a limited dataset, the performance exceeds the state-of-the-art attentive fashion network AFN (0.0534). The principle of transfer learning demonstrated here is relevant where labelled domain data are scarce providing a low solution cost of faster training of a deep neural network with a significantly small dataset. Graphical abstract
Article
For decades, researchers have investigated how to recognize facial images. This study reviews the development of different face recognition (FR) methods, namely, holistic learning, handcrafted local feature learning, shallow learning, and deep learning (DL). With the development of methods, the accuracy of recognizing faces in the labeled faces in the wild (LFW) database has been increased. The accuracy of holistic learning is 60%, that of handcrafted local feature learning increases to 70%, and that of shallow learning is 86%. Finally, DL achieves human-level performance (97% accuracy). This enhanced accuracy is caused by large datasets and graphics processing units (GPUs) with massively parallel processing capabilities. Furthermore, FR challenges and current research studies are discussed to understand future research directions. The results of this study show that presently the database of labeled faces in the wild has reached 99.85% accuracy.
Article
Multi-class geospatial object detection with remote sensing imagery has broad prospects in urban planning, natural disaster warning, industrial production, military surveillance and other applications. Accuracy and efficiency are two common measures for evaluating object detection models, and it is often difficult to achieve both at the same time. Developing a practical remote sensing object detection algorithm that balances the accuracy and efficiency is thus a big challenge in the Earth observation community. Here, we propose a comprehensive high-speed multi-class remote sensing object detection method. Firstly, we obtain a multi-volume YOLO (You Only Look Once) v4 model for balancing speed and accuracy, based on a pruning strategy of the convolutional neural network (CNN) and the one-stage object detection network YOLO v4. Moreover, we apply the Manhattan-Distance Intersection of Union (MIOU) loss function to the multi-volume YOLO v4 to further improve the accuracy without additional computational burden. Secondly, mainly due to computing limitations, a remote sensing image that is large-size relative to a natural image must first be divided into multiple smaller tiles, which are then detected separately, and finally, the detection results are spliced back to match the original image. In the process of remote sensing image slicing, a large number of truncated objects appear at the edge of tiles, which will produce a large number of false results in the subsequent detection links. To solve this problem, we propose a Truncated Non-Maximum Suppression (NMS) algorithm to filter out repeated and false detection boxes from truncated targets in the spliced detection results. We compare the proposed algorithm with the state-of-the-art methods on the Dataset for Object deTection in Aerial images (DOTA) and DOTA v2. Quantitative evaluations show that mAP and FPS reach 77.3 and 35 on DOTA, and 61.0 and 74 on DOTA v2. Overall, our method reaches the optimal balance between efficiency and accuracy, and realizes the high-speed remote sensing object detection.
Chapter
The behavior of animals reflects their internal state. Changes in behavior, such as a lack of sleep, can be detected as early warning signs of health issues. Zoologists are often required to use video recordings to study animal activity. These videos are generally not sufficiently indexed, so the process is long and laborious, and the observation results may vary between the observers. This study looks at the difficulty of measuring elephant sleep stages from surveillance videos of the elephant bran at night. To assist zoologists, we propose using deep learning techniques to automatically locate elephants in each camera surveillance, then mapping the elephants detected onto the barn plan. Instead of watching all of the videos, zoologists will examine the mapping history, allowing them to measure elephant sleeping stages faster. Overall, our approach monitors elephants in their barn with a high degree of accuracy.
Article
Full-text available
The development of face recognition improvements still lacks knowledge on what parts of the face are important. In this article, the authors present face parts analysis to obtain important recognition information in a certain area of the face, more than just the eye or eyebrow, from the black box perspective. In addition, the authors propose a more advanced way to select parts without introducing artifacts using the average face and morphing. Furthermore, multiple face recognition systems are used to analyze the face component contribution. Finally, the results show that the four deep face recognition systems produce a different behavior for each experiment. However, the eyebrows are still the most important part of deep face recognition systems. In addition, the face texture played an important role deeper than the face shape.
Article
Full-text available
Many works have employed Machine Learning (ML) techniques in the detection of Diabetic Retinopathy (DR), a disease that affects the human eye. However, the accuracy of most DR detection methods still need improvement. Gray Wolf Optimization-Extreme Learning Machine (GWO-ELM) is one of the most popular ML algorithms, and can be considered as an accurate algorithm in the process of classification, but has not been used in solving DR detection. Therefore, this work aims to apply the GWO-ELM classifier and employ one of the most popular features extractions, Histogram of Oriented Gradients-Principal Component Analysis (HOG-PCA), to increase the accuracy of DR detection system. Although the HOG-PCA has been tested in many image processing domains including medical domains, it has not yet been tested in DR. The GWO-ELM can prevent overfitting, solve multi and binary classifications problems, and it performs like a kernel-based Support Vector Machine with a Neural Network structure, whilst the HOG-PCA has the ability to extract the most relevant features with low dimensionality. Therefore, the combination of the GWO-ELM classifier and HOG-PCA features might produce an effective technique for DR classification and features extraction. The proposed GWO-ELM is evaluated based on two different datasets, namely APTOS-2019 and Indian Diabetic Retinopathy Image Dataset (IDRiD), in both binary and multi-class classification. The experiment results have shown an excellent performance of the proposed GWO-ELM model where it achieved an accuracy of 96.21% for multi-class and 99.47% for binary using APTOS-2019 dataset as well as 96.15% for multi-class and 99.04% for binary using IDRiD dataset. This demonstrates that the combination of the GWO-ELM and HOG-PCA is an effective classifier for detecting DR and might be applicable in solving other image data types.
Article
Proposed is a discriminative feature modeling technique in three orthogonal planes (TOP) for human action recognition (HAR). Pyramidal histogram of orientation gradient‐TOP (PHOG‐TOP) and dense optical flow‐TOP (DOF‐TOP) techniques are utilized for salient motion estimation and description to represent the human action in a compact but distinct manner. The contribution of the work is to explicitly learn the gradual change of visual patterns using fusion of PHOG‐TOP and DOF‐TOP technique to discover the nature of the action. With this feature representation, dimensionality reduction is achieved by deep stacked autoencoders. The encoded feature representation is used in long short term memory (LSTM) classification for HAR. Experiments with better recognition rate demonstrate the discriminative power of the proposed descriptor. Moreover, the proposed modeling and LSTM classification outperforms the state of art methods for HAR.
Article
Full-text available
X-ray imagery systems have enabled security personnel to identify potential threats contained within the baggage and cargo since the early 1970s. However, the manual process of screening the threatening items is time-consuming and vulnerable to human error. Hence, researchers have utilized recent advancements in computer vision techniques, revolutionized by machine learning models, to aid in baggage security threat identification via 2D X-ray and 3D CT imagery. However, the performance of these approaches is severely affected by heavy occlusion, class imbalance, and limited labeled data, further complicated by ingeniously concealed emerging threats. Hence, the research community must devise suitable approaches by leveraging the findings from existing literature to move in new directions. Towards that goal, we present a structured survey providing systematic insight into state-of-the-art advances in baggage threat detection. Furthermore, we also present a comprehensible understanding of X-ray based imaging systems and the challenges faced within the threat identification domain. We include a taxonomy to classify the approaches proposed within the context of 2D and 3D CT X-ray based baggage security threat screening and provide a comparative analysis of the performance of the methods evaluated on four benchmarks. Besides, we also discuss current open challenges and potential future research avenues.
Article
Full-text available
This paper presents a general, trainable system for object detection in unconstrained, cluttered scenes. The system derives much of its power from a representation that describes an object class in terms of an overcomplete dictionary of local, oriented, multiscale intensity differences between adjacent regions, efficiently computable as a Haar wavelet transform. This example-based learning approach implicitly derives a model of an object class by training a support vector machine classifier using a large set of positive and negative examples. We present results on face, people, and car detection tasks using the same architecture. In addition, we quantify how the representation affects detection performance by considering several alternate representations including pixels and principal components. We also describe a real-time application of our person detection system as part of a driver assistance system.
Article
Full-text available
The ability to recognize humans and their activities by vision is key for a machine to interact intelligently and effortlessly with a human-inhabited environment. Because of many potentially important applications, “looking at people” is currently one of the most active application domains in computer vision. This survey identifies a number of promising applications and provides an overview of recent developments in this domain. The scope of this survey is limited to work on whole-body or hand motion; it does not include work on human faces. The emphasis is on discussing the various methodologies; they are grouped in 2-D approaches with or without explicit shape models and 3-D approaches. Where appropriate, systems are reviewed. We conclude with some thoughts about future directions.
Conference Paper
Full-text available
We describe a novel method for human detection in single images which can detect full bodies as well as close-up views in the presence of clutter and occlusion. Humans are modeled as flexible assemblies of parts, and robust part detection is the key to the approach. The parts are represented by co-occurrences of local features which captures the spatial layout of the part’s appearance. Feature selection and the part detectors are learnt from training images using AdaBoost. The detection algorithm is very efficient as (i) all part detectors use the same initial features, (ii) a coarse-to-fine cascade approach is used for part detection, (iii) a part assembly strategy reduces the number of spurious detections and the search space. The results outperform existing human detectors.
Conference Paper
Full-text available
This paper describes a pedestrian detection system that integrates image intensity information with motion information. We use a detection style algorithm that scans a detector over two consecutive frames of a video sequence. The detector is trained (using AdaBoost) to take advantage of both motion and appearance information to detect a walking person. Past approaches have built detectors based on appearance information, but ours is the first to combine both sources of information in a single detector. The implementation described runs at about 4 frames/second, detects pedestrians at very small scales (as small as 20×15 pixels), and has a very low false positive rate. Our approach builds on the detection work of Viola and Jones. Novel contributions of this paper include: i) development of a representation of image motion which is extremely efficient, and ii) implementation of a state of the art pedestrian detection system which operates on low resolution images under difficult conditions (such as rain and snow).
Article
Full-text available
This paper describes a pedestrian detection system that integrates image intensity information with motion information. We use a detection style algorithm that scans a detector over two consecutive frames of a video sequence. The detector is trained (using AdaBoost) to take advantage of both motion and appearance information to detect a walking person. Past approaches have built detectors based on motion information or detectors based on appearance information, but ours is the first to combine both sources of information in a single detector. The implementation described runs at about 4 frames/second, detects pedestrians at very small scales (as small as 20 15 pixels), and has a very low false positive rate.Our approach builds on the detection work of Viola and Jones. Novel contributions of this paper include: (i) development of a representation of image motion which is extremely efficient, and (ii) implementation of a state of the art pedestrian detection system which operates on low resolution images under difficult conditions (such as rain and snow).
Article
Full-text available
In this paper we propose a novel approach for detecting interest points invariant to scale and affine transformations. Our scale and affine invariant detectors are based on the following recent results: (1) Interest points extracted with the Harris detector can be adapted to affine transformations and give repeatable results (geometrically stable). (2) The characteristic scale of a local structure is indicated by a local extremum over scale of normalized derivatives (the Laplacian). (3) The affine shape of a point neighborhood is estimated based on the second moment matrix.
Article
Full-text available
The retinotopic mapping of the visual field to the surface of the striate cortex is characterized as a longarithmic conformal mapping. This summarizes in a concise way the observed curve of cortical magnification, the linear scaling of receptive field size with eccentricity, and the mapping of global visual field landmarks. It is shown that if this global structure is reiterated at the local level, then the sequence regularity of the simple cells of area 17 may be accounted for as well. Recently published data on the secondary visual area, the medial visual area, and the inferior pulvinar of the owl monkey suggests that same global logarithmic structure holds for these areas as well. The available data on the structure of the somatotopic mapping (areaS-1) supports a similar analysis. The possible relevance of the analytical form of the cortical receptotopic maps to perception is examined and a brief discussion of the developmental implications of these findings is presented.
Article
Full-text available
In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context, steerable filters, PCA-SIFT, differential invariants, spin images, SIFT, complex filters, moment invariants, and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.
Conference Paper
Full-text available
This paper presents the results of the first large-scale field tests on vision-based pedestrian protection from a moving vehicle. Our PROTECTOR system combines pedestrian detection, trajectory estimation, risk assessment and driver warning. The paper pursues a "system approach" related to the detection component. An optimization scheme models the system as a succession of individual modules and finds a good overall parameter setting by combining individual ROCs using a convex-hull technique. On the experimental side, we present a methodology for the validation of the pedestrian detection performance in an actual vehicle setting. We hope this test methodology to contribute towards the establishment of benchmark testing, enabling this application to mature. We validate the PROTECTOR system using the proposed methodology and present interesting quantitative results based on tens of thousands of images from hours of driving. Although results are promising, more research is needed before such systems can be placed at the hands of ordinary vehicle drivers.
Conference Paper
Full-text available
We present a novel approach to measuring similarity between shapes and exploit it for object recognition. In our framework, the measurement of similarity is preceded by (1) solving for correspondences between points on the two shapes, (2) using the correspondences to estimate an aligning transform. In order to solve the correspondence problem, we attach a descriptor, the shape context, to each point. The shape context at a reference point captures the distribution of the remaining points relative to it, thus offering a globally discriminative characterization. Corresponding points on two similar shapes will have similar shape contexts, enabling us to solve for correspondences as an optimal assignment problem. Given the point correspondences, we estimate the transformation that best aligns the two shapes; regularized thin-plate splines provide a flexible class of transformation maps for this purpose. Dis-similarity between two shapes is computed as a sum of matching errors between corresponding points, together with a term measuring the magnitude of the aligning transform. We treat recognition in a nearest-neighbor classification framework. Results are presented for silhouettes, trademarks, handwritten digits and the COIL dataset
Conference Paper
Full-text available
This paper presents an efficient shape-based object detection method based on Distance Transforms and describes its use for real-time vision on-board vehicles. The method uses a template hierarchy to capture the variety of object shapes; efficient hierarchies can be generated offline for given shape distributions using stochastic optimization techniques (i.e. simulated annealing). Online, matching involves a simultaneous coarse-to-fine approach over the shape hierarchy and over the transformation parameters. Very large speed-up factors are typically obtained when comparing this approach with the equivalent brute-force formulation; we have measured gains of several orders of magnitudes. We present experimental results on the real-time detection of traffic signs and pedestrians from a moving vehicle. Because of the highly time sensitive nature of these vision tasks, we also discuss some hardware-specific implementations of the proposed method as far as SIMD parallelism is concerned
Article
Full-text available
We present a general example-based framework for detecting objects in static images by components. The technique is demonstrated by developing a system that locates people in cluttered scenes. The system is structured with four distinct example-based detectors that are trained to separately find the four components of the human body: the head, legs, left arm, and right arm. After ensuring that these components are present in the proper geometric configuration, a second example-based classifier combines the results of the component detectors to classify a pattern as either a “person” or a “nonperson.” We call this type of hierarchical architecture, in which learning occurs at multiple stages, an adaptive combination of classifiers (ACC). We present results that show that this system performs significantly better than a similar full-body person detector. This suggests that the improvement in performance is due to the component-based approach and the ACC data classification architecture. The algorithm is also more robust than the full-body person detection method in that it is capable of locating partially occluded views of people and people whose body parts have little contrast with the background
Conference Paper
Full-text available
Detecting people in images is a key problem for video indexing, browsing and retrieval. The main difficulties are the large appearance variations caused by action, clothing, illumination, viewpoint and scale. Our goal is to find people in static video frames using learned models of both the appearance of body parts (head, limbs, hands), and of the geometry of their assemblies. We build on Forsyth & Fleck's general `body plan' methodology and Felzenszwalb & Huttenlocher 's dynamic programming approach for efficiently assembling candidate parts into `pictorial structures'. However we replace the rather simple part detectors used in these works with dedicated detectors learned for each body part using Support Vector Machines (SVMs) or Relevance Vector Machines (RVMs). We are not aware of any previous work using SVMs to learn articulated body plans, however they have been used to detect both whole pedestrians and combinations of rigidly positioned subimages (typically, upper body, arms, and legs) in street scenes, under a wide range of illumination, pose and clothing variations. RVMs are SVM-like classifiers that offer a well-founded probabilistic interpretation and improved sparsity for reduced computation. We demonstrate their benefits experimentally in a series of results showing great promise for learning detectors in more general situations.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Patent
This patent describes a method of recognizing a pattern in a test object. The method consists of: specifying properties characteristic of the pattern; specifying discrete ranges of values of the properties; measuring the values of the properties in the test object; arranging the measured values in at least one test histogram; determining a reference set of values of the properties and arranging the set as at least a first reference histogram; and comparing the test and reference histograms by determination of the value of a function which provides a measure of the amount of information necessary to express the at least one test histogram in terms of the optimum code for describing at least the first reference histogram.
Article
Finding people in pictures presents a particularly difficult object recognition problem. We show how to find people by finding candidate body segments, and then constructing assemblies of segments that are consistent with the constraints on the appearance of a person that result from kinematic properties. Since a reasonable model of a person requires at least nine segments, it is not possible to inspect every group, due to the huge combinatorial complexity. We propose two approaches to this problem. In one, the search can be pruned by using projected versions of a classifier that accepts groups corresponding to people. We describe an efficient projection algorithm for one popular classifier, and demonstrate that our approach can be used to determine whether images of real scenes contain people. The second approach employs a probabilistic framework, so that we can draw samples of assemblies, with probabilities proportional to their likelihood, which allows to draw human-like assemblies more often than the non-person ones. The main performance problem is in segmentation of images, but the overall results of both approaches on real images of people are encouraging.
Conference Paper
The appeal of computer games may be enhanced by vision-based user inputs. The high speed and low cost requirements for near-term, mass-market game applications make system design challenging. The response time of the vision interface should be less than a video frame time and the interface should cost less than $50 U.S. We meet these constraints with algorithms tailored to particular hardware. We have developed a special detector, called the artificial retina chip, which allows for fast, on-chip image processing. We describe two algorithms, based on image moments and orientation histograms, which exploit the capabilities of the chip to provide interactive response to the player's hand or body positions at 10 msec frame time and at low-cost. We show several possible game interactions
Article
In this paper we describe a trainable object detector and its instantiations for detecting faces and cars at any size, location, and pose. To cope with variation in object orientation, the detector uses multiple classifiers, each spanning a different range of orientation. Each of these classifiers determines whether the object is present at a specified size within a fixed-size image window. To find the object at any location and size, these classifiers scan the image exhaustively. Each classifier is based on the statistics of localized parts. Each part is a transform from a subset of wavelet coefficients to a discrete set of values. Such parts are designed to capture various combinations of locality in space, frequency, and orientation. In building each classifier, we gathered the class-conditional statistics of these part values from representative samples of object and non-object images. We trained each classifier to minimize classification error on the training set by using Adaboost with Confidence-Weighted Predictions (Shapire and Singer, 1999). In detection, each classifier computes the part values within the image window and looks up their associated class-conditional probabilities. The classifier then makes a decision by applying a likelihood ratio test. For efficiency, the classifier evaluates this likelihood ratio in stages. At each stage, the classifier compares the partial likelihood ratio to a threshold and makes a decision about whether to cease evaluation—labeling the input as non-object—or to continue further evaluation. The detector orders these stages of evaluation from a low-resolution to a high-resolution search of the image. Our trainable object detector achieves reliable and efficient detection of human faces and passenger cars with out-of-plane rotation.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVM light is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVM light V 2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains.
Stable local feature detection and representation is a fundamental component of many image registration and object recognition algorithms. Mikolajczyk and Schmid (June 2003) recently evaluated a variety of approaches and identified the SIFT [D. G. Lowe, 1999] algorithm as being the most resistant to common image deformations. This paper examines (and improves upon) the local image descriptor used by SIFT. Like SIFT, our descriptors encode the salient aspects of the image gradient in the feature point's neighborhood; however, instead of using SIFT's smoothed weighted histograms, we apply principal components analysis (PCA) to the normalized gradient patch. Our experiments demonstrate that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. We also present results showing that using these descriptors in an image retrieval application results in increased accuracy and faster matching.
A pictorial structure is a collection of parts arranged in a deformable configuration. Each part is represented using a simple appearance model and the deformable configuration is represented by spring-like connections between pairs of parts. While pictorial structures were introduced a number of years ago, they have not been broadly applied to matching and recognition problems. This has been due in part to the computational difficulty of matching pictorial structures to images. In this paper we present an efficient algorithm for finding the best global match of a pictorial structure to an image. The running time of the algorithm is optimal and it it takes only a few seconds to match a model with five to ten parts. With this improved algorithm, pictorial structures provide a practical and powerful framework for qualitative descriptions of objects and scenes, and are suitable for many generic image recognition problems. We illustrate the approach using simple models of a person and a car.
Article
We present a method to recognize hand gestures, based on a pattern recognition technique developed by McConnell [16] employing histograms of local orientation. We use the orientation histogram as a feature vector for gesture classfication and interpolation. This method is simple and fast to compute, and offers some robustness to scene illumination changes. We have implemented a real-time version, which can distinguish a small vocabulary of about 10 different hand gestures. All the computation occurs on a workstation
Matching shapes. The 8th ICCV
  • S Belongie
  • J Malik
  • J Puzicha
S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The 8th ICCV, Vancouver, Canada, pages 454–461, 2001.
Efficient pedestrian detection: a test case for svm based categorization
  • V De Poortere
  • J Cant
  • B Van Den
  • J Bosch
  • F De Prins
  • L Fransens
  • Van Gool
V. de Poortere, J. Cant, B. Van den Bosch, J. de Prins, F. Fransens, and L. Van Gool. Efficient pedestrian detection: a test case for svm based categorization. Workshop on Cognitive Vision, 2002. Available online: http://www.vision.ethz.ch/cogvis02/.
Probabilistic methods for finding people Making large-scale svm learning practical
  • S Ioffe
  • D A Forsyth
S. Ioffe and D. A. Forsyth. Probabilistic methods for finding people. IJCV, 43(1):45–68, 2001. [10] T. Joachims. Making large-scale svm learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -Support Vector Learning. The MIT Press, Cambridge, MA, USA, 1999.
Computer vision for computer games. 2nd International Conference on Automatic Face and Gesture Recognition
  • W T Freeman
  • K Tanaka
  • J Ohta
  • K Kyuma
W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Computer vision for computer games. 2nd International Conference on Automatic Face and Gesture Recognition, Killington, VT, USA, pages 100-105, October 1996.