[Show abstract][Hide abstract] ABSTRACT: We investigate the self-localisation problem of an adhoc network of randomly distributed and independent devices in an open-space environment with low reverberation but heavy noise (e.g. smartphones recording videos of an outdoor event). Assuming a sufficient number of sound sources, we estimate the distance between a pair of devices from the extreme (minimum and maximum) time difference of arrivals (TDOAs) from the sources to the pair of devices without knowing the time offset. The obtained inter-device distances are then exploited to derive the geometrical configuration of the network. In particular, we propose a robust audio fingerprinting algorithm for noisy recordings and perform landmark matching to construct a histogram of the TDOAs of multiple sources. The extreme TDOAs can be estimated from this histogram. By using audio fingerprinting features, the proposed algorithm works robustly in very noisy environments. Experiments with free-field simulation and open-space recordings prove the effectiveness of the proposed algorithm.
IEEE Transactions on Audio Speech and Language Processing 10/2015; 23(10):1623 - 1636. DOI:10.1109/TASLP.2015.2442417 · 2.48 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We use audio fingerprinting to solve the synchronization problem between multiple recordings from an ad-hoc array consisting of randomly placed wireless microphones or hand-held smartphones. Synchronization is crucial when employing conventional microphone array techniques such as beamforming and source localization. We propose a fine audio landmark fingerprinting method that detects the time difference of arrivals (TDOAs) of multiple sources in the acoustic environment. By estimating the maximum and minimum TDOAs, the proposed method can accurately calculate the unknown time offset between a pair of microphone recordings. Experimental results demonstrate that the proposed method significantly improves the synchronization accuracy of conventional audio fingerprinting methods and achieves comparable performance to the generalized cross-correlation method.
European Signal Processing Conference (EUSIPCO), Nice, France; 09/2015
[Show abstract][Hide abstract] ABSTRACT: We present a generic formulation of self- and cross-correcting Bayesian trackers using a Dynamic Bayesian Network. Correction operations in a tracker such as parameter tuning, model updates and re-initialization are represented using hidden variables together with the target state and measurement variables in the Dynamic Bayesian network model. The representation allows one to model different self- and cross-correcting tracking frameworks under the same formulation and facilitates comparison and the design of new trackers. The proposed model is demonstrated with three state-of-the-art trackers that are based on different principles to implement online correction of target tracking.
IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS); 08/2015
[Show abstract][Hide abstract] ABSTRACT: We propose ViComp, an automatic audio-visual camera selection framework for composing uninterrupted recordings from multiple user-generated videos (UGVs) of the same event. We design an automatic audio-based cut-point selection method to segment the UGV. ViComp combines segments of UGVs using a rank-based camera selection strategy by considering audio-visual quality and camera selection history. We analyze the audio to maintain audio continuity. To filter video segments which contain visual degradations, we perform spatial and spatio-temporal quality assessment. We validate the proposed framework with subjective tests and compare it with state-of-the-art methods.
Multimedia Tools and Applications 06/2015; DOI:10.1007/s11042-015-2641-2 · 1.35 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We propose a tracker-level fusion framework for robust visual tracking. The framework combines trackers addressing different tracking challenges to improve the overall performance. A novelty of the proposed framework is the inclusion of an online performance measure to identify the track quality level of each tracker so as to guide the fusion. The fusion is then based on appropriately mixing the prior state of the trackers. Moreover, the track-quality level is used to update the target appearance model. We demonstrate the framework with two Bayesian trackers on video sequences with various challenges and show its robustness compared to the independent use of the two individual trackers, and also compared to state-of-the-art trackers that use tracker-level fusion.
IEEE Transactions on Circuits and Systems for Video Technology 05/2015; 25(5):776-789. DOI:10.1109/TCSVT.2014.2360027 · 2.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Quadcopters are highly maneuverable and can provide an effective means for an agile dynamic positioning of sensors such as cameras. In this paper we propose a method for the self-positioning of a team of camera-equipped quadcopters (flying cameras) around a moving target. The self-positioning task is driven by the maximization of the monitored surface of the moving target based on a dynamic flight model combined with a collision avoidance algorithm. Each flying camera only knows the relative distance of neighboring flying cameras and its desired position with respect to the target. Given a team of up to 12 flying cameras, we show they can achieve a stable time-varying formation around a moving target without collisions.
[Show abstract][Hide abstract] ABSTRACT: Target re-identification approaches generally perform association between camera pairs only. However, in a multi-camera system many-to-one camera associations are needed when targets transit from multiple source-cameras to a destination-camera. To address this problem, we propose a person re-identification approach that generates camera-invariant object matching scores, which are based on re-identification score variations in multiple camera pairs. Each camera pair is represented with two parametric distribution models obtained by curve fitting on intra-class and inter-class target matching scores. These two models are combined to generate the likelihood of a correct match between a new target in the destination-camera and those in all source-cameras. We show the improvement in the performance of the proposed re-identification approach compared to existing pairwise approaches on two publicly available datasets.
[Show abstract][Hide abstract] ABSTRACT: We propose an approach to create camera coalitions in resource-constrained camera networks and demonstrate it for collaborative target tracking. We cast coalition formation as a decentralized resource allocation process where the best cameras among those viewing a target are assigned to a coalition based on marginal utility theory. A manager is dynamically selected to negotiate with cameras whether they will join the coalition and to coordinate the tracking task. This negotiation is based not only on the utility brought by each camera to the coalition, but also on the associated cost (i.e. additional processing and communication). Experimental results and comparisons using simulations and real data show that the proposed approach outperforms related state-of-the-art methods by improving tracking accuracy in cost-free settings. Moreover, under resource limitations, the proposed approach controls the tradeoff between accuracy and cost, and achieves energy savings with only a minor reduction in accuracy.
[Show abstract][Hide abstract] ABSTRACT: We propose N-consensus, an algorithm that reduces the cost of the consensus process for distributed visual target tracking without compromising on tracking accuracy. N-consensus fuses target posteriors computed by viewing nodes (i.e. the cameras viewing the same target) only and limits the number of nodes participating in consensus to those within a specified number of hops from the viewing nodes. The number of hops is computed based on viewing and communication ranges to identify all nodes within twice the viewing range from the viewing nodes. Unlike average consensus, the proposed N-consensus does not require prior knowledge of node connectivity because we employ an improved fast covariance intersection algorithm during consensus update.
IEEE Tenth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015, Singapore; 04/2015
[Show abstract][Hide abstract] ABSTRACT: We present a framework for multitarget detection and tracking that infers candidate target locations in videos containing a high density of homogeneous targets. We propose a gradient-climbing technique and an isocontor slicing approach for intensity maps to localize targets. The former uses Markov chain Monte Carlo to iteratively fit a shape model onto the target locations, whereas the latter uses the intensity values at different levels to find consistent object shapes. We generate trajectories by recursively associating detections with a hierarchical graph-based tracker on temporal windows. The solution to the graph is obtained with a greedy algorithm that accounts for false-positive associations. The edges of the graph are weighted with a likelihood function based on location information. We evaluate the performance of the proposed framework on challenging datasets containing videos with high density of targets and compare it with six alternative trackers.
IEEE Transactions on Circuits and Systems for Video Technology 04/2015; 25(4):623-637. DOI:10.1109/TCSVT.2014.2344509 · 2.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We present a framework for improving probabilistic tracking of an extended object with a set of model points. The framework combines the tracker with an on-line performance measure and a correction technique. We correlate model point trajectories to improve on-line the accuracy of a failed or an uncertain tracker. A model point tracker gets assistance from neighboring trackers whenever a degradation in its performance is detected using the online performance measure. The correction of the model point state is based on correlation information from the state of other trackers. Partial Least Square (PLS) regression is used to model the correlation of point tracker states from short windowed trajectories adaptively. Experimental results on data obtained from optical motion capture systems show the improvement in tracking performance of the proposed framework compared to the baseline tracker and other state-of-the-art trackers.
[Show abstract][Hide abstract] ABSTRACT: We propose a methodology to quantitatively compare the relative performance of tracking evaluation measures. The proposed methodology is based on determining the probabilistic agreement between tracking result decisions made by measures and those made by humans. We use tracking results on publicly available datasets with different target types and varying challenges, and collect the judgments of 90 skilled, semi-skilled and unskilled human subjects using a web-based performance assessment test. The analysis of the agreements allows us to highlight the variation in performance of the different measures and the most appropriate ones for the various stages of tracking performance evaluation.
[Show abstract][Hide abstract] ABSTRACT: We present an end-to-end approach for trajectory clustering from aerial videos that enables the extraction of motion patterns in urban scenes. Camera motion is first compensated by mapping object trajectories on a reference plane. Then clustering is performed based on statistics from the Discrete Wavelet Transform coefficients extracted from the trajectories. Finally, motion patterns are identified by distance minimization from the centroids of the trajectory clusters. The experimental validation on four datasets shows the effectiveness of the proposed approach in extracting trajectory clusters. We also make available two new real-world aerial video datasets together with the estimated object trajectories and ground-truth cluster labeling.
[Show abstract][Hide abstract] ABSTRACT: We present an interactive visualizer that enables the exploration, measurement, analysis and manipulation of trajectories. Trajectories can be generated either automatically by multi-target tracking algorithms or manually by human annotators. The visualizer helps understanding the behavior of targets, correcting tracking results and quantifying the performance of tracking algorithms. The input video can be overlaid to compare ideal and estimated target locations. The code of the visualizer (C++ with openFrameworks) is open source.
[Show abstract][Hide abstract] ABSTRACT: Face images in a video sequence should be registered accurately
before being analysed, otherwise registration errors may be interpreted
as facial activity. Subpixel accuracy is crucial for the analysis of
subtle actions. In this paper we present PSTR (Probabilistic Subpixel
Temporal Registration), a framework that achieves high registration accuracy. .... .....
Asian Computer Vision Conference (ACCV'14), Singapore; 11/2014
[Show abstract][Hide abstract] ABSTRACT: The choice of the most suitable fusion scheme for smart cam-era networks depends on the application as well as on the available computational and communication resources. In this paper we discuss and compare the resource requirements of five fusion schemes, namely centralised fusion, flooding, consensus, token passing and dynamic clustering. The Ex-tended Information Filter is applied to each fusion scheme to perform target tracking. Token passing and dynamic clus-tering involve negotiation among viewing nodes (cameras observing the same target) to decide which node should per-form the fusion process whereas flooding and consensus do not include this negotiation. Negotiation helps limiting the number of participating cameras and reduces the required resources for the fusion process itself but requires additional communication. Consensus has the highest communication and computation costs but it is the only scheme that can be applied when not all viewing nodes are connected directly and routing tables are not available.
8th ACM / IEEE International Conference on Distributed Smart Cameras (ICDSC 2014); 11/2014
[Show abstract][Hide abstract] ABSTRACT: Automatic affect analysis has attracted great interest in various contexts including the recognition of action units and basic or non-basic emotions. In spite of major efforts, there are several open questions on what the important cues to interpret facial expressions are and how to encode them. In this paper, we review the progress across a range of affect recognition applications to shed light on these fundamental questions. We analyse the state-of-the-art solutions by decomposing their pipelines into fundamental components, namely face registration, representation, dimensionality reduction and recognition. We discuss the role of these components and highlight the models and new trends that are followed in their design. Moreover, we provide a comprehensive analysis of facial representations by uncovering their advantages and limitations; we elaborate on the type of information they encode and discuss how they deal with the key challenges of illumination variations, registration errors, head-pose variations, occlusions, and identity bias. This survey allows us to identify open issues and to define future directions for designing real-world affect recognition systems.
IEEE Transactions on Pattern Analysis and Machine Intelligence 10/2014; DOI:10.1109/TPAMI.2014.2366127 · 5.78 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We propose a framework for the automatic grouping and alignment of unedited multi-camera User-Generated Videos (UGVs) within a database. The proposed framework analyzes the sound in order to match and cluster UGVs that capture the same spatio-temporal event and estimate their relative time-shift to temporally align them. We design a descriptor derived from the pairwise matching of audio chroma features of UGVs. The descriptor facilitates the definition of a classification threshold for automatic query-by-example event identification. We evaluate the proposed identification and synchronization framework on a database of 263 multi-camera recordings of 48 real-world events and compare it with state-of-the-art methods. Experimental results show the effectiveness of the proposed approach in the presence of various audio degradations.
Information Sciences 08/2014; 302:108-121. DOI:10.1016/j.ins.2014.08.026 · 4.04 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Networks of smart cameras share large amounts of data to accomplish tasks such as reidentification. We propose a feature-selection method that minimizes the data needed to represent the appearance of objects by learning the most appropriate feature set for the task at hand (person reidentification). The computational cost for feature extraction and the cost for storing the feature descriptor are considered jointly with feature performance to select cost-effective good features. This selection allows us to improve intercamera reidentification while reducing the bandwidth that is necessary to share data across the camera network. We also rank the selected features in the order of effectiveness for the task to enable a further reduction of the feature set by dropping the least effective features when application constraints require this adaptation. We compare the proposed approach with state-of-the-art methods on the iLIDS and VIPeR datasets and show that the proposed approach considerably reduces network traffic due to intercamera feature sharing while keeping the reidentification performance at an equivalent or better level compared with the state of the art.
IEEE Transactions on Circuits and Systems for Video Technology 08/2014; 24(8):1362-1374. DOI:10.1109/TCSVT.2014.2305511 · 2.62 Impact Factor