Journal of Visual Communication and Image Representation

Published by Elsevier
Online ISSN: 1095-9076
Print ISSN: 1047-3203
In a recent paper in this journal by Kesidis and Papamarkos "A new method for the exact reconstruction of any gray-scale image from its projections is proposed." In this note we point out that this method is a special case of a well-known approach (peeling) and that it can produce exact reconstructions only under assumptions that are not realistic for practical methods of data collection. Further, we point out that some statements made in the paper regarding disadvantages of the algebraic reconstruction techniques (ART) as compared to the method of the paper are false.
We consider the problem of allocating bits among pictures in an MPEG video coder to equalize the visual quality of the coded pictures, while meeting buffer and channel constraints imposed by the MPEG video buffering verifier We address this problem within a framework that consists of three components: (1) a bit production model for the input pictures, (2) a set of bit-rate constraints imposed by the video buffering verifier and (3) a novel lexicographic criterion for optimality. Under this framework, we derive simple necessary and sufficient conditions for optimality that lead to efficient algorithms
This paper deals with the recovery of optical flow, that is to say, with the identification of a vector field, defined on some subset of the image plane, which accounts for the infinitessimal time evolution of the image of a particular object. Our formulation is general in that it allows for the vector field to be expressed as a linear combination of an arbitrary (but chosen in advance) finite collection of vector fields and it allows the measurements to include (a) the velocity of feature points, (b) the velocity normal to an evolving contour and/or (c) the velocity tangent to an intensity gradient. The method is based on least squares and an explicit formula for the generalized inverse of a class of integral operators. It involves a gramian whose invertibility is necessary and sufficient for the identification of a unique best fitting vector field. Various important subcases have been studied earlier and reported in the computer vision literature, the emphasis here is on the systematic development of a general tool.
When browsing binary objects in a network environment, data transmission rate, progressive display capability, and view modification under rotation, scaling, and/or translation (R/S/T) changes are the major factors for selection of an appropriate representation model of binary objects. A new half plane based representation and display method of 2D binary objects is proposed. Within this modeling framework, a binary object approximated by the shape of a polygon can be represented as a collection of half planes defined over the edges of the polygon under operations of union and intersection. The basic shape attributes of the object model are the slope and the y-intercept of the boundary line of the constituent half planes. It is shown that the representation parameters at the parent node is recursively related to those at the child nodes. This recursive relation is crucial for deriving the color of the nodes for progressive object display. Simulation results are provided to illustrate the performance of our method
Traditional image steganalysis techniques are conducted with respect to the entire image. In this work, we aim to differentiate a stego image from its cover image based on steganalysis results of decomposed image blocks. As a natural image often consists of heterogeneous regions, its decomposition will lead to smaller image blocks, each of which is more homogeneous. We classify these image blocks into multiple classes and find a classifier for each class to decide whether a block is from a cover or stego image. Consequently, the steganalysis of the whole image can be conducted by fusing steganalysis results of all image blocks through a voting process. Experimental results will be given to show the advantage of the proposed block-based image steganalysis approach.
In this paper, we present a multiple hierarchical coding and transmission scheme in which both the fragile structure of set partitioning in hierarchical trees (SPIHT) source coding and the unequal importance of bits in the coded bitstream have been taken into account. Multiple subsampling is applied to split the wavelet coeffcients of the original image source into multiple subsources so that each subsource is a crude representation of the original image. The sequential dependence of coded bitstream is broken. As a result, the error propagation is limited to a single subsource which is also coded with SPIHT to achieve desired coding performance. Cyclic redundancy coder/rate compatible punctured convolutional coder (CRC/RCPC) channel coding has been used to offer unequal error protection (UEP) to the multiple coded bitstreams. In each substream all the bits in the same bit layer are protected with the same channel code. The higher the bit-layer, the more the channel coding protects. Experimented results show that this new scheme has a good error resilient performance over wireless channels with time varying and bursty errors characteristics. In particular the reconstructed images often demonstrate good visual quality
We address the challenging problem of face recognition under the scenarios where both training and test data are possibly contaminated with spatial misalignments. A supervised sparse coding framework is developed in this paper towards a practical solution to misalignment-robust face recognition. Each given probe face image is then uniformly divided into a set of local patches. We propose to sparsely reconstruct each probe image patch from the patches of all gallery images, and at the same time the reconstructions for all patches of the probe image are regularized by one term towards enforcing sparsity on the subjects of those selected patches. The derived reconstruction coefficients by ℓ<sub>1</sub>-norm minimization are then utilized to fuse the subject information of the patches for identifying the probe face. Such a supervised sparse coding framework provides a unique solution to face recognition. Extensive face recognition experiments on three benchmark face datasets demonstrate the advantages of the proposed framework over holistic sparse coding and conventional subspace learning based algorithms in terms of robustness to spatial misalignments and image occlusions.
The use of shape as a cue for indexing in pictorial databases has been traditionally based on global invariant statistics and deformable templates, on the one hand, and local edge correlation on the other. This paper proposes an intermediate approach based on a characterization of the symmetry in edge maps. The use of symmetry matching as a joint correlation measure between pairs of edge elements further constrains the comparison of edge maps. In addition, a natural organization of groups of symmetry into a hierarchy leads to a graph-based representation of relational structure of components of shape that allows for deformations by changing attributes of this relational graph. A graduate assignment graph matching algorithm is used to match symmetry structure in images to stored prototypes or sketches. The results of matching sketches and grey-scale images against a small database consisting of a variety of fish, planes, tools, etc., are depicted
This paper details work undertaken on the application of an algorithm for visual attention (VA) to region of interest (ROI) coding in JPEG 2000 (JP2K). In this way, an “interest ordered” progressive bit-stream is produced where the regions highlighted by the VA algorithm are presented first in bit-stream. The paper briefly outlines the terminology used in JP2K, the packet structure of the bit-stream, and the methods available to achieve ROI coding in JP2K (tiling, coefficient scaling, and code-block selection). The paper then describes how the output of the VA algorithm is post-processed so that an ROI is produced that can be efficiently coded using coefficient scaling in JP2K. Finally, a two alternative forced choice (2AFC) visual trial is undertaken to compare the visual quality of images encoded using the proposed VA ROI algorithm and conventional JP2K. The experimental results show that, while there is no overall preference for the VA ROI encoded images; there is an improvement in perceived image quality at low bit rates (below 0.25 bits per pixel). It is concluded that an overall increase in image quality only occurs when the increase in quality of the ROI more than compensates for the decrease in quality of the image background (i.e., non-ROI).
A format-agnostic framework for content adaptation allows reaching a maximum number of users in heterogeneous multimedia environments. Such a framework typically relies on the use of scalable bitstreams. In this paper, we investigate the use of bitstreams compliant with the scalable extension of the H.264/MPEG-4 AVC standard in a format-independent framework for content adaptation. These bitstreams are scalable along the temporal, spatial, and SNR axis. To adapt these bitstreams, a format-independent adaptation engine is employed, driven by the MPEG-21 Bitstream Syntax Description Language (BSDL). MPEG-21 BSDL is a specification that allows generating high-level XML descriptions of the structure of a scalable bitstream. As such, the complexity of the adaptation of scalable bitstreams can be moved to the XML domain. Unfortunately, the current version of MPEG-21 BSDL cannot be used to describe the structure of large video bitstreams because the bitstream parsing process is characterized by an increasing memory consumption and a decreasing description generation speed. Therefore, in this paper, we describe a number of extensions to the MPEG-21 BSDL specification that make it possible to optimize the processing of bitstreams. Moreover, we also introduce a number of additional extensions necessary to describe the structure of scalable H.264/AVC bitstreams. Our performance analysis demonstrates that our extensions enable the bitstream parsing process to translate the structure of the scalable bitstreams into an XML document multiple times faster. Further, a constant and low memory consumption is obtained during the bitstream parsing process.
The quality of real-time audio and video information transmitted via today's Internet suffers severely from often significant packet losses. While this problem is well understood and solved for existing audio coding schemes, support from the video coding standards themselves is required for video streams. This paper presents the newly introduced error resilience mechanisms built into the second version of H.263 (1998), known under its working name H.263+, and addresses the corresponding packetization format issues that together significantly improve the image quality at packet loss rates up to 20%. In particular, it is support from the video coding algorithm itself, paired with appropriate transport layer mechanisms, that leads to significant improvements of perceived image quality for communicative as well as retrieval applications at moderate bit rates up to some 100 kbit/s.
Video transcoding is one of the key technologies in implementing dynamic adaptation of the bit-rate of a coded video bit-stream to the available bandwidth over various networks. Many fast transcoder architectures have been proposed to achieve fast processing. However, they suffer from quality degradation caused by the drift error. In this paper, we investigate the drift caused by the fast transcoder architectures for transcoding H.263 bit-streams. We discuss the limitations of the fast transcoder architectures and the flexibility that can be offered by a cascaded pixel-domain transcoder. Since the cascaded pixel-domain transcoder can achieve drift-free performance, we also propose methods to reduce the computational complexity of the drift-free cascaded pixel-domain transcoder.
Real-time transmission of video data in network environments, such as wireless and Internet, is a challenging task, as it requires high compression efficiency and network-friendly design. H.264/AVC is the newest international video coding standard, jointly developed by groups from ISO/IEC and ITU-T, which aims at achieving improved compression performance and a network-friendly video representation for different types of applications, such as conversational, storage, and streaming. In this paper, we discuss various error resiliency schemes employed by H.264/AVC. The related topics such as non-normative error concealment and network environment are also described. Some experimental results are discussed to show the performance of error resiliency schemes.
A fast H.264 Intra-prediction mode selection scheme is proposed in this work. The objective is to reduce the encoder complexity without significant rate–distortion performance degradation. The proposed method uses spatial and transform domain features of the target block jointly to filter out the majority of candidate modes. This is justified by examining the posterior error probability and the average rate–distortion loss. For the final mode selection, either the feature-based or the RDO (rate–distortion optimization)-based method is applied to 2–3 candidate modes. It is demonstrated by experimental results that the proposed scheme demands only 7–10% of the complexity of the RDO (rate–distortion optimized) mode decision scheme with little quality degradation.
This paper presents a revised rate control scheme based on an improved frame complexity measure. Rate control adopted by both MPEG-4 VM18 and H.264/AVC use a quadratic rate–distortion (R–D) model that determines quantization parameters (QPs). Classical quadratic R–D model is suitable for MPEG-4 but it performs poorly for H.264/AVC because one of the important parameters, mean absolute difference (MAD), is predicted through a linear model, whereas the MAD used in MPEG-4 VM18 is the actual MAD. Inaccurately predicted MAD results in wrong QP and consequently degrades rate–distortion optimization (RDO) performance in H.264. To overcome the limitation of the existing rate control schemes, we introduce an enhanced linear model for predicting MAD, utilizing some knowledge of current frame complexity. Moreover, we propose a more accurate frame complexity measure, namely, normalized MAD, to replace the current use of MAD parameter. Normalized MAD has a stronger correlation with optimally allocated bits than that of the predicted MAD. To minimize video quality variations, we also propose a novel long-term QP limiter (LTQPL). Finally, a dynamic bit allocation scheme among basic units is implemented. Extensive simulation results show that our method, with inexpensive computational complexity added, improves the average peak signal-to-noise ratio (PSNR) and reduces video quality variations considerably.
A fast inter-prediction mode decision and motion search algorithm is proposed for the H.264 video coding standard. The multi-resolution motion estimation scheme and an adaptive rate-distortion model are employed with early termination rules to accelerate the search process. With the new algorithm, the amount of computation involved in the motion search can be substantially reduced. Experimental results show that the proposed algorithm can achieve a speed-up factor ranging from 60 to 150 times as compared to the full-search algorithm with little quality degradation.
In order to achieve high computational performance and low power consumption, many modern microprocessors are equipped with special multimedia instructions and multi-core processing capabilities. The number of cores on a single chip increases double every three years. Therefore, besides complexity reduction by smart algorithms such as fast macroblock mode selection, an effective algorithm for parallelizing H.264/AVC is also very crucial in implementing a real-time encoder on a multi-core system. This algorithm serves to uniformly distribute workloads for H.264/AVC encoding over several slower and simpler processor cores on a single chip. In this paper, we propose a new adaptive slice-size selection technique for efficient slice-level parallelism of H.264/AVC encoding on a multi-core processor using fast macroblock mode selection as a pre-processing step. For this we propose an estimation method for the computational complexity of each macroblock using pre macroblock mode selection. Simulation results, with a number of test video sequences, show that, without any noticeable degradation, the proposed fast macroblock mode selection reduces the total encoding time by about 57.30%. The proposed adaptive slice-level parallelism has good parallel performance compared to conventional fixed slice-size parallelism. The proposed method can be applied to many multi-core systems for real-time H.264 video encoding.
Motion estimation (ME) and compensation is the most important technique in video coding. In H.264, the motion vector (MV) of a variable size block is determined by performing the ME search procedure on integer-pixel positions, followed by fractional-pixel refinement. For fast integer-pixel ME algorithms, on the average, ME can be done by examining less than 5 search positions. Considering the conventional fractional-pixel ME algorithm, 8, 16, and 24 fractional-pixel search positions are required to be examined for the best MV at 1/2-, 1/4-, and 1/8-pixel accuracy, respectively. That is, the computational complexity of fractional-pixel ME becomes comparable to that of fast integer-pixel ME. Therefore, to develop an efficient fractional-pixel ME algorithm is greatly desirable. In this study, a fast fractional-pixel ME algorithm is proposed. In this study, a “degenerate” quadratic function is used to precisely determine the “best” quantized predictive motion vector (PMV) at 1/4-pixel accuracy for a variable size block. Based on the partial probability distributions of the sum of absolute component differences between the best MV at 1/4-pixel accuracy determined by the conventional 2-stage full search ME search algorithm and the “best” quantized PMV determined by the proposed algorithm, the search range of local fraction-pixel ME can be well determined. If the best quantized PMV determined by the proposed algorithm and that determined by the center biased fractional-pixel search algorithm are identically (0, 0), then (0, 0) is directly determined as the MV at 1/4-pixel accuracy for the variable size block, without applying the small diamond search pattern (SDSP). Otherwise, the SDSP at 1/4-pixel accuracy is used to determine the final result and the SDSP will be applied at most three times. Based on the experimental results obtained in this study, the four ME performance measures of the proposed algorithm are better than that of four comparison approaches, with slight degradations in average PSNR and bit rate.
For entropy-coded H.264/AVC video frames, a transmission error in a codeword will not only affect the underlying codeword but also may affect subsequent codewords, resulting in a great degradation of the received video frames. In this study, an error resilient coding scheme for H.264/AVC video transmission is proposed. At the encoder, for an H.264/AVC intra-coded I frame, the important data for each macroblock (MB) are extracted and embedded into the next frame by the proposed MB-interleaving slice-based data embedding scheme for I frames. For an H.264/AVC inter-coded P frame, two types of important data with different error recovery capabilities for each MB are extracted and embedded into the next frame by the proposed MB-interleaving slice-based data embedding scheme for P frames. At the decoder, if the important data for a corrupted MB can be correctly extracted, the extracted important data for the corrupted MB will facilitate the employed error concealment scheme to conceal the corrupted MB; otherwise, the employed error concealment scheme is simply used to conceal the corrupted MB. As compared with some recent error resilient approaches based on data embedding, in this study, the important data selection mechanism for different types of MBs, the detailed data embedding mechanism, and the error detection and concealment scheme performed at the decoder are well developed to design an integrated error resilient coding scheme. Additionally, two types of important data with different transmission error recovery capabilities for each MB in P frames can provide more reliable error resiliency. Based on the simulation results obtained in this study, the proposed scheme can recover high-quality H.264/AVC video frames from the corresponding corrupted video frames up to a video packet loss rate of 20%.
Intra-frame mode selection and inter-frame mode selection are new features introduced in the H.264 standard. Intra-frame mode selection dramatically reduces spatial redundancy in I-frames, while inter-frame mode selection significantly affects the output quality of P-/B-frames by selecting an optimal block size with motion vector(s) or a mode for each macroblock. Unfortunately, this feature requires a myriad amount of encoding time especially when a brute force full-search method is utilised. In this paper, we propose fast mode selection algorithms tailored for both intra-frames and inter-frames. The success of the intra-frame algorithm is achieved by reducing the computational complexity of the Lagrangian rate-distortion optimisation evaluation. Two proposed fast inter-frame mode algorithms incorporate several robust and reliable predictive factors, including intrinsic complexity of the macroblock, mode knowledge from the previous frame(s), temporal similarity detection and the detection of different moving features within a macroblock. This information is used to reduce the number of search operations. Extensive simulations on different classes of test sequences demonstrate a speed up in encoding time of up to 86% compared with the H.264 benchmark. This is achieved without any significant degradation in picture quality and compression ratio.
In this work, we present a novel approach for optimizing H.264/AVC video compression by dynamically allocating computational complexity (such as a number of CPU clocks) and bits for encoding each coding element (basic unit) within a video sequence, according to its predicted MAD (mean absolute difference). Our approach is based on a computational complexity–rate–distortion (C–R–D) analysis, which adds a complexity dimension to the conventional rate–distortion (R–D) analysis. Both theoretically and experimentally, we prove that by implementing the proposed approach for the dynamic allocation better results are achieved. We also prove that the optimal computational complexity allocation along with optimal bit allocation is better than the constant computational complexity allocation along with optimal bit allocation. In addition, we present a method and system for implementing the proposed approach, and for controlling computational complexity and bit allocation in real-time and off-line video coding. We divide each frame into one or more basic units, wherein each basic unit consists of at least one macroblock (MB), whose contents are related to a number of coding modes. We determine how much computational complexity and bits should be allocated for encoding each basic unit, and then allocate a corresponding group of coding modes and a quantization step-size, according to the estimated distortion (calculated by a linear regression model) of each basic unit and according to the remaining computational complexity and bits for encoding remaining basic units. For allocating the corresponding group of coding modes and the quantization step-size, we develop computational complexity–complexity step–rate (C–I–R) and rate–quantization step-size–computational complexity (R–Q–C) models.
The recent video coding standard H.264/AVC show extremely higher coding efficiency compare to any other previous standards. H.264/AVC can achieve over 50% of bit rate saving with same quality using the rate–distortion process, but it brings high computational complexity. In this paper, we propose an algorithm that can reduce the complexity of the codec by reducing the block mode decision process adaptively. Block mode decision process in H.264/AVC consists of inter mode decision process and intra mode decision process. We deal with reduction method for inter and intra mode decision. In this paper an efficient method is proposed to reduce the inter mode decision complexity using the direct prediction methods based on block correlation and adaptive rate distortion cost threshold for early stopping. The fast intra mode reduction algorithm based on inter mode information is also proposed to reduce the computational complexity. The experimental results show that the proposed algorithm can achieve up to 63.34–77.39% speed up ratio with a little PSNR loss. Increment in bit requirement is also not much noticeable.
In this paper a novel method is presented to detect moving objects in H.264/AVC [T. Wiegand, G. Sullivan, G. Bjontegaard, G. Luthra, Overview of the H.264/AVC video coding standard, IEEE Transactions on Circuits and Systems for Video Technology, 13 (7) (2003) 560–576] compressed video surveillance sequences. Related work, within the H.264/AVC compressed domain, analyses the motion vector field to find moving objects. However, motion vectors are created from a coding perspective and additional complexity is needed to clean the noisy field. Hence, an alternative approach is presented here, based on the size (in bits) of the blocks and transform coefficients used within the video stream. The system is restricted to the syntax level and achieves high execution speeds, up to 20 times faster than the related work. To show the good detection results, a detailed comparison with related work is presented for different challenging video sequences. Finally, the influence of different encoder settings is investigated to show the robustness of our system.
This paper explores design options and evaluates implementations of in-network, RTP/RTSP based adaptation MANEs (Media Aware Network Elements) for H.264/SVC content streaming. The obvious technique to be employed by such an adaptation MANE is to perform SVC specific bitstream extraction or truncation. Another mechanism that can be used is description (metadata) driven, coding format independent adaptation based on generic Bitstream Syntax Descriptions (gBSD), as specified within MPEG-21 Digital Item Adaptation (DIA). Adaptation MANE architectures for both approaches are developed and presented, implemented in end-to-end streaming/adaptation prototype systems, and experimentally evaluated and compared. For the gBSD based solution, open issues like the granularity of bitstream descriptions and of bitstream adaptation, metadata overhead, metadata packetization and transport options, and error resilience in case of metadata losses, are addressed. The experimental results indicate that a simple SVC specific adaptation MANE does clearly outperform the gBSD based adaptation variants. Yet, the conceptual advantages of the description driven approach, like coding format independence and flexibility, may outweigh the performance drawbacks in specific applications.
The consideration of better motion compensation techniques for inter-frame prediction is one of the key reasons why the new H.264 (MPEG-4 AVC) video coding standard can achieve considerably better coding efficiency compared to older standards such as MPEG-2/4 and H.263. These include the use of multiple references and block sizes, a better interpolation filter for subpixel motion compensation, and more efficient exploitation of the spatio-temporal correlation between motion vectors of adjacent blocks through the consideration of SKIP and DIRECT modes. In this paper, we introduce additional methods into H.264 that further enhance motion compensation and can lead to additional improvements in coding efficiency. This is achieved by further exploiting motion vector temporal correlation through the introduction of a new DIRECT macroblock type and an enhancement to the existing Skip Macroblock type within Predictive (P) slices. These new macroblock types can lead to a considerable reduction in the bits required for encoding motion information, while retaining or even improving quality under a Rate Distortion Optimization Framework. Our simulation results suggest that the proposed improvements can lead up to 7.6% average bitrate reduction or equivalently 0.39 dB quality improvement over the current H.264 standard.
This paper presents a real-time spatiotemporal segmentation approach to extract video objects in the H.264 compressed domain. The only exploited segmentation cue is the motion vector (MV) field extracted from the H.264 compressed video. MV field is first temporally and spatially normalized and then accumulated by an iteratively backward projection scheme to enhance the salient motion. Then global motion compensation is performed on the accumulated MV field, which is also moderately segmented into different motion-homogenous regions by a modified statistical region growing algorithm. The hypothesis testing using the block residuals of global motion compensation is employed for intra-frame classification of segmented regions, and the projection is exploited for inter-frame tracking of previous video objects. Using the above results of intra-frame classification and inter-frame tracking as input, a correspondence matrix based spatiotemporal segmentation approach is proposed to segment video objects under different situations including appearing and disappearing objects, splitting and merging objects, stopping moving objects, multiple object tracking and scene change in a unified and efficient way. Experimental results for several H.264 compressed video sequences demonstrate the real-time performance and good segmentation quality of the proposed approach.
H.264 is an emerging video coding standard, which aims at compressing high-quality video contents at low-bit rates. While the new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards do. The complexity of H.264 standard poses a large amount of challenges to implementing the encoder/decoder in real-time via software on personal computers. This work analyzes software implementation of H.264 encoder and decoder on general-purpose processors with media instructions and multi-threading capabilities. Specifically, we discuss how to optimize the algorithms of H.264 encoders and decoders on Intel Pentium 4 processors. We first analyze the reference implementation to identify the time-consuming modules, and present optimization methods using media instructions to improve the speed of these modules. After appropriate optimizations, the speed of the codec improves by more than 3×. Nonetheless, the H.264 encoder is still too complicated to be implemented in real-time on a single processor. Thus, we also study how to partition the H.264 encoder into multiple threads, which then can be run on systems with multiple processors or multi-threading capabilities. We analyze different multi-threading schemes that have different quality/performance, and propose a scheme with good scalability (i.e., speed) and good quality. Our encoder can obtain another 3.8× speedup on a four-processor system or 4.6× speedup on a four-processor system with Hyper-Threading Technology. This work demonstrates that hardware-specific algorithm modifications can speed up the H.264 decoder and encoder substantially. The performance improvement techniques on modern microprocessors demonstrated in this work can be applied not only to H.264, but also to other video or multimedia processing applications.
H.264/AVC is a new standard for digital video compression jointly developed by ITU-T’s Video Coding Experts Group (VCEG) and ISO/IEC’s Moving Picture Experts Group (MPEG). Besides the numerous tools for efficient video coding, the H.264/AVC specification defines some new error resilience tools. One of them is flexible macroblock ordering (FMO) which is the main focus of this paper. An in-depth overview is given of the internals of FMO. Experiments are presented that demonstrate the benefits of FMO as an error resilience tool in case of packet loss over IP networks. The flexibility of FMO comes with a certain overhead or cost. A quantitative assessment of this cost is presented for a number of scenarios. FMO can, besides for pure error resilience, also be used for other purposes. This is also addressed in this paper.
In this paper, an efficient one-pass frame level rate control algorithm is proposed for H.264/AVC, where the two essential problems in rate control, i.e., the budget allocation (BA) and the quantization parameter determination (QPD) are both considered. First, an efficient BA scheme is designed with special consideration of the inter-frame dependency. Accordingly, the error propagation caused by improper QP assignment in the motion compensation process is reduced and the total distortion is kept at a low level. Second, a better QPD method is developed based on an accurate rate model and a second feedback mechanism so that a high rate accuracy is guaranteed. Simulations verified the performance of the proposed algorithm. Compared with the fixed QP coding and the two recommended rate control algorithms (G012-MB and G012-Frame) in H.264/AVC reference software, up to 1.50, 1.12, and 0.94 dB were achieved with a higher rate accuracy in the coding efficiency test defined by ITU-T VCEG. Particularly, a more significant improvement was observed in slow movement scenarios: compared with G012-MB, a gain of 0.80 dB on average and 1.43 dB at maximum was obtained with 66.0% reduction of average rate mismatch. Compared with G012-Frame, the average and maximum gains are 0.34 and 1.06 dB, respectively. While at the same time, the average rate mismatch was reduced by 90.4%. Considering the low computational cost, the proposed algorithm is quite appealing to real-time video applications.
In this paper, we present a new adaptive video coding control for real-time H.264/AVC encoding system. The main techniques include: (1) the initial quantization parameter (QP) decision scheme is based on Laplacian of Gaussian (LoG) operators; (2) the MB-level QP calculation is based on the spatio-temporal correlation, in which the computation is less than the quadratic model used by H.264/AVC; (3) the adaptive GOP structure is proposed, in which the I-frame is adaptively replaced by an enhancement P-frame to improve the coding efficiency; (4) the scene change is detected with the complexity of adjacent inter-frames and the appropriate QP is re-calculated for the scene-change frame. The proposed algorithm is not only to save the computational complexity but also to improve coding quality. Compared to the JM12.4 reference under various sequences testing, the proposed algorithm can decrease coding time by 64.5% and improve PSNR by 1.52 dB while keeping the same bit-rate.
Connectivity-Guided Adaptive Wavelet Transform based mesh compression framework is proposed. The transformation uses the connectivity information of the 3D model to exploit the inter-pixel correlations. Orthographic projection is used for converting the 3D mesh into a 2D image-like representation. The proposed conversion method does not change the connectivity among the vertices of the 3D model. There is a correlation between the pixels of the composed image due to the connectivity of the 3D mesh. The proposed wavelet transform uses an adaptive predictor that exploits the connectivity information of the 3D model. Known image compression tools cannot take advantage of the correlations between the samples. The wavelet transformed data is then encoded using a zero-tree wavelet based method. Since the encoder creates a hierarchical bitstream, the proposed technique is a progressive mesh compression technique. Experimental results show that the proposed method has a better rate distortion performance than MPEG-3DGC/MPEG-4 mesh coder.
Estimation of local second-degree variation should be a natural first step in computerized image analysis, just as it seems to be in human vision. A prevailing obstacle is that the second derivatives entangle the three features, signal strength (i.e., magnitude or energy), orientation, and shape. To disentangle these features we propose a technique where the orientation of an arbitrary pattern f is identified with the rotation required to align the pattern with its prototype p. This is more strictly formulated as solving the derotating equation. The set of all possible prototypes spans the shape space of second-degree variation. This space is one-dimensional for 2D images, two-dimensional for 3D images. The derotation decreases the original dimensionality of the response vector from 3 to 2 in the 2D-case and from 6 to 3 in the 3D case, in both cases leaving room only for magnitude and shape in the prototype. The solution to the derotation and a full understanding of the result requires (i) mapping the derivatives of the pattern f onto the orthonormal basis of spherical harmonics, and (ii) identifying the eigenvalues of the Hessian with the derivatives of the prototype p. However, once the shape space is established, the possibilities of putting together independent discriminators for magnitude, orientation, and shape are easy and almost limitless.
The image retrieval based on spatial content is an attracting task in many image database applications. The 2D strings provide a natural way of constructing spatial indexing for images and support effective picture query. Nevertheless, the 2D string is deficient in describing the spatial knowledge of nonzero sized objects with overlapping. In this paper, we use an ordered labeled tree, a 2D C-tree, to be the spatial representation for images and propose the tree-matching algorithm for similarity retrieval. The distance between 2D C-trees is used to measure the similarity of images. The proposed tree comparison algorithm is also modified to compute the partial tree distance for subpicture query. Experimental results for verifying the effectiveness of similarity retrieval by 2D C-trees matching are presented.
In this paper, a new recognition algorithm for 2D object contours, based on the decimated wavelet transform, is presented, emphasizing the starting point dependency problem. The proposed matching algorithm consists of two parts: Firstly, we present new data structures for the decimated wavelet representation and a searching algorithm to estimate the misalignment between the starting points for the reference model and unknown object. We also adopt a polynomial approximation technique and propose a fast searching algorithm. And then, matching is performed in an aligned condition on the multiresolutional wavelet representation. By employing a variable-rate decimation scheme, we can achieve fast and accurate recognition results, even in the presence of heavy noise. We provide an analysis on the computational complexity, showing that our approach requires only less than 25% of the computational load required for the conventional method [1]. Various experimental results on both synthetic and real imagery are presented to demonstrate the performance of the proposed algorithm. The simulation results show that the proposed algorithm successfully estimates the misalignment and classifies 2D object contours, even for the input SNR = 5 dB.
We present a geometry-based indexing approach for the retrieval of video databases. It consists of two modules: 3D object shape inferencing from video data and geometric modeling from the reconstructed shape structure. A motion-based segmentation algorithm employing feature block tracking and principal component split is used for multi-moving-object motion classification and segmentation. After segmentation, feature blocks from each individual object are used to reconstruct its motion and structure through a factorization method. The estimated shape structure and motion parameters are used to generate the implicit polynomial model for the object. The video data is retrieved using the geometric structure of objects and their spatial relationship. We generalize the 2D string to 3D to compactly encode the spatial relationship of objects.
In this work, we make use of 3D contours and relations between them (namely, coplanarity, cocolority, distance and angle) for four different applications in the area of computer vision and vision-based robotics. Our multi-modal contour representation covers both geometric and appearance information. We show the potential of reasoning with global entities in the context of visual scene analysis for driver assistance, depth prediction, robotic grasping and grasp learning. We argue that, such 3D global reasoning processes complement widely-used 2D local approaches such as bag-of-features since 3D relations are invariant under camera transformations and 3D information can be directly linked to actions. We therefore stress the necessity of including both global and local features with different spatial dimensions within a representation. We also discuss the importance of an efficient use of the uncertainty associated with the features, relations, and their applicability in a given context.
We have proposed a new spatio-temporal knowledge structure called 3D C-string to represent symbolic videos accompanying with the string generation and video reconstruction algorithms. In this paper, we extend the idea behind the similarity retrieval of images in 2D C+-string to 3D C-string. Our extended approach consists of two phases. First, we infer the spatial relation sequence and temporal relations for each pair of objects in a video. Second, we use the inferred relations to define various types of similarity measures and propose the similarity retrieval algorithm. By providing various types of similarity between videos, our proposed similarity retrieval algorithm has discrimination power about different criteria. Finally, some experiments are performed to show the efficiency of the proposed approach.
The demand for computer-assisted game study in sports is growing dramatically. This paper presents a practical video analysis system to facilitate semantic content understanding. A physics-based algorithm is designed for ball tracking and 3D trajectory reconstruction in basketball videos and shooting location statistics can be obtained. The 2D-to-3D inference is intrinsically a challenging problem due to the loss of 3D information in projection to 2D frames. One significant contribution of the proposed system lies in the integrated scheme incorporating domain knowledge and physical characteristics of ball motion into object tracking to overcome the problem of 2D-to-3D inference. With the 2D trajectory extracted and the camera parameters calibrated, physical characteristics of ball motion are involved to reconstruct the 3D trajectories and estimate the shooting locations. Our experiments on broadcast basketball videos show promising results. We believe the proposed system will greatly assist intelligence collection and statistics analysis in basketball games.
Efficient compression of multi-view images and videos is an open and interesting research issue that has been attracting the attention of both academic and industrial world during the last years. The considerable amount of information produced by multi-camera acquisition systems requires effective coding algorithms in order to reduce the transmitted data while granting good visual quality in the reconstructed sequence. The classical approach of multi-view coding is based on an extension of the H.264/AVC standard, still based on motion prediction techniques. In this paper we present a novel approach that tries to fully exploit the redundancy between different views of the same scene considering both texture and geometry information. The proposed scheme replaces the motion prediction stage with a 3D warping procedure based on depth information. After the warping step, a joint 3D-DCT encoding of all the warped views is provided, taking advantage of the strong correlation among them. Finally, the transformed coefficients are conveniently quantized and entropy coded. Occluded regions are also taken into account with ad-hoc interpolation and coding strategies. Experimental results performed with a preliminary version of the proposed approach show that at low bitrates it outperforms the H.264 MVC coding scheme on both real and synthetic datasets. Performance at high bitrates are also satisfactory provided that accurate depth information is available.
This paper presents a 3D structure extraction coding scheme that first computes the 3D structural properties such as 3D shape, motion, and location of objects and then codes image sequences by utilizing such 3D information. The goal is to achieve efficient and flexible coding while still avoiding the visual distortions through the use of 3D scene characteristics inherent in image sequences. To accomplish this, we present two multiframe algorithms for the robust estimation of such 3D structural properties, one from motion and one from stereo. The approach taken in these algorithms is to successively estimate 3D information from a longer sequence for a significant reduction in error. Three variations of 3D structure extraction coding are then presented — 3D motion interpolative coding, 3D motion compensation coding, and “viewpoint” compensation stereo image coding — to suggest that the approach can be viable for high-quality visual communications.
Three-dimensional (3D) meshes have been widely used in graphic applications for the representation of 3D objects. They often require a huge amount of data for storage and/or transmission in the raw data format. Since most applications demand compact storage, fast transmission, and efficient processing of 3D meshes, many algorithms have been proposed to compress 3D meshes efficiently since early 1990s. In this survey paper, we examine 3D mesh compression technologies developed over the last decade, with the main focus on triangular mesh compression technologies. In this effort, we classify various algorithms into classes, describe main ideas behind each class, and compare the advantages and shortcomings of the algorithms in each class. Finally, we address some trends in the 3D mesh compression technology development.
This paper proposes a technique for generating the quantization values for 3D-DCT coefficients. The distribution of AC coefficients inside a transform cube is characterized by two regions, theshifted complement hyperboloidand theshifted hyperboloid,which capture the dominant and the less significant coefficients, respectively. An exponential function is used to determine the appropriate quantization values for the two regions. A quantization volume for the 3D-DCT is generated by using the function. The paper also describes a novel procedure for deriving the scan order for the quantized 3D-DCT coefficients. The proposed quantization volume has been tested on various standard test video sequences. The experiments show that the 3D-DCT video compression using the proposed quantization values produce high compression ratios with good visual quality for the reconstructed video frames. If desired, the parameter settings of the function can be further tuned for better visual quality. The proposed scan order was also found to be superior, in terms of compression ratio, to the 3D zig zag approach, which is an extension of the traditional 2D zig zag.
This paper presents a novel hardware implementation of a disparity estimation scheme targeted to real-time Integral Photography (IP) image and video sequence compression. The software developed for IP image compression achieves high quality ratios over classic methodologies by exploiting the inherent redundancy that is present in IP images. However, there are certain time constraints to the software approach that must be confronted in order to address real-time applications. Our main effort is to achieve real-time performance by implementing in hardware the most time-consuming parts of the compression algorithm. The proposed novel digital architecture features minimized memory read operations and extensive simultaneous processing, while taking into concern the memory and data bandwidth limitations of a single FPGA implementation. Our results demonstrate that the implemented hardware system can successfully process high resolution IP video sequences in real-time, addressing a vast range of applications, from mobile systems to demanding desktop displays.
A method for transmission of high resolution still pictures at 64K bit/s using subband coding and a modified H261 video codec is proposed. It consists of applying a 4-band subband decomposition to the picture, generating four subbands with a quarter of the original resolution and transmitting each hand using a modified H261 codec. A theoretical analysis of the interaction between DCT-based codecs and the subband analysis/synthesis is given. The proposed scheme is compared to that in which the image is divided into four parts, each one being transmitted using an H261 codec alone. Simulation results show that the proposed method can reconstruct an image up to four times faster than that using the H261 alone. In addition, the progressive build up obtained with this method is very pleasant to the human observer.
IEEE 802.16 networks are designed based on differentiated services concept to provide better Quality of Service (QoS) support for a wide range of applications, from multimedia to typical web services, and therefore they require a fair and efficient scheduling scheme. However, this issue is not addressed in the standard. In this paper we present a new fair scheduling scheme which fulfills the negotiated QoS parameters of different connections while providing fairness among the connections of each class of service. This scheme models scheduling as a knapsack problem, where a fairness parameter reflecting the specific requirements of the connections is defined to be used in the optimization criterion. The proposed scheduler is evaluated through simulation in terms of delay, throughput and fairness index. The results show fairness of the scheduling scheme to all connections while the network guarantees for those connections are fulfilled.
Object-based bit allocation can result in significant improvement in the perceptual quality of extremely compressed video. However, real-time video object detection in large format high fidelity video is computationally daunting. Most algorithms begin with extensive use of classical bit analysis, and thus remain computationally heavy. Based on some recent results in human visual perception, in this paper, we present an experimental visual region tracking algorithm particularly designed for perceptual stream transcoding. This exploits the cue order observed in human visual perception to achieve very high computation speed as well as tracking efficiency. Rather than begin processing from pixel level or using any pixel level processing at all, it employs high level motion cue and block shape cue analysis to identify signatures of various relative movements between object of interest, scene background and the camera on the motion vector set, and from there it identifies objects. It then uses predictive filters to track the regions. The result is a fast yet highly effective perceptual region tracking algorithm that can operate in stream rate and track regions of perceptually significant object despite camera movements such as zoom, panning and translation. The technique is not specific to any special class of objects. We have implemented this algorithm in a live ISO-13818/MPEG-2 perceptual transcoder. In this paper, we share the performance of this implementation. This fast object-aware video rate transcoder is particularly suitable for live streaming and can convert a regular stream into a perceptually coded video stream.
In many video based applications, it is essential to precisely control the bit rate of video streams for transmission over different networks and channel bandwidth. One critical element in a bit rate control algorithm is the bit production model that predicts the number of produced bits when a certain quantization parameter is used. In this paper, we present a novel bit allocation and rate control algorithm for compressed domain video transcoding. The specific transcoding issue mentioned is referred to as bit rate adaptation or rate shaping. We first review the architectures of different bit rate adaptation transcoders and the generic rate control problem. Then, we propose and formulate an approximate linear bit allocation model, which is based on experimental results. Based on this model, we propose an adaptive bit estimation and allocation scheme for video transcoding. Implementation results show that the proposed algorithm can provide accurate bit allocation, stable buffer occupancy and improved video quality as compared to existing approaches. This rate control scheme can be used to provide flexible video bit rate adaptation and stable transmission of video streams over heterogeneous networks.
This paper introduces a cepstral approach for the automatic detection of landmines and underground utilities from acoustic and ground penetrating radar (GPR) images. This approach is based on treating the problem as a pattern recognition problem. Cepstral features are extracted from a group of images, which are transformed first to 1-D signals by lexicographic ordering. Mel-frequency cepstral coefficients (MFCCs) and polynomial shape coefficients are extracted from these 1-D signals to form a database of features, which can be used to train a neural network with these features. The target detection can be performed by extracting features from any new image with the same method used in the training phase. These features are tested with the neural network to decide whether a target exists or not. The different domains are tested and compared for efficient feature extraction from the lexicographically ordered 1-D signals. Experimental results show the success of the proposed cepstral approach for landmine detection from both acoustic and GPR images at low as well as high signal to noise ratios (SNRs). Results also show that the discrete cosine transform (DCT) is the most appropriate domain for feature extraction.
This paper presents a system for providing interactive broadcast services for live soccer video that is based on instant semantics acquisition. Currently, we have implemented two such interactive services: live event alert and on-the-fly language selection. The live event alert service has a small time lag of about 30 s for a short video clip to reach its final viewer and at most 1.5 min for a long clip of the live event. The on-the-fly language selection service allows users to choose their preferred contents and preferred language. The motivation for this work is that such interactive services will greatly increase the value of live soccer video. Currently, similar systems attempt to derive semantics of a soccer game from gamelog in freestyle text format and low-level features of the video, which is a challenging task. In this paper, we tackle this challenge with a combination of both gamelog input tool and targeted algorithm proposed in this paper. Our system is powered by our proposed semantic gamelog input tool that facilitates fast and accurate input of a semantic gamelog that contains basic semantic information of atomic events. When an interesting event occurs, our system performs boundary detection of these events by combining features extracted from the video with additional information from the semantic gamelog. This additional information facilitates our system to achieve accurate and very fast boundary detection of these events to support our live event alert service. Our system also implements a gamelog translation machine which translates the semantic gamelog (encoded in a game-specific code) into any natural language, provided that there is a configuration file for that language. Combining our gamelog translation machine with existing text-to-speech technology, we provide the on-the-fly language selection service. (Currently, our system supports English, Chinese, and Malay.)
Top-cited authors
Artzai Picon
  • Tecnalia
David Pardo
  • Universidad del País Vasco / Euskal Herriko Unibertsitatea
Adrian Galdran
  • École de Technologie Supérieure
Aitor Alvarez-Gila
Chang-Su Kim
  • Korea University