E. Cheng

RMIT University, Melbourne, Victoria, Australia

Are you E. Cheng?

Claim your profile

Publications (23)9.26 Total impact

  • Li Ling, E. Cheng, I.S. Burnett
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes the use of the Iterated Extended Kalman Filter (IEKF) in a real-time 3D mapping framework applied to Microsoft Kinect RGB-D data. Standard EKF techniques typically used for 3D mapping are susceptible to errors introduced during the state prediction linearization and measurement prediction. When models are highly nonlinear due to measurement errors e.g., outliers, occlusions and feature initialization errors, the errors propagate and directly result in divergence and estimation inconsistencies. To prevent linearized error propagation, this paper proposes repetitive linearization of the nonlinear measurement model to provide a running estimate of camera motion. The effects of iterated-EKF are experimentally simulated with synthetic map and landmark data on a range and bearing camera model. It was shown that the IEKF measurement update outperforms the EKF update when the state causes nonlinearities in the measurement function. In the real indoor environment 3D mapping experiment, more robust convergence behavior for the IEKF was demonstrated, whilst the EKF updates failed to converge.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Li Ling, I.S. Burrent, E. Cheng
    [Show abstract] [Hide abstract]
    ABSTRACT: Current approaches for 3D reconstruction from feature points of images are classed as sparse and dense techniques. However, the sparse approaches are insufficient for surface reconstruction since only sparsely distributed feature points are presented. Further, existing dense reconstruction approaches require pre-calibrated camera orientation, which limits the applicability and flexibility. This paper proposes a one-stop 3D reconstruction solution that reconstructs a highly dense surface from an uncalibrated video sequence, the camera orientations and surface reconstruction are simultaneously computed from new dense point features using an approach motivated by Structure from Motion (SfM) techniques. Further, this paper presents a flexible automatic method with the simple interface of 'videos to 3D model'. These improvements are essential to practical applications in 3D modeling and visualization. The reliability of the proposed algorithm has been tested on various data sets and the accuracy and performance are compared with both sparse and dense reconstruction benchmark algorithms.
    Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multimedia is ubiquitously available online with large amounts of video increasingly consumed through Web sites such as YouTube or Google Video. However, online multimedia typically limits users to visual/auditory stimulus, with onscreen visual media accompanied by audio. The recent introduction of MPEG-V proposed multi-sensory user experiences in multimedia environments, such as enriching video content with so-called sensory effects like wind, vibration, light, etc. In MPEG-V, these sensory effects are represented as Sensory Effect Metadata (SEM), which is additionally associated to the multimedia content. This paper presents three user studies that utilize the sensory effects framework of MPEG-V, investigating the emotional response of users and enhancement of Quality of Experience (QoE) of Web video sequences from a range of genres with and without sensory effects. In particular, the user studies were conducted in Austria and Australia to investigate whether geography and cultural differences affect users' elicited emotional responses and QoE.
    Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper investigates how minimal user interaction paradigms and markerless image recognition technologies can be applied to matching print media content to online digital proofs. By linking print material to online content, users can enhance their experience with traditional forms of print media with updated online content, videos, interactive online features etc. The proposed approach is based on extracting features from images/text from mobile device camera images to form 'fingerprints' that are used to find matching images/text within a limited test set. An important criterion for these applications is to ensure that the user Quality of Experience (QoE), particularly in terms of matching accuracy and time, is robust to a variety of conditions typically encountered in practical scenarios. In this paper, the performance of a number of computer vision techniques that extract the image features and form the fingerprints are analysed and compared. Both computer simulation tests and mobile device experiments in realistic user conditions are conducted to study the effectiveness of the techniques when considering scale, rotation, blur and lighting variations typically encountered by a user.
    Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: There has been much recent interest, both from industry and research communities, in 3D video technologies and processing techniques. However, with the standardisation of 3D video coding well underway and researchers studying 3D multimedia delivery and users' quality of multimedia experience in 3D video environments, there exist few publicly available databases of 3D video content. Further, there are even fewer sources of uncompressed 3D video content for flexible use in a number of research studies and applications. This paper thus presents a preliminary version of RMIT3DV: an uncompressed HD 3D video database currently composed of 31 video sequences that encompass a range of environments, lighting conditions, textures, motion, etc. The database was natively filmed on a professional HD 3D camera, and this paper describes the 3D film production workflow in addition to the database distribution and potential future applications of the content. The database is freely available online via the creative commons license, and researchers are encouraged to contribute 3D content to grow the resource for the (HD) 3D video research community.
    Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on; 01/2012
  • 4th Semantic Ambient Media Experience (SAME) Workshop in Conjunction with the 5th International Convergence on Communities and Technologies; 06/2011
  • Source
    Advances in Sound Localization, 04/2011; , ISBN: 978-953-307-224-1
  • Source
    Eva Cheng, Ian S. Burnett
    [Show abstract] [Hide abstract]
    ABSTRACT: The recent ubiquity of mobile telephony has posed the challenge of forensic speech analysis on compressed speech content. Whilst existing research studies have investigated the effect of mobile speech compression on speaker and speech parameters, this paper addresses the effect of speech compression on parameters when an interfering background speaker is present in clean and noisy conditions. Preliminary evaluations presented in this paper study the effect of the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) speech coders on the Linear Prediction (LP) speech spectrum, Line Spectral Frequencies (LSFs), and Mel Frequency Cepstral Coefficients (MFCCs). Results indicate that due caution should be employed for the forensic analysis of mobile telephony speech: speech coder parameters are significantly degraded when an interfering speaker or noise is present, compared to parameters obtained from the main speaker alone. Moreover, at high SNR the speech parameters exhibit values that gradually transition from those ideally and independently obtained from the main speaker to those of the background speaker as the amplitude of the background interfering speaker increases.
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Efficient content-based access to large multimedia collections requires annotations that are human-meaningful, and user tagging of media is one means to obtain such semantic metadata. Tags can also act as user feedback essential for quality of multimedia experience assessment; however, tags can lack user context and become ambiguous between different users. Further, user tagging is a deliberate and discrete event where a user's response to the media can significantly vary in-between tagging events. This paper extends upon the authors' social multimedia adaptation framework to explore the use of EEG biosignals obtained from consumer EEG headsets to form context around explicit tagging activities and as user emotional feedback in-between user tagging events. Preliminary user studies investigating grouped participant responses indicate the most indicative emotional states to be short-term excitement, engagement and frustration in addition to gyroscope information.
    Third International Workshop on Quality of Multimedia Experience, QoMEX 2011, Mechelen, Belgium, September 7-9, 2011; 01/2011
  • Li Ling, Ian S. Burnett, Eva Cheng
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a flexible, markerless registration method that addresses the problem of realistic virtual object placement at any position in a video sequence. The registration consists of two steps: four points are specified by the user to build the world coordinate system, where the virtual object is rendered. A self-calibration camera tracking algorithm is then proposed to recover the camera viewpoint frame-by-frame, such that the virtual object can be dynamically and correctly rendered according to camera movement. The proposed registration method needs no reference fiducials, knowledge of camera parameters or the user environment, where the virtual object can be placed in any environment even without any distinct features. Experimental evaluations demonstrate low errors for several camera motion rotations around the X and Y axes for the self-calibration algorithm. Finally, virtual object rendering applications in different user environments are evaluated.
    IEEE 13th International Workshop on Multimedia Signal Processing (MMSP 2011), Hangzhou, China, October 17-19, 2011; 01/2011
  • Li Ling, Eva Cheng, Ian S. Burnett
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper considers a self-calibration approach to the estimation of motion parameters for an unknown camera used for video-based augmented reality. Whilst existing systems derive four SVD solutions of the essential matrix, which encodes the epipolar geometry between two camera views, this paper presents eight possible solutions derived from mathematical computation and geometrical analysis. The eight solutions not only reflect the position and orientation of the camera in static displacement but also the dynamic, relative orientation between the camera and an object in continuous motion. This paper details a novel algorithm that introduces three geometric constraints to determine the rotation and translation matrix from the eight possible essential matrix solutions. An OpenGL camera motion simulator is used to demonstrate and evaluate the reliability of the proposed algorithms; this directly visualizes the abstract computer vision parameters into real 3D. be transformed. Although Wang et al. [2] introduced the possibility of eight essential matrix solutions, the relative geometrical meaning of the eight possible solutions and the application environments of these solutions were not explored. Moreover, existing research has yet to address a complete graphical interpretation and develop an algorithm to find the correct rotation and translation matrix from the multiple essential matrix solutions. The resulting ambiguity of the camera position and orientation is a significant obstacle to the development of virtual world interaction in augmented reality technology. This paper presents mathematical and geometrical explanations of all possible relative camera/object orientations and presents a novel algorithm for the derivation of the correct essential matrix solution. An OpenGL camera motion simulator that demonstrates the reliability of the proposed eight essential matrix solutions has also been developed and evaluated. In the following, a basic background is presented in Section 2, with the camera calibration procedure proposed in this paper described in Section 3. Section 4 then presents the mathematical background for the continuous camera motion recovery from the essential matrix. The camera motion simulator is explained and evaluated in Section 5, and Section 6 concludes the paper.
    Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, ICME 2011, 11-15 July, 2011, Barcelona, Catalonia, Spain; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: With new social media technologies arising daily, this paper reports on a pilot user survey that studies how tertiary educated users are engaging with social media. The results indicate sporadic use of social media by the tertiary educated users studied; they are generally aware of the key social media sites and facilities, but are not actively utilizing these services. The reasons for, and the implications of a lack of tertiary educated users in the egalitarian and participatory environments intrinsic to social media are discussed. Further, the paper suggests potential technological barriers that might be at the root of such a lack of engagement amongst tertiary educated users.
    Technology and Society (ISTAS), 2010 IEEE International Symposium on; 07/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Teleconferencing systems are becoming increasing realistic and pleasant for users to interact with geographically distant meeting participants. Video screens display a complete view of the remote participants, using technology such as wraparound or multiple video screens. However, the corresponding audio does not offer the same sophistication: often only a mono or stereo track is presented. This paper proposes a teleconferencing audio recording and playback paradigm that captures the spatial location of the geographically distributed participants for rendering of the remote soundfields at the users' end. Utilizing standard 5.1 surround sound playback, this paper proposes a surround rendering approach that `squeezes' the multiple recorded soundfields from remote teleconferencing sites to assist the user to disambiguate multiple speakers from different participating sites.
    Telecommunication Networks and Applications Conference, 2008. ATNAC 2008. Australasian; 01/2009
  • Source
    E. Cheng, I.S. Burnett, C. Ritz
    [Show abstract] [Hide abstract]
    ABSTRACT: Effective and efficient access to multiparty meeting recordings requires techniques for meeting analysis and indexing. Since meeting participants are generally stationary, speaker location information may be used to identify meeting events e.g., detect speaker changes. Time-delay estimation (TDE) utilizing cross-correlation of multichannel speech recordings is a common approach for deriving speech source location information. Recent research improved TDE by calculating TDE from linear prediction (LP) residual signals obtained from LP analysis on each individual speech channel. This paper investigates the use of LP residuals for speech TDE, where the residuals are obtained from jointly modeling the multiple speech channels. Experiments conducted with a simulated reverberant room and real room recordings show that jointly modeled LP better predicts the LP coefficients, compared to LP applied to individual channels. Both the individually and jointly modeled LP exhibit similar TDE performance, and outperform TDE on the speech alone, especially with the real recordings.
    Signal-Image Technologies and Internet-Based System, 2007. SITIS '07. Third International IEEE Conference on; 01/2008
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent research in speech localization and dereverberation introduced processing of the multichannel linear prediction (LP) residual of speech recorded with multiple microphones. This paper investigates the novel use of intra- and inter-channel speech prediction by proposing the use of a multichannel LP model derived from multivariate autoregression (MVAR), where current LP approaches are based on univariate autoregression (AR). Experiments were conducted on simulated anechoic and reverberant synthetic speech vowels and real speech sentences; results show that, especially at low reverberation times, the MVAR model exhibits greater prediction gains from the residual signal, compared to residuals obtained from univariate AR models for individually or jointly modelled speech channels. In addition, the MVAR model more accurately models the speech signal when compared to univariate LP of a similar prediction order and when a smaller number of microphones are deployed.
    International Workshop on Multimedia Signal Processing, MMSP 2008, October 8-10, 2008, Shangri-la Hotel, Cairns, Queensland, Australia; 01/2008
  • Faculty of Creative Arts - Papers. 01/2008;
  • E. Cheng, I. Burnett, C. Ritz
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiparty meetings generally involve stationary participants. Participant location information can thus be used to segment the recorded meeting speech into each speaker's 'turn' for meeting 'browsing'. To represent speaker location information from speech, previous research showed that the most reliable time delay estimates are extracted from the Hubert envelope of the linear prediction residual signal. The authors' past work has proposed the use of spatial audio cues to represent speaker location information. This paper proposes extracting spatial audio cues from the Hubert envelope of the speech residual for indicating changing speaker location for meeting speech segmentation. Experiments conducted on recordings of a real acoustic environment show that spatial cues from the Hubert envelope are more consistent across frequency subbands and can clearly distinguish between spatially distributed speakers, compared to spatial cues estimated from the recorded speech or residual signal
    Signal Processing, 2006 8th International Conference on; 02/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper explores user-centered metadata delivery through the example of hierarchically organized meeting audio metadata. Audio annotations that describe meeting scenarios can vary from low-level signal-based descriptors to high-level semantics. Users of meeting metadata also have widely varying requirements and hence want metadata at varying levels and detail. Thus, for efficient metadata access, it is vital to provide customization or choice of the metadata to be delivered using e.g. regions of interest and annotation detail specification. As well as proposing a user-centered metadata organization strategy, this paper introduces the use of a bi-directional XML protocol for metadata delivery. The combination provides advantages in terms of bandwidth efficiency when an example meeting metadata browser application is examined with practical user interfaces.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Meetings, common to many business environments, generally involve stationary participants. Thus, participant location information can be used to segment meeting speech recordings into each speaker’s ‘turn’. The authors’ previous work proposed the use of spatial audio cues to represent the speaker locations. This paper studies the validity of using spatial audio cues for meeting speech segmentation by investigating the effect of varying microphone pattern on the spatial cues. Experiments conducted on recordings of a real acoustic environment indicate that the relationship between speaker location and spatial audio cues strongly depends on the microphone pattern.
    Advances in Multimedia Information Processing - PCM 2006, 7th Pacific Rim Conference on Multimedia, Hangzhou, China, November 2-4, 2006, Proceedings; 01/2006
  • Source
    J. Lukasiak, C. McElroy, E. Cheng
    [Show abstract] [Hide abstract]
    ABSTRACT: A new low level audio descriptor that represents the psychoacoustic noise floor shape of an audio frame is proposed. Results presented indicate that the proposed descriptor is far more resilient to compression noise than any of the MPEG-7 low level audio descriptors. In fact, across a wide range of files, on average the proposed scheme fails to uniquely identify only five frames in every ten thousand. In addition, the proposed descriptor maintains a high resilience to compression noise even when decimated to use only one quarter of the values per frame to represent the noise floor. This characteristic indicates the proposed descriptor presents a truly scalable mechanism for transparently describing the characteristics of an audio frame.
    Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on; 08/2005

Publication Stats

27 Citations
9.26 Total Impact Points


  • 2010–2013
    • RMIT University
      • School of Electrical and Computer Engineering
      Melbourne, Victoria, Australia
  • 2005–2009
    • University of Wollongong
      • School of Electrical, Computer and Telecommunications Engineering (SECTE)
      City of Greater Wollongong, New South Wales, Australia