Eva Cheng

RMIT University, Melbourne, Victoria, Australia

Are you Eva Cheng?

Claim your profile

Publications (26)10.33 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This note is intended to understand relative importance of room shape and fine structures on the sound quality inside small meeting rooms in terms of the reverberation time, the sound field distribution and the speech transmission index with similar room volume, surface area and the absorption coefficients. First, different shaped rooms with smooth walls are modeled and simulated to investigate the effects of room shape on the sound quality, and then hyperboloid cells are made on the walls to examine the influence of fine structural surface on sound quality with both regular and random arrangements. It is found that the reverberation time is affected significantly by the room shape while is not sensitive to the hyperboloid cells. The sound field distribution is affected little by the room shape and the hyperboloid cells and the difference is smaller than the Just-Noticeable-Difference in most cases. The impact of the room shape and fine structural surface on the speech transmission index mainly lies in the transition area between the direct sound and the reverberant sound. The reliability of the simulation remarks is confirmed by the experiments carried out in two different meeting rooms. The main conclusion of the note is that when the room volume, the surface area and the absorption coefficients are kept constant, the room shape and fine structural surface have little impact on the sound field distribution and speech intelligibility inside small rooms with ordinary surface absorption, while the reverberation time is affected significantly by room shape but slightly by the fine structural surface.
    Applied Acoustics 06/2015; 93. DOI:10.1016/j.apacoust.2015.01.020 · 1.07 Impact Factor
  • Li Ling, E. Cheng, I.S. Burnett
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes the use of the Iterated Extended Kalman Filter (IEKF) in a real-time 3D mapping framework applied to Microsoft Kinect RGB-D data. Standard EKF techniques typically used for 3D mapping are susceptible to errors introduced during the state prediction linearization and measurement prediction. When models are highly nonlinear due to measurement errors e.g., outliers, occlusions and feature initialization errors, the errors propagate and directly result in divergence and estimation inconsistencies. To prevent linearized error propagation, this paper proposes repetitive linearization of the nonlinear measurement model to provide a running estimate of camera motion. The effects of iterated-EKF are experimentally simulated with synthetic map and landmark data on a range and bearing camera model. It was shown that the IEKF measurement update outperforms the EKF update when the state causes nonlinearities in the measurement function. In the real indoor environment 3D mapping experiment, more robust convergence behavior for the IEKF was demonstrated, whilst the EKF updates failed to converge.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Li Ling, I.S. Burrent, Eva Cheng
    [Show abstract] [Hide abstract]
    ABSTRACT: Current approaches for 3D reconstruction from feature points of images are classed as sparse and dense techniques. However, the sparse approaches are insufficient for surface reconstruction since only sparsely distributed feature points are presented. Further, existing dense reconstruction approaches require pre-calibrated camera orientation, which limits the applicability and flexibility. This paper proposes a one-stop 3D reconstruction solution that reconstructs a highly dense surface from an uncalibrated video sequence, the camera orientations and surface reconstruction are simultaneously computed from new dense point features using an approach motivated by Structure from Motion (SfM) techniques. Further, this paper presents a flexible automatic method with the simple interface of 'videos to 3D model'. These improvements are essential to practical applications in 3D modeling and visualization. The reliability of the proposed algorithm has been tested on various data sets and the accuracy and performance are compared with both sparse and dense reconstruction benchmark algorithms.
    Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper investigates how minimal user interaction paradigms and markerless image recognition technologies can be applied to matching print media content to online digital proofs. By linking print material to online content, users can enhance their experience with traditional forms of print media with updated online content, videos, interactive online features etc. The proposed approach is based on extracting features from images/text from mobile device camera images to form 'fingerprints' that are used to find matching images/text within a limited test set. An important criterion for these applications is to ensure that the user Quality of Experience (QoE), particularly in terms of matching accuracy and time, is robust to a variety of conditions typically encountered in practical scenarios. In this paper, the performance of a number of computer vision techniques that extract the image features and form the fingerprints are analysed and compared. Both computer simulation tests and mobile device experiments in realistic user conditions are conducted to study the effectiveness of the techniques when considering scale, rotation, blur and lighting variations typically encountered by a user.
    Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: There has been much recent interest, both from industry and research communities, in 3D video technologies and processing techniques. However, with the standardisation of 3D video coding well underway and researchers studying 3D multimedia delivery and users' quality of multimedia experience in 3D video environments, there exist few publicly available databases of 3D video content. Further, there are even fewer sources of uncompressed 3D video content for flexible use in a number of research studies and applications. This paper thus presents a preliminary version of RMIT3DV: an uncompressed HD 3D video database currently composed of 31 video sequences that encompass a range of environments, lighting conditions, textures, motion, etc. The database was natively filmed on a professional HD 3D camera, and this paper describes the 3D film production workflow in addition to the database distribution and potential future applications of the content. The database is freely available online via the creative commons license, and researchers are encouraged to contribute 3D content to grow the resource for the (HD) 3D video research community.
    Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multimedia is ubiquitously available online with large amounts of video increasingly consumed through Web sites such as YouTube or Google Video. However, online multimedia typically limits users to visual/auditory stimulus, with onscreen visual media accompanied by audio. The recent introduction of MPEG-V proposed multi-sensory user experiences in multimedia environments, such as enriching video content with so-called sensory effects like wind, vibration, light, etc. In MPEG-V, these sensory effects are represented as Sensory Effect Metadata (SEM), which is additionally associated to the multimedia content. This paper presents three user studies that utilize the sensory effects framework of MPEG-V, investigating the emotional response of users and enhancement of Quality of Experience (QoE) of Web video sequences from a range of genres with and without sensory effects. In particular, the user studies were conducted in Austria and Australia to investigate whether geography and cultural differences affect users' elicited emotional responses and QoE.
    Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on; 01/2012
  • 4th Semantic Ambient Media Experience (SAME) Workshop in Conjunction with the 5th International Convergence on Communities and Technologies; 06/2011
  • Source
    Advances in Sound Localization, 04/2011; , ISBN: 978-953-307-224-1
  • Li Ling, Eva Cheng, Ian S. Burnett
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper considers a self-calibration approach to the estimation of motion parameters for an unknown camera used for video-based augmented reality. Whilst existing systems derive four SVD solutions of the essential matrix, which encodes the epipolar geometry between two camera views, this paper presents eight possible solutions derived from mathematical computation and geometrical analysis. The eight solutions not only reflect the position and orientation of the camera in static displacement but also the dynamic, relative orientation between the camera and an object in continuous motion. This paper details a novel algorithm that introduces three geometric constraints to determine the rotation and translation matrix from the eight possible essential matrix solutions. An OpenGL camera motion simulator is used to demonstrate and evaluate the reliability of the proposed algorithms; this directly visualizes the abstract computer vision parameters into real 3D. be transformed. Although Wang et al. [2] introduced the possibility of eight essential matrix solutions, the relative geometrical meaning of the eight possible solutions and the application environments of these solutions were not explored. Moreover, existing research has yet to address a complete graphical interpretation and develop an algorithm to find the correct rotation and translation matrix from the multiple essential matrix solutions. The resulting ambiguity of the camera position and orientation is a significant obstacle to the development of virtual world interaction in augmented reality technology. This paper presents mathematical and geometrical explanations of all possible relative camera/object orientations and presents a novel algorithm for the derivation of the correct essential matrix solution. An OpenGL camera motion simulator that demonstrates the reliability of the proposed eight essential matrix solutions has also been developed and evaluated. In the following, a basic background is presented in Section 2, with the camera calibration procedure proposed in this paper described in Section 3. Section 4 then presents the mathematical background for the continuous camera motion recovery from the essential matrix. The camera motion simulator is explained and evaluated in Section 5, and Section 6 concludes the paper.
    Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, ICME 2011, 11-15 July, 2011, Barcelona, Catalonia, Spain; 01/2011
  • Source
    Eva Cheng, Ian S. Burnett
    [Show abstract] [Hide abstract]
    ABSTRACT: The recent ubiquity of mobile telephony has posed the challenge of forensic speech analysis on compressed speech content. Whilst existing research studies have investigated the effect of mobile speech compression on speaker and speech parameters, this paper addresses the effect of speech compression on parameters when an interfering background speaker is present in clean and noisy conditions. Preliminary evaluations presented in this paper study the effect of the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) speech coders on the Linear Prediction (LP) speech spectrum, Line Spectral Frequencies (LSFs), and Mel Frequency Cepstral Coefficients (MFCCs). Results indicate that due caution should be employed for the forensic analysis of mobile telephony speech: speech coder parameters are significantly degraded when an interfering speaker or noise is present, compared to parameters obtained from the main speaker alone. Moreover, at high SNR the speech parameters exhibit values that gradually transition from those ideally and independently obtained from the main speaker to those of the background speaker as the amplitude of the background interfering speaker increases.
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Efficient content-based access to large multimedia collections requires annotations that are human-meaningful, and user tagging of media is one means to obtain such semantic metadata. Tags can also act as user feedback essential for quality of multimedia experience assessment; however, tags can lack user context and become ambiguous between different users. Further, user tagging is a deliberate and discrete event where a user's response to the media can significantly vary in-between tagging events. This paper extends upon the authors' social multimedia adaptation framework to explore the use of EEG biosignals obtained from consumer EEG headsets to form context around explicit tagging activities and as user emotional feedback in-between user tagging events. Preliminary user studies investigating grouped participant responses indicate the most indicative emotional states to be short-term excitement, engagement and frustration in addition to gyroscope information.
    Third International Workshop on Quality of Multimedia Experience, QoMEX 2011, Mechelen, Belgium, September 7-9, 2011; 01/2011
  • Li Ling, Ian S. Burnett, Eva Cheng
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a flexible, markerless registration method that addresses the problem of realistic virtual object placement at any position in a video sequence. The registration consists of two steps: four points are specified by the user to build the world coordinate system, where the virtual object is rendered. A self-calibration camera tracking algorithm is then proposed to recover the camera viewpoint frame-by-frame, such that the virtual object can be dynamically and correctly rendered according to camera movement. The proposed registration method needs no reference fiducials, knowledge of camera parameters or the user environment, where the virtual object can be placed in any environment even without any distinct features. Experimental evaluations demonstrate low errors for several camera motion rotations around the X and Y axes for the self-calibration algorithm. Finally, virtual object rendering applications in different user environments are evaluated.
    IEEE 13th International Workshop on Multimedia Signal Processing (MMSP 2011), Hangzhou, China, October 17-19, 2011; 01/2011
  • Source
    F. Salim, E. Cheng, S. L. Choy
    [Show abstract] [Hide abstract]
    ABSTRACT: An earlier version of this paper was presented at the 2011 IEEE International Symposium on Technology and Society (ISTAS) at Saint Xavier University in Chicago, Illinois (and printed in the 2011 ISTAS proceedings). This paper describes a proposed mobile platform, Transafe, that captures and analyses public perceptions of safety to deliver 'crowdsourced' collective intelligence about places in the City of Melbourne, Australia, and their affective states at various times of the day. Public perceptions of crime on public transport in Melbourne are often mismatched with actual crime statistics and such perceptions thus can act as social barriers to visitors and locals traversing within and through the city. Using interactive mobile applications and social media, the visualization of this crowdsourced safety perception information will increase the commuter's awareness of various situations in the City of Melbourne. In addition, through social behavioral analysis and ethnographic research, the collective public intelligence will also help inform the stakeholders of the city for future policy-making and policing strategies for safety perception management. At the centre of the proposed platform is the design and development of a mobile phone application that can contribute to people feeling safer by supporting users to report crimes and misdemeanors that they witness, and provide information about transportation and emergency services around where the users are located. The proposed application can also act as a crime deterrent with one feature that enables user tracking by up to three nominated friends if the user opts to activate tracking when feeling unsafe while roaming the city.
    ACM SIGCAS Computers and Society 01/2011; DOI:10.1145/2095272.2095275
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper discusses a new approach to computer music synthesis where music is composed specifically for performance using mobile handheld devices. Open source cross-platform computer music synthesis software initially developed for composing on desktop computers has been used to program a Linux phone. Work presented here allows mobile devices to draw on these resources and makes comparisons between the strengths of each program in a mobile phone environment. Motivation is driven by aspirations of the first author who seeks to further develop creative mobile music performance applications first developed in the 1980s using purpose-built hardware and later, using j2me phones. The paper will focus on two different musical implementations of his microtonal composition entitled Butterfly Dekany which was initially implemented in Csound and later programmed using Pure Data. Each implementation represents one of the two programming paradigms that have dominated computer music composition for desktop computers namely, music synthesis using scripting and GUI-based music synthesis. Implementation of the same work using two different open source languages offers a way to understand different approaches to composition as well providing a point of reference for evaluating the performance of mobile hardware. KeywordsLinux-mobilephone-csound-pure data-chuck-synthesis toolkit
    08/2010: pages 101-110;
  • [Show abstract] [Hide abstract]
    ABSTRACT: With new social media technologies arising daily, this paper reports on a pilot user survey that studies how tertiary educated users are engaging with social media. The results indicate sporadic use of social media by the tertiary educated users studied; they are generally aware of the key social media sites and facilities, but are not actively utilizing these services. The reasons for, and the implications of a lack of tertiary educated users in the egalitarian and participatory environments intrinsic to social media are discussed. Further, the paper suggests potential technological barriers that might be at the root of such a lack of engagement amongst tertiary educated users.
    Technology and Society (ISTAS), 2010 IEEE International Symposium on; 07/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Teleconferencing systems are becoming increasing realistic and pleasant for users to interact with geographically distant meeting participants. Video screens display a complete view of the remote participants, using technology such as wraparound or multiple video screens. However, the corresponding audio does not offer the same sophistication: often only a mono or stereo track is presented. This paper proposes a teleconferencing audio recording and playback paradigm that captures the spatial location of the geographically distributed participants for rendering of the remote soundfields at the users' end. Utilizing standard 5.1 surround sound playback, this paper proposes a surround rendering approach that `squeezes' the multiple recorded soundfields from remote teleconferencing sites to assist the user to disambiguate multiple speakers from different participating sites.
    Telecommunication Networks and Applications Conference, 2008. ATNAC 2008. Australasian; 01/2009
  • Source
    E. Cheng, I.S. Burnett, C. Ritz
    [Show abstract] [Hide abstract]
    ABSTRACT: Effective and efficient access to multiparty meeting recordings requires techniques for meeting analysis and indexing. Since meeting participants are generally stationary, speaker location information may be used to identify meeting events e.g., detect speaker changes. Time-delay estimation (TDE) utilizing cross-correlation of multichannel speech recordings is a common approach for deriving speech source location information. Recent research improved TDE by calculating TDE from linear prediction (LP) residual signals obtained from LP analysis on each individual speech channel. This paper investigates the use of LP residuals for speech TDE, where the residuals are obtained from jointly modeling the multiple speech channels. Experiments conducted with a simulated reverberant room and real room recordings show that jointly modeled LP better predicts the LP coefficients, compared to LP applied to individual channels. Both the individually and jointly modeled LP exhibit similar TDE performance, and outperform TDE on the speech alone, especially with the real recordings.
    Signal-Image Technologies and Internet-Based System, 2007. SITIS '07. Third International IEEE Conference on; 01/2008
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent research in speech localization and dereverberation introduced processing of the multichannel linear prediction (LP) residual of speech recorded with multiple microphones. This paper investigates the novel use of intra- and inter-channel speech prediction by proposing the use of a multichannel LP model derived from multivariate autoregression (MVAR), where current LP approaches are based on univariate autoregression (AR). Experiments were conducted on simulated anechoic and reverberant synthetic speech vowels and real speech sentences; results show that, especially at low reverberation times, the MVAR model exhibits greater prediction gains from the residual signal, compared to residuals obtained from univariate AR models for individually or jointly modelled speech channels. In addition, the MVAR model more accurately models the speech signal when compared to univariate LP of a similar prediction order and when a smaller number of microphones are deployed.
    International Workshop on Multimedia Signal Processing, MMSP 2008, October 8-10, 2008, Shangri-la Hotel, Cairns, Queensland, Australia; 01/2008
  • E. Cheng, I. Burnett, C. Ritz
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiparty meetings generally involve stationary participants. Participant location information can thus be used to segment the recorded meeting speech into each speaker's 'turn' for meeting 'browsing'. To represent speaker location information from speech, previous research showed that the most reliable time delay estimates are extracted from the Hubert envelope of the linear prediction residual signal. The authors' past work has proposed the use of spatial audio cues to represent speaker location information. This paper proposes extracting spatial audio cues from the Hubert envelope of the speech residual for indicating changing speaker location for meeting speech segmentation. Experiments conducted on recordings of a real acoustic environment show that spatial cues from the Hubert envelope are more consistent across frequency subbands and can clearly distinguish between spatially distributed speakers, compared to spatial cues estimated from the recorded speech or residual signal
    Signal Processing, 2006 8th International Conference on; 02/2006

Publication Stats

41 Citations
10.33 Total Impact Points

Institutions

  • 2010–2015
    • RMIT University
      • School of Electrical and Computer Engineering
      Melbourne, Victoria, Australia
  • 2012
    • University of Vic
      Vic, Catalonia, Spain
  • 2005–2009
    • University of Wollongong
      • School of Electrical, Computer and Telecommunications Engineering (SECTE)
      City of Greater Wollongong, New South Wales, Australia