Conference Paper

AmbientSense: A real-time ambient sound recognition system for smartphones

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents design, implementation, and evaluation of AmbientSense, a real-time ambient sound recognition system on a smartphone. AmbientSense continuously recognizes user context by analyzing ambient sounds sampled from a smartphone's microphone. The phone provides a user with realtime feedback on recognised context. AmbientSense is implemented as an Android app and works in two modes: in autonomous mode processing is performed on the smartphone only. In server mode recognition is done by transmitting audio features to a server and receiving classification results back. We evaluated both modes in a set of 23 daily life ambient sound classes and describe recognition performance, phone CPU load, and recognition delay. The application runs with a fully charged battery up to 13.75 h on a Samsung Galaxy SII smartphone and up to 12.87 h on a Google Nexus One phone. Runtime and CPU load were similar for autonomous and server modes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The problem of identifying the speakers in any audio data set is addressed by Yu et al. [41]. Also, the ambient sound of the environment can be recognized and identified as presented in the work by Rossi et al. [42]. Thus, the identification of different verbal interactions has been addressed by many researchers, and this ability enables the interpretation of indoor group-level situations. ...
... Wearable Device-free Non-invasive Physical Verbal Collective Atomic Group [3,43] x x x [4,5,20,24,44] x x [6,7,15,45] x x x [9] x x x x x [13,14,16,46] x x x x [18,19,47] x x x [21] x x x [22,33] x x x x [25] x x x [28,29,48] x x [10,31,32,[49][50][51] x x x [39,42] x x [40,52] x x x x x [37,38] x x x x [8,35,53] x x [41] x x x [54] x x x x ...
... Comparing the identification accuracy of verbal activity by HumanSense, it is similar to Lu et al. [39] and performs with a greater accuracy than works of Rossi et al. [42] and Zhan et al. [63]. The accuracy of our work is also comparable to the recent works of Tran et al. [47] and Vafeiadis et al. [64], though the identified verbal activities are different from ours. ...
Article
Full-text available
Identification of human activity considering social interactions and group dynamics non-intrusively has been one of the fundamental problems and a challenging area of research. In real life, it is required for designing human-centric applications like assisted living, health care, and creating a smart home environment. As human beings spend 90% of time indoors, such a system will be helpful to monitor the behavioral anomalies of the inhabitants. Existing approaches have used intrusive or invasive methods like camera or wearable devices. In this work, we present a device-free, non-invasive, and non-intrusive sensing framework called HumanSense using an array of heterogeneous sensor grid for human activity monitoring. The sensor grids, comprising the ultrasonic and sound sensors, have been deployed for collective sensing combining a person’s physical activity and verbal interaction information. The proposed system senses a stream of events when the occupant(s) perform different physical activities categorized as atomic and group activities like sitting, standing, and walking. Simultaneously, it also tracks person-person verbal interactions such as monologue and discussion. Both information are then integrated into a single framework to understand the overall behavioral scenario of the indoor environment. The experimental results have shown that HumanSense can detect different activities with accuracy more than 90% and also improves overall identification accuracy compared to existing works. Our developed system can be further evolved into ready-to-deploy smart sensing panels which can be effective for human activity monitoring in an indoor environment.
... SVM [121] is a supervised classification algorithm. It is widely used to solve the problem of pattern recognition, such as sound recognition [110], human activity recognition [13], etc. In [129], Zhai et al. presented their research of stress recognition using four physiological signals when the user was interacting with the computer. ...
... In our study, we used the SVM with Gaussian kernel [27] for classification. Since the Gaussian kernel brought in the cost parameter C and the kernel parameter γ to be determined in the training process, a parameter sweep was used to find the optimized C and γ [110]. C and γ were evaluated in the range from 2 −2 to 2 2 . ...
... In [110], M. Rossi and al. presented a real-time ambient sound recognition system running on an Android smartphone. The sound was acquired from a smartphone's microphone with a sampling frequency of 16kHz at 16bit. ...
Thesis
In modern society, the stress of an individual has been found to be a common problem. Continuous stress can lead to various mental and physical problems and especially for the people who always face emergency situations (e.g., fireman): it may alter their actions and put them in danger. Therefore, it is meaningful to provide the assessment of the stress of an individual. Based on this idea, the Psypocket project is proposed which is aimed at making a portable system able to analyze accurately the stress state of an individual based on his physiological, psychological and behavioural modifications. It should then offer solutions for feedback to regulate this state.The research of this thesis is an essential part of the Psypocket project. In this thesis, we discuss the feasibility and the interest of stress recognition from heterogeneous data. Not only physiological signals, such as Electrocardiography (ECG), Electromyography (EMG) and Electrodermal activity (EDA), but also reaction time (RT) are adopted to recognize different stress states of an individual. For the stress recognition, we propose an approach based on a SVM classifier (Support Vector Machine). The results obtained show that the reaction time can be used to estimate the level of stress of an individual in addition or not to the physiological signals. Besides, we discuss the feasibility of an embedded system which would realize the complete data processing. Therefore, the study of this thesis can contribute to make a portable system to recognize the stress of an individual in real time by adopting heterogeneous data like physiological signals and RT
... Audio, on the other hand, offers much promise in this respect; many daily activities generate characteristic sounds that can be captured with any off-the-shelf device with a microphone. Hence, researchers have proposed several different types of audio event recognition frameworks over the years, from applications on wearable and mobile devices [30,37] to home-based sensor systems [5,21]. With the development of deep neural networks in recent years, several efforts have been made by researchers to model large-scale 17:2 • D. Liang and E. Thomaz acoustic events. ...
... With rapidly improving smartphones in recent years, phone-based acoustic sensing also shows great promise for activity recognition tasks. The AmbientSense application [30] is an example. It is an Android app that can process ambient sound data in real-time either on the device or on an server. ...
Article
Over the years, activity sensing and recognition has been shown to play a key enabling role in a wide range of applications, from sustainability and human-computer interaction to health care. While many recognition tasks have traditionally employed inertial sensors, acoustic-based methods offer the benefit of capturing rich contextual information, which can be useful when discriminating complex activities. Given the emergence of deep learning techniques and leveraging new, large-scale multimedia datasets, this paper revisits the opportunity of training audio-based classifiers without the onerous and time-consuming task of annotating audio data. We propose a framework for audio-based activity recognition that can make use of millions of embedding features from public online video sound clips. Based on the combination of oversampling and deep learning approaches, our framework does not require further feature processing or outliers filtering as in prior work. We evaluated our approach in the context of Activities of Daily Living (ADL) by recognizing 15 everyday activities with 14 participants in their own homes, achieving 64.2% and 83.6% averaged within-subject accuracy in terms of top-1 and top-3 classification respectively. Individual class performance was also examined in the paper to further study the co-occurrence characteristics of the activities and the robustness of the framework.
... Hence, in recent years researchers have proposed several different types of audio sensing approaches, from development in wearable and mobile devices [6,7] to home-based sensor systems [8,9]. Their work largely broaden the range of activity types for activity recognition research and significantly improve the recognition performance when combining with traditional methods. ...
... With the development of smart phones in recent years, phone-based acoustic sensing also shows great capability on activity recognition tasks. The AmbientSense application [7] is an example. It is an Android app that can process ambient sound data in real time either on user front end or on an on-line server. ...
Preprint
Activity sensing and recognition have been demonstrated to be critical in health care and smart home applications. Comparing to traditional methods such as using accelerometers or gyroscopes for activity recognition, acoustic-based methods can collect rich information of human activities together with the activity context, and therefore are more suitable for recognizing high-level compound activities. However, audio-based activity recognition in practice always suffers from the tedious and time-consuming process of collecting ground truth audio data from individual users. In this paper, we proposed a new mechanism of audio-based activity recognition that is entirely free from user training data by usage of millions of embedding features from general YouTube video sound clips. Based on combination of oversampling and deep learning approaches, our scheme does not require further feature extraction or outliers filtering for implementation. We developed our scheme for recognition of 15 common home-related activities and evaluated its performance under dedicated scenarios and in-the-wild scripted scenarios. In the dedicated recording test, our scheme yielded 81.1% overall accuracy and 80.0% overall F-score for all 15 activities. In the in-the-wild scripted tests, we obtained an averaged top-1 classification accuracy of 64.9% and an averaged top-3 classification accuracy of 80.6% for 4 subjects in actual home environment. Several design considerations including association between dataset labels and target activities, effects of segmentation size and privacy concerns were also discussed in the paper.
... These new applications were initially focused on content-based audio classification and retrieval [10,12]. However, nowadays there is an increasing number of applications such as medical telemonitoring [13], ambient sound recognition [14], or audio surveillance (e.g. monitoring of wildlife areas [15] or classification of aircraft noise [16] Despite the fact that automatic detection and analysis of speech are still active research areas [17], their methods are not always directly applicable to other AED problems [6] due to two main reasons. ...
... Table 5 shows the selected features for (8 × 8) Km (similar selections were observed in other moments). The number of selected features accounts for the energy patterns for each moment order as in Table 2 i.e. = = → p q ( 0, 0) [1,13], → (1, 0) [14,26], …, → (0, 2) ...
Article
Cough detection has recently been identified as of paramount importance to fully exploit the potential of telemedicine in respiratory conditions and release some of the economic burden of respiratory care in national health systems. Current audio-based cough detection systems are either uncomfortable or not suitable for continuous patient monitoring, since the audio processing methods implemented therein fail to cope with noisy environments such as those where the acquiring device is carried in the pocket (e.g. smartphone). Moment theory has been widely applied in a number of complex problems involving image processing, computer vision, and pattern recognition. Their invariance properties and noise robustness make them especially suitable as “signature” features enabling character recognition or texture analysis. A natural extension of moment theory to one-dimensional signals is the identification of meaningful patterns in audio signals. However, to the best of our knowledge only marginal attempts have been made in this direction. This paper applies moment theory to perform cough detection in noisy audio signals. Our proposal adopts the first steps used to extract Mel frequency cepstral coefficients (time-frequency decomposition and application of a filter bank defined in the Mel scale) while the innovation is introduced in the second step, where energy patterns defined for specific temporal frames and frequency bands are characterised using moment theory. Our results show the feasibility of using moment theory to solve such a problem in a variety of noise conditions, with sensitivity and specificity values around 90%, significantly outperforming popular state-of-the-art feature sets.
... Support Vector Machines (SVM) [19] is a supervised classification algorithm. It is widely used to solve the problem of pattern recognition, such as sound recognition [20], human activity recognition [21], etc. In [17], Zhai et al. presented their research of stress recognition using four physiological signals when the user was interacting with the computer. ...
... In this paper, we used the SVM with Gaussian kernel [22] for classification. Since the Gaussian kernel brought in the cost parameter C and the kernel parameter to be determined in the training process, a parameter sweep was used to find the optimized C and [20]. The C and were evaluated in the range from 2 −2 to 2 2 . ...
Article
This paper investigates the potential of stress recognition using the data from heterogeneous sources. Not only physiological signals but also reaction time (RT) is used to recognize different stress states. To acquire the data related to the stress of an individual, we design the experiments with two different stressors: visual stressor (Stroop test) and auditory stressor. During the experiments, the subjects perform RT task. Three physiological signals, Electrodermal activity (EDA), Electrocardiography (ECG) and Electromyography (EMG) as well as RTs are recorded. We develop the classifier based on the Support Vector Machines (SVM) for the stress recognition given the physiological signals and RT respectively. An overall good recognition performance of the SVM classifier is obtained. Besides, we present the strategy of recognition using the decision fusion. The recognition is thus achieved by fusing the classification results of physiological signals and RT with the voting method and a further improvement of recognition accuracy is observed. Results indicate that RT is efficient for stress recognition and the fusion of physiological signals and RT can bring in a more satisfied recognition performance.
... Previous works on sound (captured by the microphone on a smartphone) are mainly focused on the following application cases: environment assessment [80,81], proximity sensing [82,83], or indoor positioning [84,85]. The sources of sound are either from finetuned tags or from the surroundings. ...
Article
Full-text available
Human activity recognition (HAR) has become an intensive research topic in the past decade because of the pervasive user scenarios and the overwhelming development of advanced algorithms and novel sensing approaches. Previous HAR-related sensing surveys were primarily focused on either a specific branch such as wearable sensing and video-based sensing or a full-stack presentation of both sensing and data processing techniques, resulting in weak focus on HAR-related sensing techniques. This work tries to present a thorough, in-depth survey on the state-of-the-art sensing modalities in HAR tasks to supply a solid understanding of the variant sensing principles for younger researchers of the community. First, we categorized the HAR-related sensing modalities into five classes: mechanical kinematic sensing, field-based sensing, wave-based sensing, physiological sensing, and hybrid/others. Specific sensing modalities are then presented in each category, and a thorough description of the sensing tricks and the latest related works were given. We also discussed the strengths and weaknesses of each modality across the categorization so that newcomers could have a better overview of the characteristics of each sensing modality for HAR tasks and choose the proper approaches for their specific application. Finally, we summarized the presented sensing techniques with a comparison concerning selected performance metrics and proposed a few outlooks on the future sensing techniques used for HAR tasks.
... Sounds envelope and reverberate deeply within bodies in ways which are specific both to their phenomenal properties and to historically constituted modes of listening, under-standing and interpretation. " [165] Soundscapes are a field of interest in ubiquitous computing [171], where classes ranging from forest to railway station are recognized via machine learning as well as brushing teeth and raining. ...
Thesis
Full-text available
The ubiquity of smart devices is increasingly shaping our daily lives. Data processing of natural communication with computers, the goal of Social Signal Processing, is also moving beyond controlled settings with the use of mobile computers. Instead of executing data collection in the lab, it is now realized ”in the wild”. This means that data can now be collected, processed and evaluated in everyday situations. The challenges of this thesis lie on the one hand in classical Social Signal Processing, transferred into ”the wild”, by studying laughter recognition. On the other hand challenges go beyond classical Social Signal Processing into affect recognition in relation to urban environments viewed in local climate zones. Throughout this thesis MobileSSI, an open source framework for real-time recognition of social signals is developed. It builds upon the well established SSI framework and thus provides multi-sensor data-recording, and machine-learning capabilities. MobileSSI brings those features to a variety of platforms (Android, Linux, Windows) and extends the capabilities of SSI with inter- active machine learning for increased personalization and privacy. Using rapid prototyping in configuration and mobile user interfaces, MobileSSI forms the technical contribution, that is employed throughout different scenarios ”in the wild”, to measure and improve wellbeing. As a first field of application, group enjoyment was considered as aspect of wellbeing. Mobile sensors were employed to recognize laughter based on accelerometer and audio data. Asynchronous fusion was used to aggregate laughter events also when they occur staggered. The technique led to live demonstration of group enjoyment recognition. Drink activity as representation of health related behavior in everyday living was used as a second scenario. Smartwatches were used to record and recognize drink activity. Since it is not feasible to annotate motion data, recorded with smartwatches retrospectively, the annotation process has to be adapted in such a way, that it can be executed ”in the wild”. Therefore, Interactive machine learning combined with Active Learning was implemented, to limit the labeling effort to selected data that has the biggest training effect for the machine learning model. Moreover, a user interface for a smart watch was created that allows the comfortable correction of predictions by the system. The evaluation of the system ”in the wild” was realized with bodystorming by users with a prototype. Bodystorming is a common practice in usability engineering with focus on embodied experience, to foster insights for the design and development process of technology. As a third scenario, mobile sensors were used to measure wellbeing in the context of urban forests in collaboration with the Institute of Geography of the University of Augsburg. In detail, physiological and audio data were analyzed for the recognition of local climate zones andthe users’ wellbeińg. The study is based on models of urban climate (heat, humidity). The contribution of this thesis includes the design and implementation of techniques and methods to collect and annotate environment-related data ”in the wild” that have been validated with users walking along routes comprising varying urban structural types.
... 19 O protótipo de Will Powell não foi desenvolvido com as tecnologias móveis padronizadas por grandes fabricantes como Google, Apple ou Microsoft, mas está montado de forma a permitir bastante mobilidade. 20 AmbientSense disponibiliza função de reconhecimento do som com processamento totalmente executado no dispositivo, além de uma opção com execução em servidor. 21 O protótipo em Python implementa uma simulação da funcionalidade de localização das fontes emissoras de som. ...
Thesis
Full-text available
Acoustic insulation can greatly reduce the situational awareness of an individual, important to the social life, the performance of tasks and her/his own security. This research addresses the problem from the difficulties faced by deaf individuals, conducting an exploratory study in the field of assistive mobile computing. The study specifies requirements and presents solutions to problems that emerged during the development of an environmental sound recognition system which aims to expand the situational awareness of deaf individuals. The proposed application performs all processing in the mobile device itself, from signal capturing to spectrogram display, audio features extraction and classification. During the research, an experiment with smartphones was conducted, and it was produced a knowledge base with instances corresponding to 300 audio records distributed in 30 classes. With these data, classification tests were performed using the algorithms nearest neighbor, naive Bayes, Bayes network and an ensemble of random decision forests. The results were quite satisfactory compared to those obtained in other similar experiments on desktop platforms. To minimize problems observed in the execution of the application in unmonitored environments, an equation was elaborated with the purpose to indicate the confidence level of classification, adding support information to the sound recognition process. The personalization of the knowledge base was another solution proposed by the study, in order to meet the demand for sound recognition of all potential users of the system. In addition, an alpha test was conducted with a group of deaf users, providing an inspiring feedback and reporting a positive evaluation of the proposed system. Published under the name VSom, the application is available online for free.
... Here sound signal shows excellent potential, since it can be captured by the microphones on all smartphones. Previous work on sound captured by microphones on smartphones are mainly focused on indoor location [32,33], proximity detection [34,35], or other environmental assessment applications [36,37], with the sound sources either from fine-tuned sound tags or from the surroundings. Under ideal conditions, it has been shown that sound can be used to accurately measure distance using smartphones alone. ...
Article
Full-text available
We propose to use ambient sound as a privacy-aware source of information for COVID-19-related social distance monitoring and contact tracing. The aim is to complement currently dominant Bluetooth Low Energy Received Signal Strength Indicator (BLE RSSI) approaches. These often struggle with the complexity of Radio Frequency (RF) signal attenuation, which is strongly influenced by specific surrounding characteristics. This in turn renders the relationship between signal strength and the distance between transmitter and receiver highly non-deterministic. We analyze spatio-temporal variations in what we call “ambient sound fingerprints”. We leverage the fact that ambient sound received by a mobile device is a superposition of sounds from sources at many different locations in the environment. Such a superposition is determined by the relative position of those sources with respect to the receiver. We present a method for using the above general idea to classify proximity between pairs of users based on Kullback–Leibler distance between sound intensity histograms. The method is based on intensity analysis only, and does not require the collection of any privacy sensitive signals. Further, we show how this information can be fused with BLE RSSI features using adaptive weighted voting. We also take into account that sound is not available in all windows. Our approach is evaluated in elaborate experiments in real-world settings. The results show that both Bluetooth and sound can be used to differentiate users within and out of critical distance (1.5 m) with high accuracies of 77% and 80% respectively. Their fusion, however, improves this to 86%, making evident the merit of augmenting BLE RSSI with sound. We conclude by discussing strengths and limitations of our approach and highlighting directions for future work.
... The measurement of audio signals has been used in a vast array of applications. Notable applications include: underwater environmental acoustic tomography [1], [2], the development of acoustic arrays for sound imaging [3] and sound localization [4], ambient sound classification and monitoring [5], voice monitoring [6] and talker classification [7], and evaluation of the effect of noise and reverberation on audio signals [8]. Human health monitoring applications including, cough identification [9], [10], bowel sound monitoring [11], and snore classification [12]. ...
... Given an audio recording, sound event detection consists not only of detecting which sound events have occurred during the recording but also when it occurred [1]. SED and the broader domain of ambient sounds analysis have received a lot of attention in the past few years because of their numerous potential applications including smart home consumer electronics, smartphones, earbuds, assisted living, smart cities or security [2][3][4][5][6][7][8]. ...
Preprint
The ranking of sound event detection (SED) systems may be biased by assumptions inherent to evaluation criteria and to the choice of an operating point. This paper compares conventional event-based and segment-based criteria against the Polyphonic Sound Detection Score (PSDS)'s intersection-based criterion, over a selection of systems from DCASE 2020 Challenge Task 4. It shows that, by relying on collars , the conventional event-based criterion introduces different strictness levels depending on the length of the sound events, and that the segment-based criterion may lack precision and be application dependent. Alternatively, PSDS's intersection-based criterion overcomes the dependency of the evaluation on sound event duration and provides robustness to labelling subjectivity, by allowing valid detections of interrupted events. Furthermore, PSDS enhances the comparison of SED systems by measuring sound event modelling performance independently from the systems' operating points.
... There are works (e.g. [42]) which aim to identify a user's context and whereabouts based on the ambient noise detected by his smart phone, e.g restaurant, street, office, and so on. Some contexts are loud enough and may have distinct fingerprint in the low frequency range to be able to detect them using a gyroscope, for example railway station, shopping mall, highway, and bus. ...
Conference Paper
Full-text available
We show that the MEMS gyroscopes found on modern smartphones are sufficiently sensitive to measure acoustic signals in the vicinity of the phone. The resulting signals contain only very low-frequency information (<200Hz). Nevertheless, we show, using signal processing and machine learning, that this information is sufficient to identify speaker information and even parse speech. Since iOS and Android require no special permissions to access the gyro, our results show that apps and active web content that cannot access the microphone can nevertheless eavesdrop on speech in the vicinity of the phone.
... In Reference [32], the authors achieved an accuracy of 78.4% by using the SVM method for keystrokes identification with MFCC as features. Furthermore, the SVM method has been used by the authors of Reference [33] for the identification of several sounds, including beach, forest, street, shaver, crowd football, birds, dog, sink, dishwasher, washing machine, brushing teeth, speech, bus, car, restaurant, phone ringing, train station, chair, vacuum cleaner, coffee machine, raining and computer keyboard, using MFCC as features and reporting an accuracy around 80%. The SVM method is also used for the recognition of sleeping using MFCC and sound pressure level (SPL) as features, reporting accuracies between 75% and 81% [34,35]. ...
Article
Full-text available
The identification of Activities of Daily Living (ADL) is intrinsic with the user’s environment recognition. This detection can be executed through standard sensors present in every-day mobile devices. On the one hand, the main proposal is to recognize users’ environment and standing activities. On the other hand, these features are included in a framework for the ADL and environment identification. Therefore, this paper is divided into two parts—firstly, acoustic sensors are used for the collection of data towards the recognition of the environment and, secondly, the information of the environment recognized is fused with the information gathered by motion and magnetic sensors. The environment and ADL recognition are performed by pattern recognition techniques that aim for the development of a system, including data collection, processing, fusion and classification procedures. These classification techniques include distinctive types of Artificial Neural Networks (ANN), analyzing various implementations of ANN and choosing the most suitable for further inclusion in the following different stages of the developed system. The results present 85.89% accuracy using Deep Neural Networks (DNN) with normalized data for the ADL recognition and 86.50% accuracy using Feedforward Neural Networks (FNN) with non-normalized data for environment recognition. Furthermore, the tests conducted present 100% accuracy for standing activities recognition using DNN with normalized data, which is the most suited for the intended purpose.
... Audio recognition and classification consist of extracting different features from a sample audio file and feeding these features into a machine-learning algorithm, to detect classes of the present sounds. This topic has been studied for ambient sound classification [18][19][20], noise signal classification [21,22], speech/music classification [23], music genre classification [24], human accent or language classification, speaker recognition [25,26], and indoor localization [27]. Hinton et al. [28] and McLoughlin et al. [29] used Deep Neural Network (DNN) to develop an automated speech recognition system and a robust sound event classification, respectively. ...
Article
Full-text available
Automatically recognizing and tracking construction equipment activities is the first step towards performance monitoring of a job site. Recognizing equipment activities helps construction managers to detect the equipment downtime/idle time in a real-time framework, estimate the productivity rate of each equipment based on its progress, and efficiently evaluate the cycle time of each activity. Thus, it leads to project cost reduction and time schedule improvement. Previous studies on this topic have been based on single sources of data (e.g., kinematic, audio, video signals) for automated activity-detection purposes. However, relying on only one source of data is not appropriate, as the selected data source may not be applicable under certain conditions and fails to provide accurate results. To tackle this issue, the authors propose a hybrid system for recognizing multiple activities of construction equipment. The system integrates two major sources of data—audio and kinematic—through implementing a robust data fusion procedure. The presented system includes recording audio and kinematic signals, preprocessing data, extracting several features, as well as dimension reduction, feature fusion, equipment activity classification using Support Vector Machines (SVM), and smoothing labels. The proposed system was implemented in several case studies (i.e., ten different types and equipment models operating at various construction job sites) and the results indicate that a hybrid system is capable of providing up to 20% more accurate results, compared to cases using individual sources of data.
... The authors of [31] used the SVM method for the recognition of several sounds, including beach, crowd football, shaver, birds, dishwasher, sink, brushing teeth, dog, speech, bus, forest, street, car, phone ringing, chair, train station, vacuum cleaner, coffee machine, raining, washing machine, computer keyboard and restaurant, using MFCC as features and reporting an accuracy around 80%. The SVM method is also used for the recognition of sleeping using MFCC and sound pressure level (SPL) as features, reporting accuracies between 75% and 81% [32,33]. ...
Article
The detection of the environment where the user is located, is of extreme use for the identification of Activities of Daily Living (ADL). ADL can be identified by use of the sensors available in many off-the-shelf mobile devices, including magnetic and motion, and the environment can be also identified using acoustic sensors. The main objective of this study is to recognize the environments and some standing ADL to include in the development of a framework for the recognition of ADL and its environments. The study presented in this paper is divided in two parts: firstly, we discuss the recognition of the environment using acoustic sensors (i.e., microphone), and secondly, we fuse this information with motion and magnetic sensors (i.e., motion and magnetic sensors) for the recognition of standing ADL. The recognition of the environments and the ADL are performed using pattern recognition techniques, in order to develop a system that includes data acquisition,data processing, data fusion, and classification methods. The classification methods explored in this study are composed by different types of Artificial Neural Networks (ANN), comparing the different types of ANN and selecting the best methods to implement in the different stages of the system developed. Conclusions point to the use of Deep Neural Networks (DNN)with normalized data for the identification of ADL with 85.89% of accuracy, the use of Feedforward Neural Networks (FNN) with non-normalized data for the identification of the environments with 86.50% of accuracy, and the use of DNN method with normalized data for the identification of standing activities with 100% of accuracy.
... Smartphone inertial sensors have been exploited to infer keystrokes on the phone itself as well as external keyboards [10,14,33,39,52], to track user movements and locations [20,22,37], to infer private user activities [38] and to decode human speech [34]. Similarly, smartphone microphone and/or magnetometer have also been exploited to infer private user information [42] or trade secrets (such as 3D-printer designs) [18,23,44], private user activities [41] and natural handwriting [54]. Recently, aggregate power usage over a period of time available from the smartphone's power meter was used to track user movements and locations [35]. ...
Conference Paper
Wrist-wearables such as smartwatches and fitness bands are equipped with a variety of high-precision sensors that support novel contextual and activity-based applications. The presence of a diverse set of on-board sensors, however, also expose an additional attack surface which, if not adequately protected, could be potentially exploited to leak private user information. In this paper, we investigate the feasibility of a new attack that takes advantage of a wrist-wearable's motion sensors to infer input on mechanical devices typically used to secure physical access, for example, combination locks. We outline an inference framework that attempts to infer a lock's unlock combination from the wrist motion captured by a smartwatch's gyroscope sensor, and uses a probabilistic model to produce a ranked list of likely unlock combinations. We conduct a thorough empirical evaluation of the proposed framework by employing unlocking-related motion data collected from human subject participants in a variety of controlled and realistic settings. Evaluation results from these experiments demonstrate that motion data from wrist-wearables can be effectively employed as a side-channel to significantly reduce the unlock combination search-space of commonly found combination locks, thus compromising the physical security provided by these locks.
... A SVM (support vector machine) classifier, [22] which adopted the recorded reaction time, electromyography (EMG), electrodermal activity (EDA) and ECG to recognize the stress states of the subject, was proposed. SVM [26] is a classification algorithm which is widely used to solve the problem of pattern recognition [1,23,28]. In [8], Irick et al. presented a hardware-efficient Gaussian radial basis SVM architecture for FPGA. ...
Article
Full-text available
This work is part of the Psypocket project which aims to conceive an embedded system able to recognize the stress state of an individual based on physiological and behavioural modifications. In this paper, one of the physiological data, the electrocardiographic (ECG) signal, is focused on. The QRS complex is the most significant segment in this signal. By detecting its position, the heart rate can be learnt. In this paper, a field-programmable gate array (FPGA) architecture for QRS complex detection is proposed. The detection algorithm adopts the integer Haar transform for ECG signal filtering and a maximum finding strategy to detect the location of R peak of the QRS complex. The ECG data are originally recorded by double-precision decimal with the sampling frequencies of 2000 Hz. For the FPGA implementation, they should be converted to integers with rounding operation. To find the best multiplying factor for rounding, the comparison is performed in MATLAB. Besides, to reduce the computation load in FPGA, the feasibility of the reduction in the sampling frequency is tested in MATLAB. The FPGA Cyclone EP3C5F256C6 is used as the target chip, and all the components of the system are implemented in VHSIC hardware description language. The testing results show that the proposed FPGA architecture achieves a high detection accuracy (98.41%) and a good design efficiency in terms of silicon consumption and operation speed. The proposed architecture will be adopted as a core unit to make a FPGA system for stress recognition.
... Activities and locations can be inferred based on characteristic ambient sound patterns, e.g. walking on the streets, or eating in a restaurant [19]. Unauthorized access to GPS sensors can pose obvious risks related to loss of location privacy, such as revealing home/work locations, stalking and location-targeted advertisements [20]. ...
Article
Full-text available
Smartwatches enable many novel applications and are fast gaining popularity. However, the presence of a diverse set of on-board sensors provides an additional attack surface to malicious software and services on these devices. In this paper, we investigate the feasibility of key press inference attacks on handheld numeric touchpads by using smartwatch motion sensors as a side-channel. We consider different typing scenarios, and propose multiple attack approaches to exploit the characteristics of the observed wrist movements for inferring individual key presses. Experimental evaluation using commercial off-the-shelf smartwatches and smartphones show that key press inference using smartwatch motion sensors is not only fairly accurate, but also comparable with similar attacks using smartphone motion sensors. Additionally, hand movements captured by a combination of both smartwatch and smartphone motion sensors yields better inference accuracy than either device considered individually.
... Smartphone inertial sensors have been exploited to infer keystrokes on the phone itself as well as external keyboards [10], [31], [12], [50], [37], to track user movements and locations [18], [20], [35], to infer private user activities [36] and to decode human speech [32]. Similarly smartphone microphone and/or magnetometer have also been exploited to infer private user information [40] or trade secrets (such as 3D-printer designs) [21], [42], [17], private user activities [39] and natural handwriting [52] power usage over a period of time available from the smartphone's power meter was used to track user movements and locations [33]. ...
Article
Full-text available
Wrist-wearables such as smartwatches and fitness bands are equipped with a variety of high-precision sensors that enable collection of rich contextual information related to the wearer and his/her surroundings and support a variety of novel context- and activity-based applications. The presence of such a diverse set of on-board sensors, however, also expose an additional attack surface which, if not adequately protected, could be potentially exploited to leak private user information. In this paper, we comprehensively investigate the feasibility of a new vulnerability that attempts to take advantage of a wrist-wearable's seemingly innocuous and poorly regulated motion sensors to infer a user's input on mechanical devices typically used to secure physical access, for example, combination locks. In this direction, we outline two motion-based inference frameworks: i) a deterministic attack framework that attempts to infer a lock's unlock combination from the wrist motion (specifically, angular displacement) data obtained from a wrist-wearable's gyroscope sensor, and ii) a probabilistic attack framework that extends the output of the deterministic framework to produce a ranked list of likely unlock combinations. Further, we conduct a thorough empirical evaluation of the proposed frameworks by employing unlocking-related motion data collected from human subject participants in a variety of controlled and realistic settings. Evaluation results from these experiments demonstrate that motion data from wrist-wearables can be effectively employed as an information side-channel to significantly reduce the unlock combination search-space of commonly-found combination locks, thus compromising the physical security provided by these locks.
... Moreover, the AmbientSense system [119] can recognize 23 different contexts (e.g., coffee machine, raining, restaurant, dishwasher, toilet flush, etc.) by analyzing ambient sounds sampled from the phone. In addition, the RoomSense system in [120] uses active sound probing to classify the type of room (e.g., corridor, kitchen, lecture room, etc.) where the user is located. ...
... Efforts to duplicate these abilities on computer have been particularly intense in the area of audio signal recognition. The beginning was with speech-based applications (Rabiner and Juang, 1993), later extended to other audio recognition tasks, ranging from music analysis (Muller et al., 2011) to the problems of analyzing the general "ambient" audio (Rossi et al., 2013). ...
Thesis
Full-text available
Nowadays, there are a lot of applications related to machine vision and hearing which tried to reproduce human capabilities on machines. These problems are mainly amenable to a temporal signals classification problem, due our interest to this subject. In fact, we were interested to two distinct problems, humain gait recognition and audio signal recognition including both environmental and music ones. In the former, we have proposed a novel method to automatically learn and select the dynamic human body-parts to tackle the problem intra-class variations contrary to state-of-art methods which relied on predefined knowledge. To achieve it a group fused lasso algorithm is applied to segment the human body into parts with coherent motion value across the subjects. In the latter, while no conventional feature representation showed its ability to tackle both environmental and music problems, we propose to model audio classification as a supervised dictionary learning problem. This is done by learning a dictionary per class and encouraging the dissimilarity between the dictionaries by penalizing their pair- wise similarities. In addition the coefficients of a signal representation over these dictionaries is sought as sparse as possible. The experimental evaluations provide performing and encouraging results.
... Many studies on recognizing an environment with acoustic features have been reported. Rossi et al. [9] developed a real-time ambient sound recognition system for smartphones. Heittola et al. [3] presented a method for audio context recognition for classification of everyday environments. ...
Conference Paper
Full-text available
In wearable computing environments, various sensors enable the recording of user context as a life log. For enhancing the experience in a museum or exhibition, a voice recorder is useful for recording users' comments and providing explanations for the exhibitions. However, to utilize the sound log after the event, the voice recorder should attach appropriate tags to the sound data for extracting intended data from the long-term sound log. In a previous study, we have proposed an activity and context recognition method where the user carries a neck-worn receiver comprising a microphone and small speakers on his/her wrists that generate ultrasound. Our system also recognizes the location of the user and the people whom are near the user by ultrasonic ID signals generated from speakers placed in rooms and on people. In this study, we focus on embedding location/person IDs to sound logs by using ultrasonic IDs. We actually used the proposed method during a real event, and we confirmed that the wearing position of the microphone affects the accuracy of ultrasonic ID recognition because the acquired volume of ultrasound is different at each position. We propose a new method for recognizing the position of a microphone to improve the accuracy of ID recognition, and we actually use the improved method at another real event. Evaluation results confirmed that the accuracy of position recognition was 84.7 %, and the accuracy of ID recognition was 68.0 % for the proposed method, improved from 55.2 % for the conventional method.
... Ambient sensors like temperature, humidity, pressure, and light have been used to label user's location directly as being in kitchen, bedroom, bathroom and living room [30]. Moreover, the AmbientSense system [37] can recognize 23 different contexts (e.g., coffee machine, raining, restaurant, dishwasher, toilet flush, etc) by analyzing ambient sounds sampled from phone. In addition, the RoomSense system in [38] uses active sound probing to classify the type of room (e.g., corridor, kitchen, lecture room, etc) where the user is located. ...
Conference Paper
Full-text available
We present TransitLabel, a crowd-sensing system for automatic enrichment of transit stations indoor floorplans with different semantics like ticket vending machines, entrance gates, drink vending machines, platforms, cars' waiting lines, restrooms, lockers, waiting (sitting) areas, among others. Our key observations show that certain passengers' activities (e.g., purchasing tickets, crossing entrance gates, etc) present identifiable signatures on one or more cell-phone sensors. TransitLabel leverages this fact to automatically and unobtrusively recognize different passengers' activities, which in turn are mined to infer their uniquely associated stations semantics. Furthermore, the locations of the discovered semantics are automatically estimated from the inaccurate passengers' positions when these semantics are identified. We evaluate TransitLabel through a field experiment in eight different train stations in Japan. Our results show that TransitLabel can detect the fine-grained stations semantics accurately with 7.7% false positive rate and 7.5% false negative rate on average. In addition , it can consistently detect the location of discovered semantics accurately, achieving an error within 2.5m on average for all semantics. Finally, we show that TransitLabel has a small energy footprint on cell-phones, could be generalized to other stations, and is robust to different phone placements; highlighting its promise as a ubiquitous indoor maps enriching service.
... To our knowledge, only a few researchers focused on using mobile phone sensors to recognize scenes. Ambi-entSense [9] is such a system which was implemented on smartphones. Based on audio features collected by smartphones, AmbientSense could recognize 23 daily life ambient sound classes. ...
Conference Paper
Smartphones evolve rapidly and become more powerful in computing capabilities. More importantly, they are becoming smarter as more sensors such as the accelerometer, gyroscope, compass and the camera have been embedded on the digital board. In this paper, we propose a novel framework to recognize public scenes based on the sensors embedded in mobile phones. We build individual models for audio, light, wifi and bluetooth first, then integrate these sub-models using dynamically-weighted majority voting. We consider two factors when deciding the voting weight. One factor is the recognition rate of each sub-model and the other factor is recognition precision of the sub-model in specific scenes. We build the data-collecting app on the Android phone and implement the recognition algorithm on a Linux server. Evaluation of the data collected in the bar, cafe, elevator, library, subway station and the office shows that the ensemble recognition model is more accurate and robust than each individual sub-models. We achieved 83.33% (13.33% higher than audio sub-model) recognition accuracy when we evaluated the ensemble model with test dataset.
... For example, in Ref. [3], bathroom activities such as showering, flushing, and urination were recognized using microphone data. Also, several studies recognize daily activities with microphones in smartphones by recognizing environmental sounds such as the sound of vacuuming and the sound of running water [19], [23]. ...
Article
This paper presents a method for evaluating toothbrushing performance using audio data collected by a smartphone. This method first conducts activity recognition on the audio data to classify segments of the data into several classes based on the brushing location and type of brush stroke. These recognition results are then used to compute several independent variables which are used as input to an SVM regression model, with the dependent variables for the SVM model derived from evaluation scores assigned to each session of toothbrushing by a dentist who specializes in dental care instruction. Using this combination of audio-based activity recognition and SVM regression, our method is able to take smartphone audio data as input and output evaluation score estimates that closely correspond to the evaluation scores assigned by the dentist participating in our research.
... There are two perspectives in context-aware applications, that is, user's context and phone's context. The user's context mainly focus on detection of user's current location (e.g., home, office, railway station, restaurant, street, supermarket etc.) [10] [11] and their real-time activities (such as sitting, walking, running, bicycling, driving, cutting bread, making coffee, watching television, working at laptop, taking lunch, using water tap, brushing teeth etc.) [7][8] [9] using smartphone. ...
Conference Paper
Full-text available
In recent times, researchers have proposed numerous approaches that allow smartphones to determine user current locations (e.g., home, office, railway station, restaurant, street, supermarket etc.) and their activities (such as sitting, walking, running, bicycling, driving, cutting bread, making coffee, watching television, working at laptop, taking lunch, using water tap, brushing teeth, flushing toilet etc.) in real-time. But, to infer much richer story of context-aware applications, it is necessary to recognize the smartphone surfaces - for example on the sofa, inside the backpack, on the plastic chair, in a drawer or in your pant-pocket. This paper presents SurfaceSense, a two-tier, simple, inexpensive placement-aware technique, that uses smartphone’s embedded accelerometer, gyroscope, magnetometer, microphone and proximity sensor to infer where phone is placed. It does not require any external hardware and provides 91.75% recognition accuracy on 13 different surfaces.
Article
Many applications utilize sensors on mobile devices and apply deep learning for diverse applications. However, they have rarely enjoyed mainstream adoption due to many different individual conditions users encounter. Individual conditions are characterized by users’ unique behaviors and different devices they carry, which collectively make sensor inputs different. It is impractical to train countless individual conditions beforehand and we thus argue meta-learning is a great approach in solving this problem. We present MetaSense that leverages “seen” conditions in training data to adapt to an “unseen” condition (i.e., the target user). Specifically, we design a meta-learning framework that learns “how to adapt” to the target via iterative training sessions of adaptation. MetaSense requires very few training examples from the target (e.g., one or two) and thus requires minimal user effort. In addition, we propose a similar condition detector (SCD) that identifies when the unseen condition has similar characteristics to seen conditions and leverages this hint to further improve the accuracy. Our evaluation with 10 different datasets shows that MetaSense improves the accuracy of state-of-the-art transfer learning and meta learning methods by 15 and 11 percent, respectively. Furthermore, our SCD achieves additional accuracy improvement (e.g., 15 percent for human activity recognition).
Chapter
The evergrowing populations and increasing human activities are associated with a major increase in environmental pollution. However, their impact is still poorly understood due to the pervasiveness and spatiotemporal complexity of the phenomenon. Conventional approaches to environmental monitoring are based on networks of sparse measurement stations or in situ human-operated measurements. However, these are prohibitively expensive to capture the spatiotemporal heterogeneity of most of the environmental pollution phenomenon. Current advancements in the wireless sensor network (WSN) technology are radically changing the conventional approach, allowing for the capture of real-time information in a capillary form.
Chapter
The contextual status of mobile devices is fundamental information for many smart city applications. In this paper we present AudioIO, an active sound probing based method to tackle the problem of Indoor Outdoor (IO) detection for smartphones. We utilize the embedded speaker and microphone to emit probing signal and collect reverberation of surrounding environments. A SVM classifier is trained on the features extracted from the reverberation. We test its performance in various scenarios with different probing signals (MLS and chirp), noise levels, and device types. AudioIO achieves above 90% accuracy for both MLS and chirp signals with any tested noise levels and device types.
Chapter
From biometric image acquisition to matching to decision making, designing a selfie biometric system is riddled with security, privacy, and usability challenges. In this chapter, we provide a discussion of some of these challenges, examine some real-world examples, and discuss both existing solutions and potential new solutions. The majority of these issues will be discussed in the context of mobile devices, as they comprise a major platform for selfie biometrics; face, voice, and fingerprint biometric modalities are the most popular modalities used with mobile devices.
Preprint
Full-text available
Human perception of surrounding events is strongly dependent on audio cues. Thus, acoustic insulation can seriously impact situational awareness. We present an exploratory study in the domain of assistive computing, eliciting requirements and presenting solutions to problems found in the development of an environmental sound recognition system, which aims to assist deaf and hard of hearing people in the perception of sounds. To take advantage of smartphones computational ubiquity, we propose a system that executes all processing on the device itself, from audio features extraction to recognition and visual presentation of results. Our application also presents the confidence level of the classification to the user. A test of the system conducted with deaf users provided important and inspiring feedback from participants.
Chapter
In this paper, we propose a framework called conversational partner inference using nonverbal information (abbreviated as CFN). We use the wrist-based wearable device that has an accelerometer sensor to detect the user’s hand movement. Besides, we propose three different methods, named leading CFN, trainling CFN and leading-trailing CFN, to integrate the detected movement behaviors with the sound data sensed by microphones to effectively infer conservational partners. In experiments, we collect real data to evaluate the proposed framework. The experimental results show that the accuracy of leading CFN is better than trailing CFN and leading-trailing CFN. Moreover, our approach shows higher accuracy than the state-of-the-art approach for conversational partner inference.
Conference Paper
In1 this paper, we describe automatic audio quality recognition architecture for radio broadcast news based on audio feature selection, using the discrimination ability of the audio descriptors as a criterion of selection. Specifically, we labeled streams of broadcast news transmissions according to their audio quality based on the human auditory perception. Parameterization algorithms extract a large set of audio descriptors and an algorithm of data-driven criteria rank the descriptors' relevance. After that, the feature subsets fed machine learning algorithms for classification. This methodology showed that the k-nearest neighbor classifier provides significantly good results, considering the achieved accuracy. Moreover, the experimental framework verifies the assumption that discarding irrelevant audio descriptors before the classification stage works in favor to the overall identification performance.
Article
Automatic recognition of behavioral context (location, activities, body-posture etc.) can serve health monitoring, aging care, and many other domains. Recognizing context in-the-wild is challenging because of great variability in behavioral patterns, and it requires a complex mapping from sensor features to predicted labels. Data collected in-the-wild may be unbalanced and incomplete, with cases of missing labels or missing sensors. We propose using the multiple layer perceptron (MLP) as a multi-task model for context recognition. Based on features from multi-modal sensors, the model simultaneously predicts many diverse context labels. We analyze the advantages of the model's hidden layers, which are shared among all sensors and all labels, and provide insight to the behavioral patterns that these hidden layers may capture. We demonstrate how recognition of new labels can be improved when utilizing a model that was trained for an initial set of labels, and show how to train the model to withstand missing sensors. We evaluate context recognition on the previously published ExtraSensory Dataset, which was collected in-the-wild. Compared to previously suggested models, the MLP improves recognition, even with fewer parameters than a linear model. The ability to train a good model using data that has incomplete, unbalanced labeling and missing sensors encourages further research with uncontrolled, in-the-wild behavior.
Article
Full-text available
The detection of the environment where user is located, is of extreme use for the identification of Activities of Daily Living (ADL). ADL can be identified by use of the sensors available in many off-the-shelf mobile devices, including magnetic and motion, and the environment can be also identified using acoustic sensors. The study presented in this paper is divided in two parts: firstly, we discuss the recognition of the environment using acoustic sensors (i.e., microphone), and secondly, we fuse this information with motion and magnetic sensors (i.e., motion and magnetic sensors) for the recognition of standing activities of daily living. The recognition of the environments and the ADL are performed using pattern recognition techniques, in order to develop a system that includes data acquisition, data processing, data fusion, and artificial intelligence methods. The artificial intelligence methods explored in this study are composed by different types of Artificial Neural Networks (ANN), comparing the different types of ANN and selecting the best methods to implement in the different stages of the system developed. Conclusions point to the use of Deep Neural Networks (DNN) with normalized data for the identification of ADL with 85.89% of accuracy, the use of Feedforward neural networks with non-normalized data for the identification of the environments with 86.50% of accuracy, and the use of DNN with normalized data for the identification of standing activities with 100% of accuracy.
Chapter
Motivated by applications in nutritional epidemiology and food journaling, computing researchers have proposed numerous techniques for automating dietary monitoring over the years. Although progress has been made, a truly practical system that can automatically recognize what people eat in real-world settings remains elusive. Eating detection is a foundational element of automated dietary monitoring (ADM) since automatically recognizing when a person is eating is required before identifying what and how much is being consumed. Additionally, eating detection can serve as the basis for new types of dietary self-monitoring practices such as semi-automated food journaling.This chapter discusses the problem of automated eating detection and presents a variety of practical techniques for detecting eating activities in real-world settings. These techniques center on three sensing modalities: first-person images taken with wearable cameras, ambient sounds, and on-body inertial sensors [34, 35, 36, 37]. The chapter begins with an analysis of how first-person images reflecting everyday experiences can be used to identify eating moments using two approaches: human computation and convolutional neural networks. Next, we present an analysis showing how certain sounds associated with eating can be recognized and used to infer eating activities. Finally, we introduce a method for detecting eating moments with on-body inertial sensors placed on the wrist.
Article
Full-text available
Audio sensing has been applied in various mobile applications for sensing personal and environmental information to improve user’s life quality. However, the quality of audio sensing is distorted seriously, while the sensing service is working in incorrect context or the ability of the acoustic sensing is limited (i.e., aging effect of the microphone or interference due to hand covering). To address this challenge, we present CondioSense, a CONtext-aware service for auDIO SENSing system, which identifies the current phone context (i.e., pocket, bag, car, indoor and outdoor) and detects the microphone sensing states. The main idea behind context detection is to extract multipath features from actively generated acoustic signal to identify various contexts since the space size and material among various contexts is different. The sound of physical vibration is explored on microphone sensing state detection, by leveraging that the frequency response of recorded vibration sound changes when the signal propagation in the air is blocked with the microphone covered. We prototype CondioSense on smartphones as an application and perform extensive evaluations. It offers the possibility to recognize various phone contexts with an accuracy exceeding \(92\%\) and the accuracy of microphone sensing states detection exceeding \(90\%\).
Conference Paper
Full-text available
Fine-grained monitoring of everyday appliances can provide better feedback to the consumers and motivate them to change behavior in order to reduce their energy usage. It also helps to detect abnormal power consumption events, long-term appliance malfunctions and potential safety concerns. Commercially available plug meters can be used for individual appliance monitoring but for an entire house, each such individual plug meters are expensive and tedious to setup. Alternative methods relying on Non-Intrusive Load Monitoring techniques help disaggregate electricity consumption data and learn about the individual appliance's power states and signatures. However fine-grained events (e.g., appliance malfunctions, abnormal power consumption, etc.) remain undetected and thus inferred contexts (such as safety hazards etc.) become invisible. In this work, we correlate an appliance's inherent acoustic noise with its energy consumption pattern individually and in presence of multiple appliances. We initially investigate classification techniques to establish the relationship between appliance power and acoustic states for efficient energy disaggregation and abnormal events detection. While promising, this approach fails when there are multiple appliances simultaneously in 'ON' state. To further improve the accuracy of our energy disaggregation algorithm, we propose a probabilistic graphical model, based on a variation of Factorial Hidden Markov Model (FHMM) for multiple appliances energy disaggregation. We combine our probabilistic model with the appliances acoustic analytics and postulate a hybrid model for energy disaggregation. Our approach helps to improve the performance of energy disaggregation algorithms and provide critical insights on appliance longevity, abnormal power consumption , consumer behavior and their everyday lifestyle activities. We evaluate the performance of our proposed algorithms on real data traces and show that the fusion of acoustic and power signatures can successfully detect a number of appliances with 95% accuracy.
Conference Paper
Smartphones have become ubiquitous in recent years and offer many useful services to their users, such as notifications about incoming calls and messages, or news updates in real-time. These notifications however do not consider the current user's and phone's context. As a result, they can disturb users in important meetings or remain unnoticed in noisy environment. In this paper, we therefore propose an approach to infer the phone's context based on its vibration motor. To this end, we trigger the phone's vibration motor for short time periods and measure the response of its environments using the built-in microphone and/or accelerometers.Our evaluation shows that leveraging accelerometers allows to recognize the current phone's context with an accuracy of more than 99%. As a result, our proposed solution outperforms our previous work based on played and recorded ringtones in terms of classification performance, user annoyance, as well as potential privacy threats.
Conference Paper
Developments of ambient assistance systems and energy consumption optimization in home environments are one of the main goals of ambient intelligent systems. In this work we propose a wearable standalone solution, which combines the assistance task and the energy optimization task. For this purpose we develop a real-time mobile sound-based device and activity recognizer that senses the audible part of the environment to support its owner during his daily tasks and to help him optimize them in terms of resource consumption.
Article
Sound captured by a mobile phone's microphone is a rich source of contextual information about activity, location, and social events. In this paper, we present a robust sound classification system for recognizing the real-time context of a smartphone user. Our system can reduce unnecessary computations by discarding frames containing silence or white noise from the input audio stream in the pre-processing step. It also improves the classification performance on low energy sounds by amplifying them. Moreover, for efficient learning and application of HMM classification models, our system executes the dimension reduction and discretization on the set of multi-dimensional continuous-valued feature vectors through k-means clustering. We collected a large set of sound examples of 8 different types from daily life in a university office environment and then conducted experiments using them. Through these experiments, our system showed high classification performance.
Conference Paper
We developed a new feature extraction algorithm based on the Amplitude Modulation Spectrum (AMS), which mainly consists of two filter bank stages composed of low-order recursive filters. The passband range of each filter was optimized by using the Covariance Matrix Adaptation — Evolution Strategy (CMA-ES). The classification task was accomplished by a Linear Discriminant Analysis (LDA) classifier. To evaluate the performance of the proposed acoustic scene classifier based on AMS features, we tested it with the publicly available dataset provided by the IEEE AASP Challenge 2013. Using only 9 optimized AMS features, we achieved 85 % classification accuracy, outperforming the best previously available approaches by 10 %.
Article
A lot of personal daily contexts and activities may be inferred by analyzing acoustic signals in vicinity. Conversations play an important role in one's social communications. In this work, we consider the inference of conversation partners via acoustic sensing conducted by a group of smartphones in vicinity. By considering the continuity and overlap of speeches, we propose novel inference methods to identify conversational relationships among co-located users. In our system, each smartphone individually processes the acoustic data to understand its owner's talking turns and emotions. Via direct wireless communications, smartphones then cooperatively conduct the inference to retrieve conversational groups. Compared to existing work, which only exploits peer-to-peer conversational relationships, our approach is able to capture group conversational relationships in a more real-time manner. A prototype on Android smartphones is demonstrated to verify the feasibility of our approach. We also collect conversation data from movie clips and real life with 2 to 14 speakers to validate our result, which shows promising performance.
Conference Paper
Full-text available
This paper presents a system for acoustic event detection in recordings from real life environments. The events are modeled using a network of hidden Markov models; their size and topology is chosen based on a study of isolated events recognition. We also studied the effect of ambient background noise on event classifi-cation performance. On real life recordings, we tested recognition of isolated sound events and event detection. For event detection, the system performs recognition and temporal positioning of a se-quence of events. An accuracy of 24% was obtained in classifying isolated sound events into 61 classes. This corresponds to the ac-curacy of classifying between 61 events when mixed with ambient background noise at 0dB signal-to-noise ratio. In event detection, the system is capable of recognizing almost one third of the events, and the temporal positioning of the events is not correct for 84% of the time.
Conference Paper
Full-text available
We present the design, implementation, evaluation, and user experiences of the CenceMe application, which represents the first system that combines the inference of the pres- ence of individuals using off-the-shelf, sensor-enabled mobile phones with sharing of this information through social net- working applications such as Facebook and MySpace. We discuss the system challenges for the development of soft- ware on the Nokia N95 mobile phone. We present the de- sign and tradeoffs of split-level classification, whereby per- sonal sensing presence (e.g., walking, in conversation, at the gym) is derived from classifiers which execute in part on the phones and in part on the backend servers to achieve scal- able inference. We report performance measurements that characterize the computational requirements of the software and the energy consumption of the CenceMe phone client. We validate the system through a user study where twenty two people, including undergraduates, graduates and fac- ulty, used CenceMe continuously over a three week period in a campus town. From this user study we learn how the system performs in a production environment and what uses people find for a personal sensing system.
Article
Full-text available
This paper discusses the interaction techniques developed for Nomadic Radio, a wearable computing platform for managing voice and text-based messages in a nomadic environment. Nomadic Radio employs an auditory user interface, which synchronizes speech recognition, speech synthesis, non-speech audio and spatial presentation of digital audio, for navigating among messages as well as asynchronous alerting of newly arrived messages. Emphasis is placed on an auditory modality as Nomadic Radio is designed to be used while performing other tasks in a user's everyday environment; a range of auditory cues provide peripheral awareness of incoming messages. Notification is adaptive and context sensitive; messages are presented as more or less obtrusive based on importance inferred from content filtering, whether the user is engaged in conversation and her recent responses to prior messages. Auditory notifications are dynamically scaled from ambient sound through recorded voice cues up to message
Conference Paper
Full-text available
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic fea- tures. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We address the problem of handling generic sounds in- cluding a wide variety of sound effects, animal vocaliza- tions and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaus- sian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three meth- ods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and re- trieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magni- tude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future. NOTE: Please do not redistribution without permission
Conference Paper
Full-text available
The paper presents a prototype of a wearable, sound-analysis based, user activity recognition device. It focuses on low-power realization suitable for a miniaturized implementation. We describe a tradeoff analysis between recognition performance and computation complexity. Furthermore, we present the hardware prototype and the experimental evaluation of its recognition performance. This includes frame by frame recognition, event detection in a continuous data stream and the influence of background noise.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
This paper introduces a collaborative personal speaker identification system to annotate conversations and meetings using speech-independent speaker modeling and one audio channel. This system can operate in standalone and collaborative modes, and learn about speakers online that were detected as unknown. In collaborative mode, the system exchanges current speaker information with personal systems of others to improve identification performance. Our collaboration concept is based on distributed personal systems only, hence it does not require a specific infrastructure to operate. We present a generalized description of collaboration situations and derive three use scenarios in which the system was subsequently evaluated. Compared to standalone operation, collaboration among four personal identification systems increased system performance by up to 9% for 4 relevant speakers and up to 21% for 24 relevant speakers. Allowing unknown speakers in a conversation did not impede performance gains of a collaboration. In a scenario where individual systems had nonidentical speaker sets, collaboration gains were 16% for 24 relevant speakers.
Conference Paper
We investigate how different locations inside clothing influence the ability of a system to recognize activity relevant sounds. Specifically, we consider the recognition of sounds from 9 household and office appliances recorded using an iPhone placed in 2 trouser pockets, 2 jacket pockets, a belt holster and the users' hand. The aim is not to demonstrate good recognition rates on the above sounds (which has been done many times before) but to compare recognition rates from the individual locations and to understand how to best train the system to be location invariant.
Conference Paper
Top end mobile phones include a number of specialized (e.g., accelerometer, compass, GPS) and general purpose sensors (e.g., microphone, camera) that enable new people-centric sensing applications. Perhaps the most ubiquitous and un- exploited sensor on mobile phones is the microphone - a powerful sensor that is capable of making sophisticated in- ferences about human activity, location, and social events from sound. In this paper, we exploit this untapped sensor not in the context of human communications but as an en- abler of new sensing applications. We propose SoundSense, a scalable framework for modeling sound events on mobile phones. SoundSense is implemented on the Apple iPhone and represents the first general purpose sound sensing sys- tem specifically designed to work on resource limited phones. The architecture and algorithms are designed for scalability and SoundSense uses a combination of supervised and unsu- pervised learning techniques to classify both general sound types (e.g., music, voice) and discover novel sound events specific to individual users. The system runs solely on the mobile phone with no back-end interactions. Through im- plementation and evaluation of two proof of concept people- centric sensing applications, we demostrate that SoundSense is capable of recognizing meaningful sound events that occur in users' everyday lives.
Conference Paper
In this paper an automated bathroom activity monitoring system based on acoustics is described. The system is designed to recognize and classify major activities occurring within a bathroom based on sound. Carefully designed HMM parameters using MFCC features are used for accurate and robust bathroom sound event classification. Experiments to validate the utility of the system were performed firstly in a constrained setting as a proof-of-concept and later in an actual trial involving real people using their bathroom in the normal course of their daily lives. Preliminary results are encouraging with the accuracy rate for most sound categories being above 84%. We sincerely believe that the system contributes towards increased understanding of personal hygiene behavioral problems that significantly affect both informal care-giving and clinical care of dementia patients.
Conference Paper
Automatically identifying the person you are talking with using continuous audio sensing has the potential to enable many pervasive computing applications from memory assistance to annotating life logging data. However, a number of challenges, including energy efficiency and training data acquisition, must be addressed before unobtrusive audio sensing is practical on mobile devices. We built SpeakerSense, a speaker identification prototype that uses a heterogeneous multi-processor hardware architecture that splits computation between a low power processor and the phone’s application processor to enable continuous background sensing with minimal power requirements. Using SpeakerSense, we benchmarked several system parameters (sampling rate, GMM complexity, smoothing window size, and amount of training data needed) to identify thresholds that balance computation cost with performance. We also investigated channel compensation methods that make it feasible to acquire training data from phone calls and an automatic segmentation method for training speaker models based on one-to-one conversations.
Article
The aim of this paper is to investigate the feasibility of an audio-based context recognition system. Here, context recognition refers to the automatic classification of the context or an environment around a device. A system is developed and compared to the accuracy of human listeners in the same task. Particular emphasis is placed on the computational complexity of the methods, since the application is of particular interest in resource-constrained portable devices. Simplistic low-dimensional feature vectors are evaluated against more standard spectral features. Using discriminative training, competitive recognition accuracies are achieved with very low-order hidden Markov models (1-3 Gaussian components). Slight improvement in recognition accuracy is observed when linear data-driven feature transformations are applied to mel-cepstral features. The recognition rate of the system as a function of the test sequence length appears to converge only after about 30 to 60 s. Some degree of accuracy can be achieved even with less than 1-s test sequence lengths. The average reaction time of the human listeners was 14 s, i.e., somewhat smaller, but of the same order as that of the system. The average recognition accuracy of the system was 58% against 69%, obtained in the listening tests in recognizing between 24 everyday contexts. The accuracies in recognizing six high-level classes were 82% for the system and 88% for the subjects.
Collaborative personal speaker identi cation: A generalized approach pervasive and mobile computing
  • M Rossi
  • O Amft
  • G Tröster