Conference Paper

Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Natural User Interfaces (NUI) are supposed to be used by humans in a very logic way. However, the run to deploy Speech-based NUIs by the industry has had a large impact on the naturality of such interfaces. This paper presents a usability test of the most prestigious and internationally used Speech-based NUI (i.e., Alexa, Siri, Cortana and Google’s). A comparison of the services that each one provides was also performed considering: access to music services, agenda, news, weather, To-Do lists and maps or directions, among others. The test was design by two Human Computer Interaction experts and executed by eight persons. Results show that even though there are many services available, there is a lot to do to improve the usability of these systems. Specially focused on separating the traditional use of computers (based on applications that require parameters to function) and to get closer to real NUIs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Understandably, working under such situations is often hard to deal with as it requires teams to put in additional effort to work together effectively. To deal with scenarios such as these, some have suggested that the incorporation of voice-enabled intelligent assistants (and other AI-enabled technology) can encourage humans to use them for their tasks and perform better (Personal Digital Assistant et al. 2019;López et al. 2018). However, there exists a gap in the research literature which can help us understand how time scarcity interacts with the availability of an intelligent assistant to influence team behaviors and outcomes. ...
... While AI-enabled technologies can be of various types ranging from robots to algorithms, our focus is on one type: the intelligent assistant. Intelligent assistants often rely on verbal commands that allow users to delegate assignments to them without entirely disengaging from one's current task (Luger and Sellen 2016;Goksel and Mutlu 2016;López et al. 2018). Given these unique abilities and functions of this technology, it is understandable that intelligent assistants are touted as machines that help users do more in less time (Personal Digital Assistant et al. 2019; CIMON brings AI to International Space Station 2019). ...
... The deployment of intelligent technology is often justified with the reason that they can help maximize productivity and improve outcomes (Personal Digital Assistant et al. 2019;López et al. 2018). This implies that among the two alternatives (i.e., using an intelligent or not), the decision to use the technology is a more rational and appropriate choice. ...
Article
Full-text available
Time and technology permeate the fabric of teamwork across a variety of settings to affect outcomes which have a wide range of consequences. However, there is a limited understanding about the interplay between these factors for teams, especially as applied to artificial intelligence (AI) technology. With the increasing integration of AI into human teams, we need to understand how environmental factors such as time scarcity interact with AI technology to affect team behaviors. To address this gap in the literature, we investigated the interaction between the availability of intelligent technology and time scarcity in teams. Drawing from the theoretical perspective of computers are social actors and extant research on the use of heuristics and human–AI interaction, this study uses behavioral data from 56 teams who participated in a between-subjects 2 (intelligent assistant available × control/no intelligent assistant) × 2 (time scarcity × control/no time scarcity) lab experiment. Results show that teams working under time scarcity used the intelligent assistant more often and underperformed on a creative task compared to teams without the temporal constraints. Further, teams who had an intelligent assistant available to them had fewer interactions between members compared to teams who did not have the technology. Implications for research and applications are discussed.
... Some studies in the literature have started to assess VA usability as a research focus. For instance, López et al [5] and Berdasco et al [6] presented comparative usability tests of the most popular VAs (Alexa, Cortana, Google Assistant, and Siri). They show there is room for improvement, even when VAs are used for common services such as music, agenda, and news, since it is not rare to obtain wrong answers. ...
... Finally, Siri was accessed on an iPhone 12. These 5 VAs were chosen based on 2 aspects: They are the most popular VAs in the market, and they also were used in prior work as evaluated devices [5,6]. ...
Article
Background: Voice Assistants (VAs) are devices that respond to human voices and can be commanded to do a variety of tasks. Nowadays, VAs are being used to obtain health information, which has become a critical point of analysis for researchers in terms of question understanding and quality of response. Particularly, the COVID-19 pandemic has and still is severely affecting people worldwide, which demands studies on how VAs can be used as a tool to provide useful information. Objective: This work aims to perform a quality analysis of different VAs' responses regarding the actual and important subject of COVID-19 vaccines. We focus on this important subject since vaccines are now available, and the society urges to rapidly immunize the population. Methods: The proposed study is based on questions that were collected from the official World Health Organization website. These questions were submitted to the five dominant VAs (Alexa, Bixby, Cortana, Google Assistant and Siri), and responses were evaluated according to a rubric based on literature. We focus this study in Portuguese language as an additional contribution, since previous work are mainly focused in the English language, and we believe that VAs cannot be optimized to foreign languages. Results: Results show that Google Assistant has a better overall performance, and only this VA and Samsung Bixby achieved high scores on question understanding in the Portuguese language. Regarding the obtained answers, the study also shows the best Google Assistant overall performance. Conclusions: Under the urgent context of COVID-19 vaccination, this work can help to understand how VAs must be improved to be more useful to the society and how careful people must be when considering VAs as a source of health information. VAs have been demonstrated to perform well regarding comprehension and user-friendness. However, this work has found that they must be better integrated to their information sources to be useful as health information tools. Clinicaltrial:
... Benlian et al. (2019) suggested, for example, that humanisation can reduce perceived intrusiveness. Smart home speakers can increase the integration of humanisation elements, such as human-like tone and pacing (López et al., 2017). Some researchers have found that humanising the assistant's voice can lead to greater social presence and trust in the virtual assistant, which can influence final recommendations within firms (Chérif and Lemoine, 2019). ...
... Due to voice interaction between users and smart home speakers, the effect of personalised messages on the benefits and risks of disclosed information can vary depending on how this interaction is developed. Smart speakers can be humanised through a human-like tone and pacing (López et al., 2017), and even a sense of humour that is added to the device. Another important anthropomorphic aspect is responsiveness, which refers to the smart speaker's ability to provide users with quick and effective responses (Bavaresco et al., 2020). ...
Article
This article examines the personalisation–privacy paradox through the privacy calculus lens in the context of smart home speakers. It also considers the direct and moderating role of humanisation in the personalisation–privacy paradox. This characteristic refers to how human the device is perceived to be, given its voice’s tone and pacing, original responses, sense of humour, and recommendations. The model was tested on a sample of 360 users of different brands of smart home speakers. These users were heterogeneous in terms of age, gender, income, and frequency of use of the device. The results confirm the personalisation–privacy paradox and verify uncanny valley theory, finding the U-shaped effect that humanisation has on risks of information disclosure. They also show that humanisation increases benefits, which supports the realism maximisation theory. Specifically, they reveal that users will perceive the messages received as more useful and credible if the devices seem human. However, the human-likeness of these devices should not exceed certain levels as it increases perceived risk. These results should be used to highlight the importance of the human-like communication of smart home speakers.
... Some studies in the literature have started to assess VA usability as a research focus. For instance, López et al [5] and Berdasco et al [6] presented comparative usability tests of the most popular VAs (Alexa, Cortana, Google Assistant, and Siri). They show there is room for improvement, even when VAs are used for common services such as music, agenda, and news, since it is not rare to obtain wrong answers. ...
... Finally, Siri was accessed on an iPhone 12. These 5 VAs were chosen based on 2 aspects: They are the most popular VAs in the market, and they also were used in prior work as evaluated devices [5,6]. ...
Article
Full-text available
Voice assistants (VAs) are devices that respond to human voices and can be commanded to do a variety of tasks. Nowadays, VAs are being used to obtain health information, which has become a critical point of analysis for researchers in terms of question understanding and quality of response. Particularly, the COVID-19 pandemic has and still is severely affecting people worldwide, which demands studies on how VAs can be used as a tool to provide useful information.
... In recent years various non-traditional interfaces have been developed to exploit alternative ways of human-computer interaction, such as gaze or gesture detection and tracking, haptic and tangible interfaces, and voice or conversational interfaces [5]. Voice or conversational interfaces use natural language that resembles a conversation [6], they have been implemented with great success in different environments, such as intelligent voice assistants. This success is due both to the ease of use, as well as the continuous substantial improvements they experience in speech pattern recognition or the expansion into languages other than English [7]. ...
... As for intelligent voice assistants, there are studies on comparing different virtual assistants performing the same tasks, as well as reviewing security and privacy issues [16]. Other studies focus on evaluating UX from an emotional point of view when using these devices [17], while others on how correct or good is the answer given by an intelligent voice assistant [18], or on assessing the naturalness of these assistants' answers [6]. For its part, [7] shows how intelligent voice assistants help reduce the technological gap that may exist between users with some type of disability or physical impairment, which restricts or prevents them from using products or devices. ...
Chapter
Standardized questionnaires are widely used instruments to evaluate UX and their capture mechanism has been implemented in written form, either on paper or in digital format. This study aims to determine if the UX evaluations obtained in the standardized UEQ questionnaire (User Experience Questionnaire) are equivalent if the response capture mechanism is implemented using the traditional written form (digitally) or if a conversational voice interface is used. Having a UX evaluation questionnaire whose capture mechanism is implemented by voice could provide an alternative to collect user responses, preserving the advantages present in standardized questionnaires (quantitative results, statistically validated, self-reported by users) and adding the ease of use and growing adoption of conversational voice interfaces. The results of the case study described in this paper show that, with an adequate number of participants, there are no significant differences in the results of the six scales that make up UEQ when using either of the two response capture mechanisms.
... Therefore, standardization of voice commands across all skills might reduce errors and the unwanted quitting of skills, especially if the IPA better assisted users in not losing their touch to the skill they are currently operating. With previous research on IPAs reporting on interaction issues [7,8,21,33], it is likely that IPAs that are based on similar eco-systems or business models, as Google Assist, might share related root causes of global and local hierarchy. While the Echo Show offering 3 rd -party applications, this might be still different for other eco-systems which albeit provide various skills or actions but have stricter use of language specifications for publishing, e.g. ...
... However, IPAs lack such infrastructural responsibilities and do not meet the users' expectations -one of the potential reasons for their abandonment [7]. Nevertheless, related work already reported that users refrain from using the full scope of functions offered by IPAs [11,21,22,33], potentially due to a lack of personalization [7]. Therefore, allowing users to customize their IPA, for example, through letting them choose their own activation words [7] or specific vocabulary for their daily routines might be perceived beneficial. ...
Conference Paper
Intelligent Personal Assistants (IPA) are advertised as reliable companions in the everyday life to simplify household tasks. Due to speech-based usability issues, users struggle to deeply engage with current systems. The capabilities of newer generations of standalone devices are even extended by a display, also to address some weaknesses like memorizing auditive information. So far, it is unclear how the potential of a multimodal experience is realized by designers and appropriated by users. Therefore, we observed 20 participants in a controlled setting, planning a dinner with the help of an audio-visual-based IPA, namely Alexa Echo Show. Our study reveals ambiguous mental models of perceived and experienced device capabilities, leading to confusion. Meanwhile, the additional visual output channel could not counterbalance the weaknesses of voice interaction. Finally, we aim to illustrate users' conceptual understandings of IPAs and provide implications to rethink audiovisual output for voice-first standalone devices.
... A study [95] therefore carried out a functional and usability test of the most prestigious voice-activated personal assistants on the market, such as Amazon Alexa, Apple Siri, Microsoft Cortana and Google Assistant. ...
... While there are many services available in popular virtual assistants, there is still a lot of work to be done to improve the usability of these systems [95]. ...
Article
Full-text available
Natural user interfaces are increasingly popular these days. One of the most common of these user interfaces today are voice-activated interfaces, in particular intelligent voice assistants such as Google Assistant, Alexa, Cortana and Siri. However, the results show that although there are many services available, there is still a lot to be done to improve the usability of these systems. Speech recognition, contextual understanding and human interaction are the issues that are not yet solved in this field. In this context, this research paper focuses on the state of the art and knowledge of work on intelligent voice interfaces, challenges and issues related to this field, in particular on interaction quality, usability, security and usability. As such, the study also examines voice assistant architecture components following the expansion of the use of technologies such as wearable computing in order to improve the user experience. Moreover, the presentation of new emerging technologies in this field will be the subject of a section in this work. The main contributions of this paper are therefore: (1) overview of existing research, (2) analysis and exploration of the field of intelligent voice assistant systems, with details at the component level, (3) identification of areas that require further research and development, with the aim of increasing its use, (4) various proposals for research directions and orientations for future work, and finally, (5) study of the feasibility of designing a new type of voice assistant and general presentation of the latter, whose realisation will be the subject of a thesis.
... Synthesized speech can also be used for particular functions like spelling and pronunciation teaching for various languages. Nowadays, most smartphones are capable of listening to questions from end-users and answering back through an intelligent personal assistant-Cortana (Microsoft), Siri (iPhone), or Google Assistant (Android) [2]. Speech synthesis has been the mainstream in research on Artificial Intelligence (AI). ...
... The second step is to create the speech waveforms. The techniques used for speech synthesis can be partitioned into two broad categories: (1) Traditional machine learning-based techniques and (2) Deep machine learning-based techniques. In traditional machine learning, two specific methods are used for TTS: concatenative speech synthesis [3] and parametric speech synthesis [4], [5]. ...
... Google Assistant is a digital assistant developer by Google that consolidates features from the earlier seasoned Google now service[48,49]. Google Assistant can be accessed via Google Home smart speaker in a similar manner as Alexa can be accessed via Echo smart speaker[50]. Like the Amazon's offering, Google Assistant offers also public APIs that can be used to connect Google Assistant to new devices. ...
... However, increasingly, assistant platforms associate devices outside their original platform, such as TV sets and even microwave ovens [35]. Today, AI-based assistant platforms mediate access to an increasing number of services and devices via a voice-based interface [36]. The AI-based platform with the largest market share is Amazon Alexa closely followed by Google Assistant [37]. ...
... The traditional keyboard and mouse, and even the currently popular touchscreens are being left behind in favor of more intuitive and yet sophisticated interfaces based on cognitive applications such as image, text, and speech recognition. Commercial examples of such applications include Virtual Personal Assistants (VPA) [5][6][7] such as Google's Assistant, Apple's Siri, Microsoft's Cortana, and Amazon's Alexa. VPAs are meant to interact with an end user in a natural way (i.e., voice, text, or images), to answer questions, follow a conversation, and accomplish different tasks. ...
Article
Full-text available
Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural network (NN)-based accelerators has grown significantly. Emerging memory technologies, such as ReRAM and 3D-stacked, are promising for efficiently architecting NDP-based accelerators for NN due to their capabilities to work as both high-density/low-energy storage and in/near-memory computation/search engine. In this paper, we present a survey of techniques for designing NDP architectures for NN. By classifying the techniques based on the memory technology employed, we underscore their similarities and differences. Finally, we discuss open challenges and future perspectives that need to be explored in order to improve and extend the adoption of NDP architectures for future computing platforms. This paper will be valuable for computer architects, chip designers, and researchers in the area of machine learning.
... We defined a continuum of attackers in increasing order of their knowledge about the anonymization scheme and found that the knowledgable attackers are more successful in performing the re-identification attack. This chapter considers a slightly different threat model than the one described in Section 3.1, where individuals use the speech-to-text service provided by digital assistants [183,151]. In this context, the speech signal is sent from the user's device to a cloud-based service, as shown in Figure 4.1, where ASR and natural language understanding are performed in order to address the user request. 1 Fig. 4.1 Threat model related to speech-to-text provided by cloud-based services. ...
Thesis
Large-scale centralized storage of speech data poses severe privacy threats to the speakers. Indeed, the emergence and widespread usage of voice interfaces starting from telephone to mobile applications, and now digital assistants have enabled easier communication between the customers and the service providers. Massive speech data collection allows its users, for instance researchers, to develop tools for human convenience, like voice passwords for banking, personalized smart speakers, etc. However, centralized storage is vulnerable to cybersecurity threats which, when combined with advanced speech technologies like voice cloning, speaker recognition, and spoofing, may endow a malicious entity with the capability to re-identify speakers and breach their privacy by gaining access to their sensitive biometric characteristics, emotional states, personality attributes, pathological conditions, etc. Individuals and the members of civil society worldwide, and especially in Europe, are getting aware of this threat. With firm backing by the GDPR, several initiatives are being launched, including the publication of white papers and guidelines, to spread mass awareness and to regulate voice data so that the citizens' privacy is protected.This thesis is a timely effort to bolster such initiatives and propose solutions to remove the biometric identity of speakers from speech signals, thereby rendering them useless for re-identifying the speakers who spoke them. Besides the goal of protecting the speaker's identity from malicious access, this thesis aims to explore the solutions which do so without degrading the usefulness of speech. We present several anonymization schemes based on voice conversion methods to achieve this two-fold objective. The output of such schemes is a high-quality speech signal that is usable for publication and a variety of downstream tasks. All the schemes are subjected to a rigorous evaluation protocol which is one of the major contributions of this thesis. This protocol led to the finding that the previous approaches do not effectively protect the privacy and thereby directly inspired the VoicePrivacy initiative which is an effort to gather individuals, industry, and the scientific community to participate in building a robust anonymization scheme. We introduce a range of anonymization schemes under the purview of the VoicePrivacy initiative and empirically prove their superiority in terms of privacy protection and utility. Finally, we endeavor to remove the residual speaker identity from the anonymized speech signal using the techniques inspired by differential privacy. Such techniques provide provable analytical guarantees to the proposed anonymization schemes and open up promising perspectives for future research.In practice, the tools developed in this thesis are an essential component to build trust in any software ecosystem where voice data is stored, transmitted, processed, or published. They aim to help the organizations to comply with the rules mandated by civil governments and give a choice to individuals who wish to exercise their right to privacy.
... Past research used the above-mentioned maxims and classification variants to locate and evaluate voice assistant developments from a technical point of view [5], [6] to show the rapid progress in the areas of Natural Language Understanding (NLU) and Natural Language Processing (NLP). In this context, even with the latest technical developments, voice assistants cannot accurately predict how the user responds. ...
Article
Full-text available
Voice assistants have manifested their existence in the vehicle over the last decade. Further technological developments in the area of voice recognition and interactions with users open up the opportunity for new customer-centric user scenarios. In the following work, eight dialog use cases and two different interaction types were examined in detail. The focus of this work is to answer the research question to which extent do users prefer task-oriented multi-turn dialogs over the question-and-answer single-turn dialogs in certain driving situations. Employing a three-step online survey in 2020, participants were asked about their preferences for the assistant’s interaction type, use case as well as the perceived usefulness and pleasantness. The authors found that users preferred multi-turn conversations over single-turn conversations for all defined use case scenarios. As further challenges for the future development of voice assistants in the automotive context, the changed driving situation due to e.g. progress in autonomous driving and the focus on an integration of the voice modality as a direct function should be considered.
... Virtual digital assistants are becoming increasingly accessible and available to the general public such as Google Home, Amazon Echo/Alexa, and Apple HomePod [311]. If the home assistant is asked to perform a task, for example, setting an alarm for the next morning, the natural language signal produced by a microphone is converted into data through statistical extraction [312], and following this, classification is performed (that is, what did the user say? ). ...
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
... Recently, deep learning has dramatically improved the performance of the QA systems [24,25], especially in the sub-domain of visual QA [26,27]. The voice-assisted devices, like Alexa and Google Assistant, partially solve the problem by answering questions for which the information can be retrieved from the web [28,29]. Nevertheless, any state-of-the-art is far from an ideal QA system. ...
Article
Full-text available
The role of a human assistant, such as receptionist, is to provide specific information to the public. Questions asked by the public are often context dependent and related to the environment where the assistant is situated. Should similar behaviour and questions be expected when a social robot offers the same assistant service to visitors? Would it be sufficient for the robot to answer only service-specific questions, or is it necessary to design the robot to answer more general questions? This paper aims to answer these research questions by investigating the question-asking behaviour of the public when interacting with a question-answering social robot. We conducted the study at a university event that was open to the public. Results demonstrate that almost no participants asked context-specific questions to the robot. Rather, unrelated questions were common and included queries about the robot’s personal preferences, opinions, thoughts and emotional state. This finding contradicts popular belief and common sense expectations from what is otherwise observed during similar human–human interactions. In addition, we found that incorporating non-context-specific questions in a robot’s database increases the success rate of its question-answering system.
... Intelligent Virtual (Personal) Assistants (IVA): IVAs are hands-free, voice controlled devices that can achieve numerous tasks such as voice interaction, playing music, managing todo lists, web browsing, setting alarms, placing orders and even controlling other devices such as smart locks, light bulbs, thermostats, etc. Amazon Alexa, Google assistant, Apple Siri and Microsoft Cortana are the most common and extensively used IVA systems [22]. ...
Article
Full-text available
Over the last few years, the explosive growth of Internet of Things (IoT) has revolutionized the way we live and interact with each other as well as with various types of systems and devices which form part of the Information Communication Technology (ICT) infrastructure. IoT is having a significant impact on various application domains including healthcare, smart home, transportation, energy, agriculture, manufacturing, and many others. We focus on the smart home environment which has attracted a lot of attention from both academia and industry recently. The smart home provides a lot of convenience to home users but it also opens up various risks that threaten both the security and privacy of the users. In contrast to previous works on smart home security and privacy, we present an overview of smart homes from both academic and industry perspectives. Next we discuss the security requirements, challenges and threats associated with smart homes. Finally, we discuss countermeasures that can be deployed to mitigate the identified threats.
... The proposed conversational interface was evaluated by 2,025 customers for various use cases and showed promising results. With the recent advancements in voice assistant technologies such as Microsoft Cortana, Apple Siri, Amazon Alexa, and Google Assistant [52], such conversational interfaces hold great potential for future applications. Zhou et al. studied the impact of emojis in instant messaging as a supporting element or as an alternative to text [53]. ...
Article
Full-text available
User interface is an essential element of an information system from the user perspective. The use of text in the user interface presents a challenge to some users, such as illiterate, elderly, non-local language speakers, or users with any disabilities. Numerous studies have proposed text-free user interfaces based on non-textual aids such as pictograms, audio, video, or specialized hardware. Such rich elements in the user interface have implications for an information system’s storage and processing requirements. We propose using concept hierarchies to present a dynamic interface to a user based upon user characteristics. A drill-down approach may present more specific interface elements to a user, and a roll-up scheme can identify more generalized interface elements. The proposed scheme can be used to manage internal data of large user interfaces with a rich set of features more effectively as it is computationally inexpensive than sequential processing techniques. We also explore various physical and psychological characteristics of elements and evaluate user preference for them. The results of a survey of 200 respondents show that users consider the use of various colors, alignments, mental models, and look & feel to be essential criteria in interface design. We also conclude that size of interface elements and the use of characters for expressing emotions are significantly important characteristics in interface design. The future work will apply the proposed scheme to a full spectrum of applications ranging from enterprise systems to mobile applications.
... Virtual Assistants are one of the most important applications of deep learning for e.g. Google Assitant,Siri etc .Every time you interact with the virtual assistant its experience regarding your accent and voice increases, by using this experience it will give you a human like feeling [15].Concept of Deep Learning is applied in the backend of virtual assistants so that their knowledge can be increased about your personal life like your favorite food, actors, subject etc. Natural Human Language is evaluated for the understanding and execution of human commands [16]. Translation of speech to text can be performed by virtual assistants., notes can be created for human beings as well as appointments can be booked. ...
Article
Full-text available
The current era is the golden era of Artificial Intelligence. Machine learning is used mostly in all Applications of Artificial intelligence(AI). Machine learning is proven as a great tool to make AI strong. As an advanced form of machine learning, the popularity and success of Deep Learning is proven in different applications is at the top level. As the accuracy in forecasting is high as well as it is very important for the corporate world. The leadership of deep learning cannot be underestimated. It is used to develop systems that mimic the human knowledge gain process using neural networks. In this paper, we are going to discuss innovative developments in application areas of deep learning.
... Siri, Alexa, Cortana, and Google Assistant There exist a plethora of conversational assistants in the market (see [248,249] for a comparison of these tools) which are capable of answering natural language questions. Recently, these assistants have gained the ability to interact with IoT devices, with Ammari et al. identifying IoT as the third most common use case of voice assistants [247]. ...
Article
The current complexity of IoT systems and devices is a barrier to reach a healthy ecosystem, mainly due to technological fragmentation and inherent heterogeneity. Meanwhile, the field has scarcely adopted any engineering practices currently employed in other types of large-scale systems. Although many researchers and practitioners are aware of the current state of affairs and strive to address these problems, compromises have been hard to reach, making them settle for sub-optimal solutions. This paper surveys the current state of the art in designing and constructing IoT systems from the software engineering perspective, without overlooking hardware concerns, revealing current trends and research directions.
... Apple Siri, Amazon Alexa, Microsoft Cortana, YouTube captions, and Google Assistant [6] deploy speech recognition systems which works based on these designs. Google and Microsoft [7], use deep neural networks-based algorithms that convert sound to text through speech recognition, process the text, and respond accordingly. Typically, deep learning algorithms processes the 1D data as audio is recorded and represented s a 1D waveform [8]. ...
Article
Full-text available
The deep learning advancements have greatly improved the performance of speech recognition systems, and most recent systems are based on the Recurrent Neural Network (RNN). Overall, the RNN works fine with the small sequence data, but suffers from the gradient vanishing problem in case of large sequence. The transformer networks have neutralized this issue and have shown state-of-the-art results on sequential or speech-related data. Generally, in speech recognition, the input audio is converted into an image using Mel-spectrogram to illustrate frequencies and intensities. The image is classified by the machine learning mechanism to generate a classification transcript. However, the audio frequency in the image has low resolution and causing inaccurate predictions. This paper presents a novel end-to-end binary view transformer-based architecture for speech recognition to cope with the frequency resolution problem. Firstly, the input audio signal is transformed into a 2D image using Mel-spectrogram. Secondly, the modified universal transformers utilize the multi-head attention to derive contextual information and derive different speech-related features. Moreover, a feedforward neural network is also deployed for classification. The proposed system has generated robust results on Google’s speech command dataset with an accuracy of 95.16% and with minimal loss. The binary-view transformer eradicates the eventuality of the over-fitting problem by deploying a multiview mechanism to diversify the input data, and multi-head attention captures multiple contexts from the data’s feature map.
... accessed on 30 August 2022), and Google Assistant (see https://assistant.google.com/, accessed on 30 August 2022) [62]. ...
Article
Full-text available
The present information age is characterized by an ever-increasing digitalization. Smart devices quantify our entire lives. These collected data provide the foundation for data-driven services called smart services. They are able to adapt to a given context and thus tailor their functionalities to the user’s needs. It is therefore not surprising that their main resource, namely data, is nowadays a valuable commodity that can also be traded. However, this trend does not only have positive sides, as the gathered data reveal a lot of information about various data subjects. To prevent uncontrolled insights into private or confidential matters, data protection laws restrict the processing of sensitive data. One key factor in this regard is user-friendly privacy mechanisms. In this paper, we therefore assess current state-of-the-art privacy mechanisms. To this end, we initially identify forms of data processing applied by smart services. We then discuss privacy mechanisms suited for these use cases. Our findings reveal that current state-of-the-art privacy mechanisms provide good protection in principle, but there is no compelling one-size-fits-all privacy approach. This leads to further questions regarding the practicality of these mechanisms, which we present in the form of seven thought-provoking propositions.
... There remains an open area for the discoverability of speech-based V-NLI. The tone of voice can also provide insights into the user's sentiments [147]. ...
Article
Full-text available
Utilizing Visualization-oriented Natural Language Interfaces (V-NLI) as a complementary input modality to direct manipulation for visual analytics can provide an engaging user experience. It enables users to focus on their tasks rather than having to worry about how to operate visualization tools on the interface. In the past two decades, leveraging advanced natural language processing technologies, numerous V-NLI systems have been developed in academic research and commercial software, especially in recent years. In this article, we conduct a comprehensive review of the existing V-NLIs. In order to classify each paper, we develop categorical dimensions based on a classic information visualization pipeline with the extension of a V-NLI layer. The following seven stages are used: query interpretation, data transformation, visual mapping, view transformation, human interaction, dialogue management, and presentation. Finally, we also shed light on several promising directions for future work in the V-NLI community.
... Spoken Language Understanding (SLU) systems have become ubiquitous with the introduction of personal assistants such as Amazon Alexa, Google Home, Microsoft Cortana and Apple Siri [1,2]. A typical design for an SLU system consists of an ASR component, pipelined with a language understanding component. ...
Preprint
Spoken Language Understanding (SLU) systems typically consist of a set of machine learning models that operate in conjunction to produce an SLU hypothesis. The generated hypothesis is then sent to downstream components for further action. However, it is desirable to discard an incorrect hypothesis before sending it downstream. In this work, we present two designs for SLU hypothesis rejection modules: (i) scheme R1 that performs rejection on domain specific SLU hypothesis and, (ii) scheme R2 that performs rejection on hypothesis generated from the overall SLU system. Hypothesis rejection modules in both schemes reject/accept a hypothesis based on features drawn from the utterance directed to the SLU system, the associated SLU hypothesis and SLU confidence score. Our experiments suggest that both the schemes yield similar results (scheme R1: 2.5% FRR @ 4.5% FAR, scheme R2: 2.5% FRR @ 4.6% FAR), with the best performing systems using all the available features. We argue that while either of the rejection schemes can be chosen over the other, they carry some inherent differences which need to be considered while making this choice. Additionally, we incorporate ASR features in the rejection module (obtaining an 1.9% FRR @ 3.8% FAR) and analyze the improvements.
... In addition to the impact seen in the academic literature (presented in detail in Chapter 3), machine-learning-based products and services already available in the market have gained popularity. Supported by these technologies, companies such as Google, Amazon and Apple have built assistants that trigger atomic requests and answer simple questions by voice (López et al., 2017). For instance, the Amazon team defined a language for the virtual assistant Alexa to represent natural language commands based on an ontology of actions, types, properties and roles . ...
Thesis
Programming is a key skill in a world where businesses are driven by digital transformations. Although many of the programming demand can be addressed by a simple set of instructions composing libraries and services available in the web, non-technical professionals, such as domain experts and analysts, are still unable to construct their own programs due to the intrinsic complexity of coding. Among other types of end-user development, natural language programming has emerged to allow users to program without the formalism of traditional programming languages, where a tailored semantic parser can translate a natural language utterance to a formal command representation able to be processed by a computational machine. Currently, semantic parsers are typically built on the top of a learning method that defines its behaviours based on the patterns behind a large training data, whose production frequently are costly and time-consuming. Our research is devoted to study and propose a semantic parser for natural language commands targeting a scenario with low availability of training data. Our proposed semantic parser follows a multi-component architecture, composed of a specialised shallow parser that associates natural language commands to predicate-argument structures, integrated to a distributional ranking model that matches the command to a function signature available from an API knowledge base. Systems developed with statistical learning models and complex linguistics resources, as the proposed semantic parser, do not provide natively an easy way to associate a single feature from the input data to the impact in system behaviour. In this scenario, end-user explanations for intelligent systems has become a strong requirement to increase user confidence and system literacy. Thus, our research designed an explanation model for the proposed semantic parser that fits the heterogeneity of its multi-component architecture. The explanation model explores a hierarchical representation in an increasing degree of technical depth, providing higher-level explanations in the initial layers, going gradually to those that demand technical knowledge, applying different explanation strategies to better express the approach behind each component. With the support of a user-centred experiment, we compared the utility of different types of explanations and the impact of background knowledge in their preferences.
... The voice assistants can recognize human speech and interpret its commands and questions [21]. The most famous voice assistance is either embedded in smartphones such as Google's Assistant [22] or a standalone device such as Amazon's Alexa [23]. ...
Preprint
Full-text available
Understanding human behavior and monitoring mental health are essential to maintaining the community and society's safety. As there has been an increase in mental health problems during the COVID-19 pandemic due to uncontrolled mental health, early detection of mental issues is crucial. Nowadays, the usage of Intelligent Virtual Personal Assistants (IVA) has increased worldwide. Individuals use their voices to control these devices to fulfill requests and acquire different services. This paper proposes a novel deep learning model based on the gated recurrent neural network and convolution neural network to understand human emotion from speech to improve their IVA services and monitor their mental health.
... 14 https://uniswap.org/community [38], [39], [33], [40], [41], [42], [43], [44], [45], [46], [47] Internet 2. [48], [49], [50], [51], [52], [53], [54], [55] [98], [99], [100], [101], [102], [103], [104], [105], [106], [107], [108] 6. [109] [110], [111], [112], [113], [114], [115], [116], [117], [118], [119], [120], [121] 7. [122][123], [124], [125], [126], [127], [128], [129], [130], [131], [132], [133], [134] 8. [135], [136], [137], [138], [139], [140], [141], [142], [143], [144], [145], [146], [147] 9. [148], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159], [160] 10. [161], [162], [163], [164], [165], [166], [167], [168], [169], [170], [171] 11. [172], [173], [174], [175], [176], [177], [178], [179], [180], [181], [182], [183] DAO 12. [184], [185], [186], [187], [188], [189], [190], [191], [192], [193], [194], [195], [196] Economia 13. [197], [198], [199], [200], [201], [202], [203], [204], [205], [206], [207], [208], [209] 14. [210] 15. [211], [212], [213], [214], [215], [216], [217], [218], [219], [220], [221], [222], [223] Outros 16. [224], [225], [226], [227], [228], [229], [230], [231], [232], [233], [234], [235], [236] 17. [237], [238], [239], [240], [241], [242], [243], [244], [245], [246], [247], [248], [249] 18. [250], [251], [252], [253], [254], [255], [256], [257], [258], [259], [260], [261], [262] 19. [263], [264], [265], [266], [267], [268] 20. ...
Preprint
Full-text available
This paper is a proposal to research on governance of artificial intelligence algorithms and data from a comparative study of relevant works and also from experience in Internet governance. Complementarily , the results from this study will be made available for discussion and improvement in a decentralized control environment, implemented especially for this purpose. 1 Introdução e justificativa, com síntese da bibliografia fundamental Inteligência Artificial (IA) é o termo usado para sistemas computacionais que tentam imitar aspectos da inteligên-cia humana, incluindo funções que associamos intuitivamente à inteligência, tais como aprendizagem, resolução de problemas, pensamento e ação racional [1]. De um modo geral e independente da aplicação, estes sistemas são con-siderados uma caixa preta resultando em informações assimétricas entre os seus desenvolvedores e seus consumidores [2]. Um dos exemplos mais tristes e que evidenciam a consequência desta assimetria é o projeto do sistema MCAS 5 do Boing 737 MAX, que levou a dois acidentes com 346 mortes em outubro de 2018 (Lion Air) e março de 2019 (Ethiopian Airlines). Quando o ângulo do sensor de ataque falhou, os algoritmos embutidos forçaram o avião a baixar o nariz, resistindo às repetidas tentativas dos pilotos, confusos, de virar o nariz para cima. Ben Shneiderman, em seu livro Human-Centered AI, que comenta os dois acidentes com o Boing 737 MAX, considera que o futuro destes algoritmos de IA é centrado no ser humano, principalmente tornando-se super ferramentas, que amplificam as habilidades humanas, capacitando as pessoas de forma notável mas, ao mesmo tempo, garantindo o controle humano [3]. Ben nomeou estes algoritmos com a sigla HCAI, acrônimo do título de seu livro. Há inúmeras outras aplicações usando IA, como por exemplo, aquelas de habitam a Internet, que se comportam de forma desproporcional. Uma descrição detalhada dos chamados vieses algorítmicos está no livro de Safya Noble, Algoritmos da Opressão [4]. Informação assimétrica, vieses e outras questões estão incomodando os desenvolvedores, pesquisadores e a outros interessados, todos determinados a descobrir o que está faltando [5]! Perspectivas associadas com ética [6, 7, 8, 9], *
... and https://bit.ly/lift-06 12 Discord is an online collaboration environment, widely used by web3 users [38], [39], [33], [40], [41], [42], [43], [44], [45], [46], [47] Internet 2. [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63] Algorithms 3. [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [ [100], [101], [102], [103], [104], [105], [106], [107], [108] 6. [109] [110], [111], [112], [113], [114], [115], [116], [117], [118], [119], [120], [121] 7. [122][123], [124], [125], [126], [127], [128], [129], [130], [131], [132], [133], [134] 8. [135], [136], [137], [138], [139], [140], [141], [142], [143], [144], [145], [146], [147] 9. [148], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159], [160] 10. [161], [162], [163], [164], [165], [166], [167], [168], [169], [170], [171] 11. [172], [173], [174], [175], [176], [177], [178], [179], [180], [181], [182], [183] DAO 12. [184], [185], [186], [187], [188], [189], [190], [191], [192], [193], [194], [195], [196] Economics 13. [197], [198], [199], [200], [201], [202], [203], [204], [205], [206], [207], [208], [209] 14. [210] 15. [211], [212], [213], [214], [215], [216], [217], [218], [219], [220], [221], [222], [223] Others 16. [224], [225], [226], [227], [228], [229], [230], [231], [232], [233], [234], [235], [236] 17. [237], [238], [239], [240], [241], [242], [243], [244], [245], [246], [247], [248], [249] 18. [250], [251], [252], [253], [254], [255], [256], [257], [258], [259], [260], [261], [262] 19. ...
Preprint
Full-text available
This paper is a proposal to research on governance of artificial intelligence algorithms and data from a comparative study of relevant works and also from experience in Internet governance. Complementarily , the results from this study will be made available for discussion and improvement in a decentralized control environment, implemented especially for this purpose.
... Multiple widely used neural networks are composed of two parts: the first part projects data points into another space, and the other part of the model does further regression/classification upon this space. By transforming raw data features to another potentially more tractable space, deep learning models have recently shown potential in many areas, ranging from dialogue systems (Vinyals & Le, 2015;López et al., 2017;Chen et al., 2017), medical image analysis (Kononenko, 2001;Ker et al., 2017;Erickson et al., 2017;Litjens et al., 2017;Razzak et al., 2018;Bakator & Radosav, 2018) to robotics (Peters et al., 2003;Kober et al., 2013;Pierson & Gashler, 2017;Sünderhauf et al., 2018). ...
Preprint
Full-text available
Deep learning models often tackle the intra-sample structure, such as the order of words in a sentence and pixels in an image, but have not pay much attention to the inter-sample relationship. In this paper, we show that explicitly modeling the inter-sample structure to be more discretized can potentially help model's expressivity. We propose a novel method, Atom Modeling, that can discretize a continuous latent space by drawing an analogy between a data point and an atom, which is naturally spaced away from other atoms with distances depending on their intra structures. Specifically, we model each data point as an atom composed of electrons, protons, and neutrons and minimize the potential energy caused by the interatomic force among data points. Through experiments with qualitative analysis in our proposed Atom Modeling on synthetic and real datasets, we find that Atom Modeling can improve the performance by maintaining the inter-sample relation and can capture an interpretable intra-sample relation by mapping each component in a data point to electron/proton/neutron.
... 14 https://uniswap.org/community [38], [39], [33], [40], [41], [42], [43], [44], [45], [46], [47] Internet 2. [48], [49], [50], [51], [52], [53], [54], [55] [98], [99], [100], [101], [102], [103], [104], [105], [106], [107], [108] 6. [109] [110], [111], [112], [113], [114], [115], [116], [117], [118], [119], [120], [121] 7. [122][123], [124], [125], [126], [127], [128], [129], [130], [131], [132], [133], [134] 8. [135], [136], [137], [138], [139], [140], [141], [142], [143], [144], [145], [146], [147] 9. [148], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159], [160] 10. [161], [162], [163], [164], [165], [166], [167], [168], [169], [170], [171] 11. [172], [173], [174], [175], [176], [177], [178], [179], [180], [181], [182], [183] DAO 12. [184], [185], [186], [187], [188], [189], [190], [191], [192], [193], [194], [195], [196] Economia 13. [197], [198], [199], [200], [201], [202], [203], [204], [205], [206], [207], [208], [209] 14. [210] 15. [211], [212], [213], [214], [215], [216], [217], [218], [219], [220], [221], [222], [223] Outros 16. [224], [225], [226], [227], [228], [229], [230], [231], [232], [233], [234], [235], [236] 17. [237], [238], [239], [240], [241], [242], [243], [244], [245], [246], [247], [248], [249] 18. [250], [251], [252], [253], [254], [255], [256], [257], [258], [259], [260], [261], [262] 19. [263], [264], [265], [266], [267], [268] 20. ...
Preprint
This paper is a proposal to research on governance of artificial intelligence algorithms and data from a comparative study of relevant works and also from experience in Internet governance. Complementarily , the results from this study will be made available for discussion and improvement in a decentralized control environment, implemented especially for this purpose. 1 Introdução e justificativa, com síntese da bibliografia fundamental Inteligência Artificial (IA) é o termo usado para sistemas computacionais que tentam imitar aspectos da inteligên-cia humana, incluindo funções que associamos intuitivamente à inteligência, tais como aprendizagem, resolução de problemas, pensamento e ação racional [1]. De um modo geral e independente da aplicação, estes sistemas são con-siderados uma caixa preta resultando em informações assimétricas entre os seus desenvolvedores e seus consumidores [2]. Um dos exemplos mais tristes e que evidenciam a consequência desta assimetria é o projeto do sistema MCAS 5 do Boing 737 MAX, que levou a dois acidentes com 346 mortes em outubro de 2018 (Lion Air) e março de 2019 (Ethiopian Airlines). Quando o ângulo do sensor de ataque falhou, os algoritmos embutidos forçaram o avião a baixar o nariz, resistindo às repetidas tentativas dos pilotos, confusos, de virar o nariz para cima. Ben Shneiderman, em seu livro Human-Centered AI, que comenta os dois acidentes com o Boing 737 MAX, considera que o futuro destes algoritmos de IA é centrado no ser humano, principalmente tornando-se super ferramentas, que amplificam as habilidades humanas, capacitando as pessoas de forma notável mas, ao mesmo tempo, garantindo o controle humano [3]. Ben nomeou estes algoritmos com a sigla HCAI, acrônimo do título de seu livro. Há inúmeras outras aplicações usando IA, como por exemplo, aquelas de habitam a Internet, que se comportam de forma desproporcional. Uma descrição detalhada dos chamados vieses algorítmicos está no livro de Safya Noble, Algoritmos da Opressão [4]. Informação assimétrica, vieses e outras questões estão incomodando os desenvolvedores, pesquisadores e a outros interessados, todos determinados a descobrir o que está faltando [5]! Perspectivas associadas com ética [6, 7, 8, 9], *
... Depression detection [6]- [7], medical science [13], call centres, dialogue systems such as Alexa , Cortana, Siri, Google Voice [10]- [11], and human-robot social interaction [3]- [4]. Sophia [27]- [28], a humanoid robot that can perform speech recognition, speech synthesis, face tracking, emotion recognition, and can mimics facial expressions is a great example of machine's emotional intelligence. ...
Article
Emotion recognition is a rapidly growing research field. Emotions can be effectively expressed through speech and can provide insight about speaker’s intentions. Although, humans can easily interpret emotions through speech, physical gestures, and eye movement but to train a machine to do the same with similar preciseness is quite a challenging task. SER systems can improve human-machine interaction when used with automatic speech recognition, as emotions have the tendency to change the semantics of a sentence. Many researchers have contributed their extremely impressive work in this research area, leading to development of numerous classification, feature selection, feature extraction and emotional speech databases. This paper reviews recent accomplishments in the area of speech emotion recognition. It also present a detailed review of various types of emotional speech databases, and different classification techniques which can be used individually or in combination and a brief description of various speech features for emotion recognition.
... This may indicate an over-trust of Alexa, depending on the actual correctness of the device (although we leave this as a question for future research). Since different agents show varying levels of correctness [30], different agents should be trusted differently. ...
Preprint
Full-text available
A majority of researchers who develop design guidelines have WEIRD, adult perspectives. This means we may not have technology developed appropriately for people from non-WEIRD countries and children. We present five design recommendations to empower designers to consider diverse users' desires and perceptions of agents. For one, designers should consider the degree of task-orientation of agents appropriate to end-users' cultural perspectives. For another, designers should consider how competence, predictability, and integrity in agent-persona affects end-users' trust of agents. We developed recommendations following our study, which analyzed children and parents from WEIRD and non-WEIRD countries' perspectives on agents as they create them. We found different subsets of participants' perceptions differed. For instance, non-WEIRD and child perspectives emphasized agent artificiality, whereas WEIRD and parent perspectives emphasized human-likeness. Children also consistently felt agents were warmer and more human-like than parents did. Finally, participants generally trusted technology, including agents, more than people.
... In most of our devices, we use some part of ASR. Google Assistant, Amazon Alexa, Apple Siri, and Cortana from Microsoft are some examples in which ASR is employed [31]. In the research literature, we can find many pieces of research that have been carried out concerning this regard, which implies the broadness of the research area. ...
Article
Full-text available
Aphasia is a type of speech disorder that can cause speech defects in a person. Identifying the severity level of the aphasia patient is critical for the rehabilitation process. In this research, we identify ten aphasia severity levels motivated by specific speech therapies based on the presence or absence of identified characteristics in aphasic speech in order to give more specific treatment to the patient. In the aphasia severity level classification process, we experiment on different speech feature extraction techniques, lengths of input audio samples, and machine learning classifiers toward classification performance. Aphasic speech is required to be sensed by an audio sensor and then recorded and divided into audio frames and passed through an audio feature extractor before feeding into the machine learning classifier. According to the results, the mel frequency cepstral coefficient (MFCC) is the most suitable audio feature extraction method for the aphasic speech level classification process, as it outperformed the classification performance of all mel-spectrogram, chroma, and zero crossing rates by a large margin. Furthermore, the classification performance is higher when 20 s audio samples are used compared with 10 s chunks, even though the performance gap is narrow. Finally, the deep neural network approach resulted in the best classification performance, which was slightly better than both K-nearest neighbor (KNN) and random forest classifiers, and it was significantly better than decision tree algorithms. Therefore, the study shows that aphasia level classification can be completed with accuracy, precision, recall, and F1-score values of 0.99 using MFCC for 20 s audio samples using the deep neural network approach in order to recommend corresponding speech therapy for the identified level. A web application was developed for English-speaking aphasia patients to self-diagnose the severity level and engage in speech therapies.
... Much of this progress has come from advances in "deep learning, " which refers to multilayer network-style models that emulate the working principles of the brain. Today, AI can outperform humans on certain narrow tasks previously thought to require human expertise, such as playing chess, poker and Go (Schrittwieser et al., 2020), object recognition (LeCun et al., 2015), natural language understanding (He et al., 2021), and speech recognition (López et al., 2017;Bengio et al., 2021). In addition, self-driving cars, goods transport robots, and unmanned aircrafts will soon be a part of normal traffic (Hancock et al., 2019). ...
Article
Full-text available
Despite the success of artificial intelligence (AI), we are still far away from AI that model the world as humans do. This study focuses for explaining human behavior from intuitive mental models’ perspectives. We describe how behavior arises in biological systems and how the better understanding of this biological system can lead to advances in the development of human-like AI. Human can build intuitive models from physical, social, and cultural situations. In addition, we follow Bayesian inference to combine intuitive models and new information to make decisions. We should build similar intuitive models and Bayesian algorithms for the new AI. We suggest that the probability calculation in Bayesian sense is sensitive to semantic properties of the objects’ combination formed by observation and prior experience. We call this brain process as computational meaningfulness and it is closer to the Bayesian ideal, when the occurrence of probabilities of these objects are believable. How does the human brain form models of the world and apply these models in its behavior? We outline the answers from three perspectives. First, intuitive models support an individual to use information meaningful ways in a current context. Second, neuroeconomics proposes that the valuation network in the brain has essential role in human decision making. It combines psychological, economical, and neuroscientific approaches to reveal the biological mechanisms by which decisions are made. Then, the brain is an over-parameterized modeling organ and produces optimal behavior in a complex word. Finally, a progress in data analysis techniques in AI has allowed us to decipher how the human brain valuates different options in complex situations. By combining big datasets with machine learning models, it is possible to gain insight from complex neural data beyond what was possible before. We describe these solutions by reviewing the current research from this perspective. In this study, we outline the basic aspects for human-like AI and we discuss on how science can benefit from AI. The better we understand human’s brain mechanisms, the better we can apply this understanding for building new AI. Both development of AI and understanding of human behavior go hand in hand.
... The design of intelligent systems (Wilamowski and Irwin 2015) is a relevant research field, including for example industrial (Irani and Kamal 2014;Ngai et al. 2014), military (Ma'sum et al. 2013Yoo et al. 2014) and social (Magnisalis et al. 2011;Bernardini et al. 2014;Adamson et al. 2019) applications. In the telecare domain, personal assistants (Matsuyama et al. 2016;López et al. 2018) andartificial companionship (Chumkamon et al. 2016;Abdollahi et al. 2017) are relevant. ...
Article
Full-text available
Previous researchers have proposed intelligent systems for therapeutic monitoring of cognitive impairments. However, most existing practical approaches for this purpose are based on manual tests. This raises issues such as excessive caretaking effort and the white-coat effect. To avoid these issues, we present an intelligent conversational system for entertaining elderly people with news of their interest that monitors cognitive impairment transparently. Automatic chatbot dialogue stages allow assessing content description skills and detecting cognitive impairment with Machine Learning algorithms. We create these dialogue flows automatically from updated news items using Natural Language Generation techniques. The system also infers the gold standard of the answers to the questions, so it can assess cognitive capabilities automatically by comparing these answers with the user responses. It employs a similarity metric with values in [0, 1], in increasing level of similarity. To evaluate the performance and usability of our approach, we have conducted field tests with a test group of 30 elderly people in the earliest stages of dementia, under the supervision of gerontologists. In the experiments, we have analysed the effect of stress and concentration in these users. Those without cognitive impairment performed up to five times better. In particular, the similarity metric varied between 0.03, for stressed and unfocused participants, and 0.36, for relaxed and focused users. Finally, we developed a Machine Learning algorithm based on textual analysis features for automatic cognitive impairment detection, which attained accuracy, F-measure and recall levels above 80%. We have thus validated the automatic approach to detect cognitive impairment in elderly people based on entertainment content. The results suggest that the solution has strong potential for long-term user-friendly therapeutic monitoring of elderly people.
... Second, we provide evidence on branding effectiveness in voice shopping. Prior research on voice devices has been limited to the correctness and naturalness of voice assistants ( López, Quesada, and Guerrero 2017 ), mood prediction ( Halbauer and Klarmann 2022 ), personification of smart speakers ( Purington et al. 2017 ), and privacy ( Edu, Such, and Suarez-Tangi 2020 ). ...
Article
Nearly half of US households own a smart speaker with voice shopping functionality. Voice shopping product presentation is inherently sequential due to the audio delivery of information, which may give retailers the opportunity to influence customer decisions through the order in which brands are presented. This research examines the effect of brand order presentation in voice shopping and its impact on high-equity versus low-equity brands. Moreover, this research considers the moderating effect of product presentation format (simultaneous vs. sequential, audio vs. visual) on the impact of brand presentation order. The results of six experiments with more than 1,000 participants provide evidence that consumers attempt to balance competing concerns about risk in voice shopping with search costs because products are presented sequentially and information is reduced. If high-equity brands are presented first, the choice distribution in voice shopping is unimodal, with a peak at the first-presented products. However, a bimodal choice distribution results if low-equity brands are presented first. Importantly, choice distribution in voice shopping differs markedly from choice distribution when products are presented simultaneously and visually, as in online shopping.
... Machine learning (ML) has become a pervasive technology, driving most of the impressive technological breakthroughs within the field of artificial intelligence (AI) in recent years. A child growing up with the current digital ecosystems will likely meet ML models in both public and private spheres, as ML systems increasingly advise doctors in medicine [18], play important roles in criminal justice and terrorism prevention [82], curate media intake and advertisements on streaming sites, social media, and news platforms [35], and even enter our homes in the form of digital assistants [48]. ...
... Despite this, humans are well able to distinguish different sounds from multiple sources. In recent years, communicating with speech-oriented technological devices has become a part of daily usage for billions of people around the globe in form of voice assistant applications: Amazon Alexa, Apple Siri, Google Assistance, Microsoft Cortana [2,3], etc. Moreover, human generally feel more comfortable by communicating in their native language and thus making use of their devices in real-life applications: military operations, education and medical research. ...
Article
Full-text available
Development of a native language robust ASR framework is very challenging as well as an active area of research. Although an urge for investigation of effective front-end as well as back-end approaches are required for tackling environment differences, large training complexity and inter-speaker variability in achieving success of a recognition system. In this paper, four front-end approaches: mel-frequency cepstral coefficients (MFCC), Gammatone frequency cepstral coefficients (GFCC), relative spectral-perceptual linear prediction (RASTA-PLP) and power-normalized cepstral coefficients (PNCC) have been investigated to generate unique and robust feature vectors at different SNR values. Furthermore, to handle the large training data complexity, parameter optimization has been performed with sequence-discriminative training techniques: maximum mutual information (MMI), minimum phone error (MPE), boosted-MMI (bMMI), and state-level minimum Bayes risk (sMBR). It has been demonstrated by selection of an optimal value of parameters using lattice generation, and adjustments of learning rates. In proposed framework, four different systems have been tested by analyzing various feature extraction approaches (with or without speaker normalization through Vocal Tract Length Normalization (VTLN) approach in test set) and classification strategy on with or without artificial extension of train dataset. To compare each system performance, true matched (adult train and test—S1, child train and test—S2) and mismatched (adult train and child test—S3, adult + child train and child test—S4) systems on large adult and very small Punjabi clean speech corpus have been demonstrated. Consequently, gender-based in-domain data augmented is used to moderate acoustic and phonetic variations throughout adult and children’s speech under mismatched conditions. The experiment result shows that an effective framework developed on PNCC + VTLN front-end approach using TDNN-sMBR-based model through parameter optimization technique yields a relative improvement (RI) of 40.18%, 47.51%, and 49.87% in matched, mismatched and gender-based in-domain augmented system under typical clean and noisy conditions, respectively.
Conference Paper
Full-text available
This study aims to present a perception of the various applications of artificial intelligence, which is considered one of the most important outputs of the Fourth Industrial Revolution as a basic entry point for the advancement and promotion of the tourism service to reach a modern and viable tourism service that is able to survive and compete in this highly competitive sector. We also tried to present the difficulties it faces. Artificial intelligence in the tourism sector. Through this research paper, we concluded the necessity for the tourism sector to adopt and activate artificial intelligence techniques to ensure the provision of tourism services that fit modern global trends
Chapter
Physical rehabilitation and physiotherapy are a science that develops more and more over time and uses modern technology. In order to optimize the exercises prescribed for the home, various mobile applications are used which present some deficiencies in human-computer interactions. The voice recognition module was added to a mobile application developed which records the movements made by a patient in physical therapy. That module recognizes a phrase emitted by the user which triggers the recognition of movement of the mobile application to be able to start the exercise routine; in the same way when issuing the phrase to end the routine. This work shows the development of the speech recognition module using the Google Android Speech API in a movement recognition application for physiotherapy; the corresponding architecture was developed together with the explanation of the pseudocode and the different interfaces. The results were obtained from 10 test subjects of different ages and sexes, of which 82% of the tests were successful recognition while 18% were not recognized by different factors such as noise and lack of training for older users. The module recognized the voice pattern of the users and the mobile application worked correctly.
Chapter
Artificial Intelligence (AI) is one of the most emerging technologies of the past decade, leading the way towards human and machine interactions in terms of efficiency, accuracy and the overall value gained. This paper describes an intent-lean AI chatbot solution that handles user queries posed in natural language upon business related documents of a single-domain of Hellenic Telecommunications Organization S.A. (HTO). Unlike other traditional chatbot solutions that strictly rely on intent identification, our approach infers the implicit user need in order to provide the most relative documents and text snippets within, as the proper answer. To do this, we proceeded with a custom implementation based on Elasticsearch engine and most common NLP techniques tailored to our needs; i.e., tokenization, lowercase filtering, stop words removal, stemming, fuzzy searching, synonyms, etc. The main challenges as well as the architectural models that thrive to overcome them are being described in detail. Finally, the effectiveness of the proposed solution is being measured and the identified features for improvement are being presented.
Chapter
This chapter, based on previous bibliographic review, offers guidance from consultants as well as prior identification of documented professional practices about the current overall impact of Artificial Intelligence (AI) in business, economics and innovation.Consequently, this chapter discusses the role of AI in the future economy considering increases in productivity, innovation and technological maturity. Consequently, attention shall be paid to business-oriented design, AI tools for business processes modelling and the benefits of AI technologies. It will also be shown how business leaders can remain competitive in the new economic environment, developing the required skills for understanding the economic implications of AI, considering the changes that businesses will need to do to address the economic and social implications of large-scale applications of AI. In addition, we highlight the importance; benefits and applications of Machine Learning in business shall be applied.Finally, the conclusions propose a future research agenda for AI for certain industries (Strategy, Relationship Marketing, Servicescape, Customer acceptance, Social acceptance, Management, Workforce and Transhumanism).KeywordsBusinessEconomicsInnovationArtificial IntelligenceMachine learningApplicationsBenefits
Article
This paper proposes an approach to design a visual chatbot to enhance the virtual knowledge sharing process. Existing chatbots are either textual or vocal whose performance has not exceeded 60%. However, in various fields a textual description is no longer sufficient, and it is so essential for users to exchange images to better express their preferences. This prevents them from individually describing the image content and transmitting it in writing, which is not always obvious. This work developed a preliminary version of a visual chatbot called SIRSBot (Smart Information Retrieval System roBot). The objective of this paper is to make experiments to identify the main challenges which may face visual information identification. The role of the visual chatbot is (1) to understand the user request, (2) to extract the characteristics of each object in the image that ultimately represent the user's preferences and finally (3) to find a response that meets the user's needs.
Article
Full-text available
Automatic speech recognition (ASR) has ensured a convenient and fast mode of communication between humans and computers. It has become more accurate over the passage of time. However, in majority of ASR systems, the models have been trained using native English accents. While they serve best for native English speakers, their accuracy drops drastically for non-native English accents. Our proposed model covers this limitation for non-native English accents. We fine-tuned the DeepSpeech2 model, pretrained on the native English accent dataset by LibriSpeech. We retrain the model on a subset of the common voice dataset having only South Asian accents using the proposed novel loss function. We experimented with three different layer configurations of model to learn the best features for South Asian accents. Three evaluation parameters, word error rate (WER), match error rate (MER), and word information loss (WIL) were used. The results show that DeepSpeech2 can perform significantly well for South Asian accents if the weights of initial convolutional layers are retained while updating weights of deeper layers in the model (i.e., RNN and fully connected layers). Our model gave WER of 18.08%, which is the minimum error achieved for non-native English accents in comparison with the original model.
Article
Intelligent assistants have proliferated across various settings spurring research and ideas. However, there is a lack of clarity on how they can be defined and identified. This definitional and terminological incoherence manifests in narrow and misinformed understanding of technology and thus, poses threats to theoretical and empirical endeavors in AI. To overcome these challenges, this paper presents a definitional framework that defines intelligent assistants and distinguishes them from other technologies. This framework is informed by tracing the historical evolution of this technology in the past 70 years which saw the rise of three constitutive features in machines: AI, interaction, and assistance. A review of technology’s definitions published since 2010 and associated terminologies was also conducted to unpack their conceptual limitations and support the framework. A new definition is then presented which contends that a technology can only be characterized as an intelligent assistant if it is AI-enabled, interactive, and assistive.
Thesis
Full-text available
The aim of the study was to critically analyse teachers pedagogical approaches and how voice technology was used by students as a more knowledgeable other and the extent to which it affected students’ epistemic curiosity. Using an exploratory ethnographic approach, Amazon’s Echo Dot voice technology was studied in lessons at Hillview School. Data was collected through participant observation, informal interviews and recordings of students’ interactions with ‘Alexa’. Students asked questions to Alexa in large numbers. Alexa was asked 87 questions during two lessons suggesting that Alexa was a digital more knowledgeable other. Types of questions asked to Alexa, such as ‘Can fish see water?’, were epistemic questions and suggestive of epistemic curiosity. Teachers used the Echo Dots infrequently and in a limited number of ways. Teachers relied upon a pedagogical approach and talk oriented around performance which overlooked students’ learning talk. The answer to why students might not be curious was not found. However, evidence to understand how and why they might appear not curious was revealed. The study makes contributions to knowledge through the novel use of the Echo Dots to collect data and through a new data visualisation technique called ‘heatmaps’. The study contributes to knowledge by proposing three tentative notions that emerged inductively from the research: ‘performance-oriented talk’, ‘metricalisation’ and ‘regulativity’. The study aims to make a further contribution to knowledge by suggesting evidence of a ‘pedagogy of performance’. The study recommends ‘learning-oriented talk’ and development of Alexa ‘Skills’ as a way to disrupt the pedagogy of performance and as an area for further research.
Chapter
5G changes the landscape of mobile networks profoundly, with an evolved architecture supporting unprecedented capacity, spectral efficiency, and increased flexibility. The MARSAL project targets the development and evaluation of a complete framework for the management and orchestration of network resources in 5G and beyond by utilizing a converged optical-wireless network infrastructure in the access and fronthaul/midhaul segments. In this paper, we present a conceptual view of the MARSAL architecture, as well as a wide range of experimentation scenarios.
Article
The recent pandemic forced substantial changes in our lives, including the way we interact with physical objects. For example, voice-activated systems that enable users to communicate with them through speech commands are becoming more pervasive. At the same time, recent technology developments delivered voice capability to Internet of Things (IoT) devices with low-power audio transducers. Voice-activated IoT devices have the potential to engage patients and caregivers in new and cost-efficient ways, from telehealth and digital health, to portable diagnostics and remotely delivered care. In this brief, we review voice activated IoT devices, discuss their trends, and identify unique challenges when these devices are used in the healthcare sector. Furthermore, we discuss some future application scenarios and their characteristics.
Article
Full-text available
Large class sizes significantly limit EFL (English as a Foreign Language) learners' speaking opportunities in class, and it is also quite difficult for them to have a regular language partner if they want to practice English after class. Microsoft Xiaoying, a chatbot available on smart phones, offers EFL learners a virtual partner for them to conduct conversational practice with, and engage in language learning anytime and anywhere. This study evaluates 50 EFL learners’ accuracy, errors, and improvements in oral tasks during 28 days of training employing this chatbot, which located their grammar and pronunciation errors. The subjects’ grammar and pronunciation accuracy increased, and their oral performance was significantly improved. In addition, the chatbot was positively perceived by most participants. This study suggests that this chatbot could be one useful learning tool for after-class oral practice and that students might significantly improve their grammar and pronunciation through conversational practice with this chatbot.
Chapter
Full-text available
Introduction and MotivationHow to Conceptualize EmotionsWhy to Integrate Emotions into Conversational AgentsMaking the Virtual Human Max EmotionalExamples and ExperiencesConclusions References
Article
Full-text available
To study relations between speech and emotion, it is necessary to have methods of describing emotion. Finding appropriate methods is not straightforward, and there are difficulties associated with the most familiar. The word emotion itself is problematic: a narrow sense is often seen as “correct”, but it excludes what may be key areas in relation to speech––including states where emotion is present but not full-blown, and related states (e.g., arousal, attitude). Everyday emotion words form a rich descriptive system, but it is intractable because it involves so many categories, and the relationships among them are undefined. Several alternative types of description are available. Emotion-related biological changes are well documented, although reductionist conceptions of them are problematic. Psychology offers descriptive systems based on dimensions such as evaluation (positive or negative) and level of activation, or on logical elements that can be used to define an appraisal of the situation. Adequate descriptive systems need to recognise the importance of both time course and interactions involving multiple emotions and/or deliberate control. From these conceptions of emotion come various tools and techniques for describing particular episodes. Different tools and techniques are appropriate for different purposes.
Article
Full-text available
Recent research indicates that people respond socially to computers and perceive themas having personalities. Software agents are artifacts that particularly embody those qualities most likely to elicit social responses: fulélling a social role, using language, and exhibiting contingent behavior. People's disposition to respond socially can be so strong that they may perceive software agents as having a personality, even when none was intended. The following is a discussion about intentionally designing personalities for social agents. To design personalities, it is necessary to consider the nature of personality and its role in interactions between people and artifacts. In addition, a case study of designing a social software agent is presented. The conclusions fromthis experience are summarized as guidelines for future agent developers. Personality is a fundamental linchpin of social relationships. In the context of human interaction, people automatically and unintentionally organize the behavior of their partners into simplifying traits (Uleman et al., 1996), and people tend to agree about which partners are best described by particular traits (Moskowitz, 1988). Beyond categorization, personality shapes the very nature of social relationships, even impacting how satisfying an interaction is for the participants (Dryer & Horowitz, 1997).
Chapter
How do Cognitive Things assist individuals?How do they support shopping and buying?How do they take care of elderly?How do they operate as administrative assistants?Could they help travel and provide concierge service?
Conference Paper
With rapid advances in natural language generation (NLG), voice has now become an indispensable modality for interaction with smart phones. Most of the smart phone manufacturers have their Voice Assistant application designed with some form of personalization to enhance user experience. However, these designs are significantly different in terms of usage support, features, naturalness and personality of the voice assistant avatar or the character. Therefore the question remains that what is the kind of Voice Assistant that users would prefer. In this study we followed a User Centered Design approach for the design of a Voice Assistant from scratch. Our primary objective was to define the personality of a Voice Assistant Avatar and formulating a few design guidelines for natural dialogues and expressions for the same. The attempt was kept to design the voice assistant avatar with optimal natural or human like aspects and behavior. This paper provides a summary of our journey and details of the methodology used in realizing the design of a natural voice assistant. As research contribution, apart from the methodology we also share some of the guidelines and design decisions which may be very useful for related research.
Article
More than another friendly face, Rea knows how to have a conversation with living, breathing human users with a wink, a nod, and a sidelong glance.
Alexa skills kit. https://developer.amazon.com/public/solutions
  • Amazon Inc
Google Assistant, https://assistant.google.com
  • Google
Google: Google Assistant, https://assistant.google.com/.
Cognitive devices as human assistants In: Cognitive (Internet of) Things: Collaboration to Optimize Action
  • A Sathi
Google Home, https://madeby.google.com/home/. 10. Microsoft Coorporation: Cortana, https
  • Google
Google: Google Home, https://madeby.google.com/home/. 10. Microsoft Coorporation: Cortana, https://www.microsoft.com/enus/mobile/experiences/cortana/.