Chapter

Digital Enhancement of Cultural Experience and Accessibility for the Visually Impaired

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Visual impairment restricts everyday mobility and limits the accessibility of places, which for the non-visually impaired is taken for granted. A short walk to a close destination, such as a market or a school becomes an everyday challenge. In this chapter, we present a novel solution to this problem that can evolve into an everyday visual aid for people with limited sight or total blindness. The proposed solution is a digital system, wearable like smart-glasses, equipped with cameras. An intelligent system module, incorporating efficient deep learning and uncertainty-aware decision-making algorithms, interprets the video scenes, translates them into speech, and describes them to the user through audio. The user can almost naturally interact with the system via a speech-based user interface, which is also capable of understanding the user’s emotions. The capabilities of this system are investigated in the context of accessibility and guidance to outdoor environments of cultural interest, such as the historic triangle of Athens. A survey of relevant state-of-the-art systems, technologies and services is performed, identifying critical system components that better adapt to the goals of the system, user needs and requirements, toward a user-centered architecture design.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Existing literature reviews in the field i.e., Cho (2021), and Butler et al. (2021) differ from the present one in that they focus on a specified area (e.g., touch-based methods or the issue of synaesthesia) and do not adopt a broader perspective as is the case with the present article. Likewise, other relevant publications which include extensive literature surveys namely, Meliones and Sampson (2018), and Iakovidis et al. (2020) focus on navigation-related approaches. Our aim is to present a broader perspective of the methodological developments and identify patterns that provide a sense of how future research may, or should evolve. ...
... In this respect, this is a point of convergence with the present study which foregrounds the potential of emerging methods to benefit visually impaired people and the average museum visitor. Meliones and Sampson (2018), as well as Iakovidis et al. (2020) both, address the issue of ICT-based facilitation of navigating spaces in a self-guiding manner. They both include extensive literature reviews that may be extensive and informative but serve the aim to contextualise the presentation of a proposed methodological approach or case study. ...
... Therefore, although relevant, these are not systematic literature reviews proper. Iakovidis et al. (2020) more specifically, present the concept of a multifaceted and complex (and wearable) system that facilitates navigation and provides informative material for VI people in cultural sites. This publication is supported by a thorough literature review that surveys technological advancements relevant to the proposed concept. ...
Article
This literature review surveys published research that aims to foster blind and visually impaired (VI) people’s engagement with cultural heritage. The reviewed papers cover a broad area of methodological approaches, outlining challenges and solutions for empowering VI people to enjoy cultural heritage sites and museums, mainly with the use of Information and Communication Technologies (ICT). Fifty (50) papers are included in this survey, published between 2008 and 2021. They mostly focus on multisensory, multimodal and interdisciplinary, ICT-based assistive human-centered computing approaches. These methods often optimize the VI user experience through an effort to gain an in-depth understanding of this special group’s needs. Moreover, several publications include user evaluations and empirical studies. Findings suggest that approaches tend to become more complex, multimodal, and multidisciplinary as time progresses. This leads to a discussion and conclusions suggesting future research directions in this domain. The proposed way forward relates to a synthesis of methodologies, scientific domains, and approaches into a more holistic, comprehensive, and synergetic model.
... Image-based obstacle detection is a component of major importance for assistive navigation systems for the VCP. A user requirement analysis [17], revealed that the users need a system that aims to real-time performance and mainly detects vertical objects, e.g., trees, humans, stairs, and ground anomalies. ...
... This enables CNNs to automatically extract features from the entire image, instead of relying on hand-crafted features, such as color and texture. Multiple CNN architectures have been proposed over the last years, each one contributing some unique characteristics [17]. Although conventional CNN architectures, such as the Visual Geometry Group Network (VGGNet) [40], offer great classification performance, they usually require large, high-end workstations equipped with Graphical Processing Units (GPUs) to execute them. ...
... The wearable device, in the form of smart glasses, was designed using a CAD software according to the user requirements listed in [17]. The most relevant to the design requirements mentioned that the wearable system should be attractive and elegant, possibly with a selection of different colors, but in a minimalist rather than attention grabbing way. ...
Article
Full-text available
Every day, visually challenged people (VCP) face mobility restrictions and accessibility limitations. A short walk to a nearby destination, which for other individuals is taken for granted, becomes a challenge. To tackle this problem, we propose a novel visual perception system for outdoor navigation that can be evolved into an everyday visual aid for VCP. The proposed methodology is integrated in a wearable visual perception system (VPS). The proposed approach efficiently incorporates deep learning, object recognition models, along with an obstacle detection methodology based on human eye fixation prediction using Generative Adversarial Networks. An uncertainty-aware modeling of the obstacle risk assessment and spatial localization has been employed, following a fuzzy logic approach, for robust obstacle detection. The above combination can translate the position and the type of detected obstacles into descriptive linguistic expressions, allowing the users to easily understand their location in the environment and avoid them. The performance and capabilities of the proposed method are investigated in the context of safe navigation of VCP in outdoor environments of cultural interest through obstacle recognition and detection. Additionally, a comparison between the proposed system and relevant state-of-the-art systems for the safe navigation of VCP, focused on design and user-requirements satisfaction, is performed.
... Another important term that is being optimized by the INLP model is the route smoothness term. The route smoothness term avoids the abrupt and repetitive route deviations and changes since people with disabilities, such as the VIIs, should always feel safe, walking on as straight routes as possible, making it easier to maintain their orientation [15,16]. Also, the cultural experience term maximizes the cultural experience of the visitor by penalizing the multiple crossovers of paths, especially of the ones with low cultural value. ...
... The proposed framework has been implemented as a module to be integrated in the ENORASI system, aiming to the navigation of people with disabilities, such as the VIIs, in outdoor cultural spaces [15,16]. Future work includes Table 7 Experimental results under various validation scenarios: the cumulative tour collision risk indicates the sum of the collision risk factor of the generated route; the tour duration is given in time periods and shows the duration that the user needs to walk the generated route; and the number of route changes indicates how many times the user should perform a turn while walking the generated route the development of a voice-enabled interface for the interaction of the users with the RP and other modules of the system that combine sonification of the visual information and audio descriptions [44]. ...
Article
Full-text available
Route planning (RP) enables individuals to navigate in unfamiliar environments. Current RP methodologies generate routes that optimize criteria relevant to the traveling distance or time, whereas most of them do not consider personal preferences or needs. Also, most of the current smart wearable assistive navigation systems offer limited support to individuals with disabilities by providing obstacle avoidance instructions, but often neglecting their special requirements with respect to the route quality. Motivated by the mobility needs of such individuals, this study proposes a novel RP framework for assistive navigation that copes these open issues. The framework is based on a novel mixed 0–1 integer nonlinear programming model for solving the RP problem with constraints originating from the needs of individuals with disabilities; unlike previous models, it minimizes: (1) the collision risk with obstacles within a path by prioritizing the safer paths; (2) the walking time; (3) the number of turns by constructing smooth paths, and (4) the loss of cultural interest by penalizing multiple crossovers of the same paths, while satisfying user preferences, such as points of interest to visit and a desired tour duration. The proposed framework is applied for the development of a system module for safe navigation of visually impaired individuals (VIIs) in outdoor cultural spaces. The module is evaluated in a variety of navigation scenarios with different parameters. The results demonstrate the comparative advantage of our RP model over relevant state-of-the-art models, by generating safer and more convenient routes for the VIIs.
... In the context of the project ENORASI, we are developing a novel wearable system aiming to offer guidance to VIIs in open environments of cultural value and touristic interest [49]. An outline of the ENORASI system architecture is illustrated in Fig. 2 with the main hardware modules being listed in Table 1. ...
... In this study a thorough literature review was conducted including a total of 29 studies on the user requirements for SASs for VIIs, significantly extending the preliminary review we presented in [49]. The requirements collected from the literature review of the assistive system presented in this study, are summarized in Table 11. ...
Article
Full-text available
The marginalization of people with disabilities, such as visually impaired individuals (VIIs), has driven scientists to take advantage of the fast growth of smart technologies and develop smart assistive systems (SASs) to bring VIIs back to social life, education and even to culture. Our research focuses on developing a human–computer interactive system that will guide VIIs in outdoor cultural environments by offering universal access to cultural information, social networking and safe navigation among other services. The VI users interact with computer-based SAS to control the system during its operation, while having access to remote connection with non-VIIs for external guidance and company. The development of such a system needs a user-centered design (UCD) that incorporates the elicitation of the necessary requirements for a satisfying operation for the VI users. In this paper, we present a novel SAS system for VIIs and its design considerations, which follow a UCD approach to determine a set of operational, functional, ergonomic, environmental and optional requirements of the system. Both VIIs and non-VIIs took part in a series of interviews and questionnaires, from which data were analyzed to form the requirements of the system for both the on-site and remote use. The final requirements are tested by trials and their evaluation and results are presented. The experimental investigations gave significant feedback for the development of the system, throughout the design process. The most important contribution of this study is the derivation of requirements applicable not only to the specific system under investigation, but also to other relevant SASs for VIIs.
... The application of sonification for visually impaired individuals has shown promise in the past, with ongoing research seeking to expand accessibility access of different experiences for these individuals (Iakovidis et al., 2020;Sekhavat et al., 2022). Dynamic data sonification has also been an area of research in the past, with applications for artistic experiences at aquariums (Jeon et al., 2012) as well as use with visually impaired individuals (Ji et al., 2021) or with gesture sonification (Vatavu, 2017). ...
Article
Advances in the fields of data processing and sonification have been applied to transcribe a variety of visual experiences into an auditory format. Although image sonification examples exist, the application of these principles to visual art has not been examined thoroughly. We sought to develop and evaluate a set of guidelines for the sonification of visual artworks. Through conducting expert interviews (N = 11), we created an initial sonification algorithm that accounts for art style, lightness, and color diversity to modulate the sonified output in terms of tempo and pitch. This algorithm was evaluated through user evaluations (N = 22). User study responses supported expert interview findings, the notion that sonification can be designed to match the experience of viewing an artwork, and showed interesting interaction effects among art styles, visual components, and musical parameters. We suggest the proposed guidelines can augment visitor experiences at art exhibits and provide the basis for further experimentation.
... where represents the closest distance of a VII to an obstacle, and represents the distance to an obstacle that poses no danger for collision. For our experiments, 2, as the desired detection distance for the early avoidance of an obstacle according to the requirements of the visually impaired users is up to 2 m [6]. Moreover, in order to quantify the safety of , the first norm (p = 1) is calculated, according to (7), and taken as a performance criterion. ...
Chapter
Full-text available
Motivated by the brainstorming process of human beings, a novel learning Fuzzy Cognitive Map (FCM) model named Brainstorming Fuzzy Cognitive Map (BFCM) is proposed. The proposed model is based on a state-of-the-art optimization algorithm, named Determinative Brain Storm Optimization, which is utilized to automatically adapt the weights of the FCM structure. In this study, BFCM is applied for safe outdoor navigation of visually impaired individuals. This application ensures the avoidance of static obstacles in an unknown environment, by taking into consideration the output of an obstacle detection system based on a depth camera. The simulation results show that the proposed model can effectively assist the users to avoid static obstacles and safely reach a desired destination, and they promise a wider applicability of the model to other domains, such as robotics.
... On the other hand, the accessibility of outdoor sites of cultural interest has been less explored despite the significance of such a venture. In the ENORASI project [36,19], a pre-commercial digital system has been investigated to assist the VI individuals navigate safely in outdoor environments of cultural interest, e.g., archeological sites. While providing information concerning the sights in a descriptive manner, the system supports the user with audible guidance and instructions for obstacle avoidance. ...
Chapter
The assistive navigation of visually impaired individuals requires the development of different algorithms for obstacle detection, recognition, avoidance, and path planning. The assessment and optimization of such algorithms in the real world is a painstaking process that requires repetitive measurements under stable conditions, which is usually difficult to achieve and costly. To this end, digital twin environments can be used to replicate relevant real-life situations, enabling the evaluation and optimization of algorithms through adjustable and cost-effective simulations. This chapter presents a digital twin framework for the simulation and evaluation of assistive navigation systems, and its application in the context of a camera-based wearable system for visually impaired individuals in an outdoor cultural space. The system incorporates an obstacle avoidance algorithm based on fuzzy logic. The utility and the effectiveness of this framework are demonstrated with an indicative simulation study.
... Iakovidis [23] conducted an empirical study to investigate the experience of blind users with mobile screen readers. Users faced difficulties with inaccessible content, but they were able to overcome the challenges. ...
Article
Full-text available
The Internet of Things approaches applied in the context of home automation have been an important promise to improve the lives of users with visual impairments. However, there are few research studies that result in well-established techniques and guidelines for making these applications accessible for use with screen reading software or other adjustments required by visually impaired users. This article presents an analysis of design features to help design more affordable mobile home automation applications for visually impaired users. The analysis was carried out through tests by seven visually impaired users on a prototype developed in previous studies. Despite the limitations in the proof-of-concept prototype used in this study, users found the application promising and highlighted the need for improvements in the field, highlighting the opportunities to provide home automation applications to enhance independent living scenarios. The article also shows the challenges to be faced in these applications, considering the current limitations with support for mobile interaction through screen readers for applications that are very intensive in terms of dynamic elements and the need for immediate and accurate feedback.
... The wide use of automated robotics and vehicles in various applications emerged the need for autonomous path planning strategies that remove the human presence from the decision-making [1][2][3]. Typical path planning strategies rely on single-objective models, such as minimization of traveled distance, time, or energy consumption, among others, where traditional and popular optimization algorithms are employed for finding an optimal solution. ...
Article
Full-text available
Advances in robotic motion and computer vision have contributed to the increased use of automated and unmanned vehicles in complex and dynamic environments for various applications. Unmanned surface vehicles (USVs) have attracted a lot of attention from scientists to consolidate the wide use of USVs in maritime transportation. However, most of the traditional path planning approaches include single-objective approaches that mainly find the shortest path. Dynamic and complex environments impose the need for multi-objective path planning where an optimal path should be found to satisfy contradicting objective terms. To this end, a swarm intelligence graph-based pathfinding algorithm (SIGPA) has been proposed in the recent literature. This study aims to enhance the performance of SIGPA algorithm by integrating fuzzy logic in order to cope with the multiple objectives and generate quality solutions. A comparative evaluation is conducted among SIGPA and the two most popular fuzzy inference systems, Mamdani (SIGPAF-M) and Takagi–Sugeno–Kang (SIGPAF-TSK). The results showed that depending on the needs of the application, each methodology can contribute respectively. SIGPA remains a reliable approach for real-time applications due to low computational effort; SIGPAF-M generates better paths; and SIGPAF-TSK reaches a better trade-off among solution quality and computation time.
... • Similarly, natural language processing, is utilized during the stages between document analysis and document-to-audio in order to retain meta-information on the semantics of the rendered information [35,36]. • Computer vision methods for the disabled, especially the visually impaired, aim to process the visual world and present the information to the human user or an assistive system or application [37]. Such methods include obstacle detection and scene recognition as well as distance and object size calculation. ...
Chapter
Full-text available
This work presents the universal access design principles and methods for natural language communication design in e-learning for the disabled. It unfolds a theoretical perspective to the design-for-all methodology and provides a framework description for technologies for creating accessible content for educational content communication. Main concerns include the problem identification of design issues for universal accessibility of spoken material, the primary pedagogical aspects that such content implementation should follow upon, as well as look into the state of the most popular e-learning platforms for which educators create and communicate educational content in an e-learning environment. References to massive open online course platform types of content that exist at the moment are examined in order to understand the challenges of bridging the gap between the modern design of rich courses and universal accessibility. The paper looks into the existing technologies for accessibility and a frame for analysis, including methodological and design issues, available resources and implementation using the existing technologies for accessibility and the perception of the designer as well as the user standpoint. Finally, a study to inform and access how potential educators may perceive the accessibility factor shows that accessible content is a major requirement toward a successful path to universally accessible e-learning.
... In recent years, with the development of sensors and mobile computing, a wide variety of portable navigation systems have been proposed to assist VIP to avoid obstacles ( 2019;, navigate (Jayakody et al.; 2020; Donati et al.;, and perceive the environment ( Iakovidis et al.;2018). Positioning plays an important role in assisting VIP. ...
Article
Personalized tourist route planning (TRP) and navigation are online or real-time applications whose mathematical modeling leads to complex optimization problems. These problems are usually formulated with mathematical programming and can be described as NP hard problems. Moreover, the state-of-the-art (SOA) path search algorithms do not perform efficiently in solving multi-objective optimization (MO) problems making them inappropriate for real-time processing. To address the above limitations and the need for online processing, a swarm intelligence graph-based pathfinding algorithm (SIGPA) for MO route planning was developed. SIGPA generates a population whose individuals move in a greedy approach based on A∗ algorithm to search the solution space from different directions. It can be used to find an optimal path for every graph-based problem under various objectives. To test SIGPA, a generic MOTRP formulation is proposed. A generic TRP formulation remains a challenge since it has not been studied thoroughly in the literature. To this end, a novel mixed binary quadratic programming model is proposed for generating personalized TRP based on multi-objective criteria and user preferences, supporting, also, electric vehicles or sensitive social groups in outdoor cultural environments. The model targets to optimize the route under various factors that the user can choose, such as travelled distance, smoothness of route without multiple deviations, safety and cultural interest. The proposed model was compared to five SOA models for addressing TRP problems in 120 various scenarios solved with CPLEX solver and SIGPA. SIGPA was also tested in real scenarios with A* algorithm. The results proved the effectiveness of our model in terms of optimality but also the efficiency of SIGPA in terms of computing time. The convergence and the fitness landscape analysis showed that SIGPA achieved quality solutions with stable convergence.
Chapter
Obstacle detection addresses the detection of an object, of any kind, that interferes with the canonical trajectory of a subject, such as a human or an autonomous robotic agent. Prompt obstacle detection can become critical for the safety of visually impaired individuals (VII). In this context, we propose a novel methodology for obstacle detection, which is based on a Generative Adversarial Network (GAN) model, trained with human eye fixations to predict saliency, and the depth information provided by an RGB-D sensor. A method based on fuzzy sets are used to translate the 3D spatial information into linguistic values easily comprehensible by VII. Fuzzy operators are applied to fuse the spatial information with the saliency information for the purpose of detecting and determining if an object may interfere with the safe navigation of the VII. For the evaluation of our method we captured outdoor video sequences of 10,170 frames in total, with obstacles including rocks, trees and pedestrians. The results showed that the use of fuzzy representations results in enhanced obstacle detection accuracy, reaching 88.1%.
Article
Full-text available
The Symmetric Traveling Salesman Problem (sTSP) is an intensively studied NP-hard problem. It has many important real-life applications such as logistics, planning, manufacturing of microchips and DNA sequencing. In this paper we propose a cluster level incremental tour construction method called Intra-cluster Refinement Heuristic (IntraClusTSP). The proposed method can be used both to extend the tour with a new node and to improve the existing tour. The refinement step generates a local optimal tour for a cluster of neighbouring nodes and this local optimal tour is then merged into the global optimal tour. Based on the performed evaluation tests the proposed IntraClusTSP method provides an efficient incremental tour generation and it can improve the tour efficiency for every tested state-of-the-art methods including the most efficient Chained Lin-Kernighan refinement algorithm. As an application example, we apply IntraClusTSP to automatically determine the optimal number of clusters in a cluster analysis problem. The standard methods like Silhouette index, Elbow method or Gap statistic method, to estimate the number of clusters support only partitional (single level) clustering, while in many application areas, the hierarchical (multi-level) clustering provides a better clustering model. Our proposed method can discover hierarchical clustering structure and provides an outstanding performance both in accuracy and execution time.
Conference Paper
Full-text available
People with visual impairments (PVI) have shown interest in visiting museums and enjoying visual art. Based on this knowledge, some museums provide tactile reproductions of artworks, specialized tours for PVI, or enable them to schedule accessible visits. However, the ability of PVI to visit museums is still dependent on the assistance they get from their family and friends or from the museum personnel. In this paper, we surveyed 19 PVI to understand their opinions and expectations about visiting museums independently, as well as the requirements of user interfaces to support it. Moreover, we increase the knowledge about the previous experiences, motivations and accessibility issues of PVI in museums.
Article
Full-text available
Localization systems play an important role in assisted navigation. Precise localization renders visually impaired people aware of ambient environments and prevents them from coming across potential hazards. The majority of visual localization algorithms, which are applied to autonomous vehicles, are not adaptable completely to the scenarios of assisted navigation. Those vehicle-based approaches are vulnerable to viewpoint, appearance and route changes (between database and query images) caused by wearable cameras of assistive devices. Facing these practical challenges, we propose Visual Localizer, which is composed of ConvNet descriptor and global optimization, to achieve robust visual localization for assisted navigation. The performance of five prevailing ConvNets are comprehensively compared, and GoogLeNet is found to feature the best performance on environmental invariance. By concatenating two compressed convolutional layers of GoogLeNet, we use only thousands of bytes to represent image efficiently. To further improve the robustness of image matching, we utilize the network flow model as a global optimization of image matching. The extensive experiments using images captured by visually impaired volunteers illustrate that the system performs well in the context of assisted navigation.
Chapter
Full-text available
Over the past years, convolutional neural networks (CNN) have not only demonstrated impressive capabilities in computer vision but also created new possibilities of providing navigational assistance for people with visually impairment. In addition to obstacle avoidance and mobile localization, it is helpful for visually impaired people to perceive kinetic information of the surrounding. Road barrier, as a specific obstacle as well as a sign of entrance or exit, is an underlying hazard ubiquitously in daily environments. To address the road barrier recognition, this paper proposes a novel convolutional neural network named KrNet, which is able to execute scene classification on mobile devices in real time. The architecture of KrNet not only features depthwise separable convolution and channel shuffle operation to reduce computational cost and latency, but also takes advantage of Inception modules to maintain accuracy. Experimental results are presented to demonstrate qualified performance for the meaningful and useful applications of navigational assistance within residential and working area.
Chapter
Full-text available
The Smart Glass represents potential aid for people who are visually impaired that might lead to improvements in the quality of life. The smart glass is for the people who need to navigate independently and feel socially convenient and secure while they do so. It is based on the simple idea that blind people do not want to stand out while using tools for help. This paper focuses on the significant work done in the field of wearable electronics and the features which comes as add-ons. The Smart glass consists of ultrasonic sensors to detect the object ahead in real-time and feeds the Raspberry for analysis of the object whether it is an obstacle or a person. It can also assist the person on whether the object is closing in very fast and if so, provides a warning through vibrations in the recognized direction. It has an added feature of GSM, which can assist the person to make a call during an emergency situation. The software framework management of the whole system is controlled using Robot Operating System (ROS). It is developed using ROS catkin workspace with necessary packages and nodes. The ROS was loaded on to Raspberry Pi with Ubuntu Mate.
Article
Full-text available
Several studies have addressed the problem of abnormality detection in medical images using computer-based systems. The impact of such systems in clinical practice and in the society can be high, considering that they can contribute to the reduction of medical errors and the associated adverse events. Today, most of these systems are based on binary classification algorithms that are “strongly” supervised, in the sense that the abnormal training images need to be annotated in detail, i.e., with pixel-level annotations indicating the location of the abnormalities. However, this approach usually does not take into account the diversity of the image content, which may include a variety of structures and artifacts. In the context of gastrointestinal video-endoscopy, addressed in this study, the semantics of the normal contents of the endoscopic video frames include normal mucosal tissues, bubbles, debris and the hole of the lumen, whereas the abnormal video frames may include additional semantics corresponding to lesions or blood. This observation motivated us to investigate various multi-label classification methods, aiming to a richer semantic interpretation of the endoscopic images. Among them, an image-saliency enabled bag-of-words approach and a convolutional neural network architecture enabling multi-scale feature extraction (MM-CNN) are presented. Weakly-supervised learning is implemented using only semantic-level annotations, i.e., meaningful keywords, thus, avoiding the need for the resource demanding pixelwise annotation of the training images. Experiments were performed on a diverse set of wireless capsule endoscopy images. The results of the experiments validate that the weakly-supervised multi-label classification can provide enhanced discrimination of the gastrointestinal abnormalities, with MM-CNN method to provide the best performance.
Article
Full-text available
Speech music discrimination is a traditional task in audio analytics, useful for a wide range of applications , such as automatic speech recognition and radio broadcast monitoring, that focuses on segmenting audio streams and classifying each segment as either speech or music. In this paper we investigate the capabilities of Convolutional Neural Networks (CNNs) with regards to the speech-music discrimination task. Instead of representing the audio content using handcrafted audio features, as traditional methods do, we use deep structures to learn visual feature dependencies as they appear on the spectrogram domain (i.e. train a CNN using audio spectrograms as input images). The main contribution of our work focuses on the potentials of using pre-trained deep architectures along with transfer-learning to train robust audio classifiers for the particular task of speech music discrimination. We highlight the supremacy of the proposed methods, compared both to the typical audio-based and deep-learning methods that adopt handcrafted features, and we evaluate our system in terms of classification success and run-time execution. To our knowledge this is the first work that investigates CNNs for the task of speech music discrimination and the first that exploits transfer learning across very different domains for audio modeling using deep-learning in general. In particular, we fine-tune a deep architecture originally trained for the Imagenet classification task, using a relatively small amount of data (almost 80 mins of training audio samples) along with data augmentation. We evaluate our system through extensive experimentation against three different datasets. Firstly we experiment on a real-world dataset of more than 10h of uninterrupted radio broadcasts and secondly, for comparison purposes, we evaluate our best method on two publicly available datasets that were designed specifically for the task of speech-music discrimination. Our results indicate that CNNs can significantly outperform current state-of-the-art in terms of performance especially when transfer learning is applied, in all three test-datasets. All the discussed methods, along with the whole experimental setup and the respective datasets, are openly provided for reproduction and further experimentation 1 .
Article
Full-text available
This paper proposes a novel methodology for automatic detection and localization of gastrointestinal (GI) anomalies in endoscopic video frame sequences. Training is performed with weakly annotated images, using only image-level, semantic labels instead of detailed, pixel-level annotations. This makes it a cost-effective approach for the analysis of large videoendoscopy repositories. Other advantages of the proposed methodology include its capability to suggest possible locations of GI anomalies within the video frames, and its generality, in the sense that abnormal frame detection is based on automatically derived image features. It is implemented in three phases: a) It classifies the video frames into abnormal or normal using a Weakly Supervised Convolutional Neural Network (WCNN) architecture; b) detects salient points from deeper WCNN layers, using a Deep Saliency Detection (DSD) algorithm; and c) localizes GI anomalies using an Iterative Cluster Unification (ICU) algorithm. ICU is based on a Pointwise Cross-Feature-Map (PCFM) descriptor extracted locally from the detected salient points using information derived from the WCNN. Results from extensive experimentation using publicly available collections of gastrointestinal endoscopy video frames, are presented. The datasets used include a variety of GI anomalies. Both the anomaly detection and the localization performance achieved, in terms of the Area Under receiver operating Characteristic (AUC), were >80%. The highest AUC for anomaly detection was obtained on conventional gastroscopy images, reaching 96%, and the highest AUC for anomaly localization was obtained on wireless capsule endoscopy images, reaching 88%.
Article
Full-text available
Navigational assistance aims to help visually-impaired people to ambulate the environment safely and independently. This topic becomes challenging as it requires detecting a wide variety of scenes to provide higher level assistive awareness. Vision-based technologies with monocular detectors or depth sensors have sprung up within several years of research. These separate approaches have achieved remarkable results with relatively low processing time and have improved the mobility of impaired people to a large extent. However, running all detectors jointly increases the latency and burdens the computational resources. In this paper, we put forward seizing pixel-wise semantic segmentation to cover navigation-related perception needs in a unified way. This is critical not only for the terrain awareness regarding traversable areas, sidewalks, stairs and water hazards, but also for the avoidance of short-range obstacles, fast-approaching pedestrians and vehicles. The core of our unification proposal is a deep architecture, aimed at attaining efficient semantic understanding. We have integrated the approach in a wearable navigation system by incorporating robust depth segmentation. A comprehensive set of experiments prove the qualified accuracy over state-of-the-art methods while maintaining real-time speed. We also present a closed-loop field test involving real visually-impaired users, demonstrating the effectivity and versatility of the assistive framework.
Article
Full-text available
Feature extraction is vital for face recognition. In this paper, we focus on the general feature extraction framework for robust face recognition.We collect about 300 papers regarding face feature extraction. While some works apply handcrafted features, other works employ statistical learning methods. We believe that a general framework for face feature extraction consists of four major components: filtering, encoding, spatial pooling and holistic representation. We analyze each component in detail. Each component could be applied in a task with multiple levels. Then,We provide a brief review of deep learning networks, which can be seen as a hierarchical extension of the framework above. Finally, we provide a detailed performance comparison of various features on LFW and FERET face database.
Article
Full-text available
Introduction of RGB-D sensors is a revolutionary force that offers a portable, versatile and cost-effective solution of navigational assistance for the visually impaired. RGB-D sensors on the market such as Microsoft Kinect, Asus Xtion and Intel RealSense are mature products, but all have a minimum detecting distance of about 800 mm. This results in the loss of depth information and the omission of short-range obstacles, posing a significant risk on navigation. This paper puts forward a simple and effective approach to reduce the minimum range that enhances the reliability and safety of navigational assistance. Over-dense regions of IR speckles in two IR images are exploited as a stereo pair to generate short-range depth, as well as fusion of original depth image and RGB image to eliminate misjudgment. Besides, a seeded growing algorithm of obstacle detection with extended depth information is presented. Finally, the minimum range of Intel RealSense R200 is decreased by approximately 75%, from 650 mm to 165 mm. Experiment results show capacity of detecting obstacles from 165 mm to more than 5000 mm and improved performance of navigational assistance with expansion of detection range. The presented approach proves to be of qualified accuracy and speed for guiding the visually impaired.
Article
Full-text available
In this paper, we introduce the so-called DEEP-SEE framework that jointly exploits computer vision algorithms and deep convolutional neural networks (CNNs) to detect, track and recognize in real time objects encountered during navigation in the outdoor environment. A first feature concerns an object detection technique designed to localize both static and dynamic objects without any a priori knowledge about their position, type or shape. The methodological core of the proposed approach relies on a novel object tracking method based on two convolutional neural networks trained offline. The key principle consists of alternating between tracking using motion information and predicting the object location in time based on visual similarity. The validation of the tracking technique is performed on standard benchmark VOT datasets, and shows that the proposed approach returns state-of-the-art results while minimizing the computational complexity. Then, the DEEP-SEE framework is integrated into a novel assistive device, designed to improve cognition of VI people and to increase their safety when navigating in crowded urban scenes. The validation of our assistive device is performed on a video dataset with 30 elements acquired with the help of VI users. The proposed system shows high accuracy (>90%) and robustness (>90%) scores regardless on the scene dynamics.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Article
Full-text available
Blind or visually impaired (BVI) individuals are capable of identifying an object in their hands by combining the available visual cues (if available) with manipulation. It is harder for them to associate the object with a specific brand, a model, or a type. Starting from this observation, we propose a collaborative system designed to deliver visual feedback automatically and to help the user filling this semantic gap. Our visual recognition module is implemented by means of an image retrieval procedure that provides real-time feedback, performs the computation locally on the device, and is scalable to new categories and instances. We carry out a thorough experimental analysis of the visual recognition module, which includes a comparative analysis with the state of the art. We also present two different system implementations that we test with the help of BVI users to evaluate the technical soundness, the usability, and the effectiveness of the proposed concept.
Article
Full-text available
Visually impaired people are often unaware of dangers in front of them, even in familiar environments. Furthermore, in unfamiliar environments, such people require guidance to reduce the risk of colliding with obstacles. This study proposes a simple smartphone-based guiding system for solving the navigation problems for visually impaired people and achieving obstacle avoidance to enable visually impaired people to travel smoothly from a beginning point to a destination with greater awareness of their surroundings. In this study, a computer image recognition system and smartphone application were integrated to form a simple assisted guiding system. Two operating modes, online mode and offline mode, can be chosen depending on network availability. When the system begins to operate, the smartphone captures the scene in front of the user and sends the captured images to the backend server to be processed. The backend server uses the faster region convolutional neural network algorithm or the you only look once algorithm to recognize multiple obstacles in every image, and it subsequently sends the results back to the smartphone. The results of obstacle recognition in this study reached 60%, which is sufficient for assisting visually impaired people in realizing the types and locations of obstacles around them.
Article
Full-text available
Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms) etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN) functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.
Article
Full-text available
The World Health Organization (WHO) reported that there are 285 million visually-impaired people worldwide. Among these individuals, there are 39 million who are totally blind. There have been several systems designed to support visually-impaired people and to improve the quality of their lives. Unfortunately, most of these systems are limited in their capabilities. In this paper, we present a comparative survey of the wearable and portable assistive devices for visually-impaired people in order to show the progress in assistive technology for this group of people. Thus, the contribution of this literature survey is to discuss in detail the most significant devices that are presented in the literature to assist this population and highlight the improvements, advantages, disadvantages, and accuracy. Our aim is to address and present most of the issues of these systems to pave the way for other researchers to design devices that ensure safety and independent mobility to visually-impaired people.
Article
In this paper, we propose a novel Fully Convolutional Neural Network (FCN) architecture aiming to aid the detection of abnormalities, such as polyps, ulcers and blood, in gastrointestinal (GI) endoscopy images. The proposed architecture, named Look-Behind FCN (LB-FCN), is capable of extracting multi-scale image features by using blocks of parallel convolutional layers with different filter sizes. These blocks are connected by Look-Behind (LB) connections, so that the features they produce are combined with features extracted from behind layers, thus preserving the respective information. Furthermore, it has a smaller number of free parameters than conventional Convolutional Neural Network (CNN) architectures, which makes it suitable for training with smaller datasets. This is particularly useful in medical image analysis, since data availability is usually limited due to ethicolegal constraints. The performance of LB-FCN is evaluated on both flexible and wireless capsule endoscopy datasets, reaching 99.72% and 93.50%, in terms of Area Under receiving operating Characteristic (AUC) respectively.
Article
Our paper presents the development of a real-time system based on detection, classification, and position estimation of objects in an outdoor environment to provide the visually impaired individuals with a voice output-based scene perception. The system is low-cost, light weight, simple, and easily wearable. An odroid board integrated with an USB camera and USB laser is utilized for the purpose. To reduce utility problems, a user-centered design approach has been acquired in which feedback from various individuals was obtained to understand their problems and requirements. The valuable insights gained from the feedback were then used to modify the system to best suit the requirements of the user. The object detection framework exploits a multimodal feature fusion-based deep learning architecture using edge, multiscale as well as optical flow information. Fusing edge information with raw data is motivated from the fact that stronger edge regions result in a higher number of activated neurons, hence inducing better feature representations. Learning deep features from multiple scales as well as use of motion dynamics at feature level lead to better semantic and discriminative representations, thus providing robustness to the detection framework. Experimental results carried out using PASCAL VOC 2007 dataset, Caltech dataset as well as captured real-time data are demonstrated.
Conference Paper
We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20× faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn.
Article
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Article
Robotic endoscopic systems offer a minimally invasive approach to the examination of internal body structures, and their application is rapidly extending to cover the increasing needs for accurate therapeutic interventions. In this context, it is essential for such systems to be able to perform measurements, such as measuring the distance travelled by a wireless capsule endoscope, so as to determine the location of a lesion in the gastrointestinal (GI) tract, or to measure the size of lesions for diagnostic purposes. In this paper, we investigate the feasibility of performing contactless measurements using a computer vision approach based on neural networks. The proposed system integrates a deep convolutional image registration approach and a multilayer feed-forward neural network in a novel architecture. The main advantage of this system, with respect to the state-of-the-art ones, is that it is more generic in the sense that it is: i) unconstrained by specific models, ii) more robust to non-rigid deformations, and iii) adaptable to most of the endoscopic systems and environments, while enabling measurements of enhanced accuracy. The performance of this system is evaluated in ex-vivo conditions using a phantom experimental model and a robotically-assisted test bench. The results obtained promise a wider applicability and impact in endoscopy in the era of big data.
Conference Paper
This paper introduces an end-to-end solution for dynamic adaptation of the learning experience for learners of different personal needs, based on their behavioural and affective reaction to the learning activities. Personal needs refer to what learner already know, what they need to learn, their intellectual and physical capacities and their learning styles.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Wireless capsule endoscopy (WCE) is performed with a miniature swallowable endoscope enabling the visualization of the whole gastrointestinal (GI) tract. One of the most challenging problems in WCE is the localization of the capsule endoscope (CE) within the GI lumen. Contemporary, radiation-free localization approaches are mainly based on the use of external sensors and transit time estimation techniques, with practically low localization accuracy. Latest advances for the solution of this problem include localization approaches based solely on visual information from the CE camera. In this paper we present a novel visual localization approach based on an intelligent, artificial neural network, architecture which implements a generic visual odometry (VO) framework capable of estimating the motion of the CE in physical units. Unlike the conventional, geometric, VO approaches, the proposed one is adaptive to the geometric model of the CE used; therefore, it does not require any prior knowledge about and its intrinsic parameters. Furthermore, it exploits color as a cue to increase localization accuracy and robustness. Experiments were performed using a robotic-assisted setup providing ground truth information about the actual location of the CE. The lowest average localization error achieved is 2.70 ± 1.62 cm, which is significantly lower than the error obtained with the geometric approach. This result constitutes a promising step towards the in-vivo application of VO, which will open new horizons for accurate local treatment, including drug infusion and surgical interventions.
Chapter
The importance of involving the persons intended to use a design, already in the design process leading up to the final product or service, is increasingly acknowledged. This chapter is intended to provide both inspiration and practical suggestions for anyone interested in designing for and with persons with visual impairments. The text focuses on co-design, but many of the adaptations and materials presented can also be used in more traditional design activities, such as usability testing. The chapter rests on an inclusive mindset. In other words, we focus on how to expand and enhance existing methods regarding who is involved, and how to provide means for participation to wider target groups, rather than how to create “special” methods for “special” users with “special” needs.
Conference Paper
This paper studies monocular visual odometry (VO) problem. Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching, motion estimation, local optimisation, etc. Although some of them have demonstrated superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior knowledge is also required to recover an absolute scale for monocular VO. This paper presents a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs). Since it is trained and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset show competitive performance to state-of-the-art methods, verifying that the end-to-end Deep Learning technique can be a viable complement to the traditional VO systems.
Article
We introduce an extremely computation efficient CNN architecture named ShuffleNet, designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two proposed operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 6.7\%) than the recent MobileNet system on ImageNet classification under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves \textasciitilde 13$\times$ actual speedup over AlexNet while maintaining comparable accuracy.
Article
Convolutional sparse coding (CSC) can model local connections between image content and reduce the code redundancy when compared with patch-based sparse coding. However, CSC needs a complicated optimization procedure to infer the codes (i.e., feature maps). In this brief, we proposed a convolutional sparse auto-encoder (CSAE), which leverages the structure of the convolutional AE and incorporates the max-pooling to heuristically sparsify the feature maps for feature learning. Together with competition over feature channels, this simple sparsifying strategy makes the stochastic gradient descent algorithm work efficiently for the CSAE training; thus, no complicated optimization procedure is involved. We employed the features learned in the CSAE to initialize convolutional neural networks for classification and achieved competitive results on benchmark data sets. In addition, by building connections between the CSAE and CSC, we proposed a strategy to construct local descriptors from the CSAE for classification. Experiments on Caltech-101 and Caltech-256 clearly demonstrated the effectiveness of the proposed method and verified the CSAE as a CSC model has the ability to explore connections between neighboring image content for classification tasks.
Conference Paper
This paper proposes a novel approach to measure the object size using a regular digital camera. Nowadays, the remote object-size measurement is very crucial to many multimedia applications. Our proposed computer-aided automatic object-size measurement technique is based on a new depth-information extraction (range finding) scheme using a regular digital camera. The conventional range finders are often carried out using the passive method such as stereo cameras or the active method such as ultrasonic and infrared equipment. They either require the cumbersome set-up or deal with point targets only. The proposed approach requires only a digital camera with certain image processing techniques and relies on the basic principles of visible light. Experiments are conducted to evaluate the performance of our proposed new object-size measurement mechanism. The average error-percentage of this method is below 2%. It demonstrates the striking effectiveness of our proposed new method.