ArticlePDF Available

Abstract and Figures

Human identification is a very important subject in computer field and has been researched widely. This paper proposes human identification system based on Kinect’s individual sensor as well as combination of all its available sensors. In the first part of the project, each sensor on the Kinect (i.e IR depth sensor, RGB camera and microphones) was used for skeleton recognition, face recognition and speech recognition respectively. Then, these individual recognition methods are combined as step-by-step process into a multi-sensor recognition system. Few experiments were carried out to test the reliability of the developed human identification systems. The results show that multi-sensor based human identification system is highly efficient compared to the single-sensor system. This is because, multi-sensor recognition system involves many recognition stages as each recognition stage needs particular biometric information of the user.
Content may be subject to copyright.
IOP Conference Series: Materials Science and Engineering
PAPER • OPEN ACCESS
Human Identification through Kinect’s Depth, RGB, and Sound Sensor
To cite this article: P Shunmugam et al 2019 IOP Conf. Ser.: Mater. Sci. Eng. 705 012040
View the article online for updates and enhancements.
This content was downloaded from IP address 38.145.79.44 on 03/12/2019 at 01:26
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
1
Human Identification through Kinect’s Depth, RGB, and
Sound Sensor
P Shunmugam1, K Kamarudin1,2,a, A A A Mosed1,2 and S A A Shukor1
1School of Mechatronics Engineering, Universiti Malaysia Perlis (UniMAP), Arau,
Malaysia,
2Centre of Excellence for Advanced Sensor Technology (CEASTech), Universiti
Malaysia Perlis, Arau, Malaysia
akamarulzaman@unimap.edu.my
Abstract. Human identification is a very important subject in computer field and has been
researched widely. This paper proposes human identification system based on Kinect’s
individual sensor as well as combination of all its available sensors. In the first part of the project,
each sensor on the Kinect (i.e IR depth sensor, RGB camera and microphones) was used for
skeleton recognition, face recognition and speech recognition respectively. Then, these
individual recognition methods are combined as step-by-step process into a multi-sensor
recognition system. Few experiments were carried out to test the reliability of the developed
human identification systems. The results show that multi-sensor based human identification
system is highly efficient compared to the single-sensor system. This is because, multi-sensor
recognition system involves many recognition stages as each recognition stage needs particular
biometric information of the user.
1. Introduction
Human identification has good application prospect and potential economic value. Human identification
is a systematic process, involves biometrics technology using human biological characteristics. Human
identification is beneficial in many fields mainly on authentication and security system [1].
One of the most significant inventions in the year 2010 is the Microsoft Kinect because of the high-
resolution depth, four arrays of microphones, and visual (RGB) features that it provides for a relatively
much lower cost when compared to other 3D cameras such as stereo cameras and Time-Of-Flight
cameras[2]. Kinect is used for applications like playing a virtual violin, to applications in health care
and physical therapy, retail, education, and training [3]. Figure 1 shows the various biometric
technologies [1][4][5][6] and components of the Kinect sensor. The followings are the specification of
the Kinect; an RGB camera with a 640 x 480-pixel, an infrared (IR) emitter and an IR depth sensor to
capture depth images an array of four microphones with 16-kHz, 24-bit mono pulse code modulation
(PCM) and a 3-axis accelerometer to determine the orientation of the Kinect [2].
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
2
(a) (b)
Figure 1. (a) Various biometric technologies [1][4] [5] [6]; (b) Components of the Kinect sensor [2].
The availability of various sensors available on the Kinect can be utilized for multi-typed sensor data
based human identification purpose where more data can be obtained and collected, which resulting in
more accurate and efficient identification. Although, Kinect consists of many sensors, but the price of
Kinect much lower compared to other multi sensor devices [7]. By using Kinect, low cost, highly
efficient and accurate human identification system can be developed. In this research, speech recognition
system using microphone, skeleton recognition system using the IR depth sensor and face recognition
system using the RGB camera was developed for the single-typed sensor data based human
identification system. Speech recognition recognizes speech command, skeleton recognition recognizes
user’s 19 skeleton mean information while face recognition recognizes user’s face, eye, nose and lips
template. For multi-typed sensor data based human identification system, speech, skeleton and face
recognitions are combined into one system where the process will run step-by-step. The reliability of
the developed human identification methods was tested and analyzed
2. Related Work
In previous study, the Mel-frequency cepstrum coefficients, the logarithmic power and their related
values are calculated from the personal voice [8]. Another study in [9] was on facial depth data of a
speaking subject, captured by the Kinect device, as an additional speech informative modality to
incorporate to a traditional audiovisual automatic speech recognizer. They presented feature extraction
algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of
the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis
projection to incorporate speech dynamics and improve classification. An autonomous voice and
motion-controlled video camera system was developed by Jody (Pritchett) Shu [10]. The Kinect data is
processed in real time by the software system on the laptop to track the lecturer’s movement and
automatically change focus to the board, projector screen, or other visual aids. In addition to posture and
voice-command recognition, the system filters out background sounds and acts as a directional
microphone to pick up the lecturer’s speech in an optimal fashion for voice recognition [10].
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
3
3. Methodologies
3.1. Single-typed sensor data based human identification
3.1.1. Speech Recognition System (Microphone)
This system uses automatic speech recognition method which is speech to text technique. The flow chart
of the system is shown in Figure 2. The Kinect’s microphone array collected the 24-bit audio data and
pre-processed to remove the background noise and echoes of the signal using automatic echo
cancellation (AEC) algorithm [11]. When there is a multiple set of microphones, the time that sound
arrives from an audio source to each microphone is slightly different. Audio data captured from Kinect
microphones is fed to the preliminary processor and at that section Beam forming and Sound localization
techniques determine the direction of the sound source and the set of microphones is used as a directional
microphone.
Figure 2. Speech Recognition system flow chart
Speech module section consists of two parts which are Microsoft Speech Recognition Engine [12]
and Microsoft speech recognition grammar. Microsoft speech recognition engine matches vocal inputs
with words and phrases, as defined by grammar rules. Two grammar rules exist in the speech recognition
grammar. First is the simple rule which recognize small words and commands. Another rule can
recognize and organize semantic content with various user accents. In this system, we had implemented
the simple rule where the user needs to add the word into the command list and then the word will be
initialized. This system built by grammar builder and speech recognizer from the
“System.Speech(4.0.0.0)” of NET Constructor with LabVIEW software [13].
3.1.2. Skeleton Recognition System (IR Depth sensor)
Depth sensor is used to recognize human skeleton in this research using Kinesthesia toolkits. It utilizes
the “Distance and Displacement between Joints VI” which obtain 20 joint coordinates of the skeleton
[14] starting from head until feet. When the Microsoft Kinect starts operating, the subject to be
recognized must stand at a minimum distance of 1.5 meters from the Kinect. 20 joints will be detected,
and the coordinates are obtained. All the joints are connected, where 19 pairs of adjacent joints make up
the nones for the whole skeleton. The distance between joint pair (i.e bone) calculated using Euclidean
distance algorithm (refer Equation 1) [15]:
󰇛󰇜󰇛󰇜󰇛󰇜 (1)
where d is the distance between 2 joints, (x1, y1, z1) are the coordinates of the first joint and (x2, y2,
z2) are the coordinates of the second joint. By applying this algorithm, again it is possible to obtain 19
absolute distances between the 20 adjacent joints. The data then is written in one spreadsheet file
continuously. After Euclidean distances have been calculated, the mean for each of 19 absolute distances
is computed using Equation 2:
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
4


 
(2)
Where n is the total number of reading. The data saved as the subject’s name and will be registered
as user. During the recognition, the 19 mean distance data stored in the file (.csv) will be compared + 1
range with real time data of the 19 Euclidean distance data of subject to be recognized. The 19 outputs
are either logic 1 or logic 0. Then all of these are summed using “Compound Arithmetic” function and
choosing the “Add” option. If the score is equal to or more than 15 thresholds, then the subject is
recognized as the registered user.
3.1.3. Face Recognition System (RGB camera)
One of the most remarkable abilities of human vision is that of face recognition [16]. RGB camera was
used to recognize face in this section. This recognition was developed using IMAQ Vision Development
Toolkit [17]. The system uses built-in colour and pattern matching capabilities to perform face
recognition on a Kinect camera stream. Given right templates and adequate settings it does perform well
with respect to its simplicity. The whole approach can be split into two main parts; Face Detection and
point following. Face detection was accomplished by matching eye, nose and lips templates against the
camera stream image. Figure 3 shows the sample of the eye, nose and lips template in .png file format.
(a) (b) (c) (d)
Figure 3. Sample of pattern matching templates: (a) face (b) eye (c) nose (d) lips.
3.2. Multi-sensor based Human Identification
This system developed for human identification based on multiple types sensor data particularly
depth, RGB, sound sensor. It is the combination of all the above speech recognition, skeleton recognition
and face recognition approaches which are executed step by step. Firstly, the user must set up the data
of his speech command, skeleton information and face template in the system. To be recognized by the
system, the subject must speak the speech command. If the speech command is recognized, then the
subject will undergo second step which is the skeleton recognition. When the skeleton is recognized,
then the subject will go through face recognition as the last step. The user will be identified and authorize
the system only if all 3 stages of recognitions are passed. If any of the stage failed to be recognize, he
or she will not be recognized, and the system will end the process.
4. Experimental Result
Few experiments were carried out on eight different subjects to test the reliability of the method for the
single and multi-typed sensor data based human identification system. The results of the experiments
are shown and discussed in this chapter.
4.1. Single-typed sensor data based human identification system
4.1.1. Speech Recognition System (Microphone)
The efficiency of the speech recognition system was tested by using “measure” and “scan” commands.
Results are shown in Table 1. Each test consists of two rounds where in each round, subject is allowed
to say the command for a of maximum of 10 attempts to be recognized. If more than 10 attempts, the
subject will not be recognized, and the system ends the recognition operations.
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
5
Table 1. “Measure” command test on speech recognition system
Subjects
Command “Measure”
1st round
2nd round
1st round
2nd round
Osamah
Fail
Fail
1
1
Fiqa
Fail
3
8
7
Izyan
6
2
1
1
Ling
4
3
4
2
Khei
Fail
1
3
2
Raj
1
2
2
1
Teven
2
1
1
1
Puva
3
2
1
1
From Table 1, we can observe that Osama failed to pronounce “measure” command as he has Arabic
mother tongue and he faced difficulties in pronouncing the command correctly. But Raj, Teven and Puva
was able to pronounce the “Measure” command correctly and easily in both rounds. Also, we can
observe Osamah, Izyan, Teven and Puva were able to easily pronounced the “Scan” command and was
recognised by the system at the very first attempt. But still few subjects like Fiqa unable to be recognised
early by the system as she has very soft voice and very low sound. This shows that the subject must say
the command bold and confident. The weakness of this system is, any subject with the right command
will be recognized as the user by the system. To increase the efficiency of this system, the command
must be unpredictable word, difficult to be pronounced and must kept secretly by the user.
4.1.2. Skeleton Recognition System (IR Depth sensor)
Few experiments were conducted on eight subjects to test the reliability of this skeleton recognition
method. Firstly, the matching score (threshold) is fixed to 15. Each subject tested two random users’
skeleton information. The results are shown in Table 2.
Table 2. Tesing random user data on random subject.
Subjects
User
Match
score
Recognized?
User
Match
score
Recognized?
Osamah
Puva
7
NO
Khei
10
NO
Fiqa
Puva
10
NO
Osama
16
YES
Izyan
Puva
8
NO
Fiqa
17
YES
Ling
Puva
9
NO
Izyan
14
NO
Khei
Ling
15
YES
Fiqa
5
NO
Raj
Khei
11
NO
Fiqa
15
YES
Teven
Raj
11
NO
Osama
12
NO
Puva
Ling
8
NO
Fiqa
12
NO
From Table 2, we can observe misrecognition happened when subject (Khei) tried to access the user
(Ling) skeleton data. Also, we can conclude that few misrecognitions occurred for Fiqa, Izyan and Raj.
This happened because subject and user have the similar same height and body build. The weakness of
this system is, few subjects were misrecognized by the system where the subject and the user have
similar body built.
4.1.3. Face Recognition System (RGB camera)
Efficiency of the system was determined by testing random user’s face template data on random subject.
Each subject tested the system twice. The results are shown in Table 3.
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
6
Table 3. Result of testing random user data on random subject.
Experiment 1
Experiment 2
Subjects
User
Recognized?
User
Recognized?
Osamah
Puva
No
Khei
Yes
Fiqa
Puva
No
Osama
No
Izyan
Puva
No
Fiqa
No
Ling
Izyan
Yes
Fiqa
No
Khei
Ling
No
Fiqa
No
Raj
Fiqa
No
Khei
No
Teven
Raj
No
Osama
No
Puva
Osama
Yes
Teven
No
From Table 3, we can observe misrecognition occurred while testing subject (Ling) with user (Izyan)
face template data. Also, we can observe that Osamah was misrecognized by Khei data and Puva was
misrecognized by Osamah data. This happened because both the user and subject have fair skin, and
have similar face features such as eyes, nose and lips. From the experiment result we can conclude that
700 is the best threshold as user can be recognized easily. But still misrecognition occurred when tested
with random user data on random subjects. This happened because the subject and the user have the
similar face features such as same type of nose, eyes and lips.
4.2. Multi-sensor based human identification system
For multi sensor based human identification system, two experiments were carried out to test the
reliability of the system. First experiment used eight subjects to test user (Vishnu) system. User chooses
“measure” as the voice command password as it’s difficult to pronounce by others and he can pronounce
it correctly and easily. User’s skeleton information and face template also set up in the system. The
results are shown in Table 4.
Table 4. Eight subject tested user (Vishnu) system.
From Table 4, we can observe that only 4 subjects (Izyan, Ling, Teven and Puva) able to pass first stage
(speech recognition). Out of four subjects only one subject (Puva) able to pass second stage (skeleton
recognition). But this subject (Puva) too failed to pass the last stage (face recognition) of the system. At
the end, this multi-typed sensor based human identification system can only be authorized by the user
(Vishnu). Second experiment was conducted using cross user method where five persons act as user as
well as subjects tested the system 25 times. The results are shown in Table 5. While Figure 4 shows an
example of the process where the user (Puva) has been successfully recognized by the multi-sensor data
based human identification system.
Subject
User
Stage 1: Speech
Stage 2: Skeleton
Stage 3: Face
Authorised?
Command: Measure
Match Score:15
Threshold 700
Osamah
Vishnu
Failed
X
X
X
Fiqa
Vishnu
Failed
X
X
X
Izyan
Vishnu
Yes
Failed
X
X
Ling
Vishnu
Yes
Failed
X
X
Khei
Vishnu
Failed
X
X
X
Raj
Vishnu
Failed
X
X
X
Teven
Vishnu
Yes
Failed
X
X
Puva
Vishnu
Yes
Yes
Failed
X
Vishnu
Vishnu
Yes
Yes
Yes
Yes
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
7
Table 5. Five user/subject cross test the system 25 times.
Users
Subjects
Osamah
Fiqa
Izyan
Ling
Puva
Osamah
Recognised
Not
Not
Not
Not
Fiqa
Not
Recognised
Not
Not
Not
Izyan
Not
Not
Recognised
Not
Not
Ling
Not
Not
Not
Recognised
Not
Puva
Not
Not
Not
Not
Recognised
Figure 4. Puva was recognized by the multi-sensor data based human identification system.
From Table 5, we can observe that the system did not misrecognize any subjects. All these
experiments proved that multi-sensor human identification system is more efficient compared to the
single sensor human identification system as only the correct user was recognized and authorized by the
system.
5. CONCLUSION
Multi-typed sensor human identification algorithm was developed, analyzed and implemented
successfully by a detailed evaluation for real life scenarios such as highly efficient and low-cost user
authentication system. From all the experiments carried out, we can conclude that single-typed sensor
data based human identification system are less efficient compared to the multi-typed sensor data based
human identification system.
One of the limitations are, voice command password must be kept secretly. Secondly, the person to
be identified and tracked must not move his/her body for 10s. Thirdly, for the face recognition the person
should maintain the appearance, for example if he or she used glasses in the template image, then he or
she must wear glasses during the recognition process. Finally, since Microsoft Kinect uses infrared,
camera and microphone, this system only can be used indoors where there is no direct sunlight, and
under controlled brightness.
Kinect originally can detect 6 people and track 2 at the same time, but with Kinesthesia toolkit in
LabVIEW, only one person is recognized and tracked which reduces the practicality of the algorithm
developed. As future work, the Kinesthesia toolkit can be improved to recognize all the six detectable
people and track two, which will enhance the practicality of the system.
6. ACKNOWLEDGMENT
The authors would like to acknowledge the support from the Fundamental Research Grant Scheme
(FRGS) under a grant number of FRGS/1/2018/TK04/UNIMAP/02/15 from the Ministry of Education
Malaysia.
5th International Conference on Man Machine Systems
IOP Conf. Series: Materials Science and Engineering 705 (2019) 012040
IOP Publishing
doi:10.1088/1757-899X/705/1/012040
8
REFERENCES
[1] Bhattacharyya D, Ranjan R, Alisherov F and Choi M 2009 Biometric authentication: A review
Int. J. u-and e-Service, Sci. Technol. 2 1328
[2] Recognition H 2013 Human Recognition, Identification and Tracking using Microsoft Kinect
Interfaced with DaNI Robot Author: Diyar Khalis Bilal 20123
[3] Tashev I 2013 Kinect development kit: A toolkit for gesture-and speech-based human-machine
interaction [Best of the Web] IEEE Signal Process. Mag. 30 12931
[4] Kumar M S N and Babu R V 2013 Human gait recognition using depth camera 16
[5] Anon 2013 Face Recognition And Registration For Home Surveillance System Using Iterative
Closest Point And Haar Cascade Lgorithm By Kranthi Kumar Kandi , B . tech Presented to the
faculty of The University o f Houston Clear Lake In Partial Fulfillment of the Requir
[6] Engineering C 2013 Person Authentication Using Face And Voice 719
[7] Rahman M W, Zohra F T and Gavrilova M L 2018 Rank level fusion for kinect gait and face
biometrie identification 2017 IEEE Symp. Ser. Comput. Intell. SSCI 2017.17
[8] Kita E, Zuo Y, Saito F and Feng X 2017 Personal Identification with Face and Voice Features
Extracted through Kinect Sensor IEEE Int. Conf. Data Min. Work. ICDMW 54551
[9] Galatas G, Potamianos G and Makedon F 2012 Audio-Visual Speech Recognition Incorporating
Facial Depth Information Captured By The Kinect Institute of Informatics and
Telecommunications, NCSR Demokritos , Athens , Greece Heracleia Lab ., Dept . of
Computer Science and Engin ., University of Tex 27147
[10] State C 2013 Autonomous Voice and Motion Controlled Video Camera System for Instructional
Technology by
[11] Rathnayake K A S V, Diddeniya S I A P, Wanniarachchi W K I L, Nanayakkara W H K P and
Gunasinghe H N 2017 Voice operated home automation system based on Kinect sensor 2016
IEEE Int. Conf. Inf. Autom. Sustain. Interoper. Sustain. Smart Syst. Next Gener. ICIAfS 2016 1
5
[12] Rami M, Svitlana M, Lyashenko V and Belova N 2017 Speech Recognition Systems : A
Comparative Review IOSR J. Comput. Eng. 19 719
[13] Ertugrul N 2000 with LabVIEW TM Simulation
[14] Anon 2017 Copyright by Sanobar Kadiwal 2017
[15] Lip C J, Yeon A S A, Kamarudin L M, Kamarudin K, Visvanathan R, Zaidi A F A, Mamduh S
M, Zakaria A and Nooriman W M 2018 Human 3D Reconstruction and Identification Using
Kinect Sensor 2018 Int. Conf. Comput. Approach Smart Syst. Des. Appl. ICASSDA 2018 17
[16] IEEE 1998 Face Recognition: Features versus Templates IEEE Trans. Pattern Anal. Mach.
Intell. 15 111
[17] Adiloğlu S 2016 We are IntechOpen , the world ’ s leading publisher of Open Access books Built
by scientists , for scientists TOP 1 % Heavy Met. Remov. with Phytoremediation i 13
... With specialized sensors, human identity recognition can be achieved by collecting and analyzing a wide range of sensed data using sensors such as accelerometers [57,58] and ground sensors [59][60][61] attached to the human body. In 2019, Shunmugam et al. [62] proposed a human recognition system based on a combination of Kinect sensors, in which an IRdepth sensor, an RGB camera, and a microphone are used for bone recognition, facial recognition, and voice recognition, respectively. Although this system can realize accurate identity recognition, it is difficult to be widely implemented due to its high cost, inconvenient installation, and low portability. ...
Article
Full-text available
In recent years, Wi-Fi sensing technology has become an emerging research direction of human-computer interaction due to its advantages of low cost, contactless, illumination insensitivity, and privacy preservation. At present, Wi-Fi sensing research has been expanded from target location to action recognition and identity recognition, among others. This paper summarizes and analyzes the research of Wi-Fi sensing technology in human identity recognition. Firstly, we overview the history of Wi-Fi sensing technology, compare it with traditional identity-recognition technologies and other wireless sensing technologies, and highlight its advantages for identity recognition. Secondly, we introduce the steps of the Wi-Fi sensing process in detail, including data acquisition, data pre-processing, feature extraction, and identity classification. After that, we review state-of-the-art approaches using Wi-Fi sensing for single-and multi-target identity recognition. In particular, three kinds of approaches (pattern-based, model-based, and deep learning-based) for single-target identity recognition and two kinds of approaches (direct recognition and separated recognition) for multi-target identity recognition are introduced and analyzed. Finally, future research directions are discussed, which include transfer learning, improved multi-target recognition, and unified dataset construction.
Article
Full-text available
Conventional human dance posture detection methods have problems such as low motion detection accuracy and recognition rate, so a simplified and improved mayfly algorithm is proposed to optimize the human dance posture detection methods. To begin with, a high-precision Kinect sensor is employed to gather 3D data on human dance posture movements. Then, the movement categories are recognized based on the indirect segmentation principle of the sliding window design. Then, the improved mayfly algorithm optimizes the multi-threshold combination of image segmentation to determine the optimal segmentation threshold. It is proposed to use gesture-based feature description to fully represent the human action information, use human gesture to obtain the human body regions in the frame, extract 3D-SIFT and optical flow features for each region, respectively, and then compare with other intelligent algorithms, and the experimental analysis shows that the proposed method is better than the DSI method in terms of Average accuracy and Accuracy at the worst performance. Performance is higher than the DTW method, with a difference of 29.91% and 28.65%, respectively. The improved mayfly algorithm’s simulation results are more accurate and stable than other methods, which improves the recognition rate and allows for more precise detection of human dance postures.
Conference Paper
Two-dimensional face recognition has been researched for the past few decades. With the recent development of Deep Convolutional Neural Network (DCNN) deep learning approaches, two-dimensional face recognition had achieved impressive recognition accuracy rate. However, there are still some challenges such as pose variation, scene illumination, facial emotions, facial occlusions exist in the two-dimensional face recognition. This problem can be solved by adding the depth images as input as it provides valuable information to help model facial boundaries and understand the global facial layout and provide low-frequency patterns. RGB-D images are more robust compared to RGB images. Unfortunately, the lack of sufficient RGB-D face databases to train the DCNN are the main reason for this research to remain undiscovered. So, in this research, new RGB-D face database is constructed using the Intel RealSense D435 Depth Camera which has 1280 x 720-pixel depth. Twin DCNN streams are developed and trained on RGB images at one stream and Depth images at another stream, and finally combined the output through fusion soft-max layers. The proposed DCNN model shows an accuracy of 95% on a newly constructed RGB-D database.
Article
Full-text available
Kinect is a device for human?machine interaction, which adds two more input modalities to the palette of the user interface designer: gestures and speech. Kinect is transforming how people interact with computers, kiosks, and other motion-controlled devices from fun applications like playing a virtual violin, to applications in health care and physical therapy, retail, education, and training.
Article
Full-text available
Advances in the field of Information Technology also make Information Security an inseparable part of it. In order to deal with security, Authentication plays an important role. This paper presents a review on the biometric authentication techniques and some future possibilities in this field. In biometrics, a human being needs to be identified based on some characteristic physiological parameters. A wide variety of systems require reliable personal recognition schemes to either confirm or determine the identity of an individual requesting their services. The purpose of such schemes is to ensure that the rendered services are accessed only by a legitimate user, and not anyone else. By using biometrics it is possible to confirm or establish an individual’s identity. The position of biometrics in the current field of Security has been depicted in this work. We have also outlined opinions about the usability of biometric authentication systems, comparison between different techniques and their advantages and disadvantages in this paper.
Article
Creating voice control for robots is very important and difficult task. Therefore, we consider different systems of speech recognition. We divided them into two main classes: (1) open -source and (2) close-source code. As close-source software the following were selected: Dragon Mobile SDK, Google Speech Recognition API, Siri, Yandex SpeechKit and Microsoft Speech API. While the following were selected as open-source software: CMU Sphinx, Kaldi, Julius, HTK, iAtros, RWTH ASR and Simon. The comparison mainly based on accuracy, API, performance, speed in real-time, response time and compatibility. the variety of comparison axes allow us to make detailed description of the differences and similarities, which in turn enabled us to adopt a careful decision to choose the appropriate system depending on our need.
Conference Paper
Abstract: Control of home appliances using smart technologies which is also known as home automation is a popular (industrial) research area. Many of these systems use remote controlling such as IR sensing, networking, Arduino programing etc. whereas our system uses voice controlling. Proposed system is designed to enable centralize controlling over distant household appliances. In this prototype, we built an interface demonstrating voice control using Kinect V2 as voice receiver and trained a computer system to identify set of voice commands. Then a circuit was constructed using Arduino and light bulbs that mimics actual appliances. We measured accuracy of our system. It is more than 95% when the distance between user and Kinect sensor is 4 m and when there is about 53 dB noise. Hence, the proposed method can be efficiently used with a Kinect V2 for voice controlling.
Article
Gait is an important biometric modality for recognizing humans. Unlike other biometrics, human gait can be captured at a distance which makes it an unobtrusive method for recognition. In this paper, an unrestricted gait recognition algorithm is proposed which uses 3D skeleton information and trajectory covariance of joint points. 3-D skeleton is generated from the depth images that are captured using Kinect sensor. The temporal tracking of skeleton points is used for gait analysis. The covariance measure between these skeleton point trajectories are computed and the covariance matrices form the gait model. The gait is recognized by computing the minimum dissimilarity measure between the gait models of the training data and the testing data. Recognition accuracy of over 90% has been achieved for a data set consisting of fixed and moving camera scenarios of 20 subjects.
Conference Paper
We investigate the use of facial depth data of a speaking subject, captured by the Kinect device, as an additional speech-informative modality to incorporate to a traditional audiovisual automatic speech recognizer. We present our feature extraction algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis projection to incorporate speech dynamics and improve classification. For automatic speech recognition utilizing the three available data streams (audio, visual, and depth), we consider both the feature and decision fusion paradigms, the latter via a state-synchronous tri-stream hidden Markov model. We report multi-speaker recognition results on a small-vocabulary task employing our recently collected bilingual audio-visual corpus with depth information, demonstrating improved recognition performance by the addition of the proposed depth stream, across a wide range of audio conditions.
  • M S N Kumar
  • R V Babu
Kumar M S N and Babu R V 2013 Human gait recognition using depth camera 1-6