PreprintPDF Available

A Framework for Recognizing and Estimating Human Concentration Levels

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

One of the major tasks in online education is to estimate the concentration levels of each student. Previous studies have a limitation of classifying the levels using discrete states only. The purpose of this paper is to estimate the subtle levels as specified states by using the minimum amount of body movement data. This is done by a framework composed of a Deep Neural Network and Kalman Filter. Using this framework, we successfully extracted the concentration levels, which can be used to aid lecturers and expand to other areas.
Content may be subject to copyright.
A FRAMEWORK FOR RECOGNIZING AND ESTIMATING HUMAN
CONCENTRATION LEVELS
A PREPRINT
Woodo Lee
Department of Physics
Korea University
Seoul, Republic of Korea, 02481
woodolee@korea.ac.kr
Jakyung Koo
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
lawkelvin33@korea.ac.kr
Nokyung Park
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
noparkee@korea.ac.kr
Pilgu Kang
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
rkd903@korea.ac.kr
Jeakwon Shim
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
jaekwoun.shim@gmail.com
April 26, 2021
ABS TRAC T
One of the major tasks in online education is to estimate the concentration levels of each student.
Previous studies have a limitation of classifying the levels using discrete states only. The purpose of
this paper is to estimate the subtle levels as specified states by using the minimum amount of body
movement data. This is done by a framework composed of a Deep Neural Network and Kalman
Filter. Using this framework, we successfully extracted the concentration levels, which can be used
to aid lecturers and expand to other areas.
Keywords
Computer vision, Data mining, Education, Educational programs, Human computer interaction, Signal analysis.
1 Introduction
There have been many attempts to measure students’ concentration levels using various methods such as taking skin
temperature [1], recognizing visual attention, and detecting Electroencephalogram (EEG) signals [2
4]. However,
these attempts lack in detail since their concentration levels were classified as discrete states. Here, we develop a new
framework (DNN-K), named after its architecture, Deep Neural Network (DNN) and Kalman Filter (KF) [5] to improve
these limitations. DNN-K defines the concentration levels as the probability of "high concentration," which is derived
from the DNN, and suggests that the standard deviations of designated points are a core factor in measuring these
probabilities. KF is also implemented to DNN-K to estimate the concentration levels over measuring.
Corresponding author.
arXiv:2104.11421v1 [cs.LG] 23 Apr 2021
APREPRINT - AP RI L 26, 2021
Figure 1: The measured points by OpenPose are shown. The middle and upper body are being measured with ten
points respectively.
2 Motivation
The motive of this paper is the fact that a human’s body movements can be a factor in recognizing his/her condition. We
propose that the standard deviations of designated points are a core factor in measuring the concentration levels. To
measure the points, we used OpenPose [6
9], which is shown in Fig. 1 with the coordinate data of the points range
from 0 to 1. We then check the distributions when humans are concentrating or not respectively, which is shown in Fig.
2. The distribution in (a) shows that the entries are gathered around body points while (b) is more widely spread. The
difference is distinguished, but it cannot be quantified to recognize the concentration levels. We analyze the distributions
by using DNN-K, and the method details are discussed in the following sections.
3 Methods
3.1 Overview
Figure 3 shows the overview of DNN-K. The first step of DNN-K is where the student’s video data taken by a camera is
pre-processed. The pre-processed data is labeled by two states, high and low concentration levels, based on the students’
intent. The labeled data is used for training a DNN model at the recognition step. The trained DNN model recognizes
the continuous concentration levels, which is defined as recognition levels (
Sr
). The concentration levels are estimated
by KF in the estimation step, which are called the estimation levels (
Se
). At the end, lecturers can give their feedback to
students with Srand Se.
3.2 Step 1. Data pre-processing
The first step of DNN-K is to extract the standard deviations from video data. Ten points of the human body are
measured every 50 frames, which are classified as the top part (0 - 4) and the middle part (5 - 9). Then, the standard
deviations of the X and Y coordinates are calculated in the top and the middle parts respectively. Note that we assume
the standard deviations of the points are the core factor in measuring the concentration levels. Table 1 shows the
notations of the results in the pre-processing step.
Algorithm 1 shows the entire process of the pre-processing step. Through this algorithm, the standard deviations are
obtained, and become the input data for DNN, discussed in the next section.
Figure 4 shows the difference of the standard deviations among each group. Nevertheless, there remain unexplained
aspects, such as ambiguous patterns, the correlation of which with the concentration levels are not clear. To this end,
2
APREPRINT - AP RI L 26, 2021
(a)
(b)
Figure 2: (a) The 2D histogram in the case of high-concentration is shown. (b) The 2D histogram in the case of
low-concentration is shown.
3
APREPRINT - AP RI L 26, 2021
Figure 3: The overview of our research is shown. DNN-K consist of several packages to recognize and estimate the
levels.
Columns Description
σX
T op The standard deviation of the top part’s X coordinate
σY
T op The standard deviation of the top part’s Y coordinate
σX
Mid The standard deviation of the middle part’s X coordinate
σY
Mid The standard deviation of the middle part’s Y coordinate
Table 1: The definition of the data and its description are shown.
DNN is applied to solve the problems as DNN is an appropriate method for obtaining nonlinear combinations from
features. This allows us to unveil hidden features that we cannot acknowledge.
3.3 Step 2. Recognition
Algorithm 2 shows the overall structure of the recognition step. Through our DNN, the recognition levels (
Sr
) are
obtained. The DNN consists of four layers. ReLU is used as the activation function in the first hidden layer and the
second hidden layer. Sigmoid is used to make sure the probability is distributed relatively evenly from 0 to 1. ADAptive
Moment (ADAM) estimation optimizer [10] is applied and the initial learning rate is set as 0.1
%
, which is the optimal
value for the DNN. Loss function (L) is defined as
L=1
N
N
X
i=1
[yilog( ˆyi) + (1 yi) log(1 ˆyi)],(1)
where N is the number of labels, which is two in this case.
yi
is the label of the data and
ˆyi
is the probability of
the prediction value from the data as
yi
. As the output value is a probability for binary classification, we use binary
cross-entropy for calculating the loss for every epoch during training.
K-Fold is applied to cover the insufficient data. The accuracy of 5-Fold training is ranged from
85%
to
95%
with a
median of 90.62 %.
3.4 Step 3. Estimation
Algorithm 1: Data Pre-Processing Step of DNN-K
Input: top.X, top.Y, mid.X, mid.Y
for each Data {top.X, top.Y, mid.X, mid.Y }do
for Di=
50(i+1)
S
50i
Data do
σData.append(s.d.(Di))
end for
end for
Output: σX
T op, σY
T op, σX
Mid , σY
Mid
4
APREPRINT - AP RI L 26, 2021
(a) (b)
(c) (d)
Figure 4: (a) - (d) show the distribution of the standard distribution for each parts, respectively. Red histogram for high
concentration case, and blue histogram for low concentration case are shown.
Algorithm 2: Recognition Step of DNN-K
Input: σX
T op, σY
T op, σX
Mid , σY
Mid
Input layer : R4
1st hidden layer : R8(activation function: ReLU)
2nd hidden layer : R8(activation function: ReLU)
Output layer : R1(activation function: Sigmoid)
Output: Concentration Levels stS
Algorithm 3: Estimation Step of DNN-K
Input: stS
for all stSdo
xest
tst
/* Predicting */
xpre
t+1=A·xest
t
Ppre
t=A·Pt·AT+Q
/* Updating */
K=Ppre
t·H/(H·Ppre
t·H+R)
/* Estimating */
xest
t+1=xpre
t+1+K·(mtH·xpre
t+1)
Pt+1=Ppred
tK·H·Ppred
t
end for
/* Analyzing */
Fitting the distribution of xest
t+1by lecturers’
defined function
Output: Concentration Levels (Ψ)
5
APREPRINT - AP RI L 26, 2021
Figure 5: Estimation and measurement levels are shown. The black dots (
Sr
) and the green dots (
Se
) show a low
concentration case. The blue dots (Sr) and the red dots (Se) show a high concentration case.
The estimation step of DNN-K includes KF to earn
Se
. Algorithm 3 shows the overall process. There are three states in
the algorithm, which are described as prediction state (
xpre
t
), estimation state (
xest
t
), measurement state (
mt
), the error
covariance matrix (
Pt
), and the transition weight matrix (
A
). The students’ state starts from
xest
0= 0.5
, because the
students’ concentration level is assumed as
50%
in the beginning.
P0= 0.9
is the system error, which comes from the
DNN, discussed in the previous section.
In the predicting step,
A
is a transition matrix, and
Q
is an external noise matrix, which can be modified by teachers.
A
is set to
1
and
Q
is set to
0
as an ideal case. The students maintained their concentration levels and there were no
external disturbances when they took lectures. In the updating step, the Kalman Gain (
K
) is obtained in every step.
H
is a scale matrix, which is set to
1
by simplifying the problems.
xest
t+1
and
Pt+1
are updated with
K
. Finally, in the
estimating step, the next estimated state xest
t+1is recurrently updated.
Figure 5 shows the estimation and measurement results every 2.5 seconds. Even though the measurement fluctuates
widely every 2.5 seconds, KF enables users to track the levels smoothly, which are shown as the green and the red dots.
Figure 6 shows the histogram of the green dots. In this case, the distribution consist of two dominant modes, which
can be analyzed by a certain function, and the details are as follows. Each mode can be described by the function of
bimodal distribution X, which is written as
Nk(µk, σk) = Ae
(xµk)2
2σ2
k(2)
X=N1(µ1, σ2
1, A1) + N2(µ2, σ2
2, A2)(3)
where σ1and σ2are the standard deviations, µ1and µ2are the mean values, and xis the input data.
Figure 6 shows the distribution of
Ψ
in the low concentration case (
Ψlow
). The
µ1
and
µ2
are obtained as 0.09 and 0.16
respectively in this distribution.
4 Conclusion and future work
We devise the model for aiding lecturers to estimate students’ concentration level using cameras in online classes. Our
system presents the level every 2.5 seconds under 90.62
%
accuracy and estimates the next level of concentration by
using KF. In contrast to the previous research, such as using VGG16 [11], our model takes a different approach to
6
APREPRINT - AP RI L 26, 2021
Figure 6: Fitting result in the low-concentration case is shown. Two separated states could be described as Eq. (2).
quantify the levels by captivating the variance of detected points on humans in the current state. Additionally, we
estimate and track the level for the next time window. Our model practically offers a tool to monitor the level more
precisely and aid lecturers to estimate the level. Academically, our model has a novel approach to analyze complex
human states and the concentration level.
As a future work, we plan to use not solely body movement data but also emotion data [12] and skin thermal data [1] [13]
for an enhanced prediction of measuring human concentration levels. The measuring method used in this paper and the
conventional measuring techniques will be combined and processed using deep learning. This work expects to provide
useful information on students’ concentration level and thus assist lecturers.
References
[1]
Nomura, S., Hasegawa-Ohira, M., Kurosawa, Y., Hanasaka, Y., Yajima, K., & Fukumura, Y. (2012). SKIN
TEMPERETURE AS A POSSIBLE INDICATOR OF STUDENT’ S INVOLVEMENT IN E-LEARNING
SESSIONS. " International Journal of Electronic Commerce Studies", 3(1), 101-110.
[2]
Al-Musawi, A. S. (2018). Concentration level monitoring in education and healthcare. Basic and Clinical
Pharmacology and Toxicology, 124(s2), 36.
[3]
Marouane, S., Najlaa, S., Abderrahim, T., & Eddine, E. K. (2015). Towards measuring learner’s concentration in
E-learning systems. International Journal of Computer Techniques, 2(5), 27-29.
[4]
Liu, N. H., Chiang, C. Y., & Chu, H. C. (2013). Recognizing the degree of human attention using EEG signals
from mobile sensors. sensors, 13(8), 10273-10286.
[5]
Kalman, R. E. (March 1, 1960). "A New Approach to Linear Filtering and Prediction Problems." ASME. J. Basic
Eng. March 1960; 82(1): 35–45.
[6]
Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). OpenPose: realtime multi-person 2D pose
estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence, 43(1),
172-186.
[7]
Simon, T., Joo, H., Matthews, I.,& Sheikh, Y. (2017). Hand keypoint detection in single images using multiview
bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1145-
1153).
7
APREPRINT - AP RI L 26, 2021
[8]
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity
fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291-7299).
[9]
M. Mohammadpour, H. Khaliliardali, S. M. R. Hashemi and M. M. AlyanNezhadi, "Facial emotion recognition
using deep convolutional networks," 2017 IEEE 4th International Conference on Knowledge-Based Engineering
and Innovation (KBEI), Tehran, 2017, pp. 0017-0021, doi: 10.1109/KBEI.2017.8324974.
[10] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[11]
C.Ruvinga, D.Malathi, J. D. Dorathi Jayaseeli. (2020). Human Concentration Level Recognition Based on VGG16
CNN architecture. International Journal of Advanced Science and Technology, 29(6s), 1364 - 1373. Retrieved
from http://sersc.org/journals/index.php/IJAST/article/view/9271
[12]
Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of
the IEEE conference on Computer Vision and Pattern Recognition (pp. 4724-4732).
[13]
Lauri Nummenmaa, Enrico Glerean, Riitta Hari, and Jari K. Hietanen (2014). Bodily maps of emotions. PNAS,
111 (2), 646-651.
8
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
During the learning process, whether students remain attentive throughout the session influences their learning capability. If teachers can identify whether students are attentive they can be notified to remain focused, thus resulting in improving their learning capability. Traditional methods require, teachers observe students’ facial expressions to identify whether they are attentive during the session. However, this method is often inaccurate and increases the burden on teachers. The number of Alzheimer's patient and resulting deaths is increasing every year mainly due to delay in the early detection so as to take necessary measures and treatments to overcome it. With the development of electroencephalography (EEG) detection tools, a window has opened for developing an effective equipment to aid this cause.
Article
Full-text available
Significance Emotions coordinate our behavior and physiological states during survival-salient events and pleasurable interactions. Even though we are often consciously aware of our current emotional state, such as anger or happiness, the mechanisms giving rise to these subjective sensations have remained unresolved. Here we used a topographical self-report tool to reveal that different emotional states are associated with topographically distinct and culturally universal bodily sensations; these sensations could underlie our conscious emotional experiences. Monitoring the topography of emotion-triggered bodily sensations brings forth a unique tool for emotion research and could even provide a biomarker for emotional disorders.
Article
Full-text available
During the learning process, whether students remain attentive throughout instruction generally influences their learning efficacy. If teachers can instantly identify whether students are attentive they can be suitably reminded to remain focused, thereby improving their learning effects. Traditional teaching methods generally require that teachers observe students' expressions to determine whether they are attentively learning. However, this method is often inaccurate and increases the burden on teachers. With the development of electroencephalography (EEG) detection tools, mobile brainwave sensors have become mature and affordable equipment. Therefore, in this study, whether students are attentive or inattentive during instruction is determined by observing their EEG signals. Because distinguishing between attentiveness and inattentiveness is challenging, two scenarios were developed for this study to measure the subjects' EEG signals when attentive and inattentive. After collecting EEG data using mobile sensors, various common features were extracted from the raw data. A support vector machine (SVM) classifier was used to calculate and analyze these features to identify the combination of features that best indicates whether students are attentive. Based on the experiment results, the method proposed in this study provides a classification accuracy of up to 76.82%. The study results can be used as a reference for learning system designs in the future.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Conference Paper
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
Article
The classical filtering and prediction problem is re-examined using the Bode-Sliannon representation of random processes and the “state-transition” method of analysis of dynamic systems. New results are: (1) The formulation and methods of solution of the problem apply without modification to stationary and nonstationary statistics and to growing-memory and infinitememory filters. (2) A nonlinear difference (or differential) equation is derived for the covariance matrix of the optimal estimation error. From the solution of this equation the coefficients of the difference (or differential) equation of the optimal linear filter are obtained without further calculations. (3) The filtering problem is shown to be the dual of the noise-free regulator problem. The new method developed here is applied to two well-known problems, confirming and extending earlier results. The discussion is largely self-contained and proceeds from first principles; basic concepts of the theory of random processes are reviewed in the Appendix.
SKIN TEMPERETURE AS A POSSIBLE INDICATOR OF STUDENT' S INVOLVEMENT IN E-LEARNING SESSIONS
  • S Nomura
  • M Hasegawa-Ohira
  • Y Kurosawa
  • Y Hanasaka
  • K Yajima
  • Y Fukumura
Nomura, S., Hasegawa-Ohira, M., Kurosawa, Y., Hanasaka, Y., Yajima, K., & Fukumura, Y. (2012). SKIN TEMPERETURE AS A POSSIBLE INDICATOR OF STUDENT' S INVOLVEMENT IN E-LEARNING SESSIONS. " International Journal of Electronic Commerce Studies", 3(1), 101-110.