Available via license: CC BY 4.0
Content may be subject to copyright.
A FRAMEWORK FOR RECOGNIZING AND ESTIMATING HUMAN
CONCENTRATION LEVELS
A PREPRINT
Woodo Lee
Department of Physics
Korea University
Seoul, Republic of Korea, 02481
woodolee@korea.ac.kr
Jakyung Koo
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
lawkelvin33@korea.ac.kr
Nokyung Park
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
noparkee@korea.ac.kr
Pilgu Kang
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
rkd903@korea.ac.kr
Jeakwon Shim ∗
Department of Computer Science and Engineering
Korea University
Seoul, Republic of Korea, 02481
jaekwoun.shim@gmail.com
April 26, 2021
ABS TRAC T
One of the major tasks in online education is to estimate the concentration levels of each student.
Previous studies have a limitation of classifying the levels using discrete states only. The purpose of
this paper is to estimate the subtle levels as specified states by using the minimum amount of body
movement data. This is done by a framework composed of a Deep Neural Network and Kalman
Filter. Using this framework, we successfully extracted the concentration levels, which can be used
to aid lecturers and expand to other areas.
Keywords
Computer vision, Data mining, Education, Educational programs, Human computer interaction, Signal analysis.
1 Introduction
There have been many attempts to measure students’ concentration levels using various methods such as taking skin
temperature [1], recognizing visual attention, and detecting Electroencephalogram (EEG) signals [2
–
4]. However,
these attempts lack in detail since their concentration levels were classified as discrete states. Here, we develop a new
framework (DNN-K), named after its architecture, Deep Neural Network (DNN) and Kalman Filter (KF) [5] to improve
these limitations. DNN-K defines the concentration levels as the probability of "high concentration," which is derived
from the DNN, and suggests that the standard deviations of designated points are a core factor in measuring these
probabilities. KF is also implemented to DNN-K to estimate the concentration levels over measuring.
∗Corresponding author.
arXiv:2104.11421v1 [cs.LG] 23 Apr 2021
APREPRINT - AP RI L 26, 2021
Figure 1: The measured points by OpenPose are shown. The middle and upper body are being measured with ten
points respectively.
2 Motivation
The motive of this paper is the fact that a human’s body movements can be a factor in recognizing his/her condition. We
propose that the standard deviations of designated points are a core factor in measuring the concentration levels. To
measure the points, we used OpenPose [6
–
9], which is shown in Fig. 1 with the coordinate data of the points range
from 0 to 1. We then check the distributions when humans are concentrating or not respectively, which is shown in Fig.
2. The distribution in (a) shows that the entries are gathered around body points while (b) is more widely spread. The
difference is distinguished, but it cannot be quantified to recognize the concentration levels. We analyze the distributions
by using DNN-K, and the method details are discussed in the following sections.
3 Methods
3.1 Overview
Figure 3 shows the overview of DNN-K. The first step of DNN-K is where the student’s video data taken by a camera is
pre-processed. The pre-processed data is labeled by two states, high and low concentration levels, based on the students’
intent. The labeled data is used for training a DNN model at the recognition step. The trained DNN model recognizes
the continuous concentration levels, which is defined as recognition levels (
Sr
). The concentration levels are estimated
by KF in the estimation step, which are called the estimation levels (
Se
). At the end, lecturers can give their feedback to
students with Srand Se.
3.2 Step 1. Data pre-processing
The first step of DNN-K is to extract the standard deviations from video data. Ten points of the human body are
measured every 50 frames, which are classified as the top part (0 - 4) and the middle part (5 - 9). Then, the standard
deviations of the X and Y coordinates are calculated in the top and the middle parts respectively. Note that we assume
the standard deviations of the points are the core factor in measuring the concentration levels. Table 1 shows the
notations of the results in the pre-processing step.
Algorithm 1 shows the entire process of the pre-processing step. Through this algorithm, the standard deviations are
obtained, and become the input data for DNN, discussed in the next section.
Figure 4 shows the difference of the standard deviations among each group. Nevertheless, there remain unexplained
aspects, such as ambiguous patterns, the correlation of which with the concentration levels are not clear. To this end,
2
APREPRINT - AP RI L 26, 2021
(a)
(b)
Figure 2: (a) The 2D histogram in the case of high-concentration is shown. (b) The 2D histogram in the case of
low-concentration is shown.
3
APREPRINT - AP RI L 26, 2021
Figure 3: The overview of our research is shown. DNN-K consist of several packages to recognize and estimate the
levels.
Columns Description
σX
T op The standard deviation of the top part’s X coordinate
σY
T op The standard deviation of the top part’s Y coordinate
σX
Mid The standard deviation of the middle part’s X coordinate
σY
Mid The standard deviation of the middle part’s Y coordinate
Table 1: The definition of the data and its description are shown.
DNN is applied to solve the problems as DNN is an appropriate method for obtaining nonlinear combinations from
features. This allows us to unveil hidden features that we cannot acknowledge.
3.3 Step 2. Recognition
Algorithm 2 shows the overall structure of the recognition step. Through our DNN, the recognition levels (
Sr
) are
obtained. The DNN consists of four layers. ReLU is used as the activation function in the first hidden layer and the
second hidden layer. Sigmoid is used to make sure the probability is distributed relatively evenly from 0 to 1. ADAptive
Moment (ADAM) estimation optimizer [10] is applied and the initial learning rate is set as 0.1
%
, which is the optimal
value for the DNN. Loss function (L) is defined as
L=−1
N
N
X
i=1
[yilog( ˆyi) + (1 −yi) log(1 −ˆyi)],(1)
where N is the number of labels, which is two in this case.
yi
is the label of the data and
ˆyi
is the probability of
the prediction value from the data as
yi
. As the output value is a probability for binary classification, we use binary
cross-entropy for calculating the loss for every epoch during training.
K-Fold is applied to cover the insufficient data. The accuracy of 5-Fold training is ranged from
85%
to
95%
with a
median of 90.62 %.
3.4 Step 3. Estimation
Algorithm 1: Data Pre-Processing Step of DNN-K
Input: top.X, top.Y, mid.X, mid.Y
for each Data ∈ {top.X, top.Y, mid.X, mid.Y }do
for Di=
50(i+1)
S
50i
Data do
σData.append(s.d.(Di))
end for
end for
Output: σX
T op, σY
T op, σX
Mid , σY
Mid
4
APREPRINT - AP RI L 26, 2021
(a) (b)
(c) (d)
Figure 4: (a) - (d) show the distribution of the standard distribution for each parts, respectively. Red histogram for high
concentration case, and blue histogram for low concentration case are shown.
Algorithm 2: Recognition Step of DNN-K
Input: σX
T op, σY
T op, σX
Mid , σY
Mid
Input layer : ∈R4
1st hidden layer : ∈R8(activation function: ReLU)
2nd hidden layer : ∈R8(activation function: ReLU)
Output layer : ∈R1(activation function: Sigmoid)
Output: Concentration Levels st∈S
Algorithm 3: Estimation Step of DNN-K
Input: st∈S
for all st∈Sdo
xest
t←st
/* Predicting */
xpre
t+1=A·xest
t
Ppre
t=A·Pt·AT+Q
/* Updating */
K=Ppre
t·H/(H·Ppre
t·H+R)
/* Estimating */
xest
t+1=xpre
t+1+K·(mt−H·xpre
t+1)
Pt+1=Ppred
t−K·H·Ppred
t
end for
/* Analyzing */
Fitting the distribution of xest
t+1by lecturers’
defined function
Output: Concentration Levels (Ψ)
5
APREPRINT - AP RI L 26, 2021
Figure 5: Estimation and measurement levels are shown. The black dots (
Sr
) and the green dots (
Se
) show a low
concentration case. The blue dots (Sr) and the red dots (Se) show a high concentration case.
The estimation step of DNN-K includes KF to earn
Se
. Algorithm 3 shows the overall process. There are three states in
the algorithm, which are described as prediction state (
xpre
t
), estimation state (
xest
t
), measurement state (
mt
), the error
covariance matrix (
Pt
), and the transition weight matrix (
A
). The students’ state starts from
xest
0= 0.5
, because the
students’ concentration level is assumed as
50%
in the beginning.
P0= 0.9
is the system error, which comes from the
DNN, discussed in the previous section.
In the predicting step,
A
is a transition matrix, and
Q
is an external noise matrix, which can be modified by teachers.
A
is set to
1
and
Q
is set to
0
as an ideal case. The students maintained their concentration levels and there were no
external disturbances when they took lectures. In the updating step, the Kalman Gain (
K
) is obtained in every step.
H
is a scale matrix, which is set to
1
by simplifying the problems.
xest
t+1
and
Pt+1
are updated with
K
. Finally, in the
estimating step, the next estimated state xest
t+1is recurrently updated.
Figure 5 shows the estimation and measurement results every 2.5 seconds. Even though the measurement fluctuates
widely every 2.5 seconds, KF enables users to track the levels smoothly, which are shown as the green and the red dots.
Figure 6 shows the histogram of the green dots. In this case, the distribution consist of two dominant modes, which
can be analyzed by a certain function, and the details are as follows. Each mode can be described by the function of
bimodal distribution X, which is written as
Nk(µk, σk) = Ae
−(x−µk)2
2σ2
k(2)
X=N1(µ1, σ2
1, A1) + N2(µ2, σ2
2, A2)(3)
where σ1and σ2are the standard deviations, µ1and µ2are the mean values, and xis the input data.
Figure 6 shows the distribution of
Ψ
in the low concentration case (
Ψlow
). The
µ1
and
µ2
are obtained as 0.09 and 0.16
respectively in this distribution.
4 Conclusion and future work
We devise the model for aiding lecturers to estimate students’ concentration level using cameras in online classes. Our
system presents the level every 2.5 seconds under 90.62
%
accuracy and estimates the next level of concentration by
using KF. In contrast to the previous research, such as using VGG16 [11], our model takes a different approach to
6
APREPRINT - AP RI L 26, 2021
Figure 6: Fitting result in the low-concentration case is shown. Two separated states could be described as Eq. (2).
quantify the levels by captivating the variance of detected points on humans in the current state. Additionally, we
estimate and track the level for the next time window. Our model practically offers a tool to monitor the level more
precisely and aid lecturers to estimate the level. Academically, our model has a novel approach to analyze complex
human states and the concentration level.
As a future work, we plan to use not solely body movement data but also emotion data [12] and skin thermal data [1] [13]
for an enhanced prediction of measuring human concentration levels. The measuring method used in this paper and the
conventional measuring techniques will be combined and processed using deep learning. This work expects to provide
useful information on students’ concentration level and thus assist lecturers.
References
[1]
Nomura, S., Hasegawa-Ohira, M., Kurosawa, Y., Hanasaka, Y., Yajima, K., & Fukumura, Y. (2012). SKIN
TEMPERETURE AS A POSSIBLE INDICATOR OF STUDENT’ S INVOLVEMENT IN E-LEARNING
SESSIONS. " International Journal of Electronic Commerce Studies", 3(1), 101-110.
[2]
Al-Musawi, A. S. (2018). Concentration level monitoring in education and healthcare. Basic and Clinical
Pharmacology and Toxicology, 124(s2), 36.
[3]
Marouane, S., Najlaa, S., Abderrahim, T., & Eddine, E. K. (2015). Towards measuring learner’s concentration in
E-learning systems. International Journal of Computer Techniques, 2(5), 27-29.
[4]
Liu, N. H., Chiang, C. Y., & Chu, H. C. (2013). Recognizing the degree of human attention using EEG signals
from mobile sensors. sensors, 13(8), 10273-10286.
[5]
Kalman, R. E. (March 1, 1960). "A New Approach to Linear Filtering and Prediction Problems." ASME. J. Basic
Eng. March 1960; 82(1): 35–45.
[6]
Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). OpenPose: realtime multi-person 2D pose
estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence, 43(1),
172-186.
[7]
Simon, T., Joo, H., Matthews, I.,& Sheikh, Y. (2017). Hand keypoint detection in single images using multiview
bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1145-
1153).
7
APREPRINT - AP RI L 26, 2021
[8]
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity
fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291-7299).
[9]
M. Mohammadpour, H. Khaliliardali, S. M. R. Hashemi and M. M. AlyanNezhadi, "Facial emotion recognition
using deep convolutional networks," 2017 IEEE 4th International Conference on Knowledge-Based Engineering
and Innovation (KBEI), Tehran, 2017, pp. 0017-0021, doi: 10.1109/KBEI.2017.8324974.
[10] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[11]
C.Ruvinga, D.Malathi, J. D. Dorathi Jayaseeli. (2020). Human Concentration Level Recognition Based on VGG16
CNN architecture. International Journal of Advanced Science and Technology, 29(6s), 1364 - 1373. Retrieved
from http://sersc.org/journals/index.php/IJAST/article/view/9271
[12]
Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of
the IEEE conference on Computer Vision and Pattern Recognition (pp. 4724-4732).
[13]
Lauri Nummenmaa, Enrico Glerean, Riitta Hari, and Jari K. Hietanen (2014). Bodily maps of emotions. PNAS,
111 (2), 646-651.
8