Available via license: CC BY 4.0

Content may be subject to copyright.

A FRAMEWORK FOR RECOGNIZING AND ESTIMATING HUMAN

CONCENTRATION LEVELS

A PREPRINT

Woodo Lee

Department of Physics

Korea University

Seoul, Republic of Korea, 02481

woodolee@korea.ac.kr

Jakyung Koo

Department of Computer Science and Engineering

Korea University

Seoul, Republic of Korea, 02481

lawkelvin33@korea.ac.kr

Nokyung Park

Department of Computer Science and Engineering

Korea University

Seoul, Republic of Korea, 02481

noparkee@korea.ac.kr

Pilgu Kang

Department of Computer Science and Engineering

Korea University

Seoul, Republic of Korea, 02481

rkd903@korea.ac.kr

Jeakwon Shim ∗

Department of Computer Science and Engineering

Korea University

Seoul, Republic of Korea, 02481

jaekwoun.shim@gmail.com

April 26, 2021

ABS TRAC T

One of the major tasks in online education is to estimate the concentration levels of each student.

Previous studies have a limitation of classifying the levels using discrete states only. The purpose of

this paper is to estimate the subtle levels as speciﬁed states by using the minimum amount of body

movement data. This is done by a framework composed of a Deep Neural Network and Kalman

Filter. Using this framework, we successfully extracted the concentration levels, which can be used

to aid lecturers and expand to other areas.

Keywords

Computer vision, Data mining, Education, Educational programs, Human computer interaction, Signal analysis.

1 Introduction

There have been many attempts to measure students’ concentration levels using various methods such as taking skin

temperature [1], recognizing visual attention, and detecting Electroencephalogram (EEG) signals [2

–

4]. However,

these attempts lack in detail since their concentration levels were classiﬁed as discrete states. Here, we develop a new

framework (DNN-K), named after its architecture, Deep Neural Network (DNN) and Kalman Filter (KF) [5] to improve

these limitations. DNN-K deﬁnes the concentration levels as the probability of "high concentration," which is derived

from the DNN, and suggests that the standard deviations of designated points are a core factor in measuring these

probabilities. KF is also implemented to DNN-K to estimate the concentration levels over measuring.

∗Corresponding author.

arXiv:2104.11421v1 [cs.LG] 23 Apr 2021

APREPRINT - AP RI L 26, 2021

Figure 1: The measured points by OpenPose are shown. The middle and upper body are being measured with ten

points respectively.

2 Motivation

The motive of this paper is the fact that a human’s body movements can be a factor in recognizing his/her condition. We

propose that the standard deviations of designated points are a core factor in measuring the concentration levels. To

measure the points, we used OpenPose [6

–

9], which is shown in Fig. 1 with the coordinate data of the points range

from 0 to 1. We then check the distributions when humans are concentrating or not respectively, which is shown in Fig.

2. The distribution in (a) shows that the entries are gathered around body points while (b) is more widely spread. The

difference is distinguished, but it cannot be quantiﬁed to recognize the concentration levels. We analyze the distributions

by using DNN-K, and the method details are discussed in the following sections.

3 Methods

3.1 Overview

Figure 3 shows the overview of DNN-K. The ﬁrst step of DNN-K is where the student’s video data taken by a camera is

pre-processed. The pre-processed data is labeled by two states, high and low concentration levels, based on the students’

intent. The labeled data is used for training a DNN model at the recognition step. The trained DNN model recognizes

the continuous concentration levels, which is deﬁned as recognition levels (

Sr

). The concentration levels are estimated

by KF in the estimation step, which are called the estimation levels (

Se

). At the end, lecturers can give their feedback to

students with Srand Se.

3.2 Step 1. Data pre-processing

The ﬁrst step of DNN-K is to extract the standard deviations from video data. Ten points of the human body are

measured every 50 frames, which are classiﬁed as the top part (0 - 4) and the middle part (5 - 9). Then, the standard

deviations of the X and Y coordinates are calculated in the top and the middle parts respectively. Note that we assume

the standard deviations of the points are the core factor in measuring the concentration levels. Table 1 shows the

notations of the results in the pre-processing step.

Algorithm 1 shows the entire process of the pre-processing step. Through this algorithm, the standard deviations are

obtained, and become the input data for DNN, discussed in the next section.

Figure 4 shows the difference of the standard deviations among each group. Nevertheless, there remain unexplained

aspects, such as ambiguous patterns, the correlation of which with the concentration levels are not clear. To this end,

2

APREPRINT - AP RI L 26, 2021

(a)

(b)

Figure 2: (a) The 2D histogram in the case of high-concentration is shown. (b) The 2D histogram in the case of

low-concentration is shown.

3

APREPRINT - AP RI L 26, 2021

Figure 3: The overview of our research is shown. DNN-K consist of several packages to recognize and estimate the

levels.

Columns Description

σX

T op The standard deviation of the top part’s X coordinate

σY

T op The standard deviation of the top part’s Y coordinate

σX

Mid The standard deviation of the middle part’s X coordinate

σY

Mid The standard deviation of the middle part’s Y coordinate

Table 1: The deﬁnition of the data and its description are shown.

DNN is applied to solve the problems as DNN is an appropriate method for obtaining nonlinear combinations from

features. This allows us to unveil hidden features that we cannot acknowledge.

3.3 Step 2. Recognition

Algorithm 2 shows the overall structure of the recognition step. Through our DNN, the recognition levels (

Sr

) are

obtained. The DNN consists of four layers. ReLU is used as the activation function in the ﬁrst hidden layer and the

second hidden layer. Sigmoid is used to make sure the probability is distributed relatively evenly from 0 to 1. ADAptive

Moment (ADAM) estimation optimizer [10] is applied and the initial learning rate is set as 0.1

%

, which is the optimal

value for the DNN. Loss function (L) is deﬁned as

L=−1

N

N

X

i=1

[yilog( ˆyi) + (1 −yi) log(1 −ˆyi)],(1)

where N is the number of labels, which is two in this case.

yi

is the label of the data and

ˆyi

is the probability of

the prediction value from the data as

yi

. As the output value is a probability for binary classiﬁcation, we use binary

cross-entropy for calculating the loss for every epoch during training.

K-Fold is applied to cover the insufﬁcient data. The accuracy of 5-Fold training is ranged from

85%

to

95%

with a

median of 90.62 %.

3.4 Step 3. Estimation

Algorithm 1: Data Pre-Processing Step of DNN-K

Input: top.X, top.Y, mid.X, mid.Y

for each Data ∈ {top.X, top.Y, mid.X, mid.Y }do

for Di=

50(i+1)

S

50i

Data do

σData.append(s.d.(Di))

end for

end for

Output: σX

T op, σY

T op, σX

Mid , σY

Mid

4

APREPRINT - AP RI L 26, 2021

(a) (b)

(c) (d)

Figure 4: (a) - (d) show the distribution of the standard distribution for each parts, respectively. Red histogram for high

concentration case, and blue histogram for low concentration case are shown.

Algorithm 2: Recognition Step of DNN-K

Input: σX

T op, σY

T op, σX

Mid , σY

Mid

Input layer : ∈R4

1st hidden layer : ∈R8(activation function: ReLU)

2nd hidden layer : ∈R8(activation function: ReLU)

Output layer : ∈R1(activation function: Sigmoid)

Output: Concentration Levels st∈S

Algorithm 3: Estimation Step of DNN-K

Input: st∈S

for all st∈Sdo

xest

t←st

/* Predicting */

xpre

t+1=A·xest

t

Ppre

t=A·Pt·AT+Q

/* Updating */

K=Ppre

t·H/(H·Ppre

t·H+R)

/* Estimating */

xest

t+1=xpre

t+1+K·(mt−H·xpre

t+1)

Pt+1=Ppred

t−K·H·Ppred

t

end for

/* Analyzing */

Fitting the distribution of xest

t+1by lecturers’

deﬁned function

Output: Concentration Levels (Ψ)

5

APREPRINT - AP RI L 26, 2021

Figure 5: Estimation and measurement levels are shown. The black dots (

Sr

) and the green dots (

Se

) show a low

concentration case. The blue dots (Sr) and the red dots (Se) show a high concentration case.

The estimation step of DNN-K includes KF to earn

Se

. Algorithm 3 shows the overall process. There are three states in

the algorithm, which are described as prediction state (

xpre

t

), estimation state (

xest

t

), measurement state (

mt

), the error

covariance matrix (

Pt

), and the transition weight matrix (

A

). The students’ state starts from

xest

0= 0.5

, because the

students’ concentration level is assumed as

50%

in the beginning.

P0= 0.9

is the system error, which comes from the

DNN, discussed in the previous section.

In the predicting step,

A

is a transition matrix, and

Q

is an external noise matrix, which can be modiﬁed by teachers.

A

is set to

1

and

Q

is set to

0

as an ideal case. The students maintained their concentration levels and there were no

external disturbances when they took lectures. In the updating step, the Kalman Gain (

K

) is obtained in every step.

H

is a scale matrix, which is set to

1

by simplifying the problems.

xest

t+1

and

Pt+1

are updated with

K

. Finally, in the

estimating step, the next estimated state xest

t+1is recurrently updated.

Figure 5 shows the estimation and measurement results every 2.5 seconds. Even though the measurement ﬂuctuates

widely every 2.5 seconds, KF enables users to track the levels smoothly, which are shown as the green and the red dots.

Figure 6 shows the histogram of the green dots. In this case, the distribution consist of two dominant modes, which

can be analyzed by a certain function, and the details are as follows. Each mode can be described by the function of

bimodal distribution X, which is written as

Nk(µk, σk) = Ae

−(x−µk)2

2σ2

k(2)

X=N1(µ1, σ2

1, A1) + N2(µ2, σ2

2, A2)(3)

where σ1and σ2are the standard deviations, µ1and µ2are the mean values, and xis the input data.

Figure 6 shows the distribution of

Ψ

in the low concentration case (

Ψlow

). The

µ1

and

µ2

are obtained as 0.09 and 0.16

respectively in this distribution.

4 Conclusion and future work

We devise the model for aiding lecturers to estimate students’ concentration level using cameras in online classes. Our

system presents the level every 2.5 seconds under 90.62

%

accuracy and estimates the next level of concentration by

using KF. In contrast to the previous research, such as using VGG16 [11], our model takes a different approach to

6

APREPRINT - AP RI L 26, 2021

Figure 6: Fitting result in the low-concentration case is shown. Two separated states could be described as Eq. (2).

quantify the levels by captivating the variance of detected points on humans in the current state. Additionally, we

estimate and track the level for the next time window. Our model practically offers a tool to monitor the level more

precisely and aid lecturers to estimate the level. Academically, our model has a novel approach to analyze complex

human states and the concentration level.

As a future work, we plan to use not solely body movement data but also emotion data [12] and skin thermal data [1] [13]

for an enhanced prediction of measuring human concentration levels. The measuring method used in this paper and the

conventional measuring techniques will be combined and processed using deep learning. This work expects to provide

useful information on students’ concentration level and thus assist lecturers.

References

[1]

Nomura, S., Hasegawa-Ohira, M., Kurosawa, Y., Hanasaka, Y., Yajima, K., & Fukumura, Y. (2012). SKIN

TEMPERETURE AS A POSSIBLE INDICATOR OF STUDENT’ S INVOLVEMENT IN E-LEARNING

SESSIONS. " International Journal of Electronic Commerce Studies", 3(1), 101-110.

[2]

Al-Musawi, A. S. (2018). Concentration level monitoring in education and healthcare. Basic and Clinical

Pharmacology and Toxicology, 124(s2), 36.

[3]

Marouane, S., Najlaa, S., Abderrahim, T., & Eddine, E. K. (2015). Towards measuring learner’s concentration in

E-learning systems. International Journal of Computer Techniques, 2(5), 27-29.

[4]

Liu, N. H., Chiang, C. Y., & Chu, H. C. (2013). Recognizing the degree of human attention using EEG signals

from mobile sensors. sensors, 13(8), 10273-10286.

[5]

Kalman, R. E. (March 1, 1960). "A New Approach to Linear Filtering and Prediction Problems." ASME. J. Basic

Eng. March 1960; 82(1): 35–45.

[6]

Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). OpenPose: realtime multi-person 2D pose

estimation using Part Afﬁnity Fields. IEEE transactions on pattern analysis and machine intelligence, 43(1),

172-186.

[7]

Simon, T., Joo, H., Matthews, I.,& Sheikh, Y. (2017). Hand keypoint detection in single images using multiview

bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1145-

1153).

7

APREPRINT - AP RI L 26, 2021

[8]

Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part afﬁnity

ﬁelds. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291-7299).

[9]

M. Mohammadpour, H. Khaliliardali, S. M. R. Hashemi and M. M. AlyanNezhadi, "Facial emotion recognition

using deep convolutional networks," 2017 IEEE 4th International Conference on Knowledge-Based Engineering

and Innovation (KBEI), Tehran, 2017, pp. 0017-0021, doi: 10.1109/KBEI.2017.8324974.

[10] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[11]

C.Ruvinga, D.Malathi, J. D. Dorathi Jayaseeli. (2020). Human Concentration Level Recognition Based on VGG16

CNN architecture. International Journal of Advanced Science and Technology, 29(6s), 1364 - 1373. Retrieved

from http://sersc.org/journals/index.php/IJAST/article/view/9271

[12]

Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of

the IEEE conference on Computer Vision and Pattern Recognition (pp. 4724-4732).

[13]

Lauri Nummenmaa, Enrico Glerean, Riitta Hari, and Jari K. Hietanen (2014). Bodily maps of emotions. PNAS,

111 (2), 646-651.

8