ArticlePDF Available

Abstract and Figures

Introduction: Communication with the hearing impaired ( deaf/mute) people is a great challenge in our society today; this can be attributed to the fact that their means of communication (Sign Language or hand gestures at a local level) requires an interpreter at every instance. Conversion of images to text as well as speech can be of great benefit to the non-hearing impaired and hearing impaired people (the deaf/mute) from circadian interaction with images. To effectively achieve this, a sign language (ASL – American Sign Language) image to text as well as speech conversion was aimed at in this research. Methodology: The techniques of image segmentation and feature detection played a crucial role in implementing this system. We formulate the interaction between image segmentation and object recognition in the framework of FAST and SURF algorithms. The system goes through various phases such as data capturing using KINECT sensor, image segmentation, feature detection and extraction from ROI, supervised and unsupervised classification of images with K-Nearest Neighbour (KNN)-algorithms and text-to-speech (TTS) conversion. The combination FAST and SURF with a KNN of 10 also showed that unsupervised learning classification could determine the best matched feature from the existing database. In turn, the best match was converted to text as well as speech. Result: The introduced system achieved a 78% accuracy of unsupervised feature learning. Conclusion: The success of this work can be attributed to the effective classification that has improved the unsupervised feature learning of different images. The pre-determination of the ROI of each image using SURF and FAST, has demonstrated the ability of the proposed algorithm to limit image modelling to relevant region within the image.
Content may be subject to copyright.
Page 58
Research Article
Journal of Research and Review in Science, 58-65
Volume 5, December 2018
LASU Journal of Research and Review in Science
CONVERSION OF SIGN LANGUAGE TO TEXT AND
SPEECH USING MACHINE LEARNING TECHNIQUES
Victoria A. Adewale 1, Dr. Adejoke O. Olamiti 2
1Crawford University, Faith-City, Igbesa,
Ogun State, Nigeria, [2] University of Ibadan,
Ibadan, Nigeria
2Department of Neuroimaging Sciences,
Center for Clinical Brain Sciences, University
of Edinburgh, Edinburgh UK
Correspondence
Victoria A. Adewale, Crawford University,
Faith-City, Igbesa, Ogun State, Nigeria.
Email: bimpsyade@gmail.com
Abstract:
Introduction: Communication with the hearing impaired (deaf/mute)
people is a great challenge in our society today; this can be attributed to
the fact that their means of communication (Sign Language or hand
gestures at a local level) requires an interpreter at every instance.
Conversion of images to text as well as speech can be of great benefit to
the non-hearing impaired and hearing impaired people (the deaf/mute)
from circadian interaction with images. To effectively achieve this, a sign
language (ASL American Sign Language) image to text as well as
speech conversion was aimed at in this research.
Aims: To convert ASL signed hand gestures into text as well as speech
using unsupervised feature learning to eliminate communication barrier
with the hearing impaired and as well provide teaching aid for sign
language.
Materials and Method: The techniques of image segmentation and
feature detection played a crucial role in implementing this system. We
formulate the interaction between image segmentation and object
recognition in the framework of FAST and SURF algorithms. The system
goes through various phases such as data capturing using KINECT
sensor, image segmentation, feature detection and extraction from ROI,
supervised and unsupervised classification of images with K-Nearest
Neighbour (KNN)-algorithms and text-to-speech (TTS) conversion. The
combination FAST and SURF with a KNN of 10 also showed that
unsupervised learning classification could determine the best matched
feature from the existing database. In turn, the best match was converted
to text as well as speech.
Results: The introduced system achieved a 78% accuracy of
unsupervised feature learning.
Conclusion: The success of this work can be attributed to the effective
classification that has improved the unsupervised feature learning of
different images. The pre-determination of the ROI of each image using
SURF and FAST, has demonstrated the ability of the proposed algorithm
to limit image modelling to relevant region within the image.
To Keywords: Image and Speech processing; Text-to-Speech (TTS);
Unsupervised Learning; FAST and SURF algorithms.
All co-authors agreed to have their names listed as authors.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and
reproduction in any medium, provided the original work is properly cited.
© 2018 The Authors. Journal of Research and Reviews in Science JRRS, A Publication of Lagos State University
JRRS https:sciencejournal.lasucomputerscience.com
ORIGINAL RESEARCH
LASU Journal of Research and Review in Science
Page 59
LASU Journal of Research and Review in Science
1. INTRODUCTION
Communication has been defined as an act of
conveying intended meanings from one entity or group
to another through the use of mutually understood signs
and semiotic rules. It plays a vital role in the existence
and continuity of human. For an individual to progress
in life and coexist with other individuals there is the need
for effective communication. Effective communication is
an essential skill that enables us to understand and
connect with people around us. It allows us to build
respect and trust, resolve differences and maintain
sustainable development in our environment where
problem solving, caring and creative ideas can thrive.
Poor communication skills are the largest contributor to
conflict in relationships. The indicators of poor
communication include inattentiveness, arguments,
vilification, and language barrier between the
communicators. All of these factors do not only affect
the physically fit people but also the physically
challenged. Research has shown that over nine (9)
billion people at intervals, all over the world are
physically challenged in terms of communication; blind,
deaf or mute [1]. Investigating the barrier of
communication between the hearing-impaired and the
hearing person has led to the need of providing a means
of bridging this communication gap.
Extant literatures capture some of the dynamics of
solving the problems facing effective communication,
although not without observed missing links.
Academic and industrial researchers have recently
been focusing on analyzing images of people and there
has been a surge interest in recognizing human
gestures. A research on scene segmentation of images
was carried out using deep learning techniques; the
classification yielded 53.8% accuracy [2]. In relation to
conversion of sign language, there is the need to
explore other image classification techniques to
enhance accurate classification.
Also, a novel method for unsupervised learning of
human action categories was presented by Juan Carlos
to automatically learn the probability distributions of the
spatial-temporal words and the intermediate topics
corresponding to human action categories. This was
achieved by using latent topic models such as the
probabilistic Latent Semantic Analysis (pLSA) model
and Latent Dirichlet Allocation (LDA)[3]. This study was
aimed at recognizing general human actions which can
be said to be ambiguous; a more specific human action
identification approach using unsupervised learning will
yield a better result.
Furthermore, an approach to convert signed ASL
alphabets using unsupervised learning feature
(Gaussian model) has also been used to learn set of
features similar to edges using an autoencoder; a
softmax classifier was used to classify the learned
features. Results showed that the higher the training set
the higher the level of accuracy-(1200 training set
produced 95.62% accuracy and 6000 training set
resulted in 98.20%) and the disparity in the training and
test error [4]. Further studies on how to capture and
convert ASL words would be a plus to this study.
2. METHODOLOGY
The aim of the study as earlier stated was to provide an
unsupervised learning feature of signed hand gestures
while the system returns corresponding output as text
and speech. The following were the measurable
methods employed in actualising the aim:
1. Segmentation of captured signed gestures of ASL
as inputs
2. Feature extraction of the segmented images
3. UFL and classification of several images
4. Text and Speech synthesis of classified images
Figure 1 gives an overview of the system
Fig 1: System Overview
2.1 Segmentation of captured signed gestures
of ASL as inputs
The aim of segmentation was to convert images into more
meaningful and easy to analyse portions. Segmentation does
the job of partitioning an image into multiple segments which
help to locate the objects and boundaries (curves, arcs, lines,
etc.) in an image in binary form. The set of images captured
from the Kinect sensor using the Image Acquisition Tool in
MATLAB would be selected and fed into the Image
Segmenter in MATLAB which is then converted to grayscale
image. The threshold of the images is then obtained by
converting grayscale images into binary images to determine
the high level contrast of the images. Such images can then
be cropped or resized. The segmentation process is
represented as shown in fig. 2.
Input
image
Segment
ation
Featur
e
Extract
Unsuper
vised
learning
and
Text and
Speech
Synthesis
LASU Journal of Research and Review in Science
Page 60
LASU Journal of Research and Review in Science
Fig 2: Image Segmentation
2.2 Feature extraction of the segmented
images
Sign Language usually involves movement in the upper
part of the body; head, shoulder, hands and elbows
coordinates are retained while other parts are discarded
[5]. To satisfy this need, key points corresponding to
high-contrast locations such as object edges and
corners were used. These features are intended to be
non-redundant, informative and relevant for the
intended use. Extracting ROI from images has been
very much challenging as it is the base for further image
analysis, interpretation and classification. A rectangular
ROI whose outline consists of four segments joining the
four corner points is used to make computational
statistics feasible [6].
The vertices of an ROI outline may be positioned
anywhere with respect to the array of image pixels, so
the same Rectangular ROI superimposed on the pixel
array may appear.
2.3 UFL and classification of several images
The identification of interest points present within the
space of an image is important in the determination of
the image’s ROI, therefore the method being proposed
in this paper maximizes the number of interest points
detected within a sample image through the use of the
combination of FAST corner detector and SURF
detector.
2.3.1 Fast and Surf Points For K-Nearest
Neighbour UFL
If FAST corner points and SURF key points are
respectively represented by the sets F= {f1,f2,f3,…,fL }
and S= {s1,s2,s3,…,sL}, then the combination of these
two algorithms can be represented by a set A ( i.e
). The two key criteria which distinguish keypoints
belonging to an ROI from those that do not belong to
the desired region are location and description. These
combination take its root from K-Nearest Neighbour
(KNN) algorithm in equation (1) used by [7] for
classification of description of each point into either
foreground or background therefore, it requires training
samples.


󰇛󰇜 

󰇛󰇜
equation (1)
A new set of extracted data will be fed into the system
for training in order to learn the set of unsupervised
features. In the implementation of the KNN, each
feature point to be categorised is allocated the highest
occurring label from the closest 10 neighbours (Medium
KNN), thus the points labeled as the foreground are
grouped together to form the desired Region.
2.4 Text and Speech synthesis of classified
images
Text-to-Speech (TTS) refers to the ability of computers
to read text aloud. A TTS Engine converts written text
to a phonemic representation, and then converts the
phonemic representation to waveforms that can be
output as sound. Speech synthesis is the artificial
production of human speech. A computer system used
for this purpose is called a speech synthesizer, and can
be implemented in software or hardware. After the
successful classification of these features, the important
task is to generate appropriate text and speech output
for every input image using MATLAB Speech
Synthesizer.
3. RESULTS AND DISCUSSION
Sample images of different ASL signs were collected
using the Kinect sensor using the image acquisition
toolbox on MATLAB. About five hundred (500) data
samples (with each sign count five and ten (5-10)) were
collected as the training data. The reason for this is to
make the algorithm very robust for images of the same
database in order to reduce the rate of misclassification.
Examples of the images collected is shown in Table 1
LASU Journal of Research and Review in Science
Page 61
LASU Journal of Research and Review in Science
Table 1: Coloured images for training
3.1 SEGMENTATION OF IMAGES
Batch segmentation for all training samples was carried
out to convert the coloured images into binary form with
MATLAB Image Segmenter and Batch Processor
toolbox. Basically only the set of binarised images are
useful in feature detection and extraction. The
segmented form of the images in Table 1 is represented
in Table 2.
Table 2: Segmented ImageSet of Coloured Images
SegmentedImageA1
SegmentedImageA2
SegmentedImageB1
SegmentedImageB2
3.2 FEATURE EXTRACTION
Once segmentation process has been successfully
carried out, the next thing is to load the image database.
A for loop is used to read an entire folder of images and
store them in MATLAB’s memory for labeling. The
image training labeler function of MATLAB in Figure 3
is employed to do this and then ROI generated by
equation 1 was used for selection of the training set.
Fig 3: ROI Labelling of ASL Images
Fig 4: Object Bounding Box of Labelled ASL images
The bounding boxes in Figure 4 for each image
alongside its path name is generated and stored in a
.mat file for further processing.
SegmentedImageA3 SegmentedImageA4
SegmentedImageB3 SegmentedImageB4
3.3 UNSUPERVISED CLASSIFICATION OF ASL
IMAGES
In the implementation of Fast and Surf points for KNN
mentioned in equation (1), each feature point to be
classified is allocated the highest occurring label from
the closest 10 neighbours, thus the points labeled as the
foreground are grouped together to form the desired
region.
Imageset
A1
ImagesetA
2
ImagesetA3
ImagesetA
4
Imageset
B1
ImagesetB
2
ImagesetB3
ImagesetB
4
LASU Journal of Research and Review in Science
Page 62
LASU Journal of Research and Review in Science
3.3.1 STAGES FOR UFL AND CLASSIFICATION
1. PREPARE COLLECTION OF IMAGES TO
SEARCH
Read the set of reference images each containing a
different object. Multiple views of the same object are
included in the collection shown in Figure 5 in order to
capture hidden or occluded areas.
Fig 5: Image Collection of different ASL
2. DETECT FEATURE POINTS IN IMAGE
COLLECTION
Detect and display feature points in first image as
shown in Figure 6. Use of local features serves two
purposes. It makes the search process more robust to
changes in scale and orientation and reduces the
amount of data that needs to be stored and analysed.
Then Detect features in the entire image collection.
Fig 6: First Image in Collection
3. BUILD FEATURE DATASET
All of the features from each image are combined into a
matrix. The matrix was then used to initialize
a KDTreeSearcher object from the Statistics Toolbox.
This object allows for fast searching for nearest
neighbours of high-dimensional data. In this case, a
nearest neighbour of FAST and SURF descriptor are
used as view of the same point.
4. CHOOSE QUERY IMAGE
An entirely new set of images as shown in Figure 7
outside the trained images are supplied. In other words,
it is an imageset that is not part of the training set.
Fig 7: New Image to be classified
5. DETECT FEATURE POINTS IN QUERY
IMAGE
The query image is converted into grayscale and
threshold to obtain a segmented image; then the ROI
of the image is captured before the features are
extracted using fast corner points and surf keypoints
for effective and efficient recognition. The features
detected from Figure 8a are represented in Figure 8b.
Fig 8a: ROI Selection
LASU Journal of Research and Review in Science
Page 63
LASU Journal of Research and Review in Science
Fig 8b: Detected Features
6. SEARCH IMAGE COLLECTION FOR THE
QUERY IMAGE
For all of the features in the query image, ten nearest
neighbours in the dataset were considered to compute
the distance to each neighbour. The KNN-
search function returns the nearest neighbours, even if
none of the features are a close match. To throw away
those bad matches, we will use a ratio of the ten closest
neighbour distances. The histogram function was used
to count the number of features that matched from each
image. Each pair of indices, in the indexIntervals of
Figure 9 constitutes an index interval that corresponds
to an image.
The strength with which each image in the collection
matches the query image can be viewed in image
collection shown in Figure 10. Size of each image in
Figure 11 is proportional to the proximity of matching
features. It was observed that some other images are
still considered as either a strong or weak match. These
are outliers that will be eliminated in the next step.
Fig 9: Histogram of matched images
Fig 10: Set of Matched Images
7. ELIMINATE OUTLIERS USING DISTANCE
TESTS
To prevent false matches, it is important to remove
those nearest neighbour matches that are far from their
query feature. The poorly matched features can be
detected by comparing the distances of the first and
second nearest neighbour. If the distances are similar,
as calculated by their ratio, the match is rejected as
shown in Figure 11. Additionally, matches that are far
apart were ignored. These processes were repeated for
other new set of images for unsupervised learning and
classification.
Fig 11: Best Matched Feature
3.4 TEXT-TO-SPEECH SYNTHESIS
The Text-To-Speech Synthesizer function of MATLAB
was used to convert the string of the filename of the
best matched feature in the collection to speech in
Figure 12.
LASU Journal of Research and Review in Science
Page 64
LASU Journal of Research and Review in Science
Fig 12: Text-To-Speech
3.4.1 RESULT OF SUPERVISED AND
UNSUPERVISED CLASSIFICATION
Table 3 shows the result of correctly classified images:
Table 3: Result of Classification
Image
Samples
Number
of image
Samples
per sign
Supervised
Feature
Learning
(Classification)
Unsupervised
Feature
Learning
(Classification)
A
10
1
1
B
10
1
1
C
10
1
1
D
9
1
0
E
10
1
0
F
10
1
1
G
10
1
1
H
9
1
1
I
10
1
1
K
9
1
1
L
10
1
1
M
10
0
0
N
10
1
0
O
10
1
1
P
10
1
1
Q
10
1
0
R
10
1
1
S
10
1
1
T
10
1
1
U
10
1
1
V
7
1
1
W
8
1
1
X
10
1
0
Y
10
1
0
Love
10
1
1
Master
9
0
1
Father
5
1
1
Mother
9
1
1
You
9
1
1
Me
8
1
1
Your
9
1
1
Start
10
1
1
End
10
1
1
Man
10
1
1
Come
10
1
0
To
10
1
1
Meat
10
1
1
Want
9
1
1
Church
10
1
1
Name
8
1
1
God
8
1
1
Food/Eat
10
1
1
What
10
1
1
Are
10
1
1
My/Mine
10
1
1
Water
10
1
1
House
10
1
0
Have
10
1
1
The final result in Figure 13 shows 92% correct
classification using supervised learning and 78%
correct classification using unsupervised learning.
LASU Journal of Research and Review in Science
Page 65
LASU Journal of Research and Review in Science
Fig 13: Classification of ASL Images
3.5 DISCUSSION
The study findings show that American Sign Language
(ASL) is commonly used in Nigeria by the hearing
impaired hence; five hundred (500) ASL images were
collected as training set. From the collection of images,
a database of forty-nine (49) different signs was used.
Having subjected the set of images to batch
segmentation, features of each signs were detected and
extracted from specific bounding-box of Region of
Interest (ROI) to aid supervised learning. The
combination FAST and SURF with a KNN of 10 also
showed that unsupervised learning classification could
determine the best matched feature from the existing
database. In turn, the best match was converted to text
as well as speech. The introduced system achieved a
92% accuracy of supervised feature learning and 78%
of unsupervised feature learning.
REFERENCES
[1] V. Padmanabhan and M. Sornalatha, “Hand
gesture recognition and voice conversion
system for dumb people,” Int. J. Sci. Eng. Res.,
vol. 5, no. 5, 2014.
[2] C. Chen, J. Chen, and A. Ryan, “Scene
Segmentation of 3D Kinect Images with
Recursive Neural Networks,” 2011. [Online].
Available: http://cs.nyu.edu/. [Accessed: 14-
Mar-2017].
[3] J. C. Niebles, H. Wang, L. Fei-Fei, J. C.
Niebles, H. Wang, and L. Fei-Fei,
“Unsupervised Learning of Human Action
Categories Using Spatial-Temporal Words,” Int
J Comput Vis, 2008.
[4] P. A. Ajavon, “An Overview of Deaf Education
in Nigeria,” vol. 109, no. 1, pp. 5–10, 2006.
[5] D. Mart, “Sign Language Translator using
Microsoft Kinect XBOX 360 TM.”
[6] Xinapse, “Region of Interest (ROI) Algorithms,”
2018. .
[7] A. Li, W. Jiang, W. Yuan, D. Dai, S. Zhang, and
Z. Wei, “An Improved FAST + SURF Fast
Matching Algorithm,” Procedia - Procedia
Comput. Sci., vol. 107, no. Icict, pp. 306312,
2017.
92%
78%
8%
22%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Supervised Learning unsupervised Learning
Yes
No
Classification of ASL Images
... It consists of a comprehensive examination and synthesis of existing academic papers, articles, and relevant publications that pertain to the specific domain or problem being addressed. By delving into prior work, researchers gain a deep understanding of the historical context, theoretical foundations, and practical applications related to their area of interest within ML. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] This exploration not only helps in identifying key concepts, methodologies, and algorithms that have proven effective but also highlights gaps in knowledge or unresolved issues. Moreover, the literature review serves as a guide for selecting appropriate ML techniques, models, and datasets based on their performance in similar contexts. ...
... Ahmad et al. [2] and Agre et al. [8] highlighted real-time translation systems, with Ahmad's work focusing on the BISINDO dataset and Agre's study on text and speech conversion. Similarly, Akano and Olamiti [3] and Lang et al. [4] examined the conversion of sign language to text and speech using machine learning and Kinect, respectively. These studies illustrate the progression from static image recognition to dynamic, real-time translation systems, addressing various challenges such as gesture variability and computational efficiency. ...
Article
: Sign language detection plays a vital role in enhancing accessibility, communication, and inclusion for individuals who are deaf or hard of hearing. It is an essential tool that facilitates smooth communication in various environments, such as schools, workplaces, healthcare settings, and daily interactions. By accurately identifying and interpreting sign language gestures, this technology helps bridge communication barriers between sign language users and those unfamiliar with it, ensuring equal access to information and services. In education, this technology supports language learning, literacy, and academic achievement for deaf students. In healthcare, reliable sign language detection ensures that medical information and consultations are conveyed clearly, improving care quality for deaf patients. In the workplace, it promotes inclusivity by enabling effective communication between deaf employees, their colleagues, and clients, fostering equal opportunities and professional growth. Additionally, during emergencies, this technology can save lives by facilitating swift and precise communication between emergency responders and deaf individuals. When incorporated into assistive technologies, sign language detection enables deaf individuals to independently handle everyday tasks, such as accessing digital content, using smart devices, and engaging in online interactions.
Article
Effective communication remains a challenge for individuals who rely on sign language as their primary mode of expression, especially in interactions with non-sign language users. This research explores an innovative system that converts sign language gestures into text and subsequently into synthesized speech, enabling seamless and inclusive communication. Leveraging advancements in computer vision, natural language processing (NLP), and speech synthesis, the proposed model captures real-time sign gestures, translates them into structured textual data, and outputs audible speech with high accuracy. The study delves into key technologies, including machine learning algorithms for gesture recognition, dynamic language modelling for text interpretation, and scalable speech synthesis techniques for voice output. This paper aims to provide a comprehensive framework addressing the linguistic and technical complexities of sign-to-text-to-speech conversion, emphasizing its potential impact on accessibility and societal integration for the deaf and hard-of-hearing communities.
Article
Sign language is an essential means of communication for the deaf and hard-of-hearing community. However, effective communication between sign language users and those unfamiliar with sign language can be challenging. The primary goal is to utilize the machine learning to automatically identify sign language gestures and translate them into easily understandable formats. This research presents a comprehensive sign language detection system that captures sign language gestures, detects them, and provides output in text using LSTM (Long Short-Term Memory) and Transformers with an accuracy of 79%. This multimodal approach ensures the system helps in understanding the sign language.
Article
Full-text available
Target matching is an important part of image registration and mosaic. Based on a lot of real-time application requirements, the requirement of fast matching is also put forward. The classical matching algorithm has the problems of large computation and slow speed. Aiming at the problems existing in the classical algorithm, a fast matching algorithm based on the combination of FAST feature points and SURF descriptor is proposed. Experiments show that compared to the classic SIFT matching algorithm, the method is very good to achieve the goal of fast matching, in addition to the algorithm also improves the accuracy of the matching.
Article
Full-text available
We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
Hand gesture recognition and voice conversion system for dumb people
  • V Padmanabhan
  • M Sornalatha
V. Padmanabhan and M. Sornalatha, "Hand gesture recognition and voice conversion system for dumb people," Int. J. Sci. Eng. Res., vol. 5, no. 5, 2014.
Scene Segmentation of 3D Kinect Images with Recursive Neural Networks
  • C Chen
  • J Chen
  • A Ryan
C. Chen, J. Chen, and A. Ryan, "Scene Segmentation of 3D Kinect Images with Recursive Neural Networks," 2011. [Online].
An Overview of Deaf Education in Nigeria
  • P A Ajavon
P. A. Ajavon, "An Overview of Deaf Education in Nigeria," vol. 109, no. 1, pp. 5-10, 2006.
Sign Language Translator using Microsoft Kinect XBOX 360 TM
  • D Mart
D. Mart, "Sign Language Translator using Microsoft Kinect XBOX 360 TM."
Region of Interest (ROI) Algorithms
  • Xinapse
Xinapse, "Region of Interest (ROI) Algorithms," 2018..