Conference PaperPDF Available

Reducing Response Time for Multimedia Event Processing using Domain Adaptation

Authors:

Figures

Content may be subject to copyright.
Reducing Response Time for Multimedia Event Processing using
Domain Adaptation
Asra Aslam
Insight Centre for Data Analytics
NUI Galway, Ireland
asra.aslam@insight-centre.org
Edward Curry
Insight Centre for Data Analytics
NUI Galway, Ireland
edward.curry@insight-centre.org
ABSTRACT
The Internet of Multimedia Things (IoMT) is an emerging concept
due to the large amount of multimedia data produced by sensing
devices. Existing event-based systems mainly focus on scalar data,
and multimedia event-based solutions are domain-specic. Multiple
applications may require handling of numerous known/unknown
concepts which may belong to the same/dierent domains with an
unbounded vocabulary. Although deep neural network-based tech-
niques are eective for image recognition, the limitation of having
to train classiers for unseen concepts will lead to an increase in the
overall response-time for users. Since it is not practical to have all
trained classiers available, it is necessary to address the problem
of training of classiers on demand for unbounded vocabulary. By
exploiting transfer learning based techniques, evaluations showed
that the proposed framework can answer within
0.01 min to
30
min of response-time with accuracy ranges from 95.14% to 98.53%,
even when all subscriptions are new/unknown.
CCS CONCEPTS
Information systems
Multimedia streaming;
Computing
methodologies
Neural networks;
Software and its engineer-
ing Publish-subscribe / event-based architectures.
KEYWORDS
Domain Adaptation, Online Training, Internet of Multimedia Things,
Event-Based Systems, Multimedia Stream Processing, Object De-
tection, Transfer Learning, Smart Cities, Machine Learning
ACM Reference Format:
Asra Aslam and Edward Curry. 2020. Reducing Response Time for Multi-
media Event Processing using Domain Adaptation. In Proceedings of the
2020 International Conference on Multimedia Retrieval (ICMR ’20), June
8–11, 2020, Dublin, Ireland. ACM, New York, NY, USA, 5 pages. https:
//doi.org/10.1145/3372278.3390722
1 INTRODUCTION
Due to ever increasing shift of data towards multimedia, the inclu-
sion of “multimedia things” in the domain of Internet of Things
(IoT) is a crucial step for the emerging applications of smart cities
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ICMR ’20, June 8–11, 2020, Dublin, Ireland
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7087-5/20/06. . . $15.00
https://doi.org/10.1145/3372278.3390722
[
2
,
3
,
24
,
34
,
42
]. Event processing systems [
12
,
14
] are designed to
process data streams (consisting of mostly scalar data excluding
multimedia data). In case of smart cities, multiple types of multi-
media applications may require handling of multiple subscriptions
belonging to multiple domains (like {car, bus, pedestrian, bike}
ϵ
trac management, {car, taxi, bike}
ϵ
parking management, {ball,
person}
ϵ
sports event management etc.). High performance re-
quirement of real-time systems can be accomplished using existing
image processing systems but they are designed only for specic
domains, have limited user expressibility, and cannot successfully
realize the goal of generalizable multimedia event processing due
to their bounded object detection capability.
In our previous work [
5
,
6
] we analyzed the problem of general-
ized multimedia event processing but recognized the requirement
of availability of trained classiers for unknown concepts/objects
within subscriptions (unbounded vocabulary [
52
]). The online train-
ing of classier on request of any new/unknown subscription is an
option to be explored, which will help either in switching (trans-
forming) from one classier to another (like bus
car) or in the
construction of completely new classier (like ball). Also, existing
DNN based techniques [
18
] are well-known for easy knowledge
transfers among domains [
16
,
30
,
46
] but focused either on im-
proving accuracy or testing time. They do not analyze the overall
response time of the process of transfer and its impact on accuracy.
In this work, we propose an adaptive multimedia event process-
ing model, that leverages transfer learning-based techniques for
domain adaptation to handle unknown/new subscription within an
acceptable time frame. An example of multimedia event processing
specically for the detection of objects is shown in Fig. 1. Main
contributions of this article include a denition of quality metric
“response-time” supporting “unknown subscriptions”, an adaptive
approach using online classier construction to support multiple
domain-based subscriptions, and an instantiation of a classier
learning model by transferring knowledge among classiers us-
ing ne-tuning and freezing layers of neural network-based object
detection models.
2 BACKGROUND WITH RELATED WORK
Very few multimedia event based architectures for Internet of Mul-
timedia Things (IoMT) proposed in recent works [
2
,
42
,
44
], focused
on scalability and multimodal big data. However augmenting IoT
systems with multimedia event based approaches is not straightfor-
ward, and still haven’t been combined yet as an end-to-end model.
2.1 Domain-Specic Event Recognition
Event recognition in multimedia is one of the popular areas of re-
search [
21
,
32
,
51
]. Trac recognition systems [
17
,
26
] are highly
Figure 1: Generalized Multimedia Event Processing Scenario
ecient in analyzing and predicting trac events. Detection of
interesting events in sports video [
7
,
33
] is also one of the common
event recognition problem. Similarly other applications like ood
detection, surveillance based systems, cultural events, and/or nat-
ural disasters, are also introduced in literature [
1
,
31
,
43
,
50
] with
medium to high precision and no possibility for domain adaptation.
It can be concluded that although these event recognition systems
achieve high performance, they have no support for large vocab-
ulary which limits their user interface, they also demonstrate the
need to merge event based systems with multimedia methods each
time the domain changes, and therefore do not support domain
adaptation by themselves.
2.2 Domain Adaptive Event Recognition
As existing approaches of processing multimedia data are domain-
specic, the research is moving towards the concept of transfer
of knowledge from one domain to another [
8
,
11
,
49
]. Domain
Adaptation is the ability to utilize the knowledge of old domains
to identify unknown domains. The model learns from the source
domain consisting of labeled data and from the target domain using
unlabeled/labeled data, and in most use-cases, data available in the
source domain is much more than the target domain [
35
]. Many
approaches [
9
,
16
,
30
,
46
] with supervised/unsupervised transfer
learning have been proposed for domain adaptation and are mainly
focused on generalization ability for increasing accuracy not the
overall response time. An event recognition in still images by trans-
ferring objects and scene representations has been proposed in
work [
48
], where the correlations of the concepts of object, scene,
and events have been investigated. Similarly, large scale domain
adaptation based approaches [
4
,
10
,
19
,
20
,
40
] are also introduced
particularly for the detection of objects and it is desirable to bring
their abilities to the core of multimedia event processing.
3 MULTIMEDIA EVENT PROCESSING
3.1 Problem Formulation
The problem is focused on minimizing the response time for the
processing of multimedia events in order to answer user queries
consisting of unknown subscriptions (unbounded vocabulary), us-
ing an adaptive classier construction approach while achieving
high accuracy. It is primarily based on following two dimensions
“Response-Time” and “Unknown Subscriptions”:
(i) Response-Time:
It can be dened as the dierence between
the arrival and notication time of subscription processed using
specic classiers. Challenges with response-time in multimedia
event processing system include the following two cases:
Case 1: Classier for subscription available
This case contains subscriptions (like car, dog, bus) which are previ-
ously known to the multimedia event processing system, and their
classiers are already present in the model. Here response-time will
depend only on the testing time while excluding training time.
Case 2: Classier for subscription not available
This scenario includes subscriptions (like person, truck, trac_light)
for which classiers are not available and unknown to the system.
However by using the similarity of new subscriptions with existing
base classiers, we can further classify the present case as:
(a) Subscriptions require classiers similar to base classiers: Consider
an example of an unknown subscription “truck”, classier for truck
can be constructed from existing “bus” classier. Hence domain
adaptation time contributes to response-time.
(b) Subscriptions require classiers completely dierent from base
classiers: In such scenario, we assume no base classiers are simi-
lar to incoming subscription and response-time must includes cost
of training from scratch.
(ii) Unknown Subscriptions:
This dimension concerns the ability
to recognize new subscriptions with the naming of objects that may
not belong to the limited vocabulary of system. The lack of support
for unbounded vocabularies is a bottleneck for emerging applica-
tions [52], which we are referring to as Unknown Subscriptions.
3.2 Adaptive Multimedia Event Processing
A functional model has been designed for the adaptive multimedia
event processing engine (shown in Fig. 2), consisting of various
models discussed below:
Event Matcher analyzes user subscriptions (such as bus,car,dog)
and image events, and is responsible for the detection of condi-
tions in image events as specied by user query and preparation of
notications that need to be forwarded to users.
Figure 2: Design for Adaptive Multimedia Event Processing
Training and Testing Decision Model designed to analyze available
classiers and take the testing and training decision accordingly. It
evaluates the relationship of existing classiers with new/unknown
subscription and chooses the transfer learning technique.
Classier Construction Model phase performs the training of clas-
siers for subscribed classes, and updates the classier in the shared
resources after allowed response-time. The two options of transfer
learning used for classier construction includes ne-tuning and
freezing layers. In the rst approach we are performing ne-tuning
on a pre-trained model (presently ImageNet [
13
]), which uses the
technique of back-propagation with labels for target domain until
validation loss starts to increase. In the second approach, we are
using this previously trained classier to instantiate the network
of another classier required for a similar subscription concept.
In this particular scenario, we are freezing the backbone (convo-
lutional and pooling layers) of the neural network and training
only top dense fully connected layers, where the frozen backbone
is not updated during back-propagation and only ne-tuned layers
are getting updated and retrained during the training of classier.
The decision of construction of a classier for “bus” either from
pre-trained models (by ne-tuning) or from “car” classier (by freez-
ing) is taken with the help of computation of a threshold based on
subscriptions relatedness (path operator of WordNet [36]).
In Training Data Construction model, if a subscriber subscribes for
a class which is not present in any smaller object detection datasets
(Pascal VOC [
15
], and Microsoft COCO [
28
]), then a classier can
be constructed by fetching data from datasets (ImageNet [
13
], and
OID [
23
]) of more classes using online tools like ImageNet-Utils
1
and OIDv4_ToolKit
2
. Another common approach of online training
data construction is to use engines like “Google Images” or “Bing
Image Search API” to search for class names and download images.
Feature Extraction of Multimedia Events is responsible for the
detection of objects in image events using current deep neural
network based object detection models and incorporating new
classiers. Here we utilize image classication models [
18
,
37
,
45
]
in backbone network of object-detection models.
Shared Resources component consist of existing image processing
modules and training datasets. We use You Only Look Once (YOLO),
1https://github.com/tzutalin/ImageNet_Utils
2https://github.com/EscVM/OIDv4_ToolKit
Single shot multibox detector (SSD),and Focal loss based Dense object
detection (RetinaNet) as object detection models [
27
,
29
,
38
,
39
]. We
have some base classiers trained o-line using established dataset
Pascal VOC [
15
], which are used in constructing more classiers
using domain adaptation.
4 EVALUATION
4.1 Performance With/Without Adaptation
The results of mean Average Precision (mAP) for response time
from 0 to 30 min are shown in Table 1. In the case of arrival of a
completely new subscription (Case 2b in Section–3.1), all models
are trained from scratch without use of any pre-trained model. Here,
RetinaNet performs higher (mAP
0
.
21) than other models and
the SSD300 does not converge without a pre-trained model. The
second and third row indicate the performance of proposed model
by applying domain adaptation techniques of ne-tuning/freezing
(Case 2a in Section–3.1). The recorded frame rates on our resources
for YOLOv3, SSD300, and RetinaNet are 114 fps, 21 fps, and 12 fps
respectively, where fps represent the number of frames per second.
It can be concluded that domain adaptation via freezing layers
can provide acceptable performance (i.e. accuracy
92
.
74% with
precision
0
.
50 using YOLOv3 model) in such short training time
(30min) as compared to ne-tuning of pre-trained model, which
is crucial to know before taking the decision of choosing either
pre-trained model or nearest classier.
4.2 Empirical Analysis for Domain Shift
We analyzes Transfer Loss,Accuracy, and Distribution Discrepancy
metrics, for domain adaptation. The “transfer loss” has been evalu-
ated on four domain transfers (varying from closely related domains
to not related domains), depicted in Fig. 3a. The transfer achieved
by YOLOv3 is better than other object detection models in case of
football to cricket ball and laptop to mango domain transfers. Here,
the transfer loss only indicates how well the transfer works on
multiple domains, and lower values are desired. However, the best
transfer is achieved by RetinaNet model on the transfer of cat to
dog class. Similarly the single shot detection (SSD) model achieve
its best on transfer of car to bus. Interestingly, the values of transfer
loss using models (SSD and RetinaNet) on other domain transfers
Table 1: mean Average Precision (mAP) on Initial (α=
0
) and
Final (β=30min) Response-Time With/Without Adaptation
Training Method YOLOv3 SSD300 RetinaNet
α β α β α β
Training from Scratch
0.01 0.07 0.00 0.00
0.09 0.21
Fine-Tuning ImageNet
0.00 0.12 0.05 0.17
0.27 0.36
Freezing Similar Classier 0.15 0.50 0.14 0.16
0.17 0.17
are quit high, and lead us to evaluate accuracy on these domain
adaptations.
The accuracy achieved by object detection models on the same
classes of domain transfers, is shown in Fig. 3b. It can be clearly seen
that all object detection models are able to provide high accuracy
on applying transfer learning techniques, however the YOLOv3
achieve the best accuracy on all domain transfers.
In-order to realize the variation of approximate distance (i.e. Dis-
tribution Discrepancy) among dierent domains, we have trained
few binary classiers that can classify source-target pair of classes
like cat and dog,car and bus etc. It can be seen in the results (Fig. 3c),
that distribution discrepancy (lower is better) for YOLOv3 is rela-
tively smaller among most of the domain transfers than for other
object detection models, which suggests that YOLOv3 neural net-
work closes the cross-domain gap more eectively, which also
explains its better accuracy than other object detection models.
4.3 Evaluations on Known/Unknown Domains
As results of high performance and domain shifts are in favor of
YOLOv3 with freezing layer based transfer learning technique, we
have selected YOLOv3 as an object detection model for perform-
ing further experiments on the unknown subscriptions. Table 2
provides a comparison of average accuracy and response time of
Adaptive Multimedia Event Processing model with existing domain-
specic models by considering their best performance. It can be
observed that existing multimedia event recognition models are
designed only for the detection of specic objects and answer such
known subscriptions in low response time, while fails to process any
unknown subscription. An average response time of the approach
for known subscriptions depends only on testing time (
0.01 min)
and accuracy (98.53%) of object detection model. However, response
time for an unknown subscription includes training (presently
30
min) via domain adaptation and achieves the accuracy of 95.14%.
5 CONCLUSION AND FUTURE WORK
This paper analyzed the problem of processing multimedia events
(specically object detection), for known/unknown subscriptions/concepts,
while minimizing the response time. We proposed a multimedia
event processing model with domain adaptation by utilizing trans-
fer learning based techniques (ne-tuning and freezing), for the
online training of neural network based models. Experiments on
current models evaluated the performance in low response-time,
along with an empirical analysis for domain shift. The proposed
system can achieve accuracy ranges from 95
.
14% to 98
.
53% within
0
.
01 min to
30 min of response-time using YOLOv3 even when
subscriptions are unknown. In future work, it can be extended
(a) Transfer Loss
(b) Accuracy
(c) A-Distance
Figure 3: Analysis for Domain Shift
Table 2: Comparison of Proposed with Existing Model(s)
Approach Subscription Performance
Response Time Accuracy
Vehicle Detection Known 0.001 min 97.3%
for Trac [47] Unknown 0%
Firearm Detection Known 0.0001 min 94.00%
for Security [25] Unknown 0%
Stolen Object Known 0.0007 min 93.58%
Detection [41] Unknown 0%
Car Parking Vaca- Known 0.17 min 97.9%
ncy Detection [22] Unknown 0%
Adaptive Multimedia Known 0.01 min 98.53%
Event Processing Model
Unknown 29.99 min 95.14%
for unsupervised/semi-supervised learning to reduce the need of
labeled data for new subscriptions.
ACKNOWLEDGMENTS
This work was supported by Science Foundation Ireland under grant
SFI/12/RC/2289_P2. Titan Xp GPU used was donated by NVIDIA.
REFERENCES
[1]
Sheharyar Ahmad, Kashif Ahmad, Nasir Ahmad, and Nicola Conci. 2017. Convo-
lutional Neural Networks for Disaster Images Retrieval.. In MediaEval.
[2]
Sufyan Almajali, I Dhiah el Diehn, Haythem Bany Salameh, Moussa Ayyash,
and Hany Elgala. 2018. A distributed multi-layer MEC-cloud architecture for
processing large scale IoT-based multimedia applications. Multimedia Tools and
Applications (2018), 1–22.
[3]
Sheeraz A Alvi, Bilal Afzal, Ghalib A Shah, Luigi Atzori, and Waqar Mahmood.
2015. Internet of multimedia things: Vision and challenges. Ad Hoc Networks 33
(2015), 87–111.
[4]
Asra Aslam. 2020. Object Detection for Unseen Domains while Reducing Re-
sponse Time using Knowledge Transfer in Multimedia Event Processing. In
Accepted for Proceedings of the 2020 ACM on International Conference on Multime-
dia Retrieval (ICMR).
[5]
Asra Aslam and Edward Curry. 2018. Towards a Generalized Approach for Deep
Neural Network Based Event Processing for the Internet of Multimedia Things.
IEEE Access 6 (2018), 25573–25587.
[6]
Asra Aslam, Souleiman Hasan, and Edward Curry. 2017. Challenges with image
event processing: Poster. In Proceedings of the 11th ACM International Conference
on Distributed and Event-based Systems. 347–348.
[7]
Noboru Babaguchi, Yoshihiko Kawai, and Tadahiro Kitahashi. 2002. Event based
indexing of broadcasted sports video by intermodal collaboration. IEEE transac-
tions on Multimedia 4, 1 (2002), 68–75.
[8]
Oscar Beijbom. 2012. Domain adaptations for computer vision applications. arXiv
preprint arXiv:1211.4860 (2012).
[9]
Yoshua Bengio. 2012. Deep learning of representations for unsupervised and
transfer learning. In Proceedings of ICML Workshop on Unsupervised and Transfer
Learning. 17–36.
[10]
Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2018.
Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 3339–3348.
[11]
Gabriela Csurka. 2017. Domain adaptation for visual applications: A comprehen-
sive survey. arXiv preprint arXiv:1702.05374 (2017).
[12]
Gianpaolo Cugola and Alessandro Margara. 2012. Processing ows of information:
From data stream to complex event processing. ACM Computing Surveys (CSUR)
44, 3 (2012), 15.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-
genet: A large-scale hierarchical image database. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248–255.
[14]
Patrick Th Eugster, Pascal A Felber, Rachid Guerraoui, and Anne-Marie Kermar-
rec. 2003. The many faces of publish/subscribe. ACM Computing Surveys (CSUR)
35, 2 (2003), 114–131.
[15]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010.
The Pascal Visual Object Classes (VOC) Challenge. International Journal of
Computer Vision 88, 2 (June 2010), 303–338.
[16]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo
Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016.
Domain-adversarial training of neural networks. The Journal of Machine Learning
Research 17, 1 (2016), 2096–2030.
[17]
Holger Glasl, David Schreiber, Nikolaus Viertl, Stephan Veigl, and Gustavo Fer-
nandez. 2008. Video based trac congestion prediction on an embedded system.
In Intelligent Transportation Systems, 2008. ITSC 2008. 11th International IEEE
Conference on. IEEE, 950–955.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[19]
Judith Homan. 2016. Adaptive learning algorithms for transferable visual recog-
nition. University of California, Berkeley.
[20]
Judy Homan, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Je Donahue,
Ross Girshick, Trevor Darrell, and Kate Saenko. 2014. LSDA: Large scale detection
through adaptation. In Advances in Neural Information Processing Systems. 3536–
3544.
[21]
Ling Hu and Qiang Ni. 2017. IoT-driven automated object detection algorithm
for urban surveillance systems in smart cities. IEEE Internet of Things Journal 5,
2 (2017), 747–754.
[22]
Jermsak Jermsurawong, Mian Umair Ahsan, Abdulhamid Haidar, Haiwei Dong,
and Nikolaos Mavridis. 2012. Car parking vacancy detection and its application
in 24-hour statistical analysis. In 2012 10th International Conference on Frontiers
of Information Technology. IEEE, 84–90.
[23]
Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina
Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al
.
2017.
Openimages: A public dataset for large-scale multi-label and multi-class image
classication. Dataset available from https://github. com/openimages 2 (2017), 3.
[24]
Malaram Kumhar, Gaurang Raval, and Vishal Parikh. 2019. Quality Evaluation
Model for Multimedia Internet of Things (MIoT) Applications: Challenges and Re-
search Directions. In International Conference on Internet of Things and Connected
Technologies. Springer, 330–336.
[25]
Mikolaj E Kundegorski, Samet Akçay, Michael Devereux, Andre Mouton, and
Toby P Breckon. 2016. On using feature descriptors as visual words for object
detection within x-ray baggage security screening. (2016).
[26]
Ching-Hao Lai and Chia-Chen Yu. 2010. An ecient real-time trac sign recog-
nition system for intelligent vehicles with smart phones. In Technologies and
Applications of Articial Intelligence (TAAI), 2010 International Conference on.
IEEE, 195–202.
[27]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
Focal loss for dense object detection. In Proceedings of the IEEE international
conference on computer vision. 2980–2988.
[28]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common
objects in context. In European conference on computer vision. Springer, 740–755.
[29]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
In European conference on computer vision. Springer, 21–37.
[30]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learn-
ing transferable features with deep adaptation networks. arXiv preprint
arXiv:1502.02791 (2015).
[31]
Laura Lopez-Fuentes, Joost van de Weijer, Marc Bolanos, and Harald Skinnemoen.
2017. Multi-modal Deep Learning Approach for Flood Detection.. In MediaEval.
[32]
Badri Mohapatra and Prangya Prava Panda. 2019. Machine learning applications
to smart city. ACCENTS Transactions on Image Processing and Computer Vision 4
(14) (Feb 2019). https://doi.org/10.19101/TIPCV.2018.412004
[33]
Pirkko Mustamo. 2018. Object detection in sports: TensorFlow Object Detection API
case study. University of Oulu.
[34]
Ali Nauman, Yazdan Ahmad Qadri, Muhammad Amjad, Yousaf Bin Zikria,
Muhammad Khalil Afzal, and Sung Won Kim. 2020. Multimedia Internet of
Things: A Comprehensive Survey. IEEE Access 8 (2020), 8202–8250.
[35]
Sinno Jialin Pan, Qiang Yang, et al
.
2010. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
[36]
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet::
Similarity: measuring the relatedness of concepts. In Demonstration papers at
HLT-NAACL 2004. Association for Computational Linguistics, 38–41.
[37]
Joseph Redmon. 2013–2016. Darknet: Open Source Neural Networks in C. http:
//pjreddie.com/darknet/.
[38]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
only look once: Unied, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 779–788.
[39]
Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv
preprint arXiv:1612.08242 (2016).
[40]
Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. 2010. Adapting visual
category models to new domains. In European conference on computer vision.
Springer, 213–226.
[41]
Juan Carlos San Miguel and José M Martínez. 2008. Robust unattended and stolen
object detection by fusing simple algorithms. In 2008 IEEE Fifth International
Conference on Advanced Video and Signal Based Surveillance. IEEE, 18–25.
[42]
Kah Phooi Seng and Li-Minn Ang. 2018. A Big Data Layered Architecture and
Functional Units for the Multimedia Internet of Things (MIoT). IEEE Transactions
on Multi-Scale Computing Systems (2018).
[43]
Chiao-Fe Shu, Arun Hampapur, Max Lu, Lisa Brown, Jonathan Connell, Andrew
Senior, and Yingli Tian. 2005. Ibm smart surveillance system (s3): a open and
extensible framework for event based surveillance. In Advanced Video and Signal
Based Surveillance, 2005. AVSS 2005. IEEE Conference on. IEEE, 318–323.
[44]
Javier Silvestre-Blanes, Víctor Sempere-Payá, and Teresa Albero-Albero. 2020.
Smart Sensor Architectures for Multimedia Sensing in IoMT. Sensors 20, 5 (2020),
1400.
[45]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[46]
Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy
domain adaptation.. In AAAI, Vol. 6. 8.
[47]
Yong Tang, Congzhe Zhang, Renshu Gu, Peng Li, and Bin Yang. 2017. Vehicle
detection and recognition for intelligent trac surveillance system. Multimedia
tools and applications 76, 4 (2017), 5817–5832.
[48]
Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool. 2018. Transferring deep ob-
ject and scene representations for event recognition in still images. International
Journal of Computer Vision 126, 2-4 (2018), 390–409.
[49]
Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey.
Neurocomputing 312 (2018), 135–153.
[50]
Xiu-Shen Wei,Bin-Bin Gao, and Jianxin Wu. 2015. Deep spatial pyramid ensemble
for cultural event recognition. In Proceedings of the IEEE international conference
on computer vision workshops. 38–44.
[51]
Piyush Yadav and Edward Curry. 2019. VidCEP: Complex Event Processing
Framework to Detect Spatiotemporal Patterns in Video Streams. In 2019 IEEE
International Conference on Big Data (Big Data). IEEE, 2513–2522.
[52]
Yuhao Zhang and Arun Kumar. 2019. Panorama: a data system for unbounded
vocabulary querying over video. Proceedings of the VLDB Endowment 13, 4 (2019),
477–491.
... In this work we utilize the multimedia event processing model of our published work [4][5][6] and remove the limitation of availability of trained classifiers using the adaptive framework shown in Fig. 1. We propose two different approaches for the training of DNN based classifiers to recognize unseen/new classes while minimizing the response-time. ...
... This section summarizes the results of our current works which are in progress or published [4][5][6]. We compare the detection performance of proposed model with adaptation based techniques against object detection models (YOLOv3, SSD300, and RetinaNet [28,30,38]) using Pascal VOC and Open images Datasets [14,26]. ...
... Online Training: Other than online domain adaptation, we may need to train classifiers for concepts that are completely unknown for any application of smart cities [213][214][215]. We can use online data collection 5 and automatic annotation 6 techniques to construct such a classifier. ...
Article
Full-text available
An enormous amount of sensing devices (scalar or multimedia) collect and generate information (in the form of events) over the Internet of Things (IoT). Present research on IoT mainly focus on the processing of scalar sensor data events and barely considers the challenges posed by multimedia based events. In this paper, we systematically review the existing solutions available for the Internet of Multimedia Things (IoMT) by analyzing sensing, networking, service, and application-level services provided by IoT. We present state-of-the-art event-based middleware methods and their suitability for multimedia event processing methods. We observe that existing IoT event-based middleware solutions focus on structured (scalar) events and possess only domain-specific characteristics for unstructured (multimedia) events. A case study for object detection is also presented to demonstrate the requirements associated with the processing of multimedia events within smart cities, even with common image recognition based applications. In order to validate the existing issues in the detection of objects, we also presented an evaluation of object detection models using existing datasets. At the end of each section, we shed light on trends, gaps, and possible solutions based on our analysis, experiments, and review of the existing research. Finally, we summarize the challenges and future research directions for the generalized multimedia event processing (by taking detection of each and every object as an example) based on applications using IoMT. Our experiments demonstrate that existing models are very slow to respond to any unseen class, and existing rich datasets do not have a sufficient number of classes to meet the requirements of real-time applications of smart cities. We show that although there is a significantly large technical literature on IoT, and research on IoMT is also quite actively growing, there have not been much research efforts directed towards the processing of multimedia events. As an example, although deep learning techniques have been shown to achieve impressive performance in applications like image recognition, the methods are deficient in detecting new (previously unseen) objects for multimedia based applications in smart cities. In light of these facts, it becomes imperative to conduct research on bringing together the abilities of event-based middleware for IoMT, and low response-time based online training and adaptation techniques.
... Generalized image processing with domain adaptation is difficult [64] but with the rise of the Internet of Things (IoT), and federated learning [65][66][67], it is now possible to detect the image or video context on a device without any privacy risk, which can be utilized to adapt to the level of security needed, reducing response times. The image encryption system can also be sensitive to the software and hardware constraints and network loads in future to effectively balance security and speed. ...
Conference Paper
Full-text available
The COVID-19 pandemic serves as a grim reminder of the unexpected nature of these outbreaks and gives rise to a unique set of research challenges in a variety of fields. As people all over the world adjust to this new 'normal', with most workplaces, from companies to educational institutions shifting online, enormous surges in the transmission of images and videos have been observed, creating record-breaking stresses on the internet backbone. At the same time, maintaining the privacy and security of the users' data is of immense importance, this is where fast and efficient image encryption algorithms play a vital role. This paper discusses the calamitous effects of the pandemic on the world population and how their changes in multimedia consumption have led to an urgent need for the development and deployment of secure and fast image encryption, especially selective image encryption techniques. It carefully surveys the most recent advances in this field, discusses their real-world effects and finally explores some future research avenues, to provide swift relief and recover from the disastrous effects of the pandemic.
... Generalized image processing with domain adaptation is difficult [64] but with the rise of the Internet of Things (IoT), and federated learning [65][66][67], it is now possible to detect the image or video context on a device without any privacy risk, which can be utilized to adapt to the level of security needed, reducing response times. The image encryption system can also be sensitive to the software and hardware constraints and network loads in future to effectively balance security and speed. ...
Preprint
Full-text available
The COVID-19 pandemic serves as a grim reminder of the unexpected nature of these outbreaks and gives rise to a unique set of research challenges in a variety of fields. As people all over the world adjust to this new 'normal', with most workplaces, from companies to educational institutions shifting online, enormous surges in the transmission of images and videos have been observed, creating record-breaking stresses on the internet backbone. At the same time, maintaining the privacy and security of the users' data is of immense importance, this is where fast and efficient image encryption algorithms play a vital role. This paper discusses the calamitous effects of the pandemic on the world population and how their changes in multimedia consumption have led to an urgent need for the development and deployment of secure and fast image encryption, especially selective image encryption techniques. It carefully surveys the most recent advances in this field, discusses their real-world effects and finally explores some future research avenues, to provide swift relief and recover from the disastrous effects of the pandemic.
... Angsuchotmetee et al. [18] has presented an ontological approach for modeling complex events and multimedia data with syntactic and semantic interoperability in multimedia sensor networks, which allows subscribers to define application-specific complex events while keeping the low-level network representation generic. Aslam et al. [19] leveraged domain adaption and online transfer learning in multimedia event processing to extend support for unknown events. Knowledge graph is suitable for semantic representation and reasoning over video streams due to its scalability and maintainability [20], as demonstrated in [5]. ...
Conference Paper
Full-text available
Efficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of ex-pressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper.
Article
Multimedia applications are often associated with cross-domain knowledge transfer, where Unsupervised Domain Adaptation (UDA) can be used to reduce the domain shifts. Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain under the assumption that the target domain contains unknown classes. Existing OSDA methods consistently lay stress on the covariate shift, ignoring the potential label shift problem. The performance of OSDA methods degrades drastically under intra-domain class imbalance and inter-domain label shift. However, little attention has been paid to this issue in the community. In this paper, the Imbalanced Open Set Domain Adaptation (IOSDA) is explored where the covariate shift, label shift and category mismatch exist simultaneously. To alleviate the negative effects raised by label shift in OSDA, we propose Open-set Moving-threshold Estimation and Gradual Alignment (OMEGA) - a novel architecture that improves existing OSDA methods on class-imbalanced data. Specifically, a novel unknown-aware target clustering scheme is proposed to form tight clusters in the target domain to reduce the negative effects of label shift and intra-domain class imbalance. Furthermore, moving-threshold estimation is designed to generate specific thresholds for each target sample rather than using one for all. Extensive experiments on IOSDA, OSDA and OPDA benchmarks demonstrate that our method could significantly outperform existing state-of-the-arts.
Thesis
Full-text available
When implementing real-time distributed Multimedia Event Processing (MEP) applications over the Cloud and Edge environments, the scheduling approach used for Deep Neural Network (DNN)-based Object Detection tasks may lead to a significant impact in terms of energy consumption, accuracy and throughput. This impact highlighted for both ecological and economical reasons since recent studies have shown that 14% of the worldwide energy consumption is for data-centres alone and that the carbon footprint of DNN models can produce as much carbon footprint as five cars would in a lifetime. As a response to this, this work presents the study, design and analysis of: an energy-speed-aware self-adaptive scheduler for DNN-based Object Detection models based on the Quality of Service (QoS) metrics extracted from the user requirements; a distributed MEP framework for heterogeneous Cloud-Edge environments. The framework was designed using microservices architecture best practices. The self-adaptive scheduler follows the MAPE-K (Monitor, Analyse, Plan, Execute and Knowledge) architecture for adaptation, whereby events in the system are monitored and analysed in order to plan and execute the necessary adaptation to improve the QoS of the system. The proposed approach showed improved results in different scenarios, with 85% faster adaptation life-cycle, reduction of 73% and 96% in energy consumption and latency, respectively, and a trade-off of 38% in accuracy. A real-world scenario of Occupational Health and Safety for construction sites with a heterogeneous Cloud-Edge infrastructure was used for testing this new self-adaptive scheduler and framework, which also showed promising results, with a reduction of 87% and 98% in energy consumption and latency, respectively, with a trade-off of 79% accuracy.
Article
Multimedia event detection aims to precisely retrieve videos that contain complex semantic events from a large pool. This work addresses this task under a zero-shot setting, where only brief event-specific textural information (such as event names, a few descriptive sentences, etc.) is known yet none positive video example is provided. Mainstream approaches to tackling this task are middle-level semantic concept-based, where meticulously-crafted concept banks (e.g., LSCOM) are adopted. We argue that these concept banks are still inadequate facing video semantic complexity. Existing semantic concepts are essentially first-order, mainly designed for atomic objects, scenes or human actions, etc. This work advocates the utilization of high-order concepts (such as subject-predicate-object triplets or adjective-object). The main contributions are two-fold. First, we harvest a comprehensive albeit compact high-order concept library through distilling information from three large public datasets (MS-COCO, Visual Genome, and Kinetics-600), mainly related to visual relations and human-object interactions. Secondly, zero-shot events are often only briefly and partially described via textual input. The resultant semantic ambiguity makes the pursuit of the most indicative high-order concepts challenging. We thus design a novel query-expanding scheme that enriches ambiguous event-specific keywords by searching over either large common knowledge bases ( e.g. , WikiHow) or top-ranked webpages retrieved from modern search engines. This way sets up a more faithful connection between zero-shot events and high-order concepts. To our best knowledge, this is the first work that strives for concept-based video search beyond first-order concepts. Extensive experiments have been conducted on several large video benchmarks (TRECVID 2013, TRECVID 2014, and ActivityNet-1.3). The evaluations clearly demonstrate the superiority of our constructed high-order concept library and its complementariness to existing concepts.
Article
Full-text available
Today, a wide range of developments and paradigms require the use of embedded systems characterized by restrictions on their computing capacity, consumption, cost, and network connection. The evolution of the Internet of Things (IoT) towards Industrial IoT (IIoT) or the Internet of Multimedia Things (IoMT), its impact within the 4.0 industry, the evolution of cloud computing towards edge or fog computing, also called near-sensor computing, or the increase in the use of embedded vision, are current examples of this trend. One of the most common methods of reducing energy consumption is the use of processor frequency scaling, based on a particular policy. The algorithms to define this policy are intended to obtain good responses to the workloads that occur in smarthphones. There has been no study that allows a correct definition of these algorithms for workloads such as those expected in the above scenarios. This paper presents a method to determine the operating parameters of the dynamic governor algorithm called Interactive, which offers significant improvements in power consumption, without reducing the performance of the application. These improvements depend on the load that the system has to support, so the results are evaluated against three different loads, from higher to lower, showing improvements ranging from 62% to 26%.
Article
Full-text available
The immense increase in multimedia-on-demand traffic that refers to audio, video, and images, has drastically shifted the vision of the Internet of Things (IoT) from scalar to Multimedia Internet of Things (M-IoT). IoT devices are constrained in terms of energy, computing, size, and storage memory. Delay-sensitive and bandwidth-hungry multimedia applications over constrained IoT networks require revision of IoT architecture for M-IoT. This paper provides a comprehensive survey of M-IoT with an emphasis on architecture, protocols, and applications. This article starts by providing a horizontal overview of the IoT. Then, we discuss the issues considering the characteristics of multimedia and provide a summary of related M-IoT architectures. Various multimedia applications supported by IoT are surveyed, and numerous use cases related to road traffic management, security, industry, and health are illustrated to show how different M-IoT applications are revolutionizing human life. We explore the importance of Quality-of-Experience (QoE) and Quality-of-Service (QoS) for multimedia transmission over IoT. Moreover, we explore the limitations of IoT for multimedia computing and present the relationship between the M-IoT and emerging technologies including event processing, feature extraction, cloud computing, Fog/Edge computing and Software-Defined-Networks (SDNs). We also present the need for better routing and Physical-Medium Access Control (PHY-MAC) protocols for M-IoT. Finally, we present a detailed discussion on the open research issues and several potential research areas related to emerging multimedia communication in IoT.
Conference Paper
Full-text available
Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Querying event patterns from video streams is challenging due to its unstructured representation. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and lack of expressive query language. In this work, we focus on a CEP framework where users can define high-level expressive queries over videos to detect a range of spatiotemporal event patterns. In this context, we propose- i) VidCEP, an in-memory, on the fly, near real-time complex event matching framework for video streams. The system uses a graph-based event representation for video streams which enables the detection of high-level semantic concepts from video using cascades of Deep Neural Network models, ii) a Video Event Query language (VEQL) to express high-level user queries for video streams in CEP, iii) a complex event matcher to detect spatiotemporal video event patterns by matching expressive user queries over video data. The proposed approach detects spatiotemporal video event patterns with an F-score ranging from 0.66 to 0.89. VidCEP maintains near real-time performance with an average throughput of 70 frames per second for 5 parallel videos with sub-second matching latency.
Article
Full-text available
The basic need of human is increasing as they interact with different devices and also, they provide many feedbacks. Many smart devices generate high data and that can be retrieved and reviewed by humans. Applications are not fixed as it increases day to day life. Based on these data generated by different smart devices and smart city applications machine learning approach is the best adaptive solution. Rapid development in software, hardware with high speed internet connection provides large data to this physical world. The key contribution of this paper is a machine learning application survey towards smart city.
Article
Full-text available
The enormous growth of the Internet of Things (IoT) devices gave governments, businesses, and individual users new means to accomplish their missions. Several IoT applications require deployment at a large scale such as smart cities. Large-scale IoT applications help achieve better monitoring services and more efficient cities. IoT multimedia (IoTMM) applications provide a unique level of intelligence that cannot be obtained with traditional IoT applications. IoTMM applications intensely consume bandwidth, processing, and storage resources compared to traditional IoT applications. IoT nodes vary in their capabilities, where some nodes have high processing capabilities while others do not. Cloud services models (IaaS, PaaS, and SaaS) provide a perfect solution for applications of high demand for resources. However, such models are inefficient when dealing with multimedia applications due to the high bandwidth requirements before reaching the cloud. Mobile edge computing (MEC) model creates a new level of providers according to the proximity of the end users to the network resources. Edge providers can offload some processing that is usually done at the cloud, which improves the overall performance of cloud-based applications. In this paper, we propose a new distributed structure for processing multimedia applications in the cloud, where different layers of processing and providers are involved. Every layer is responsible for a specific role depending on the type of the multimedia application and multimedia device. The major contribution of the proposed architecture includes two main elements: the support of scalable IoTMM applications in large deployment scenarios and the support of effective multimedia information sharing. The proposed architecture supports a scalable architecture for IoTMM applications with minimum additional resources compared to the traditional models. The proposed model allows for effective and practical sharing of multimedia information (raw multimedia data or features extracted from multimedia data). In addition, the proposed architecture provides applications with the ability of accessing intelligent multimedia information and services with minimum software development efforts. We support our claims with detailed simulation results and analysis.
Chapter
Multimedia Internet of Things (MIoT) is uniquely addressable broad network of interconnected objects, it covers many types of different fields of utilization. There are several heterogeneous requirements related applications exists for the same purpose. The MIoT revolution is reforming modern healthcare system with promising technological, economic and social prospects. The extensive distribution of MIoT devices has made the evaluation for performance of MIoT applications very important aspect as relevant placements are present in every daily life activity. Managing the quality of MIoT applications would be natural consequence of the trends. However, evaluating the quality of MIoT applications is substantially diverse from calculating the quality of software systems. This paper attempts to review and understand the quality evaluation parameter to evaluate the overall quality of a MIoT application and challenges in developing good quality evaluation model.
Article
Deep convolutional neural networks (CNNs) achieve state-of-the-art accuracy for many computer vision tasks. But using them for video monitoring applications incurs high computational cost and inference latency. Thus, recent works have studied how to improve system efficiency. But they largely focus on small "closed world" prediction vocabularies even though many applications in surveillance security, traffic analytics, etc. have an ever-growing set of target entities. We call this the "unbounded vocabulary" issue, and it is a key bottleneck for emerging video monitoring applications. We present the first data system for tacking this issue for video querying, Panorama. Our design philosophy is to build a unified and domain-agnostic system that lets application users generalize to unbounded vocabularies in an out-of-the-box manner without tedious manual re-training. To this end, we synthesize and innovate upon an array of techniques from the ML, vision, databases, and multimedia systems literature to devise a new system architecture. We also present techniques to ensure Panorama has high inference efficiency. Experiments with multiple real-world datasets show that Panorama can achieve between 2x to 20x higher efficiency than baseline approaches on in-vocabulary queries, while still yielding comparable accuracy and also generalizing well to unbounded vocabularies.
Article
In this paper, we address two issues by proposing a new architecture for the Multimedia Internet of Things (MIoT) with Big multimodal computation layer. We first introduce MIoT as a novel paradigm in which smart heterogeneous multimedia things can interact and cooperate with one another and with other things connected to the Internet to facilitate multimedia-based services and applications that are globally available to the users. The MIoT architecture consists of six layers. The computation layer is specially designed for Big multimodal analytics. This layer has four important functional units: Data Centralized Unit, Multimodal Data Aggregation Unit, Multimodal Data Divide & Conquer Computation Unit and Fusion & Decision Making Unit. A novel and highly scalable technique called the Divide & Conquer Principal Component Analysis (DC-PCA) for feature extraction in the divide and conquer mechanism is proposed to be used together with the Divide & Conquer Linear Discriminant Analysis (DC-LDA) for multimodal Big data analytics. Experiments are conducted to confirm the good performance of these techniques in the functional units of the Divide & Conquer computational mechanisms. The final section of the paper gives application on a camera sensing IoT platform and real-world data analytics on multicore architecture implementations.