Conference PaperPDF Available

# Object Detection for Unseen Domains while Reducing Response Time using Knowledge Transfer in Multimedia Event Processing

Authors:

## Figures

Content may be subject to copyright.
Object Detection for Unseen Domains while Reducing Response
Time using Knowledge Transfer in Multimedia Event Processing
Asra Aslam
(Supervised by Edward Curry)
Insight Centre for Data Analytics,
NUI Galway, Ireland
asra.aslam@insight-centre.org
ABSTRACT
Event recognition is among one of the popular areas of smart cities
that has attracted great attention for researchers. Since Internet of
Things (IoT) is mainly focused on scalar data events, research is
shifting towards the Internet of Multimedia Things (IoMT) and is
still in infancy. Presently multimedia event-based solutions provide
low response-time, but they are domain-specic and can handle
only familiar classes (bounded vocabulary). However multiple ap-
plications within smart cities may require processing of numerous
familiar as well as unseen concepts (unbounded vocabulary) in the
form of subscriptions. Deep neural network-based techniques are
popular for image recognition, but have the limitation of training
of classiers for unseen concepts as well as the requirement of an-
notated bounding boxes with images. In this work, we explore the
problem of training of classiers for unseen/unknown classes while
reducing response-time of multimedia event processing (specically
object detection). We proposed two domain adaptation based mod-
els while leveraging Transfer Learning (TL) and Large Scale Detec-
tion through Adaptation (LSDA). The preliminary results show that
proposed framework can achieve 0.5 mAP (mean Average Precision)
within 30 min of response-time for unseen concepts. We expect
to improve it further using modied LSDA while applying fastest
classication (MobileNet) and detection (YOLOv3) network, along
with elimination of requirement of annotated bounding boxes.
CCS CONCEPTS
Information systems
Multimedia streaming;
Computing
methodologies
Neural networks;
Software and its engineer-
ing Publish-subscribe / event-based architectures.
KEYWORDS
Domain Adaptation, Internet of Multimedia Things, Event-Based
Systems, Object Detection, Transfer Learning, Smart Cities
ACM Reference Format:
Asra Aslam. 2020. Object Detection for Unseen Domains while Reducing
Response Time using Knowledge Transfer in Multimedia Event Processing.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ICMR ’20, June 8–11, 2020, Dublin, Ireland
ACM ISBN 978-1-4503-7087-5/20/06. . . \$15.00
https://doi.org/10.1145/3372278.3391936
Figure 1: Architecture for Multimedia Event Processing
In Proceedings of the 2020 International Conference on Multimedia Retrieval
(ICMR ’20), June 8–11, 2020, Dublin, Ireland. ACM, New York, NY, USA,
5 pages. https://doi.org/10.1145/3372278.3391936
1 INTRODUCTION
The revision of the Internet of Things (IoT) for multimedia commu-
nication in the form of the Internet of Multimedia Things (IoMT)
is getting popular in smart cities due to the abrupt increase in
multimedia trac in the last decade [
2
,
3
,
25
,
33
,
42
]. IoT middle-
ware is responsible for providing common services to applications
and eases the development process, but it is still applied to scalar
data events [
36
]. Thus event-based systems cannot natively in-
clude multimedia events produced by IoMT generated data. On the
other hand, existing multimedia event-based approaches exhibit
high performance but the applicable for only specic domains (like
public transportation, surveillance systems, parking management,
sports events etc.), and hence can handle only familiar classes (have
bounded/closed vocabulary) [
50
]. Moreover the requirement of
availability of training data for unseen categories/classes is also
realized in literature [
17
,
18
,
44
] for Deep Neural Network (DNN)
based models, with no consideration for the duration of training
time of classiers. It is not clear in existing approaches how to deal
with multiple multimedia applications which may require handling
of multiple types of subscriptions (i.e. unbounded vocabulary) while
providing low response-time.
In this work we utilize the multimedia event processing model of
our published work [
4
6
] and remove the limitation of availability
of trained classiers using the adaptive framework shown in Fig. 1.
We propose two dierent approaches for the training of DNN based
classiers to recognize unseen/new classes while minimizing the
response-time. In the rst one, we utilize Transfer Learning (TL)
from pre-trained models which reduces response-time and requires
images with bounding boxes. In the second approach, we present
a modied version of Large Scale Detection through Adaptation
(LSDA) where we expect to reduce response time and require only
image labels without bounding boxes. Our experiments evaluate
the rst approach using current object detecting models and demon-
strate that the proposed framework can achieve 0.5 mAP within a
30 min response-time for unseen concepts. However this approach
requires Object-level annotations (i.e. annotated bounding boxes).
Since Image-level annotations (i.e. without bounding boxes) are
comparatively easy to acquire, we intend to improve it further using
our second approach of modied LSDA where we will not require
bounding boxes for the training of classiers and expect a decrease
in response-time.
The contributions of our work include formulating the problem
of processing multimedia events without detection data while min-
imizing response-time, adaptive architecture for the detection of
unseen (new) domains, approach to convert classier into detector
with and without bounding box annotations, and evaluations on
object detection models to identify the best performance that can
be provided in terms of both accuracy and response time.
2 STATE OF THE ART
Image recognition [
20
,
31
] is now able to detect multimedia events
when a camera senses a certain object and send real-time alerts
via text/image messages to users. Multiple scenarios of multimedia
events could include trac congestion, road accidents, change in
weather, terrorist attacks, parking problems, security, etc. Event
processing systems are designed to process the subscription of a
user based on standard languages in response to events and mainly
include the entities: events, subscriptions, and matcher [
12
,
48
].
Presently event processing systems are only focused on structured
(scalar) events for the processing of subscriptions of a user [
9
,
15
,
49
].
In the context of multimedia “an event is the representation of a
change of state in a multimedia item planned and attended” [
1
]. Ex-
isting multimedia event processing systems [
8
,
22
,
27
,
32
] generally
provide high performance but have domain specic characteristics
with bounded vocabulary, high setup cost, and cannot be adapted
to multiple domains.
Domain Adaptation is the ability to utilize the knowledge of
existing domains to identify unknown domains [
7
,
11
,
47
]. Since
transfer learning makes machine learning algorithms more ecient,
knowledge transfer is becoming more desirable within application
of machine learning based techniques [
46
]. Existing approaches
[
34
] in transfer learning are focused on the generalization ability
of classiers for the enhancement of accuracy, but not the overall
response time specically. Domain adaptation based methods are
also presented in literature [
10
,
21
,
41
] for transferring the appear-
ance models of the familiar objects to an unseen object. They are
also motivated with the fact that annotation of bounding boxes
for unseen/unknown classes is an expensive and time-consuming
process. They further demonstrate that it is possible to transfer ap-
pearance model from one object class to another based on semantic
and/or visual relationship. However, here domain shift represents
dierent change in view-points, weather conditions, backgrounds,
image quality, sketches etc.
Object recognition has been an area of extensive research for a
long time, resulting in several successful DNN based object detec-
tion models. One of the major challenge in training object detection
models is the need to collect a large amount of images with bound-
ing box annotations. Large Scale Detection through Adaptation
(LSDA) [
17
,
18
], is a method that learns the dierence between the
two tasks (classication and detection) and transfers this knowl-
edge to classiers to turn them into detectors. Another research
[
44
,
45
] incorporates the knowledge of object similarities from vi-
sual as well as semantic domains to the transfer process to improve
the previous work of LSDA. The motivation behind this work is
that visually and semantically similar categories should provide
more accurate detectors as compared to dissimilar categories, e.g.
a better detector can be constructed for the cat class by learning
the dierences between a dog classier and a dog detector, than
would by learning from the violin class classier and detector. How-
ever, these approaches do not consider on the long training time of
classiers, which increases the overall response time.
3 PROBLEM FORMULATION
Our work aims to answer the question: can we answer online user
queries consisting of familiar (bounded vocabulary) as well as un-
seen subscriptions (unbounded vocabulary) that include processing
of multimedia events while achieving high accuracy and minimiz-
ing the response time, where the training of classiers may or may
not have bounding box annotations available? Here, “Response-
Time” is the time dierence between the time subscription arrived
and the time at which the system is ready to notify the subscriber
(shown in Fig. 1). By assuming domain adaptation time as
tda
, data
collection time as
tdc
, and testing time as
tt
, we can formally dene
response time (trt ) as:
tr t =tda +tdc +tt(1)
On the basis of availability of bounding box annotations for unseen
subscriptions, our problem consist of two scenarios:
Scenario-1: Object-Level Annotations Available:
This scenario
assumes we can collect images with bounding box annotations by
using object detection datasets (like Pascal VOC, Microsoft COCO,
Open Images Dataset etc. [
14
,
23
,
29
]) or online data collection toolk-
its
1
. However, all of these object detection datasets have bounded
vocabulary consisting of 20, 80, and 600 classes respectively, and
thus it is not possible to provide bounding box labels for thousands,
or millions, of classes. Moreover it is much easier to provide image-
level annotations, which direct us to Scenario-2.
Scenario-2: Image-Level Annotations Available:
In this case, we
assume only image labels are available with no bounding boxes.
Such image level annotations can be obtained using ImageNet [
13
]
online data collection toolkit
2
or using image tags on any (i.e. Flickr,
Google, and/or Bing) image web search.
1https://github.com/EscVM/OIDv4_ToolKit
2https://github.com/tzutalin/ImageNet_Utils
(a) Model-1 for Object-Level Annotations (b) Model-2 for Image-Level Annotations (No Bounding Boxes)
Figure 2: Proposed Models (where Bounding Boxes Annotations may or may not be available)
4 PROPOSED APPROACHES
In this section, we introduce two models (shown in Fig. 2) to ad-
dress the specied problem. In the rst model we utilize transfer
learning based domain adaptation techniques and DNN based ob-
ject detection models, for minimizing response-time and increasing
the accuracy. In the second model we intend to incorporate LSDA
and modify it using MobileNet [
19
] as a classication model and
YOLOv3 [
38
,
39
] as object detection model. We use visual and se-
mantic knowledge in both models to identify which class is suitable
for knowledge transfer.
4.1 Object-Level Annotations Available
Suppose we have a detector for class “dog”available, then how can
we answer for subscription class “cat” when we can collect anno-
tated bounded boxes for cat? This represents scenario-1 (Section–3)
where cat class is previously unseen for the multimedia event pro-
cessing model. Here, the rst step (shown in Fig. 2a) is to utilize vi-
sual and semantic knowledge to identify which class (e.g. dog, tiger,
piano) among familiar classes of model, is most suitable for knowl-
edge transfer to construct detector for unfamiliar (unseen) class
(cat). Include chosen source (dog) class detector as well as available
training data (having annotated bounding boxes) for destination
class (cat). Then we convert source-detector to destination-detector
using domain adaptation based transfer learning techniques. We
use two techniques for transferring knowledge: (i) ne-tuning pre-
trained models (ii) freezing backbone [
16
,
37
,
43
] of similar clas-
siers while training only top layers. We apply the ne-tuning
technique only when we do not nd any suitable source classier
similar to the destination classier based on threshold, where we
use ImageNet [
13
] as a pre-trained model. In the second approach,
we instantiate the network of destination-detector using weights
of source-detector and then freeze backbone (convolutional and
pooling layers) while training only top fully connected layers with
softmax as output layer. Finally, we expect to construct a more
accurate destination class detector with reduced training time.
4.2 Image-Level Annotations Available
Consider a case when we have a classier as well as detector for
class “dog” available, then how can we answer for subscription
“cat” when we can collect only image level labels, and not bounding
box annotations for cat? This addresses problem for Scenario-2
(Section–3) and also appears to be an important use case of existing
models based on LSDA [
18
,
44
,
45
]. As stated, LSDA is one of the
popular algorithms that learns to transform an image classier into
an object detector. For categories where object-level annotations are
available, LSDA utilizes an AlexNet [24] pre-trained on ImageNet,
and ne-tuned into a detector with bounding boxes by utlizing the
RCNN framework [
40
]. Finally, these classiers serve as source clas-
siers for destination classes having only image-level annotations
available, and adapted into detectors by learning category-specic
transformation of source model parameters. Proposed Model-2 for
“Image-Level Annotations Available” shown in Fig. 2b follows the
same framework of LSDA and guidelines of Model-1. In this case
we are free to collect image level data from any web source like
Google, Bing, Flickr images or ImageNet Utils. It may include a
construction step for the destination class classier. However, by
combining it with our modied LSDA, we expect to have more
accurate results in less response-time as compared to conventional
LDSA. Modied LSDA makes use of MobileNet in place of AlexNet
for classication, and also takes advantage of much faster object
detection models like YOLOv3 (as compared to RCNN in the LSDA).
Nevertheless, the extent of improvement in performance by using
Model-2 is still an open question, and has associated risks, and can
be validated only after further experimentation.
5 EVALUATIONS TO DATE
This section summarizes the results of our current works which
are in progress or published [
4
6
]. We compare the detection per-
formance of proposed model with adaptation based techniques
against object detection models (YOLOv3, SSD300, and RetinaNet
[
28
,
30
,
38
]) using Pascal VOC and Open images Datasets [
14
,
26
].
The results for mean Average Precision (mAP) using Model-1 are
summarized in Table 1. The rst row shows the detection perfor-
mance for training (of 30 min) from scratch (i.e. without adaptation),
achieving maximum mAP of 0
.
21 on RetinaNet. The second row
could serve as a baseline for domain adaptation, where we use
similar classier for the detection of an unseen class before train-
ing (i.e. response-time=0). Here, we assume a similar classier is
available for unfamiliar/unseen class and thus applies domain adap-
tation. Row third are the results of ne-tuning pre-trained model
(trained on ImageNet [
13
]). Finally, the last row shows the detection
performance by freezing the backbone and training top layers for
30 min of response-time. These experiments (with best mAP of
0
.
50 on YOLOv3) verify the achievement of high accuracy in low
response-time for proposed Model-1.
Table 1: Summarizing mean Average Precision (mAP) on Proposed Model
Methods Description of Detectors for Unseen Classes YOLOv3 SSD300 RetinaNet
Detector without
Construct classier by training from Scratch (assuming no similar classier or
pre-trained model available) 0.0695 0.0000 0.2121
Detector with
Use Similar (i.e. Nearest) Classier for Unseen Class (baseline as no training
required here i.e. response-time=0) 0.0421 0.0635 0.0818
Adaptation Construct classier by ne-tuning Pre-Trained Model (presently ImageNet) 0.1200 0.1685 0.3638
Construct classier by freezing backbone of Similar (i.e. Nearest) Classier 0.5000 0.1557 0.1713
Table 2: Detection mAP on Specic Domain Transfers using dierent Domain Adaptation techniques
Classes Semantic
Similarity Score Method of Domain Adaptation YOLOv3 SSD300 RetinaNet
Laptop Detector tested for Mango (baseline) NaN 0.0047 0.0046
Mango Laptop 0.08 Mango Detector from pre-trained model NaN 0.1439 0.1667
Mango Detector from Laptop Detector 0.2000 0.0818 0.0973
Cat Detector tested for Dog (baseline) 0.0000 0.2123 0.2446
Dog Cat 0.20 Dog Detector from pre-trained model 0.5254 0.2120 0.2159
Dog Detector from Cat Detector 0.6875 0.2504 0.2307
Football Detector tested for Cricket ball (baseline) 0.0000 0.0000 0.0111
Cricket_Ball Football 0.33 Cricket ball Detector from pre-trained model NaN 0.00 0.0375
Cricket ball Detector from Football Detector 0.0000 0.0000 0.0120
Car Detector tested for Bus (baseline) 0.1683 0.0371 0.0668
Bus Car 0.50 Bus Detector from pre-trained model 0.7213 0.0938 0.1110
Bus Detector from Car Detector 0.5821 0.1127 0.0808
NaN: Not a Number
Table 3: Detection mAP on Specic Subscriptions using dierent Object Detection Models
Object Detection Models Cat Dog Cricket_Ball Mango Car Bus Bicycle Laptop Football
YOLOv3 0.8181 0.6875 0.0000 0.2000 0.8763 0.5821 0.8227 0.2805 0.2857
SSD300 0.1749 0.2504 0.0000 0.0818 0.6058 0.1127 0.1463 0.0000 0.0108
RetinaNet 0.2633 0.2307 0.0120 0.0973 0.6736 0.0808 0.3501 0.0000 0.0300
In order to determine relatedness among similar classiers, presently
we are using WordNet [
35
] with the path operator. Table 2 shows
four examples of classes on domain transfers with dierent similar-
ity scores. It can be concluded for Model-1 that adapting from one
domain (class) to another mostly yields high performance results
in low response-time as compared to ne-tuning of detectors on
pre-trained model (like ImageNet). Here we also have a simple base-
line where the nearest neighbors are used to detect an unseen class
without training, thus resulting in low mAP but zero response-time.
Table 3 shows the overall performance of the proposed model
on specic subscriptions belongs to dierent applications of smart
cities. However, experiments with Model-2 of modied LSDA while
applying fastest classication (MobileNet [
19
]) as well as detection
(YOLOv3 [
38
]) network are still in implementation phases. Lastly,
we also aim to improve it further by incorporating visual knowledge
transfer along with semantic knowledge in proposed models 1 & 2.
6 CONCLUSION AND PLANNED WORK
In this work, we analyzed the problem of training of classiers on-
demand for previously unseen/unknown concepts along with the
aim of reducing response-time for the multimedia event processing.
We presented two models for the domain adaptation of classiers
while leveraging transfer learning (ne-tuning top and freezing
backbone) and LSDA based techniques. We have conducted experi-
ments using the rst approach and achieve 0.5 mAP within 30 min
of response-time.
As part of work in progress, the proposed approach for mod-
ied LSDA is in its implementation phase, and we are planning
to compare it with our rst model using existing object detection
models with and without domain adaptation. Since the second
method will no longer require bounding box annotations, it should
reduce online data collection/construction time which contributes
towards response time. However, we still have two open questions:
(1) whether the second LSDA based model will be able to reduce
response-time further by decreasing the overall training time? (2)
What will be its impact on accuracy of unseen/weak class detector?
To reach the best case for both questions, we aim to use MobileNet
[
19
] and YOLOv3 [
38
] in training network, which themselves are
the fastest classication and detection models known to date, while
incorporating visual along with semantic knowledge transfers.
ACKNOWLEDGMENTS
This work was supported by Science Foundation Ireland under grant
SFI/12/RC/2289_P2. Titan Xp GPU used was donated by NVIDIA.
REFERENCES
[1]
Kashif Ahmad and Nicola Conci. January, 2019. How Deep Features Have Im-
proved Event Recognition in Multimedia: a Survey. ACM Transactions on Multi-
media Computing Communications and Applications (January, 2019).
[2]
Sufyan Almajali, I Dhiah el Diehn, Haythem Bany Salameh, Moussa Ayyash,
and Hany Elgala. 2018. A distributed multi-layer MEC-cloud architecture for
processing large scale IoT-based multimedia applications. Multimedia Tools and
Applications (2018), 1–22.
[3]
Sheeraz A Alvi, Bilal Afzal, Ghalib A Shah, Luigi Atzori, and Waqar Mahmood.
2015. Internet of multimedia things: Vision and challenges. Ad Hoc Networks 33
(2015), 87–111.
[4]
Asra Aslam and Edward Curry. 2018. Towards a Generalized Approach for Deep
Neural Network Based Event Processing for the Internet of Multimedia Things.
IEEE Access 6 (2018), 25573–25587.
[5]
Asra Aslam and Edward Curry. 2020. Reducing Response Time for Multimedia
Event Processing using Domain Adaptation. In Accepted for Proceedings of the
2020 ACM on International Conference on Multimedia Retrieval (ICMR).
[6]
Asra Aslam, Souleiman Hasan, and Edward Curry. 2017. Challenges with image
event processing: Poster. In Proceedings of the 11th ACM International Conference
on Distributed and Event-based Systems. 347–348.
[7]
Oscar Beijbom. 2012. Domain adaptations for computer vision applications. arXiv
preprint arXiv:1211.4860 (2012).
[8]
Benjamin Bischke, Patrick Helber, Christian Schulze, Venkat Srinivasan, Andreas
Dengel, and Damian Borth. 2017. The Multimedia Satellite Task at MediaEval
2017.. In MediaEval.
[9]
Antonio Carzaniga, David S Rosenblum, and Alexander L Wolf. 2000. Achieving
scalability and expressiveness in an internet-scale event notication service.
In Proceedings of ACM symposium on Principles of distributed computing. ACM,
219–227.
[10]
Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2018.
Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 3339–3348.
[11]
Gabriela Csurka. 2017. Domain adaptation for visual applications: A comprehen-
sive survey. arXiv preprint arXiv:1702.05374 (2017).
[12]
Gianpaolo Cugola and Alessandro Margara. 2012. Processing ows of information:
From data stream to complex event processing. ACM Computing Surveys (CSUR)
44, 3 (2012), 15.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-
genet: A large-scale hierarchical image database. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248–255.
[14]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010.
The Pascal Visual Object Classes (VOC) Challenge. International Journal of
Computer Vision 88, 2 (June 2010), 303–338.
[15]
Souleiman Hasan, Sean O’Riain, and Edward Curry. 2012. Approximate semantic
matching of heterogeneous events. In Proceedings of the 6th ACM International
Conference on Distributed Event-Based Systems. ACM, 252–263.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[17]
Judith Homan. 2016. Adaptive learning algorithms for transferable visual recog-
nition. University of California, Berkeley.
[18]
Judy Homan, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Je Donahue,
Ross Girshick, Trevor Darrell, and Kate Saenko. 2014. LSDA: Large scale detection
3544.
[19]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:
Ecient convolutional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861 (2017).
[20]
Ling Hu and Qiang Ni. 2017. IoT-driven automated object detection algorithm
for urban surveillance systems in smart cities. IEEE Internet of Things Journal 5,
2 (2017), 747–754.
[21]
Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018.
Cross-domain weakly-supervised object detection through progressive domain
adaptation. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 5001–5009.
[22]
Jermsak Jermsurawong, Mian Umair Ahsan, Abdulhamid Haidar, Haiwei Dong,
and Nikolaos Mavridis. 2012. Car parking vacancy detection and its application
in 24-hour statistical analysis. In 2012 10th International Conference on Frontiers
of Information Technology. IEEE, 84–90.
[23]
Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina
Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al
.
2017.
Openimages: A public dataset for large-scale multi-label and multi-class image
classication. Dataset available from https://github. com/openimages 2 (2017), 3.
[24]
Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classica-
tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[25]
Malaram Kumhar, Gaurang Raval, and Vishal Parikh. 2019. Quality Evaluation
Model for Multimedia Internet of Things (MIoT) Applications: Challenges and Re-
search Directions. In International Conference on Internet of Things and Connected
Technologies. Springer, 330–336.
[26]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi
Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al
.
2018. The open images dataset v4: Unied image classication, object detection,
and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 (2018).
[27]
Ching-Hao Lai and Chia-Chen Yu. 2010. An ecient real-time trac sign recog-
nition system for intelligent vehicles with smart phones. In Technologies and
Applications of Articial Intelligence (TAAI), 2010 International Conference on.
IEEE, 195–202.
[28]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
Focal loss for dense object detection. In Proceedings of the IEEE international
conference on computer vision. 2980–2988.
[29]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common
objects in context. In European conference on computer vision. Springer, 740–755.
[30]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
In European conference on computer vision. Springer, 21–37.
[31]
Badri Mohapatra and Prangya Prava Panda. 2019. Machine learning applications
to smart city. ACCENTS Transactions on Image Processing and Computer Vision 4
(14) (Feb 2019). https://doi.org/10.19101/TIPCV.2018.412004
[32]
Pirkko Mustamo. 2018. Object detection in sports: TensorFlow Object Detection API
case study. University of Oulu.
[33]
Muhammad Khalil Afzal, and Sung Won Kim. 2020. Multimedia Internet of
Things: A Comprehensive Survey. IEEE Access 8 (2020), 8202–8250.
[34]
Sinno Jialin Pan, Qiang Yang, et al
.
2010. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
[35]
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet::
Similarity: measuring the relatedness of concepts. In Demonstration papers at
HLT-NAACL 2004. Association for Computational Linguistics, 38–41.
[36]
Clarke. 2015. Middleware for internet of things: a survey. IEEE Internet of things
journal 3, 1 (2015), 70–95.
[37]
Joseph Redmon. 2013–2016. Darknet: Open Source Neural Networks in C. http:
//pjreddie.com/darknet/.
[38]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
only look once: Unied, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 779–788.
[39]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement.
arXiv preprint arXiv:1804.02767 (2018).
[40]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Advances
in neural information processing systems. 91–99.
[41]
Mrigank Rochan and Yang Wang. 2015. Weakly supervised localization of novel
objects using appearance transfer. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition. 4315–4324.
[42]
Kah Phooi Seng and Li-Minn Ang. 2018. A Big Data Layered Architecture and
Functional Units for the Multimedia Internet of Things (MIoT). IEEE Transactions
on Multi-Scale Computing Systems (2018).
[43]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[44]
Yuxing Tang, Josiah Wang, Boyang Gao, Emmanuel Dellandréa, Robert
Gaizauskas, and Liming Chen. 2016. Large scale semi-supervised object de-
tection using visual and semantic knowledge transfer. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2119–2128.
[45]
Yuxing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao, Emmanuel Dellandréa,
Robert Gaizauskas, and Liming Chen. 2017. Visual and semantic knowledge
transfer for large scale semi-supervised object detection. IEEE transactions on
pattern analysis and machine intelligence 40, 12 (2017), 3045–3058.
[46]
Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of Research on
Machine Learning Applications and Trends: Algorithms, Methods, and Techniques.
IGI Global, 242–264.
[47]
Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey.
Neurocomputing 312 (2018), 135–153.
[48]
Piyush Yadav and Edward Curry. 2019. VidCEP: Complex Event Processing
Framework to Detect Spatiotemporal Patterns in Video Streams. In 2019 IEEE
International Conference on Big Data (Big Data). IEEE, 2513–2522.
[49]
Liangzhao Zeng and Hui Lei. 2004. A semantic publish/subscribe system. In E-
Commerce Technology for Dynamic E-Business, 2004. IEEE International Conference
on. IEEE, 32–39.
[50]
Yuhao Zhang and Arun Kumar. 2019. Panorama: a data system for unbounded
vocabulary querying over video. Proceedings of the VLDB Endowment 13, 4 (2019),
477–491.
... • UnseenNet, a LSDA based detector for the training of unseen classes using only image-level labels with no bounding boxes annotations by using the fastest classification and detection models while utilizing object detection and image classification datasets having a limited vocabulary [38]. ...
... Training In my experiments, I consider the main settings of LSDA while using the pipeline of YOLOv3 [35,166] and MobileNetv3 [151,306]. Specifically, I used the three layers (38,117,165) from the MobileNetv3 (Small) within YOLO to make the predic- I assume it is essential to specify that ImageNet and Object detection datasets use different name for the same classes, so I am using the vocabulary of WordNet to give a single name to each class and also provide mappings of different datasets with our model. ...
... There are two reasons for this. Firstly, in many other state-of-the-art object detection networks [19,59], the first stage of the base CNN is not used for various designed modules. Secondly, the resolution of the first stage is one-half the resolution of the input image. ...
Article
Full-text available
Infrared and visible images (multi-sensor or multi-band images) have many complementary features which can effectively boost the performance of object detection. Recently, convolutional neural networks (CNNs) have seen frequent use to perform object detection in multi-band images. However, it is very difficult for CNNs to extract complementary features from infrared and visible images. In order to solve this problem, a difference maximum loss function is proposed in this paper. The loss function can guide the learning directions of two base CNNs and maximize the difference between features from the two base CNNs, so as to extract complementary and diverse features. In addition, we design a focused feature-enhancement module to make features in the shallow convolutional layer more significant. In this way, the detection performance of small objects can be effectively improved while not increasing the computational cost in the testing stage. Furthermore, since the actual receptive field is usually much smaller than the theoretical receptive field, the deep convolutional layer would not have sufficient semantic features for accurate detection of large objects. To overcome this drawback, a cascaded semantic extension module is added to the deep layer. Through simple multi-branch convolutional layers and dilated convolutions with different dilation rates, the cascaded semantic extension module can effectively enlarge the actual receptive field and increase the detection accuracy of large objects. We compare our detection network with five other state-of-the-art infrared and visible image object detection networks. Qualitative and quantitative experimental results prove the superiority of the proposed detection network.
... Description: Alam et al. [72,235] have proposed a generalized DNN-based multimedia event processing approach. The work introduces multimedia event operators and uses DNN models to extract image features for event matching. ...
Thesis
Full-text available
With the evolution of the Internet of Things (IoT), there is an exponential rise in sensor devices that are deployed ubiquitously. Due to the extensive usage of IoT applications in smart cities, smart homes, self-driving cars, and social media, there is enormous growth in multimedia data streams like videos and images. We are now transitioning to an era of the Internet of Multimedia Things (IoMT), where unstructured data like videos are continuously streamed from visual sensors like CCTV cameras and smartphones. Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to mine event patterns from video streams due to their unstructured data model, lack of expressive query language, and resource intensiveness. This work introduces VidCEP, a data-driven, distributed, on the fly, near real-time complex event matching framework for video streams with five key contributions: 1) Expressive Video Event Query Language- Current event query languages are highly focused on temporal reasoning. A SQL-like declarative Video Event Query Language (VEQL) is proposed which enables state-based spatiotemporal video event matching without needing to focus on low-level video features. The VEQL enables the creation of robust spatiotemporal operators using a hybrid approach. 2) Structured Video Stream Representation- The work introduces Video Event Knowledge Graph (VEKG), a knowledge graph driven model of video data streams using an ensemble of deep learning models and VEQL operators. VEKG creates a semantic knowledge representation of video data by modeling video objects as nodes and their relationship interaction as edges over time and space. 3) Query and State-aware stream summarization- Objects coexist across multiple frames, leading to the creation of redundant nodes and edges at different time instances that result in high memory usage and increased matching time. The work introduces two novel state and multi-query based spatiotemporal graph summarizations of VEKG streams- Time Aggregated Graph (VEKG-TAG) and Event Aggregated Graph (VEKG-EAG) for an efficient state-based event matching. 4) Resource constraint distributed content-driven windowing- An adaptive and content-driven windowing technique VID-WIN is proposed to improve Quality of Service (QoS). VID-WIN instances are deployed over edge and cloud to perform efficient state-based video event matching. The VID-WIN adopts resource and query-aware runtime optimization strategies to improve the CEP matching performance under limited available resources and application requirements. 5) VidCEP Complex Event Processing Framework: The above techniques are integrated into Video Complex Event Processing framework (VidCEP). A prototype of the framework is developed which performs query-based video event pattern matching in a distributed setting. The VidCEP capability is demonstrated using a real-work traffic estimation service for OpenStreetMap. A total of 18 event rules are defined from the traffic management and activity recognition domain. Extensive experiments have been performed across 20 datasets consisting of more than 3900 video clips to evaluate the performance and efficacy of proposed techniques. Results of this study show that VidCEP achieves a throughput of approximately 70 frames per second (fps) for five parallel streams (at 17fps) with sub-second matching latency. The system successfully detects different spatiotemporal video event patterns with good F-scores (0.51-0.90). The optimization techniques proposed in the thesis improves ~5X search time, ~2.3X throughput, and ~99% bandwidth savings.
... In such cases, we could improve accuracy for IoMT based data using semi-supervised or unsupervised models in different applications of smart cities [142,216,217]. [27,212,218,219]. These LSDA based methods are designed to construct classifiers on new concepts for which we do not have sufficient data or no-annotated data. ...
Article
Full-text available
An enormous amount of sensing devices (scalar or multimedia) collect and generate information (in the form of events) over the Internet of Things (IoT). Present research on IoT mainly focus on the processing of scalar sensor data events and barely considers the challenges posed by multimedia based events. In this paper, we systematically review the existing solutions available for the Internet of Multimedia Things (IoMT) by analyzing sensing, networking, service, and application-level services provided by IoT. We present state-of-the-art event-based middleware methods and their suitability for multimedia event processing methods. We observe that existing IoT event-based middleware solutions focus on structured (scalar) events and possess only domain-specific characteristics for unstructured (multimedia) events. A case study for object detection is also presented to demonstrate the requirements associated with the processing of multimedia events within smart cities, even with common image recognition based applications. In order to validate the existing issues in the detection of objects, we also presented an evaluation of object detection models using existing datasets. At the end of each section, we shed light on trends, gaps, and possible solutions based on our analysis, experiments, and review of the existing research. Finally, we summarize the challenges and future research directions for the generalized multimedia event processing (by taking detection of each and every object as an example) based on applications using IoMT. Our experiments demonstrate that existing models are very slow to respond to any unseen class, and existing rich datasets do not have a sufficient number of classes to meet the requirements of real-time applications of smart cities. We show that although there is a significantly large technical literature on IoT, and research on IoMT is also quite actively growing, there have not been much research efforts directed towards the processing of multimedia events. As an example, although deep learning techniques have been shown to achieve impressive performance in applications like image recognition, the methods are deficient in detecting new (previously unseen) objects for multimedia based applications in smart cities. In light of these facts, it becomes imperative to conduct research on bringing together the abilities of event-based middleware for IoMT, and low response-time based online training and adaptation techniques.
... An event recognition in still images by transferring objects and scene representations has been proposed in work [48], where the correlations of the concepts of object, scene, and events have been investigated. Similarly, large scale domain adaptation based approaches [4,10,19,20,40] are also introduced particularly for the detection of objects and it is desirable to bring their abilities to the core of multimedia event processing. ...
Article
Full-text available
The immense increase in multimedia-on-demand traffic that refers to audio, video, and images, has drastically shifted the vision of the Internet of Things (IoT) from scalar to Multimedia Internet of Things (M-IoT). IoT devices are constrained in terms of energy, computing, size, and storage memory. Delay-sensitive and bandwidth-hungry multimedia applications over constrained IoT networks require revision of IoT architecture for M-IoT. This paper provides a comprehensive survey of M-IoT with an emphasis on architecture, protocols, and applications. This article starts by providing a horizontal overview of the IoT. Then, we discuss the issues considering the characteristics of multimedia and provide a summary of related M-IoT architectures. Various multimedia applications supported by IoT are surveyed, and numerous use cases related to road traffic management, security, industry, and health are illustrated to show how different M-IoT applications are revolutionizing human life. We explore the importance of Quality-of-Experience (QoE) and Quality-of-Service (QoS) for multimedia transmission over IoT. Moreover, we explore the limitations of IoT for multimedia computing and present the relationship between the M-IoT and emerging technologies including event processing, feature extraction, cloud computing, Fog/Edge computing and Software-Defined-Networks (SDNs). We also present the need for better routing and Physical-Medium Access Control (PHY-MAC) protocols for M-IoT. Finally, we present a detailed discussion on the open research issues and several potential research areas related to emerging multimedia communication in IoT.
Conference Paper
Full-text available
Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Querying event patterns from video streams is challenging due to its unstructured representation. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and lack of expressive query language. In this work, we focus on a CEP framework where users can define high-level expressive queries over videos to detect a range of spatiotemporal event patterns. In this context, we propose- i) VidCEP, an in-memory, on the fly, near real-time complex event matching framework for video streams. The system uses a graph-based event representation for video streams which enables the detection of high-level semantic concepts from video using cascades of Deep Neural Network models, ii) a Video Event Query language (VEQL) to express high-level user queries for video streams in CEP, iii) a complex event matcher to detect spatiotemporal video event patterns by matching expressive user queries over video data. The proposed approach detects spatiotemporal video event patterns with an F-score ranging from 0.66 to 0.89. VidCEP maintains near real-time performance with an average throughput of 70 frames per second for 5 parallel videos with sub-second matching latency.
Article
Full-text available
The basic need of human is increasing as they interact with different devices and also, they provide many feedbacks. Many smart devices generate high data and that can be retrieved and reviewed by humans. Applications are not fixed as it increases day to day life. Based on these data generated by different smart devices and smart city applications machine learning approach is the best adaptive solution. Rapid development in software, hardware with high speed internet connection provides large data to this physical world. The key contribution of this paper is a machine learning application survey towards smart city.
Article
Full-text available
Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition and detection, deep learning has demonstrated to perform well also in event recognition tasks. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this paper, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this paper, we extensively review different deep learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.
Article
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide $$15\times$$ more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Article
Deep convolutional neural networks (CNNs) achieve state-of-the-art accuracy for many computer vision tasks. But using them for video monitoring applications incurs high computational cost and inference latency. Thus, recent works have studied how to improve system efficiency. But they largely focus on small "closed world" prediction vocabularies even though many applications in surveillance security, traffic analytics, etc. have an ever-growing set of target entities. We call this the "unbounded vocabulary" issue, and it is a key bottleneck for emerging video monitoring applications. We present the first data system for tacking this issue for video querying, Panorama. Our design philosophy is to build a unified and domain-agnostic system that lets application users generalize to unbounded vocabularies in an out-of-the-box manner without tedious manual re-training. To this end, we synthesize and innovate upon an array of techniques from the ML, vision, databases, and multimedia systems literature to devise a new system architecture. We also present techniques to ensure Panorama has high inference efficiency. Experiments with multiple real-world datasets show that Panorama can achieve between 2x to 20x higher efficiency than baseline approaches on in-vocabulary queries, while still yielding comparable accuracy and also generalizing well to unbounded vocabularies.
Article
In this paper, we address two issues by proposing a new architecture for the Multimedia Internet of Things (MIoT) with Big multimodal computation layer. We first introduce MIoT as a novel paradigm in which smart heterogeneous multimedia things can interact and cooperate with one another and with other things connected to the Internet to facilitate multimedia-based services and applications that are globally available to the users. The MIoT architecture consists of six layers. The computation layer is specially designed for Big multimodal analytics. This layer has four important functional units: Data Centralized Unit, Multimodal Data Aggregation Unit, Multimodal Data Divide & Conquer Computation Unit and Fusion & Decision Making Unit. A novel and highly scalable technique called the Divide & Conquer Principal Component Analysis (DC-PCA) for feature extraction in the divide and conquer mechanism is proposed to be used together with the Divide & Conquer Linear Discriminant Analysis (DC-LDA) for multimodal Big data analytics. Experiments are conducted to confirm the good performance of these techniques in the functional units of the Divide & Conquer computational mechanisms. The final section of the paper gives application on a camera sensing IoT platform and real-world data analytics on multicore architecture implementations.