Content uploaded by Asra Aslam
Author content
All content in this area was uploaded by Asra Aslam on Jun 05, 2020
Content may be subject to copyright.
Object Detection for Unseen Domains while Reducing Response
Time using Knowledge Transfer in Multimedia Event Processing
Asra Aslam
(Supervised by Edward Curry)
Insight Centre for Data Analytics,
NUI Galway, Ireland
asra.aslam@insight-centre.org
ABSTRACT
Event recognition is among one of the popular areas of smart cities
that has attracted great attention for researchers. Since Internet of
Things (IoT) is mainly focused on scalar data events, research is
shifting towards the Internet of Multimedia Things (IoMT) and is
still in infancy. Presently multimedia event-based solutions provide
low response-time, but they are domain-specic and can handle
only familiar classes (bounded vocabulary). However multiple ap-
plications within smart cities may require processing of numerous
familiar as well as unseen concepts (unbounded vocabulary) in the
form of subscriptions. Deep neural network-based techniques are
popular for image recognition, but have the limitation of training
of classiers for unseen concepts as well as the requirement of an-
notated bounding boxes with images. In this work, we explore the
problem of training of classiers for unseen/unknown classes while
reducing response-time of multimedia event processing (specically
object detection). We proposed two domain adaptation based mod-
els while leveraging Transfer Learning (TL) and Large Scale Detec-
tion through Adaptation (LSDA). The preliminary results show that
proposed framework can achieve 0.5 mAP (mean Average Precision)
within 30 min of response-time for unseen concepts. We expect
to improve it further using modied LSDA while applying fastest
classication (MobileNet) and detection (YOLOv3) network, along
with elimination of requirement of annotated bounding boxes.
CCS CONCEPTS
•Information systems →
Multimedia streaming;
•Computing
methodologies →
Neural networks;
•Software and its engineer-
ing →Publish-subscribe / event-based architectures.
KEYWORDS
Domain Adaptation, Internet of Multimedia Things, Event-Based
Systems, Object Detection, Transfer Learning, Smart Cities
ACM Reference Format:
Asra Aslam. 2020. Object Detection for Unseen Domains while Reducing
Response Time using Knowledge Transfer in Multimedia Event Processing.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ICMR ’20, June 8–11, 2020, Dublin, Ireland
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7087-5/20/06. . . $15.00
https://doi.org/10.1145/3372278.3391936
Figure 1: Architecture for Multimedia Event Processing
In Proceedings of the 2020 International Conference on Multimedia Retrieval
(ICMR ’20), June 8–11, 2020, Dublin, Ireland. ACM, New York, NY, USA,
5 pages. https://doi.org/10.1145/3372278.3391936
1 INTRODUCTION
The revision of the Internet of Things (IoT) for multimedia commu-
nication in the form of the Internet of Multimedia Things (IoMT)
is getting popular in smart cities due to the abrupt increase in
multimedia trac in the last decade [
2
,
3
,
25
,
33
,
42
]. IoT middle-
ware is responsible for providing common services to applications
and eases the development process, but it is still applied to scalar
data events [
36
]. Thus event-based systems cannot natively in-
clude multimedia events produced by IoMT generated data. On the
other hand, existing multimedia event-based approaches exhibit
high performance but the applicable for only specic domains (like
public transportation, surveillance systems, parking management,
sports events etc.), and hence can handle only familiar classes (have
bounded/closed vocabulary) [
50
]. Moreover the requirement of
availability of training data for unseen categories/classes is also
realized in literature [
17
,
18
,
44
] for Deep Neural Network (DNN)
based models, with no consideration for the duration of training
time of classiers. It is not clear in existing approaches how to deal
with multiple multimedia applications which may require handling
of multiple types of subscriptions (i.e. unbounded vocabulary) while
providing low response-time.
In this work we utilize the multimedia event processing model of
our published work [
4
–
6
] and remove the limitation of availability
of trained classiers using the adaptive framework shown in Fig. 1.
We propose two dierent approaches for the training of DNN based
classiers to recognize unseen/new classes while minimizing the
response-time. In the rst one, we utilize Transfer Learning (TL)
from pre-trained models which reduces response-time and requires
images with bounding boxes. In the second approach, we present
a modied version of Large Scale Detection through Adaptation
(LSDA) where we expect to reduce response time and require only
image labels without bounding boxes. Our experiments evaluate
the rst approach using current object detecting models and demon-
strate that the proposed framework can achieve 0.5 mAP within a
30 min response-time for unseen concepts. However this approach
requires Object-level annotations (i.e. annotated bounding boxes).
Since Image-level annotations (i.e. without bounding boxes) are
comparatively easy to acquire, we intend to improve it further using
our second approach of modied LSDA where we will not require
bounding boxes for the training of classiers and expect a decrease
in response-time.
The contributions of our work include formulating the problem
of processing multimedia events without detection data while min-
imizing response-time, adaptive architecture for the detection of
unseen (new) domains, approach to convert classier into detector
with and without bounding box annotations, and evaluations on
object detection models to identify the best performance that can
be provided in terms of both accuracy and response time.
2 STATE OF THE ART
Image recognition [
20
,
31
] is now able to detect multimedia events
when a camera senses a certain object and send real-time alerts
via text/image messages to users. Multiple scenarios of multimedia
events could include trac congestion, road accidents, change in
weather, terrorist attacks, parking problems, security, etc. Event
processing systems are designed to process the subscription of a
user based on standard languages in response to events and mainly
include the entities: events, subscriptions, and matcher [
12
,
48
].
Presently event processing systems are only focused on structured
(scalar) events for the processing of subscriptions of a user [
9
,
15
,
49
].
In the context of multimedia “an event is the representation of a
change of state in a multimedia item planned and attended” [
1
]. Ex-
isting multimedia event processing systems [
8
,
22
,
27
,
32
] generally
provide high performance but have domain specic characteristics
with bounded vocabulary, high setup cost, and cannot be adapted
to multiple domains.
Domain Adaptation is the ability to utilize the knowledge of
existing domains to identify unknown domains [
7
,
11
,
47
]. Since
transfer learning makes machine learning algorithms more ecient,
knowledge transfer is becoming more desirable within application
of machine learning based techniques [
46
]. Existing approaches
[
34
] in transfer learning are focused on the generalization ability
of classiers for the enhancement of accuracy, but not the overall
response time specically. Domain adaptation based methods are
also presented in literature [
10
,
21
,
41
] for transferring the appear-
ance models of the familiar objects to an unseen object. They are
also motivated with the fact that annotation of bounding boxes
for unseen/unknown classes is an expensive and time-consuming
process. They further demonstrate that it is possible to transfer ap-
pearance model from one object class to another based on semantic
and/or visual relationship. However, here domain shift represents
dierent change in view-points, weather conditions, backgrounds,
image quality, sketches etc.
Object recognition has been an area of extensive research for a
long time, resulting in several successful DNN based object detec-
tion models. One of the major challenge in training object detection
models is the need to collect a large amount of images with bound-
ing box annotations. Large Scale Detection through Adaptation
(LSDA) [
17
,
18
], is a method that learns the dierence between the
two tasks (classication and detection) and transfers this knowl-
edge to classiers to turn them into detectors. Another research
[
44
,
45
] incorporates the knowledge of object similarities from vi-
sual as well as semantic domains to the transfer process to improve
the previous work of LSDA. The motivation behind this work is
that visually and semantically similar categories should provide
more accurate detectors as compared to dissimilar categories, e.g.
a better detector can be constructed for the cat class by learning
the dierences between a dog classier and a dog detector, than
would by learning from the violin class classier and detector. How-
ever, these approaches do not consider on the long training time of
classiers, which increases the overall response time.
3 PROBLEM FORMULATION
Our work aims to answer the question: can we answer online user
queries consisting of familiar (bounded vocabulary) as well as un-
seen subscriptions (unbounded vocabulary) that include processing
of multimedia events while achieving high accuracy and minimiz-
ing the response time, where the training of classiers may or may
not have bounding box annotations available? Here, “Response-
Time” is the time dierence between the time subscription arrived
and the time at which the system is ready to notify the subscriber
(shown in Fig. 1). By assuming domain adaptation time as
tda
, data
collection time as
tdc
, and testing time as
tt
, we can formally dene
response time (trt ) as:
tr t =tda +tdc +tt(1)
On the basis of availability of bounding box annotations for unseen
subscriptions, our problem consist of two scenarios:
Scenario-1: Object-Level Annotations Available:
This scenario
assumes we can collect images with bounding box annotations by
using object detection datasets (like Pascal VOC, Microsoft COCO,
Open Images Dataset etc. [
14
,
23
,
29
]) or online data collection toolk-
its
1
. However, all of these object detection datasets have bounded
vocabulary consisting of 20, 80, and 600 classes respectively, and
thus it is not possible to provide bounding box labels for thousands,
or millions, of classes. Moreover it is much easier to provide image-
level annotations, which direct us to Scenario-2.
Scenario-2: Image-Level Annotations Available:
In this case, we
assume only image labels are available with no bounding boxes.
Such image level annotations can be obtained using ImageNet [
13
]
online data collection toolkit
2
or using image tags on any (i.e. Flickr,
Google, and/or Bing) image web search.
1https://github.com/EscVM/OIDv4_ToolKit
2https://github.com/tzutalin/ImageNet_Utils
(a) Model-1 for Object-Level Annotations (b) Model-2 for Image-Level Annotations (No Bounding Boxes)
Figure 2: Proposed Models (where Bounding Boxes Annotations may or may not be available)
4 PROPOSED APPROACHES
In this section, we introduce two models (shown in Fig. 2) to ad-
dress the specied problem. In the rst model we utilize transfer
learning based domain adaptation techniques and DNN based ob-
ject detection models, for minimizing response-time and increasing
the accuracy. In the second model we intend to incorporate LSDA
and modify it using MobileNet [
19
] as a classication model and
YOLOv3 [
38
,
39
] as object detection model. We use visual and se-
mantic knowledge in both models to identify which class is suitable
for knowledge transfer.
4.1 Object-Level Annotations Available
Suppose we have a detector for class “dog”available, then how can
we answer for subscription class “cat” when we can collect anno-
tated bounded boxes for cat? This represents scenario-1 (Section–3)
where cat class is previously unseen for the multimedia event pro-
cessing model. Here, the rst step (shown in Fig. 2a) is to utilize vi-
sual and semantic knowledge to identify which class (e.g. dog, tiger,
piano) among familiar classes of model, is most suitable for knowl-
edge transfer to construct detector for unfamiliar (unseen) class
(cat). Include chosen source (dog) class detector as well as available
training data (having annotated bounding boxes) for destination
class (cat). Then we convert source-detector to destination-detector
using domain adaptation based transfer learning techniques. We
use two techniques for transferring knowledge: (i) ne-tuning pre-
trained models (ii) freezing backbone [
16
,
37
,
43
] of similar clas-
siers while training only top layers. We apply the ne-tuning
technique only when we do not nd any suitable source classier
similar to the destination classier based on threshold, where we
use ImageNet [
13
] as a pre-trained model. In the second approach,
we instantiate the network of destination-detector using weights
of source-detector and then freeze backbone (convolutional and
pooling layers) while training only top fully connected layers with
softmax as output layer. Finally, we expect to construct a more
accurate destination class detector with reduced training time.
4.2 Image-Level Annotations Available
Consider a case when we have a classier as well as detector for
class “dog” available, then how can we answer for subscription
“cat” when we can collect only image level labels, and not bounding
box annotations for cat? This addresses problem for Scenario-2
(Section–3) and also appears to be an important use case of existing
models based on LSDA [
18
,
44
,
45
]. As stated, LSDA is one of the
popular algorithms that learns to transform an image classier into
an object detector. For categories where object-level annotations are
available, LSDA utilizes an AlexNet [24] pre-trained on ImageNet,
and ne-tuned into a detector with bounding boxes by utlizing the
RCNN framework [
40
]. Finally, these classiers serve as source clas-
siers for destination classes having only image-level annotations
available, and adapted into detectors by learning category-specic
transformation of source model parameters. Proposed Model-2 for
“Image-Level Annotations Available” shown in Fig. 2b follows the
same framework of LSDA and guidelines of Model-1. In this case
we are free to collect image level data from any web source like
Google, Bing, Flickr images or ImageNet Utils. It may include a
construction step for the destination class classier. However, by
combining it with our modied LSDA, we expect to have more
accurate results in less response-time as compared to conventional
LDSA. Modied LSDA makes use of MobileNet in place of AlexNet
for classication, and also takes advantage of much faster object
detection models like YOLOv3 (as compared to RCNN in the LSDA).
Nevertheless, the extent of improvement in performance by using
Model-2 is still an open question, and has associated risks, and can
be validated only after further experimentation.
5 EVALUATIONS TO DATE
This section summarizes the results of our current works which
are in progress or published [
4
–
6
]. We compare the detection per-
formance of proposed model with adaptation based techniques
against object detection models (YOLOv3, SSD300, and RetinaNet
[
28
,
30
,
38
]) using Pascal VOC and Open images Datasets [
14
,
26
].
The results for mean Average Precision (mAP) using Model-1 are
summarized in Table 1. The rst row shows the detection perfor-
mance for training (of 30 min) from scratch (i.e. without adaptation),
achieving maximum mAP of 0
.
21 on RetinaNet. The second row
could serve as a baseline for domain adaptation, where we use
similar classier for the detection of an unseen class before train-
ing (i.e. response-time=0). Here, we assume a similar classier is
available for unfamiliar/unseen class and thus applies domain adap-
tation. Row third are the results of ne-tuning pre-trained model
(trained on ImageNet [
13
]). Finally, the last row shows the detection
performance by freezing the backbone and training top layers for
30 min of response-time. These experiments (with best mAP of
0
.
50 on YOLOv3) verify the achievement of high accuracy in low
response-time for proposed Model-1.
Table 1: Summarizing mean Average Precision (mAP) on Proposed Model
Methods Description of Detectors for Unseen Classes YOLOv3 SSD300 RetinaNet
Detector without
Adaptation
Construct classier by training from Scratch (assuming no similar classier or
pre-trained model available) 0.0695 0.0000 0.2121
Detector with
Use Similar (i.e. Nearest) Classier for Unseen Class (baseline as no training
required here i.e. response-time=0) 0.0421 0.0635 0.0818
Adaptation Construct classier by ne-tuning Pre-Trained Model (presently ImageNet) 0.1200 0.1685 0.3638
Construct classier by freezing backbone of Similar (i.e. Nearest) Classier 0.5000 0.1557 0.1713
Table 2: Detection mAP on Specic Domain Transfers using dierent Domain Adaptation techniques
Classes Semantic
Similarity Score Method of Domain Adaptation YOLOv3 SSD300 RetinaNet
Laptop Detector tested for Mango (baseline) NaN 0.0047 0.0046
Mango ←Laptop 0.08 Mango Detector from pre-trained model NaN 0.1439 0.1667
Mango Detector from Laptop Detector 0.2000 0.0818 0.0973
Cat Detector tested for Dog (baseline) 0.0000 0.2123 0.2446
Dog ←Cat 0.20 Dog Detector from pre-trained model 0.5254 0.2120 0.2159
Dog Detector from Cat Detector 0.6875 0.2504 0.2307
Football Detector tested for Cricket ball (baseline) 0.0000 0.0000 0.0111
Cricket_Ball ←Football 0.33 Cricket ball Detector from pre-trained model NaN 0.00 0.0375
Cricket ball Detector from Football Detector 0.0000 0.0000 0.0120
Car Detector tested for Bus (baseline) 0.1683 0.0371 0.0668
Bus ←Car 0.50 Bus Detector from pre-trained model 0.7213 0.0938 0.1110
Bus Detector from Car Detector 0.5821 0.1127 0.0808
NaN: Not a Number
Table 3: Detection mAP on Specic Subscriptions using dierent Object Detection Models
Object Detection Models Cat Dog Cricket_Ball Mango Car Bus Bicycle Laptop Football
YOLOv3 0.8181 0.6875 0.0000 0.2000 0.8763 0.5821 0.8227 0.2805 0.2857
SSD300 0.1749 0.2504 0.0000 0.0818 0.6058 0.1127 0.1463 0.0000 0.0108
RetinaNet 0.2633 0.2307 0.0120 0.0973 0.6736 0.0808 0.3501 0.0000 0.0300
In order to determine relatedness among similar classiers, presently
we are using WordNet [
35
] with the path operator. Table 2 shows
four examples of classes on domain transfers with dierent similar-
ity scores. It can be concluded for Model-1 that adapting from one
domain (class) to another mostly yields high performance results
in low response-time as compared to ne-tuning of detectors on
pre-trained model (like ImageNet). Here we also have a simple base-
line where the nearest neighbors are used to detect an unseen class
without training, thus resulting in low mAP but zero response-time.
Table 3 shows the overall performance of the proposed model
on specic subscriptions belongs to dierent applications of smart
cities. However, experiments with Model-2 of modied LSDA while
applying fastest classication (MobileNet [
19
]) as well as detection
(YOLOv3 [
38
]) network are still in implementation phases. Lastly,
we also aim to improve it further by incorporating visual knowledge
transfer along with semantic knowledge in proposed models 1 & 2.
6 CONCLUSION AND PLANNED WORK
In this work, we analyzed the problem of training of classiers on-
demand for previously unseen/unknown concepts along with the
aim of reducing response-time for the multimedia event processing.
We presented two models for the domain adaptation of classiers
while leveraging transfer learning (ne-tuning top and freezing
backbone) and LSDA based techniques. We have conducted experi-
ments using the rst approach and achieve 0.5 mAP within 30 min
of response-time.
As part of work in progress, the proposed approach for mod-
ied LSDA is in its implementation phase, and we are planning
to compare it with our rst model using existing object detection
models with and without domain adaptation. Since the second
method will no longer require bounding box annotations, it should
reduce online data collection/construction time which contributes
towards response time. However, we still have two open questions:
(1) whether the second LSDA based model will be able to reduce
response-time further by decreasing the overall training time? (2)
What will be its impact on accuracy of unseen/weak class detector?
To reach the best case for both questions, we aim to use MobileNet
[
19
] and YOLOv3 [
38
] in training network, which themselves are
the fastest classication and detection models known to date, while
incorporating visual along with semantic knowledge transfers.
ACKNOWLEDGMENTS
This work was supported by Science Foundation Ireland under grant
SFI/12/RC/2289_P2. Titan Xp GPU used was donated by NVIDIA.
REFERENCES
[1]
Kashif Ahmad and Nicola Conci. January, 2019. How Deep Features Have Im-
proved Event Recognition in Multimedia: a Survey. ACM Transactions on Multi-
media Computing Communications and Applications (January, 2019).
[2]
Sufyan Almajali, I Dhiah el Diehn, Haythem Bany Salameh, Moussa Ayyash,
and Hany Elgala. 2018. A distributed multi-layer MEC-cloud architecture for
processing large scale IoT-based multimedia applications. Multimedia Tools and
Applications (2018), 1–22.
[3]
Sheeraz A Alvi, Bilal Afzal, Ghalib A Shah, Luigi Atzori, and Waqar Mahmood.
2015. Internet of multimedia things: Vision and challenges. Ad Hoc Networks 33
(2015), 87–111.
[4]
Asra Aslam and Edward Curry. 2018. Towards a Generalized Approach for Deep
Neural Network Based Event Processing for the Internet of Multimedia Things.
IEEE Access 6 (2018), 25573–25587.
[5]
Asra Aslam and Edward Curry. 2020. Reducing Response Time for Multimedia
Event Processing using Domain Adaptation. In Accepted for Proceedings of the
2020 ACM on International Conference on Multimedia Retrieval (ICMR).
[6]
Asra Aslam, Souleiman Hasan, and Edward Curry. 2017. Challenges with image
event processing: Poster. In Proceedings of the 11th ACM International Conference
on Distributed and Event-based Systems. 347–348.
[7]
Oscar Beijbom. 2012. Domain adaptations for computer vision applications. arXiv
preprint arXiv:1211.4860 (2012).
[8]
Benjamin Bischke, Patrick Helber, Christian Schulze, Venkat Srinivasan, Andreas
Dengel, and Damian Borth. 2017. The Multimedia Satellite Task at MediaEval
2017.. In MediaEval.
[9]
Antonio Carzaniga, David S Rosenblum, and Alexander L Wolf. 2000. Achieving
scalability and expressiveness in an internet-scale event notication service.
In Proceedings of ACM symposium on Principles of distributed computing. ACM,
219–227.
[10]
Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2018.
Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 3339–3348.
[11]
Gabriela Csurka. 2017. Domain adaptation for visual applications: A comprehen-
sive survey. arXiv preprint arXiv:1702.05374 (2017).
[12]
Gianpaolo Cugola and Alessandro Margara. 2012. Processing ows of information:
From data stream to complex event processing. ACM Computing Surveys (CSUR)
44, 3 (2012), 15.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-
genet: A large-scale hierarchical image database. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248–255.
[14]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010.
The Pascal Visual Object Classes (VOC) Challenge. International Journal of
Computer Vision 88, 2 (June 2010), 303–338.
[15]
Souleiman Hasan, Sean O’Riain, and Edward Curry. 2012. Approximate semantic
matching of heterogeneous events. In Proceedings of the 6th ACM International
Conference on Distributed Event-Based Systems. ACM, 252–263.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[17]
Judith Homan. 2016. Adaptive learning algorithms for transferable visual recog-
nition. University of California, Berkeley.
[18]
Judy Homan, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Je Donahue,
Ross Girshick, Trevor Darrell, and Kate Saenko. 2014. LSDA: Large scale detection
through adaptation. In Advances in Neural Information Processing Systems. 3536–
3544.
[19]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:
Ecient convolutional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861 (2017).
[20]
Ling Hu and Qiang Ni. 2017. IoT-driven automated object detection algorithm
for urban surveillance systems in smart cities. IEEE Internet of Things Journal 5,
2 (2017), 747–754.
[21]
Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018.
Cross-domain weakly-supervised object detection through progressive domain
adaptation. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 5001–5009.
[22]
Jermsak Jermsurawong, Mian Umair Ahsan, Abdulhamid Haidar, Haiwei Dong,
and Nikolaos Mavridis. 2012. Car parking vacancy detection and its application
in 24-hour statistical analysis. In 2012 10th International Conference on Frontiers
of Information Technology. IEEE, 84–90.
[23]
Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina
Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al
.
2017.
Openimages: A public dataset for large-scale multi-label and multi-class image
classication. Dataset available from https://github. com/openimages 2 (2017), 3.
[24]
Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classica-
tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[25]
Malaram Kumhar, Gaurang Raval, and Vishal Parikh. 2019. Quality Evaluation
Model for Multimedia Internet of Things (MIoT) Applications: Challenges and Re-
search Directions. In International Conference on Internet of Things and Connected
Technologies. Springer, 330–336.
[26]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi
Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al
.
2018. The open images dataset v4: Unied image classication, object detection,
and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 (2018).
[27]
Ching-Hao Lai and Chia-Chen Yu. 2010. An ecient real-time trac sign recog-
nition system for intelligent vehicles with smart phones. In Technologies and
Applications of Articial Intelligence (TAAI), 2010 International Conference on.
IEEE, 195–202.
[28]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
Focal loss for dense object detection. In Proceedings of the IEEE international
conference on computer vision. 2980–2988.
[29]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common
objects in context. In European conference on computer vision. Springer, 740–755.
[30]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
In European conference on computer vision. Springer, 21–37.
[31]
Badri Mohapatra and Prangya Prava Panda. 2019. Machine learning applications
to smart city. ACCENTS Transactions on Image Processing and Computer Vision 4
(14) (Feb 2019). https://doi.org/10.19101/TIPCV.2018.412004
[32]
Pirkko Mustamo. 2018. Object detection in sports: TensorFlow Object Detection API
case study. University of Oulu.
[33]
Ali Nauman, Yazdan Ahmad Qadri, Muhammad Amjad, Yousaf Bin Zikria,
Muhammad Khalil Afzal, and Sung Won Kim. 2020. Multimedia Internet of
Things: A Comprehensive Survey. IEEE Access 8 (2020), 8202–8250.
[34]
Sinno Jialin Pan, Qiang Yang, et al
.
2010. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
[35]
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet::
Similarity: measuring the relatedness of concepts. In Demonstration papers at
HLT-NAACL 2004. Association for Computational Linguistics, 38–41.
[36]
Mohammad Abdur Razzaque, Marija Milojevic-Jevric, Andrei Palade, and Siobhán
Clarke. 2015. Middleware for internet of things: a survey. IEEE Internet of things
journal 3, 1 (2015), 70–95.
[37]
Joseph Redmon. 2013–2016. Darknet: Open Source Neural Networks in C. http:
//pjreddie.com/darknet/.
[38]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
only look once: Unied, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 779–788.
[39]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement.
arXiv preprint arXiv:1804.02767 (2018).
[40]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Advances
in neural information processing systems. 91–99.
[41]
Mrigank Rochan and Yang Wang. 2015. Weakly supervised localization of novel
objects using appearance transfer. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition. 4315–4324.
[42]
Kah Phooi Seng and Li-Minn Ang. 2018. A Big Data Layered Architecture and
Functional Units for the Multimedia Internet of Things (MIoT). IEEE Transactions
on Multi-Scale Computing Systems (2018).
[43]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[44]
Yuxing Tang, Josiah Wang, Boyang Gao, Emmanuel Dellandréa, Robert
Gaizauskas, and Liming Chen. 2016. Large scale semi-supervised object de-
tection using visual and semantic knowledge transfer. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2119–2128.
[45]
Yuxing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao, Emmanuel Dellandréa,
Robert Gaizauskas, and Liming Chen. 2017. Visual and semantic knowledge
transfer for large scale semi-supervised object detection. IEEE transactions on
pattern analysis and machine intelligence 40, 12 (2017), 3045–3058.
[46]
Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of Research on
Machine Learning Applications and Trends: Algorithms, Methods, and Techniques.
IGI Global, 242–264.
[47]
Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey.
Neurocomputing 312 (2018), 135–153.
[48]
Piyush Yadav and Edward Curry. 2019. VidCEP: Complex Event Processing
Framework to Detect Spatiotemporal Patterns in Video Streams. In 2019 IEEE
International Conference on Big Data (Big Data). IEEE, 2513–2522.
[49]
Liangzhao Zeng and Hui Lei. 2004. A semantic publish/subscribe system. In E-
Commerce Technology for Dynamic E-Business, 2004. IEEE International Conference
on. IEEE, 32–39.
[50]
Yuhao Zhang and Arun Kumar. 2019. Panorama: a data system for unbounded
vocabulary querying over video. Proceedings of the VLDB Endowment 13, 4 (2019),
477–491.