Content uploaded by Jaleed Khan
Author content
All content in this area was uploaded by Jaleed Khan on Sep 16, 2020
Content may be subject to copyright.
Neuro-symbolic Visual Reasoning for Multimedia Event
Processing: Overview, Prospects and Challenges
Muhammad Jaleed Khan
Data Science Institute
National University of Ireland Galway
m.khan12@nuigalway.ie
Edward Curry
Data Science Institute
National University of Ireland Galway
edward.curry@nuigalway.ie
Abstract
Efficient multimedia event processing is a key
enabler for real-time and complex decision
making in streaming media. The need for ex-
pressive queries to detect high-level human-
understandable spatial and temporal events in
multimedia streams is inevitable due to the
explosive growth of multimedia data in smart
cities and internet. The recent work in stream
reasoning, event processing and visual rea-
soning inspires the integration of visual and
commonsense reasoning in multimedia event
processing, which would improve and enhance
multimedia event processing in terms of ex-
pressivity of event rules and queries. This
can be achieved through careful integration of
knowledge about entities, relations and rules
from rich knowledge bases via reasoning over
multimedia streams within an event process-
ing engine. The prospects of neuro-symbolic
visual reasoning within multimedia event pro-
cessing are promising, however, there are sev-
eral associated challenges that are highlighted
in this paper.
1 Introduction
Internet of multimedia things (IoMT), data analytics
and artificial intelligence are continuously improving
smart cities and urban environments with their ever-
increasing applications ranging from traffic manage-
Copyright c
by the paper’s authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY
4.0).
In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ
Workshop, Location, Country, DD-MMM-YYYY, published at
http://ceur-ws.org
Figure 1: (a) Example of video stream in smart city. (b) Detection
of objects and relations. (c) High-level event of traffic congestion
detected as a result of automated reasoning.
ment to public safety. As middleware between in-
ternet of things and real-time applications, complex
event processing (CEP) systems process structured
data streams from multiple producers and detect com-
plex events queried by subscribers in real-time. The
enormous increase in image and video content surveil-
lance cameras and other sources in IoMT applications
posed several challenges in real-time processing of mul-
timedia events, which motivated researchers in this
area to extend the existing CEP engines and to devise
new CEP frameworks to support unstructured multi-
media streams. Over the past few years, several efforts
have been made to mitigate the challenges in multime-
dia event processing by developing techniques for ex-
tension of existing CEP engines for multimedia events
[1] and development of end-to-end CEP frameworks
for multimedia streams [2]. On the other hand, the re-
search in computer vision has focused on compliment-
ing object detection with human-like visual reasoning
that allows for prediction of meaningful and useful
semantic relations among detected objects based on
analogy and commonsense (CS) knowledge [3, 4].
2 Background
In this paper, we discuss the background, prospects
and challenges related to leveraging the existing vi-
sual and commonsense reasoning to enhance multime-
dia event processing in terms of its applicability and
expressivity of multimedia event queries. The moti-
Figure 2: (a) Conceptual level block diagram of a CEP framework supporting visual reasoning. The input stream of images (or video
frames) is received from a publisher, the objects are detected using DNN and rule-based relations [5] are represented using a graph, which is
followed by automated reasoning that adds new visual relations from a knowledge base [6] and validates those relations using commonsense
knowledge [7]. The matcher performs spatial and temporal event matching on these detected objects and relations with the spatial and
temporal patterns in high-level events queried by the subscriber. (b) An example of visual reasoning in multimedia event processing. Suppose
a subscriber is interested in the event where tennis player is either “hitting” or “missing” a shot. This event is not explicitly defined via
rules but it can be predicted via automated reasoning over detected objects and predicted relations. (Image credits: Visual Genome [6])
vation for development of an end-to-end multimedia
event processing system supporting automated rea-
soning over multimedia streams comes from its poten-
tial real-time applications in smart cities, internet and
sports. Fig. 1 shows an example of traffic congestion
event detected using visual and commonsense reason-
ing over the objects and relations among the objects in
the video stream. A conceptual level design and a mo-
tivational example of a novel CEP framework support-
ing visual and commonsense reasoning is presented in
Fig. 2.
This section presents a review of the recent work
in stream reasoning, multimedia event processing and
visual reasoning that could be complementary within a
proposed neuro-symbolic multimedia event processing
system with support for visual reasoning.
2.1 Reasoning over Streams and Knowledge
Graph
Emerging from the semantic web, streaming data is
conventionally modelled according to RDF [12], a
graph representation. The real-time processing of
RDF streams is performed in time-dependent windows
that control the access to the stream, each containing
a small part of the stream over which a task needs to
be performed at a certain time instant. Reasoning is
performed by applying RDF Schema rules to the graph
using SPARQL query language or its variants. Reason-
ing over knowledge graphs (KG) provides new relations
among entities to enrich the knowledge graph and im-
prove its applicability [13]. Neuro-symbolic comput-
ing combines symbolic and statistical approaches, i.e.
knowledge is represented in symbolic form, whereas
learning and reasoning are performed by DNN [14],
which has shown its efficacy in object detection [15] as
well as enhanced feature learning via knowledge infu-
sion in DNN layers from knowledge bases [16]. Tempo-
ral KG allows time-aware representation and tracking
of entities and relations [17].
2.2 Multimedia Event Representation and
Processing
CEP engines inherently lacked the support for un-
structured multimedia events, which was mitigated
by a generalized approach for handling multimedia
events as native events in CEP engines as presented
in [1]. Angsuchotmetee et al. [18] has presented
an ontological approach for modeling complex events
and multimedia data with syntactic and semantic in-
teroperability in multimedia sensor networks, which
allows subscribers to define application-specific com-
plex events while keeping the low-level network rep-
resentation generic. Aslam et al. [19] leveraged do-
main adaption and online transfer learning in multi-
media event processing to extend support for unknown
events. Knowledge graph is suitable for semantic rep-
resentation and reasoning over video streams due to its
scalability and maintainability [20], as demonstrated
Table 1: Available Knowledge Bases for Visual Reasoning
Knowledge Base #Images #Entity Categories #Entity Instances #Relation Categories #Relation Instances
Open Images V4 [8] 9,200,000 600 15,400,000 57 375,000
YAGO 4 [9] – 10,124 64,000,000 – 2 billion
Visual Genome [6] 108,077 33,877 3,843,636 42,374 2,269,617
COCO-a [10] 10,000 81 74,000 156 207,000
VisKE [11] – 1,884 – 1,158 12,593
in [5]. VidCEP [2], a CEP framework for detection of
spatiotemporal video events expressed by subscriber-
defined queries, includes a graph-based representation,
Video Event Query Language (VEQL) and a complex
event matcher for video data.
2.3 Visual and Commonsense Reasoning
In addition to the objects and their attributes in im-
ages, detection of relations among these objects is cru-
cial for scene understanding for which compositional
models [21], visual phrase models [11] and DNN based
relational networks [22] are available. Visual and se-
mantic embeddings aid large scale visual relation de-
tection, such as Zhang et al. [4] employed both visual
and textual features to leverage the interactions be-
tween objects for relation detection. Similarly, Peyre
et al. [3] added a visual phrase embedding space
during learning to enable analogical reasoning for un-
seen relations and to improve robustness to appear-
ance variations of visual relations. Table 1 presents
some knowledge bases publicly available for visual rea-
soning. Wan et al. [7] proposed the use of common-
sense knowledge graph along with the visual features
to enhance visual relation detection. Rajani et al.
[23] leverage human reasoning and language models
to generate human-like explanations for DNN-based
commonsense question answering. There are various
commonsense reasoning methods and datasets avail-
able for visual commonsense reasoning [24] and story
completion [25].
3 Neuro-symbolic Visual Reasoning in
Multimedia Event Processing
3.1 Prospects
The current multimedia event representation methods
use knowledge graph to represent the detected objects,
their attributes and relations among the objects in
video streams. Pre-defined spatial-temporal rules are
used to form relations among the objects. However,
the complex relations that exist among real-world ob-
jects also depend on semantic facts and situational
variables that can not be explicitly specified for every
possible event as rules. The statistical reasoning meth-
ods and knowledge bases discussed in Section 2 have
great potential to complement the rule-based relation
formation in multimedia event processing by inject-
ing some semantic knowledge and reasoning to extract
more semantically meaningful relations among objects.
This advancement will allow subscribers to define ab-
stract or high-level human-understandable event query
rules that can be decomposed into spatial and tem-
poral patterns. The spatio-temporal matching of the
queried high-level events will be performed on the ob-
jects, rule-based relations and relations extracted us-
ing visual reasoning. The subscriber will be instantly
notified of the high-level event as a combined detec-
tion of those spatial-temporal patterns. The idea of
developing an end-to-end multimedia event processing
system supporting visual reasoning over video streams
(Fig. 2) poses several challenges that are discussed
in the next section. This novel approach will give
more expressive power to subscribers in querying com-
plex events in multimedia streams, and thus increase
the scope of real-time applications of multimedia event
processing in smart city applications as well as internet
media streaming applications.
3.2 Challenges
1. Suitable representation for reasoning It is cru-
cial to select a generalized and scalable model to rep-
resent events and effectively perform automated rea-
soning to derive more meaningful and expressive spa-
tiotemporal events.
2. Expressive query definition and matching
Providing a generic and human-friendly format to sub-
scribers for writing expressive and high-level queries
would require new constructs. Matching queries with
the low-level events and relations along with reason-
ing via knowledge bases requires efficient retrieval
within the complex event matcher. Real-world com-
plex events can share similar patterns, occur as a clus-
ter of similar events or occur in a hierarchical manner,
which requires generalized, adaptive and scalable spa-
tiotemporal constructs to query such events.
3. Labeling and training samples of visual re-
lations There can be a large numbers of objects and
possible relations among them in images, which can
result in a large number of categories of relations. It is
difficult to annotate all possible relations and to have
balanced categories of relations in the training data.
For example, Visual Genome [6] has a huge number of
relations with unbalanced instances of each relation.
4. Consistent integration of knowledge bases
The object labels in datasets for object detection and
entity labels in knowledge bases (e.g. person, human,
man) are not always the same. Similarly, knowledge
bases have different labels for the same entity, differ-
ent names for the same attribute (e.g. birthPlace and
placeOfBirth) or relation (e.g. ’at left’ and ’to left of ’).
This can cause inconsistency or redundancy while in-
tegrating relations from the knowledge bases. It is
important to select the knowledge base and dataset
that are consistent and suitable for the combined use
of both object detection and visual reasoning.
5. Supporting rare or unseen visual relations
Apart from the common relations, very rare or unseen
relations among objects also appear in certain scenes.
It is nearly impossible to collect sufficient training sam-
ples for all possible seen and unseen relations. Han-
dling such relations while evaluating the models is also
a challenge.
6. Temporal processing of objects and relations
The recent methods on this subject address complex
inference tasks by decomposing images or scenes into
objects and visual relations among the objects. The
temporal events and temporal tracking of the detected
objects and predicted relations has not been explored
much, which is crucial for spatiotemporal event pro-
cessing.
Acknowledgement
This work was conducted with the financial support
of the Science Foundation Ireland Centre for Research
Training in Artificial Intelligence under Grant No.
18/CRT/6223.
References
[1] Asra Aslam and Edward Curry. Towards a generalized ap-
proach for deep neural network based event processing for the
internet of multimedia things. IEEE Access, 6(1):25573–25587,
2018.
[2] Piyush Yadav and Edward Curry. Vidcep: Complex event pro-
cessing framework to detect spatiotemporal patterns in video
streams. In 2019 IEEE International Conference on Big Data
(Big Data), pages 2513–2522. IEEE, 2019.
[3] Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. De-
tecting unseen visual relations using analogies. In Proceedings
of the IEEE International Conference on Computer Vision,
pages 1981–1990, 2019.
[4] Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar
Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. Large-
scale visual relationship understanding. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 33, pages
9185–9194, 2019.
[5] Piyush Yadav and Edward Curry. Vekg: Video event knowledge
graph to represent video streams for complex event pattern
matching. In 2019 First International Conference on Graph
Computing (GC), pages 13–20. IEEE, 2019.
[6] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense im-
age annotations. International Journal of Computer Vision,
123(1):32–73, 2017.
[7] Hai Wan, Jialing Ou, Baoyi Wang, Jianfeng Du, Jeff Z Pan,
and Juan Zeng. Iterative visual relationship detection via com-
monsense knowledge graph. In Joint International Semantic
Technology Conference, pages 210–225. Springer, 2019.
[8] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
Matteo Malloci, Alexander Kolesnikov, et al. The open images
dataset v4. International Journal of Computer Vision, pages
1–26, 2020.
[9] Thomas Pellissier Tanon, Gerhard Weikum, and Fabian
Suchanek. Yago 4: A reason-able knowledge base. In European
Semantic Web Conference, pages 583–596. Springer, 2020.
[10] Matteo Ruggero Ronchi and Pietro Perona. Describing
common human visual actions in images. arXiv preprint
arXiv:1506.02203, 2015.
[11] Fereshteh Sadeghi, Santosh K Kumar Divvala, and Ali Farhadi.
Viske: Visual knowledge extraction and question answering by
visual verification of relation phrases. In CVPR 2015, pages
1456–1464.
[12] Rdf 1.1 concepts and abstract syntax. 2014.
[13] Xiao jun Chen, Shengbin Jia, and Yang Xiang. A review:
Knowledge reasoning over knowledge graph. Expert Systems
with Applications, 141:112948, 2020.
[14] Weizhuo Li, Guilin Qi, and Qiu Ji. Hybrid reasoning in knowl-
edge graphs: Combing symbolic reasoning and statistical rea-
soning. Semantic Web, (Preprint):1–10, 2020.
[15] Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vi-
jay Chandrasekhar. Ob ject detection meets knowledge graphs.
2017.
[16] Ugur Kursuncu, Manas Gaur, and Amit Sheth. Knowledge in-
fused learning (k-il): Towards deep incorporation of knowledge
in deep learning. arXiv preprint arXiv:1912.00512, 2019.
[17] Alberto Garc´ıa-Dur´an, Sebastijan Dumanˇci´c, and Mathias
Niepert. Learning sequence encoders for temporal knowledge
graph completion. arXiv preprint arXiv:1809.03202, 2018.
[18] Chinnapong Angsuchotmetee, Richard Chbeir, and Yudith
Cardinale. Mssn-onto: An ontology-based approach for flex-
ible event processing in multimedia sensor networks. Future
Generation Computer Systems, 108:1140–1158, 2020.
[19] Asra Aslam and Edward Curry. Reducing response time for
multimedia event processing using domain adaptation. In Pro-
ceedings of the 2020 International Conference on Multimedia
Retrieval, pages 261–265, 2020.
[20] Luca Greco, Pierluigi Ritrovato, and Mario Vento. On the use
of semantic technologies for video analysis. Journal of Ambient
Intelligence and Humanized Computing, 2020.
[21] Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao’ou Tang.
Vip-cnn: Visual phrase guided convolutional neural network.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1347–1356, 2017.
[22] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual rela-
tionships with deep relational networks. In Proceedings of the
IEEE conference on computer vision and Pattern recognition,
pages 3076–3086, 2017.
[23] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and
Richard Socher. Explain yourself ! leveraging language models
for commonsense reasoning. arXiv preprint arXiv:1906.02361,
2019.
[24] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi.
From recognition to cognition: Visual commonsense reasoning.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2019.
[25] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and
Yejin Choi. Hellaswag: Can a machine really finish your sen-
tence? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, 2019.