Conference PaperPDF Available

Neuro-symbolic Visual Reasoning for Multimedia Event Processing: Overview, Prospects and Challenges

Authors:

Abstract and Figures

Efficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of ex-pressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper.
Content may be subject to copyright.
Neuro-symbolic Visual Reasoning for Multimedia Event
Processing: Overview, Prospects and Challenges
Muhammad Jaleed Khan
Data Science Institute
National University of Ireland Galway
m.khan12@nuigalway.ie
Edward Curry
Data Science Institute
National University of Ireland Galway
edward.curry@nuigalway.ie
Abstract
Efficient multimedia event processing is a key
enabler for real-time and complex decision
making in streaming media. The need for ex-
pressive queries to detect high-level human-
understandable spatial and temporal events in
multimedia streams is inevitable due to the
explosive growth of multimedia data in smart
cities and internet. The recent work in stream
reasoning, event processing and visual rea-
soning inspires the integration of visual and
commonsense reasoning in multimedia event
processing, which would improve and enhance
multimedia event processing in terms of ex-
pressivity of event rules and queries. This
can be achieved through careful integration of
knowledge about entities, relations and rules
from rich knowledge bases via reasoning over
multimedia streams within an event process-
ing engine. The prospects of neuro-symbolic
visual reasoning within multimedia event pro-
cessing are promising, however, there are sev-
eral associated challenges that are highlighted
in this paper.
1 Introduction
Internet of multimedia things (IoMT), data analytics
and artificial intelligence are continuously improving
smart cities and urban environments with their ever-
increasing applications ranging from traffic manage-
Copyright c
by the paper’s authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY
4.0).
In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ
Workshop, Location, Country, DD-MMM-YYYY, published at
http://ceur-ws.org
Figure 1: (a) Example of video stream in smart city. (b) Detection
of objects and relations. (c) High-level event of traffic congestion
detected as a result of automated reasoning.
ment to public safety. As middleware between in-
ternet of things and real-time applications, complex
event processing (CEP) systems process structured
data streams from multiple producers and detect com-
plex events queried by subscribers in real-time. The
enormous increase in image and video content surveil-
lance cameras and other sources in IoMT applications
posed several challenges in real-time processing of mul-
timedia events, which motivated researchers in this
area to extend the existing CEP engines and to devise
new CEP frameworks to support unstructured multi-
media streams. Over the past few years, several efforts
have been made to mitigate the challenges in multime-
dia event processing by developing techniques for ex-
tension of existing CEP engines for multimedia events
[1] and development of end-to-end CEP frameworks
for multimedia streams [2]. On the other hand, the re-
search in computer vision has focused on compliment-
ing object detection with human-like visual reasoning
that allows for prediction of meaningful and useful
semantic relations among detected objects based on
analogy and commonsense (CS) knowledge [3, 4].
2 Background
In this paper, we discuss the background, prospects
and challenges related to leveraging the existing vi-
sual and commonsense reasoning to enhance multime-
dia event processing in terms of its applicability and
expressivity of multimedia event queries. The moti-
Figure 2: (a) Conceptual level block diagram of a CEP framework supporting visual reasoning. The input stream of images (or video
frames) is received from a publisher, the objects are detected using DNN and rule-based relations [5] are represented using a graph, which is
followed by automated reasoning that adds new visual relations from a knowledge base [6] and validates those relations using commonsense
knowledge [7]. The matcher performs spatial and temporal event matching on these detected objects and relations with the spatial and
temporal patterns in high-level events queried by the subscriber. (b) An example of visual reasoning in multimedia event processing. Suppose
a subscriber is interested in the event where tennis player is either “hitting” or “missing” a shot. This event is not explicitly defined via
rules but it can be predicted via automated reasoning over detected objects and predicted relations. (Image credits: Visual Genome [6])
vation for development of an end-to-end multimedia
event processing system supporting automated rea-
soning over multimedia streams comes from its poten-
tial real-time applications in smart cities, internet and
sports. Fig. 1 shows an example of traffic congestion
event detected using visual and commonsense reason-
ing over the objects and relations among the objects in
the video stream. A conceptual level design and a mo-
tivational example of a novel CEP framework support-
ing visual and commonsense reasoning is presented in
Fig. 2.
This section presents a review of the recent work
in stream reasoning, multimedia event processing and
visual reasoning that could be complementary within a
proposed neuro-symbolic multimedia event processing
system with support for visual reasoning.
2.1 Reasoning over Streams and Knowledge
Graph
Emerging from the semantic web, streaming data is
conventionally modelled according to RDF [12], a
graph representation. The real-time processing of
RDF streams is performed in time-dependent windows
that control the access to the stream, each containing
a small part of the stream over which a task needs to
be performed at a certain time instant. Reasoning is
performed by applying RDF Schema rules to the graph
using SPARQL query language or its variants. Reason-
ing over knowledge graphs (KG) provides new relations
among entities to enrich the knowledge graph and im-
prove its applicability [13]. Neuro-symbolic comput-
ing combines symbolic and statistical approaches, i.e.
knowledge is represented in symbolic form, whereas
learning and reasoning are performed by DNN [14],
which has shown its efficacy in object detection [15] as
well as enhanced feature learning via knowledge infu-
sion in DNN layers from knowledge bases [16]. Tempo-
ral KG allows time-aware representation and tracking
of entities and relations [17].
2.2 Multimedia Event Representation and
Processing
CEP engines inherently lacked the support for un-
structured multimedia events, which was mitigated
by a generalized approach for handling multimedia
events as native events in CEP engines as presented
in [1]. Angsuchotmetee et al. [18] has presented
an ontological approach for modeling complex events
and multimedia data with syntactic and semantic in-
teroperability in multimedia sensor networks, which
allows subscribers to define application-specific com-
plex events while keeping the low-level network rep-
resentation generic. Aslam et al. [19] leveraged do-
main adaption and online transfer learning in multi-
media event processing to extend support for unknown
events. Knowledge graph is suitable for semantic rep-
resentation and reasoning over video streams due to its
scalability and maintainability [20], as demonstrated
Table 1: Available Knowledge Bases for Visual Reasoning
Knowledge Base #Images #Entity Categories #Entity Instances #Relation Categories #Relation Instances
Open Images V4 [8] 9,200,000 600 15,400,000 57 375,000
YAGO 4 [9] 10,124 64,000,000 2 billion
Visual Genome [6] 108,077 33,877 3,843,636 42,374 2,269,617
COCO-a [10] 10,000 81 74,000 156 207,000
VisKE [11] 1,884 1,158 12,593
in [5]. VidCEP [2], a CEP framework for detection of
spatiotemporal video events expressed by subscriber-
defined queries, includes a graph-based representation,
Video Event Query Language (VEQL) and a complex
event matcher for video data.
2.3 Visual and Commonsense Reasoning
In addition to the objects and their attributes in im-
ages, detection of relations among these objects is cru-
cial for scene understanding for which compositional
models [21], visual phrase models [11] and DNN based
relational networks [22] are available. Visual and se-
mantic embeddings aid large scale visual relation de-
tection, such as Zhang et al. [4] employed both visual
and textual features to leverage the interactions be-
tween objects for relation detection. Similarly, Peyre
et al. [3] added a visual phrase embedding space
during learning to enable analogical reasoning for un-
seen relations and to improve robustness to appear-
ance variations of visual relations. Table 1 presents
some knowledge bases publicly available for visual rea-
soning. Wan et al. [7] proposed the use of common-
sense knowledge graph along with the visual features
to enhance visual relation detection. Rajani et al.
[23] leverage human reasoning and language models
to generate human-like explanations for DNN-based
commonsense question answering. There are various
commonsense reasoning methods and datasets avail-
able for visual commonsense reasoning [24] and story
completion [25].
3 Neuro-symbolic Visual Reasoning in
Multimedia Event Processing
3.1 Prospects
The current multimedia event representation methods
use knowledge graph to represent the detected objects,
their attributes and relations among the objects in
video streams. Pre-defined spatial-temporal rules are
used to form relations among the objects. However,
the complex relations that exist among real-world ob-
jects also depend on semantic facts and situational
variables that can not be explicitly specified for every
possible event as rules. The statistical reasoning meth-
ods and knowledge bases discussed in Section 2 have
great potential to complement the rule-based relation
formation in multimedia event processing by inject-
ing some semantic knowledge and reasoning to extract
more semantically meaningful relations among objects.
This advancement will allow subscribers to define ab-
stract or high-level human-understandable event query
rules that can be decomposed into spatial and tem-
poral patterns. The spatio-temporal matching of the
queried high-level events will be performed on the ob-
jects, rule-based relations and relations extracted us-
ing visual reasoning. The subscriber will be instantly
notified of the high-level event as a combined detec-
tion of those spatial-temporal patterns. The idea of
developing an end-to-end multimedia event processing
system supporting visual reasoning over video streams
(Fig. 2) poses several challenges that are discussed
in the next section. This novel approach will give
more expressive power to subscribers in querying com-
plex events in multimedia streams, and thus increase
the scope of real-time applications of multimedia event
processing in smart city applications as well as internet
media streaming applications.
3.2 Challenges
1. Suitable representation for reasoning It is cru-
cial to select a generalized and scalable model to rep-
resent events and effectively perform automated rea-
soning to derive more meaningful and expressive spa-
tiotemporal events.
2. Expressive query definition and matching
Providing a generic and human-friendly format to sub-
scribers for writing expressive and high-level queries
would require new constructs. Matching queries with
the low-level events and relations along with reason-
ing via knowledge bases requires efficient retrieval
within the complex event matcher. Real-world com-
plex events can share similar patterns, occur as a clus-
ter of similar events or occur in a hierarchical manner,
which requires generalized, adaptive and scalable spa-
tiotemporal constructs to query such events.
3. Labeling and training samples of visual re-
lations There can be a large numbers of objects and
possible relations among them in images, which can
result in a large number of categories of relations. It is
difficult to annotate all possible relations and to have
balanced categories of relations in the training data.
For example, Visual Genome [6] has a huge number of
relations with unbalanced instances of each relation.
4. Consistent integration of knowledge bases
The object labels in datasets for object detection and
entity labels in knowledge bases (e.g. person, human,
man) are not always the same. Similarly, knowledge
bases have different labels for the same entity, differ-
ent names for the same attribute (e.g. birthPlace and
placeOfBirth) or relation (e.g. ’at left’ and ’to left of ’).
This can cause inconsistency or redundancy while in-
tegrating relations from the knowledge bases. It is
important to select the knowledge base and dataset
that are consistent and suitable for the combined use
of both object detection and visual reasoning.
5. Supporting rare or unseen visual relations
Apart from the common relations, very rare or unseen
relations among objects also appear in certain scenes.
It is nearly impossible to collect sufficient training sam-
ples for all possible seen and unseen relations. Han-
dling such relations while evaluating the models is also
a challenge.
6. Temporal processing of objects and relations
The recent methods on this subject address complex
inference tasks by decomposing images or scenes into
objects and visual relations among the objects. The
temporal events and temporal tracking of the detected
objects and predicted relations has not been explored
much, which is crucial for spatiotemporal event pro-
cessing.
Acknowledgement
This work was conducted with the financial support
of the Science Foundation Ireland Centre for Research
Training in Artificial Intelligence under Grant No.
18/CRT/6223.
References
[1] Asra Aslam and Edward Curry. Towards a generalized ap-
proach for deep neural network based event processing for the
internet of multimedia things. IEEE Access, 6(1):25573–25587,
2018.
[2] Piyush Yadav and Edward Curry. Vidcep: Complex event pro-
cessing framework to detect spatiotemporal patterns in video
streams. In 2019 IEEE International Conference on Big Data
(Big Data), pages 2513–2522. IEEE, 2019.
[3] Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. De-
tecting unseen visual relations using analogies. In Proceedings
of the IEEE International Conference on Computer Vision,
pages 1981–1990, 2019.
[4] Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar
Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. Large-
scale visual relationship understanding. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 33, pages
9185–9194, 2019.
[5] Piyush Yadav and Edward Curry. Vekg: Video event knowledge
graph to represent video streams for complex event pattern
matching. In 2019 First International Conference on Graph
Computing (GC), pages 13–20. IEEE, 2019.
[6] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense im-
age annotations. International Journal of Computer Vision,
123(1):32–73, 2017.
[7] Hai Wan, Jialing Ou, Baoyi Wang, Jianfeng Du, Jeff Z Pan,
and Juan Zeng. Iterative visual relationship detection via com-
monsense knowledge graph. In Joint International Semantic
Technology Conference, pages 210–225. Springer, 2019.
[8] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
Matteo Malloci, Alexander Kolesnikov, et al. The open images
dataset v4. International Journal of Computer Vision, pages
1–26, 2020.
[9] Thomas Pellissier Tanon, Gerhard Weikum, and Fabian
Suchanek. Yago 4: A reason-able knowledge base. In European
Semantic Web Conference, pages 583–596. Springer, 2020.
[10] Matteo Ruggero Ronchi and Pietro Perona. Describing
common human visual actions in images. arXiv preprint
arXiv:1506.02203, 2015.
[11] Fereshteh Sadeghi, Santosh K Kumar Divvala, and Ali Farhadi.
Viske: Visual knowledge extraction and question answering by
visual verification of relation phrases. In CVPR 2015, pages
1456–1464.
[12] Rdf 1.1 concepts and abstract syntax. 2014.
[13] Xiao jun Chen, Shengbin Jia, and Yang Xiang. A review:
Knowledge reasoning over knowledge graph. Expert Systems
with Applications, 141:112948, 2020.
[14] Weizhuo Li, Guilin Qi, and Qiu Ji. Hybrid reasoning in knowl-
edge graphs: Combing symbolic reasoning and statistical rea-
soning. Semantic Web, (Preprint):1–10, 2020.
[15] Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vi-
jay Chandrasekhar. Ob ject detection meets knowledge graphs.
2017.
[16] Ugur Kursuncu, Manas Gaur, and Amit Sheth. Knowledge in-
fused learning (k-il): Towards deep incorporation of knowledge
in deep learning. arXiv preprint arXiv:1912.00512, 2019.
[17] Alberto Garc´ıa-Dur´an, Sebastijan Dumanˇci´c, and Mathias
Niepert. Learning sequence encoders for temporal knowledge
graph completion. arXiv preprint arXiv:1809.03202, 2018.
[18] Chinnapong Angsuchotmetee, Richard Chbeir, and Yudith
Cardinale. Mssn-onto: An ontology-based approach for flex-
ible event processing in multimedia sensor networks. Future
Generation Computer Systems, 108:1140–1158, 2020.
[19] Asra Aslam and Edward Curry. Reducing response time for
multimedia event processing using domain adaptation. In Pro-
ceedings of the 2020 International Conference on Multimedia
Retrieval, pages 261–265, 2020.
[20] Luca Greco, Pierluigi Ritrovato, and Mario Vento. On the use
of semantic technologies for video analysis. Journal of Ambient
Intelligence and Humanized Computing, 2020.
[21] Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao’ou Tang.
Vip-cnn: Visual phrase guided convolutional neural network.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1347–1356, 2017.
[22] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual rela-
tionships with deep relational networks. In Proceedings of the
IEEE conference on computer vision and Pattern recognition,
pages 3076–3086, 2017.
[23] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and
Richard Socher. Explain yourself ! leveraging language models
for commonsense reasoning. arXiv preprint arXiv:1906.02361,
2019.
[24] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi.
From recognition to cognition: Visual commonsense reasoning.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2019.
[25] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and
Yejin Choi. Hellaswag: Can a machine really finish your sen-
tence? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, 2019.
... Moreover, the exploration of reasoning, vision, and language understanding by Yi et al. [235] through Neural-Symbolic Visual Question Answering (VQA) and the advancements in multimedia event processing by Khan and Curry [258] further underscore the breadth of neuro-symbolic AI's application in computer vision and beyond. ...
... g. Integrating reasoning, vision, and language The exploration of neuro-symbolic approaches in tasks like Visual Question Answering (VQA) by Yi et al. [235] and multimedia event processing by Khan and Curry [258] showcases the vast potential of neuro-symbolic AI beyond traditional computer vision tasks. Expanding these methodologies to more complex, multimodal interactions could significantly advance AI's cognitive capabilities. ...
Article
Full-text available
The goal of the growing discipline of neuro-symbolic artificial intelligence (AI) is to develop AI systems with more human-like reasoning capabilities by combining symbolic reasoning with connectionist learning. We survey the literature on neuro-symbolic AI during the last two decades, including books, monographs, review papers, contribution pieces, opinion articles, foundational workshops/talks, and related PhD theses. Four main features of neuro-symbolic AI are discussed, including representation, learning, reasoning, and decision-making. Finally, we discuss the many applications of neuro-symbolic AI, including question answering, robotics, computer vision, healthcare, and more. Scalability, explainability, and ethical considerations are also covered, as well as other difficulties and limits of neuro-symbolic AI. This study summarizes the current state of the art in neuro-symbolic artificial intelligence.
... NeSy visual semantic models have found useful applications in the representation of multimedia streams for realtime multimodal event processing in the Internet of Multimedia Things (IoMT) [19,54]. These models blend DNNs for object and attribute detection with symbolic rules to understand spatiotemporal relations among objects. ...
Article
Full-text available
Combining deep learning and common sense knowledge via neurosymbolic integration is essential for semantically rich scene representation and intuitive visual reasoning. This survey paper delves into data- and knowledge-driven scene repre- sentation and visual reasoning approaches based on deep learning, common sense knowledge and neurosymbolic integration. It explores how scene graph generation, a process that detects and analyses objects, visual relationships and attributes in scenes, serves as a symbolic scene representation. This representation forms the basis for higher-level visual reasoning tasks such as visual question answering, image captioning, image retrieval, image generation, and multimodal event processing. Infusing common sense knowledge, particularly through the use of heterogeneous knowledge graphs, improves the accuracy, expressiveness and reasoning ability of the representation and allows for intuitive downstream reasoning. Neurosymbolic integration in these approaches ranges from loose to tight coupling of neural and symbolic components. The paper reviews and categories the state-of-the-art knowledge-based neurosymbolic approaches for scene representation based on the types of deep learning architecture, common sense knowledge source and neurosymbolic integration used. The paper also discusses the visual reasoning tasks, datasets, evaluation metrics, key challenges and future directions, providing a comprehensive review of this research area and motivating further research into knowledge-enhanced and data-driven neurosymbolic scene representation and visual reasoning.
... Similarly, Ziaeefard et al. [119] proposed a Graph Attention Networks-based VQA method for encoding scene graphs along with background knowledge from ConceptNet. Graph-based visual semantic models are also used for multimedia stream representation for real-time multimedia event processing in IoMT [46]. Objects and attributes are detected using DNNs, and symbolic rules are employed to identify spatial-temporal interactions between the objects, which are required for matching high-level events questioned by users. ...
Article
Full-text available
Exploring the potential of neuro-symbolic hybrid approaches offers promising avenues for seamless high-level understanding and reasoning about visual scenes. Scene Graph Generation (SGG) is a symbolic image representation approach based on deep neural networks (DNN) that involves predicting objects, their attributes, and pairwise visual relationships in images to create scene graphs, which are utilized in downstream visual reasoning. The crowdsourced training datasets used in SGG are highly imbalanced, which results in biased SGG results. The vast number of possible triplets makes it challenging to collect sufficient training samples for every visual concept or relationship. To address these challenges, we propose augmenting the typical data-driven SGG approach with common sense knowledge to enhance the expressiveness and autonomy of visual understanding and reasoning. We present a loosely-coupled neuro-symbolic visual understanding and reasoning framework that employs a DNN-based pipeline for object detection and multi-modal pairwise relationship prediction for scene graph generation and leverages common sense knowledge in heterogenous knowledge graphs to enrich scene graphs for improved downstream reasoning. A comprehensive evaluation is performed on multiple standard datasets, including Visual Genome and Microsoft COCO, in which the proposed approach outperformed the state-of-the-art SGG methods in terms of relationship recall scores, i.e. Recall@K and mean Recall@K, as well as the state-of-the-art scene graph-based image captioning methods in terms of SPICE and CIDEr scores with comparable BLEU, ROGUE and METEOR scores. As a result of enrichment, the qualitative results showed improved expressiveness of scene graphs, resulting in more intuitive and meaningful caption generation using scene graphs. Our results validate the effectiveness of enriching scene graphs with common sense knowledge using heterogeneous knowledge graphs. This work provides a baseline for future research in knowledge-enhanced visual understanding and reasoning. The source code is available at https://github.com/jaleedkhan/neusire.
... Also, other available metrics may also be used to report experimental results. Further, attention-based models may be employed for strengthening the robustness of the models [30]. Also, other cross-domain approaches such as neuro-symbolic approaches may be preferred in order to enhance the explaining ability of proposed models along with the reasoning for generating specific captions for corresponding images [46]. ...
Article
Full-text available
Image captioning is an interesting and challenging task with applications in diverse domains such as image retrieval, organizing and locating images of users’ interest, etc. It has huge potential for replacing manual caption generation for images and is especially suitable for large-scale image data. Recently, deep neural network based methods have achieved great success in the field of computer vision, machine translation, and language generation. In this paper, we propose an encoder-decoder based model that is capable of generating grammatically correct captions for images. This model makes use of VGG16 Hybrid Places 1365 as an encoder and LSTM as a decoder. To ensure the complete ground truth accuracy, the model is trained on the labeled Flickr8k and MS-COCO Captions datasets., Further, the model is evaluated using all popular standard metrics such as BLEU, METEOR, GLEU, and ROUGE_L. Experimental results indicate that the proposed model obtained a BLEU-1 score of 0.6666, METEOR score of 0.5060, and GLEU score of 0.2469 on the Flickr8k dataset and BLEU-1 score 0.7350, METEOR score of 0.4768 and GLEU score 0.2798 on MS-COCO Caption dataset. Thus, the proposed method achieved a significant performance as compared to the state-of-art approaches. To evaluate the efficacy of the model further, we also show the results of caption generation from live sample images that reinforce the validity of the proposed approach.
... Also, other available metrics may also be used to report experimental results. Further, attentionbased models may be employed for strengthening the robustness of the models [26]. Also, other cross-domain approaches such as neuro-symbolic approaches may be preferred in order to enhance the explain ability of proposed models along with the reasoning for generating specific captions for corresponding images [41]. ...
Preprint
Full-text available
Image captioning is an interesting and challenging task with applications in diverse domains such as image retrieval, organizing and locating images of users’ interest etc. It has huge potential for replacing manual caption generation for images and is especially suitable for large scale image data. Recently, deep neural network based methods have achieved great success in the field of computer vision, machine translation and language generation. In this paper, we propose an encoder-decoder based model that is capable of generating grammatically correct captions for images. This model makes use of VGG16 Hybrid Places 1365 as encoder and LSTM as decoder. To ensure the complete ground truth accuracy, the model is trained on the labelled Flickr8k and MSCOCO Captions datasets. Further, the model is evaluated using all standard metrics such as BLEU, METEOR, GLEU and ROUGE L. Experimental results indicate that the proposed model obtained a BLEU-1 score 0.6666, METEOR score 0.5060 and GLEU score 0.2469 on Flickr8k dataset and BLEU-1 score 0.7350, METEOR score 0.4768 and GLEU score 0.2798 on the MSCOCO Captions dataset. Thus, the proposed method achieved a significant performance as compared to the state-of-art approaches. To evaluate the efficacy of the model further, we also show the results of a caption generation from live sample images that reinforces the validity of the proposed approach.
... The scale of infusion of commonsense knowledge varies from shallow to deep infusion in ML models. The use of KGs as commmonsense knowledge source within the stateof-the-art neuro-symbolic approaches [19] is a promising research direction in visual understanding and reasoning. SGG techniques can benefit from the related facts and background knowledge of visual concepts in effectively capturing and interpreting detailed semantics in images. ...
Article
Full-text available
Visual understanding involves detecting objects in a scene and investigating rich semantic relationships between the objects, which is required for downstream visual reasoning tasks. Scene graph is widely used for structured scene representation, however, the performance of scene graph generation for visual reasoning is limited due to challenges posed by imbalanced datasets and insufficient attention towards commonsense knowledge infusion. Most of the existing approaches use statistical or language priors for knowledge infusion. Commonsense knowledge infusion using heterogeneous knowledge graphs can help in improving the accuracy, robustness and generalizability of scene graph generation and enable explainable higher-level reasoning by providing rich and diverse background and factual knowledge about the concepts in visual scenes. In this article, we present the background and applications of scene graph generation and the initial approaches and key challenges in commonsense knowledge infusion using heterogeneous knowledge graphs for visual understanding and reasoning.
Article
Full-text available
This study investigates the effectiveness of an image captioning model utilizing VGG16 and LSTM architectures on the Flickr8K dataset. Through meticulous experimentation and evaluation, valuable insights into the model's capabilities and limitations in generating descriptive captions for images were gained. The findings contribute to the broader understanding of image captioning techniques and offer guidance for future advancements in the field. The exploration of VGG16 and LSTM architecture involved data preprocessing, model training, and evaluation. The Flickr8K dataset, comprising 8,000 images paired with textual descriptions, served as the foundation. Data preprocessing, feature extraction using VGG16, and LSTM training were conducted. Optimization of model parameters and hyperparameters was performed to achieve optimal performance. Evaluation metrics including BLEU score, Semantic Similarity score, and ROUGE scores were utilized. While moderate overlap with reference captions was observed according to the BLEU score, the model demonstrated a high degree of semantic similarity. However, challenges in maintaining coherence and capturing higher-order linguistic structures were revealed by the analysis of ROUGE scores. Implications of this research extend to domains such as computer vision, natural language processing, and human-computer interaction. By bridging the semantic gap between visual content and textual descriptions, image captioning models can enhance accessibility, improve image understanding, and facilitate human-machine communication. Despite promising performance in capturing semantic content, opportunities for improvement exist, including refining model architecture, integrating attention mechanisms, and leveraging larger datasets. Continued innovation in image captioning promises advanced systems with widespread applications across industries and disciplines.
Chapter
Scene graph generation aims to capture the semantic elements in images by modelling objects and their relationships in a structured manner, which are essential for visual understanding and reasoning tasks including image captioning, visual question answering, multimedia event processing, visual storytelling and image retrieval. The existing scene graph generation approaches provide limited performance and expressiveness for higher-level visual understanding and reasoning. This challenge can be mitigated by leveraging commonsense knowledge, such as related facts and background knowledge, about the semantic elements in scene graphs. In this paper, we propose the infusion of diverse commonsense knowledge about the semantic elements in scene graphs to generate rich and expressive scene graphs using a heterogeneous knowledge source that contains commonsense knowledge consolidated from seven different knowledge bases. The graph embeddings of the object nodes are used to leverage their structural patterns in the knowledge source to compute similarity metrics for graph refinement and enrichment. We performed experimental and comparative analysis on the benchmark Visual Genome dataset, in which the proposed method achieved a higher recall rate (R@K=29.89,35.4,39.12 for K=20,50,100) as compared to the existing state-of-the-art technique (R@K=25.8,33.3,37.8 for K=20,50,100). The qualitative results of the proposed method in a downstream task of image generation showed that more realistic images are generated using the commonsense knowledge-based scene graphs. These results depict the effectiveness of commonsense knowledge infusion in improving the performance and expressiveness of scene graph generation for visual understanding and reasoning tasks.
Article
Full-text available
The rapid proliferation of smart devices, surveillance cameras, infrastructures and buildings enhanced with the Internet of Things (IoT) technologies has led to a huge explosion of contents, especially in the video domain, determining an ever increasing interest towards the development of methods and tools for automatic analysis and interpretation of video sequences. Through the years, the availability of contextual knowledge has proven to improve video analysis performances in several ways, although the formal representation of semantic content in a shareable and fusion oriented manner is still an open problem, also considering the wide diffusion of Fog and Edge computing architectures for video analytics lately. In this context, an interesting answer has come from Semantic Web (SW) technologies, that opened a new perspective for the so-called Knowledge Based Computer Vision (KBCV), adding novel analytics opportunities, improving accuracy, and facilitating data exchange between video analysis systems in an open extensible manner. In this work, we propose a survey of the papers from the last eighteen years, back when first applications of semantic technologies to video analytics have appeared. The papers, analyzed under different perspectives to give a comprehensive overview of the technologies involved, reveal an interesting trend towards the adoption of SW technologies for video analytics scopes. As a result of our work, some insights about future challenges are also provided.
Article
Full-text available
Knowledge graphs (KGs) contain rich resources that represent human knowledge in the world. There are mainly two kinds of reasoning techniques in knowledge graphs, symbolic reasoning and statistical reasoning. However, both of them have their merits and limitations. Therefore, it is desirable to combine them to provide hybrid reasoning in a knowledge graph. In this paper, we present the first work on the survey of methods for hybrid reasoning in knowledge graphs. We categorize existing methods based on applications of reasoning techniques, and introduce the key ideas of them. Finally, we re-examine the remaining research problems to be solved and provide an outlook to future directions for hybrid reasoning in knowledge graphs.
Conference Paper
Full-text available
Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Querying event patterns from video streams is challenging due to its unstructured representation. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and lack of expressive query language. In this work, we focus on a CEP framework where users can define high-level expressive queries over videos to detect a range of spatiotemporal event patterns. In this context, we propose- i) VidCEP, an in-memory, on the fly, near real-time complex event matching framework for video streams. The system uses a graph-based event representation for video streams which enables the detection of high-level semantic concepts from video using cascades of Deep Neural Network models, ii) a Video Event Query language (VEQL) to express high-level user queries for video streams in CEP, iii) a complex event matcher to detect spatiotemporal video event patterns by matching expressive user queries over video data. The proposed approach detects spatiotemporal video event patterns with an F-score ranging from 0.66 to 0.89. VidCEP maintains near real-time performance with an average throughput of 70 frames per second for 5 parallel videos with sub-second matching latency.
Conference Paper
Full-text available
Complex Event Processing (CEP) is a paradigm to detect event patterns over streaming data in a timely manner. Presently, CEP systems have inherent limitations to detect event patterns over video streams due to their data complexity and lack of structured data model. Modelling complex events in unstructured data like videos not only requires detecting objects but also the spatiotemporal relationships among objects. This work introduces a novel video representation technique where an input video stream is converted to a stream of graphs. We propose the Video Event Knowledge Graph (VEKG), a knowledge graph driven representation of video data. VEKG models video objects as nodes and their relationship interaction as edges over time and space. It creates a semantic knowledge representation of video data derived from the detection of high-level semantic concepts from the video using an ensemble of deep learning models. To optimize the run-time system performance, we introduce a graph aggregation method VEKG-TAG, which provides an aggregated view of VEKG for a given time length. We defined a set of operators using event rules which can be used as a query and applied over VEKG graphs to discover complex video patterns. The system achieves an F-Score accuracy ranging between 0.75 to 0.86 for different patterns when queried over VEKG. In given experiments, pattern search time over VEKG-TAG was 2.3X faster as compared to the baseline.
Article
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15×15\times more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Article
Mining valuable hidden knowledge from large-scale data relies on the support of reasoning technology. Knowledge graphs, as a new type of knowledge representation, have gained much attention in natural language processing. Knowledge graphs can effectively organize and represent knowledge so that it can be efficiently utilized in advanced applications. Recently, reasoning over knowledge graphs has become a hot research topic, since it can obtain new knowledge and conclusions from existing data. Herein we review the basic concept and definitions of knowledge reasoning and the methods for reasoning over knowledge graphs. Specifically, we dissect the reasoning methods into three categories: rule-based reasoning, distributed representation-based reasoning and neural network-based reasoning. We also review the related applications of knowledge graph reasoning, such as knowledge graph completion, question answering, and recommender systems. Finally, we discuss the remaining challenges and research opportunities for knowledge graph reasoning.