Conference PaperPDF Available

# Neuro-symbolic Visual Reasoning for Multimedia Event Processing: Overview, Prospects and Challenges

Authors:

## Abstract and Figures

Efficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of ex-pressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper.
Content may be subject to copyright.
Neuro-symbolic Visual Reasoning for Multimedia Event
Processing: Overview, Prospects and Challenges
Data Science Institute
National University of Ireland Galway
m.khan12@nuigalway.ie
Edward Curry
Data Science Institute
National University of Ireland Galway
edward.curry@nuigalway.ie
Abstract
Eﬃcient multimedia event processing is a key
enabler for real-time and complex decision
making in streaming media. The need for ex-
pressive queries to detect high-level human-
understandable spatial and temporal events in
multimedia streams is inevitable due to the
explosive growth of multimedia data in smart
cities and internet. The recent work in stream
reasoning, event processing and visual rea-
soning inspires the integration of visual and
commonsense reasoning in multimedia event
processing, which would improve and enhance
multimedia event processing in terms of ex-
pressivity of event rules and queries. This
can be achieved through careful integration of
knowledge about entities, relations and rules
from rich knowledge bases via reasoning over
multimedia streams within an event process-
ing engine. The prospects of neuro-symbolic
visual reasoning within multimedia event pro-
cessing are promising, however, there are sev-
eral associated challenges that are highlighted
in this paper.
1 Introduction
Internet of multimedia things (IoMT), data analytics
and artiﬁcial intelligence are continuously improving
smart cities and urban environments with their ever-
increasing applications ranging from traﬃc manage-
by the paper’s authors. Use permitted under Cre-
4.0).
In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ
Workshop, Location, Country, DD-MMM-YYYY, published at
http://ceur-ws.org
Figure 1: (a) Example of video stream in smart city. (b) Detection
of objects and relations. (c) High-level event of traﬃc congestion
detected as a result of automated reasoning.
ment to public safety. As middleware between in-
ternet of things and real-time applications, complex
event processing (CEP) systems process structured
data streams from multiple producers and detect com-
plex events queried by subscribers in real-time. The
enormous increase in image and video content surveil-
lance cameras and other sources in IoMT applications
posed several challenges in real-time processing of mul-
timedia events, which motivated researchers in this
area to extend the existing CEP engines and to devise
new CEP frameworks to support unstructured multi-
media streams. Over the past few years, several eﬀorts
have been made to mitigate the challenges in multime-
dia event processing by developing techniques for ex-
tension of existing CEP engines for multimedia events
[1] and development of end-to-end CEP frameworks
for multimedia streams [2]. On the other hand, the re-
search in computer vision has focused on compliment-
ing object detection with human-like visual reasoning
that allows for prediction of meaningful and useful
semantic relations among detected objects based on
analogy and commonsense (CS) knowledge [3, 4].
2 Background
In this paper, we discuss the background, prospects
and challenges related to leveraging the existing vi-
sual and commonsense reasoning to enhance multime-
dia event processing in terms of its applicability and
expressivity of multimedia event queries. The moti-
Figure 2: (a) Conceptual level block diagram of a CEP framework supporting visual reasoning. The input stream of images (or video
frames) is received from a publisher, the objects are detected using DNN and rule-based relations [5] are represented using a graph, which is
followed by automated reasoning that adds new visual relations from a knowledge base [6] and validates those relations using commonsense
knowledge [7]. The matcher performs spatial and temporal event matching on these detected objects and relations with the spatial and
temporal patterns in high-level events queried by the subscriber. (b) An example of visual reasoning in multimedia event processing. Suppose
a subscriber is interested in the event where tennis player is either “hitting” or “missing” a shot. This event is not explicitly deﬁned via
rules but it can be predicted via automated reasoning over detected objects and predicted relations. (Image credits: Visual Genome [6])
vation for development of an end-to-end multimedia
event processing system supporting automated rea-
soning over multimedia streams comes from its poten-
tial real-time applications in smart cities, internet and
sports. Fig. 1 shows an example of traﬃc congestion
event detected using visual and commonsense reason-
ing over the objects and relations among the objects in
the video stream. A conceptual level design and a mo-
tivational example of a novel CEP framework support-
ing visual and commonsense reasoning is presented in
Fig. 2.
This section presents a review of the recent work
in stream reasoning, multimedia event processing and
visual reasoning that could be complementary within a
proposed neuro-symbolic multimedia event processing
system with support for visual reasoning.
2.1 Reasoning over Streams and Knowledge
Graph
Emerging from the semantic web, streaming data is
conventionally modelled according to RDF [12], a
graph representation. The real-time processing of
RDF streams is performed in time-dependent windows
a small part of the stream over which a task needs to
be performed at a certain time instant. Reasoning is
performed by applying RDF Schema rules to the graph
using SPARQL query language or its variants. Reason-
ing over knowledge graphs (KG) provides new relations
among entities to enrich the knowledge graph and im-
prove its applicability [13]. Neuro-symbolic comput-
ing combines symbolic and statistical approaches, i.e.
knowledge is represented in symbolic form, whereas
learning and reasoning are performed by DNN [14],
which has shown its eﬃcacy in object detection [15] as
well as enhanced feature learning via knowledge infu-
sion in DNN layers from knowledge bases [16]. Tempo-
ral KG allows time-aware representation and tracking
of entities and relations [17].
2.2 Multimedia Event Representation and
Processing
CEP engines inherently lacked the support for un-
structured multimedia events, which was mitigated
by a generalized approach for handling multimedia
events as native events in CEP engines as presented
in [1]. Angsuchotmetee et al. [18] has presented
an ontological approach for modeling complex events
and multimedia data with syntactic and semantic in-
teroperability in multimedia sensor networks, which
allows subscribers to deﬁne application-speciﬁc com-
plex events while keeping the low-level network rep-
resentation generic. Aslam et al. [19] leveraged do-
main adaption and online transfer learning in multi-
media event processing to extend support for unknown
events. Knowledge graph is suitable for semantic rep-
resentation and reasoning over video streams due to its
scalability and maintainability [20], as demonstrated
Table 1: Available Knowledge Bases for Visual Reasoning
Knowledge Base #Images #Entity Categories #Entity Instances #Relation Categories #Relation Instances
Open Images V4 [8] 9,200,000 600 15,400,000 57 375,000
YAGO 4 [9] 10,124 64,000,000 2 billion
Visual Genome [6] 108,077 33,877 3,843,636 42,374 2,269,617
COCO-a [10] 10,000 81 74,000 156 207,000
VisKE [11] 1,884 1,158 12,593
in [5]. VidCEP [2], a CEP framework for detection of
spatiotemporal video events expressed by subscriber-
deﬁned queries, includes a graph-based representation,
Video Event Query Language (VEQL) and a complex
event matcher for video data.
2.3 Visual and Commonsense Reasoning
In addition to the objects and their attributes in im-
ages, detection of relations among these objects is cru-
cial for scene understanding for which compositional
models [21], visual phrase models [11] and DNN based
relational networks [22] are available. Visual and se-
mantic embeddings aid large scale visual relation de-
tection, such as Zhang et al. [4] employed both visual
and textual features to leverage the interactions be-
tween objects for relation detection. Similarly, Peyre
et al. [3] added a visual phrase embedding space
during learning to enable analogical reasoning for un-
seen relations and to improve robustness to appear-
ance variations of visual relations. Table 1 presents
some knowledge bases publicly available for visual rea-
soning. Wan et al. [7] proposed the use of common-
sense knowledge graph along with the visual features
to enhance visual relation detection. Rajani et al.
[23] leverage human reasoning and language models
to generate human-like explanations for DNN-based
commonsense question answering. There are various
commonsense reasoning methods and datasets avail-
able for visual commonsense reasoning [24] and story
completion [25].
3 Neuro-symbolic Visual Reasoning in
Multimedia Event Processing
3.1 Prospects
The current multimedia event representation methods
use knowledge graph to represent the detected objects,
their attributes and relations among the objects in
video streams. Pre-deﬁned spatial-temporal rules are
used to form relations among the objects. However,
the complex relations that exist among real-world ob-
jects also depend on semantic facts and situational
variables that can not be explicitly speciﬁed for every
possible event as rules. The statistical reasoning meth-
ods and knowledge bases discussed in Section 2 have
great potential to complement the rule-based relation
formation in multimedia event processing by inject-
ing some semantic knowledge and reasoning to extract
more semantically meaningful relations among objects.
This advancement will allow subscribers to deﬁne ab-
stract or high-level human-understandable event query
rules that can be decomposed into spatial and tem-
poral patterns. The spatio-temporal matching of the
queried high-level events will be performed on the ob-
jects, rule-based relations and relations extracted us-
ing visual reasoning. The subscriber will be instantly
notiﬁed of the high-level event as a combined detec-
tion of those spatial-temporal patterns. The idea of
developing an end-to-end multimedia event processing
system supporting visual reasoning over video streams
(Fig. 2) poses several challenges that are discussed
in the next section. This novel approach will give
more expressive power to subscribers in querying com-
plex events in multimedia streams, and thus increase
the scope of real-time applications of multimedia event
processing in smart city applications as well as internet
media streaming applications.
3.2 Challenges
1. Suitable representation for reasoning It is cru-
cial to select a generalized and scalable model to rep-
resent events and eﬀectively perform automated rea-
soning to derive more meaningful and expressive spa-
tiotemporal events.
2. Expressive query deﬁnition and matching
Providing a generic and human-friendly format to sub-
scribers for writing expressive and high-level queries
would require new constructs. Matching queries with
the low-level events and relations along with reason-
ing via knowledge bases requires eﬃcient retrieval
within the complex event matcher. Real-world com-
plex events can share similar patterns, occur as a clus-
ter of similar events or occur in a hierarchical manner,
which requires generalized, adaptive and scalable spa-
tiotemporal constructs to query such events.
3. Labeling and training samples of visual re-
lations There can be a large numbers of objects and
possible relations among them in images, which can
result in a large number of categories of relations. It is
diﬃcult to annotate all possible relations and to have
balanced categories of relations in the training data.
For example, Visual Genome [6] has a huge number of
relations with unbalanced instances of each relation.
4. Consistent integration of knowledge bases
The object labels in datasets for object detection and
entity labels in knowledge bases (e.g. person, human,
man) are not always the same. Similarly, knowledge
bases have diﬀerent labels for the same entity, diﬀer-
ent names for the same attribute (e.g. birthPlace and
placeOfBirth) or relation (e.g. ’at left’ and ’to left of ’).
This can cause inconsistency or redundancy while in-
tegrating relations from the knowledge bases. It is
important to select the knowledge base and dataset
that are consistent and suitable for the combined use
of both object detection and visual reasoning.
5. Supporting rare or unseen visual relations
Apart from the common relations, very rare or unseen
relations among objects also appear in certain scenes.
It is nearly impossible to collect suﬃcient training sam-
ples for all possible seen and unseen relations. Han-
dling such relations while evaluating the models is also
a challenge.
6. Temporal processing of objects and relations
The recent methods on this subject address complex
inference tasks by decomposing images or scenes into
objects and visual relations among the objects. The
temporal events and temporal tracking of the detected
objects and predicted relations has not been explored
much, which is crucial for spatiotemporal event pro-
cessing.
Acknowledgement
This work was conducted with the ﬁnancial support
of the Science Foundation Ireland Centre for Research
Training in Artiﬁcial Intelligence under Grant No.
18/CRT/6223.
References
[1] Asra Aslam and Edward Curry. Towards a generalized ap-
proach for deep neural network based event processing for the
internet of multimedia things. IEEE Access, 6(1):25573–25587,
2018.
[2] Piyush Yadav and Edward Curry. Vidcep: Complex event pro-
cessing framework to detect spatiotemporal patterns in video
streams. In 2019 IEEE International Conference on Big Data
(Big Data), pages 2513–2522. IEEE, 2019.
[3] Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. De-
tecting unseen visual relations using analogies. In Proceedings
of the IEEE International Conference on Computer Vision,
pages 1981–1990, 2019.
[4] Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar
Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. Large-
scale visual relationship understanding. In Proceedings of the
AAAI Conference on Artiﬁcial Intelligence, volume 33, pages
9185–9194, 2019.
[5] Piyush Yadav and Edward Curry. Vekg: Video event knowledge
graph to represent video streams for complex event pattern
matching. In 2019 First International Conference on Graph
Computing (GC), pages 13–20. IEEE, 2019.
[6] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense im-
age annotations. International Journal of Computer Vision,
123(1):32–73, 2017.
[7] Hai Wan, Jialing Ou, Baoyi Wang, Jianfeng Du, Jeﬀ Z Pan,
and Juan Zeng. Iterative visual relationship detection via com-
monsense knowledge graph. In Joint International Semantic
Technology Conference, pages 210–225. Springer, 2019.
[8] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings,
Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,
Matteo Malloci, Alexander Kolesnikov, et al. The open images
dataset v4. International Journal of Computer Vision, pages
1–26, 2020.
[9] Thomas Pellissier Tanon, Gerhard Weikum, and Fabian
Suchanek. Yago 4: A reason-able knowledge base. In European
Semantic Web Conference, pages 583–596. Springer, 2020.
[10] Matteo Ruggero Ronchi and Pietro Perona. Describing
common human visual actions in images. arXiv preprint
arXiv:1506.02203, 2015.
Viske: Visual knowledge extraction and question answering by
visual veriﬁcation of relation phrases. In CVPR 2015, pages
1456–1464.
[12] Rdf 1.1 concepts and abstract syntax. 2014.
[13] Xiao jun Chen, Shengbin Jia, and Yang Xiang. A review:
Knowledge reasoning over knowledge graph. Expert Systems
with Applications, 141:112948, 2020.
[14] Weizhuo Li, Guilin Qi, and Qiu Ji. Hybrid reasoning in knowl-
edge graphs: Combing symbolic reasoning and statistical rea-
soning. Semantic Web, (Preprint):1–10, 2020.
[15] Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vi-
jay Chandrasekhar. Ob ject detection meets knowledge graphs.
2017.
[16] Ugur Kursuncu, Manas Gaur, and Amit Sheth. Knowledge in-
fused learning (k-il): Towards deep incorporation of knowledge
in deep learning. arXiv preprint arXiv:1912.00512, 2019.
[17] Alberto Garc´ıa-Dur´an, Sebastijan Dumanˇci´c, and Mathias
Niepert. Learning sequence encoders for temporal knowledge
graph completion. arXiv preprint arXiv:1809.03202, 2018.
[18] Chinnapong Angsuchotmetee, Richard Chbeir, and Yudith
Cardinale. Mssn-onto: An ontology-based approach for ﬂex-
ible event processing in multimedia sensor networks. Future
Generation Computer Systems, 108:1140–1158, 2020.
[19] Asra Aslam and Edward Curry. Reducing response time for
multimedia event processing using domain adaptation. In Pro-
ceedings of the 2020 International Conference on Multimedia
Retrieval, pages 261–265, 2020.
[20] Luca Greco, Pierluigi Ritrovato, and Mario Vento. On the use
of semantic technologies for video analysis. Journal of Ambient
Intelligence and Humanized Computing, 2020.
[21] Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao’ou Tang.
Vip-cnn: Visual phrase guided convolutional neural network.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1347–1356, 2017.
[22] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual rela-
tionships with deep relational networks. In Proceedings of the
IEEE conference on computer vision and Pattern recognition,
pages 3076–3086, 2017.
[23] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and
Richard Socher. Explain yourself ! leveraging language models
for commonsense reasoning. arXiv preprint arXiv:1906.02361,
2019.
[24] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi.
From recognition to cognition: Visual commonsense reasoning.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2019.
[25] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and
Yejin Choi. Hellaswag: Can a machine really ﬁnish your sen-
tence? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, 2019.
... Recent years have witnessed the mushroom growth of deep learning techniques. These techniques have been widely applied and shown excellent performance in numerous tasks in the field of computer vision, such as image generation [15,17,39], image enhancement [55,56], face recognition [21,49,60], object detection [24,32], and representation learning [19,20]. For the task of image generation, generative adversarial network (GAN) [11] shows good performance by training a group of networks simultaneously. ...
... Thus, we simplify the U into U by using the distances in several directions to measure the size of U . (19) where N l ( p i ) denotes the neighboring pixel of the pixel p i in the lth direction, M is the number of directions that are calculated on. We set M = 8. ...
Full-text available
Article
Sketch-to-photo face generation has recently gained remarkable attention in computer vision and signal processing communities, because the sketches that employ concise lines are easily available and can describe significant facial attributes conveniently. Most existing sketch-to-photo works fail to maintain geometric structures and improve local details simultaneously, which limits their performance. In this work, we propose a two-stage sketch-to-photo generative adversarial network for face generation. In the first stage, we propose a semantic loss to maintain semantic consistency. In the second stage, we define the similar connected component and propose a color refinement loss to generate fine-grained details. Moreover, we introduce a multi-scale discriminator and design a patch-level local discriminator. We also propose a texture loss to enhance the local fidelity of synthesized images. Experiments show that our proposed method can significantly generate better results while preserving facial attributes than the state-of-the-art methods.
... The scale of infusion of commonsense knowledge varies from shallow to deep infusion in ML models. The use of KGs as commmonsense knowledge source within the stateof-the-art neuro-symbolic approaches [19] is a promising research direction in visual understanding and reasoning. SGG techniques can benefit from the related facts and background knowledge of visual concepts in effectively capturing and interpreting detailed semantics in images. ...
Full-text available
Article
Visual understanding involves detecting objects in a scene and investigating rich semantic relationships between the objects, which is required for downstream visual reasoning tasks. Scene graph is widely used for structured scene representation, however, the performance of scene graph generation for visual reasoning is limited due to challenges posed by imbalanced datasets and insufficient attention towards commonsense knowledge infusion. Most of the existing approaches use statistical or language priors for knowledge infusion. Commonsense knowledge infusion using heterogeneous knowledge graphs can help in improving the accuracy, robustness and generalizability of scene graph generation and enable explainable higher-level reasoning by providing rich and diverse background and factual knowledge about the concepts in visual scenes. In this article, we present the background and applications of scene graph generation and the initial approaches and key challenges in commonsense knowledge infusion using heterogeneous knowledge graphs for visual understanding and reasoning.
... In this direction, Scene Graph Generation (SGG) [46,48,3] has attracted significant attention due to its capability to capture the detailed semantics of visual scenes by modelling objects and their relationships in a structured manner. Graph-based structured image representations like scene graphs are used in a wide range of visual understanding tasks including image reconstruction [11], image captioning [61], Visual Question Answering (VQA) [22,25], image retrieval [55], visual storytelling [54] and multimedia event processing [5,20]. The performance of SGG is compromised by challenges including bias and annotation issues in crowd-sourced datasets [23,7]. ...
Full-text available
Conference Paper
Scene graph generation aims to capture the semantic elements in images by modelling objects and their relationships in a structured manner, which are essential for visual understanding and reasoning tasks including image captioning, visual question answering, multimedia event processing, visual storytelling and image retrieval. The existing scene graph generation approaches provide limited performance and expressiveness for higher-level visual understanding and reasoning. This challenge can be mitigated by leveraging commonsense knowledge, such as related facts and background knowledge, about the semantic elements in scene graphs. In this paper, we propose the infusion of diverse commonsense knowledge about the semantic elements in scene graphs to generate rich and expressive scene graphs using a heterogeneous knowledge source that contains commonsense knowledge consolidated from seven different knowledge bases. The graph embeddings of the object nodes are used to leverage their structural patterns in the knowledge source to compute similarity metrics for graph refinement and enrichment. We performed experimental and comparative analysis on the benchmark Visual Genome dataset, in which the proposed method achieved a higher recall rate (R@K = 29.89, 35.4, 39.12 for K = 20, 50, 100) as compared to the existing state-of-the-art technique (R@K = 25.8, 33.3, 37.8 for K = 20, 50, 100). The qualitative results of the proposed method in a downstream task of image generation showed that more realistic images are generated using the commonsense knowledge-based scene graphs. These results depict the effectiveness of commonsense knowledge infusion in improving the performance and expressiveness of scene graph generation for visual understanding and reasoning tasks.
... The new pre-processing methods like changing color space or extracting features from the images for different classifiers using other pre-trained CNNs and other fine-tuning methods can be used to overcome this method. Also, in the future, the currently evolving tools and techniques, such as UAV-based approaches for improved remote sensing capabilities [52], multi-modal techniques for realtime and in-field analysis [53] and neurosymbolic visual reasoning for explainable predictions [54]. ...
Full-text available
Article
With an increase in the consumption of fruits day by day, the yielding and production around the world are also increasing at a steady rate. Meanwhile, the workforce in the field becomes more challenging, there arises a need for automated solutions to maintain consistent output and quality of the product. An accurate, competent and consistent approach to classifying fruits and other agricultural products in precision agriculture is the foundation for a machine vision system to be successful and cost-effective. In this research work, Convolutional Neural Network (CNN)-based intelligent fruits classification utilizing the bilinear pooling with heterogeneous streams is proposed. The fruits classification problem is viewed as a fine-grained visual classification (FGVC) and the heterogeneous bilinear network is developed and compared with the normal implementations. The proposed CNN network is initialized with ImageNet weights and the pre-trained networks are used as components in the Bilinear Pooling CNN (BP-CNN). The CNNs used in the bilinear network function as feature extractors are then combined using the bilinear pooling function. The proposed BP-CNN-based intelligent classifier is trained and tested with Fruits-360, Imagenet and VegFru which are used by many researchers recently. The performance of the proposed BP-CNN model is validated using various metrics and compared with other existing CNN models. It is found that it outperforms all other methods with a classification accuracy of 99.69% and an F1 score of 0.9968.
... 2) Neurosymbolic Visual Reasoning: The integration of visual and commonsense reasoning in MEP could improve and enhance the expressivity of event rules and queries. This can be achieved by carefully integrating knowledge about entities, relations, and rules from rich knowledge bases via reasoning over multimodal streams [61]. 3) Subjectivity: Traditional event processing engines answer objective queries using the objective attributes of events. ...
Full-text available
Article
Modern distributed computing infrastructure need to process vast quantities of data streams generated by a growing number of participants with information generated in multiple formats. With the Internet of Multimedia Things (IoMT) becoming a reality, new approaches are needed to process realtime multimodal event data streams. Existing approaches to event processing have limited consideration for the challenges of multimodal events, including the need for complex content extraction, increased computational and memory costs. The paper explores event processing as a basis for processing real-time IoMT data. The paper introduces the Multimodal Event Processing (MEP) paradigm, which provides a formal basis for native approaches to neural multimodal content analysis (i.e., computer vision, linguistics, and audition) with symbolic event processing rules to support real-time queries over multimodal data streams using the Multimodal Event Processing Language to express single, primitive multimodal, and complex multimodal event patterns. The content of multimodal streams is represented using Multimodal Event Knowledge Graphs to capture the semantic, spatial, and temporal content of the multimodal streams. The approach is implemented and evaluated within an MEP Engine using single and multimodal queries achieving near real-time performance with a throughput of ∼30 fps and sub-second latency of 0.075-0.30 seconds for video streams of 30 fps input rate. Support for high input stream rates (45 fps) is achieved through content-aware load shedding techniques with a ∼127X latency improvement resulting in only a minor decrease in accuracy.
... The authors focused on the "self-attention" and "cross-attention"mechanism to explore spectral and spatial features from HSI and LiDAR. The authors (Khan, & Curry, 2020) provided insights about the various challenges, prospects of concatenating visual and commonsense reasoning in multimedia event processing, which was achieved by careful integration of learning about entities,relations, and rules from rich knowledge bases. ...
Full-text available
Article
Hyperspectral imaging has shown tremendous growth over the past three decades. Hyperspectral imaging was evolved through remote sensing. Along, with the technological enhancements hyperspectral imaging has outgrown, conquering over other various application areas. In addition to it, data enriched data cubes with abundant spectral and spatial information works as perk for capturing, analyzing, reviewing, and interpreting results from data. This review concentrates on emerging application areas of hyperspectral imaging. Emerging application areas are selected in ways where there is a vast scope for future enhancements by exploiting cutting edge technology, that is, deep learning. Applications of hyperspectral imaging techniques in some selected areas (remote sensing, document forgery, history and archaeology conservation, surveillance and security, machine vision for fruit quality inspection, medical imaging) are focused. The review pivots around the publicly available datasets and features used domain wise. This review can act as a baseline for deep learning and machine vision experts, historical geographers, and scholars by providing them a view of how hyperspectral imaging is implemented in multiple domains along with future research prospects. This article is categorized under: Technologies > Machine Learning Technologies > Prediction
... Furthermore, the stereo reconstruction can benefit from semantic priors induced from scene understanding, such as monocular depth estimation [34], semantic image segmentation [14,61] and neuro-symbolic visual reasoning [30]. The reconstruction performance can also been improved if additional information is available, e.g., polarimetric lighting [6] and hyperspectral images [29,33]. However, these methods are not suitable for 3D reconstruction from small motion clips due to high uncertainty caused by small baselines. ...
Full-text available
Article
Small motion can be induced from burst video clips captured by a handheld camera when the shutter button is pressed. Although uncalibrated burst video clip conveys valuable parallax information, it generally has small baseline between frames, making it difficult to reconstruct 3D scenes. Existing methods usually employ a simplified camera parameterization process with keypoint-based structure from small motion (SFSM), followed by a tailored dense reconstruction. However, such SFSM methods are sensitive to insufficient or unreliable keypoint features, and the subsequent dense reconstruction may fail to recover the detailed surface. In this paper, we propose a robust 3D reconstruction pipeline by leveraging both keypoint and line segment features from video clips to alleviate the uncertainty induced by small baseline. A joint feature-based structure from small motion method is first presented to improve the robustness of the self-calibration with line segment constraints, and then, a noise-aware PatchMatch stereo module is proposed to improve the accuracy of the dense reconstruction. Finally, a confidence weighted fusion process is utilized to further suppress depth noise and mitigate erroneous depth. The proposed method can reduce the failure cases of self-calibration when the keypoints are insufficient, while recovering the detailed 3D surfaces. In comparison with state of the arts, our method achieves more robust and accurate 3D reconstruction results for a variety of challenging scenes.
Chapter
Scene graph generation aims to capture the semantic elements in images by modelling objects and their relationships in a structured manner, which are essential for visual understanding and reasoning tasks including image captioning, visual question answering, multimedia event processing, visual storytelling and image retrieval. The existing scene graph generation approaches provide limited performance and expressiveness for higher-level visual understanding and reasoning. This challenge can be mitigated by leveraging commonsense knowledge, such as related facts and background knowledge, about the semantic elements in scene graphs. In this paper, we propose the infusion of diverse commonsense knowledge about the semantic elements in scene graphs to generate rich and expressive scene graphs using a heterogeneous knowledge source that contains commonsense knowledge consolidated from seven different knowledge bases. The graph embeddings of the object nodes are used to leverage their structural patterns in the knowledge source to compute similarity metrics for graph refinement and enrichment. We performed experimental and comparative analysis on the benchmark Visual Genome dataset, in which the proposed method achieved a higher recall rate (R@K=29.89,35.4,39.12 for K=20,50,100) as compared to the existing state-of-the-art technique (R@K=25.8,33.3,37.8 for K=20,50,100). The qualitative results of the proposed method in a downstream task of image generation showed that more realistic images are generated using the commonsense knowledge-based scene graphs. These results depict the effectiveness of commonsense knowledge infusion in improving the performance and expressiveness of scene graph generation for visual understanding and reasoning tasks.
Full-text available
Article
The analysis of human facial expressions from the thermal images captured by the Infrared Thermal Imaging (IRTI) cameras has recently gained importance compared to images captured by the standard cameras using light having a wavelength in the visible spectrum. It is because infrared cameras work well in low-light conditions and also infrared spectrum captures thermal distribution that is very useful for building systems like Robot interaction systems, quantifying the cognitive responses from facial expressions, disease control, etc. In this paper, a deep learning model called IRFacExNet (InfraRed Facial Expression Network) has been proposed for facial expression recognition (FER) from infrared images. It utilizes two building blocks namely Residual unit and Transformation unit which extract dominant features from the input images specific to the expressions. The extracted features help to detect the emotion of the subjects in consideration accurately. The Snapshot ensemble technique is adopted with a Cosine annealing learning rate scheduler to improve the overall performance. The performance of the proposed model has been evaluated on a publicly available dataset, namely IRDatabase developed by RWTH Aachen University. The facial expressions present in the dataset are Fear, Anger, Contempt, Disgust, Happy, Neutral, Sad, and Surprise. The proposed model produces 88.43% recognition accuracy, better than some state-of-the-art methods considered here for comparison. Our model provides a robust framework for the detection of accurate expression in the absence of visible light.
Article
Clostridium sporogenes spores are used as surrogates for Clostridium botulinum, to verify thermal exposure and lethality in sterilization regimes by food industries. Conventional methods to detect spores are time-consuming and labour intensive. The objectives of this study were to evaluate the feasibility of using hyperspectral imaging (HSI) and deep learning approaches, firstly to identify dead and live forms of C. sporogenes spores and secondly, to estimate the concentration of spores on culture media plates and ready-to-eat mashed potato (food matrix). C. sporogenes spores were inoculated by either spread plating or drop plating on sheep blood agar (SBA) and tryptic soy agar (TSA) plates and by spread plating on the surface of mashed potato. Reflectance in the spectral range of 547-1701 nm from the region of interest was used for principal component analysis (PCA). PCA was successful in distinguishing dead and live spores and different levels of inoculum (10² to 10⁶ CFU/ml) on both TSA and SBA plates, however, was not efficient on the mashed potato (food matrix). Hence, deep learning classification frameworks namely 1D- convolutional neural networks (CNN) and random forest (RF) model were used. CNN model outperformed the RF model and the accuracy for quantification of spores was improved by 4 % and 8 % in the presence and absence, respectively of dead spores. The screening system used in this study was a combination of HSI and deep learning modelling, which resulted in an overall accuracy of 90-94 % when the dead/inactivated spores were present and absent, respectively. The only discrepancy detected was during the prediction of samples with low inoculum levels (< 10² CFU/ml). In summary, it was evident that HSI in combination with a deep learning approach showed immense potential as a tool to detect and quantify spores on nutrient media as well as on specific food matrix (mashed potato). However, the presence of dead spores in any sample is postulated to affect the accuracy and would need replicates and confirmatory assays.
Full-text available
Conference Paper
Full-text available
Article
The rapid proliferation of smart devices, surveillance cameras, infrastructures and buildings enhanced with the Internet of Things (IoT) technologies has led to a huge explosion of contents, especially in the video domain, determining an ever increasing interest towards the development of methods and tools for automatic analysis and interpretation of video sequences. Through the years, the availability of contextual knowledge has proven to improve video analysis performances in several ways, although the formal representation of semantic content in a shareable and fusion oriented manner is still an open problem, also considering the wide diffusion of Fog and Edge computing architectures for video analytics lately. In this context, an interesting answer has come from Semantic Web (SW) technologies, that opened a new perspective for the so-called Knowledge Based Computer Vision (KBCV), adding novel analytics opportunities, improving accuracy, and facilitating data exchange between video analysis systems in an open extensible manner. In this work, we propose a survey of the papers from the last eighteen years, back when first applications of semantic technologies to video analytics have appeared. The papers, analyzed under different perspectives to give a comprehensive overview of the technologies involved, reveal an interesting trend towards the adoption of SW technologies for video analytics scopes. As a result of our work, some insights about future challenges are also provided.
Full-text available
Article
Knowledge graphs (KGs) contain rich resources that represent human knowledge in the world. There are mainly two kinds of reasoning techniques in knowledge graphs, symbolic reasoning and statistical reasoning. However, both of them have their merits and limitations. Therefore, it is desirable to combine them to provide hybrid reasoning in a knowledge graph. In this paper, we present the first work on the survey of methods for hybrid reasoning in knowledge graphs. We categorize existing methods based on applications of reasoning techniques, and introduce the key ideas of them. Finally, we re-examine the remaining research problems to be solved and provide an outlook to future directions for hybrid reasoning in knowledge graphs.
Full-text available
Conference Paper
Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Querying event patterns from video streams is challenging due to its unstructured representation. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and lack of expressive query language. In this work, we focus on a CEP framework where users can define high-level expressive queries over videos to detect a range of spatiotemporal event patterns. In this context, we propose- i) VidCEP, an in-memory, on the fly, near real-time complex event matching framework for video streams. The system uses a graph-based event representation for video streams which enables the detection of high-level semantic concepts from video using cascades of Deep Neural Network models, ii) a Video Event Query language (VEQL) to express high-level user queries for video streams in CEP, iii) a complex event matcher to detect spatiotemporal video event patterns by matching expressive user queries over video data. The proposed approach detects spatiotemporal video event patterns with an F-score ranging from 0.66 to 0.89. VidCEP maintains near real-time performance with an average throughput of 70 frames per second for 5 parallel videos with sub-second matching latency.
Full-text available
Conference Paper
Complex Event Processing (CEP) is a paradigm to detect event patterns over streaming data in a timely manner. Presently, CEP systems have inherent limitations to detect event patterns over video streams due to their data complexity and lack of structured data model. Modelling complex events in unstructured data like videos not only requires detecting objects but also the spatiotemporal relationships among objects. This work introduces a novel video representation technique where an input video stream is converted to a stream of graphs. We propose the Video Event Knowledge Graph (VEKG), a knowledge graph driven representation of video data. VEKG models video objects as nodes and their relationship interaction as edges over time and space. It creates a semantic knowledge representation of video data derived from the detection of high-level semantic concepts from the video using an ensemble of deep learning models. To optimize the run-time system performance, we introduce a graph aggregation method VEKG-TAG, which provides an aggregated view of VEKG for a given time length. We defined a set of operators using event rules which can be used as a query and applied over VEKG graphs to discover complex video patterns. The system achieves an F-Score accuracy ranging between 0.75 to 0.86 for different patterns when queried over VEKG. In given experiments, pattern search time over VEKG-TAG was 2.3X faster as compared to the baseline.
Article
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide $$15\times$$ more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Article
Mining valuable hidden knowledge from large-scale data relies on the support of reasoning technology. Knowledge graphs, as a new type of knowledge representation, have gained much attention in natural language processing. Knowledge graphs can effectively organize and represent knowledge so that it can be efficiently utilized in advanced applications. Recently, reasoning over knowledge graphs has become a hot research topic, since it can obtain new knowledge and conclusions from existing data. Herein we review the basic concept and definitions of knowledge reasoning and the methods for reasoning over knowledge graphs. Specifically, we dissect the reasoning methods into three categories: rule-based reasoning, distributed representation-based reasoning and neural network-based reasoning. We also review the related applications of knowledge graph reasoning, such as knowledge graph completion, question answering, and recommender systems. Finally, we discuss the remaining challenges and research opportunities for knowledge graph reasoning.