Ioannis Patras

Ioannis Patras
Queen Mary, University of London | QMUL

About

150
Publications
23,848
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,849
Citations

Publications

Publications (150)
Preprint
Full-text available
Capsule Networks have shown tremendous advancement in the past decade, outperforming the traditional CNNs in various task due to it's equivariant properties. With the use of vector I/O which provides information of both magnitude and direction of an object or it's part, there lies an enormous possibility of using Capsule Networks in unsupervised le...
Article
Full-text available
In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) co...
Preprint
Self-supervised learning has recently achieved great success in representation learning without human annotations. The dominant method -- that is contrastive learning, is generally based on instance discrimination tasks, i.e., individual samples are treated as independent categories. However, presuming all the samples are different contradicts the...
Preprint
Human affect and mental state estimation in an automated manner, face a number of difficulties, including learning from labels with poor or no temporal resolution, learning from few datasets with little data (often due to confidentiality constraints) and, (very) long, in-the-wild videos. For these reasons, deep learning methodologies tend to overfi...
Preprint
This work addresses the problem of discovering non-linear interpretable paths in the latent space of pre-trained GANs in a model-agnostic manner. In the proposed method, the discovery is driven by a set of pairs of natural language sentences with contrasting semantics, named semantic dipoles, that serve as the limits of the interpretation that we r...
Preprint
Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either disc...
Preprint
Full-text available
Despite the large progress in supervised learning with Neural Networks, there are significant challenges in obtaining high-quality, large-scale and accurately labeled datasets. In this context, in this paper we address the problem of classification in the presence of label noise and more specifically, both close-set and open-set label noise, that i...
Article
Full-text available
Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on th...
Preprint
Full-text available
This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions...
Preprint
Full-text available
In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) co...
Preprint
Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only cl...
Preprint
Full-text available
Understanding interactions between objects in an image is an important element for generating captions. In this paper, we propose a relationship-based neural baby talk (R-NBT) model to comprehensively investigate several types of pairwise object interactions by encoding each image via three different relationship-based graph attention networks (GAT...
Preprint
Full-text available
In this technical report we study the problem of propagation of uncertainty (in terms of variances of given uni-variate normal random variables) through typical building blocks of a Convolutional Neural Network (CNN). These include layers that perform linear operations, such as 2D convolutions, fully-connected, and average pooling layers, as well a...
Preprint
Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the...
Book
The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021. Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were a...
Book
The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021. Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were a...
Article
This paper presents a new method for unsupervised video summarization. The proposed architecture embeds an Actor-Critic model into a Generative Adversarial Network and formulates the selection of important video fragments (that will be used to form the summary) as a sequence generation task. The Actor and the Critic take part in a game that increme...
Preprint
In this paper, we address the problem of temporal action localization with a single stage neural network. In the proposed architecture we model the boundary predictions as uni-variate Gaussian distributions in order to model their uncertainties, which is the first in this area to the best of our knowledge. We use two uncertainty-aware boundary regr...
Preprint
This work addresses the problem of temporal action localization with Variance-Aware Networks (VAN), i.e., DNNs that use second-order statistics in the input and/or the output of regression tasks. We first propose a network (VANp) that when presented with the second-order statistics of the input, i.e., each sample has a mean and a variance, it propa...
Chapter
This paper presents a new video summarization approach that integrates an attention mechanism to identify the significant parts of the video, and is trained unsupervisingly via generative adversarial learning. Starting from the SUM-GAN model, we first develop an improved version of it (called SUM-GAN-sl) that has a significantly reduced number of l...
Article
Full-text available
Foreground/background (fg/bg) classification is an important first step for several video analysis tasks such as people counting, activity recognition and anomaly detection. As is the case for several other Computer Vision problems, the advent of deep Convolutional Neural Network (CNN) methods has led to major improvements in this field. However, d...
Conference Paper
In this paper we present our work on improving the efficiency of adversarial training for unsupervised video summarization. Our starting point is the SUM-GAN model, which creates a representative summary based on the intuition that such a summary should make it possible to reconstruct a video that is indistinguishable from the original one. We buil...
Chapter
This chapter is focused on methods and tools for video fragmentation and reverse search on the web. These technologies can assist journalists when they are dealing with fake news—which nowadays are being rapidly spread via social media platforms—that rely on the reuse of a previously posted video from a past event with the intention to mislead the...
Chapter
Modern newsroom tools offer advanced functionality for automatic and semi-automatic content collection from the web and social media sources to accompany news stories. However, the content collected in this way often tends to be unstructured and may include irrelevant items. An important step in the verification process is to organize this content,...
Chapter
This chapter discusses the problem of Near-Duplicate Video Retrieval (NDVR). The main objective of a typical NDVR approach is: given a query video, retrieve all near-duplicate videos in a video repository and rank them based on their similarity to the query. Several approaches have been introduced in the literature, which can be roughly classified...
Chapter
This chapter presents the techniques researched and developed within InVID for the forensic analysis of videos, and the detection and localization of forgeries within User-Generated Videos (UGVs). Following an overview of state-of-the-art video tampering detection techniques, we observed that the bulk of current research is mainly dedicated to fram...
Preprint
Full-text available
In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such relations are typically lost in previous video retrieval approaches that embed the whole frame or even the whole video into a vector descriptor before the similarity estimation. By contrast...
Preprint
In this paper we propose a novel Temporal Attentive Relation Network (TARN) for the problems of few-shot and zero-shot action recognition. At the heart of our network is a meta-learning approach that learns to compare representations of variable temporal length, that is, either two videos of different length (in the case of few-shot action recognit...
Article
Recognition and analysis of human affect has been researched extensively within the field of computer science in the past two decades. However, most of the past research in automatic analysis of human affect has focused on the recognition of affect displayed by people in individual settings and little attention has been paid to the analysis of the...
Chapter
In this paper, we propose a new geometric model based on mixture of Markov Random Fields (MRFs) for human pose estimation. We build on previous work that expresses the global constraints on the relative locations of the body joints using an auto-encoder ConvNet which performs dimensionality reduction on the heat maps, and recovers in this manner a...
Article
Patients with schizophrenia often display impairments in the expression of emotion and speech and those are observed in their facial behaviour. Automatic analysis of patients' facial expressions that is aimed at estimating symptoms of schizophrenia has received attention recently. However, the datasets that are typically used for training and evalu...
Article
Full-text available
This paper introduces the problem of Fine-grained Incident Video Retrieval (FIVR). Given a query video, the objective is to retrieve all associated videos, considering several types of associations that range from duplicate videos to videos from the same incident. FIVR offers a single framework that contains several retrieval tasks as special cases...
Chapter
This chapter analyses the literature and presents the research efforts for improving concept‐based and event‐based video search. It focuses on feature extraction using hand‐crafted and deep convolutional neural networks (DCNN)‐based descriptors, dimensionality reduction using accelerated generalised subclass discriminant analysis (AGSDA), cascades...
Article
Full-text available
In this paper, we present a novel single shot face-related task analysis method, called Face-SSD, for detecting faces and for performing various face-related (classification/regression) tasks including smile recognition, face attribute prediction and valence-arousal estimation in the wild. Face-SSD uses a Fully Convolutional Neural Network (FCNN) t...
Chapter
Full-text available
This paper describes the combination of advanced technologies for social-media-based story detection, story-based video retrieval and concept-based video (fragment) labeling under a novel approach for multimodal video annotation. This approach involves textual metadata, structural information and visual concepts - and a multimodal analytics dashboa...
Conference Paper
Full-text available
This work reports the methodology that CERTH-ITI team developed so as to recognize the emotional impact that movies have to its viewers in terms of valence/arousal and fear. More Specifically, deep convolutional neural newtworks and several machine learning techniques are utilized to extract visual features and classify them based on the predicted...
Article
Full-text available
Automatic understanding and analysis of groups has attracted increasing attention in the vision and multimedia communities in recent years. However, little attention has been paid to the automatic analysis of the non-verbal behaviors and how this can be utilized for analysis of group membership, i.e., recognizing which group each individual is part...
Preprint
Full-text available
This paper introduces the problem of Fine-grained Incident Video Retrieval (FIVR). Given a query video, the objective is to retrieve all associated videos, considering several types of association that range from duplicate videos to videos from the same incident. FIVR offers a single framework that contains as special cases several retrieval tasks....
Preprint
Patients with schizophrenia often display impairments in the expression of emotion and speech and those are observed in their facial behaviour. Automatic analysis of patients' facial expressions that is aimed at estimating symptoms of schizophrenia has received attention recently. However, the datasets that are typically used for training and evalu...
Article
In this work we propose a DCNN (Deep Convolutional Neural Network) architecture that addresses the problem of video/image concept annotation by exploiting concept relations at two different levels. At the first level, we build on ideas from multi-task learning, and propose an approach to learn conceptspecific representations that are sparse, linear...
Chapter
As multimedia applications have become part of our life, preservation and long-term access to the multimedia elements that are continuously produced is a major consideration, both for many organizations that generate or collect and need to maintain digital content, and for individuals. In this chapter, we focus primarily on the following multimedia...
Article
In this work we explore how the architecture proposed in [8], which expresses the processing steps of the classical Fisher vector pipeline approaches, i.e. dimensionality reduction by principal component analysis (PCA) projection, Gaussian mixture model (GMM) and Fisher vector descriptor extraction as network layers, can be modified into a hybrid n...
Article
Full-text available
In this paper, we propose a maximum margin classifier that deals with uncertainty in data input. More specifically, we reformulate the SVM framework such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix -- the latter modeling the uncertainty. We address the...
Conference Paper
Full-text available
This work addresses the problem of Near-Duplicate Video Retrieval (NDVR). We propose an effective video-level NDVR scheme based on deep metric learning that leverages Convolutional Neural Network (CNN) features from intermediate layers to generate discriminative global video representations in tandem with a Deep Metric Learning (DML) framework with...
Conference Paper
Full-text available
Automatic understanding and analysis of groups has attracted increasing attention in the vision and multimedia communities in recent years. However, little attention has been paid to the automatic analysis of group membership-i.e., recognizing which group the individual in question is part of. This paper presents a novel two-phase Support Vector Ma...
Conference Paper
Full-text available
This paper presents the VideoAnalysis4ALL tool that supports the automatic fragmentation and concept-based annotation of videos, and the exploration of the annotated video fragments through an interactive user interface. The developed web application decomposes the video into two different granularities, namely shots and scenes, and annotates each...
Conference Paper
This paper presents a fully-automatic method that combines video concept detection and textual query analysis in order to solve the problem of ad-hoc video search. We present a set of NLP steps that cleverly analyse different parts of the query in order to convert it to related semantic concepts, we propose a new method for transforming concept-bas...
Conference Paper
Zero-example event detection is a problem where, given an event query as input but no example videos for training a detector, the system retrieves the most closely related videos. In this paper we present a fully-automatic zero-example event detection method that is based on translating the event description to a predefined set of concepts for whic...
Article
We present a database for research on affect, personality traits and mood by means of neuro-physiological signals. Different to other databases, we elicited affect using both short and long videos in two settings, one with individual viewers and one with groups of viewers. The database allows the multimodal study of the affective responses of indiv...
Conference Paper
Full-text available
The problem of Near-Duplicate Video Retrieval (NDVR) has attracted increasing interest due to the huge growth of video content on the Web, which is characterized by high degree of near duplicity. This calls for efficient NDVR approaches. Motivated by the outstanding performance of Convolutional Neural Networks (CNNs) over a wide variety of computer...
Conference Paper
This paper presents VERGE interactive video retrieval engine, which is capable of browsing and searching into video content. The system integrates several content-based analysis and retrieval modules including concept detection, clustering, visual similarity search, object-based search, query analysis and multimodal and temporal fusion.
Conference Paper
Full-text available
In this study we compare three different fine-tuning strategies in order to investigate the best way to transfer the parameters of popular deep convolutional neural networks that were trained for a visual annotation task on one dataset, to a new, considerably different dataset. We focus on the concept-based image/video annotation problem and use Im...