Enver Sangineto

Enver Sangineto
  • Doctor of Engineering
  • University of Trento

About

108
Publications
30,499
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,428
Citations
Current institution
University of Trento

Publications

Publications (108)
Article
Full-text available
There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success of other Artificial Intelligence areas in this structured domain. Particularly interesting is the case in which tabular data have a time dependence, such as, for instance, financial transactions. However, the heterogeneity of th...
Preprint
Contrastive Language-Image Pre-Training (CLIP) has refreshed the state of the art for a broad range of vision-language cross-modal tasks. Particularly, it has created an intriguing research line of text-guided image style transfer, dispensing with the need for style reference images as in traditional style transfer methods. However, directly using...
Preprint
Despite the progress made in the style transfer task, most previous work focus on transferring only relatively simple features like color or texture, while missing more abstract concepts such as overall art expression or painter-specific traits. However, these abstract semantics can be captured by models like DALL-E or CLIP, which have been trained...
Preprint
There is a recent growing interest in applying Deep Learning techniques to tabular data, in order to replicate the success of other Artificial Intelligence areas in this structured domain. Specifically interesting is the case in which tabular data have a time dependence, such as, for instance financial transactions. However, the heterogeneity of th...
Preprint
Full-text available
Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the \textbf{exposure bias} problem in autoregressive text generation. Spe...
Preprint
Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain. One important, desired characteristic of these transformations, is their graduality, which corresponds to a smooth change between the source and the target image when their respective latent-space representations are linearly inter...
Chapter
Full-text available
Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid/semi-rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, whi...
Article
This paper proposes a gaze correction and animation method for high-resolution, unconstrained portrait images, which can be trained without the gaze angle and the head pose annotations. Common gaze-correction methods usually require annotating training data with precise gaze, and head pose information. Solving this problem using an unsupervised met...
Preprint
Full-text available
This paper proposes a gaze correction and animation method for high-resolution, unconstrained portrait images, which can be trained without the gaze angle and the head pose annotations. Common gaze-correction methods usually require annotating training data with precise gaze, and head pose information. Solving this problem using an unsupervised met...
Preprint
Recent work has shown that the attention maps of Vision Transformers (VTs), when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way...
Preprint
Environments in Reinforcement Learning are usually only partially observable. To address this problem, a possible solution is to provide the agent with information about the past. However, providing complete observations of numerous steps can be excessive. Inspired by human memory, we propose to represent history with only important changes in the...
Preprint
Full-text available
Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a...
Preprint
In this paper, we study the problem of Novel Class Discovery (NCD). NCD aims at inferring novel object categories in an unlabeled set by leveraging from prior knowledge of a labeled set containing different, but related classes. Existing approaches tackle this problem by considering multiple objective functions, usually involving specialized loss t...
Preprint
Full-text available
Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training...
Conference Paper
Full-text available
Age Smile Specie Specie Figure 1: Our method generates smooth interpolations within and across domains in various image-to-image translation tasks. Here, we show gender, age and smile translations from CelebA-HQ [20] and animal translations from AFHQ [10]. Abstract Image-to-Image (I2I) multi-domain translation models are usually evaluated also usin...
Preprint
Full-text available
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry...
Preprint
Full-text available
Controllable person image generation aims to produce realistic human images with desirable attributes (e.g., the given pose, cloth textures or hair style). However, the large spatial misalignment between the source and target images makes the standard architectures for image-to-image translation not suitable for this task. Most of the state-of-the-...
Article
Full-text available
Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single-source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for multi-source domain adaptation (MSDA) based on generative adversarial networks. Our method...
Chapter
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible...
Preprint
Full-text available
In this paper we address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need for precise annotations of the gaze angle and the head pose. We have created a new dataset called CelebAGaze, which consists of two domains X, Y, where the eyes are either staring at the camera or somewhere else. Our m...
Preprint
Full-text available
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible...
Preprint
Recent literature on self-supervised learning is based on the contrastive loss, where image instances which share the same semantic content ("positives") are contrasted with instances extracted from other images ("negatives"). However, in order for the learning to be effective, a lot of negatives should be compared with a positive pair. This is not...
Preprint
Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks. Our method...
Preprint
Full-text available
Gaze redirection aims at manipulating a given eye gaze to a desirable direction according to a reference angle and it can be applied to many real life scenarios, such as video-conferencing or taking groups. However, the previous works suffer from two limitations: (1) low-quality generation and (2) low redirection precision. To this end, we propose...
Article
Hashing methods have recently been shown to be very effective in the retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. Common hashing methods in RS are based on hand-crafted features on top of which they learn a hash function, which provides the final binary codes. However, these features are not o...
Article
In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image $x_a$ of a person and a target pose $P(x_b)$ , extracted from an image $x_b$ , we synthesize a new image of that person in pose $P(x_b)$ , while preserving the visual details in $x_a$ . In ord...
Preprint
We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is main...
Preprint
In this paper, we address the problem of generating person images conditioned on both pose and appearance information. Specifically, given an image xa of a person and a target pose P(xb), extracted from a different image xb, we synthesize a new image of that person in pose P(xb), while preserving the visual details in xa. In order to deal with pixe...
Preprint
Hashing methods have been recently found very effective in retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. The traditional hashing methods in RS usually exploit hand-crafted features to learn hash functions to obtain binary codes, which can be insufficient to optimally represent the information c...
Preprint
A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by domain adaptation methods. In this work we introduce a novel deep learning framework which unifies different paradigms in unsupervised domain adaptation. Specifically, we propose domain alig...
Preprint
Batch Normalization (BN) is a common technique used both in discriminative and generative networks in order to speed-up training. On the other hand, the learnable parameters of BN are commonly used in conditional Generative Adversarial Networks for representing class-specific information using conditional Batch Normalization (cBN). In this paper we...
Article
Full-text available
In this paper we address the problem of generating person images conditioned on a given pose. Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose. In order to deal with pixel-to-pixel misalignments caused by the pose differences, we introduce deformable skip connections in the gener...
Conference Paper
Full-text available
In this paper we address the abnormality detection problem in crowded scenes. We propose to use Generative Adversarial Nets (GANs), which are trained using "normal" frames and corresponding optical-flow images in order to learn an internal representation of the scene "normality". Since our GANs are trained with only normal data, they are not able t...
Preprint
In this paper we address the abnormality detection problem in crowded scenes. We propose to use Generative Adversarial Nets (GANs), which are trained using normal frames and corresponding optical-flow images in order to learn an internal representation of the scene normality. Since our GANs are trained with only normal data, they are not able to ge...
Article
Full-text available
Abnormal crowd behaviour detection attracts a large interest due to its importance in video surveillance scenarios. However, the ambiguity and the lack of sufficient "abnormal" ground truth data makes end-to-end training of large deep networks hard in this domain. In this paper we propose to use Generative Adversarial Nets (GANs), which are trained...
Article
Full-text available
In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MSCOCO dataset, FOIL-COCO, which associates images with both correct and "foil" captions, that is, descriptions of the image that are highly similar to the original...
Article
Full-text available
Most of the crowd abnormal event detection methods rely on complex hand-crafted features to represent the crowd motion and appearance. Convolutional Neural Networks (CNN) have shown to be a powerful tool with excellent representational capacities, which can leverage the need for hand-crafted features. In this paper, we show that keeping track of th...
Article
Full-text available
In a weakly-supervised scenario, object detectors need to be trained using image-level annotation only. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative approach in which the classifier, obtained in the previous iteration, is used to predict the objects' positions which are used...
Preprint
In a weakly-supervised scenario object detectors need to be trained using image-level annotation alone. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative, Multiple Instance Learning framework in which the current classifier is used to select the highest-confidence boxes in each i...
Article
Facial expression and gesture recognition algorithms are key enabling technologies for human-computer interaction (HCI) systems. State of the art approaches for automatic detection of body movements and analyzing emotions from facial features heavily rely on advanced machine learning algorithms. Most of these methods are designed for the average us...
Conference Paper
Full-text available
We present an open source cross platform technology for 3D face tracking and analysis. It contains a full stack of components for complete face understanding: detection, head pose tracking, facial expression and action units recognition. Given a depth sensor, one can combine FaceCept3D modules to fulfill a specific application scenario. Key advanta...
Conference Paper
Full-text available
Most of the facial expression recognition methods assume frontal or near-frontal head poses and usually their accuracy strongly decreases when tested with non-frontal poses. Training a 2D pose-specific classifier for a large number of discrete poses can be time consuming due to the need of many samples per pose. On the other hand, 2D and 3D view-po...
Article
The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors. Most commonly these are histogram of oriented gradients (HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is very powerful for classification, it is also computationally expensive....
Conference Paper
Full-text available
The way in which human beings express emotions depends on their specific personality and cultural background. As a consequence, person independent facial expression classifiers usually fail to accurately recognize emotions which vary between different individuals. On the other hand, training a person-specific classifier for each new user is a time...
Conference Paper
Full-text available
Previous works on facial expression analysis have shown that person specific models are advantageous with respect to generic ones for recognizing facial expressions of new users added to the gallery set. This finding is not surprising, due to the often significant inter-individual variability: different persons have different morphological aspects...
Article
Full-text available
The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors. Most commonly these are histogram of oriented gradients (HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is very powerful for classification, it is also computationally expensive....
Conference Paper
Full-text available
The increasing interest in automatic adaptation of pedestrian detectors toward specific scenarios is motivated by the drop of performance of common detectors, especially in video-surveillance low resolution images. Different works have been recently proposed for unsupervised adaptation. However, most of these works do not completely solve the drift...
Chapter
Full-text available
This chapter presents a novel scheme for analyzing the crowd behavior from visual crowded scenes. The proposed method starts from the assumption that the interaction force, as estimated by the Social Force Model (SFM), is a significant feature to analyze crowd behavior. We step forward this hypothesis by optimizing this force using Particle Swarm O...
Conference Paper
Full-text available
The recent availability of large scale training sets in conjunction with accurate classifiers (e.g., SVMs) makes it possible to build large sets of "simple" object detectors and to develop new classification approaches in which dictionaries of visual features are substituted by dictionaries of object detectors. The responses of this collection of d...
Article
Full-text available
We present an approach to automatic localization of facial feature points which deals with pose, expression and identity variations combining 3D shape models with local image patch classification. The latter is performed by means of densely extracted SURF-like features, which we call DU-SURF, while the former is based on a multiclass version of the...
Article
Over the past few years there has been a growing interest in visual interfaces based on gestures. Using gestures as a mean to communicate with a computer can be helpful in applications such as gaming platforms, domotic environments, augmented reality or sign language interpretation to name a few. However, a serious bottleneck for such interfaces is...
Article
Two of the most important state-of-the-art challenges in face recognition are: dealing with image acquisition conditions very different between the gallery and the probe set and dealing with large datasets of individuals. In this paper we face both aspects presenting a method which is able to work in "real life" scenarios, in which face images are...
Conference Paper
Full-text available
In this paper we propose an integrated system for face detection and face recognition based on improved versions of state-of-the-art statistical learning techniques such as Boosting and LDA. Both the detection and the recognition processes are performed on facial features (e.g., the eyes, the nose, the mouth, etc) in order to improve the recognitio...
Article
Full-text available
We present a face recognition system based on the Scale Invariant Feature Transform (SIFT) image descriptors recently proposed by Lowe and largely used in generic object recognition tasks. We show how SIFT descriptors can be used in a robust face recognition system coupled with some simple image normalization processes and geometric constraints on...
Conference Paper
Full-text available
We present an approach to viewpoint invariant hand detection which merges model based representation of shape and exact curve matching with graph search in order to achieve a very low false alarm rate system able to work in real time. The method proposed makes few assumptions on the articulated object nature and can be applied to recognize other ar...
Article
We present in this paper an elephant photo identification system based on the shape comparison of the nicks characterizing the elephants’ ears. The method we propose can deal with very cluttered and noisy images as the ones commonly used by zoologists for wild elephant photo identification. Difficult segmentation problems are solved using rough pos...
Article
Full-text available
This paper presents an approach to automatic course generation and student modeling. The method has been developed during the European funded projects Diogene and Intraserv, focused on the construction of an adaptive e-learning platform. The aim of the platform is the automatic generation and personalization of courses, taking into account pedagogi...
Article
In this work we present a system for identikit composition and face recognition based on anatomic landmarks, i.e., a set of points manually set by the user which define the position of the most important face features. Landmarks are used both in the identikit composition, driving the mapping on the current image of possible new facial features take...
Article
Sketch-based image retrieval systems need to handle two main problems. First of all, they have to recognize shapes similar but not necessarily identical to the user’s query. Hence, exact object identification techniques do not fit in this case. The second problem is the selection of the image features to compare with the user’s sketch. In domain-in...
Conference Paper
Full-text available
In this paper we present the results of a two-years research project on automatic people counting in public crowded environments. The aim of the proposed system is to estimate the number of people passing through a gate in a public area such as a metro or a railway station. The problem is particularly challenging due to both the presence of crowd w...
Conference Paper
Zoologists studying elephant populations in wild environments need to recognize different individuals from photos taken in different periods of time. Individuals can be distinguished by the shape of the nicks on their ears. Nevertheless, shape comparison is not trivial due to a highly cluttered background. We propose a method for partially, non-con...
Article
Full-text available
In this chapter we show the technical and methodological aspects of an e-learning platform for automatic course personalization built during the European funded project Diogene. The system we propose is composed of different knowledge modules and some inference tools. The knowledge modules represent the system's information about both the domain-sp...
Chapter
In this chapter we show the technical and methodological aspects of an e-learning platform for automatic course personalization built during the European funded project Diogene. The system we propose is composed of different knowledge modules and some inference tools. The knowledge modules represent the system’s information about both the domain-sp...
Conference Paper
Full-text available
We present in this paper a general-purpose approach for articulated object recognition. We split the recognition process in two distinct phases. In the former we use standard model-based techniques in order to recognize and localize in the input image the rigid components the articulated object is composed of. In the second phase the spatial config...
Chapter
Behaviour protocols are often used to coordinate the action of a group of two or more agents by limiting the nature of the agents' goals and the possibility of backtracking in the negotiation. The solution presented in this paper achieves the maximum flexibility because the negotiation itself is object of the planning activity. We extend the reason...
Article
Full-text available
This paper presents a system for articulated object recognition. The system has been tested in the RoboCup domain (four-legged league), an international competition among autonomous doglike robots playing soccer. Nevertheless, the proposed method does not depend on any specific domain, but is thought to be applicable to generic objects composed of...
Article
Full-text available
Face detection and facial expression recognition are research areas with important application possibilities. Although the two problems are usually dealt with different ap- proaches, we show in this paper how the same recognition process can be used to recognize both a generic “class-face” in a given, possibly complex image, and a speci�c facial ex...
Article
Among the existing content-based image retrieval (CBIR) techniques for still images based on different perceptual features (e.g., color, texture, etc.), shape-based methods are particularly challenging due to the intrinsic difficulties in dealing with shape localization and recognition problems. Nevertheless, there is no doubt that shape is one of...
Chapter
Among the existing Content Based Image Retrieval (CBIR) techniques for still images based on different perceptual features (e. g., colour, texture, etc.), shape-based methods are particularly challenging due to the intrinsic difficulties in dealing with shape localization and recognition problems. Nevertheless, there is no doubt that shape is one o...
Conference Paper
Full-text available
Robot navigation using only abstract, topological information on the environment is strongly related to the possibility for a robot to unambiguously match information coming from its sensors with the basic elements of the environment. In this paper we present an approach to this challenging problem based on the direct recognition of the topological...
Article
Full-text available
There is a recent growing interest in recognition of biological individuals in images: human being detection, face detection, animal body recognition are examples of relatively new research areas with important practical applications. From the Computer Vision point of view, humans and animals are nonrigid objects, i.e., objects whose shape can unde...
Conference Paper
Full-text available
We present in this paper the results of an European funded project Diogene, finished in October 2004, whose aim has been the design and the development of a distributed e-learning system able to perform several automatic actions such as course customization and information retrieval on the Semantic Web. The Diogene architecture is based on a networ...
Article
Full-text available
The aim of the European funded project Diogene (which is going to be closed the 31st of October 2004) [Diogene] has been the construction of an automatic “brokering environment” for e-learning in a Semantic Web scenario. The Diogene platform is able to act as an intermediary between the learners and different “content providers”, specialized traini...
Conference Paper
Full-text available
Image retrieval by shape similarity systems usually either focus their attention on images with isolated objects (uniform backgrounds and no occluding objects) or perform time consuming exhaustive initializations to localize the portion of the image possibly containing the searched shape. We propose a content based image retrieval (CBIR) method mer...
Conference Paper
Full-text available
The purpose of this paper is to describe the work in progress related to the design, the implementation and the evaluation of an innovative e-learning platform for ICT individual training in the framework of an EC funded project named Diogene. The present e-learning solution includes several state-of-the-art technologies and methodologies such as:...
Conference Paper
Full-text available
We present a new method for image retrieval by shape similarity able to deal with real images with not uniform background and possible touching/occluding objects. First of all we perform a sketch-driven segmentation of the scene by means of a Deformation Tolerant version of the Generalized Hough Transform (DTGHT). Using the DTGHT we select in the i...
Article
In this paper we propose a vision-based system that lets the robot recognize an environment observed through the construction of a perspective structure which characterizes it. The individualization of the most significant characteristics of the perspective structure is performed by a geometric method that, using the information given by the image,...
Conference Paper
Full-text available
We propose a Web tutoring system in which Artificial Intelligence techniques and Semantic Web approaches are integrated in order to provide an automatic tool able both to completely customize learning on the student's needs and to exchange learning material with other Web systems. IWT (Intelligent Web Teacher) is based on an ad hoc knowledge repres...

Questions

Question (1)
Question
Hi,
I’m dealing with building a Random Forest (basically, an ensemble of ***non-pruned*** decision trees) and I’m not sure about the criteria for stopping the node splitting process when the number of training samples of that node are too few. In fact, what is usually recommended is to stop splitting when “the number of training samples in the node is less than a specified threshold”. Anyway, it seems to me that this rule can produce trees with leaves associated with a number of samples below the threshold and, hence, non-statistically meaningful.
For instance, suppose I fix the threshold to N (e.g., N = 10). And suppose that my current node V is associated with N + K samples (e.g., N + K = 13). Since N + K > N, I should split V. The new splitting will produce two more nodes (e.g., V1 and V2) with, respectively, N1 and N2 samples. However, I have no guarantee that N1 >= N and N2 >= N (e.g., N1 = 7, N2 = 6).
A different interpretation of this stop-splitting criterion would be: do not split V unless both the two resulting children V1 and V2 have a sample cardinality greater than N.
Which is the most correct and statistically robust interpretation?
Thank you in advance,
Enver

Network

Cited By