Edwin R. Hancock’s research while affiliated with New York University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (968)


FrameERC: Framelet Transform Based Multimodal Graph Neural Networks for Emotion Recognition in Conversation
  • Article

May 2025

·

27 Reads

·

4 Citations

Pattern Recognition

·

Jiandong Shi

·

·

[...]

·

Edwin R. Hancock

Conceptual comparison of fully decoupled person search (b) with previous decoupled models (a). a Due to the conflicting objectives of the detection and re-id subtasks, a plain model θs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\theta }_s$$\end{document} only presents a compromised average solution and previous decoupled methods improve the performance by narrowing the objective gap (θo\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\theta }_o$$\end{document}) or adding task-oriented prediction modules (θd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\theta }_d$$\end{document} and θr\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\theta }_r$$\end{document}). b Different from previous works, fully decoupled person search avoids sharing between the task-oriented parameters θd′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\theta }'_d$$\end{document} and θr′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\theta }'_r$$\end{document} to reach the optimum for both subtasks
a The proposed fully decoupled person search network which consists of a detection side-net, a re-id side-net, and the modules that bridge them. The model is trained incrementally by person detection and re-id tasks, which fully decouples the parameters for the two conflicting sub-tasks. b The Online Representation Distillation procedure that incorporates two streams of data for training. The representation distillation between the two streams mitigates the representation gap between end-to-end and two-step models to facilitate robust person matching
Illustration of the re-id head. Person feature maps are drawn from the output of ‘conv4’ and refined by the ‘conv5’ block. By consecutive global average pooling and batch normalization, this module produces 1-D person feature vectors
Person search performances with different numbers of augmented boxes in SnA
Person search performances under different spatial noise factors

+4

Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives
  • Article
  • Publisher preview available

March 2025

·

15 Reads

International Journal of Computer Vision

End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection sub-task learns to identify all persons as one category while the re-identification (re-id) sub-task aims to discriminate persons of different identities, resulting in conflicting optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on the sub-tasks due to their partially decoupled models, which limits the overall person search performance. To further eliminate the last coupled part in decoupled models without sacrificing the efficiency of end-to-end person search, we propose a fully decoupled person search framework in this work. Specifically, we design a task-incremental network to construct an end-to-end model in a task-incremental learning procedure. Given that the detection subtask is easier, we start by training a lightweight detection sub-network and expand it with a re-id sub-network trained in another stage. On top of the fully decoupled design, we also enable one-stage training for the task-incremental network. The fully decoupled framework further allows an Online Representation Distillation to mitigate the representation gap between the end-to-end model and two-step models for learning robust representations. Without requiring an offline teacher re-id model, this transfers structured representational knowledge learned from cropped images to the person search model. The learned person representations thus focus more on discriminative clues of foreground persons and suppress the distractive background information. To understand the effectiveness and efficiency of the proposed method, we conduct comprehensive experimental evaluations on two popular person search datasets PRW and CUHK-SYSU. The experimental results demonstrate that the fully decoupled model achieves superior performance than previous decoupled methods. The inference of the model is also shown to be efficient among recent end-to-end methods. The source code is available at https://github.com/PatrickZad/fdps.

View access options


Dual-Modal Prior Semantic Guided Infrared and Visible Image Fusion for Intelligent Transportation System

January 2025

·

2 Reads

·

1 Citation

IEEE Transactions on Intelligent Transportation Systems

Infrared and visible image fusion (IVF) plays an important role in intelligent transportation system (ITS). The early works predominantly focus on boosting visual appeal of the fused result, although several recent approaches have tried to combine high-level vision task with IVF, they prioritize the design of cascaded structure to seek unified suitable features and fit different tasks. Thus, they tend to bias toward reconstructing raw pixels without considering the significance of semantic features. Therefore, we propose a novel prior semantic guided image fusion method based on the dual-modality strategy, improving the performance of IVF in ITS. Specifically, to explore the independent significant semantic of each modality, we first design two parallel semantic segmentation branches with a refined feature adaptive-modulation (RFaM) mechanism. RFaM can perceive the features that are semantically distinct enough in each semantic segmentation branch. Then, two pilot experiments based on the two branches are conducted to capture the significant prior semantic of source images, which is then applied to guide the fusion task in the integration of semantic segmentation branches and fusion branch. In addition, to aggregate both high-level semantics and impressive visual effects, we further investigate the frequency response of the prior semantics, and propose a multi-level representation-adaptive fusion (MRaF) module to explicitly integrate low-frequency prior semantic with high-frequency details. Extensive experiments on two public datasets demonstrate the superiority of our method over state-of-the-art fusion approaches. Our method has better performance on four quantitative metrics in fusion task and achieves the highest mIoU in semantic segmentation task.


HAQJSK: Hierarchical-Aligned Quantum Jensen-Shannon Kernels for Graph Classification

November 2024

·

25 Reads

·

53 Citations

IEEE Transactions on Knowledge and Data Engineering

In this work, we propose two novel quantum walk kernels, namely the Hierarchical Aligned Quantum Jensen-Shannon Kernels (HAQJSK), between un-attributed graph structures. Different from most classical graph kernels, the proposed HAQJSK kernels can incorporate hierarchical aligned structure information between graphs and transform graphs of random sizes into fixed-size aligned graph structures, i.e., the Hierarchical Transitive Aligned Adjacency Matrix of vertices and the Hierarchical Transitive Aligned Density Matrix of the Continuous-Time Quantum Walks (CTQW). With pairwise graphs to hand, the resulting HAQJSK kernels are defined by computing the Quantum Jensen-Shannon Divergence (QJSD) between their transitive aligned graph structures. We show that the proposed HAQJSK kernels not only reflect richer intrinsic whole graph characteristics in terms of the CTQW, but also address the drawback of neglecting structural correspondence information that arises in most R-convolution graph kernels. Moreover, unlike the previous QJSD based graph kernels associated with the QJSD and the CTQW, the proposed HAQJSK kernels can simultaneously guarantee the properties of permutation invariant and positive definiteness, explaining the theoretical advantages of the HAQJSK kernels. The experiment indicates the effectiveness of the new proposed kernels.


Learning From Human Attention for Attribute-Assisted Visual Recognition

September 2024

·

41 Reads

·

11 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

With prior knowledge of seen objects, humans have a remarkable ability to recognize novel objects using shared and distinct local attributes. This is significant for the challenging tasks of zero-shot learning (ZSL) and fine-grained visual classification (FGVC), where the discriminative attributes of objects have played an important role. Inspired by human visual attention, neural networks have widely exploited the attention mechanism to learn the locally discriminative attributes for challenging tasks. Though greatly promoted the development of these fields, existing works mainly focus on learning the region embeddings of different attribute features and neglect the importance of discriminative attribute localization. It is also unclear whether the learned attention truly matches the real human attention. To tackle this problem, this paper proposes to employ real human gaze data for visual recognition networks to learn from human attention. Specifically, we design a unified Attribute Attention Network (A 2^{2} Net) that learns from human attention for both ZSL and FGVC tasks. The overall model consists of an attribute attention branch and a baseline classification network. On top of the image feature maps provided by the baseline classification network, the attribute attention branch employs attribute prototypes to produce attribute attention maps and attribute features. The attribute attention maps are converted to gaze-like attentions to be aligned with real human gaze attention. To guarantee the effectiveness of attribute feature learning, we further align the extracted attribute features with attribute-defined class embeddings. To facilitate learning from human gaze attention for the visual recognition problems, we design a bird classification game to collect real human gaze data using the CUB dataset via an eye-tracker device. Experiments on ZSL and FGVC tasks without/with real human gaze data validate the benefits and accuracy of our proposed model. This work supports the promising benefits of collecting human gaze datasets and automatic gaze estimation algorithms learning from human attention for high-level computer vision tasks.


The Ihara zeta function as a partition function for network structure characterisation

August 2024

·

36 Reads

Statistical characterizations of complex network structures can be obtained from both the Ihara Zeta function (in terms of prime cycle frequencies) and the partition function from statistical mechanics. However, these two representations are usually regarded as separate tools for network analysis, without exploiting the potential synergies between them. In this paper, we establish a link between the Ihara Zeta function from algebraic graph theory and the partition function from statistical mechanics, and exploit this relationship to obtain a deeper structural characterisation of network structure. Specifically, the relationship allows us to explore the connection between the microscopic structure and the macroscopic characterisation of a network. We derive thermodynamic quantities describing the network, such as entropy, and show how these are related to the frequencies of prime cycles of various lengths. In particular, the n-th order partial derivative of the Ihara Zeta function can be used to compute the number of prime cycles in a network, which in turn is related to the partition function of Bose–Einstein statistics. The corresponding derived entropy allows us to explore a phase transition in the network structure with critical points at high and low-temperature limits. Numerical experiments and empirical data are presented to evaluate the qualitative and quantitative performance of the resulting structural network characterisations.


HC-GAE: The Hierarchical Cluster-based Graph Auto-Encoder for Graph Representation Learning

May 2024

·

28 Reads

Graph Auto-Encoders (GAEs) are powerful tools for graph representation learning. In this paper, we develop a novel Hierarchical Cluster-based GAE (HC-GAE), that can learn effective structural characteristics for graph data analysis. To this end, during the encoding process, we commence by utilizing the hard node assignment to decompose a sample graph into a family of separated subgraphs. We compress each subgraph into a coarsened node, transforming the original graph into a coarsened graph. On the other hand, during the decoding process, we adopt the soft node assignment to reconstruct the original graph structure by expanding the coarsened nodes. By hierarchically performing the above compressing procedure during the decoding process as well as the expanding procedure during the decoding process, the proposed HC-GAE can effectively extract bidirectionally hierarchical structural features of the original sample graph. Furthermore, we re-design the loss function that can integrate the information from either the encoder or the decoder. Since the associated graph convolution operation of the proposed HC-GAE is restricted in each individual separated subgraph and cannot propagate the node information between different subgraphs, the proposed HC-GAE can significantly reduce the over-smoothing problem arising in the classical convolution-based GAEs. The proposed HC-GAE can generate effective representations for either node classification or graph classification, and the experiments demonstrate the effectiveness on real-world datasets.


ENADPool: The Edge-Node Attention-based Differentiable Pooling for Graph Neural Networks

May 2024

·

36 Reads

Graph Neural Networks (GNNs) are powerful tools for graph classification. One important operation for GNNs is the downsampling or pooling that can learn effective embeddings from the node representations. In this paper, we propose a new hierarchical pooling operation, namely the Edge-Node Attention-based Differentiable Pooling (ENADPool), for GNNs to learn effective graph representations. Unlike the classical hierarchical pooling operation that is based on the unclear node assignment and simply computes the averaged feature over the nodes of each cluster, the proposed ENADPool not only employs a hard clustering strategy to assign each node into an unique cluster, but also compress the node features as well as their edge connectivity strengths into the resulting hierarchical structure based on the attention mechanism after each pooling step. As a result, the proposed ENADPool simultaneously identifies the importance of different nodes within each separated cluster and edges between corresponding clusters, that significantly addresses the shortcomings of the uniform edge-node based structure information aggregation arising in the classical hierarchical pooling operation. Moreover, to mitigate the over-smoothing problem arising in existing GNNs, we propose a Multi-distance GNN (MD-GNN) model associated with the proposed ENADPool operation, allowing the nodes to actively and directly receive the feature information from neighbors at different random walk steps. Experiments demonstrate the effectiveness of the MD-GNN associated with the proposed ENADPool.


Exploring the Usage of Pre-trained Features for Stereo Matching

May 2024

·

84 Reads

·

9 Citations

International Journal of Computer Vision

For many vision tasks, utilizing pre-trained features results in improved performance and consistently benefits from the rapid advancement of pre-training technologies. However, in the field of stereo matching, the use of pre-trained features has not been extensively researched. In this paper, we present the first systematical exploration into the utilization of pre-trained features for stereo matching. To provide flexible employment for any combination of pre-trained backbones and stereo matching networks, we develop the deformable neck (DN) that decouples the network architectures of these two components. The core idea of DN is to utilize the deformable attention mechanism to iteratively fuse pre-trained features from shallow to deep layers. Empirically, our exploration reveals the crucial factors that influence using pre-trained features for stereo matching. We further investigate the role of instance-level information of pre-trained features, demonstrating it benefits stereo matching while can be suppressed during convolution-based feature fusion. Built on the attention mechanism, the proposed DN module effectively utilizes the instance-level information in pre-trained features. Besides, we provide an understanding of the efficiency-accuracy tradeoff, concluding that using pre-trained features can also be a good alternative with efficiency consideration.


Citations (54)


... Unlike previous studies [3] that represent human gaze as Gaussian-distributed heatmaps, we model gaze as a sequential trajectory G ∈ R 176×2 . The second dimension, which corresponds to spatial coordinates, is normalized to the range [0, 224] to match the input resolution of ViT. ...

Reference:

Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification
Learning From Human Attention for Attribute-Assisted Visual Recognition
  • Citing Article
  • September 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Transformer-based methods have gained significant attention in the field of 3D skeletonbased action recognition due to their ability to capture long-range dependencies and global relationships, especially using the multi-head self-attention (MSA) mechanism [66][67][68]. These methods demonstrate superior performance in processing sequences by aggregating spatialtemporal data through attention-based approaches. ...

Exploring the Usage of Pre-trained Features for Stereo Matching

International Journal of Computer Vision

... Our approach aims to address the degree bias problem while preserving the global graph topology without requiring any label information [20]. We achieve this by introducing a self-supervised learning task that preserves the global graph structure, utilizing the transition probability matrix for global graph structure preservation. ...

HAQJSK: Hierarchical-Aligned Quantum Jensen-Shannon Kernels for Graph Classification
  • Citing Article
  • November 2024

IEEE Transactions on Knowledge and Data Engineering

... The many-to-many VC mechanism can be modified to function seamlessly in various VC settings, including any-to-one VC [105], any-to-any VC [106], and similar configurations. 5) Emotional VC: Apart from the conventional types of VC, another domain of VC, named emotional VC, has recently emerged. ...

Any-to-Any Voice Conversion With Multi-Layer Speaker Adaptation and Content Supervision
  • Citing Article
  • January 2023

IEEE/ACM Transactions on Audio Speech and Language Processing

... In contrast to open-vocabulary object detection methods, which rely on the interplay between categories and visual-semantic connections gleaned from vast repositories of big data and large models, zero-shot object detection necessitates the capacity to extrapolate to novel categories from base class data. This is achieved through the utilization of semantic embeddings [23,24] , object attributes [25,26] , relational reasoning [27,28] , generative models [29,30] , and crossmodal learning [31,32] . Parallel to the zero-shot paradigm, open-set [33,34] and open-world object detection [35,36] do not rely on supplementary data, although they dispense with the immediate imperative of accurately categorizing unseen classes, thereby alleviating the constraints of classification. ...

Attribute subspaces for zero-shot learning
  • Citing Article
  • August 2023

Pattern Recognition

... 1 Within QML, an even newer subfield called Quantum Graph Learning (QGL) is starting to be explored. As described by Yu et al., 2 QGL has the potential to solve or mitigate several substantial problems in graph learning including the difficulty of storing and processing large graphs and the limitation of the distance across which inferences can be made. QGL should also be able to apply some of the native benefits of QML to graph learning, including a reduction in the number of required training parameters. ...

Quantum Graph Learning: Frontiers and Outlook
  • Citing Preprint
  • February 2023

... For the second stage, numerous deep learning methods have been proposed to extract representations of brain networks to facilitate brain disorder diagnosis nowadays [11]. In particular, methods based on graph neural networks (GNNs) stand out for their ability to comprehensively capture network topology information [12]. ...

Position-aware and Structure Embedding Networks for Deep Graph Matching
  • Citing Article
  • December 2022

Pattern Recognition

... This technique facilitates emotional communication, enhances user experiences in humanmachine interactions, and contributes to more immersive virtual environments [2]. Conventional EVC systems primarily rely on autoencoders [3,4], achieving significant improvements in speech quality [5,6]. However, their limited variability in synthesized voices restricts the diversity of emotional expressions [7]. ...

Speaker-Independent Emotional Voice Conversion via Disentangled Representations
  • Citing Article
  • January 2022

IEEE Transactions on Multimedia

... The adaptive neighbors for each data point are selected based on the local connectivity of the data. In [33], the hierarchical aligned quantum Jensen-Shannon kernel has been proposed by measuring the Quantum Jensen-Shannon Divergence (QJSD) between the transitive aligned graph structures. In addition, some enhanced graph learning methods have been proposed and applied to deep learning. ...

HAQJSK: Hierarchical-Aligned Quantum Jensen-Shannon Kernels for Graph Classification

... This observation inspired our use of hierarchical learning to recognize materials. Hierarchical image classification [8,10,28,48,60,87,89,94] has recently been used to take advantage of semantic relationships between classes, e.g., in object recognition, where the hierarchy is defined by WordNet [54]. Several such methods utilize graph neural networks [81,82] to explicitly represent hierarchical relationships, which have demonstrated impressive generalization capabilities [36,56,77,78,89]. ...

Where to Focus: Investigating Hierarchical Attention Relationship for Fine-Grained Visual Classification
  • Citing Chapter
  • November 2022

Lecture Notes in Computer Science