Wei-Hong Li’s research while affiliated with University of Edinburgh and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (26)


We propose Universal Representations Learning in purple as in the figure framework that learns multiple task-specific neural networks and distills their knowledge into a universal ones through aligning the universal network’s feature with the task-specific ones. Our method is generic and can be successfully applied to a wide range of problems including multi-task dense prediction tasks (a), multi-domain many-shot learning (b), cross-domain few-shot learning (c)
Illustration of universal representation learning. In the first stage (a), we learn a task-specific deep network for each task. In the second stage (b), our goal is to learn a multi-task network that shares the feature encoder across all tasks and build multiple task-specific decoders on top of the feature encoder such that it performs well on these tasks compared to task-specific models trained in (a). To achieve this, we train such multi-task network by jointly minimizing task-specific losses and aligning the feature and prediction between the multi-task network and task-specific network. For the aligned feature distillation, we introduce a set of task-specific adapters to transform the feature from multi-task network to task-specific space before the alignment with task-specific features
Qualitative results on NYU-v2. The fist column shows the RGB image, the second column plots the ground-truth or predictions with the IoU (↑\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\uparrow $$\end{document}) score of all methods for semantic segmentation, the third column presents the ground-truth or predictions with the absolute error (↓\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\downarrow $$\end{document}), and we show the prediction of surface normal with mean error (↓\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\downarrow $$\end{document}) in the last column
Qualitative analysis of our method in two datasets. Green and red colors indicate correct and false predictions respectively (Color figure online)
Universal Representations: A Unified Look at Multiple Task and Domain Learning
  • Article
  • Full-text available

November 2023

·

66 Reads

·

21 Citations

International Journal of Computer Vision

Wei-Hong Li

·

·

Hakan Bilen

We propose a unified look at jointly learning multiple vision tasks and visual domains through universal representations, a single deep neural network. Learning multiple problems simultaneously involves minimizing a weighted sum of multiple loss functions with different magnitudes and characteristics and thus results in unbalanced state of one loss dominating the optimization and poor results compared to learning a separate model for each problem. To this end, we propose distilling knowledge of multiple task/domain-specific networks into a single deep neural network after aligning its representations with the task/domain-specific ones through small capacity adapters. We rigorously show that universal representations achieve state-of-the-art performances in learning of multiple dense prediction problems in NYU-v2 and Cityscapes, multiple image classification problems from diverse domains in Visual Decathlon Dataset and cross-domain few-shot learning in MetaDataset. Finally we also conduct multiple analysis through ablation and qualitative studies.

Download



Universal Representations: A Unified Look at Multiple Task and Domain Learning

April 2022

·

33 Reads

·

1 Citation

We propose a unified look at jointly learning multiple vision tasks and visual domains through universal representations, a single deep neural network. Learning multiple problems simultaneously involves minimizing a weighted sum of multiple loss functions with different magnitudes and characteristics and thus results in unbalanced state of one loss dominating the optimization and poor results compared to learning a separate model for each problem. To this end, we propose distilling knowledge of multiple task/domain-specific networks into a single deep neural network after aligning its representations with the task/domain-specific ones through small capacity adapters. We rigorously show that universal representations achieve state-of-the-art performances in learning of multiple dense prediction problems in NYU-v2 and Cityscapes, multiple image classification problems from diverse domains in Visual Decathlon Dataset and cross-domain few-shot learning in MetaDataset. Finally we also conduct multiple analysis through ablation and qualitative studies.


Learning Relation Models to Detect Important People in Still Images

January 2022

·

4 Reads

·

2 Citations

IEEE Transactions on Multimedia

Important people detection aims to identify the most important people ( i.e. , the people who play the main roles in scenes) in images, which is challenging since people's importance in images depends not only on their appearance but also on their interactions with others ( i.e. , relations among people) and their roles in the scene ( i.e. , relations between people and underlying events). In this work, we propose the People Relation Network (PRN) to solve this problem. PRN consists of three modules ( i.e. , the feature representation, relation and classification modules) to extract visual features, model relations and estimate people's importance, respectively. The relation module contains two submodules to model two types of relations, namely, the person-person relation submodule and the person-event relation submodule. The person-person relation submodule infers the relations among people from the interaction graph and the person-event relation submodule models the relations between people and events by considering the spatial correspondence between features. With the help of them, PRN can effectively distinguish important people from other individuals. Extensive experiments on the Multi-Scene Important People (MS) and NCAA Basketball Image (NCAA) datasets show that PRN achieves state-of-the-art performance and generalizes well when available data is limited.


Learning Multiple Dense Prediction Tasks from Partially Annotated Data

November 2021

·

35 Reads

Despite the recent advances in multi-task learning of dense prediction problems, most methods rely on expensive labelled datasets. In this paper, we present a label efficient approach and look at jointly learning of multiple dense prediction tasks on partially annotated data, which we call multi-task partially-supervised learning. We propose a multi-task training procedure that successfully leverages task relations to supervise its multi-task learning when data is partially annotated. In particular, we learn to map each task pair to a joint pairwise task-space which enables sharing information between them in a computationally efficient way through another network conditioned on task pairs, and avoids learning trivial cross-task relations by retaining high-level information about the input image. We rigorously demonstrate that our proposed method effectively exploits the images with unlabelled tasks and outperforms existing semi-supervised learning approaches and related methods on three standard benchmarks.



Improving Task Adaptation for Cross-domain Few-shot Learning

July 2021

·

412 Reads

In this paper, we look at the problem of cross-domain few-shot classification that aims to learn a classifier from previously unseen classes and domains with few labeled samples. We study several strategies including various adapter topologies and operations in terms of their performance and efficiency that can be easily attached to existing methods with different meta-training strategies and adapt them for a given task during meta-test phase. We show that parametric adapters attached to convolutional layers with residual connections performs the best, and significantly improves the performance of the state-of-the-art models in the Meta-Dataset benchmark with minor additional cost. Our code will be available at https://github.com/VICO-UoE/URL.


Figure 2. Illustration of our proposed method for multi-domain feature learning. Given training images from K different domains, we first train K domain-specific networks f φ * 1 , . . . , f φ * K and their classifiers h ψ * 1 , . . . , h ψ * K
Figure 4. Qualitative comparison to URT in four datasets. Green and red colors indicate correct and false predictions respectively.
Comparison of loss functions for knowledge distillation.
Results of Five-Way One-Shot and Varying-Way Five-Shot settings. Mean accuracies are reported and the results with confidence interval are reported.
Universal Representation Learning from Multiple Domains for Few-shot Classification

March 2021

·

896 Reads

In this paper, we look at the problem of few-shot classification that aims to learn a classifier for previously unseen classes and domains from few labeled samples. Recent methods use adaptation networks for aligning their features to new domains or select the relevant features from multiple domain-specific feature extractors. In this work, we propose to learn a single set of universal deep representations by distilling knowledge of multiple separately trained networks after co-aligning their features with the help of adapters and centered kernel alignment. We show that the universal representations can be further refined for previously unseen domains by an efficient adaptation step in a similar spirit to distance learning methods. We rigorously evaluate our model in the recent Meta-Dataset benchmark and demonstrate that it significantly outperforms the previous methods while being more efficient. Our code will be available at https://github.com/VICO-UoE/URL.


MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection

November 2020

·

21 Reads

·

60 Citations

Lecture Notes in Computer Science

We address the weakly supervised video highlight detection problem for learning to detect segments that are more attractive in training videos given their video event label but without expensive supervision of manually annotating highlight segments. While manually averting localizing highlight segments, weakly supervised modeling is challenging, as a video in our daily life could contain highlight segments with multiple event types, e.g., skiing and surfing. In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning. We consider each video as a bag of segments, and therefore, the proposed MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant. In particular, we form a max-max ranking loss to acquire a reliable relative comparison between the most likely positive segment instance and the hardest negative segment instance. With this max-max ranking loss, our MINI-Net effectively leverages all segment information to acquire a more distinct video feature representation for localizing the highlight segments of a specific event in a video. The extensive experimental results on three challenging public benchmarks clearly validate the efficacy of our multiple instance ranking approach for solving the problem.


Citations (17)


... Since the encoder in computer vision usually contains large parameters, sharing the encoder across different tasks can significantly reduce the computational cost but may cause conflicts among tasks, leading to a performance drop in partial tasks. Hence, many methods are proposed to address the issue from the perspective of MOO, which use some scalarization methods to balance multiple losses [155]- [159]. A comprehensive survey on multi-task dense prediction is provided in [4]. ...

Reference:

Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond
Universal Representations: A Unified Look at Multiple Task and Domain Learning

International Journal of Computer Vision

... POINT [8] approach enhances the ability to model interpersonal interaction relationships, yet multilayered interaction modeling remains relatively limited, and feature aggregation tends to introduce randomness, thereby diminishing the model's stability. PRN [9] attempts to comprehensively capture interpersonal interaction features through global relationship modeling, but it exhibits clear limitations in handling hierarchical and non-linear relationships. Additionally, there is substantial room for improvement in the efficiency of multi-source information fusion [10,11]. ...

Learning Relation Models to Detect Important People in Still Images
  • Citing Article
  • January 2022

IEEE Transactions on Multimedia

... Research in MTL can be broadly classified into two primary categories. The first [5,27,33,45,67,70] focuses on refining network architectures to optimize information sharing across tasks and to enhance task-specific representations. For instance, PolyMax [70] handles diverse prediction tasks effectively, showcasing adaptability and spatial data handling. ...

Learning Multiple Dense Prediction Tasks from Partially Annotated Data
  • Citing Conference Paper
  • June 2022

... Incorporating Adapters [40,41] into transformer layers is one of the parameter-efficient tuning methods for ViT, which is suitable for the fast adaptation with limited labeled samples [42,43,44]. However, when executing CL tasks with adapter-embedded ViT, we find that the performance of the previous task is difficult to maintain when the model adapts fast to novel tasks. ...

Cross-domain Few-shot Learning with Task-specific Adapters
  • Citing Conference Paper
  • June 2022

... The purpose of knowledge distillation is to transfer the knowledge learned by a cumbersome teacher network to a compact student network [38,39]. Consequently, a compact network with stronger feature learning ability can be obtained. ...

Universal Representations: A Unified Look at Multiple Task and Domain Learning
  • Citing Preprint
  • April 2022

... These models only use miniImagenet as the source domain. To further enhance the generalization ability, URL [33] and SUR [34] aimed to fuse a set of general representations from multiple source domains to address cross-domain problems. ISS [35] also introduced multiple source domains, but without requiring additional labels. ...

Universal Representation Learning from Multiple Domains for Few-shot Classification
  • Citing Conference Paper
  • October 2021

... To address this optimization problem, some studies attempt to train independent strategies for each task and use strategy distillation techniques [22,23] or knowledge transfer techniques [12,24] to solve multitasking problems or quickly adapt to new tasks. However, when faced with a large number of tasks, this method can lead to a significant increase in the number of network parameters. ...

Knowledge Distillation for Multi-task Learning
  • Citing Chapter
  • January 2020

Lecture Notes in Computer Science

... Video grounding aims to localize video content provided a custom query. Besides STVG, there exist many other video grounding tasks, such as moment retrieval (e.g., [1,6,18,23,38]), query-based video summarization (e.g., [27,34], video highlight detection (e.g., [2,9,11,35]), etc. Unlike these tasks for only temporal grounding in the video, OmniSTVG and STVG aim at both spatial and temporal grounding from videos. ...

MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection
  • Citing Chapter
  • November 2020

Lecture Notes in Computer Science

... JSRII-GNN [20] combined social relationships and object information to improve detection performance, but it failed to comprehensively capture multi-level interaction features, leading to limited performance in complex scenarios. Li et al. [21] proposed a pseudo-label learning method for detecting important individuals, which improved model efficiency by reducing labeling requirements. However, it underperforms in modeling complex relationships. ...

Learning to Detect Important People in Unlabelled Images for Semi-Supervised Important People Detection
  • Citing Conference Paper
  • June 2020

... To address this issue, we created a sliding window based method for extracting video key frames that combines local feature point detection (SURF) and global feature extraction (Gist), which is especially intended for low-quality user-generated films. Due to frequent camera movement and unclear edit points, user-generated videos are difficult for traditional shot identification techniques to identify [13][14][15][16][17]. CGRW-KF effectively addresses the previously mentioned issues. ...

MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection
  • Citing Preprint
  • July 2020