Article

When Multitask Learning Meets Partial Supervision: A Computer Vision Review

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multitask learning (MTL) aims to learn multiple tasks simultaneously while exploiting their mutual relationships. By using shared resources to simultaneously calculate multiple outputs, this learning paradigm has the potential to have lower memory requirements and inference times compared to the traditional approach of using separate methods for each task. Previous work in MTL has mainly focused on fully supervised methods, as task relationships (TRs) can not only be leveraged to lower the level of data dependency of those methods but also improve the performance. However, MTL introduces a set of challenges due to a complex optimization scheme and a higher labeling requirement. This article focuses on how MTL could be utilized under different partial supervision settings to address these challenges. First, this article analyses how MTL traditionally uses different parameter sharing techniques to transfer knowledge in between tasks. Second, it presents different challenges arising from such a multiobjective optimization (MOO) scheme. Third, it introduces how task groupings (TGs) can be achieved by analyzing TRs. Fourth, it focuses on how partially supervised methods applied to MTL can tackle the aforementioned challenges. Lastly, this article presents the available datasets, tools, and benchmarking results of such methods. The reviewed articles, categorized following this work, are available at https://github.com/Klodivio355/MTL-CV-Review .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... MTL, a method that follows a "learning-to-learn" approach to utilize a system's shared representations to train, in parallel, the domain-relevant information accumulated from individually relevant tasks, has become a state-of-the-art learning method for many benchmark tasks and even certain real-world problems. This ability to leverage shared representations has been particularly effective in domains such as natural language processing (NLP) [37] and computer vision (CV) [38], where a wide range of tasks often share underlying structures or features. Just as "Jade can be polished with stones from other mountains". ...
Article
Full-text available
Multi-agent reinforcement learning (MARL) has proven to be effective and promising in team collaboration tasks. Knowledge transfer in MARL has also received increasing attention. Compared to knowledge transfer in single-agent tasks, knowledge transfer in multi-agent tasks is more complex due to the need to account for coordination among agents. However, existing knowledge transfer-based methods only focus on strategies or agent-level knowledge in a single task, and the transfer of knowledge in such a specific task to new and different types of tasks is likely to fail. In this paper, we propose a multitask-based training framework termed MTT in cooperative MARL, which aims to learn shared collaborative knowledge across multiple tasks simultaneously and then apply it to solve other related tasks. However, models obtained from multitask learning may fail on other tasks because the gradients from different tasks may conflict with each other. To obtain a model with shared knowledge, we provide conflict-free updates by ensuring a positive dot product between the final update and the gradient of each specific task. It also maintains a consistent optimization rate for all tasks. Experiments conducted in two popular environments, StarCraft II Multi-Agent Challenge and Google Research Football, demonstrate that our method outperforms the baselines, significantly improving the efficiency of team collaboration.
... This scalability is particularly beneficial in robotic-assisted surgeries, where the visual field is subject to frequent changes, and the interaction between surgical instruments and tissues is complex. By training on extensive datasets, Transformerbased models can adapt to these challenges, ensuring that they remain robust and effective across a variety of surgical tasks [30], [31]. ...
Article
Full-text available
Visual Question Answering (VQA) in robotic surgery is rapidly becoming a pivotal technology in medical AI, addressing the complex challenge of interpreting multimodal surgical data to support real-time decision-making. This comprehensive review synthesizes key advancements in Surgical VQA, highlighting the integration of large language models (LLMs), multimodal fusion techniques, and visual grounding methods. By reviewing 62 key studies selected through a systematic search of major scientific databases, including IEEE Xplore, Google Scholar, SpringerLink, and PubMed, we trace the evolution of VQA systems and their application in surgical environments. Current limitations, including dataset scarcity, multimodal alignment challenges, and issues of interpretability, are critically examined. This survey aims to not only provide a structured overview of the field but also identify critical research gaps and propose future directions to enhance VQA systems for robotic surgery, with the ultimate goal of improving intraoperative performance and patient outcomes.
... Multitask learning [16] aims to improve the generalization performance of all tasks by learning useful information from multiple related tasks. As a learning paradigm in machine learning, multitask learning is widely applied in various fields [17][18][19][20]37]. In computer vision, multitask learning is commonly used for autonomous-driving-related tasks [21,22], such as road segmentation and object detection (OD). ...
Article
Full-text available
In the era of widespread digital image sharing on social media platforms, deep-learning-based watermarking has shown great potential in copyright protection. To address the fundamental trade-off between the visual quality of the watermarked image and the robustness of watermark extraction, we explore the role of structural features and propose a novel edge-aware watermarking framework. Our primary innovation lies in the edge-aware secret hiding module (EASHM), which achieves adaptive watermark embedding by aligning watermarks with image structural features. To realize this, the EASHM leverages knowledge distillation from an edge detection teacher and employs a dual-task encoder that simultaneously performs edge detection and watermark embedding through maximal parameter sharing. The framework is further equipped with a social network noise simulator (SNNS) and a secret recovery module (SRM) to enhance robustness against common image noise attacks. Extensive experiments on three public datasets demonstrate that our framework achieves superior watermark imperceptibility, with PSNR and SSIM values exceeding 40.82 dB and 0.9867, respectively, while maintaining an over 99% decoding accuracy under various noise attacks, outperforming existing methods by significant margins.
... However, there is increasing interest in developing unified models that can predict multiple tasks simultaneously. This approach, known as Multitask Learning (MTL) [1,15,38], aims to improve the efficiency and coherence of predictions in different tasks by leveraging shared information and representations, resulting in substantial advantages over traditional methods [7,29,53]. MTL frameworks allow interactions between tasks at various stages within the model with the aim of enhancing overall multi-task performance. On the one hand, many previous attempts consist in implementing Cross-Task Prediction Coherence, either through distillation [8,27,30] or attention mechanisms [23,46,51]. ...
Preprint
Multi-Task Learning (MTL) involves the concurrent training of multiple tasks, offering notable advantages for dense prediction tasks in computer vision. MTL not only reduces training and inference time as opposed to having multiple single-task models, but also enhances task accuracy through the interaction of multiple tasks. However, existing methods face limitations. They often rely on suboptimal cross-task interactions, resulting in task-specific predictions with poor geometric and predictive coherence. In addition, many approaches use inadequate loss weighting strategies, which do not address the inherent variability in task evolution during training. To overcome these challenges, we propose an advanced MTL model specifically designed for dense vision tasks. Our model leverages state-of-the-art vision transformers with task-specific decoders. To enhance cross-task coherence, we introduce a trace-back method that improves both cross-task geometric and predictive features. Furthermore, we present a novel dynamic task balancing approach that projects task losses onto a common scale and prioritizes more challenging tasks during training. Extensive experiments demonstrate the superiority of our method, establishing new state-of-the-art performance across two benchmark datasets. The code is available at:https://github.com/Klodivio355/MT-CP
Article
In the field of Remote Sensing Change Detection (RSCD), accurately identifying significant changes between bitemporal images is essential for environmental monitoring, urban planning, and disaster assessment. In recent years, advancements in deep learning for computer vision have transformed RSCD, significantly enhancing its effectiveness. However, existing methods often overlook the importance of depth information, focusing primarily on two-dimensional information. This limits their ability to capture subtle changes and structural details in three-dimensional space. To address these limitations, we introduce ChangeDA—a depth-augmented multi-task network designed to enhance the effectiveness of RSCD. ChangeDA introduces a depth encoder module to extract implicit depth information from optical images, enabling the utilization of 3D structural information without reliance on external data sources. Through the Depth Infusion Module (DIM), depth information is integrated into the dual-temporal feature maps, significantly enhancing the network’s ability to perceive changes in three-dimensional spatial structures. Additionally, ChangeDA includes a Differential Feature Extractor (DFE) tailored to pinpoint differential features between sequential images, and an Adaptive All feature Fusion (AAFF) strategy that significantly improves recognition accuracy and generalization capability through cross-level feature integration. Performance evaluations on four prominent single-modal datasets—LEVIR-CD, S2-Looking, WHU-CD, and SYSU-CD—yielded state-of-the-art F1-scores of 92.27%, 66.42%, 94.12%, and 82.74%, respectively. Furthermore, ChangeDA also achieved outstanding results on the multi-modal 3DCD dataset, with an F1 score of 63.52% in 2D CD and an RMSE of 1.20 in the 3D CD task. These results demonstrate ChangeDA’s robust adaptability across diverse targets and real-world scenarios.
Article
Full-text available
In recent times, with the exception of sporadic cases, the trend in computer vision is to achieve minor improvements compared to considerable increases in complexity. To reverse this trend, we propose a novel method to boost image classification performances without increasing complexity. To this end, we revisited ensembling, a powerful approach, often not used properly due to its more complex nature and the training time, so as to make it feasible through a specific design choice. First, we trained two EfficientNet‐b0 end‐to‐end models (known to be the architecture with the best overall accuracy/complexity trade‐off for image classification) on disjoint subsets of data (i.e., bagging). Then, we made an efficient adaptive ensemble by performing fine‐tuning of a trainable combination layer. In this way, we were able to outperform the state‐of‐the‐art by an average of 0.5% on the accuracy, with restrained complexity both in terms of the number of parameters (by 5–60 times), and the FLoating point Operations Per Second FLOPS by 10–100 times on several major benchmark datasets.
Article
Full-text available
Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised and domain-adaptive semantic segmentation, which is enhanced by self-supervised monocular depth estimation (SDE) trained only on unlabeled image sequences. In particular, we utilize SDE as an auxiliary task comprehensively across the entire learning framework: First, we automatically select the most useful samples to be annotated for semantic segmentation based on the correlation of sample diversity and difficulty between SDE and semantic segmentation. Second, we implement a strong data augmentation by mixing images and labels using the geometry of the scene. Third, we transfer knowledge from features learned during SDE to semantic segmentation by means of transfer and multi-task learning. And fourth, we exploit additional labeled synthetic data with Cross-Domain DepthMix and Matching Geometry Sampling to align synthetic and real data. We validate the proposed model on the Cityscapes dataset, where all four contributions demonstrate significant performance gains, and achieve state-of-the-art results for semi-supervised semantic segmentation as well as for semi-supervised domain adaptation. In particular, with only 1/30 of the Cityscapes labels, our method achieves 92% of the fully-supervised baseline performance and even 97% when exploiting additional data from GTA. The source code is available at https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.
Article
Full-text available
Morphological and functional changes in retinal vessels are indicators of a variety of chronic diseases, such as diabetes, stroke, and hypertension. However, without a large number of high-quality annotations, existing deep learning-based medical image segmentation approaches may degrade their performance dramatically on the retinal vessel segmentation task. To reduce the demand of high-quality annotations and make full use of massive unlabeled data, we propose a self-supervised multi-task strategy to extract curvilinear vessel features for the retinal vessel segmentation task. Specifically, we use a dense network to extract more vessel features across different layers/slices, which is elaborately designed for hardware to train and test efficiently. Then, we combine three general pre-training tasks (i.e., intensity transformation, random pixel filling, in-painting and out-painting) in an aggregated way to learn rich hierarchical representations of curvilinear retinal vessel structures. Furthermore, a vector classification task module is introduced as another pre-training task to obtain more spatial features. Finally, to make the segmentation network pay more attention to curvilinear structures, a novel dynamic loss is proposed to learn robust vessel details from unlabeled fundus images. These four pre-training tasks greatly reduce the reliance on labeled data. Moreover, our network can learn the retinal vessel features effectively in the pre-training process, which leads to better performance in the target multi-modal segmentation task. Experimental results show that our method provides a promising direction for the retinal vessel segmentation task. Compared with other state-of-the-art supervised deep learning-based methods applied, our method requires less labeled data and achieves comparable segmentation accuracy. For instance, we match the accuracy of the traditional supervised learning methods on DRIVE and Vampire datasets without needing any labeled ground truth image. With elaborately training, we gain the 0.96 accuracy on DRIVE dataset.
Chapter
Full-text available
While self-supervised learning (SSL) algorithms have been widely used to pre-train deep models, few efforts [11] have been done to improve representation learning of X-ray image analysis with SSL pre-trained models. In this work, we study a novel self-supervised pre-training pipeline, namely Multi-task Self-super-vised Continual Learning (MUSCLE), for multiple medical imaging tasks, such as classification and segmentation, using X-ray images collected from multiple body parts, including heads, lungs, and bones. Specifically, MUSCLE aggregates X-rays collected from multiple body parts for MoCo-based representation learning, and adopts a well-designed continual learning (CL) procedure to further pre-train the backbone subject various X-ray analysis tasks jointly. Certain strategies for image pre-processing, learning schedules, and regularization have been used to solve data heterogeneity, over-fitting, and catastrophic forgetting problems for multi-task/dataset learning in MUSCLE. We evaluate MUSCLE using 9 real-world X-ray datasets with various tasks, including pneumonia classification, skeletal abnormality classification, lung segmentation, and tuberculosis (TB) detection. Comparisons against other pre-trained models [7] confirm the proof-of-concept that self-supervised multi-task/dataset continual pre-training could boost the performance of X-ray image analysis.KeywordsX-ray images (X-ray)Self-supervised learning
Conference Paper
Full-text available
This paper introduces the result of Team LingJing’s experiments in SemEval-2022 Task 1 Comparing Dictionaries and Word Embeddings (CODWOE). This task aims at comparing two types of semantic descriptions, including the definition modeling and reverse dictionary track. Our team focuses on the reverse dictionary track and adopts the multi-task selfsupervised pre-training for multilingual reverse dictionaries. Specifically, the randomly initialized mDeBERTa-base model is used to perform multi-task pre-training on the multilingual training datasets. The pre-training step is divided into two stages, namely the MLM pretraining stage and the contrastive pre-training stage. As a result, all the experiments are performed on the pre-trained language model during fine-tuning. The experimental results show that the proposed method has achieved good performance in the reverse dictionary track, where we rank the 1-st in the Sgns targets of the EN and RU languages. All the experimental codes are open-sourced at https: //github.com/WENGSYX/Semeval.
Article
Full-text available
The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable "sparse gate" to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to 128 tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over 22% improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k 1 .
Article
Full-text available
Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.
Conference Paper
Full-text available
Medical image segmentation has significantly benefitted thanks to deep learning architectures. Furthermore, semi-supervised learning (SSL) has recently been a growing trend for improving a model's overall performance by leveraging abundant unlabeled data. Moreover, learning multiple tasks within the same model further improves model generalizability. To generate smooth and accurate segmentation masks from 3D cardiac MR images, we present a Multi-task Cross-task learning consistency approach to enforce the correlation between the pixel-level (segmentation) and the geometric-level (distance map) tasks. Our extensive experimentation with varied quantities of labeled data in the training sets justifies the effectiveness of our model for the segmentation of the left atrial cavity from Gadolinium-enhanced magnetic resonance (GE-MR) images. With the incorporation of uncertainty estimates to detect failures in the segmentation masks generated by CNNs, our study further showcases the potential of our model to flag low-quality segmentation from a given model.
Article
Full-text available
The automatic analysis of endoscopic images to assist endoscopists in accurately identifying the types and locations of esophageal lesions remains a challenge. In this paper, we propose a novel multi-task deep learning model for automatic diagnosis, which does not simply replace the role of endoscopists in decision making, because endoscopists are expected to correct the false results predicted by the diagnosis system if more supporting information is provided. In order to help endoscopists improve the diagnosis accuracy in identifying the types of lesions, an image retrieval module is added in the classification task to provide an additional confidence level of the predicted types of esophageal lesions. In addition, a mutual attention module is added in the segmentation task to improve its performance in determining the locations of esophageal lesions. The proposed model is evaluated and compared with other deep learning models using a dataset of 1003 endoscopic images, including 290 esophageal cancer, 473 esophagitis, and 240 normal. The experimental results show the promising performance of our model with a high accuracy of 96.76% for the classification and a Dice coefficient of 82.47% for the segmentation. Consequently, the proposed multi-task deep learning model can be an effective tool to help endoscopists in judging esophageal lesions.
Article
Full-text available
Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 2000 transfer learning experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations for practitioners.
Conference Paper
Full-text available
Multi-task learning (MTL) has been widely used in representation learning. However, naively training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the task-specific parameters, we dynamically weigh the task losses so that all of them are kept at a comparable scale. Further, we find the above gradient balance and loss balance are complementary and thus propose a hybrid balance method to further improve the performance. Our impartial multi-task learning (IMTL) can be end-to-end trained without any heuristic hyper-parameter tuning, and is general to be applied on all kinds of losses without any distribution assumption. Moreover, our IMTL can converge to similar results even when the task losses are designed to have different scales, and thus it is scale-invariant. We extensively evaluate our IMTL on the standard MTL benchmarks including Cityscapes, NYUv2 and CelebA. It outperforms existing loss weighting methods under the same experimental settings.
Article
Deep learning approaches have achieved great success in the field of Natural Language Processing (NLP). However, directly training deep neural models often suffer from overfitting and data scarcity problems that are pervasive in NLP tasks. In recent years, Multi-Task Learning (MTL), which can leverage useful information of related tasks to achieve simultaneous performance improvement on these tasks, has been used to handle these problems. In this paper, we give an overview of the use of MTL in NLP tasks. We first review MTL architectures used in NLP tasks and categorize them into four classes, including parallel architecture, hierarchical architecture, modular architecture, and generative adversarial architecture. Then we present optimization techniques on loss construction, gradient regularization, data sampling, and task scheduling to properly train a multi-task model. After presenting applications of MTL in a variety of NLP tasks, we introduce some benchmark datasets. Finally, we make a conclusion and discuss several possible research directions in this field.
Article
Convolution neural networks (CNNs) and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer for multi-task learning of dense prediction. Our method, named DeMT, is based on a simple and effective encoder-decoder architecture (i.e., deformable mixer encoder and task-aware transformer decoder). First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels (i.e., efficient channel location mixing), and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations (i.e., deformed features). Second, the task-aware transformer decoder consists of the task interaction block and task query block. The former is applied to capture task interaction features via self-attention. The latter leverages the deformed features and task-interacted features to generate the corresponding task-specific feature through a query-based Transformer for corresponding task predictions. Extensive experiments on two dense image prediction datasets, NYUD-v2 and PASCAL-Context, demonstrate that our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models on a variety of metrics. The code is available at https://github.com/yangyangxu0/DeMT.
Chapter
Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physicists, neuroscientists, mathematicians, statisticians, and computer scientists—interested in theoretical and applied aspects of modeling, simulating, and building neural-like or intelligent systems. The presentations are interdisciplinary, with contributions in algorithms, learning theory, cognitive science, neuroscience, brain imaging, vision, speech and signal processing, reinforcement learning, and applications. Only twenty-five percent of the papers submitted are accepted for presentation at NIPS, so the quality is exceptionally high. This volume contains the papers presented at the December 2006 meeting, held in Vancouver. Bradford Books imprint
Chapter
Self-supervised pre-training appears as an advantageous alternative to supervised pre-trained for transfer learning. By synthesizing annotations on pretext tasks, self-supervision allows pre-training models on large amounts of pseudo-labels before fine-tuning them on the target task. In this work, we assess self-supervision for diagnosing skin lesions, comparing three self-supervised pipelines to a challenging supervised baseline, on five test datasets comprising in- and out-of-distribution samples. Our results show that self-supervision is competitive both in improving accuracies and in reducing the variability of outcomes. Self-supervision proves particularly useful for low training data scenarios (<1500 and <150 samples), where its ability to stabilize the outcomes is essential to provide sound results. KeywordsSelf-supervisionOut-of-distributionSkin lesionsMelanomaClassificationSmall datasets
Article
Recent years witnessed several advances in developing multi-goal conversational recommender systems (MG-CRS) that can proactively attract users’ interests and naturally lead user-engaged dialogues with multiple conversational goals and diverse topics. Four tasks are often involved in MG-CRS, including Goal Planning, Topic Prediction, Item Recommendation, and Response Generation. Most existing studies address only some of these tasks. To handle the whole problem of MG-CRS, modularized frameworks are adopted where each task is tackled independently without considering their interdependencies. In this work, we propose a novel Unified MultI-goal conversational recommeNDer system, namely UniMIND . In specific, we unify these four tasks with different formulations into the same sequence-to-sequence (Seq2Seq) paradigm. Prompt-based learning strategies are investigated to endow the unified model with the capability of multi-task learning. Finally, the overall learning and inference procedure consists of three stages, including multi-task learning, prompt-based tuning, and inference. Experimental results on two MG-CRS benchmarks (DuRecDial and TG-ReDial) show that UniMIND achieves state-of-the-art performance on all tasks with a unified model. Extensive analyses and discussions are provided for shedding some new perspectives for MG-CRS.
Chapter
Multi-task dense scene understanding is a thriving research domain that requires simultaneous perception and reasoning on a series of correlated tasks with pixel-wise prediction. Most existing works encounter a severe limitation of modeling in the locality due to heavy utilization of convolution operations, while learning interactions and inference in a global spatial-position and multi-task context is critical for this problem. In this paper, we propose a novel end-to-end Inverted Pyramid multi-task Transformer (InvPT) to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. To the best of our knowledge, this is the first work that explores designing a transformer structure for multi-task dense prediction for scene understanding. Besides, it is widely demonstrated that a higher spatial resolution is remarkably beneficial for dense predictions, while it is very challenging for existing transformers to go deeper with higher resolutions due to huge complexity to large spatial size. InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts. The code is available at https://github.com/prismformore/InvPT.
Chapter
In this paper, we explore the advantages of utilizing transformer structures for addressing multi-task learning (MTL). Specifically, we demonstrate that models with transformer structures are more appropriate for MTL than convolutional neural networks (CNNs), and we propose a novel transformer-based architecture named MTFormer for MTL. In the framework, multiple tasks share the same transformer encoder and transformer decoder, and lightweight branches are introduced to harvest task-specific outputs, which increases the MTL performance and reduces the time-space complexity. Furthermore, information from different task domains can benefit each other, and we conduct cross-task reasoning. We propose a cross-task attention mechanism for further boosting the MTL results. The cross-task attention mechanism brings little parameters and computations while introducing extra performance improvements. Besides, we design a self-supervised cross-task contrastive learning algorithm for further boosting the MTL performance. Extensive experiments are conducted on two multi-task learning datasets, on which MTFormer achieves state-of-the-art results with limited network parameters and computations. It also demonstrates significant superiorities for few-shot learning and zero-shot learning.
Chapter
In this paper, we explore a novel and ambitious knowledge-transfer task, termed Knowledge Factorization (KF). The core idea of KF lies in the modularization and assemblability of knowledge: given a pretrained network model as input, KF aims to decompose it into several factor networks, each of which handles only a dedicated task and maintains task-specific knowledge factorized from the source network. Such factor networks are task-wise disentangled and can be directly assembled, without any fine-tuning, to produce the more competent combined-task networks. In other words, the factor networks serve as Lego-brick-like building blocks, allowing us to construct customized networks in a plug-and-play manner. Specifically, each factor network comprises two modules, a common-knowledge module that is task-agnostic and shared by all factor networks, alongside with a task-specific module dedicated to the factor network itself. We introduce an information-theoretic objective, InfoMax-Bottleneck (IMB), to carry out KF by optimizing the mutual information between the learned representations and input. Experiments across various benchmarks demonstrate that, the derived factor networks yield gratifying performances on not only the dedicated tasks but also disentanglement, while enjoying much better interpretability and modularity. Moreover, the learned common-knowledge representations give rise to impressive results on transfer learning. Our code is available at https://github.com/Adamdad/KnowledgeFactor. KeywordsTransfer learningKnowledge factorization
Chapter
We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence “multi-modal”), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence “multi-task”). We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available - in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and tasks, we train MultiMAE entirely using pseudo labeling, which makes the framework widely applicable to any RGB dataset. The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer. Code, pre-trained models, and interactive visualizations are available at https://multimae.epfl.ch.
Conference Paper
Tasks in multi-task learning often correlate, conflict, or even compete with each other. As a result, a single solution that is optimal for all tasks rarely exists. Recent papers introduced the concept of Pareto optimality to this field and directly cast multi-task learning as multi-objective optimization problems, but solutions returned by existing methods are typically finite, sparse, and discrete. We present a novel, efficient method that generates locally continuous Pareto sets and Pareto fronts, which opens up the possibility of continuous analysis of Pareto optimal solutions in machine learning problems. We scale up theoretical results in multi-objective optimization to modern machine learning problems by proposing a sample-based sparse linear system, for which standard Hessian-free solvers in machine learning can be applied. We compare our method to the state-of-the-art algorithms and demonstrate its usage of analyzing local Pareto sets on various multi-task classification and regression problems. The experimental results confirm that our algorithm reveals the primary directions in local Pareto sets for trade-off balancing, finds more solutions with different trade-offs efficiently, and scales well to tasks with millions of parameters.