Qi Tian

Qi Tian
Huazhong University of Science and Technology | hust · School of Optical and Electronic Information

About

1,011
Publications
172,699
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
64,596
Citations

Publications

Publications (1,011)
Article
We propose integrally pre-trained transformer pyramid network (iTPN), towards jointly optimizing the network backbone and the neck, so that transfer gap between representation models and downstream tasks is minimal. iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT). 2) Multi-stage super...
Article
Reconstructing and rendering 3D objects from highly sparse views is of critical importance for promoting applications of 3D vision techniques and improving user experience. However, images from sparse views only contain very limited 3D information, leading to two significant challenges: 1) Difficulty in building multi-view consistency as images for...
Article
Integrated optics hold great potential to accelerate deep learning tasks with high clock rates, parallelism and low-loss data transmission. Silicon photonic integrated circuits can perform large-scale and low-power-consuming optical linear operations by using weighting mechanism through linear optics. However, on-chip light attenuation and nonlinea...
Article
Full-text available
Image segmentation plays a critical role in autonomous driving by providing vehicles with a detailed and accurate understanding of their surroundings. Transformers have recently shown encouraging results in image segmentation. However, transformer-based models are challenging to strike a better balance between performance and efficiency. The comput...
Preprint
Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision t...
Preprint
Full-text available
The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given...
Article
Active learning (AL) is to design label-efficient algorithms by labeling the most representative samples. It reduces annotation cost and attracts increasing attention from the community. However, previous AL methods suffer from the inadequacy of annotations and unreliable uncertainty estimation. Moreover, we find that they ignore the intra-diversit...
Article
Conventional neural architecture search (NAS) algorithms typically work on search spaces with short-distance node connections. We argue that such designs, though safe and stable, are obstacles to exploring more effective network architectures. In this brief, we explore the search algorithm upon a complicated search space with long-distance connecti...
Preprint
Modeling, understanding, and reconstructing the real world are crucial in XR/VR. Recently, 3D Gaussian Splatting (3D-GS) methods have shown remarkable success in modeling and understanding 3D scenes. Similarly, various 4D representations have demonstrated the ability to capture the dynamics of the 4D world. However, there is a dearth of research fo...
Preprint
Recently, 3D Gaussian splatting (3D-GS) has achieved great success in reconstructing and rendering real-world scenes. To transfer the high rendering quality to generation tasks, a series of research works attempt to generate 3D-Gaussian assets from text. However, the generated assets have not achieved the same quality as those in reconstruction tas...
Preprint
Full-text available
Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts...
Preprint
The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, \eg, through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades...
Preprint
Full-text available
Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1)...
Preprint
Full-text available
This paper focuses on the integration of generative techniques into spatial-temporal data mining, considering the significant growth and diverse nature of spatial-temporal data. With the advancements in RNNs, CNNs, and other non-generative techniques, researchers have explored their application in capturing temporal and spatial dependencies within...
Article
Full-text available
In computer vision, an important challenge to deep neural networks comes from adjusting the varying properties of different image domains. To study this problem, researchers have been investigating a practical setting in which the deep neural networks are trained on a labeled source domain and then transferred to an unlabeled or even unseen target...
Article
Fine-tuning visual models has been widely shown promising performance on many downstream visual tasks. With the surprising development of pre-trained visual foundation models, visual tuning jumped out of the standard modus operandi that fine-tunes the whole pre-trained model or just the fully connected layer. Instead, recent advances can achieve su...
Article
Full-text available
Despite recent promising performances across a range of vision tasks, vision Transformers still have an issue of high computational costs. Recently, vision prompt learning has provided an economical solution to this problem without fine-tuning the whole large-scale model. However, the efficiency and effectiveness of existing models are still far fr...
Chapter
Incremental few-shot semantic segmentation aims to extend a semantic segmentation model to novel classes according to only a few labeled data, while preserving its segmentation capability on learned base classes. However, semantic aliasing between base and novel classes severely limits the quality of segmentation results. To alleviate this issue, w...
Article
Fine-grained image retrieval mainly focuses on learning salient features from the seen subcategories as discriminative embedding while neglecting the problems behind zero-shot settings. We argue that retrieving fine-grained objects from unseen subcategories may rely on more diverse clues, which are easily restrained by the salient features learnt f...
Article
Demoiréing is the task of removing moiré patterns, which are commonly caused by the interference between the screen and digital cameras. Although research on single image demoiréing has made great progress, research on video demoiréing has received less attention from the community. Video demoiréing poses a new set of challenges. First, most existi...
Article
In the ever-evolving domain of multimedia, the significance of multi-modality understanding cannot be overstated. As multimedia content becomes increasingly sophisticated and ubiquitous, the ability to effectively combine and analyze the diverse information from different types of data, such as text, audio, image, video and point clouds, will be pa...
Conference Paper
We propose a novel Fabry–Pérot laser diode with temperature-insensitive laser spectral envelope. By introducing a temperature-insensitive optical filter structure into the laser diode resonator, the laser spectral envelope shows substantially reduced temperature sensitivity.
Article
The current state-of-the-art text-to-image (T2I) models have found numerous applications, driven by their ability to produce photorealistic images. Concept learning, as one notable application, aims to enable T2I models to generate personalized content and better enable users to create images according to their interests. Nevertheless, the process...
Article
There are two mainstream approaches for object detection: top-down and bottom-up. The state-of-the-art approaches are mainly top-down methods. In this paper, we demonstrate that bottom-up approaches show competitive performance compared with top-down approaches and have higher recall rates. Our approach, named CenterNet, detects each object as a tr...
Article
The electro-absorption modulated laser (EML) chip-on-carrier (COC) submodule has a complex equivalent circuit model, making it challenging to characterize the EML parameters directly. An efficient theoretical model is presented which closely approximates the actual situation by combining the conventional EML COC equivalent circuit model with the RF...
Article
Multilabel image recognition (MLR) aims to annotate an image with comprehensive labels and suffers from object occlusion or small object sizes within images. Although the existing works attempt to capture and exploit label correlations to tackle these issues, they predominantly rely on global statistical label correlations as prior knowledge for gu...
Article
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification. Instead of training model using the accurate label of each sample, our setting requires the model to interact with the system by predicting the class label of each sample and learn from the answer whether the guess is correct, which pro...
Conference Paper
Full-text available
Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and ach...
Article
Existing fine-grained image recognition works have attempted to dig into low-level details for emphasizing subtle discrepancies among sub-categories. However, a potential limitation of these methods is that they integrate the low-level details and high-level semantics directly, and neglect their content complementarity and spatial corresponding c...
Article
As a classical feature compression technique, quantization is usually coupled with inverted indices for scalable image retrieval. Most quantization methods explicitly divide feature space into Voronoi cells, and quantize feature vectors in each cell into the centroids learned from data distribution. However, Voronoi decomposition is difficult to ac...
Article
Exosomal microRNA (miRNA) exerts potential roles in non-small-cell lung cancer (NSCLC). The current study elucidated the role of miR-30b-5p shuttled by bone marrow mesenchymal stem cells (BMSCs)-derived exosomes in treating NSCLC. Bioinformatics analysis was performed with NSCLC-related miRNA microarray GSE169587 and mRNA data GSE74706 obtained for...
Preprint
Over the past decade, deep learning models have exhibited considerable advancements, reaching or even exceeding human-level performance in a range of visual perception tasks. This remarkable progress has sparked interest in applying deep networks to real-world applications, such as autonomous vehicles, mobile devices, robotics, and edge computing....
Article
Full-text available
Fine-grained object recognition (FGOR) aims to learn discriminative features that can identify the subtle distinctions between visually similar objects. However, less effort has been devoted to overcoming the impact of object’s personalized differences, e.g., varying posture or perspective. We argue that the personalized differences could decline t...
Preprint
Full-text available
Transformers have become the primary backbone of the computer vision community due to their impressive performance. However, the unfriendly computation cost impedes their potential in the video recognition domain. To optimize the speed-accuracy trade-off, we propose Semantic-aware Temporal Accumulation score (STA) to prune spatio-temporal tokens in...
Preprint
Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various...
Article
Full-text available
Weather forecasting is important for science and society. At present, the most accurate forecast system is the numerical weather prediction (NWP) method, which represents atmospheric states as discretized grids and numerically solves partial differential equations that describe the transition between those states¹. However, this procedure is comput...
Article
Full-text available
Electrically pumped semiconductor laser array operating in the fundamental mode is of vital importance. In general, the competition of multiple transverse modes in the laser array leads to spatiotemporally unstable and degradation of beam quality. In this letter, we propose and experimentally demonstrate electrically pumped single transverse-mode s...
Preprint
Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised pre-training excel at capturing longer-range global patterns and enabling better feature discrimination, whil...
Article
Full-text available
Low-light video enhancement (LLVE) is an important yet challenging task with many applications such as photographing and autonomous driving. Unlike single image low-light enhancement, most LLVE methods utilize temporal information from adjacent frames to restore the color and remove the noise of the target frame. However, these algorithms, based on...
Article
Fine-grained object retrieval aims to learn discriminative representation to retrieve visually similar objects. However, existing top-performing works usually impose pairwise similarities on the semantic embedding spaces or design a localization sub-network to continually fine-tune the entire model in limited data scenarios, thus resulting in conve...
Article
Text-guided image editing models have shown remarkable results. However, there remain two problems. First, they employ fixed manipulation modules for various editing requirements (e.g., color changing, texture changing, content adding and removing), which results in over-editing or insufficient editing. Second, they do not clearly distinguish betwe...
Article
Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different le...
Preprint
The AI community has been pursuing algorithms known as artificial general intelligence (AGI) that apply to any kind of real-world problem. Recently, chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve AGI in natural language processing (NLP), but the path towards AGI in computer vision (CV...
Article
Style-based GANs achieve state-of-the-art results for generating high-quality images, but lack explicit and precise control over camera poses. Recently proposed NeRF-based GANs have made great progress towards 3D-aware image generation. However, the methods either rely on convolution operators which are not rotationally invariant, or utilize comp...
Preprint
Image compression aims to reduce the information redundancy in images. Most existing neural image compression methods rely on side information from hyperprior or context models to eliminate spatial redundancy, but rarely address the channel redundancy. Inspired by the mask sampling modeling in recent self-supervised learning methods for natural lan...
Preprint
Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer co...
Preprint
Incremental few-shot semantic segmentation (IFSS) aims to incrementally extend a semantic segmentation model to novel classes according to only a few pixel-level annotated data, while preserving its segmentation capability on previously learned base categories. This task faces a severe semantic-aliasing issue between base and novel classes due to d...
Preprint
Full-text available
With the advance of large-scale model technologies, parameter-efficient transfer learning (PETL) has swept across various fields of Artificial Intelligence. Its core idea is to adapt the model to downstream tasks using only a small number of parameters. Recently, some studies have applied these techniques proven effective to multimodal tasks. Howev...
Preprint
This paper discusses the feasibility of continuously training the CLIP model through streaming data. Then, by tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-moda...
Preprint
Full-text available
Fine-tuning visual models has been widely shown promising performance on many downstream visual tasks. With the surprising development of pre-trained visual foundation models, visual tuning jumped out of the standard modus operandi that fine-tunes the whole pre-trained model or just the fully connected layer. Instead, recent advances can achieve su...
Article
Few-shot class-incremental learning (FSCIL) faces the challenges of memorizing old class distributions and estimating new class distributions given few training samples. In this study, we propose a learnable distribution calibration (LDC) approach, to systematically solve these two challenges using a unified framework. LDC is built upon a parameter...
Preprint
The Segment Anything Model (SAM) has demonstrated its effectiveness in segmenting any object/part in various 2D images, yet its ability for 3D has not been fully explored. The real world is composed of numerous 3D scenes and objects. Due to the scarcity of accessible 3D data and high cost of its acquisition and annotation, lifting SAM to 3D is a ch...
Preprint
Full-text available
The Mixture of Experts (MoE) model becomes an important choice of large language models nowadays because of its scalability with sublinear computational complexity for training and inference. However, existing MoE models suffer from two critical drawbacks, 1) tremendous inner-node and inter-node communication overhead introduced by all-to-all dispa...
Preprint
Full-text available
Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks....
Preprint
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with cl...
Preprint
Full-text available
Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory d...
Conference Paper
Full-text available
As a typical self-supervised learning strategy, Masked Image Modeling (MIM) is driven by recovering all masked patches from visible ones. However, patches from the same image are highly correlated and it is redundant to reconstruct all the masked patches. We find that this redundancy is neglected by existing MIM based methods and causes non-negligi...
Preprint
Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning perfo...
Preprint
In realistic open-set scenarios where labels of a part of testing data are totally unknown, when vision-language (VL) prompt learning methods encounter inputs related to unknown classes (i.e., not seen during training), they always predict them as one of the training classes. The exhibited label bias causes difficulty in open set recognition (OSR),...
Preprint
Prompt learning has achieved great success in efficiently exploiting large-scale pre-trained models in natural language processing (NLP). It reformulates the downstream tasks as the generative pre-training ones to achieve consistency, thus improving the performance stably. However, when transferring it to the vision area, current visual prompt lear...
Chapter
In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. In particular, deep neural networks based on U-shaped architecture and skip-connections have been widely applied in various medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global semantic inf...
Chapter
Domain generalization aims to train a robust model on multiple source domains that generalizes well to unseen target domains. While considerable attention has focused on training domain generalizable models, a few recent studies have shifted the attention to test time, i.e., leveraging test samples for better target generalization. To this end, thi...
Article
With convolution operations, Convolutional Neural Networks (CNNs) are good at extracting local features but experience difficulty to capture global representations. With cascaded self-attention modules, vision transformers can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a...