
Bin ChenTsinghua University | TH · Department of Computer Science and Technology
Bin Chen
Doctor of Philosophy
About
122
Publications
3,245
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,029
Citations
Introduction
Publications
Publications (122)
Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By...
Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH t...
Dataset distillation offers an efficient way to reduce memory and computational costs by optimizing a smaller dataset with performance comparable to the full-scale original. However, for large datasets and complex deep networks (e.g., ImageNet-1K with ResNet-101), the extensive optimization space limits performance, reducing its practicality. Recen...
Dataset distillation (DD) aims to minimize the time and memory consumption needed for training deep neural networks on large datasets, by creating a smaller synthetic dataset that has similar performance to that of the full real dataset. However, current dataset distillation methods often result in synthetic datasets that are excessively difficult...
3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for 3D scene representation, providing rapid rendering speeds and high fidelity. As 3DGS gains prominence, safeguarding its intellectual property becomes increasingly crucial since 3DGS could be used to imitate unauthorized scene creations and raise copyright issues. Existing watermark...
Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic...
Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TP...
Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds based on the modeled content. Masked autoencoders (MAE) have become the mainstream paradigm in point clouds self-supervised learning. However, existing MAE-based methods are domain-specific, limiting the model's gen...
With recent advances in large language models (LLMs), there has been emerging numbers of research in developing Semantic IDs based on LLMs to enhance the performance of recommendation systems. However, the dimension of these embeddings needs to match that of the ID embedding in recommendation, which is usually much smaller than the original length....
Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in the released machine learning models. Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios, posing serious privacy risks of Deep Neural Networks (DNNs). However, the development of de...
Recent advances in diffusion-based Large Restoration Models (LRMs) have significantly improved photo-realistic image restoration by leveraging the internal knowledge embedded within model weights. However, existing LRMs often suffer from the hallucination dilemma, i.e., producing incorrect contents or textures when dealing with severe degradations,...
An
(n, k, d; r)<sub>q</sub>
-locally repairable code (LRC) is called a Singleton-optimal LRC if it achieves the Singleton-type bound. Analogous to the classical MDS conjecture, the maximal length problem of Singleton-optimal LRCs has attracted a lot of attention in recent years. In this paper, we give an improved upper bound for the length of
q...
Multi-view image compression is vital for 3D-related applications. To effectively model correlations between views, existing methods typically predict disparity between two views on a 2D plane, which works well for small disparities, such as in stereo images, but struggles with larger disparities caused by significant view changes. To address this,...
Compressive Learning (CL) has proven to be highly successful in executing joint signal sampling and inference for intricate vision tasks through resource-limited Internet of Things (IoT) devices. Recent studies have turned their attention towards utilizing the deep neural networks (DNNs) methodology, also known as DeepCL, to enhance performance in...
Traditional QR codes consist of a grid of black-and-white square modules, which lack aesthetic appeal and meaning for human perception. This has motivated recent research to beautify the visual appearance of QR codes. However, there exists a trade-off between the visual quality and scanning-robustness of the image, causing outputs of previous works...
Invertible Rescaling Networks (IRNs) and their variants have witnessed remarkable achievements in various image processing tasks like image rescaling. However, we observe that IRNs with deeper networks are difficult to train, thus hindering the representational ability of IRNs. To address this issue, we propose Invertible Residual Rescaling Models...
Model Inversion (MI) attacks aim to reconstruct privacy-sensitive training data from released models by utilizing output information, raising extensive concerns about the security of Deep Neural Networks (DNNs). Recent advances in generative adversarial networks (GANs) have contributed significantly to the improved performance of MI attacks due to...
Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios. Recent studies have introduced \textit{single-target} generative attacks that train a generator for each target class to generate highly transferable perturbations, resulting in substantial computational overhead w...
Locally repairable codes (LRCs) are linear codes with locality properties for code symbols, which have important applications in distributed storage systems. In this paper, we completely classify all the possible code parameters of optimal ternary LRCs achieving the Singleton-like bound proposed by Gopalan et al. Explicit constructions of optimal t...
Dataset distillation is an emerging dataset reduction method, which condenses large-scale datasets while maintaining task accuracy. Current methods have integrated parameterization techniques to boost synthetic dataset performance by shifting the optimization space from pixel to another informative feature domain. However, they limit themselves to...
Hao Fang Jiawei Kong Wenbo Yu- [...]
Ke Xu
Vision-Language Pre-training (VLP) models trained on large-scale image-text pairs have demonstrated unprecedented capability in many practical applications. However, previous studies have revealed that VLP models are vulnerable to adversarial samples crafted by a malicious adversary. While existing attacks have achieved great success in improving a...
Gradient Inversion Attacks invert the transmitted gradients in Federated Learning (FL) systems to reconstruct the sensitive data of local clients and have raised considerable privacy concerns. A majority of gradient inversion methods rely heavily on explicit prior knowledge (e.g., a well pre-trained generative model), which is often unavailable in...
The pre-trained point cloud model based on Masked Point Modeling (MPM) has exhibited substantial improvements across various tasks. However, these models heavily rely on the Transformer, leading to quadratic complexity and limited decoder, hindering their practice application. To address this limitation, we first conduct a comprehensive analysis of...
Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown p...
Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments. Due to the lack of moment annotations, the uncertainty lying in clip modeling and text-clip correspondence leads to major challenges. Despite the great progress, existing solutions either sacrifice efficiency or efficacy to c...
Learning 3D representation plays a critical role in masked autoencoder (MAE) based pre-training methods for point cloud, including single-modal and cross-modal based MAE. Specifically, although cross-modal MAE methods learn strong 3D representations via the auxiliary of other modal knowledge, they often suffer from heavy computational burdens and h...
In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal...
Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is informatio...
MDS array codes have been extensively studied due to their applications in storage systems. In this paper, we first propose a novel method of constructing MDS array codes by deleting one row and one column from the circulant matrices associated to some polynomials. Several new classes of MDS array codes with flexible parameters are constructed. In...
Unsupervised cross-modal hashing (UCMH) has been commonly explored to support large-scale cross-modal retrieval of unlabeled data. Despite promising progress, most existing approaches are developed on convolutional neural network and multilayer perceptron architectures, sacrificing the quality of hash codes due to limited capacity for excavating mu...
Federated Learning (FL) has recently emerged as a promising distributed machine learning framework to preserve clients' privacy, by allowing multiple clients to upload the gradients calculated from their local data to a central server. Recent studies find that the exchanged gradients also take the risk of privacy leakage, e.g., an attacker can inve...
Recognizing characters from low-resolution (LR) text images poses a significant challenge due to the information deficiency as well as the noise and blur in low-quality images. Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-s...
Beyond achieving higher compression efficiency over classical image compression codecs, deep image compression is expected to be improved with additional side information, e.g., another image from a different perspective of the same scene. To better utilize the side information under the distributed compression scenario, the existing method only im...
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation amon...
Deep neural networks (DNNs) have witnessed remarkable achievement in image super-resolution (SR), and plenty of DNN-based SR models with elaborated network designs have recently been proposed. However, existing methods usually require substantial computations by operating in spatial domain. To address this issue, we propose a general frequency-orie...
In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal...
Recently, pre-trained point cloud models have found extensive applications in downstream tasks like object classification. However, these tasks often require {full fine-tuning} of models and lead to storage-intensive procedures, thus limiting the real applications of pre-trained models. Inspired by the great success of visual prompt tuning (VPT) in...
Unsupervised anomaly detection (UAD) has been widely implemented in industrial and medical applications, which reduces the cost of manual annotation and improves efficiency in disease diagnosis. Recently, deep auto-encoder with its variants has demonstrated its advantages in many UAD scenarios. Training on the normal data, these models are expected...
Locally repairable codes (LRCs) have emerged as an important coding scheme in distributed storage systems (DSSs) with relatively low repair cost by accessing fewer non-failure nodes. Theoretical bounds and optimal constructions of LRCs have been widely investigated. Optimal LRCs via cyclic and constacyclic codes provide significant benefit of elega...
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation amon...
A linear code is called a locally repairable code (LRC) with locality r if one can recover an erased code symbol by accessing at most r other code symbols. Constructions of LRCs have been widely investigated in recent years. In this paper, we give a step forward in this direction. Firstly, we propose a novel concept of perfect LRCs whose size exact...
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only c...
Beyond achieving higher compression efficiency over classical image compression codecs, deep image compression is expected to be improved with additional side information, e.g., another image from a different perspective of the same scene. To better utilize the side information under the distributed compression scenario, the existing method (Ayzik...
Recent trends in cloud computing technology effectively boosted the application of visual inspection. However, most of the available systems work in a human-in-the-loop manner and can not provide long-term support to the online application. To make a step forward, this paper outlines an automatic annotation system called SsaA, working in a self-sup...
Constructions of optimal locally repairable codes (LRCs) achieving Singleton-type bound have been exhaustively investigated in recent years. In this paper, we consider new bounds and constructions of Singleton-optimal LRCs with minmum distance $d=6$, locality $r=3$ and minimum distance $d=7$ and locality $r=2$, respectively. Firstly, we establish e...
The high efficiency in computation and storage makes hashing (including binary hashing and quantization) a common strategy in large-scale retrieval systems. To alleviate the reliance on expensive annotations, unsupervised deep hashing becomes an important research problem. This paper provides a novel solution to unsupervised deep quantization, name...
While online video sharing becomes more popular, it also causes unconscious leakage of personal information in the video retrieval systems like deep hashing. A snoop can collect more users’ private information from the video database by querying similar videos. This paper focuses on bypassing the deep video hashing based retrieval to prevent inform...
Deep product quantization networks (DPQNs) have been successfully used in image retrieval tasks, due to their powerful feature extraction ability and high efficiency of encoding high-dimensional visual features. Recent studies show that deep neural networks (DNNs) are vulnerable to input with small and maliciously designed perturbations (a.k.a., ad...
Locally repairable codes (LRCs) with $(r,\delta)$ locality were introduced by Prakash \emph{et al.} into distributed storage systems (DSSs) due to their benefit of locally repairing at least $\delta-1$ erasures via other $r$ survival nodes among the same local group. An LRC achieving the $(r,\delta)$ Singleton-type bound is called an optimal $(r,\d...
With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because...
Deep hashing has become a popular method in large-scale image retrieval due to its computational and storage efficiency. However, recent works raise the security concerns of deep hashing. Although existing works focus on the vulnerability of deep hashing in terms of adversarial perturbations, we identify a more pressing threat, backdoor attack, whe...
The high efficiency in computation and storage makes hashing (including binary hashing and quantization) a common strategy in large-scale retrieval systems. To alleviate the reliance on expensive annotations, unsupervised deep hashing becomes an important research problem. This paper provides a novel solution to unsupervised deep quantization, name...
Deep hashing approaches, including deep quantization and deep binary hashing, have become a common solution to large-scale image retrieval due to high computation and storage efficiency. Most existing hashing methods can not produce satisfactory results for fine-grained retrieval, because they usually adopt the outputs of the last CNN layer to gene...
Recently, deep neural networks (DNNs) have shown serious vulnerability to adversarial examples with imperceptible perturbation to clean images. To counter this issue, many powerful defensive methods (e.g., ComDefend) focus on rectifying the adversarial examples with well-trained models from a large training dataset (e.g., clean-adversarial image pa...
Knowledge distillation (KD) has been one of the most popular techniques for model compression and acceleration, where a compact student model can be trained under the guidance of a large-capacity teacher model. The key of known KD methods is to explore multiple types of knowledge to direct the training of the student to mimic the teacher’s behaviou...