Bo Li’s research while affiliated with Sichuan University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (197)


M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation
  • Preprint

March 2025

Markus Karmann

·

Peng-Tao Jiang

·

Bo Li

·

We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.


GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping

March 2025

·

6 Reads

High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes by leveraging multi-view low dynamic range (LDR) images captured at different exposure levels. Current training paradigms with 3D tone mapping often result in unstable HDR reconstruction, while training with 2D tone mapping reduces the model's capacity to fit LDR images. Additionally, the global tone mapper used in existing methods can impede the learning of both HDR and LDR representations. To address these challenges, we present GaussHDR, which unifies 3D and 2D local tone mapping through 3D Gaussian splatting. Specifically, we design a residual local tone mapper for both 3D and 2D tone mapping that accepts an additional context feature as input. We then propose combining the dual LDR rendering results from both 3D and 2D local tone mapping at the loss level. Finally, recognizing that different scenes may exhibit varying balances between the dual results, we introduce uncertainty learning and use the uncertainties for adaptive modulation. Extensive experiments demonstrate that GaussHDR significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.


Fig. 3. Qualitative comparison with zero-shot monocular depth estimation methods across different datasets. Our model demonstrates excellent detail preservation and structure capture capabilities. Benefiting from the Feature Alignment module, our model avoids overfitting to textures.
Fig. 5. Depth distribution of different depth preprocess methods on Virtual KITTI. Square-root disparity exhibits the most uniform distribution.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
  • Preprint
  • File available

January 2025

·

13 Reads

Ziyang Song

·

Zerong Wang

·

Bo Li

·

[...]

·

Tianzhu Zhang

Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.

Download

Boosting Adversarial Transferability with Spatial Adversarial Alignment

January 2025

·

3 Reads

Deep neural networks are vulnerable to adversarial examples that exhibit transferability across various models. Numerous approaches are proposed to enhance the transferability of adversarial examples, including advanced optimization, data augmentation, and model modifications. However, these methods still show limited transferability, particularly in cross-architecture scenarios, such as from CNN to ViT. To achieve high transferability, we propose a technique termed Spatial Adversarial Alignment (SAA), which employs an alignment loss and leverages a witness model to fine-tune the surrogate model. Specifically, SAA consists of two key parts: spatial-aware alignment and adversarial-aware alignment. First, we minimize the divergences of features between the two models in both global and local regions, facilitating spatial alignment. Second, we introduce a self-adversarial strategy that leverages adversarial examples to impose further constraints, aligning features from an adversarial perspective. Through this alignment, the surrogate model is trained to concentrate on the common features extracted by the witness model. This facilitates adversarial attacks on these shared features, thereby yielding perturbations that exhibit enhanced transferability. Extensive experiments on various architectures on ImageNet show that aligned surrogate models based on SAA can provide higher transferable adversarial examples, especially in cross-architecture attacks.



Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

December 2024

·

15 Reads

Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.


Learning Adaptive Lighting via Channel-Aware Guidance

December 2024

·

4 Reads

Learning lighting adaption is a key step in obtaining a good visual perception and supporting downstream vision tasks. There are multiple light-related tasks (e.g., image retouching and exposure correction) and previous studies have mainly investigated these tasks individually. However, we observe that the light-related tasks share fundamental properties: i) different color channels have different light properties, and ii) the channel differences reflected in the time and frequency domains are different. Based on the common light property guidance, we propose a Learning Adaptive Lighting Network (LALNet), a unified framework capable of processing different light-related tasks. Specifically, we introduce the color-separated features that emphasize the light difference of different color channels and combine them with the traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across channels. We introduce dual domain channel modulation to generate color-separated features and a wavelet followed by a vision state space module to generate color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo at https://xxxxxx2025.github.io/LALNet/.


CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

December 2024

·

8 Reads

Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.




Citations (56)


... Therefore, a potential direction for future work is to develop a COLMAP-free version of GaussHDR. Second, it is promising to introduce depth priors to enhance geometry reconstruction by utilizing off-the-shelf depth models [23,35,36,46,60,61]. Finally, our method focuses on static scenes and lacks the ability to perform HDR reconstruction in dynamic environments, which is also an area worth exploring. ...

Reference:

GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping
Mono-ViFI: A Unified Learning Framework for Self-supervised Single and Multi-frame Monocular Depth Estimation
  • Citing Chapter
  • November 2024

... To mitigate this issue, later techniques propose estimating optical flow to detect motion regions in the LDR images, followed by removal [19] or alignment [22] during the fusion stage. Recent advancements in deep learning have led to the exploration of using CNNs [14,27,29] and Transformers [10,37,52] [20,33,42,51,55], primarily for nighttime scenes. This paper focuses on the second group, which employs multi-exposure and multi-view LDR images for training [7,18,21,58,59]. ...

SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging
  • Citing Chapter
  • October 2024

... Zang et al. [33] proposed ContextDET, a framework combining a visual encoder, pre-trained LLM, and visual decoder for context-aware object detection in human-AI interactions. Lv et al. [30] introduced the Multimodal Camo-Perceptive Framework (MMCPF), using a Chain of Visual Perception strategy to improve zero-shot Camouflaged Object Detection (COD) with LLMs. Zhu et al. [39] developed a depth-aware Transformer to integrate object depth information for improved Visual Commonsense Reasoning (VCR) by considering 3D spatial relationships in visual and textual data. ...

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection
  • Citing Conference Paper
  • October 2024

... The diffusion model [1]- [3], [24]- [26] consists of a forward process and a backward process. In the forward process, noise is incrementally added to the original image q(x), generating a noisy image through a Markov chain, guided by a predefined variance schedule β 1:T . ...

Non-uniform Timestep Sampling: Towards Faster Diffusion Model Training
  • Citing Conference Paper
  • October 2024

... Furthermore, the presence of multiple Lewis basic nitrogen groups, such as amides, often reduces the reactivity of peptide acceptors under electrophile-promoted reaction conditions. HFIP is uniquely effective in solubilizing peptides (49,50); its strong Hbonding ability also minimizes the complexation of peptide basic sites with electrophilic promoters (29). As demonstrated by examples 42 to 44, carboxylic acid groups at either terminal or internal positions of peptides can be glycosylated in moderate to good yield under conditions [A]. ...

Intermolecular crosslinking of phenols and alkyl amines with formaldehyde in HFIP for conjugation: A multi-partner bridging model for HFIP promotion
  • Citing Article
  • October 2024

CCS Chemistry

... They typically identify the generalized factors from a small amount of source domain data to fine-tune FMs and generalize the model to the out-of-domain data. From the perspective of representation learning [70], the mainstream paradigms for improving the generalizability of VFM and MMFM include inserting LoRA [33] module or Adapter [29,81] to fine-tune a small number of parameters, adversarial learning [46], feature decomposition [22,75], and invariance alignment [65]. Despite the general efficacy across different tasks, most existing approaches are customized for specific architectures and tasks. ...

ASAM: Boosting Segment Anything Model with Adversarial Tuning
  • Citing Conference Paper
  • June 2024

... 7) Poisoning against Defense: Some defense methods, like adversarial training [337], [309], randomized smoothing [48], [338], [339], and backdoor detection [340], [341], have proven effective in defending against poisoning attacks. To counter these defenses, a line of work focuses on integrating defense considerations during perturbation learning [342], [314], [315], [343]. Specifically for backdoor triggers, approaches such as adaptive triggers [304], tailored triggers [296], and latent space perturbations [333] have been developed to help evade detection. ...

Re-Thinking Data Availability Attacks Against Deep Neural Networks
  • Citing Conference Paper
  • June 2024

... As shown in Figure 2, instead of designing complex but incompact network structures [12; 4] that incur a large number of tunable parameters, recent works [13; 5; 14] such as MLoRE and MOELoRA explore the advantages of Mixture-of-Experts (MoE) in extracting task-specific features by enhancing the diversity of parameters and features [15; 16; 17], and Parameter-Efficient Fine-Tuning (PEFT) in reducing the tunable parameters and storage overhead [18; 19; 20; 21]. Nevertheless, MLoRE [5] still relies on a substantial number of additional parameters, limiting the overall efficiency and feasibility of training. MOELoRA [14] adopts a unitary LoRA structure to tune the experts, which weakens the learning capability of individual experts. ...

Multi-Task Dense Prediction via Mixture of Low-Rank Experts
  • Citing Conference Paper
  • June 2024

... 3 Researchers have elucidated distinct receptor activation states through atomic-level simulations, proposing a "G protein-first" activation model. 5 The advent of high-resolution GPCR structural biology, facilitated by techniques like cryoelectron microscopy, is propelling the evolution of precision drugs tailored to these receptors. ...

The G Protein-First Mechanism for Activation of the Class B Glucagon-like Peptide 1 Receptor Coupled to N-Terminal Domain-Mediated Conformational Progression
  • Citing Article
  • September 2024

Journal of the American Chemical Society