Wonjong Rhee’s research while affiliated with Seoul National University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (80)


Task-Specific Preconditioner for Cross-Domain Few-Shot Learning
  • Article

April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

Suhyun Kang

·

Jungwon Park

·

Wonseok Lee

·

Wonjong Rhee

Cross-Domain Few-Shot Learning (CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP). Our method first meta-learns Domain-Specific Preconditioners (DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.


Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

Yeji Song

·

·

Wonhark Park

·

[...]

·

In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new context. However, the existing methods often 1) generate images with the same pose as an input image, and 2) exhibit deterioration in the subject's identity when facing a pose variation prompt. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the pose indication in the textual embedding. Conversely, the textual embedding also harms the subject's identity which is tightly entangled with the pose in the visual embedding. As a remedy, we propose text-orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our method is both effective and robust, offering highly flexible zero-shot generation while effectively maintaining the subject's identity.



Figure 2: Illustration of forming a Task-Specific Preconditioner based on three DSPs that have been metatrained for three meta-training domains.
Hyper-parameters used for training Dataset Classi- fier on various experimental settings.
Task-Specific Preconditioner for Cross-Domain Few-Shot Learning
  • Preprint
  • File available

December 2024

·

5 Reads

Cross-Domain Few-Shot Learning~(CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent~(TSP). Our method first meta-learns Domain-Specific Preconditioners~(DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.

Download

Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

December 2024

·

8 Reads

Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we present a method for constructing Head Relevance Vectors (HRVs) that align with useful visual concepts. An HRV for a given visual concept is a vector with a length equal to the total number of cross-attention heads, where each element represents the importance of the corresponding head for the given visual concept. We develop and employ an ordered weakening analysis to demonstrate the effectiveness of HRVs as interpretable features. To demonstrate the utility of HRVs, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. We show that misinterpretations of polysemous words in image generation can be corrected in most cases, five challenging attributes in image editing can be successfully modified, and catastrophic neglect in multi-concept generation can be mitigated. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.



Figure 2: Four image examples of generating datasets with known true MI values. X and Y consist of random images drawn from the MNIST dataset. (a) Basic construction: only the images of class 0 or 1 are considered and X and Y are sampled to share the class information. By choosing C to be either 0 or 1 with probability 0.5, H(C) becomes 1 and therefore I(X; Y ) = H(C) = 1. (b) Combining four independent images in 2-D to form a single image: I(X; Y ) = 4 because H(C) = H left,up + H left,down + H right,up + H right,down = 4. (c) Combining three independent images in channel dimension to form a color image: I(X; Y ) = 3 because H(C) = H green + H red + H blue = 3. (d) Adding nuisance: an independently chosen background image from CIFAR-10 [Krizhevsky et al., 2009] is inserted as nuisance. Because the nuisance is independently chosen for X and Y , they do not affect the true MI [Lee et al., 2023]. Therefore, I(X; Y ) = 1.
Figure 5: (a) Example of inserting nuisance to D vision . (b) Estimation results when the true MI is 2 bits. (c) Estimation results with various values of nuisance strength and true MI for three best-performing estimators. True MI values are on the x-axis and the nuisance strength is on the y-axis.
Figure 6: Estimation results for hidden layers of ResNet-50. Dashed lines indicate boundaries between stages. For full results, see Supplementary D.4. If deep representations are robust for estimating MI, should this hold across all layers? To address this question, we estimated MI for intermediate layers of ResNet-50 trained on D vision without nuisance. Results are summarized in Figure 6. According to the data processing inequality, lower-layer MI cannot be smaller than upper-layer MI. However, estimated MI values indicate the opposite. This discrepancy suggests that MI estimations at lower layers are less precise, whereas upper-layer representations yield more accurate estimations. Interestingly, we observe the step-wise estimation results; the transition clearly occurs when the output size changes across all types of estimators. It appears that upper layers might capture abstract, high-level features, potentially offering more meaningful information for MI estimation. In contrast, lower layers might contain more noise and less discriminative features, which could lead to poorer accuracy.
Figure 7: Construction of an image dataset with a non-integer MI value. We utilize binary symmetric channel to corrupt the label of Y .
A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets

October 2024

·

13 Reads

Mutual Information (MI) is a fundamental metric for quantifying dependency between two random variables. When we can access only the samples, but not the underlying distribution functions, we can evaluate MI using sample-based estimators. Assessment of such MI estimators, however, has almost always relied on analytical datasets including Gaussian multivariates. Such datasets allow analytical calculations of the true MI values, but they are limited in that they do not reflect the complexities of real-world datasets. This study introduces a comprehensive benchmark suite for evaluating neural MI estimators on unstructured datasets, specifically focusing on images and texts. By leveraging same-class sampling for positive pairing and introducing a binary symmetric channel trick, we show that we can accurately manipulate true MI values of real-world datasets. Using the benchmark suite, we investigate seven challenging scenarios, shedding light on the reliability of neural MI estimators for unstructured datasets.



Classification of underlying paroxysmal supraventricular tachycardia types using deep learning of sinus rhythm electrocardiograms

September 2024

·

48 Reads

Background Obtaining tachycardia electrocardiograms (ECGs) in patients with paroxysmal supraventricular tachycardia (PSVT) is often challenging. Sinus rhythm ECGs are of limited predictive value for PSVT types in patients without preexcitation. This study aimed to explore the classification of atrioventricular nodal reentry tachycardia (AVNRT) and concealed atrioventricular reentry tachycardia (AVRT) using sinus rhythm ECGs through deep learning. Methods This retrospective study included patients diagnosed with either AVNRT or concealed AVRT, validated through electrophysiological studies. A modified ResNet-34 deep learning model, pre-trained on a public ECG database, was employed to classify sinus rhythm ECGs with underlying AVNRT or concealed AVRT. Various configurations were compared using ten-fold cross-validation on the training set, and the best-performing configuration was tested on the hold-out test set. Results The study analyzed 833 patients with AVNRT and 346 with concealed AVRT. Among ECG features, the corrected QT intervals exhibited the highest area under the receiver operating characteristic curve (AUROC) of 0.602. The performance of the deep learning model significantly improved after pre-training, showing an AUROC of 0.726 compared to 0.668 without pre-training (p < 0.001). No significant difference was found in AUROC between 12-lead and precordial 6-lead ECGs (p = 0.265). On the test set, deep learning achieved modest performance in differentiating the two types of arrhythmias, with an AUROC of 0.708, an AUPRC of 0.875, an F1-score of 0.750, a sensitivity of 0.670, and a specificity of 0.649. Conclusion The deep-learning classification of AVNRT and concealed AVRT using sinus rhythm ECGs is feasible, indicating potential for aiding in the non-invasive diagnosis of these arrhythmias.



Citations (38)


... Existing implementations of Vide-oLMs can be broadly categorized into two approaches: The first integrates video and language modalities directly during training, with representative methods including Video-LLaVA [10] and Video-ChatGPT [11], among others; the second leverages existing VLM architectures by transforming videos into structured image inputs through multi-stage processing. For example, IG-VLM [12] samples 6 frames at equal intervals from a video and arranges them into a grid-like image, which is then fed into a standard VLM to model temporal information. Similar approaches include VideoTree [13], LVNet [14], and others. ...

Reference:

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

IEEE Access

... Current diffusion-based personalization models can be broadly divided into two categories: optimization-based models and tuning-free models. Optimization-based models [1,6,9,10,14,15,26,33], such as Textual Inversion [6] and DreamBooth [26], optimize embeddings of text tokens or fine-tune the backbone to implant a new subject into the model's output domain. In contrast, tuning-free models [4,7,13,17,20,21,24,27,38,40] pre-train a separate encoder to extract the subject features, enabling zero-shot personalized image generation without per-subject optimization. ...

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
  • Citing Conference Paper
  • June 2024

... To address this gap, we present a method for evaluating MI estimators on any dataset in the absence of underlying distribution functions. Our approach employs same-class sampling as positive pairing [Lee et al., 2023] and binary symmetric channels [Cover, 1999] for precise manipulation of the true MI values. ...

Towards a rigorous analysis of mutual information in contrastive learning
  • Citing Article
  • August 2024

Neural Networks

... Other previously reported predictors of ECV failure are male gender [71], advanced age [72], high body weight [73] and high body surface area [74], diabetes [75], low estimated glomerular filtration rate [76], CHA 2 DS 2 -VASc score > 2 [77], left ventricular systolic dysfunction [78], larger LAVi [3], increased LVFPs [79], and AF duration > 3 months [80]. ...

Machine Learning Prediction for the Recurrence After Electrical Cardioversion of Patients With Persistent Atrial Fibrillation
  • Citing Article
  • July 2023

Korean Circulation Journal

... Out-of-distribution generalization is an important and challenging problem in machine learning, arising when there are distribution shifts between the training and test data. Compared to traditional domain adaptation tasks [12,36,89,57,32,99,74], OOD generalization is more critical as it focuses on generalizing to covariate-shifted data distributions that are unseen during training [6,61,116,71,40,69,9,58]. A primary set of approaches to OOD generalization involves extracting domaininvariant representations. ...

VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution
  • Citing Conference Paper
  • June 2023

... This process produces a preconditioner specifically adapted to the geometric characteristics of the target task's parameter space. By integrating knowledge from multiple seen domains, TSP distinguishes itself from traditional PGD techniques, such as GAP (Kang et al. 2023), which are discussed further in Section 6. Applying our approach to state-of-theart CDFSL methods, such as TSA or TA 2 -Net, significantly enhances performance on Meta-Dataset. For example, in multi-domain settings, applying TSP to TA 2 -Net (Guo et al. 2023) achieves the best performance across all datasets. ...

Meta-Learning with a Geometry-Adaptive Preconditioner
  • Citing Conference Paper
  • June 2023

... We corroborate these findings for the multilingual case. Jung et al. (2023) apply isotropyimproving methods, namely normalising flows and Whitening, in the context of dense retrieval models, and find score improvements on the target task. ...

Isotropic Representation Can Improve Dense Retrieval
  • Citing Chapter
  • May 2023

Lecture Notes in Computer Science

... In addition, to improve the robustness of neural networks against adversarial attacks, attribution guided sharpening (AGS), which uses explainability approaches such as AGS, employs saliency maps derived from a nonrobust model to direct Choi and Hall's sharpening technique, which diminishes noise in input images before classification [76]. Hwang et al. [77] are credited with this approach and introduced AID-Purifier to increase the resilience of adversarially trained networks by refining their inputs. This auxiliary network operates as an extension to an already trained primary classifier and is trained to use binary cross-entropy loss as a discriminator to preserve computational efficiency. ...

AID-purifier: A light auxiliary network for boosting adversarial defense
  • Citing Article
  • April 2023

Neurocomputing

... MOS data encoded into 2-D images using the GAF method improved the robustness and generalization in the DNN processing of MOS signals. Combining this preprocessing with the fine-tuning of the GoogLeNet 156 . Their DL framework was tailored for predicting concentrations in mixed-gas environments and focused on synergistic approaches in data pre-processing, network architectural decisions, and multi-task learning. ...

Deep Learning Framework With Essential Pre-Processing Techniques for Improving Mixed-Gas Concentration Prediction

IEEE Access

... Pruning and quantization were used to compress VGG and ResNet for remote sensing image classification, balancing computational complexity constraints while preserving model accuracy [44]. Low-rank decomposition and quantization were used to compress ResNet and MobileNet, and reduce the computational complexity while preserving high performance [45]. Pruning, quantization, and changing the model architecture were used to design a compact SqueezeNet with competitive accuracy while significantly reducing the number of parameters [46]. ...

An effective low-rank compression with a joint rank selection followed by a compression-friendly training
  • Citing Article
  • April 2023

Neural Networks