Kyungjune Baek’s research while affiliated with Yonsei University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (12)


Win-rate versus SDXL on the Pick-a-Pic v2 test set.
Implicit reward accuracy on the Pick-a-Pic v2 validation set.
Ablation study of components: the reference model update strategy, the skewed timestep sampling method, and the reward scale scheduling. We train models based on SD1.5.
Ablation study of various exploration strategies.
Rethinking Direct Preference Optimization in Diffusion Models
  • Preprint
  • File available

May 2025

Junyong Kang

·

Seohyun Lim

·

Kyungjune Baek

·

Hyunjung Shim

Aligning text-to-image (T2I) diffusion models with human preferences has emerged as a critical research challenge. While recent advances in this area have extended preference optimization techniques from large language models (LLMs) to the diffusion setting, they often struggle with limited exploration. In this work, we propose a novel and orthogonal approach to enhancing diffusion-based preference optimization. First, we introduce a stable reference model update strategy that relaxes the frozen reference model, encouraging exploration while maintaining a stable optimization anchor through reference model regularization. Second, we present a timestep-aware training strategy that mitigates the reward scale imbalance problem across timesteps. Our method can be integrated into various preference optimization algorithms. Experimental results show that our approach improves the performance of state-of-the-art methods on human preference evaluation benchmarks.

Download

Length-Aware DETR for Robust Moment Retrieval

December 2024

·

2 Reads

Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix employs two augmentation strategies: ForegroundMix and Background Mix, each enhancing the feature representations of the foreground and background, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions of moments. To address this, we propose a Length-Aware Decoder, which conditions length through a novel bipartite matching process. Our extensive studies demonstrate the efficacy of our length-aware approach, especially in localizing short moments, leading to improved overall performance. Our method surpasses state-of-the-art DETR-based methods on benchmark datasets, achieving the highest R1 and mAP on QVHighlights and the highest R1@0.7 on TACoS and Charades-STA (such as a 2.46% gain in R1@0.7 and a 2.57% gain in mAP average for QVHighlights). The code is available at https://github.com/sjpark5800/LA-DETR.



Logit Mixing Training for More Reliable and Accurate Prediction

July 2022

·

3 Reads

·

2 Citations

Duhyeon Bang

·

Kyungjune Baek

·

Jiwoo Kim

·

[...]

·

Hyunjung Shim

When a person solves the multi-choice problem, she considers not only what is the answer but also what is not the answer. Knowing what choice is not the answer and utilizing the relationships between choices, she can improve the prediction accuracy. Inspired by this human reasoning process, we propose a new training strategy to fully utilize inter-class relationships, namely LogitMix. Our strategy is combined with recent data augmentation techniques, e.g., Mixup, Manifold Mixup, CutMix, and PuzzleMix. Then, we suggest using a mixed logit, i.e., a mixture of two logits, as an auxiliary training objective. Since the logit can preserve both positive and negative inter-class relationships, it can impose a network to learn the probability of wrong answers correctly. Our extensive experimental results on the image- and language-based tasks demonstrate that LogitMix achieves state-of-the-art performance among recent data augmentation techniques regarding calibration error and prediction accuracy.



Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data

April 2022

·

18 Reads

Transfer learning for GANs successfully improves generation performance under low-shot regimes. However, existing studies show that the pretrained model using a single benchmark dataset is not generalized to various target datasets. More importantly, the pretrained model can be vulnerable to copyright or privacy risks as membership inference attack advances. To resolve both issues, we propose an effective and unbiased data synthesizer, namely Primitives-PS, inspired by the generic characteristics of natural images. Specifically, we utilize 1) the generic statistics on the frequency magnitude spectrum, 2) the elementary shape (i.e., image composition via elementary shapes) for representing the structure information, and 3) the existence of saliency as prior. Since our synthesizer only considers the generic properties of natural images, the single model pretrained on our dataset can be consistently transferred to various target datasets, and even outperforms the previous methods pretrained with the natural images in terms of Fr'echet inception distance. Extensive analysis, ablation study, and evaluations demonstrate that each component of our data synthesizer is effective, and provide insights on the desirable nature of the pretrained model for the transferability of GANs.



GridMix: Strong Regularization Through Local Context Mapping

August 2020

·

90 Reads

·

44 Citations

Pattern Recognition

Recently developed regularization techniques improve the networks generalization by only considering the global context. Therefore, the network tends to focus on a few most discriminative subregions of an image for prediction accuracy, leading the network being sensitive to unseen or noisy data. To address this disadvantage, we introduce the concept of local context mapping by predicting patch-level labels and combine it with a method of local data augmentation by grid-based mixing, called GridMix. Through our analysis of intermediate representations, we show that our GridMix can effectively regularize the network model. Finally, our evaluation results indicate that GridMix outperforms state-of-the-art techniques in classification and adversarial robustness, and it achieves a comparable performance in weakly supervised object localization.


Rethinking the Truly Unsupervised Image-to-Image Translation

June 2020

·

166 Reads

Every recent image-to-image translation model uses either image-level (i.e. input-output pairs) or set-level (i.e. domain labels) supervision at minimum. However, even the set-level supervision can be a serious bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose the truly unsupervised image-to-image translation method (TUNIT) that simultaneously learns to separate image domains via an information-theoretic approach and generate corresponding images using the estimated domain labels. Experimental results on various datasets show that the proposed method successfully separates domains and translates images across those domains. In addition, our model outperforms existing set-level supervised methods under a semi-supervised setting, where a subset of domain labels is provided. The source code is available at https://github.com/clovaai/tunit


PsyNet: Self-Supervised Approach to Object Localization Using Point Symmetric Transformation

April 2020

·

132 Reads

·

33 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

Existing co-localization techniques significantly lose performance over weakly or fully supervised methods in accuracy and inference time. In this paper, we overcome common drawbacks of co-localization techniques by utilizing self-supervised learning approach. The major technical contributions of the proposed method are two-fold. 1) We devise a new geometric transformation, namely point symmetric transformation and utilize its parameters as an artificial label for self-supervised learning. This new transformation can also play the role of region-drop based regularization. 2) We suggest a heat map extraction method for computing the heat map from the network trained by self-supervision, namely class-agnostic activation mapping. It is done by computing the spatial attention map. Based on extensive evaluations, we observe that the proposed method records new state-of-the-art performance in three fine-grained datasets for unsupervised object localization. Moreover, we show that the idea of the proposed method can be adopted in a modified manner to solve the weakly supervised object localization task. As a result, we outperform the current state-of-the-art technique in weakly supervised object localization by a significant gap.


Citations (7)


... To address the challenge of noisy labels, numerous research efforts have proposed various techniques to effectively learn from data with noisy labels. These techniques include identifying and correcting mislabeled instances [11,12], adapting learning strategies for noisy labeled data [13][14][15][16], utilizing noise-robust loss functions such as normalized, generalized, symmetrical, and composite losses [13][14][15][16][17], and exploring semi-supervised learning approaches [3,18]. Among these, noise-robust functions and semi-supervised learning have gained traction due to their impressive perfor-mance in noisy label scenarios and ease of implementation. ...

Reference:

NoRD: A framework for noise-resilient self-distillation through relative supervision
Learning from Better Supervision: Self-distillation for Learning with Noisy Labels
  • Citing Conference Paper
  • August 2022

... Leveraging synthetic data in model training. We adopt two approaches that effectively leverage synthetic data in model training: pre-training and finetuning ('PT-FT') [3,16,22,37] and progressive transformation learning (PTL) [47]. 'PT-FT' is the most widely used transfer learning method, which involves pretraining a model on a synthetic dataset and then fine-tuning it on a real dataset. ...

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data
  • Citing Conference Paper
  • June 2022

... In bias-aligned examples, ground truth labels are correlated with both robust and biased features. 2 In bias-conflicting examples, labels are correlated only with robust features. Clearly, the issues of models trained on biased datasets stem from the prevalence of bias-aligned samples. ...

Logit Mixing Training for More Reliable and Accurate Prediction
  • Citing Conference Paper
  • July 2022

... Introducing a different approach, CycleGAN [74], DiscoGAN [29] and DualGAN [70] utilizes unpaired datasets by implementing a cycle consistency loss, which guarantees that the mapping from source to target and back to source retains the original content. Then many models like [3,6,7,65,67] ultilize cycle consistency for unpaired training. [20,34,62] assume that the representation can be disentangled into domaininvariant semantic structure features and domain-specific style features. ...

Rethinking the Truly Unsupervised Image-to-Image Translation
  • Citing Conference Paper
  • October 2021

... This methodology not only enriches the training data but also encourages models to learn more generalized representations by interpolating between different classes. Techniques such as MixUp, CutMix, GridMix, and RICAP exemplify this strategy [37][38][39][40] each introducing unique variations in how images are blended and how their corresponding labels are combined, leading to improved model robustness and performance on a variety of tasks. ...

GridMix: Strong Regularization Through Local Context Mapping
  • Citing Article
  • August 2020

Pattern Recognition

... We also display some visualization results to analyze the learned feature representations in Fig. 4. We employ class-agnostic activation maps (CAAM) [78] to reveal the spatio-temporal distributions of the extracted features. Generally, vanilla contrastive learning based on SimCLR [26] leads the model to focus on representative background cues, e.g., the soccer field, swimming pool and fitness equipment. ...

PsyNet: Self-Supervised Approach to Object Localization Using Point Symmetric Transformation
  • Citing Article
  • April 2020

Proceedings of the AAAI Conference on Artificial Intelligence

... Electronic copy available at: https://ssrn.com/abstract=3419272 Wang et al. (2009), EBGAN (energy-based GAN that viewed generator and discriminator as energy functions) by Zhao et al. (2016), WGAN (Wasserstein GAN based on Earth Mover distance) by Arjovsky et al. (2017), and MVP (Multi-View Perceptron architecture to generate multi-view images) by Zhu et al. (2014), and Editable GAN (a framework that simultaneously generates and manipulates face samples with desired attributes) by Baek et al. (2018). Table 1 summarizes face manipulation methods. ...

Editable Generative Adversarial Networks: Generating and Editing Faces Simultaneously
  • Citing Chapter
  • May 2019

Lecture Notes in Computer Science