SangKeun Lee’s research while affiliated with Korea University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (128)


Figure 1: (a) Frequency histograms of material concepts and general words on 150K materials-related scientific papers. (b) Tokenization results of material concepts using conventional tokenization and MATTER (ours).
Figure 5: Comparison of Macro-F1 scores for ChemDataExtractor and MatDetector across λ values.
Figure 6: Comparison of Micro-F1 scores for ChemDataExtractor and MatDetector across different λ values.
Detailed configuration of the main model and training hyperparameters for classification task.
Comparison of Extractable Entity Types and Training Data in ChemDataExtractor and MatDetector.

+2

Incorporating Domain Knowledge into Materials Tokenization
  • Preprint
  • File available

June 2025

·

2 Reads

Yerim Oh

·

Jun-Hyung Park

·

Junho Kim

·

[...]

·

SangKeun Lee

While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of 4%4\% and 2%2\% in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER

Download



C2A: Client-Customized Adaptation for Parameter-Efficient Federated Learning

October 2024

·

7 Reads

Despite the versatility of pre-trained language models (PLMs) across domains, their large memory footprints pose significant challenges in federated learning (FL), where the training model has to be distributed between a server and clients. One potential solution to bypass such constraints might be the use of parameter-efficient fine-tuning (PEFT) in the context of FL. However, we have observed that typical PEFT tends to severely suffer from heterogeneity among clients in FL scenarios, resulting in unstable and slow convergence. In this paper, we propose Client-Customized Adaptation (C2A), a novel hypernetwork-based FL framework that generates client-specific adapters by conditioning the client information. With the effectiveness of the hypernetworks in generating customized weights through learning to adopt the different characteristics of inputs, C2A can maximize the utility of shared model parameters while minimizing the divergence caused by client heterogeneity. To verify the efficacy of C2A, we perform extensive evaluations on FL scenarios involving heterogeneity in label and language distributions. Comprehensive evaluation results clearly support the superiority of C2A in terms of both efficiency and effectiveness in FL scenarios.


Complementary effect of incorporating CleaR to NLL methods on SST-5 (60% symmetric noise).
Peak and Average accuracy (%) on SST-5 under different levels of instance-dependent noise.
Peak and Average accuracy (%) on TREC and 20NewsGroups with 60% symmetric noisy labels. Best results are highlighted in boldface.
CleaR: Towards Robust and Generalized Parameter-Efficient Fine-Tuning for Noisy Label Learning

October 2024

·

5 Reads

Parameter-efficient fine-tuning (PEFT) has enabled the efficient optimization of cumbersome language models in real-world settings. However, as datasets in such environments often contain noisy labels that adversely affect performance, PEFT methods are inevitably exposed to noisy labels. Despite this challenge, the adaptability of PEFT to noisy environments remains underexplored. To bridge this gap, we investigate various PEFT methods under noisy labels. Interestingly, our findings reveal that PEFT has difficulty in memorizing noisy labels due to its inherently limited capacity, resulting in robustness. However, we also find that such limited capacity simultaneously makes PEFT more vulnerable to interference of noisy labels, impeding the learning of clean samples. To address this issue, we propose Clean Routing (CleaR), a novel routing-based PEFT approach that adaptively activates PEFT modules. In CleaR, PEFT modules are preferentially exposed to clean data while bypassing the noisy ones, thereby minimizing the noisy influence. To verify the efficacy of CleaR, we perform extensive experiments on diverse configurations of noisy labels. The results convincingly demonstrate that CleaR leads to substantially improved performance in noisy environments.


MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

October 2024

·

17 Reads

We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.




Zero-shot Commonsense Reasoning over Machine Imagination

October 2024

·

14 Reads

Recent approaches to zero-shot commonsense reasoning have enabled Pre-trained Language Models (PLMs) to learn a broad range of commonsense knowledge without being tailored to specific situations. However, they often suffer from human reporting bias inherent in textual commonsense knowledge, leading to discrepancies in understanding between PLMs and humans. In this work, we aim to bridge this gap by introducing an additional information channel to PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework designed to complement textual inputs with visual signals derived from machine-generated images. To achieve this, we enhance PLMs with imagination capabilities by incorporating an image generator into the reasoning process. To guide PLMs in effectively leveraging machine imagination, we create a synthetic pre-training dataset that simulates visual question-answering. Our extensive experiments on diverse reasoning benchmarks and analysis show that Imagine outperforms existing methods by a large margin, highlighting the strength of machine imagination in mitigating reporting bias and enhancing generalization capabilities.


Performances of teacher, mentor, and student models across four different complex reasoning tasks, where the backbone model is FlanT5. GPT-3.5-Turbo results with an asterisk(*) were excerpted from (Chen et al., 2023). The best and second best results are highlighted in boldface and underline, respectively.
Ablation study of Mentor-KD on Tracking Shuffled Objects and Last Letter Concatenation. We em- ploy large models of each backbone model as mentors and small models as students.
Statistics of datasets used in our study.
Mentor-KD: Making Small Language Models Better Multi-step Reasoners

October 2024

·

15 Reads

Large Language Models (LLMs) have displayed remarkable performances across various complex tasks by leveraging Chain-of-Thought (CoT) prompting. Recently, studies have proposed a Knowledge Distillation (KD) approach, reasoning distillation, which transfers such reasoning ability of LLMs through fine-tuning language models of multi-step rationales generated by LLM teachers. However, they have inadequately considered two challenges regarding insufficient distillation sets from the LLM teacher model, in terms of 1) data quality and 2) soft label provision. In this paper, we propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs while addressing the aforementioned challenges. Specifically, we exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations and provide soft labels for the student model during reasoning distillation. We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.


Citations (60)


... MolFormer (Ross et al., 2022) scales up this technique while incorporating rotary positional embeddings, efficiently pre-training on SMILES sequences from 1.1 billion molecules. MolTRES (Park et al., 2024) introduces a hierarchical masking strategy for SMILES sequences targeting multiple granularities of chemical substructures, from individual atoms to entire functional groups. However, these SMILES-based transformers often neglect the topological relationships inherent to molecular graphs, sacrificing structurally informed representation learning for modeling efficiency (Nguyen et al., 2024). ...

Reference:

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning
MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction
  • Citing Conference Paper
  • January 2024

... Generalized Knowledge Distillation (GKD) [173] introduces skew KL divergence to stabilize gradients and enhance performance, using an adaptive off-policy approach to minimize noisy feedback and improve efficiency. Black-Box KD relies only on teacher outputs without having access to model internals [48,182,271,360]. Methods like Distilling ...

Mentor-KD: Making Small Language Models Better Multi-step Reasoners
  • Citing Conference Paper
  • January 2024

... Furthermore, recent studies have integrated emerging technologies such as MentorNet [19], iterative learning frameworks [17], O2U-net [44], DivideMix [5], ELR+ [18], PML-NLI [45], contrastive learning [46,47], InstanceGM [48], CoDis [49], and Clean Routing (CleanR) [50]. ...

Towards Robust and Generalized Parameter-Efficient Fine-Tuning for Noisy Label Learning
  • Citing Conference Paper
  • January 2024

... Afterward, we filter the annotations generated by the LLM. Following previous works Magister et al., 2023;Fu et al., 2023;Lee et al., 2024), we preserve annotations where the final predictionŷ t i matches the golden answer y i of a sample. Then, the annotations are reformatted into a question-label format for training mentor and student models. ...

Coconut: Contextualized Commonsense Unified Transformers for Graph-Based Commonsense Augmentation of Language Models
  • Citing Conference Paper
  • January 2024

... Direct prompt tuning methods have developed into four main branches: (1) General approaches that directly optimize prompt parameters, including Prompt Tuning that prepends trainable vectors to input while freezing the language model (Lester et al., 2021), XPrompt that employs hierarchical structured pruning to identify and retain important prompt tokens (Ma et al., 2022), and P-Tuning v2 that introduces deep prompts across all transformer layers (Liu et al., 2022); (2) Encoder-based methods that leverage additional modules, such as P-Tuning that incorporates an encoder to learn dependencies between continuous embeddings (Liu et al., 2023), Residual Prompt Tuning (RPT) that employs a residual part with down/up-projection layers for stable optimization (Razdaibiedina et al., 2023), and Prefix Tuning that prepends trainable key-value pairs at each layer through a reparameterization section (Li and Liang, 2021); (3) Decomposition methods that decompose prompt embeddings, including Decomposed Prompt Tuning (DPT) that applies low-rank matrix decomposition to reduce parameter count (Xiao et al., 2023), and DePT that combines shorter soft prompts with lowrank updates to word embeddings (Shi and Lipani, 2024); and (4) MoE approaches such as Sparse Mixture-of-Prompts (SMoP) that employs multiple shorter prompts with a dynamic router to route inputs to the most suitable soft prompt (Choi et al., 2023). ...

SMoP: Towards Efficient and Effective Prompt Tuning with Sparse Mixture-of-Prompts
  • Citing Conference Paper
  • January 2023

... Token reduction strategies in language modeling have evolved from early optimizations for BERT [37,44,46,47,105] to techniques specifically designed for LLMs. PoWER-BERT [29] introduces progressive word-vector elimination by removing redundant token representations based on selfattention dynamics, improving inference efficiency. ...

Leap-of-Thought: Accelerating Transformers via Dynamic Token Routing
  • Citing Conference Paper
  • January 2023

... Debiased Representations Existing methods focus on weak-learner guided pruning (Meissner et al., 2022), disentangling robust and spurious representations (Gao et al., 2022), decision boundaries (Lyu et al., 2022), and attention patterns with PoE (Wang et al., 2023), training biased models with one-vs-rest approach (Jeon et al., 2023), and amplifying bias in training set with debiased test set (Reif and Schwartz, 2023). ...

Improving Bias Mitigation through Bias Experts in Natural Language Understanding
  • Citing Conference Paper
  • January 2023

... KoLD (Jeong et al., 2022) and K-MHaS specified the target group of the offensive language. Subsequently, KODOLI (Park et al., 2023b) provided labels that refine the degree of offensiveness, and built upon these efforts, K-HATERS (Park et al., 2023a) was built to incorporate the strengths of the preceding datasets. ...

“Why do I feel offended?” - Korean Dataset for Offensive Language Identification
  • Citing Conference Paper
  • January 2023

... Differing from the likes of BERT and BioBERT that randomly mask 15% of the tokens in each batch, we mask only the tokens that form a concept within the previously curated curriculum C while ensuring that no more than 20% of the tokens in any batch are masked. As concepts can span multiple tokens, we follow Lee et al. (2022)'s Whole Concept Masking (WCM) strategy such that all the tokens comprising a single concept are simultaneously masked. As is the standard, we replace 80% of the masked concepts with a mask token, replace another 10% with a random token and do not replace the remaining 10%. ...

Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking
  • Citing Conference Paper
  • January 2022

... First, sub-character decomposition simply breaks syllable-characters into subcharacters, capturing crucial subword information (e.g., -ㄹ future tense from 갈 'will go'). This is common in studies using traditional word embeddings (Park et al., 2018;Kim et al., 2022). Second, morpheme analysis, by employing a morphological analyzer, enable the models to detect subword morphemes directly, regardless of subword levels. ...

Break it Down into BTS: Basic, Tiniest Subword Units for Korean
  • Citing Conference Paper
  • January 2022