Lars Kai Hansen’s research while affiliated with Technical University of Denmark and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (397)


Figure 1: The overall evaluation setup: A collection of GLLMs, including closed-source (lock symbol) instruct-tuned (bulls-eye) and multilingual (globe) ones, were evaluated in Danish across diverse usecase scenarios.
Figure 6: A screenshot from the leaderboard frontend allowing users to explore how model results change with different metric choices as well as inspecting model output examples and reading further details on evaluation scenarios.
Figure 7: Model Danish capabilities based on human feedback. Values are normalized BradleyTerrey coefficients where two models are connected if the coefficiencts are not significantly different in the ranking model.
Danoliteracy of Generative, Large Language Models
  • Preprint
  • File available

October 2024

·

13 Reads

Søren Vejlgaard Holm

·

Lars Kai Hansen

·

Martin Carsten Nielsen

The language technology moonshot moment of Generative, Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were until recently difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency, across eight diverse scenarios such Danish citizenship tests and abstractive social media question answering. This limited-size benchmark is found to produce a robust ranking that correlates to human feedback at ρ0.8\rho \sim 0.8 with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining 95%95\% of scenario performance variance for GLLMs in Danish, suggesting a g factor of model consistency in language adaption.

Download

BiSSL: Bilevel Optimization for Self-Supervised Pre-Training and Fine-Tuning

October 2024

·

21 Reads

In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.




Fig. 1. Conceptual overview of mimicking networks. Left: The general transformer model, consisting of a feature extractor module, the transformer stack and a classification/probing layer. We denote the first part of the network until layer i by f (i) the remaining part of the network from layer i to L by g (i) , and the classification layer by h. Middle: A 2-layer mimicking network where m (i) f and m (i)
Fig. 2. Analysis of redundancy of layers using similarity measures and pruning of layers: (a) Similarity between layers of wav2vec2 and wavLM. All three metrics (cosine similarity, CKA, mutual kNN) reveal a block structure. (b) Effect of pruning on performance, using four different pruning objectives. Up to 45% of layers can be pruned while maintaining 95% of accuracy ( ). After pruning most layers, the model performance drops to random chance ( ). Uncertainty ranges cover the empirical 2.5 and 97.5 quantiles obtained from N = 5 runs. (c) Visualisation of pruned layers on top of the kNN similarity matrix (with 50% performance threshold). Light layers indicate pruned layers, dark layers are still present. Backward and forward pruning (left+middle) only preserve performance as long as both blocks are still present. Pruning based on kNN-BI (right) prunes layers mainly in the first block.
How Redundant Is the Transformer Stack in Speech Representation Models?

September 2024

·

21 Reads

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model's predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.


Fig. 3. Correlation of OOOA and convexity across models. High correlation in the first half of models (0.91). Lower and even negative correlation in the second half for pretrained (0.4) and fine-tuned models (-0.54) respectively.
hugginface.co version of all analyzed models.
Correlation scores between OOOA and convexity on a model basis.
Connecting Concept Convexity and Human-Machine Alignment in Deep Neural Networks

September 2024

·

32 Reads

Understanding how neural networks align with human cognitive processes is a crucial step toward developing more interpretable and reliable AI systems. Motivated by theories of human cognition, this study examines the relationship between \emph{convexity} in neural network representations and \emph{human-machine alignment} based on behavioral data. We identify a correlation between these two dimensions in pretrained and fine-tuned vision transformer models. Our findings suggest that the convex regions formed in latent spaces of neural networks to some extent align with human-defined categories and reflect the similarity relations humans use in cognitive tasks. While optimizing for alignment generally enhances convexity, increasing convexity through fine-tuning yields inconsistent effects on alignment, which suggests a complex relationship between the two. This study presents a first step toward understanding the relationship between the convexity of latent representations and human-machine alignment.


Fig. 1. Convexity of latent representations of words, phonemes, and speakers. Evaluated for pretrained ( ) and fine-tuned models (word classification: , speaker identification: ), for base models (upper row) and large models (lower row). Models fine-tuned for word classification show increased convexity for word and phoneme representations and decreased convexity for speaker representations, while models fine-tuned for speaker identification show increased convexity for speaker representations and reduced convexity for word and phoneme representations.
Convexity-based Pruning of Speech Representation Models

August 2024

·

10 Reads

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more. Typically, it is found that larger models lead to better performance. However, the significant computational effort involved in such large transformer systems is a challenge for embedded and real-world applications. Recent work has shown that there is significant redundancy in the transformer models for NLP and massive layer pruning is feasible (Sajjad et al., 2023). Here, we investigate layer pruning in audio models. We base the pruning decision on a convexity criterion. Convexity of classification regions has recently been proposed as an indicator of subsequent fine-tuning performance in a range of application domains, including NLP and audio. In empirical investigations, we find a massive reduction in the computational effort with no loss of performance or even improvements in certain cases.


Fig. 3: Distribution of the independent component (ICs) classification by ICLabel during preprocessing with the SPEED pipeline on The TUH EEG Corpus. Most of the ICs are classified as brain and other as expected.
Fig. 4: UMAP embeddings of the Bhutan dataset. The subplots show the distribution of the 5 artifact classes as scatterplots. The density of the whole dataset is represented by the contour plot. The representations based on SPEED show better alignment with the ground truth labels.
Fig. 5: Validation contrastive accuracy during pretraining for three different versions of preprocessed datasets; SPEED, SPEED w/ ICA, and Baseline. The models with SPEED and SPEED w/ ICA offer more stable training and achieve higher scores. Pretrain → ↓ Downstream SPEED SPEED w/ ICA Baseline MMIDB SPEED 0.73 ± 0.01 0.71 ± 0.01 0.64 ± 0.01 SPEED w/ ICA 0.66 ± 0.01 0.65 ± 0.02 0.59 ± 0.02 Baseline 0.76 ± 0.01 0.74 ± 0.01 0.70 ± 0.01 BC Bhutan SPEED 0.48 ± 0.02 0.43 ± 0.02 0.41 ± 0.02 SPEED w/ ICA 0.31 ± 0.03 0.34 ± 0.03 0.32 ± 0.03 Baseline 0.36 ± 0.03 0.34 ± 0.03 0.33 ± 0.02 BCI@NER SPEED 0.68 ± 0.03 0.67 ± 0.03 0.62 ± 0.03 SPEED w/ ICA 0.67 ± 0.02 0.67 ± 0.02 0.64 ± 0.02 Baseline 0.70 ± 0.03 0.70 ± 0.03 0.65 ± 0.03
SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

August 2024

·

48 Reads

Electroencephalography (EEG) research typically focuses on tasks with narrowly defined objectives, but recent studies are expanding into the use of unlabeled data within larger models, aiming for a broader range of applications. This addresses a critical challenge in EEG research. For example, Kostas et al. (2021) show that self-supervised learning (SSL) outperforms traditional supervised methods. Given the high noise levels in EEG data, we argue that further improvements are possible with additional preprocessing. Current preprocessing methods often fail to efficiently manage the large data volumes required for SSL, due to their lack of optimization, reliance on subjective manual corrections, and validation processes or inflexible protocols that limit SSL. We propose a Python-based EEG preprocessing pipeline optimized for self-supervised learning, designed to efficiently process large-scale data. This optimization not only stabilizes self-supervised training but also enhances performance on downstream tasks compared to training with raw data.



Challenges in explaining deep learning models for data with biological variation

June 2024

·

5 Reads

Much machine learning research progress is based on developing models and evaluating them on a benchmark dataset (e.g., ImageNet for images). However, applying such benchmark-successful methods to real-world data often does not work as expected. This is particularly the case for biological data where we expect variability at multiple time and spatial scales. In this work, we are using grain data and the goal is to detect diseases and damages. Pink fusarium, skinned grains, and other diseases and damages are key factors in setting the price of grains or excluding dangerous grains from food production. Apart from challenges stemming from differences of the data from the standard toy datasets, we also present challenges that need to be overcome when explaining deep learning models. For example, explainability methods have many hyperparameters that can give different results, and the ones published in the papers do not work on dissimilar images. Other challenges are more general: problems with visualization of the explanations and their comparison since the magnitudes of their values differ from method to method. An open fundamental question also is: How to evaluate explanations? It is a non-trivial task because the "ground truth" is usually missing or ill-defined. Also, human annotators may create what they think is an explanation of the task at hand, yet the machine learning model might solve it in a different and perhaps counter-intuitive way. We discuss several of these challenges and evaluate various post-hoc explainability methods on grain data. We focus on robustness, quality of explanations, and similarity to particular "ground truth" annotations made by experts. The goal is to find the methods that overall perform well and could be used in this challenging task. We hope the proposed pipeline will be used as a framework for evaluating explainability methods in specific use cases.


Citations (48)


... Recent research on large language models (LLMs) has revealed that many layers and neurons can be pruned without significantly impacting performance [8,9,10,11]. Similar findings have been observed in speech representation models, where pruning or informed layer selection can lead to reduced computational requirements and faster inference times while retaining or even improving performance [12,13]. ...

Reference:

How Redundant Is the Transformer Stack in Speech Representation Models?
Convexity Based Pruning of Speech Representation Models
  • Citing Conference Paper
  • September 2024

... Several methods have been proposed to quantitatively compare the learned internal representations of neural networks based on geometrical and statistical properties of the distribution of activations. For example, centered kernel alignment (Kornblith et al. 2019), orthogonal Procrustes distance (Schönemann 1966) and methods based on canonical correlation analysis (Raghu et al. 2017;Morcos, Raghu, and Bengio 2018) have been extensively used to analyze and compare representations (Smith et al. 2017;Raghu et al. 2021;Yadav et al. 2024). However, these structural similarity indices do not take the functionality of the networks into account directly. ...

Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

... Recent advancements in deep learning have shown strong predictive capabilities of transformer-based models, which could improve the performance of existing ML models in fields like healthcare 77 . Originally developed for natural language processing, their ability to capture structure in human language could generalise to life-sequences, such as socioeconomic and health data, for classification tasks. ...

Using sequences of life-events to predict human lives

Nature Computational Science

... CAVs were instead applied by Madsen et al. in [55] on an EEG classification model. A comparison for (multi-modal) transformers was carried out in [56], namely between Optimal Transport, which considers activations of different input types, and Label Attribution, which is a variation of TMME (see Section 5.5.4). ...

Concept-Based Explainability for an EEG Transformer Model
  • Citing Conference Paper
  • September 2023

... Den sparsomme kvantitative tekstorienterede skriveforskning i form af korpuslingvistiske studier har isaer fokuseret på laengdeaspekter eller andre formelle aspekter ved elevers tekster, fordi det overvejende var det, man havde mulighed for at undersøge (fx Palmér, 2018). Det har rejst ønsker om mere betydningsorienterede vaerktøjer til at indfange for eksempel tekstuelle (Crossley, 2020) og interpersonelle aspekter ved elevers skrivning (Tannert, 2022), skriftsproglige betydningsaspekter som nye teknologier er hastigt på vej til at kunne identificere (Andersen et al., 2023). ...

Automatic proficiency scoring for early-stage writing
  • Citing Article
  • September 2023

Computers and Education Artificial Intelligence

... Quality evaluation First, evaluate the robustness of explanations to various data augmentations. We replicate the experiments from [39] and compare the results for a toy dataset (ImageNet) from [39] and a real-world one (grains). We choose six augmentation methods: changes in brightness, hue and saturation, further rotation, translation, and scale. ...

Robustness of Visual Explanations to Common Data Augmentation Methods
  • Citing Conference Paper
  • June 2023

... Numerous studies have indicated abnormal connectivity within the default mode network (DMN) in individuals with schizophrenia, with a particular focus on the connectivity between different large-scale networks such as the DMN, the central-executive network, and the salience network. 10 Previous evidence has suggested that antipsychotics may normalize cortico-striatal functional connectivity and modulate DMN connectivity. 11 However, the effects of NIBS on restoring functional connectivity are not well understood. ...

Clustering of antipsychotic-naïve patients with schizophrenia based on functional connectivity from resting-state electroencephalography

European Archives of Psychiatry and Clinical Neuroscience

... The system could potentially fail to generalize to this mismatching stimulus condition. Alternatively, the visual face may be correlated with audio envelope information (O'Sullivan et al., 2017a;Pedersen et al., 2022), and it may be easier to focus auditory attention on a speaker that can be seen, which may in turn improve decoding (O'Sullivan et al., 2013). Potential visual benefits are important to investigate, for instance in the case were the user wants to switch attention to a previously ignored speaker. ...

Modulation transfer functions for audiovisual speech

... A prominent probabilistic methology for characterizing a graph in terms of segregated units and their integration is the stochastic block model (SBM), which through Bayesian inference quantifies within-graph node similarities using a set of blocks/clusters (segregated regions) with separate intra-and intercluster link densities (integration; Holland, Laskey, & Leinhardt, 1983). This framework has historically been used to elucidate cross-subject FC similarity (Andersen et al., 2014;Baldassano, Beck, & Fei-Fei, 2015;Mørup, Madsen, Dogonowski, Siebner, & Hansen, 2010), SC similarity (Ambrosen, Albers, Dyrby, Schmidt, & Mørup, 2014;Baldassano et al., 2015), and joint similarities both in terms of segregated regions and how they consistently integrate (Andersen et al., 2012) or only in terms of shared segregated regions (Albers et al., 2022). Notably, the SBM is closely related to the widely used modularity optimization procedure for community detection, such that modularity optimization can be considered a special case of maximum likelihood optimization of the degree-corrected SBM (Karrer & Newman, 2011;Newman, 2016). ...

Uncovering Cortical Units of Processing From Multi-Layered Connectomes

... However, even in 1D, quantum simulations of phase diagrams face significant hurdles that suggest the need for new approaches. The first one is the preparation of mixed states on quantum computers [55][56][57][58][59][60][61][62][63][64]. This is difficult since it requires non-unitary evolution or thermal sampling of multiple eigenstates. ...

Noise-assisted variational quantum thermalization