Denis Kuchelev’s research while affiliated with Paderborn University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (6)


Figure 1: Three-level overview of LOLA architecture. The left-most block provides a high-level overview of the layers within LOLA, including the alternating standard and Mixture-of-Experts (MoE)-based decoder blocks. The middle block gives a detailed view of the MoE-based decoder block structure. The right-most block zooms in on the inner workings of each MoE layer, showing how the top-1 gating mechanism selects from multiple expert Feed Forward Networks (FFNs).
Figure 4: Comparison of model sizes across all evaluated models, with our model highlighted in orange. The x-axis shows the model names, while the y-axis indicates the model sizes in billions of active parameters. The models are grouped into three size categories: Category-1, Category-2, and Category-3. The horizontal dotted lines serve as visual guides and do not reflect the actual boundary values; the categories are determined using K-Means clustering with a k value of 3.
Figure 5: Pearson correlation values for distance between languages based on phylogenetic features and LOLA's MoE routing features. The x-axis represents the numbers of languages included in the comparison. We include the languages in the descending order of the number of documents seen by the model for that language.
Evaluation tasks used to evaluate LOLA, along with the number of languages covered by each task.
Training statistics and model details for LOLA.
LOLA -- An Open-Source Massively Multilingual Large Language Model
  • Preprint
  • File available

September 2024

·

40 Reads

·

Denis Kuchelev

·

Tatiana Moteu

·

[...]

·

This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

Download


A Topic Model for the Data Web

October 2023

·

7 Reads

Lecture Notes in Computer Science

The usage of knowledge graphs in industry and at Web scale has increased steadily within recent years. However, the decentralized approach to data creation which underpins the popularity of knowledge graphs also comes with significant challenges. In particular, gaining an overview of the topics covered by existing datasets manually becomes a gargantuan if not impossible feat. Several dataset catalogs, portals and search engines offer different ways to interact with lists of available datasets. However, these interactions range from keyword searches to manually created tags and none of these solutions offers an easy access to human-interpretable categories. In addition, most of these approaches rely on metadata instead of the dataset itself. We propose to use topic modeling to fill this gap. Our implementation LODCat automatically creates human-interpretable topics and assigns them to RDF datasets. It does not need any metadata and solely relies on the provided RDF dataset. Our evaluation shows that LODCat can be used to identify the topics of hundreds of thousands of RDF datasets. Also, our experiment results suggest that humans agree with the topics that LODCat assigns to RDF datasets. Our code and data are available online.


Figure 1. Overview of the Benchmark components and the flow of data. Orange: Benchmark components; Grey: Synthetic Data Web generated by the benchmark; Dark blue: The benchmarked crawler; Solid arrows: Flow of data; Dotted arrows: Links between RDF datasets; Numbers indicate the order of steps.
ORCA - a Benchmark for Data Web Crawlers

January 2021

·

100 Reads

·

2 Citations

The number of RDF knowledge graphs available on the Web grows constantly. Gathering these graphs at large scale for downstream applications hence requires the use of crawlers. Although Data Web crawlers exist, and general Web crawlers could be adapted to focus on the Data Web, there is currently no benchmark to fairly evaluate their performance. Our work closes this gap by presenting the Orca benchmark. Orca generates a synthetic Data Web, which is decoupled from the original Web and enables a fair and repeatable comparison of Data Web crawlers. Our evaluations show that Orca can be used to reveal the different advantages and disadvantages of existing crawlers. The benchmark is open-source and available at https://w3id.org/dice-research/orca.


Fig. 1. Overview of the Benchmark components and the flow of data. Orange: Benchmark components; Grey: Synthetic Data Web generated by the benchmark; Dark blue: The benchmarked crawler; Solid arrows: Flow of data; Dotted arrows: Links between RDF datasets; Numbers indicate the order of steps.
Templates of resource URIs to refer to an external resource and its dependency on the external node type. {H} = host name; {F} = file format; {N} = resource ID.
ORCA: a Benchmark for Data Web Crawlers

December 2019

·

218 Reads

The number of RDF knowledge graphs available on the Web grows constantly. Gathering these graphs at large scale for downstream applications hence requires the use of crawlers. Although Data Web crawlers exist, and general Web crawlers could be adapted to focus on the Data Web, there is currently no benchmark to fairly evaluate their performance. Our work closes this gap by presenting the Orca benchmark. Orca generates a synthetic Data Web, which is decoupled from the original Web and enables a fair and repeatable comparison of Data Web crawlers. Our evaluations show that Orca can be used to reveal the different advantages and disadvantages of existing crawlers. The benchmark is open-source and available at https://github.com/dice-group/orca.


Citations (3)


... The Lingua Franca approach [128] has been aiming at improving the method by Perevalov et al. (see paragraph above) by introducing named entity-aware MT approach combined with mKGQA systems. Lingua Franca leverages symbolic information about named entities stored in Wikidata to preserve their correct translation to the target language. ...

Reference:

Multilingual question answering systems for knowledge graphs – a survey
Lingua Franca – Entity-Aware Machine Translation Approach for Question Answering over Knowledge Graphs
  • Citing Conference Paper
  • December 2023

... Further evaluation criteria include the runtimes for the initial KG construction and for the incremental updates and perhaps the manual effort to set up the construction pipelines. Ideally, an evaluation platform could be established -similar to other areas [256] -for a comparative evaluation of different pipelines with different implementations of the individual construction steps. ...

HOBBIT: A platform for benchmarking Big Linked Data

Data Science