Valentin Stauber’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (6)


Leveraging knowledge graphs to update scientific word embeddings using latent semantic imputation
  • Preprint
  • File available

October 2022

·

48 Reads

·

·

Valentin Stauber

·

[...]

·

The most interesting words in scientific texts will often be novel or rare. This presents a challenge for scientific word embedding models to determine quality embedding vectors for useful terms that are infrequent or newly emerging. We demonstrate how \gls{lsi} can address this problem by imputing embeddings for domain-specific words from up-to-date knowledge graphs while otherwise preserving the original word embedding model. We use the MeSH knowledge graph to impute embedding vectors for biomedical terminology without retraining and evaluate the resulting embedding model on a domain-specific word-pair similarity task. We show that LSI can produce reliable embedding vectors for rare and OOV terms in the biomedical domain.

Download

ACT2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations

June 2022

·

72 Reads

·

4 Citations

Classifying citations according to their purpose and importance is a challenging task that has gained considerable interest in recent years. This interest has been primarily driven by the need to create more transparent, efficient, merit-based reward systems in academia; a system that goes beyond simple bibliometric measures and considers the semantics of citations. Such systems that quantify and classify the influence of citations can act as edges that link knowledge nodes to a graph and enable efficient knowledge discovery. While a number of researchers have experimented with a variety of models, these experiments are typically limited to single-domain applications and the resulting models are hardly comparable. Recently, two Citation Context Classification (3C) shared tasks (at WOSP2020 and SDP2021) created the first benchmark enabling direct comparison of citation classification approaches, revealing the crucial impact of supplementary data on the performance of models. Reflecting from the findings of these shared tasks, we are releasing a new multidisciplinary dataset, ACT2, an extended SDP 3C shared task dataset. This modified corpus has annotations for both citation function and importance classes newly enriched with supplementary contextual and non-contextual feature sets the selection of which follows from the lists of features used by the more successful teams in these shared tasks. Additionally, we include contextual features for cited papers (e.g. Abstract of the cited paper), which most existing datasets lack, but which have a lot of potential to improve results. We describe the methodology used for feature extraction and the challenges involved in the process. The feature enriched ACT2 dataset is available at https://github.com/oacore/ACT2.


ACT2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations

June 2022

·

24 Reads

·

2 Citations

Classifying citations according to their purpose and importance is a challenging task that has gained considerable interest in recent years. This interest has been primarily driven by the need to create more transparent, efficient, merit-based reward systems in academia; a system that goes beyond simple bibliometric measures and considers the semantics of citations. Such systems that quantify and classify the influence of citations can act as edges that link knowledge nodes to a graph and enable efficient knowledge discovery. While a number of researchers have experimented with a variety of models, these experiments are typically limited to single-domain applications and the resulting models are hardly comparable. Recently, two Citation Context Classification (3C) shared tasks (at WOSP2020 and SDP2021) created the first benchmark enabling direct comparison of citation classification approaches, revealing the crucial impact of supplementary data on the performance of models. Reflecting from the findings of these shared tasks, we are releasing a new multidisciplinary dataset, ACT2, an extended SDP 3C shared task dataset. This modified corpus has annotations for both citation function and importance classes newly enriched with supplementary contextual and non-contextual feature sets the selection of which follows from the lists of features used by the more successful teams in these shared tasks. Additionally, we include contextual features for cited papers (e.g. Abstract of the cited paper), which most existing datasets lack, but which have a lot of potential to improve results. We describe the methodology used for feature extraction and the challenges involved in the process. The feature enriched ACT2 dataset is available at https://github.com/oacore/ACT2.



Domain-adaptation of spherical embeddings

October 2021

·

44 Reads

Domain adaptation of embedding models, updating a generic embedding to the language of a specific domain, is a proven technique for domains that have insufficient data to train an effective model from scratch. Chemistry publications is one such domain, where scientific jargon and overloaded terminology inhibit the performance of a general language model. The recent spherical embedding model (JoSE) proposed in arXiv:1911.01196 jointly learns word and document embeddings during training on the multi-dimensional unit sphere, which performs well for document classification and word correlation tasks. But, we show a non-convergence caused by global rotations during its training prevents it from domain adaptation. In this work, we develop methods to counter the global rotation of the embedding space and propose strategies to update words and documents during domain specific training. Two new document classification data-sets are collated from general and chemistry scientific journals to compare the proposed update training strategies with benchmark models. We show that our strategies are able to reduce the performance cost of domain adaptation to a level similar to Word2Vec.


Figure 1: This flowchart describes the process during a typical Scithon™ competition. Participants would follow the indicated steps to systematically map scientific publications into research questions, defined by each team in Step (2). The final results produced in Step (5) are submitted to the jury by each team on the provided Scithon™ template (see Table 1 in Appendix A). (a) All steps are monitored by a key-logger that records userengagement with the provided tools. (b) Steps (2) and (4) are guided by the same provided template (Table 1 in Appendix A).
Scithon™ An evaluation framework for assessing research productivity tools

May 2018

·

83 Reads

·

2 Citations

International Journal of Scientific and Research Publications

There is a current scarcity of tested methods to evaluate the performance of artificial intelligence-based science discovery tools. Iris.ai, an international start-up developing text understanding technology and products, has developed a novel framework for performing such evaluation tasks. The framework, organized around live events, involves a systematic and cross-disciplinary comparison that focuses on productivity gains and takes into account user engagement. Under this format, referred to as Scithon™, event participants are asked to address, in a compressed time frame, the early stages of a research challenge put forth by a third party. Submitted results are then evaluated externally by domain experts. The logged data, including user engagement with the system, is compared against the outcome of the Scithon™. In this paper, we present in detail the full mechanics of the Scithon™ and the results obtained from a series of Scithon™ competitions run since 2016, where the presented framework is used to evaluate the productivity gains of Iris.ai 's own intelligent research assistant. Initial findings show that, compared to conventional evaluation frameworks for search engines, Scithon™ is a suitable platform for benchmarking intelligent research assistants and is able to identify advantages and disadvantages of such systems in deeper detail and complexity. Iris.ai provides the usage of the platform under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License, which means we welcome the community to freely adopt its name and format with an appropriate acknowledgement to this paper and its authors.

Citations (2)


... Pitts.ai and RobotReviewer/RobotSearch also use an SVM for identifying RCTs [46]. Finally, Iris.ai identifies and groups similar papers based on the similarity of their 'fingerprint', a vector representation of the most meaningful words and their synonyms extracted from the abstract [98]. ...

Reference:

Artificial intelligence for literature reviews: opportunities and challenges
Scithon™ An evaluation framework for assessing research productivity tools

International Journal of Scientific and Research Publications

... Zhang et al (2023) argued that LLMs can contribute to citation context analysis by automating citation classification; however, the applicability of LLMs classification is currently unclear and should be addressed in the future. Kunnath et al (2023) compared the performance of LLMs citation classification when performing parameter updating using multiple methods on the public datasets ACL-ARC (Jurgens et al, 2018) and ACT2 (Nambanoor Kunnath et al, 2022). The results show high performance when using some of the methods and that the zero-shot performance of GPT3.5 is high when targeting multiple fields (ACT2) but low when targeting a single field (ACL-ARC). ...

ACT2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations