Ryohei Sasano’s research while affiliated with Nagoya University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (87)


Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings
  • Preprint

June 2025

·

2 Reads

Hayato Tsukagoshi

·

Ryohei Sasano

Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.


Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References

April 2025

·

4 Reads

Journal of Information Processing

Driving video captioning aims to automatically generate descriptions for videos from driving recorders. Driving video captions are generally required to describe first-person driving behaviors which implicitly characterize the driving videos but are challenging to anchor to concrete visual evidence. To generate captions with better driving behavior descriptions, existing work has introduced behavior-related in-vehicle sensors into a captioning model for behavior-aware captioning. However, a better method for fusing the sensor modality with visual modalities has not been fully investigated, and the accuracy and informativeness of generated behavior-related descriptions remain unsatisfactory. In this paper, we compare three modality fusion methods by using a Transformer-based video captioning model and propose two training strategies to improve both the accuracy and the informativeness of generated behavior descriptions: 1) joint training the captioning model with multilabel behavior classification by explicitly using annotated behavior tags; and 2) weighted training by assigning weights to reference captions (references) according to the informativeness of behavior descriptions in references. Experiments on a Japanese driving video captioning dataset, City Traffic (CT), show the efficacy and positive interaction of the proposed training strategies. Moreover, larger improvements on out-of-distribution data demonstrate the improved generalization ability.


Verifying Claims About Metaphors with Large-Scale Automatic Metaphor Identification大規模なメタファー自動推定結果に基づくメタファーに関する仮説の検証
  • Article
  • Full-text available

March 2025

·

13 Reads

Journal of Natural Language Processing

There are several linguistic claims about situations where words are more likely to be used as metaphors. However, few studies have sought to verify such claims against large corpora. This study conducts a large-scale, corpus-based analysis of claims about metaphors, by applying automatic metaphor detection to sentences extracted from Common Crawl and using the statistics obtained from the results. Specifically, we verified a total of five claims: three claims concerning the direct objects of the verb metaphors and two claims concerning emotional polarity and subjectivity. The verification results support all of the five claims and indicate that the direct objects of verbs used as metaphors tend to have lower degrees of concreteness, imageability, and familiarity, and that metaphors are more likely to be used in emotional and subjective sentences.

Download

Empowering Vision-Language Tuning with Machine Translation for Driving Scene Caption Generation

March 2025

·

10 Reads

IEICE Transactions on Information and Systems

In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward vision-to-text task to address such needs. However, insufficient real-world driving scene descriptive data hinders the performance of caption generation under a simple supervised training paradigm. Recently, large-scale Vision-Language Pre-training (VLP) foundation models have attracted much attention from the community. Tuning large foundation models on task-specific datasets becomes a prevailing paradigm for caption generation. However, for the application in autonomous driving, we often encounter large gaps between the training data for VLP foundation models and the real-world driving scene captioning data, which impedes the immense potential of VLP foundation models. In this paper, we present to tackle this problem via a unified framework for cross-lingual cross-domain vision-language tuning empowered by Machine Translation (MT) techniques. We aim to obtain a captioning system for driving scene caption generation in Japanese from a domain-general and English-centric VLP model. The framework comprises two core components: (i) bidirectional knowledge distillation by MT teachers; (ii) fusing objectives for cross-lingual fine-tuning. Moreover, we introduce three schedulers to operate the vision-language tuning process with fusing objectives. Based on GIT [1], we implement our framework and verify its effectiveness on real-world driving scenes with natural caption texts annotated by experienced vehicle users. The caption generation performance with our framework reveals a significant advantage over the baseline settings.


Figure 1: An illustration of our perspective to investigate the language-arithmetic representational dissociation within language models (LMs) -if brain imaging renders that the human brain activation patterns are different against linguistic and (non-linguistic) reasoning stimuli, what about LMs?
Figure 9: The 2D visualization of LANG (en), LANGNUM (en), EQSP (en), GSM8K, and LANG-SHUFFLED (en) clusters in Gemma2-9b-it via PCA.
Figure 10: The 2D visualization of LANG (en) and LANGSHUFFLED (en) clusters in Gemma2-9b-it via PCA.
On Representational Dissociation of Language and Arithmetic in Large Language Models

February 2025

·

11 Reads

The association between language and (non-linguistic) thinking ability in humans has long been debated, and recently, neuroscientific evidence of brain activity patterns has been considered. Such a scientific context naturally raises an interdisciplinary question -- what about such a language-thought dissociation in large language models (LLMs)? In this paper, as an initial foray, we explore this question by focusing on simple arithmetic skills (e.g., 1+2= ?) as a thinking ability and analyzing the geometry of their encoding in LLMs' representation space. Our experiments with linear classifiers and cluster separability tests demonstrate that simple arithmetic equations and general language input are encoded in completely separated regions in LLMs' internal representation space across all the layers, which is also supported with more controlled stimuli (e.g., spelled-out equations). These tentatively suggest that arithmetic reasoning is mapped into a distinct region from general language input, which is in line with the neuroscientific observations of human brain activations, while we also point out their somewhat cognitively implausible geometric properties.


Transformer-based Live Update Generation for Soccer Matches from Microblog PostsTransformer モデルを利用したマイクロブログからのサッカー速報生成

December 2024

·

6 Reads

Journal of Natural Language Processing

When a sports match is broadcast, X users often enjoy sharing the comment and it is possible to roughly understand a match’s progress by reading these posts. However, because of the diverse nature of posts, it can be challenging to quickly grasp a match’s progress. In this study, we focus on soccer matches and work on building a system to generate live updates from posts so that users can instantly grasp a match’s progress. Our system is based on a large language model T5, and outputs updates at certain times by inputting posts related to a specific match. However simply applying the model to this task caused two problems of the number of generated updates and redundant updates. Therefore, we propose mechanisms that incorporate a classifier to control the number of generated updates and a mechanism that takes into account the previous updates to mitigate redundancy.


Figure 1: Overview of four citation count prediction methods that leverage a paper's main text.
CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text

October 2024

·

14 Reads

Prediction of the future citation counts of papers is increasingly important to find interesting papers among an ever-growing number of papers. Although a paper's main text is an important factor for citation count prediction, it is difficult to handle in machine learning models because the main text is typically very long; thus previous studies have not fully explored how to leverage it. In this paper, we propose a BERT-based citation count prediction model, called CiMaTe, that leverages the main text by explicitly capturing a paper's sectional structure. Through experiments with papers from computational linguistics and biology domains, we demonstrate the CiMaTe's effectiveness, outperforming the previous methods in Spearman's rank correlation coefficient; 5.1 points in the computational linguistics domain and 1.8 points in the biology domain.


Ruri: Japanese General Text Embeddings

September 2024

·

12 Reads

We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.


Citation Count Prediction for Newly Published Papers最新論文に適用可能な被引用数予測

September 2024

·

1 Read

Transactions of the Japanese Society for Artificial Intelligence

Citation count prediction is the task of predicting the future citation counts of academic papers, which is particularly useful for estimating the future impacts of an ever-growing number of academic papers. Although there have been many studies on citation count prediction, they are not applicable to predicting the citation counts of newly published papers, because they assume the availability of future citation counts for papers that have not had enough time passed since publication. In this paper, we first identify problems in the settings of existing studies and introduce a realistic citation count prediction task that strictly uses information available at the time of a target paper’s publication. For realistic citation prediction, we then propose two methods that leverage the citation counts of papers shortly after publication to capture the research trend that is important for predicting the citation counts of newly published papers. Through experiments using papers collected from arXiv and bioRxiv, we demonstrate that our methods considerably improve the performance of citation count prediction for newly published papers in a realistic setting.


Are Social Sentiments Inherent in LLMs? An Empirical Study on Extraction of Inter-demographic Sentiments

August 2024

·

5 Reads

Large language models (LLMs) are supposed to acquire unconscious human knowledge and feelings, such as social common sense and biases, by training models from large amounts of text. However, it is not clear how much the sentiments of specific social groups can be captured in various LLMs. In this study, we focus on social groups defined in terms of nationality, religion, and race/ethnicity, and validate the extent to which sentiments between social groups can be captured in and extracted from LLMs. Specifically, we input questions regarding sentiments from one group to another into LLMs, apply sentiment analysis to the responses, and compare the results with social surveys. The validation results using five representative LLMs showed higher correlations with relatively small p-values for nationalities and religions, whose number of data points were relatively large. This result indicates that the LLM responses including the inter-group sentiments align well with actual social survey results.


Citations (40)


... The use of synthetic datasets in training text embeddings is very promising and is actively explored Sato et al., 2024;Lee et al., 2024b). This is particu-2 https://huggingface.co/datasets/hpprc/jawiki 3 https://huggingface.co/datasets/hpprc/ ...

Reference:

Ruri: Japanese General Text Embeddings
Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples
  • Citing Conference Paper
  • January 2024

... The mainstream methods are mainly based on contextualized word embeddings (Qasem-iZadeh et al., 2019;Yamada et al., 2021b) such as BERT (Devlin et al., 2019). These methods leverage the observation that words evoking the same semantic frame tend to appear in similar contexts, resulting in their embeddings being grouped into the same cluster (Yamada et al., 2021a(Yamada et al., , 2023. However, while the frame induction task provides clusters of frames, it lacks interpretability because definitions of these clusters are not provided. ...

Semantic Frame Induction with Deep Metric Learning
  • Citing Conference Paper
  • January 2023

... More recent neural models incorporate textual and early-citation signals via multilayer perceptrons or recurrent neural networks, yielding improved but still limited performance [Ruan et al., 2020, Jamal et al., 2024. The latest transformer-based predictors leverage pre-trained language embeddings over abstracts or document chunks, but typically only use them as fixed features for downstream regressors rather than in an end-to-end predictive framework [Hirako et al., 2024, van Dongen et al., 2020, Hirako et al., 2023, Wenniger et al., 2023. A more in-depth analysis of techniques used in the field can be found in Aiza et al. [2024]. ...

Realistic Citation Count Prediction Task for Newly Published Papers
  • Citing Conference Paper
  • January 2023

... As a training strategy, CL gradually increases the complexity of the data, i.e., easy-to-hard, during the training process for faster convergence and better performance (Bengio et al., 2009). CL achieves great success in various NLG tasks, e.g., machine translation Mohiuddin et al., 2022), medical report generation Zhang et al., 2022a), dialogue generation , and language modeling (Mi, 2023). "How to define the difficulty of training instances" is a fundamental issue in CL. ...

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
  • Citing Conference Paper
  • January 2022

Hongkuan Zhang

·

Saku Sugawara

·

Akiko Aizawa

·

[...]

·

... To start, we wanted to validate the assumption that the sentence embeddings of a larger document can meaningfully be used as a proxy for the original document embedding [28]. To test this, we wanted to determine how much reconstruction loss we would incur from using an optimal linear combination of sentence embedding vectors instead of a full multi-sentence embedding vector. ...

Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals
  • Citing Conference Paper
  • January 2022

... Natural language processing and artificial intelligence (AI) are used to autonomously retrieve or extract information for the purpose of learning from incident reports 12,[14][15][16][17][18] . Our earlier studies 14,19 provided systematic methodologies for annotating and classifying incident reports of medication errors in a structured manner; the resulting high quality gold-standard data has been published 20 . In this study, we devised a data creation workflow and developed a machine annotator to create a large corpus of machine-annotated incident reports of medication errors, possessing useful drug-related concept and attribute extraction. ...

Annotation Guidelines for Medication Errors in Incident Reports: Validation Through a Mixed Methods Approach

... The disease appears to evolve, changing its presentation with varying onsets, levels of severity, and responses to treatment based on the individual [3,4]. Therefore, a deeper understanding of the underlying pathophysiology and establishing an effective strategy to comprehensively assess changing symptoms become imperative to provide tailored, longitudinal care and to improve patients' quality of life [4][5][6][7]. ...

Research impact analysis of international funding agencies in the realm of allergy and immunology

... For example, a user interested in congested traffic scenarios in urban areas at night can filter scenes based on the corresponding attributes. [38], nuScenes [2], and Road Hazard Stimuli datasets [39,40]. Each row represents a label type, and the values in each column represent the distribution of the corresponding labels in the dataset. ...

Driving Behavior Aware Caption Generation for Egocentric Driving Videos Using In-Vehicle Sensors *
  • Citing Conference Paper
  • July 2021

... In Li et al. [26], the authors propose to introduce a multi-head self-attention mechanism into news HG based on the Transformer decoder, and designed a decoding selection strategy integrating top-k, top-p and penalty mechanisms to select important semantic information and then generate news headlines. In Yamada et al. [27], the authors propose a Transformer-based Seq2BF model that alternates forward and backward decoding to generate headlines with a given phrase. In Bukhtiyarov et al. [28], the authors fine-tuned two Transformer-based pre-trained models, mBART and BertSumAbs, and achieved good results. ...

Transformer-based Lexically Constrained Headline Generation
  • Citing Conference Paper
  • January 2021

... When focusing on the field of NLP, by experimenting with several heuristics, (Sachan and Xing 2016;Xu et al. 2020) migrated the success of CL to NLU tasks. Zhou et al. 2021) improved the machine translation modeling by carefully designing different curricula. Recently, with the rise of LLM, (Liu et al. 2024b) discovered the huge potential of CL in in-context learning, while (Wang et al. 2023d) focus on the improvement of CL for LLM in pre-train. ...

Self-Guided Curriculum Learning for Neural Machine Translation
  • Citing Conference Paper
  • January 2021