Masataka Goto

Masataka Goto
National Institute of Advanced Industrial Science and Technology

About

304
Publications
44,908
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,837
Citations

Publications

Publications (304)
Article
Full-text available
Many people listen to music while working nowadays. However, conventional recommendation systems that are designed for playing songs matching user preferences cannot be applied for such a situation. This is because previous research showed that listeners’ concentration can be negatively affected not only by music that listeners strongly dislike but...
Article
Full-text available
This study introduces self-supervised contrastive learning to acquire feature representations of singing voices. To acquire robust representations in an unsupervised manner, regular self-supervised contrastive learning trains neural networks to make the feature representation of a sample close to those of its computationally transformed versions. S...
Article
Lyric videos, or kinetic typography videos, are music videos showing lyric text in synchronization with the music. The purpose of this paper is to quantitatively and qualitatively analyze lyric videos to understand their design trends via three modalities: word motion, font style, and music style. These trends will not only be helpful as hints for...
Conference Paper
Full-text available
Current deep learning techniques for style transfer would not be optimal for design support since their "one-shot" transfer does not fit exploratory design processes. To overcome this gap, we propose parametric transcription, which transcribes an end-to-end style transfer effect into parameter values of specific transformations available in an exis...
Preprint
Full-text available
Current deep learning techniques for style transfer would not be optimal for design support since their "one-shot" transfer does not fit exploratory design processes. To overcome this gap, we propose parametric transcription, which transcribes an end-to-end style transfer effect into parameter values of specific transformations available in an exis...
Article
Full-text available
This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid...
Article
This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we pro...
Article
Full-text available
We propose a learning-based method of estimating the compatibility between vocal and accompaniment audio tracks, i.e. , how well they go with each other when played simultaneously. This task is challenging because it is difficult to formulate hand-crafted rules or construct a large labeled dataset to perform supervised learning. Our method uses s...
Conference Paper
Full-text available
While participating in live concerts is a promising application of virtual reality (VR), it falls short of our participation experience in the real world. In particular, to increase the engagement of participants, previous studies emphasized the importance of social experience among audience members, such as the sense of co-presence elicited by sha...
Preprint
Deep generative models allow even novice composers to generate various melodies by sampling latent vectors. However, finding the desired melody is challenging since the latent space is unintuitive and high-dimensional. In this work, we present an interactive system that supports generative melody composition with human-in-the-loop Bayesian optimiza...
Article
Full-text available
We present a novel concept audio–visual object removal in 360-degree videos, in which a target object in a 360-degree video is removed in both the visual and auditory domains synchronously. Previous methods have solely focused on the visual aspect of object removal using video inpainting techniques, resulting in videos with unreasonable remaining s...
Chapter
We attempt to recognize and track lyric words in lyric videos. Lyric video is a music video showing the lyric words of a song. The main characteristic of lyric videos is that the lyric words are shown at frames synchronously with the music. The difficulty of recognizing and tracking the lyric words is that (1) the words are often decorated and geom...
Article
Visual design tasks often involve tuning many design parameters. For example, color grading of a photograph involves many parameters, some of which non-expert users might be unfamiliar with. We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set by exploring such a high-dimensional...
Preprint
We attempt to recognize and track lyric words in lyric videos. Lyric video is a music video showing the lyric words of a song. The main characteristic of lyric videos is that the lyric words are shown at frames synchronously with the music. The difficulty of recognizing and tracking the lyric words is that (1) the words are often decorated and geom...
Preprint
Visual design tasks often involve tuning many design parameters. For example, color grading of a photograph involves many parameters, some of which non-expert users might be unfamiliar with. We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set by exploring such a high-dimensional...
Article
For amateur creators, it has been becoming popular to create new content based on existing original work: such new content is called derivative work. We know that derivative creation is popular, but why are individual derivative works created? Although there are several factors that inspire the creation of derivative works, such factors cannot usua...
Preprint
This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we pro...
Conference Paper
We describe the AIST Dance Video Database (AIST Dance DB), a shared database containing original street dance videos with copyright-cleared dance music. Although dancing is highly related to dance music and dance information can be considered an important aspect of music information , research on dance information processing has not yet received mu...
Conference Paper
In recommender systems, item diversification and explainable recommendations improve users' satisfaction. Unlike traditional explainable recommendations that display a single explanation for each item, explainable hybrid recommendations display multiple explanations for each item and are, therefore, more beneficial for users. When multiple explanat...
Conference Paper
In Web services dealing with user-generated content (UGC), a user can have two roles: a role of a consumer and that of a producer. Since most item recommendation models have only considered the role of a user as a consumer, how to leverage the two roles to improve UGC recommendation accuracy has been underexplored. In this paper, based on the state...
Article
Full-text available
Characters in interactive 3D applications are often animated by creating transitions from one motion clip to another in response to user input. It is not trivial, however, to achieve quick, natural-looking transitions between two arbitrary motion clips, especially when the two motions are dissimilar. To tackle this problem, we present a simple fram...
Conference Paper
Full-text available
We present VocalistMirror, an interactive user interface that enables a singer to avoid their undesirable facial expressions in singing video recordings. Since singers usually focus on singing expressions and do not care about facial expressions, when watching singing videos they recorded, they sometimes notice that some of their facial expressions...
Conference Paper
Audio annotation for music clips is an important task for machine-learning-based music analysis and applications. However, it is a time-consuming task because it often requires repetitive manipulations even though typical audio files often contain repetitive structures (e.g., a song often has similar phrases used multiple times). In this paper we p...
Chapter
Full-text available
This paper presents Query-by-Dancing, a dance music retrieval system that enables a user to retrieve music using dance motions. When dancers search for music to play when dancing, they sometimes find it by referring to online dance videos in which the dancers use motions similar to their own dance. However, previous music retrieval systems could no...
Article
Music analysis based on signal processing offers new ways of creating and listening to music. This article focuses on applications and interfaces that are enabled by advances in automatic music analysis. By using signal processing, some of these applications provide nonexperts the chance to enjoy music in their daily lives, while other applications...
Conference Paper
This paper presents Songle Sync, a web-based platform on which hundreds of Internet-connected devices - including smartphones, computers, and other physical computing devices - can be controlled to synchronize with music playback. It uses music-understanding technologies to dynamically synthesize music-driven multimedia performances from a musical...
Article
Digital paintings are often created by compositing semi‐transparent layers using various advanced color‐blend modes, such as “color‐burn,” “multiply,” and “screen,” which can produce interesting non‐linear color effects. We propose a method of decomposing an input image into layers with such advanced color blending. Unlike previous layer‐decomposit...
Conference Paper
Full-text available
We investigate the potential of music appreciation using spatial mapping techniques, which allow us to "place" audio sources in various locations within a physical space. We consider possible ways of this new appreciation style and list some design variables, such as how to define coordinate systems, how to show visually, and how to place the sound...
Conference Paper
Full-text available
Retrieving the lyrics of a sung recording from a database of text documents is a research topic that has not received much attention so far. Such a retrieval system has many practical applications, e.g. for karaoke applications or for indexing large song databases by their lyric content. We present a new method for lyrics retrieval. An acoustic mod...
Conference Paper
The mission of animators is to create nuanced, high-quality character motions. To achieve this, the careful editing of animation curves---curves that determine how a series of keyframed poses are interpolated over time---is an important task. Manual editing affords full and precise control, but requires tedious and nonintuitive trials and errors. N...
Article
Creating new content based on existing original work is becoming popular especially among amateur creators. Such new content is called derivative work and can be transformed into the next new derivative work. Such derivative work creation is called "N-Th order derivative creation." Although derivative creation is popular, the reason an individual d...
Article
This paper addresses the issue of modeling the discourse nature of lyrics and presented the first study aiming at capturing the two common discourse-related notions: storylines and themes. We assume that a storyline is a chain of transitions over topics of segments and a song has at least one entire theme. We then hypothesize that transitions over...
Conference Paper
Automatic music-understanding technologies (automatic analysis of music signals) make possible the creation of intelligent music interfaces that enrich music experiences and open up new ways of listening to music. In the past, it was common to listen to music in a somewhat passive manner; in the future, people will be able to enjoy music in a more...
Conference Paper
Full-text available
This paper proposes FocusMusicRecommender, an automated system recommending background music to listen to while working. Recommendation systems matching user preferences have been widely researched even though research has shown that music that listeners strongly like is not suitable background music because it interferes with their concentration....
Chapter
Full-text available
As social media has matured, uploading video content has increased. Multiple videos of physical performances, such as dance, are difficult to integrate into high-quality videos without knowledge of video editing principles. In this study, we present a system that automatically edits dance-performance videos taken from multiple viewpoints into a mor...
Conference Paper
This paper describes a music exploratory search interface called QueryShare, which provides query searching and recommendation functions for query sharing among users. Most people are not expert users who know how to use various music metadata that include automatically estimated musical features to represent their own information needs as a query....
Conference Paper
Although the exploration of design alternatives is crucial for interaction designers and customization is required for end-users, the current development tools for physical computing devices have focused on single versions of an artifact. We propose the parametric design of devices including their enclosure layouts and programs to address this issu...
Conference Paper
Online music services are increasing in popularity. They enable us to analyze people's music listening behavior based on play logs. Although it is known that people listen to music based on topic (e.g., rock or jazz), we assume that when a user is addicted to an artist, s/he chooses the artist's songs regardless of topic. Based on this assumption,...
Chapter
The purpose of this project is to develop fundamental technologies for building a similarity-aware information environment in which people are able to know similarities among vast amounts of media content. This environment helps establish a “content-symbiotic society” in which media content such as music and video can be created and used in innovat...
Conference Paper
Programmers write source code that compiles to programs, and users execute the programs to benefit from their features. While issue-tracking systems help communication between these two groups of people, feature requests have usually been written in text with optional figures that follows community guidelines and needs human interpretation to under...
Article
Full-text available
There is considerable interest in music-based games and apps. However, in existing games, music generally serves as an accompaniment or as a reward for progress. We set out to design a game where paying attention to the music would be essential to making deductions and solving the puzzle. The result is the CrossSong Puzzle, a novel type of music-ba...
Conference Paper
This paper presents LyriSys, a novel lyric-writing support system. Previous systems for lyric writing can fully automatically only generate a single line of lyrics that satisfies given constraints on accent and syllable patterns or an entire lyric. In contrast to such systems, LyriSys allows users to create and revise their work incrementally in a...
Conference Paper
Authoring videos that demonstrate interactive applications with human-computer and human-robot interactions are not easy. It typically involves many retakes, each of which requires repetitive actions in the authoring tools. To address the issue, we propose a robotic framework that covers not only the camera but also the recording environment. It al...
Article
As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used...
Conference Paper
Many amateur creators now create derivative works and put them on the web. Although there are several factors that inspire the creation of derivative works, such factors cannot usually be observed on the web. In this paper, we propose a model for inferring latent factors from sequences of derivative work posting events. We assume a sequence to be a...
Article
This paper describes a web-based multimedia development framework, Songle Widget (http://widget.songle.jp), that makes it possible to control computer-graphic animation and physical devices such as lighting devices and robots in synchronization with music publicly available on the web. Development of applications featuring rigid synchronization wit...
Article
We present a method for music emotion recognition which adaptively aggregates regression models. Music emotion recognition is a task to estimate how music affects the emotion of a listener. The approach works by mapping acoustic features into space that represents emotions. Previous research has centered on finding effective acoustic features, or a...
Conference Paper
Full-text available
Live Programming allows programmers to gain information about the program continuously during its development. While it has been implemented in various integrated devel-opment environments (IDEs) for programmers, its interac-tion techniques such as slider widgets for continuous param-eter tuning are comprehensible for people without any prior knowl...
Article
The programming-with-examples workflow lets developers create interactive applications with the help of example data. It takes a general programming environment and adds dedicated user interfaces for visualizing and managing the data. This lets both programmers and users understand applications and configure them to meet their needs.
Conference Paper
We propose a novel interface that allows the user to interactively change the playback order of multiple songs by choosing one or more criteria. The criteria include not only the song's title and artist name but also its content automatically estimated by music/singing signal processing and artist-level social analysis. The artist-level social info...
Article
This paper proposes a novel concept we call musical commonness, which is the similarity of a song to a set of songs; in other words, its typicality. This commonness can be used to retrieve representative songs from a set of songs (e.g. songs released in the 80s or 90s). Previous research on musical similarity has compared two songs but has not eval...
Conference Paper
Full-text available
During the development of physical computing devices, physical object models and programs for microcontrollers are usually created with separate tools with distinct files. As a result, it is difficult to track the changes in hardware and software without discrepancy. Moreover, the software cannot directly access hardware metrics. Designing hardware...
Article
In the context of singing voice synthesis, expression control manipulates a set of voice features related to a particular emotion, style, or singer. Also known as performance modeling, it has been approached from different perspectives and for different purposes, and different projects have shown a wide extent of applicability. The aim of this arti...
Conference Paper
Music technologies have opened up various music cultures. For example, any musical instruments such as guitar, piano, and sound synthesizers were originally invented by state-of-the-art music technologies and have had huge influences on music cultures.