Timo Immanuel DenkGoogle Inc. | Google
Timo Immanuel Denk
Bachelor of Science
Working on applied problems in the domain of audio and music processing
About
11
Publications
13,912
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
33
Citations
Introduction
Publications
Publications (11)
Chargrid is a recently proposed approach to understanding documents with 2-dimensional structure. It represents a document with a grid, thereby preserving its spatial structure for the processing model. Text is embedded in the grid with one-hot encoding on character level. With Wordgrid we extend Chargrid by employing a grid on word level.
For emb...
For understanding generic documents, information like font sizes, column layout, and generally the positioning of words may carry semantic information that is crucial for solving a downstream document intelligence task. Our novel BERTgrid, which is based on Chargrid by Katti et al. (2018), represents a document as a grid of contextualized word piec...
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research...
The process of reconstructing experiences from human brain activity offers a unique lens into how the brain interprets and represents the world. In this paper, we introduce a method for reconstructing music from brain activity, captured using functional magnetic resonance imaging (fMRI). Our approach uses either music retrieval or the MusicLM music...
Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instrumen...
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate repr...
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our ex...
BERT is a popular language model whose main pre-training task is to fill in the blank, i.e., predicting a word that was masked out of a sentence, based on the remaining words. In some applications, however, having an additional context can help the model make the right prediction, e.g., by taking the domain or the time of writing into account. This...
Based on an available list of the top 100,000 most popular domains on the web, we define a novel vision-based page rank estimation task: A model is asked to predict the rank of a given web domain purely based on screenshots of its web pages and information about the web link graph that interconnects them. This work is a feasibility study seeking to...
We present a parameterizable neural network meta-architecture for text classification tasks. It is based on one-dimensional separable convolutional layers, followed by a classification head consisting of stacked fully connected layers. The classifier operates on word level, with words represented by word embeddings, which we fine-tune during the tr...