Naoto Yokoya

Naoto Yokoya
The University of Tokyo | Todai · Department of Complexity Science and Engineering

Ph.D.

About

251
Publications
99,375
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
14,601
Citations
Introduction
I am working on image processing for computational imaging and image analysis. In particular, my research is focused on intelligent information processing to automatically extract map information, such as land cover labels and elevation models, from remote sensing images acquired by spaceborne and airborne sensors. My work is motivated by applications in Earth observation to respond to global challenges.
Additional affiliations
January 2018 - present
RIKEN
Position
  • Unit Leader
July 2013 - December 2017
The University of Tokyo
Position
  • Research Assistant
April 2012 - June 2013
The University of Tokyo
Position
  • Research Associate
Education
October 2010 - March 2013
The University of Tokyo
Field of study
  • Aerospace Engineering
October 2008 - September 2010
The University of Tokyo
Field of study
  • Aerospace Engineering
April 2004 - March 2008
The University of Tokyo
Field of study
  • Aerospace Engineering

Publications

Publications (251)
Preprint
Self-supervised cross-modal super-resolution (SR) can overcome the difficulty of acquiring paired training data, but is challenging because only low-resolution (LR) source and high-resolution (HR) guide images from different modalities are available. Existing methods utilize pseudo or weak supervision in LR space and thus deliver results that are b...
Preprint
Full-text available
We introduce OpenEarthMap, a benchmark dataset, for global high-resolution land cover mapping. OpenEarthMap consists of 2.2 million segments of 5000 aerial and satellite images covering 97 regions from 44 countries across 6 continents, with manually annotated 8-class land cover labels at a 0.25--0.5m ground sampling distance. Semantic segmentation...
Preprint
Full-text available
Global semantic 3D understanding from single-view high-resolution remote sensing (RS) imagery is crucial for Earth Observation (EO). However, this task faces significant challenges due to the high costs of annotations and data collection, as well as geographically restricted data availability. To address these challenges, synthetic data offer a pro...
Preprint
Neural radiance fields have made a remarkable breakthrough in the novel view synthesis task at the 3D static scene. However, for the 4D circumstance (e.g., dynamic scene), the performance of the existing method is still limited by the capacity of the neural network, typically in a multilayer perceptron network (MLP). In this paper, we present the m...
Preprint
Full-text available
Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the developm...
Article
Full-text available
Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional cha...
Preprint
Full-text available
The field of Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. Despite the substantial domain gaps in RS images that are characterized by variabilities such as location, wavelength, and sensor type, research in thi...
Preprint
Full-text available
Remote Sensing (RS) is a crucial technology for observing, monitoring, and interpreting our planet, with broad applications across geoscience, economics, humanitarian fields, etc. While artificial intelligence (AI), particularly deep learning, has achieved significant advances in RS, unique challenges persist in developing more intelligent RS syste...
Preprint
Full-text available
Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional cha...
Preprint
We introduce GaussianOcc, a systematic method that investigates the two usages of Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, w...
Preprint
Full-text available
Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi...
Article
Full-text available
Popular geo-computer vision works make use of aerial imagery, with sizes ranging from 64 × 64 to 1024 × 1024 pixels without any overlap, although the learning process of deep learning models can be affected by the reduced semantic context or the lack of information near the image boundaries. In this work, the impact of three tile sizes (256 × 256,...
Preprint
Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be gen...
Article
Full-text available
Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally in...
Preprint
Full-text available
Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking gener...
Preprint
Full-text available
Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings. Recently, the Mamba architecture, based on state space models, has shown remarkable performance in a series of natural language processing tasks, which can e...
Article
The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which of...
Article
Hyperspectral and multispectral image fusion, denoted as HSI-MSI fusion, involves merging a pair of hyperspectral (HSI) and multispectral (MSI) images to generate a high spatial resolution hyperspectral image (HR-HSI). The primary challenge in HSI-MSI fusion is to find the best way to extract one-dimensional spectral features and two-dimensional (2...
Article
Recently, complex convolutional sparse coding (ComCSC) has demonstrated its effectiveness in interferometric phase restoration, owing to its prominent performance in noise mitigation and detailed phase preservation. By incorporating the estimated coherence into ComCSC as prior knowledge for re-weighting individual complex residues, coherence-guided...
Article
3D occupancy estimation from surrounding-view images is an exciting task in autonomous driving, following the success of Bird's Eye View (BEV) perception. In this work, we present a comprehensive framework for 3D occupancy estimation, which reveals several key components for 3D occupancy estimation, such as network design, optimization, and evaluat...
Article
Full-text available
Optical high-resolution imagery and OpenStreetMap (OSM) data are two important data sources of land-cover change detection (CD). Previous related studies focus on utilizing the information in OSM data to aid the CD on optical high-resolution images. This paper pioneers the direct detection of land-cover changes utilizing paired OSM data and optical...
Article
Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be gen...
Article
Full-text available
When natural disasters occur, timely and accurate building damage assessment maps are vital for disaster management responders to organize their resources efficiently. Pairs of pre- and post-disaster remote sensing imagery have been recognized as invaluable data sources that provide useful information for building damage identification. Recently, d...
Conference Paper
Full-text available
Synthetic datasets, recognized for their cost effectiveness, play a pivotal role in advancing computer vision tasks and techniques. However, when it comes to remote sensing image processing, the creation of synthetic datasets becomes challenging due to the demand for larger-scale and more diverse 3D models. This complexity is compounded by the diff...
Article
Full-text available
Real-time semantic segmentation of remote sensing imagery is a challenging task that requires a tradeoff between effectiveness and efficiency. It has many applications, including tracking forest fires, detecting changes in land use and land cover, crop health monitoring, and so on. With the success of efficient deep learning methods [i.e., efficien...
Preprint
Full-text available
The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which of...
Article
Full-text available
Change detection is a critical task in studying the dynamics of ecosystems and human activities using multi-temporal remote sensing images. While deep learning has shown promising results in change detection tasks, it requires a large number of labeled and paired multi-temporal images to achieve high performance. Pairing and annotating large-scale...
Preprint
Full-text available
Change detection is a critical task in studying the dynamics of ecosystems and human activities using multi-temporal remote sensing images. While deep learning has shown promising results in change detection tasks, it requires a large number of labeled and paired multi-temporal images to achieve high performance. Pairing and annotating large-scale...
Preprint
Full-text available
Optical high-resolution imagery and OpenStreetMap (OSM) data are two important data sources for land-cover change detection. Previous studies in these two data sources focus on utilizing the information in OSM data to aid the change detection on multi-temporal optical high-resolution images. This paper pioneers the direct detection of land-cover ch...
Article
Neural radiance fields have made a remarkable breakthrough in the novel view synthesis task at the 3D static scene. However, for the 4D circumstance (e.g., dynamic scene), the performance of the existing method is still limited by the capacity of the neural network, typically in a multilayer perceptron network (MLP). In this paper, we utilize 3D Vo...
Preprint
Full-text available
Synthetic datasets, recognized for their cost effectiveness, play a pivotal role in advancing computer vision tasks and techniques. However, when it comes to remote sensing image processing, the creation of synthetic datasets becomes challenging due to the demand for larger-scale and more diverse 3D models. This complexity is compounded by the diff...
Preprint
Understanding dark scenes based on multi-modal image data is challenging, as both the visible and auxiliary modalities provide limited semantic information for the task. Previous methods focus on fusing the two modalities but neglect the correlations among semantic classes when minimizing losses to align pixels with labels, resulting in inaccurate...
Article
Full-text available
This paper proposes an algorithm based on defined scaled tri-factorization (STF) for fast and accurate tensor ring (TR) decomposition. First, based on the fast tri-factorization approach, we define STF and design a corresponding algorithm that can more accurately represent various matrices while maintaining a similar level of computational time. Se...
Article
Full-text available
Change detection on multimodal remote sensing images has become an increasingly interesting and challenging topic in the remote sensing community, which can play an essential role in time-sensitive applications, such as disaster response. However, the modal heterogeneity problem makes it difficult to compare the multimodal images directly. This pap...
Preprint
Full-text available
The task of estimating 3D occupancy from surrounding view images is an exciting development in the field of autonomous driving, following the success of Birds Eye View (BEV) perception.This task provides crucial 3D attributes of the driving environment, enhancing the overall understanding and perception of the surrounding space. However, there is s...
Article
Full-text available
With the advancement of global civilisation, monitoring and managing dumpsites have become essential parts of environmental governance in various countries. Dumpsite locations are difficult to obtain in a timely manner by local government agencies and environmental groups. The World Bank shows that governments need to spend massive labour and econo...
Article
Multi-modality datasets offer advantages for processing frameworks with complementary information, particularly for large-scale cropland mapping. Extensive training datasets are required to train machine learning algorithms, which can be challenging to obtain. To alleviate the limitations, we extract the training samples from the agricultural censu...
Article
An inductive bias induced by an untrained network architecture has been shown to be effective as a deep image prior (DIP) in solving inverse imaging problems. However, it is still unclear as to what kind of prior is encoded in the network architecture, and the early stopping for the overfitting problem of DIP still remains the challenge. To address...
Article
Enormous efforts have been recently made to super-resolve hyperspectral (HS) images with the aid of high spatial resolution multispectral (MS) images. Most prior works usually perform the fusion task by means of multifarious pixel-level priors. Yet the intrinsic effects of a large distribution gap between HS-MS data due to differences in the spatia...
Article
State-of-the-art phase linking (PL) methods for distributed scatterer interferometry (DSI) retrieve consistent phase histories from the sample coherence matrix or the one whose magnitudes are calibrated. To unify them, we first propose a framework consisting of sample coherence matrix estimation and Kullback–Leibler (KL) divergence minimization. Wi...
Article
Supervised learning methods assume that training and test data are sampled from the same distribution. However, this assumption is not always satisfied in practical situations of land cover semantic segmentation when models trained in a particular source domain are applied to other regions. This is because domain shifts caused by variations in loca...
Article
Full-text available
Unsupervised multimodal change detection is a practical and challenging topic that can play an important role in time-sensitive emergency applications. To address the challenge that multimodal remote sensing images cannot be directly compared due to their modal heterogeneity, we take advantage of two types of modality-independent structural relatio...
Chapter
Convolutional neural networks have been widely developed for hyperspectral image (HSI) restoration. However, making full use of the spatial-spectral information of HSIs still remains a challenge. In this work, we disentangle the 3D convolution into lightweight 2D spatial and spectral convolutions, and build a spectrum-aware search space for HSI res...
Chapter
Self-supervised cross-modal super-resolution (SR) can overcome the difficulty of acquiring paired training data, but is challenging because only low-resolution (LR) source and high-resolution (HR) guide images from different modalities are available. Existing methods utilize pseudo or weak supervision in LR space and thus deliver results that are b...
Preprint
Full-text available
Unsupervised multimodal change detection is a practical and challenging topic that can play an important role in time-sensitive emergency applications. To address the challenge that multimodal remote sensing images cannot be directly compared due to their modal heterogeneity, we take advantage of two types of modality-independent structural relatio...
Preprint
Full-text available
In the era of deep learning, annotated datasets have become a crucial asset to the remote sensing community. In the last decade, a plethora of different datasets was published, each designed for a specific data type and with a specific task or application in mind. In the jungle of remote sensing datasets, it can be hard to keep track of what is ava...
Conference Paper
In the era of deep learning, annotated datasets have become a crucial asset to the remote sensing community. In the last decade, a plethora of different datasets was published, each designed for a specific data type and with a specific task or application in mind. In the jungle of remote sensing datasets, it can be hard to keep track of what is ava...
Preprint
Full-text available
Enormous efforts have been recently made to super-resolve hyperspectral (HS) images with the aid of high spatial resolution multispectral (MS) images. Most prior works usually perform the fusion task by means of multifarious pixel-level priors. Yet the intrinsic effects of a large distribution gap between HS-MS data due to differences in the spatia...
Preprint
In this paper, a computation efficient regression framework is presented for estimating the 6D pose of rigid objects from a single RGB-D image, which is applicable to handling symmetric objects. This framework is designed in a simple architecture that efficiently extracts point-wise features from RGB-D data using a fully convolutional network, call...
Article
Full-text available
We present here the scientific outcomes of the 2021 Data Fusion Contest (DFC2021) organized by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society. DFC2021 was dedicated to research on geospatial artificial intelligence (AI) for social good with a global objective of modeling the state and change...
Article
This work critically analyzes the problems arising from differ-modality building semantic segmentation in the remote sensing domain. With the growth of multimodality datasets, such as optical, synthetic aperture radar (SAR), light detection and ranging (LiDAR), and the scarcity of semantic knowledge, the task of learning multimodality information h...
Article
Full-text available
The synergies between Sentinel-3 (S3) and the forthcoming fluorescence explorer (FLEX) mission bring us the opportunity of using S3 vegetation indices (VI) as proxies of the solar-induced chlorophyll fluorescence (SIF) that will be captured by FLEX. However, the highly dynamic nature of SIF demands a very temporally accurate monitoring of S3 VIs to...
Article
Hyperspectral image super-resolution (HSI-SR) can be achieved by fusing a paired multispectral image (MSI) and hyperspectral image (HSI), which is a prevalent strategy. But, how to precisely reconstruct the high spatial resolution hyperspectral image (HR-HSI) by fusion technology is a challenging issue. In this article, we propose an iterative regu...
Article
Deep learning has certainly become the dominant trend in hyperspectral (HS) remote sensing (RS) image classification owing to its excellent capabilities to extract highly discriminating spectral–spatial features. In this context, transformer networks have recently shown prominent results in distinguishing even the most subtle spectral differences b...
Article
Interferometric phase restoration is a crucial step in retrieving large-scale geophysical parameters from Synthetic Aperture Radar (SAR) images. Existing noise impacts the accuracy of parameter retrieval as a result of decorrelation effects. Most state-of-the-art filtering methods belong to the group of nonlocal filters. In this paper, we propose a...
Article
The Image Analysis and Data Fusion Technical Committee (IADF TC) of the IEEE Geoscience and Remote Sensing Society (GRSS) has been organizing the annual Data Fusion Contest (DFC) since 2006. The contest promotes the development of methods for extracting geospatial information from large-scale, multisensor, multimodal, and multitemporal data. It aim...
Article
Full-text available
In this paper, we elaborate on the scientific outcomes of the 2021 Data Fusion Contest (DFC2021), which was organized by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society, on the subject of geospatial artificial intelligence (AI) for social good. The ultimate objective of the contest was to mod...
Preprint
Full-text available
Humanitarian organizations must have fast and reliable data to respond to disasters. Deep learning approaches are difficult to implement in real-world disasters because it might be challenging to collect ground truth data of the damage situation (training data) soon after the event. The implementation of recent self-paced positive-unlabeled learnin...
Chapter
Multisource remote sensing image fusion refers to the process of fusing multiple image sources to obtain detailed and accurate information about the surface that cannot be obtained using a single image. A variety of multisource image fusion techniques have been proposed in the field of remote sensing to realize resolution enhancement or super-resol...