PreprintPDF Available

Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

In this paper, we present a groundbreaking spectrally multiplexed photometric stereo approach for recovering surface normals of dynamic surfaces without the need for calibrated lighting or sensors, a notable advancement in the field traditionally hindered by stringent prerequisites and spectral ambiguity. By embracing spectral ambiguity as an advantage, our technique enables the generation of training data without specialized multispectral rendering frameworks. We introduce a unique, physics-free network architecture, SpectraM-PS, that effectively processes multiplexed images to determine surface normals across a wide range of conditions and material types, without relying on specific physically-based knowledge. Additionally, we establish the first benchmark dataset, SpectraM14, for spectrally multiplexed photometric stereo, facilitating comprehensive evaluations against existing calibrated methods. Our contributions significantly enhance the capabilities for dynamic surface recovery, particularly in uncalibrated setups, marking a pivotal step forward in the application of photometric stereo across various domains.
Physics-Free Spectrally Multiplexed Photometric
Stereo under Unknown Spectral Composition
Satoshi Ikehata1,2and Yuta Asano1
1National Institute of Informatics, Tokyo, Japan
2Tokyo Institute of Technology, Tokyo, Japan
Abstract. In this paper, we present a groundbreaking spectrally mul-
tiplexed photometric stereo approach for recovering surface normals of
dynamic surfaces without the need for calibrated lighting or sensors,
a notable advancement in the field traditionally hindered by stringent
prerequisites and spectral ambiguity. By embracing spectral ambigu-
ity as an advantage, our technique enables the generation of training
data without specialized multispectral rendering frameworks. We in-
troduce a unique, physics-free network architecture, SpectraM-PS, that
effectively processes multiplexed images to determine surface normals
across a wide range of conditions and material types, without relying on
specific physically-based knowledge. Additionally, we establish the first
benchmark dataset, SpectraM14, for spectrally multiplexed photometric
stereo, facilitating comprehensive evaluations against existing calibrated
methods. Our contributions significantly enhance the capabilities for dy-
namic surface recovery, particularly in uncalibrated setups, marking a
pivotal step forward in the application of photometric stereo across var-
ious domains.
Keywords: Spectrally Multiplexed Photometric Stereo ·Dynamic Sur-
face Recovery ·Multispectral Photometric Stereo
1 Introduction
Recovering detailed normals of dynamic surfaces is essential for monitoring var-
ious processes: in manufacturing, it helps in tracking wear and tear of machine
parts; in agriculture, it allows for the observation of crop growth through changes
in leaf geometry; and in sports engineering, it aids in improving equipment design
and safety by analyzing how surfaces deform upon impact.
Photometric Stereo (PS) [59,66] derives object surface normals from observa-
tions under different lighting conditions at a fixed viewpoint. Despite decades of
progress, the requirement for objects to stay stationary during lighting changes
challenges the recovery of dynamic surfaces, essential for analyzing temporal sur-
face deformations. PS researches have employed spectral multiplexing for dynamic
surface recovery [11,22, 34, 39, 53, 64]—a technique originally used in telecommu-
nications and spectroscopy [33]. This technique utilizes the varying wavelengths
arXiv:2410.20716v1 [cs.CV] 28 Oct 2024
2 S. Ikehata and Y. Asano
Ch. 1
Ch. 2
Ch. k
SpectraM-PS
Uncontrolled Light
Uncontrolled
Sensor
Normal Map
Spectrally
Multiplexed Image
Video Stream
SpectraM-PS SpectraM-PS SpectraM-PS SpectraM-PS
Fig. 1: (Left) Illustration of our SpectraM-PS. Our method recovers a surface nor-
mal map from a spectrally multiplexed image. The spectral/spatial composition for
generating the observations is unknown. There is potential for a mismatch between
the sensor’s spectral sensitivity and the light source’s spectral distribution, which may
lead to crosstalk. (Right) By applying our method to individual frames of a video, the
normal map of dynamic surfaces can be recovered.
of light to multiplex and subsequently demultiplex signals within a single sensor,
thereby increasing the capacity for information transmission.
Historically, spectrally multiplexed photometric stereo is often referred to as
color photometric stereo [8,15,22,34,35], specifically when objects are illuminated
with monochromatic red, green, and blue lights from various angles, captured in
the camera’s RGB channels. Each channel is then treated as an observation under
a distinct lighting for PS analysis. This technique has been further extended to
not only RGB but also any number of spectral bands and is specifically referred to
as multispectral photometric stereo [19,20, 48]. These techniques enable dynamic
surface recovery by processing each temporal multi-channel frame separately.
Despite their potential in dynamic surface recovery, current spectrally mul-
tiplexed photometric stereo methods face stringent prerequisites that limit their
practicality. These include the necessity for precisely calibrated directional light-
ing in controlled environments [8, 15, 19, 22,48] and sensors with aligned spec-
tral sensitivities [34,35]. Furthermore, they make strong assumptions about the
surface, requiring it to be convex, integrable, Lambertian, and exhibit uniform
chromaticity [22,39]. By contrast, recent PS methods without spectral multiplex-
ing support non-Lambertian surfaces [31,57], spatially-varying materials [13,25],
and the use of uncalibrated lighting [12,27, 28, 56]. This disparity arises from the
challenge where identical observations are produced by different spectral com-
positions of light, surface and sensor [24, 52], a phenomenon absent in conven-
tional PS due to constant spectral compositions of them across images. Recently,
Guo et al. [19,20] thoroughly explored how the spectral ambiguity renders spec-
trally multiplexed photometric stereo ill-posed, necessitating severely unpractical
conditions on light, surface, and sensor to resolve the ambiguity.
In this work, we propose a spectrally multiplexed photometric stereo method
that recovers normals directly from multiplexed observations produced by un-
known composition of lights, surface, and sensor (See Fig. 1-left), drawing inspi-
ration from recent data-driven photometric stereo methods [27,28]. While prior
SpectraM-PS 3
works [20, 48] have considered the spectral ambiguity harmful and something
that must be resolved, we demonstrate that it can even be beneficial for a data-
driven approach as it compacts the input space and allows for the generation
of training data without a multispectral rendering framework. Trained on spec-
trally composed observations, our generic, physics-free architecture directly maps
a single multiplexed image with an order-agnostic, arbitrary number of channels
to object surface normals without the need for calibrating lights and sensors,
and without imposing severe constraints on surface reflectance and geometry.
By applying our method to individual frames of a video, dynamic surface recov-
ery via spectrally multiplexed photometric stereo in uncalibrated, uncontrolled
scenarios is achieved as illustrated in Fig. 1 (right).
While numerous benchmarks exist for conventional photometric stereo [54,58,
65], not a single benchmark is available for spectrally multiplexed PS. Therefore,
we have created the first real benchmark dataset, namely SpectraM, for this
task. For comparative evaluations with calibrated methods such as [20, 48], we
carefully calibrated directional light sources of different wavelengths, including
their directions. We implemented five different difficulty settings by varying the
type of light sources (RGB vs NIR) and whether individual light sources were
illuminated independently or simultaneously, catering to both ideal conditions
without channel crosstalk and more realistic conditions with channel crosstalk.
Our contributions are summarized as follows: (1) We pioneer the use of spec-
trally multiplexed photometric stereo for recovering dynamic surfaces in uncali-
brated setups, employing a data-driven approach to overcome spectral ambiguity,
a significant barrier in prior work. (2) We introduce a unique, physics-free neural
network, SpectraM-PS (Spectrally Multiplexed PS), that recovers surface nor-
mals from a spectrally multiplexed image, capable of handling images with any
number of order-agnostic channels. (3) We demonstrate how spectral ambigu-
ity restricts the input space for training data generation, offering a strategy for
efficient dataset creation without the need for multispectral rendering. (4) We
create the first evaluation benchmark, SpectraM, for this domain, showing our
method’s superiority over current calibrated spectrally multiplexed photometric
stereo techniques.
2 Related Work
Temporally Multiplexed Photometric Stereo (Conventional). From a
communication perspective, conventional photometric stereo, as originally pro-
posed by Woodham [66], employs a time multiplexing strategy to recover static
surfaces. This method involves temporally varying lighting conditions while cap-
turing images from a fixed viewpoint. Since the same light sources and sensor
always provide observations, image differences stem solely from changes in light
direction and intensity. This approach simplifies addressing complex conditions
such as cast shadows [29,30], non-Lambertian surfaces [18,31], non-convex sur-
faces [25], and uncalibrated lighting [12,38, 56]. Recently, learning-based meth-
ods [1214, 2527, 36, 38, 4144, 46, 55, 60, 61, 70, 71] have emerged as an effec-
4 S. Ikehata and Y. Asano
tive alternative, addressing challenges faced by traditional, physics-based ap-
proaches [7, 9, 18, 21, 23, 30, 47, 67]. These data-driven methods regress normal
maps from observations utilizing techniques such as observation map regres-
sion [25, 32], set pooling [13, 14], graph neural networks [70], Transformer [26],
and neural rendering for inverse rendering optimization [42,43, 60]. Notably, the
introduction of universal photometric stereo methods [27, 28] has enabled the
handling of unknown, spatially-varying lighting in a purely data-driven frame-
work. Inspired by these advancements, our work aims to regress normals from
observations under unknown light, surface, and sensor conditions.
Spectrally Multiplexed Photometric Stereo. Despite its potential for dy-
namic surface recovery, spectrally multiplexed photometric stereo [16, 39] has
remained less explored than its mainstream counterpart, primarily due to no-
table limitations.
Lighting Constraints: Existing methods necessitate multiple directional
lights in controlled settings, contrasting the flexibility of temporally multiplexed
techniques that adapt to diverse lighting conditions [28, 49, 51]. They typically
require pre-calibrated light directions and distinct light source spectra to prevent
channel crosstalk. In contrast, our approach accommodates uncontrolled lighting
scenarios without the need for predefined or calibrated setups.
Surface and Sensor Constraints: Prior works assume significant limita-
tions on surfaces, such as Lambertian, convex, and uniform properties [16,22,39].
Recent advances like Lv et al. [48] extend to non-Lambertian surfaces but still
require uniform materials. Sensor requirements typically involve narrow-band
spectral responses and a fixed number of channels, limiting flexibility. Our ap-
proach leverages a data-driven model, training neural networks on synthetic data
to handle complex surfaces and varied spectral sensor responses.
Data-driven Methods: To our knowledge, there are few data-driven meth-
ods for this task [34, 35, 48]. Previous studies, such as those by Ju et al . [34, 35],
require identical spectral and spatial lighting conditions during both training and
testing, greatly restricting their practicality. ELIE-Net [48] permits variability
in training and test setups; however, strong assumptions on both light and sur-
faces prohibitively limit its applications. Our model, on the other hand, eschews
explicit lighting models in favor of learning direct input-output relationships, al-
lowing for accurate predictions under varied and unknown spectral compositions
and supporting materials with spatially diverse properties. Furthermore, unlike
ELIE-Net’s reliance on spectral BRDF datasets, our training approach utilizes
assets akin to those employed in conventional photometric stereo.
3 Problem Statement
Given a single image IRh×w×kcaptured by a static k-channel orthographic
sensor, along with an optional object mask MRh×w, the objective of spec-
trally multiplexed photometric stereo is to recover the surface normals of the
SpectraM-PS 5
Backbone
C/4xC/4
FPN
Multiplexed
Image
Channel-axis TF
Pixel
Sampling
Patch Embedding
C/4xC/4
C/4xC/4
Global Feat.
Channel-axis
Transformer
Concatenate
Spatial-axis
Transformer
Channel-axis
Transformer
Spatial-axis
Transformer
Normal
Predictor
Normal
Predictor
Decoder
(1st Branch)
Decoder
(2nd Branch)
Encoder
Concatenate
C/4xC/4
C x C
C x C
C/4xC/4
C x C
C x C
Patch
Sampling
Fig. 2: SpectraM-PS involves decomposing a spectrally multiplexed image into in-
dependent channels. The Global Feature Encoder extracts a feature map from each
channel. The surface vector is then recovered by the Dual-scale Surface Normal De-
coder at each pixel. We adopt a dual-scale approach to preserve the entire shape, while
employing patch-embedding techniques to enhance local surface details.
object, NRh×w×33. The object is supposed to be illuminated by multiple
light sources, each with unique spatial and spectral properties. Previous studies
have typically assumed an equal number of light sources and sensor channels,
with each light source’s wavelength precisely matching the spectral response of
a single channel, thereby precluding any channel crosstalk, and with the di-
rections of lights predetermined. In contrast, we do not presuppose the spatial
distribution of illumination nor require the spectrum of each light source to be
exclusively aligned with the spectral responses of the sensor channels, thus per-
mitting channel crosstalk. This distinction is elaborated in subsequent sections.
4 Method
We propose and tackle the problem of spectrally multiplexed photometric stereo
from a single image with multiple channels, produced by an unknown spec-
tral/spatial composition of the sensor, light, and surface. To build such a method,
we train neural networks to directly infer the normal map from an image.
Our method addresses two challenges: (1) a physics-free architecture that
accepts a varying number of spectral channels and is agnostic to their order, and
(2) an effective approximation of the spectrally multiplexed image for efficient
training. We consider (2) to be of significant importance, yet it remains largely
unexplored. Synthesizing spectrally multiplexed images in a physically accurate
manner is prohibitively challenging, owing to the increased complexity of their
parameter spaces and the scarcity of 3D assets with detailed spectral properties,
3It should be noted that unlike conventional PS, reflectance recovery generally falls
outside the scope of spectrally multiplexed PS due to its inherently ill-posed nature.
6 S. Ikehata and Y. Asano
as well as the complex nature of light-surface interactions across different wave-
lengths. To address this issue, developing an efficient approximation method for
rendering spectrally multiplexed images using common RGB image rendering
techniques is crucial.
4.1 Physics-free Spectrally Multiplexed PS Network (SpectraM-PS)
The architecture of SpectraM-PS is illustrated in Fig. 2. Drawing inspiration
from established Transformer-based photometric stereo networks [2628], we in-
tegrate an encoder to first extract the global features and a decoder to estimate
per-pixel surface normals. The architecture derives normals solely from the in-
put image and mask, without prior light information. This indicates that the
architecture focuses the network’s learning objective on the relationship between
input and output without relying on physics-based principles, unlike prior works.
In our model, all the interactions among features from different sensor chan-
nels are employed by näive Transformer [63] in similar to [26–28]. Transformer
functions by mapping input features to query, key, and value vectors of equal di-
mensions. These vectors are processed through a multi-head self-attention mech-
anism, utilizing a softmax layer, followed by a feed-forward network comprising
two linear layers. Both the input and output layers maintain identical dimension-
ality, with the inner layer having twice the dimension of the input. Each layer
is surrounded by a residual connection, succeeded by layer normalization [69].
The advantage of employing Transformers in photometric stereo networks lies in
their capability to facilitate complex interactions among intermediate features,
a task unachievable with simple operations like pooling [1214, 48] and obser-
vation map [25,32]. Additionally, the token-based attention mechanism allows
for different number of input tokens (i.e., sensor channels) between training and
test phases and ensures that the results are independent on the order of tokens.
Building on the established Transformer-based architecture [28] for tempo-
rally multiplexed photometric stereo, we extend its scope to a spectrally multi-
plexed one. To accommodate a variable number of channels and eliminate de-
pendency on their order, an input spectrally multiplexed image is first split into
individual channels, each of which is concatenated with an object mask (If no
mask is provided, replace with a matrix of ones.) and then input into the same
encoder of a neural network. This approach is distinctly different from tradi-
tional methods that encode an input image as it is in neural networks [34, 35].
Then, at Preprocessing, we normalize each channel by dividing it by a ran-
dom value between its maximum and mean. Each channel and mask are resized
or cropped to a resolution (c×c) that is a multiple of 32 to be input into the
multi-scale encoder. Global Feature Encoder first applies a backbone network
(i.e., ConvNeXt-T [45]) to individualy encode the concatenation of each channel
and mask, then uses Transformer layers for channel-axis (i.e., sensor channel)
feature communication across scales (the number of Transformer layers is {0, 1,
2, 4} at {1/4, 1/8, 1/16, 1/32} scales, hidden dimensions are same with input
dimensions), and finally, a feature pyramid network [68] for integrating features
SpectraM-PS 7
at different levels. Note that the design of encoder is almost the same as [27,28],
except that images are replaced by sensor channels, so details are omitted.
Given global features Rk×c/4×c/4×256, our novel Dual-scale Surface
Normal Decoder adopts a dual-scale strategy for predicting point-wise sur-
face normals at m(i.e., 2048) sampled locations at the original resolution within
the object mask. The first branch recovers low-frequency surface normals at the
feature map resolution ( c
4×c
4). Concretely, all global features corresponding to
each sample location are processed by five channel-axis Transformer layers (with
a 256 hidden dimension) and are pooled via Pooling-by-Multihead-Attention
(PMA) [40] using an additional channel-axis Transformer layer (with a 384 hid-
den dimension). To enhance spatial communication, two spatial-axis Transformer
layers (with a 384 hidden dimension) inspired by Ikehata [28] are employed (i.e.,
Transformer is employed among samples at different locations), with a final MLP
(3841923) predicting the low-frequency normals at sampled locations. The
second branch focuses on high-resolution normal recovery, using patch embed-
ding for local context at the same mlocations, with w×wpatches (w= 21)
processed by an MLP (with a 256 hidden dimension) and two layer norms. These
patches, concatenated with bilinearly interpolated global features, pass through
five channel-axis Transformer blocks (with a 256 hidden dimension), PMA (with
a 384 hidden dimension), and are merged with the first branch output nor-
mals into 387-dimensional vectors. Two additional spatial-axis Transformer lay-
ers (with a 384 hidden dimension) enable non-local interactions, culminating
in a final MLP (3841923) for high-resolution normals, normalized to unit
vectors. The complete normal map is formed by merging all the vectors from
different sample sets.
It should be noted that while SDM-UniPS [28] targets temporally multiplexed
PS with tens of images, and its decoder performs normal estimation purely on
a pixel basis. In contrast, spectrally multiplexed PS deals with fewer channels
(e.g., three with RGB sensors), making a pixel-basis architecture less effective.
Therefore, we use patch embedding at the patch-basis decoder to capture fine
details with a dual-scale architecture for preserving overall shape. Without a
dual-scale design, the recovery of surface normals becomes overly influenced by
local image textures captured through patch embedding. This leads to a failure
in preserving the entire shape, resulting in a significant reduction in accuracy.
Our motivation is supported by Fig. 3 (left), where SDM-UniPS [28] fails to re-
cover fine details with six temporally multiplexed images, while our architecture
produces a more plausible normal map.
4.2 Efficient Training Strategy Utilizing Spectral Ambiguity
Aligning the training and test data domains in neural networks is essential for op-
timal model performance [10,62]. However, rendering spectrally multiplexed data
poses challenges due to the scarcity of multispectral Bidirectional Reflectance
Distribution Functions (BRDFs). In reality, ELIE-Net [48] was trained using
only 51 measured isotropic spectral BRDFs. On the other hand, given the avail-
ability of various large isotropic BRDF databases [13,5, 50], we seek to explore
8 S. Ikehata and Y. Asano
Point Dir. + Env.
Point + Env. Environment
PS-Multiplex Training Dataset
Input images
(4 of 6, temporally multiplexed)
SpectraM-PS (Ours)
SDM-UniPS [26]
SpectraM-PS on Temporally Multiplexed PS
Fig. 3: (Left) Comparison of SpectraM-PS and SDM-UniPS [28] on six temporally
multiplexed PS images. Due to the patch-wise basis of SpectraM-PS, fine details are
better recovered. (Right) Illustration of different lighting conditions in PS-Multiplex.
utilizing these datasets for training our model, leveraging the fact that our net-
work does not distinguish images based on their physically-based principles. In
this section, we highlight how RGB images serve as a practical approximation,
simplifying the complexity inherent in multispectral imaging.
We begin the discussion by characterizing multispectral imaging. Assuming
that the surface doesn’t emit light and only reflections on surface are considered,
the image formation model is described as follows [37]:
I(s,p)=Z
(ωT
inp)Z
0
Ss(λ)fp(ωi, ωo, λ)Lp(ωi, λ)dλdωi.(1)
In this equation, I(s,p)denotes the incoming spectral radiance at the sensor s
(or s-th channel) from a surface point p. The term fprepresents BRDF, Lpthe
incident light intensity at the surface point, and λthe wavelength of the incident
light. The symbols ωiand ωodenote the directions of incident and reflected light,
respectively. Ss(λ)refers to the spectral sensitivity of the sensor sat wavelength
λ,npis the surface normal, and represents the hemisphere over which incident
light directions are possible. The integral sums over all incident directions and
wavelengths. It is important to note that the incident light intensity Lpdepends
not only on the direct contribution from light sources but also on the visibility
of light (e.g., attached and cast shadows) and indirect illuminations.
Eq. (1) illustrates the concept of spectral ambiguity, showing that an infinite
number of combinations of Ss(λ),fp(ωi, ωo, λ), and Lp(ωi, λ)can result in the
same spectral radiance, including narrowband compositions. In other words, with
spectral ambiguity, a single observation I(s,p)can encompass the observations for
all spectral compositions that satisfy the equation (i.e ., metamerism [24, 52]).
This perspective justifies the theory of substituting multispectral images, which
possess a broad parameter space, with narrowband RGB images. It is worth
mentioning that channel crosstalk primarily affects the incident light intensity,
consequently distorting the product of Ss(λ)·fp(ωi, ωo, λ)·Lp(ωi, λ)in Eq. (1).
SpectraM-PS 9
This implies that observations influenced by spectral crosstalk can still be equiv-
alently represented using a narrowband setup under spectral ambiguity. In the
experiments, we demonstrate that our model, trained on three narrowband ob-
servations can be applied to multiplexed data with channel crosstalk. To realize
this approximation, we rendered a large number of three-channel narrowband
images using the path-tracing algorithm in Blender [4], where up to 10-bounce
reflections are permitted, based on common 3D assets [2] for RGB rendering.
Following the rendering pipeline described in [28], we rendered objects by com-
bining three different lighting models: directional, point, and environmental (five
combinatorial settings in total as shown in Fig. 3). To simulate spectrally multi-
plexed images, we defined R, G, and B light sources and illuminated the surface
in a multiplexed manner. It is important to note that the rendered RGB images
are decomposed into three grayscale images, each of which was independently fed
into the network; therefore, any wavelength-dependent information is masked.
For material diversity, we adopted the method from [28], categorizing 897 Adobe-
Stock texture maps into three groups: 421 diffuse, 219 specular, and 257 metallic
textures. Four objects from a set of 410 3D AdobeStock models were randomly
selected and textured with these materials. This structured approach led to the
rendering of 106,374 multiplexed images along with their ground truth surface
normal maps, forming the ‘PS-Multiplex’ dataset.
5 SpectraM14 Benchmark Dataset
Due to the lack of a benchmark for spectrally multiplexed PS, the first com-
prehensive evaluation dataset, named SpectraM14, is created. This dataset in-
cludes 14 objects, each exhibiting a range of optical properties such as monochro-
matic or multicolored appearances and diffuse or specular reflections, as depicted
in Fig. 4. Our benchmark encompasses tasks under five distinct conditions, as
described later.
Imaging Setup. To acquire our dataset, we utilized a color camera (FLIR GS3-
U3-123S6C-C) and an NIR camera (FLIR GS3-U3-41C6NIR-C), both equipped
with a 50mm lens. For the NIR camera, we used narrowband filters with wave-
lengths of 750nm, 850nm, 880nm, 905nm, and 940nm, and the acquired images
were manually merged. Objects were placed 0.8m from the camera to approxi-
mate orthographic projection. Following conventional PS benchmarks [54,58,65],
data capture occurred in a controlled, dark environment with the scene draped
in black cloth to mitigate interreflection. The camera’s ISO sensitivity was min-
imized to enhance image quality. The imaging area was further isolated using
low-reflectance cloths to suppress inter-reflection. For each illumination condi-
tion, we collected six images under varying exposures to produce HDR input
images. For the evaluations throughout this paper, the images are cropped using
an object mask and resized to 512px ×512px.
Lighting. Six LED and three halogen light sources, positioned roughly 1 meter
from the object, provided illumination. We used the “Weeylite S05 RGB Pocket
Lamp” and the “NPI PIS-UHX-AIR” for lighting. This setup enabled the use of
10 S. Ikehata and Y. Asano
Broad spec. ID : 1 ID : 2 ID : 4 ID : 5
ID : 6 ID : 7 ID : 8 ID : 9
ID : 12
Diffuse Sparse spec. Sparse spec. Diffuse
Diffuse Diffuse
Broad spec.
/ Metal. Sparse spec.
Diffuse
Multicolor Monochrome Multicolor Multicolor
Monochrome Multicolor Almost monochrome Monochrome
Monochrome Monochrome
ID : 13 ID : 14
Diffuse
Multicolor
Sparse spec.
Multicolor
ID : 3
Multicolor
Diffuse ID : 11
ID : 10
Multicolor
LED
(Red)
LED
(Blue)
LED
(Magenta)
LED
(Green)
LED
(Yellow)
LED
(Cyan)
Color sensor
(FLIR GS3-U3-123S6C-C)
Object
Low-reflective
cloth
NIR sensor
(FLIR GS3-U3-
41C6NIR-C)
Halogen light
(NPI PIS-UHX-AIR)
Broad spec.
Fig. 4: Objects in SpectraM14.
Cond. 1 Cond. 2 Cond. 3 Cond. 4750nm Cond. 5 750nm
850nm
NIR
sensor
NIR
sensor
Color
sensor
Color
sensor
850nm
880nm
905nm
940nm
880nm
905nm
940nm
Bandpass
filter
Bandpass
filter
RGB
sensor
averaging
Fig. 5: Illustration of six conditions in SpectraM14.
red, green, blue, yellow, magenta, cyan, and NIR lighting, with spectra validated
using a Hamamatsu Photonics Multichannel Analyzer C10027-01.
Calibration and Ground Truth Data. We measured the directions of lights
using specular reflections from a mirror sphere. Light intensity was standardized
across the visible spectrum by averaging RGB values from reflected light on a
white target. The ground truth normals were captured with a SHINING 3D
EinScan-SE scanner.
Evaluation Procedure. The design philosophy of this benchmark is to assess
the robustness and adaptability of spectrally multiplexed PS methods under re-
alistic lighting conditions, accounting for variations in channel numbers and the
presence of spectral crosstalk. For a comprehensive evaluation, we designed tasks
under five distinct conditions as shown in Fig. 5: Condition 1: Color sensor, no
crosstalk condition: Six colors of light (red, green, blue, cyan, yellow, magenta)
were each independently illuminated and observed with an RGB sensor. After-
ward, the channels of RGB were averaged. Condition 2: Color sensor, weak
crosstalk condition: Three colors of light (red, green, blue) were simultaneously
illuminated and observed through each channel of the RGB sensor. Condition
3: Color sensor, strong crosstalk condition: Three colors of light (cyan, yellow,
magenta) were simultaneously illuminated and observed through each channel
of the RGB sensor. Condition 4: NIR sensor, no crosstalk condition: Light at
wavelengths of 750 nm, 850 nm, 880 nm, 905 nm, and 940 nm were each inde-
pendently illuminated and observed with a monochrome sensor corresponding to
SpectraM-PS 11
each wavelength. Condition 5: NIR sensor, spatially-varying lighting condition:
New images were created by averaging two images taken under the conditions
mentioned above. The combinations were (750 nm, 850 nm), (850 nm, 880 nm),
(880 nm, 905 nm), (905 nm, 940 nm), and (940 nm, 750 nm).
6 Experiment
In this section, we evaluate our method on our SpectraM14. Our method is com-
pared with one SOTA optimization-based method [19] and one SOTA learning-
based method [48]. The former introduces a closed-form solution for spectrally
multiplexed photometric stereo applied to monochromatic surfaces with spa-
tially varying (SV) albedo. The latter presents a Spectral Reflectance Decompo-
sition (SRD) model, which disentangles spectral reflectance into geometric and
spectral components for surface normal recovery under non-Lambertian spectral
reflectance conditions. Unlike the compared methods, our approach does not as-
sume a specific lighting setup, whereas both methods presume the presence of
calibrated single directional light sources.
Training details. SpectraM-PS was trained from scratch on the PS-Multiplex
dataset until convergence using the AdamW optimizer, with a step decay learn-
ing rate schedule that reduced the learning rate by a factor of 0.8 every ten
epochs. We applied learning rate warmup during the first epoch and used a
batch size of 16, an initial learning rate of 0.0001, and a weight decay of 0.05.
Each batch consisted of three input training multiplexed images with three chan-
nels each. The training loss was computed using the Mean Squared Error (MSE)
loss function to measure 2errors between the predicted surface normal vectors
and the ground truth surface normal vectors. We measured the reconstruction
accuracy of our method by computing the mean angular errors (MAE) between
the predicted and true surface normal maps, expressed in degrees.
Computational Cost. The inference time of PS methods varies with the num-
ber of pixels and channels in the input image. For Condition 2 and 3 with a
512 ×512 ×3image, the mean and standard deviation of inference times (in
sec) over 14 objects in SpectraM14 benchmark were: our method (3.42/0.85),
Lv et al. [48] (0.46/0.24) and Guo et al. [19] (2.38/1.10). Our architecture leads
to higher computational costs; however, none of the methods were suitable for
real-time processing (e.g ., 15 fps requires 0.06 sec/frame).
Ablation Study. We firstly validate the individual technical contributions of
our training dataset (i.e., PS-Multiplex) and the physics-free architecture (i.e.,
SpectraM-PS) using a synthetic evaluation dataset. Firstly, we validate the ef-
ficacy of our training dataset, PS-Multiplex, by adapting an existing universal
photometric stereo architecture designed for the conventional task (i.e., SDM-
UniPS [28]) to the spectrally multiplexed photometric stereo task. Since both
ours and SDM-UniPS take multiple observations and an object mask as input,
this adaptation straightforwardly involves training the model on PS-Multiplex
by treating each channel of an image as an individual image. Subsequently, we
12 S. Ikehata and Y. Asano
Table 1: Ablation analysis of the contributions of SpectraM-PS and PS-Multiplex.
Method
Non-
Lambertian
Lambertian
Non-
Lambertian
Lambertian
Non-
Lambertian
Lambertian
SDM-UniPS [26] 12.9 (4.5) 12.4 (4.7) 15.0 (5.0) 21.7 (7.0) 14.4 (5.1) 15.2 (5.9)
[26] trained on PS-Multiplex
11.1 (3.6) 11.0 (3.9) 10.5 (2.9) 12.3 (3.9) 10.6 (3.9) 11.2 (3.7)
SpectraM-PS (Ours)
8.0 (2.7) 8.4 (2.7) 7.9 (2.4) 8.9 (2.9) 8.2 (3.2) 8.0 (2.5)
MAE (Uniform)
compare this model against our proposed SpectraM-PS to demonstrate the effi-
cacy of our dual-scale design with local patch embedding.
For evaluating the contribution of our architecture (SpectraM-PS) and train-
ing dataset (PS-Multiplex), we additionally rendered three-channel spectrally
multiplexed images representing six distinct surface material categories: (a) uni-
form, Lambertian; (b) piece-wise uniform, Lambertian; (c) non-uniform, Lam-
bertian; (d) uniform, non-Lambertian; (e) piece-wise uniform, non-Lambertian;
and (f) non-uniform, non-Lambertian. In uniform materials, every point on the
surface within a scene exhibits the same material properties. For piece-wise
uniform materials, each object in a scene is composed of the same material,
yet different objects possess distinct materials. Non-uniform materials feature
unique PBR textures assigned to each object. The rendering process for these
images was identical to that used for the PS-Multiplex datasets in each cate-
gory. We generated 100 scenes for each surface material category, and MAEs
(stds) are averaged over them. The results are presented in Tab.1. In summary,
SDM-UniPS [28] trained on our PS-Multiplex dataset demonstrates proper adap-
tation to the spectrally multiplexed photometric stereo task. Nonetheless, our
SpectraM-PS method significantly enhanced reconstruction accuracy, showcas-
ing an architecture-level improvement over SDM-UniPS for the spectrally multi-
plexed photometric stereo task, where the number of input channels is typically
much fewer than that of input images for conventional PS.
Comparative Evaluation on SpectraM14. The results are illustrated in Tabs. 2
to 6 and Fig. 6. Despite the fact that all existing spectrally multiplexed pho-
tometric stereo methods assume calibrated light sources and known directional
light source conditions, our proposed method significantly outperformed them.
This is because most of the real objects used in our experiment are neither
Lambertian nor convex, and do not conform to their assumptions. However,
our non-physical-based method successfully restored the normals very stably for
these objects. Furthermore, our proposed method enabled robust reconstruction
for all objects, despite having been trained only with RGB color images. This
result supports the efficacy of our approximation. Furthermore, unlike existing
methods that suffer from reduced estimation accuracy with increasing spectral
crosstalk, our approach demonstrates only minimal performance degradation.
Remarkably, our method excels in recovering a more realistic structure with
spatially-varying surface materials. This breakthrough implies that our network
can effectively achieve dynamic surface reconstruction across video frames in a
universal setting. Due to space constraints, not all results can be included here.
However, all results are comprehensively presented in the supplementary materi-
SpectraM-PS 13
Table 2: Comparison in condition 1. The values are mean angular errors in degrees.
Method Object ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Ave.
Ours 8.6 11.2 13.0 7.3 12.2 6.1 11.5 5.0 5.7 7.5 10.3 5.1 6.1 12.2 8.9
Lv et al . [48] 20.4 17.0 21.1 13.9 23.1 10.7 21.2 16.6 10.9 15.9 19.0 16.4 13.2 18.6 17.1
Guo et al . [19] 22.6 15.2 20.7 13.4 27.1 7.2 31.3 24.8 8.0 18.2 24.5 11.4 10.2 29.3 18.9
Table 3: Comparison in condition 2. The values are mean angular errors in degrees.
Method Object ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Ave.
Ours 10.1 11.5 13.5 9.0 12.8 7.0 10.9 5.5 6.3 9.3 12.7 5.3 9.8 14.7 10.0
Lv et al . [48] 22.8 26.0 27.0 19.4 30.3 19.5 22.1 18.9 14.3 19.8 23.5 20.8 19.0 21.4 21.7
Guo et al . [19] 31.1 27.5 29.2 20.1 38.0 19.0 33.4 23.5 13.6 26.6 32.7 14.7 17.1 39.3 25.7
Table 4: Comparison in condition 3. The values are mean angular errors in degrees.
Method Object ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Ave.
Ours 12.0 12.8 15.8 11.0 16.5 5.9 12.3 9.8 6.9 9.8 14.6 6.3 7.5 20.1 11.6
Lv et al . [48] 38.5 34.1 38.2 32.4 40.1 38.2 36.6 29.6 38.2 35.1 38.2 36.6 38.4 30.9 36.0
Guo et al . [19] 46.0 42.6 56.3 37.6 57.1 45.6 76.0 48.0 29.9 62.1 49.8 50.2 52.8 73.7 51.7
Table 5: Comparison in condition 4. The values are mean angular errors in degrees.
Method Object ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Ave.
Ours 10.9 11.3 9.7 6.3 10.8 5.3 12.4 4.3 7.8 6.6 9.1 4.0 7.4 10.6 8.4
Lv et al . [48] 24.1 19.8 20.7 12.0 19.8 12.4 15.1 18.7 11.4 17.1 21.2 17.5 19.1 16.7 17.5
Guo et al . [19] 30.7 16.0 18.3 9.9 29.6 8.4 23.1 25.7 10.2 14.6 24.5 13.9 29.6 13.8 19.1
Table 6: Comparison in condition 5. The values are mean angular errors in degrees.
Method Object ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Ave.
Ours 10.9 11.0 9.9 7.3 11.0 4.7 13.3 4.8 7.9 6.5 9.8 4.2 7.4 10.4 8.6
Lv et al . [48] 29.8 26.2 28.9 22.8 28.9 25.8 21.9 24.6 18.7 24.7 27.6 24.8 22.3 27.6 25.0
Guo et al . [19] 40.0 27.3 29.2 23.2 33.3 25.0 29.3 30.1 20.0 26.7 31.6 26.3 25.0 27.2 27.8
als. Additionally, the supplementary materials evaluate the impact of the spatial
distribution of light sources on the performance of the proposed method. We
also offer an in-depth discussion of each experimental condition therein.
7 Conclusion
In this work, we introduce an innovative approach to spectrally multiplexed
photometric stereo under unknown spatial/spectral composition. Turning spec-
tral ambiguity into a benefit, our method allows for the creation of training
data without the need for complex multispectral rendering. Our work signifi-
cantly broadens the scope for dynamic surface analysis, establishing a critical
advancement in the utilization of photometric stereo across multiple sectors. Our
proposed method exhibits several limitations. Firstly, there is unstable temporal
variation in the normal maps reconstructed by our method for dynamic surface
reconstruction. This instability arises from factors such as motion blur in certain
frames, image noise, or the influence of cast/attached shadows, which become
more pronounced compared to conventional photometric stereo methods that
14 S. Ikehata and Y. Asano
Cond. 1 (separate, 6ch, no crosstalk), ID 1
Ours Guo2022
Lv2023
8.6°20.4°22.6°
GT GT Ours Lv2023 Guo2022
Ch.1 Ch.2 Ch.3 Ch.4 Ch.5 Ch.6
Cond. 2 (multiplex, 3ch, low crosstalk), ID 5
Ours Guo2022
Lv2023
12.8°30.3°38.0°
GT
Split
Ch.1 Ch.2 Ch.3
GT Ours Lv2023 Guo2022
Ours Guo2022
Lv2023
12.3°36.6°76.0°
GT
Cond. 3 (multiplex, 3ch, high crosstalk), ID 7
Split
Ch.1 Ch.2 Ch.3
GT Ours Lv2023 Guo2022
Ch.1 Ch.2 Ch.3 Ch.4 Ch.5
Ours Guo2022
Lv2023
GT 9.1°21.2°24.5°
Cond. 4 (separate, 5ch, no crosstalk), ID 11
GT Ours Lv2023 Guo2022
Ours Guo2022
Lv2023
GT 7.4°22.3°25.0°
Ch.1 Ch.2 Ch.3 Ch.4 Ch.5
Cond. 5 (separate, 5ch, high crosstalk), ID 13
GT Ours Lv2023 Guo2022
INPUT
INPUT
INPUT
INPUT
INPUT
Fig. 6: Evaluation on SpectraM14. Full results are available in the supplementary.
utilize numerous images. To recover clean and temporally stable normal maps,
we may need to consider temporal consistency and more actively utilize monoc-
ular cues. Additionally, while our method targets dynamic surfaces, it currently
requires several seconds to up to ten seconds per RGB image, which is far from
real-time processing. Considering industrial applications in the future, acceler-
ating the processing speed is a crucial challenge.
SpectraM-PS 15
References
1. 3D Textures - Free seamless PBR textures with Diffuse, Normal, Displacement,
Occlusion, Specularity and Roughness Maps. https://3dtextures.me/, accessed:
2024-03-07
2. Adobe Stock. https://stock.adobe.com/
3. AmbientCG - Free Public Domain PBR Materials. https://ambientcg.com/, ac-
cessed: 2024-03-07
4. Blender. https://www.blender.org/
5. Poliigon - A library of materials, and HDR’s for artists including free textures.
https://www.poliigon.com/, accessed: 2024-03-07
6. weeylitepro app on google play store. https://play.google.com/store/apps/
details?id=com.ruitianzhixin.weeylite2&pli=1, accessed: 2024-03-07
7. Alldrin, N., Mallick, S., Kriegman, D.: Resolving the generalized bas-relief ambi-
guity by entropy minimization. CVPR (2007)
8. Anderson, R., Stenger, B., Cipolla, R.: Color photometric stereo for multicolored
surfaces. ICCV (2011)
9. Basri, R., Jacobs, D., Kemelmacher, I.: Photometric stereo with general, unknown
lighting. International Journal of computer vision 72(3), 239–257 (2007)
10. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.:
A theory of learning from different domains. Machine learning 79, 151–175 (2010)
11. Chakrabarti, A., Sunkavalli, K.: Single-image rgb photometric stereo with spatially-
varying albedo. In: 2016 Fourth International Conference on 3D Vision (3DV)
(2016)
12. Chen, G., Han, K., Shi, B., Matsushita, Y., Wong, K.K.K.: Self-calibrating deep
photometric stereo networks. In: 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). pp. 8731–8739 (2019)
13. Chen, G., Han, K., Wong, K.Y.K.: Ps-fcn: A flexible learning framework for pho-
tometric stereo. ECCV (2018)
14. Chen, G., Waechter, M., Shi, B., Wong, K.Y.K., Matsushita, Y.: What is learned
in deep uncalibrated photometric stereo? In: European Conference on Computer
Vision. pp. 745–762. Springer (2020)
15. Decker, B., Kautz, J., Mertens, T., Bekaert, P.: Capturing multiple illumination
conditions using time and color multiplexing. CVPR (2009)
16. Drew, M.S.: Shape from color. Technical Report CSS/LCCR TR 92-07, School of
Computing Science, Simon Fraser University, Vancouver, BC (1992)
17. Edmund Optics: Edmund optics - optics, imaging, and photonics technology.
https://www.edmundoptics.com/, accessed: 2024-03-14
18. Goldman, D., Curless, B., Hertzmann, A., Seitz, S.: Shape and spatially-varying
brdfs from photometric stereo. In: ICCV (October 2005)
19. Guo, H., Okura, F., Shi, B.: Multispectral photometric stereo for spatially-varying
spectral reflectances. IJCV 130, 2166–2183 (2022)
20. Guo, H., Okura, F., Shi, B., Funatomi, T., Mukaigawa, Y., Matsushita, Y.: Multi-
spectral photometric stereo for spatially-varying spectral reflectances: A well posed
problem? In: CVPR. pp. 963–971 (2021)
21. Hayakawa, H.: Photometric stereo under a light souce with arbitary motion. JOSA
11(11), 3079–3089 (1994)
22. Hernández, C., Vogiatzis, G., Brostow, G., Stenger, B., Cipolla, R.: Non-rigid pho-
tometric stereo with colored lights. In: ICCV. pp. 1–8 (2007). https://doi.org/
10.1109/ICCV.2007.4408939
16 S. Ikehata and Y. Asano
23. Hertzmann, A., Seitz, S.: Example-based photometric stereo: shape reconstruction
with general, varying brdfs. IEEE TPAMI 27(8), 1254–1264 (2005)
24. Hill, B.: Color capture, color management, and the problem of metamerism: does
multispectral imaging offer the solution? In: Proc. SPIE. pp. 2–14. SPIE (1999)
25. Ikehata, S.: Cnn-ps: Cnn-based photometric stereo for general non-convex surfaces.
In: ECCV (2018)
26. Ikehata, S.: Ps-transformer: Learning sparse photometric stereo network using self-
attention mechanism. In: BMVC (2021)
27. Ikehata, S.: Universal photometric stereo network using global lighting contexts.
In: CVPR (2022)
28. Ikehata, S.: Scalable, detailed and mask-free universal photometric stereo. In:
CVPR (2023)
29. Ikehata, S., Aizawa, K.: Photometric stereo using constrained bivariate regression
for general isotropic surfaces. In: CVPR (2014)
30. Ikehata, S., Wipf, D., Matsushita, Y., Aizawa, K.: Robust photometric stereo using
sparse regression. In: CVPR (2012)
31. Ikehata, S., Wipf, D., Matsushita, Y., Aizawa, K.: Photometric stereo using sparse
bayesian regression for general diffuse surfaces. IEEE TPAMI 36(9), 1816–1831
(2014)
32. Ikehata, S.: Does physical interpretability of observation map improve photometric
stereo networks? In: ICIP (2022)
33. Ishio, H., Minowa, J., Nosu, K.: Review and status of wavelength-division-
multiplexing technology and its application. Journal of lightwave technology 2(4),
448–463 (1984)
34. Ju, Y., Dong, X., Wang, Y., Qi, L., Dong, J.: A dual-cue network for multispectral
photometric stereo. Pattern Recognition 100, 107162 (2020)
35. Ju, Y., Qi, L., Zhou, H., Dong, J., Lu, L.: Demultiplexing colored images for multi-
spectral photometric stereo via deep neural networks. IEEE Access 6, 30804–30818
(2018)
36. Ju, Y., Dong, J., Chen, S.: Recovering surface normal and arbitrary images: A dual
regression network for photometric stereo. IEEE Transactions on Image Processing
30, 3676–3690 (2021)
37. Kajiya, J.T.: The rendering equation. SIGGRAPH Comput. Graph. 20(4), 143–150
(1986)
38. Kaya, B., Kumar, S., Oliveira, C., Ferrari, V., Van Gool, L.: Uncalibrated neural
inverse rendering for photometric stereo of general surfaces. pp. 3804–3814 (2021)
39. Kontsevich, L.L., Petrov, A., Vergelskaya, I.: Reconstruction of shape from shading
in color images. Journal of the Optical Society of America pp. 1047–1052 (1994)
40. Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: A
framework for attention-based permutation-invariant neural networks. In: ICML.
pp. 3744–3753 (2019)
41. Li, J., Robles-Kelly, A., You, S., Matsushita, Y.: Learning to minify photometric
stereo. In: CVPR (2019)
42. Li, J., Li, H.: Neural reflectance for shape recovery with shadow handling. In:
CVPR (2022)
43. Li, J., Li, H.: Self-calibrating photometric stereo by neural inverse rendering. In:
ECCV (2022)
44. Liu, H., Yan, Y., Song, K., Yu, H.: Sps-net: Self-attention photometric stereo net-
work. IEEE Transactions on Instrumentation and Measurement 70, 1–13 (2021)
45. Liu, Z., Mao, H., Chao-Yuan Wu, C.F.: A convnet for the 2020s. In: CVPR (2022)
SpectraM-PS 17
46. Logothetis, F., Budvytis, I., Mecca, R., Cipolla, R.: Px-net: Simple and efficient
pixel-wise training of photometric stereo networks. In: CVPR. pp. 12757–12766
(2021)
47. Lu, F., Matsushita, Y., Sato, I., Okabe, T., Sato, Y.: Uncalibrated photometric
stereo for unknown isotropic reflectances. In: CVPR. pp. 1490–1497 (2013)
48. Lv, J., Guo, H., Chen, G., Liang, J., Shi, B.: Non-lambertian multispectral photo-
metric stereo via spectral refectance decomposition. In: IJCAI (2023)
49. Mecca, R., Rosman, G., Kimmel, R., Bruckstein, A.: Perspective photometric
stereo with shadows. In: Proc. of 4th International Conference on Scale Space
and Variational Methods in Computer Vision (2013)
50. Mitsubishi Electric Research Laboratories (MERL): MERL BRDF Database.
http://www.merl.com/brdf/, accessed: 2024-03-07
51. Mo, Z., Shi, B., Lu, F., Yeung, S.K., Matsushita, Y.: Uncalibrated photometric
stereo under natural illumination. pp. 2936–2945. IEEE Computer Society (2018)
52. Nayatani, Y., Kurioka, Y., Sobagaki, H.: Study on color rendering and metamerism
(part 8). Journal of the Illuminating Engineering Institute of Japan 56(9), 529–536
(1972). https://doi.org/10.2150/jieij1917.56.9_529
53. Ozawa, K., Sato, I., Yamaguchi, M.: Single color image photometric stereo for
multi-colored surfaces. Computer Vision and Image Understanding 171, 140–149
(2018)
54. Ren, J., Wang, F., Zhang, J., Zheng, Q., Ren, M., Shi, B.: Diligent102: A photo-
metric stereo benchmark dataset with controlled shape and material variation. pp.
12581–12590 (June 2022)
55. Santo, H., Samejima, M., Sugano, Y., Shi, B., Matsushita, Y.: Deep photometric
stereo network. In: International Workshop on Physics Based Vision meets Deep
Learning (PBDL) in Conjunction with IEEE International Conference on Com-
puter Vision (ICCV) (2017)
56. Shi, B., Matsushita, Y., Wei, Y., Xu, C., Tan, P.: Self-calibrating photometric
stereo. In: CVPR (2010)
57. Shi, B., Tan, P., Matsushita, Y., Ikeuchi, K.: A biquadratic reflectance model for
radiometric image analysis. In: CVPR (2012)
58. Shi, B., Wu, Z., Mo, Z., D.Duan, Yeung, S.K., Tan, P.: A benchmark dataset
and evaluation for non-lambertian and uncalibrated photometric stereo. In: CVPR
(2016)
59. Silver, W.M.: Determining shape and reflectance using multiple images. Master’s
thesis, MIT (1980)
60. Taniai, T., Maehara, T.: Neural Inverse Rendering for General Reflectance Photo-
metric Stereo. In: ICML (2018)
61. Tiwari, A., Raman, S.: Deepps2: Revisiting photometric stereo using two differently
illuminated images. ECCV (2022)
62. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain
adaptation. In: CVPR (2017)
63. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
64. Vogiatzis, G., Hernandez, C.: Self-calibrated, multi-spectral photometric stereo for
3d face capture. IJCV 56(97), 91–103 (2012)
65. Wang, F., Ren, J., Guo, H., Ren, M., Shi, B.: Diligent-pi: A photometric stereo
benchmark dataset with controlled shape and material variation (October 2023)
66. Woodham, P.: Photometric method for determining surface orientation from mul-
tiple images. Opt. Engg 19(1), 139–144 (1980)
18 S. Ikehata and Y. Asano
67. Wu, L., Ganesh, A., Shi, B., Matsushita, Y., Wang, Y., Ma, Y.: Robust photometric
stereo via low-rank matrix completion and recovery. In: ACCV (2010)
68. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene
understanding. In: ECCV. pp. 418–434 (2018)
69. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan,
Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In:
International Conference on Machine Learning. pp. 10524–10533. PMLR (2020)
70. Yao, Z., Li, K., Fu, Y., Hu, H., Shi, B.: Gps-net: Graph-based photometric stereo
network. NeurIPS (2020)
71. Zheng, Q., Jia, Y., Shi, B., Jiang, X., Duan, L.Y., Kot, A.: Spline-net: Sparse
photometric stereo through lighting interpolation and normal estimation networks.
ICCV (2019)
A Details of SpectraM14 Benchmark
In this section, we detail SpectraM14, the first benchmark dataset for spectrally
multiplexed photometric stereo. Firstly, we provide a detailed explanation of the
five task conditions included in the benchmark. Then, we offer comprehensive in-
formation about data acquisition. Finally, we discuss the spectral characteristics
of the sensors and light sources.
A.1 Details of Benchmark Tasks
Our SpectraM14 benchmark encompasses tasks under five distinct conditions,
as illustrated in Fig. 7. The details of these conditions are as follows.
SpectraM-PS 19
Condition 1 (six channels)
Setup.
Color sensor, no crosstalk condition: Six colors of light (red, green, blue,
cyan, yellow, magenta) were each independently illuminated and ob-
served with an RGB sensor. Afterward, the channels of RGB were aver-
aged.
Motivation.
This condition evaluates the robustness to differences in channel num-
bers during training (i.e., three) and testing (i.e., six). It also simulates
an idealized multiplexing scenario without any channel crosstalk.
Condition 2 (three channels)
Setup.
Color sensor, weak crosstalk condition: Three colors of light (red, green,
blue) were simultaneously illuminated and observed through each chan-
nel of the RGB sensor.
Motivation.
Actual multiplexing is employed using RGB LEDs and a color sensor,
with LEDs’ spectral peaks generally aligning with sensor responses, al-
beit not narrowband, leading to weak channel crosstalk. This setup tests
the method’s ability to handle real-world multiplexing scenarios within
a typical RGB setup.
Condition 3 (three channels)
Setup.
Color sensor, strong crosstalk condition: Three colors of light (cyan,
yellow, magenta) were simultaneously illuminated and observed through
each channel of the RGB sensor.
Motivation.
The light source’s spectral distribution no longer uniquely matches the
RGB channels’ sensitivity, leading to strong channel crosstalk and in-
validating the assumption of a single directional light source. This setup
tests the method under more complex lighting scenarios than those as-
sumed by most existing spectrally multiplexed photometric stereo meth-
ods [19, 48].
20 S. Ikehata and Y. Asano
Condition 4 (five channels)
Setup.
NIR sensor, no crosstalk condition: Light at wavelengths of 750 nm,
850 nm, 880 nm, 905 nm, and 940 nm were each independently illumi-
nated and observed with a monochrome sensor corresponding to each
wavelength.
Motivation.
Evaluating spectral characteristics beyond visible light can address the
concern that learning-based methods trained on specific narrowband
wavelengths (e.g., RGB images) may struggle to effectively handle char-
acteristics of unknown wavelengths.
Condition 5 (five channels)
Setup.
NIR sensor, spatially-varying lighting condition: New images were cre-
ated by averaging two images taken under the conditions mentioned
above. The combinations were (750 nm, 850 nm), (850 nm, 880 nm),
(880 nm, 905 nm), (905 nm, 940 nm), and (940 nm, 750 nm).
Motivation.
This setup evaluates methods on NIR images under multiple light sources,
causing spatially-varying illumination where the assumption of a single
directional light source is no longer valid. From the other perspective,
this simulates strong spectral multiplexing under NIR lighting and sen-
sors. Due to the practical difficulties of real multiplexing under NIR light,
a pseudo-environment is created by averaging multiple NIR images.
A.2 Details of Data Acquisition
In our imaging setup, six LED color light sources (Weeylite S05 RGB Pocket
Lamp) were positioned around the camera for conditions 1 to 3, and three halo-
gen light sources (NPI PIS-UHX-AIR), whose wavelengths range from visible to
NIR, were used for conditions 4 to 5, as shown in Fig. 8. The walls and floor were
covered with cloth made from low-reflectance material to minimize the effects of
inter-reflection. For conditions 4 and 5, an NIR image at each specific wavelength
(i.e., 750 nm, 850 nm, 880 nm, 905 nm, and 940 nm) was captured using halogen
light as the light source and placing a bandpass filter (Edmund Optics [17]) in
front of the camera lens. In condition 5, synthetic data was generated by av-
eraging two images from those captured in condition 4. Note that we utilized
three NIR lights for our convenience to smoothly change lighting conditions. In
reality, only a single NIR light was turned on for each capture.
The colors of the six LEDs used in our dataset—red, green, blue, yellow,
magenta, and cyan—were remotely controlled by the vendor’s software [6], and
their respective spectra, measured by the Hamamatsu Photonics Multichannel
SpectraM-PS 21
Six colors of light were each
independently illuminated. Three colors of light were
simultaneously illuminated. Three colors of light were
simultaneously illuminated.
Cond. 1 Cond. 2 Cond. 3
Cond. 4
Five NIR lights were each
independently illuminated.
750nm Cond. 5
Two NIR lights were simultaneously
illuminated (five NIR light pairs).
* Two channels from Cond.4 are averaged.
750nm
850nm
NIR
sensor
NIR
sensor
Color
sensor
Color
sensor
RGB sensor
averaging
850nm
880nm
905nm
940nm
880nm
905nm
940nm
Bandpass
filter Bandpass
filter
Fig. 7: Illustration of task conditions.
Analyzer C10027-01, are shown in Fig. 9. The spectrum of a NIR light source
is also shown in Fig. 10. Halogen lights emit NIR light strongly, in addition to
visible light. By placing a rotating filter holder with multiple bandpass filters in
front of the camera lens, multiple near-infrared spectral images can be efficiently
acquired without the need for a NIR light source with a specific wavelength. The
bandpass filters have a full width-half maximum of 10 nm, and there is minimal
spectral crosstalk in the NIR image.
We utilized a color sensor (FLIR GS3-U3-123S6C-C) and an NIR sensor
(FLIR GS3-U3-41C6NIR-C), both equipped with a 50 mm lens and having a
linear radiometric response function. For each lighting condition, we captured
five images with different exposure times (50ms, 100ms, 150ms, 200ms, 300ms,
and 500ms), which we then combined to produce a single HDR image. This
process enables precise observation of strong specular reflections and shadowed
areas, and improves the intensity resolution of the image.
To apply the calibrated methods (i.e., including all prior methods except
for ours), the direction of each light source is measured based on the specular
highlight on a plastic sphere (See Fig. 11). Note that LED/halogen lights are
placed approximately 1m from the object center (the object size is less than
22 S. Ikehata and Y. Asano
LED
(Red)
LED
(Blue)
LED
(Magenta)
LED
(Green)
LED
(Yellow)
LED
(Cyan)
Color sensor
(FLIR GS3-U3-123S6C-C)
Object
Low-reflective cloth
NIR sensor
(FLIR GS3-U3-41C6NIR-C)
Halogen light
(NPI PIS-UHX-AIR)
Fig. 8: Imaging setup for SpectraM14 dataset for condition 1–3. In condition 4–5,
replace the LED light source with a halogen light and install a bandpass filter in front
of the camera lens.
400 600
Wavelength (nm)
0
0.1
0.2
Light intensity
Red
400 600
Wavelength (nm)
0
0.1
0.2
Light intensity
Green
400 600
Wavelength (nm)
0
0.1
0.2
Light intensity
Blue
400 600
Wavelength (nm)
0
0.1
0.2
Light intensity
Yellow
400 600
Wavelength (nm)
0
0.1
0.2
Light intensity
Magenta
400 600
Wavelength (nm)
0
0.1
0.2
Light intensity
Cyan
Fig. 9: Spectra of six LED light sources.
10cm) in roughly uniform directions, maintaining a light-object distance that
is about 10 times larger than the object size to practically approximate the
directional lighting setup.
A.3 Spectral Analysis
In spectral multiplexing, channel crosstalk refers to the phenomenon where light
from sources of different wavelengths is observed in the same channel of a sensor.
This occurs when the sensor’s wavelength response characteristics are not suffi-
ciently discriminatory across channels, or when the spectra of the light sources
overlap. If channel crosstalk occurs, each channel will observe light from mul-
tiple sources, thus disrupting the single lighting assumption. Conditions 2 and
3 are settings designed to compare performance based on the degree of channel
crosstalk in actual spectral multiplexing scenarios. Indeed, Fig. 12 demonstrates
that while the spectra of the red, green, and blue LEDs hardly overlap, the
spectra of the cyan, yellow, and magenta LEDs significantly overlap, causing
substantial channel crosstalk. It is noteworthy that the occurrence of channel
SpectraM-PS 23
700 800 900
Wavelength (nm)
0
0.1
0.2
0.3
Light intensity
750nm
850nm
880nm
905nm
950nm
Fig. 10: Spectra of five NIR light sources with bandpass filters.
crosstalk depends on both the spectrum of the light source and the spectral
sensitivity of the sensor. As shown inFig. 13, there is also an overlap in sensor
spectral response, which makes the influence of crosstalk between images more
pronounced. All the spectral properties of lights and sensors were measured using
the Hamamatsu Photonics Multichannel Analyzer C10027-01.
B Details of Main Results
In this section, we detail the main results of our paper. All inputs are illustrated
in Figs. 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38 and 40, and outputs are
illustrated in Figs. 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39 and 41. For
each figure, we provide a detailed discussion about the object and the results.
24 S. Ikehata and Y. Asano
(a) A sphere with strong specu-
lar characteristics.
(b) Light source directions in
condition 1–3.
(c) Light source directions in
conditions 4–5 (with averaging).
Fig. 11: Light source distribution of LED and halogen light sources.
400 500 600 700
Wavelength (nm)
0
0.05
0.1
0.15
0.2
0.25
Light intensity
Red
Green
Blue
(a) Condition 2.
400 500 600 700
Wavelength (nm)
0
0.05
0.1
0.15
0.2
0.25
Light intensity
Yellow
Magenta
Cyan
(b) Condition 3.
Fig. 12: Spectrum of LEDs used in condition 2–3.
C Analysis on Lighting Distributions
In this section, we investigate the impact of light source distribution on the pro-
posed method. Unlike conventional temporally multiplexed problems, spectrally
multiplexed photometric stereo requires consideration not only of the spatial dis-
tribution of light sources but also their spectral distribution. When the spectra
of different light sources are similar, each sensor is more likely to be influenced by
multiple sources (i.e., channel crosstalk). Consequently, the image captured by
each sensor shifts from ones based on directional light sources to those based on
spatially more complex lighting. Furthermore, when light sources are spatially
proximate, the shading variations in images by different sources decreases, com-
plicating the process of photometric stereo. On the other hand, in cases where
light sources are in completely different directions or have no wavelength over-
lap, the system becomes more susceptible to the effects of cast and attached
shadows, thus not necessarily improving performance. To investigate these non-
trivial relationships, we varied the light source distribution both spatially and
spectrally.
The process of controlling the spatial/spectral light source distribution is
illustrated in Fig. 42. In our experiment, three LED sources were used, with
SpectraM-PS 25
400 500 600 700
Wavelength (nm)
0
0.2
0.4
0.6
0.8
Response Ratio
Red
Green
Blue
(a) Color sensor.
700 750 800 850 900 950
Wavelength (nm)
0
0.2
0.4
0.6
0.8
Response Ratio
NIR camera
(b) NIR sensor.
Fig. 13: Sensor spectral response function.
spectral multiplexing. For varying spatial distribution, the azimuth angle of the
light sources was kept constant, while the elevation angle was manually adjusted
in seven increments from roughly 0 to 90 degrees (See Fig. 42-(b)-left). At lower
elevation angles, the light is projected horizontally relative to the object, while
at higher angles, it is more vertically oriented, reducing the directional diversity
between sources. Additionally, each LED light source was initially set with max-
imum brightness for one of the primary colors (Red, Green, Blue) and minimum
for the others. Then, the intensity of white LEDs was incrementally increased,
thereby shifting the spectrum from the initial state to include other spectral
components (See Fig. 42-(b)-right). This alteration transformed the light source
from a directional to a more spatially complex distribution. The intensity of the
white LEDs was linearly increased from zero, reaching maximum at the eleventh
increment. Here, we captured two different objects (ID11 and ID13) from our
SpectraM14 dataset. The captured images are illustrated in Fig. 42-(a).
The results are illustrated in Figs. 43 and 44. The rows of the tables repre-
sent the elevation angles of the light sources (all three light sources share the
same elevation angles), and the columns indicate the amount of additive white
LED light. Initially, higher accuracy is observed at higher elevation angles of
the light source, contrary to lower angles. However, it becomes clear that the
optimal elevation angle is not the highest but slightly lower, indicating a per-
formance decline when the overlap between light sources is maximized. Further-
more, the increasing MAE towards the right side of the table demonstrates that
wavelength overlap between light sources also leads to performance degradation.
Consequently, better accuracy is more likely when light sources are distinctly
separated into RGB and multiplexed. This trend remains consistent even in sce-
narios like Object ID 11, which predominantly reflects red light. Although a
general trend in spatial and spectral distribution is observable, the optimal com-
bination of light source distribution varies with each object, complicating the
identification of the best setup consistently. As a future research direction, au-
tomatically determining the optimal light source distribution poses a significant
and worthwhile challenge.
26 S. Ikehata and Y. Asano
D Application: Dynamic Surface Recovery
In our project website4, We demonstrate dynamic surface recovery by applying
our method to each frame of a video captured under spectrally multiplexed illu-
mination. We captured dynamic scenes using a Grasshopper3 RGB color sensor
under multiplexed illumination and applied our proposed method to individ-
ual video frames. We employed four newer RGB 168 lights to create random
non-uniform spatial and spectral distributions for illumination and captured the
scenes against a black, low-reflective background (but not in the dark room).
It’s important to mention that we use a color sensor for the demonstra-
tion as we could not find a budget-friendly multispectral sensor in our vicinity
suitable for dynamic surface reconstruction using spectrally multiplexed photo-
metric stereo. Although multispectral or hyperspectral cameras provide broader
channel measurements, their high cost (e.g., $50,000 for the EBA NH7 as noted
in [19]) and slow frame rates (less than 1fps) restrict their application in dynamic
scenes. Consequently, optimizing for affordable sensors with fewer channels yet
greater sensitivity and speed is crucial for practical applications.
4https://github.com/satoshi-ikehata/SpectraM-PS-ECCV2024
SpectraM-PS 27
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Broad. Spec.
Fig. 14: Input of Object ID 1. A toy astronaut with a simplistic yet recognizable space
suit design made from a glossy plastic material. The dark blue helmet has a starry
design. The object overall lacks intricate details.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
8.6°20.4°22.6°10.1°22.8°31.1°12.0°38.5°46.0°
10.9°24.1°30.7°10.9°29.8°40.0°surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
Fig. 15: Output of Object ID 1. The dark blue helmet exhibits very low reflectivity,
showing low brightness across all wavelength ranges. The brightness values decrease
further as a result of multiplexing in Cond. 2, Cond. 3 and Cond. 5, making the problem
more challenging due to the lack of uniformity in quality. The proposed method has
successfully recovered the difficult helmet section under any condition, in contrast to
prior methods [19, 48] which have shown significant errors, especially in this helmet
area.
28 S. Ikehata and Y. Asano
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Monochrome
Diffuse
Fig. 16: Input of Object ID 2. A keychain resembling a piece of bread. It is made
from a soft, foam-like material, giving it a spongy texture, and has a matte finish. The
shape is cylindrical with a series of horizontal indentations that mimic the appearance
of sliced bread. It has a mostly uniform light orange color.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
11.2°17.0°15.2°11.5°26.0°27.5°12.8°34.1°42.6°
11.3°19.8°16.0°11.0°26.2°27.3°
Fig. 17: Output of Object ID 2. The surface reflectivity is high in the red wavelength,
resulting in high observed intensity in the red channel, but relatively low in other chan-
nels. Therefore, in Cond. 2 and 3, the number of reliable channels per pixel is reduced,
making it a more challenging object than it appears. The diffuse reflection is domi-
nant, making it poorly compatible with Lv2023 [48] that rely on specular reflection.
The proposed method shows stable recovery even under such challenging conditions.
SpectraM-PS 29
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Spars. Spec.
Fig. 18: Input of Ob ject ID 3. A statue of Buddha. The material is glossy and reflective,
indicative of a glazed ceramic finish. It’s painted in vibrant colors, with the figure
dressed in a bright red robe and holding what appears to be a gold ingot. The figure
is seated, with exposed belly and feet, adding to the intricate detail of the statuette.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
13.0°21.1°20.7°13.5°27.0°29.2°15.8°38.2°56.3°
9.7°20.7°18.3°9.9°28.9°29.2°
Fig. 19: Output of Object ID 3. The surface has a high red reflectance, appearing
glossy. Its concave shape enhances shadows and reflections. Contrary to expectations,
Lv2023 [48] underperforms compared to Guo2022 [19] for surfaces with specular reflec-
tions. The proposed method sees a minor drop in red area accuracy across different
conditions with a color sensor, but using an NIR sensor improves results due to high
reflectance. Prior methods [19, 48] struggle with inter-reflections and cast shadows,
leading to accuracy loss.
30 S. Ikehata and Y. Asano
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Spars. Spec.
Fig. 20: Input of Ob ject ID 4. A ceramic cat. It has a glossy finish indicative of glazed
ceramic. The cat is stylized with a rounded, simplified form. The colors are soft, with
pastel pinks and whites, and there are golden accents on the ears, paws, and a medallion
on its chest.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
7.3°13.9°13.4°9.0°19.4°20.1°11.0°32.4°37.6°
6.3°12.0°9.9°7.3°22.8°23.2°
Fig. 21: Output of Object ID 4. This object is characterized by a discrepancy be-
tween the continuity of the geometry and material. Networks trained to align texture
boundaries with geometric boundaries should produce artifacts on such objects. The
proposed method achieves very stable recovery of surface normals under all conditions
except for a slight error increase in Cond. 2 and 3. On the other hand, Lv2023 [48] and
Guo2022 [19] struggle significantly with texture boundaries, showing the detrimental
effects of their assumption of uniform material.
SpectraM-PS 31
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Diffuse
Fig. 22: Input of Object ID 5. A plastic figurine depicting a cat dressed as a clown,
set in a playful pose atop a pumpkin. The cat is adorned with a clown’s hat and collar,
painted in vibrant colors such as purple, green, yellow, and red.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
12.2°23.1°27.1°12.8°30.3°38.0°16.5°40.1°57.1°
10.8°19.8°29.6°11.0°28.9°33.3°
Errors (deg.)
Fig. 23: Output of Object ID 5. This object poses challenges due to its particularly
complex reflectance distribution. The high reflectance in the red and green wavelength
ranges, combined with a black surface that lowers the signal-to-noise ratio, and its
complex shape, make it one of the most challenging objects among the 14 objects.
Indeed, Lv2023 [48] and Guo2022 [19] exhibit significant errors under all conditions.
In contrast, the proposed method, while experiencing some accuracy degradation in
non-convex areas under Cond. 2 and 3, does not encounter significant issues with non-
uniform materials.
32 S. Ikehata and Y. Asano
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Monochrome
Diffuse
Fig. 24: Input of Object ID 6. A plaster Daruma doll. The doll is characterized by a
round shape, uniform white color, and a face with simplistic features. The material,
being plaster, has a matte finish.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
6.1°10.7°7.2°7.0°19.5°19.0°5.9°38.2°45.6°
5.3°12.4°8.4°4.7°25.8°25.0°
Errors (deg.)
Fig. 25: Output of Object ID 6. This object is ideal for photometric stereo due to
its uniform color, Lambertian material, and mostly convex shape. Indeed, prior meth-
ods such as Lv2023 [48] and Guo2022 [19] demonstrate stable surface normal recovery.
However, challenges still arise under conditions with channel crosstalk, such as Condi-
tions 2, 3, and 5. The difficulty in obtaining accurate results, even under relatively ideal
circumstances, highlights the potentially unrealistic constraints under which previous
methods have been developed.
SpectraM-PS 33
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Diffuse
Fig. 26: Input of Object ID 7. A figurine of the artist Vincent van Gogh. This stylized
representation features a orange beard, blue hair and a green coat. The material is
diffusive.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
11.5°21.2°31.3°10.9°22.1°33.4°12.3°36.6°76.0°
12.4°15.1 23.1 13.3°21.9°29.3°
Fig. 27: Output of Object ID 7. At first glance, this object appears to be highly
sensitive in the blue and green channels. However, visualization under Cond. 1, 2,
and 3 reveals that the reflectivity of the clothes is very low for a color sensor, posing
challenges for Lv2023 [48] and Guo2022 [19]. While our method successfully recovers
the details in the beard and hat, prior methods, such as Lv2023 [48], struggle with
these fine details.
34 S. Ikehata and Y. Asano
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Almost Monochrome
Broad Spec. / Metal
Fig. 28: Input of Object ID 8. A gold-painted rabbit figurine. It features a simpli-
fied, stylized form with minimal details. The golden finish results in broad specular
reflections.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
5.0°16.6°24.8°5.5°18.9°23.5°9.8°29.6°48.0°
4.3°18.7°25.7°4.8°24.6°30.1°
Errors (deg.)
Fig. 29: Output of Object ID 8. The material is almost uniform, yet slight non-
uniformity is observed due to the floral pattern. In Cond. 1, 4, and 5, where the number
of channels is relatively high, our method achieves particularly high reconstruction ac-
curacy among the objects. In spectral multiplexing setups like Cond. 2 and 3 with a
color sensor, slight decreases in accuracy are observed in areas where texture changes
exist. Notably, due to its non-uniform and non-Lambertian nature, the accurate re-
covery of surface normals poses challenges for methods like Guo2022 [19] and even
Lv2023 [48].
SpectraM-PS 35
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Monochrome
Sparse Spec.
Fig. 30: Input of Ob ject ID 9. A figurine in the shape of a hand. It is made of a white,
smooth and glossy ceramic. The hand is displayed in an open position, with fingers
slightly spread.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
5.7°10.9°8.0°6.3°14.3°13.6°6.9°38.2°29.9°
7.8°11.4°10.2°7.9°18.7°20.0°
Fig. 31: Output of Object ID 9. This object features a completely uniform material
and exhibits peaky specular reflections due to its smooth surface. Not only does the
proposed method recover surface normals with reasonable accuracy, but prior methods
also perform well if a sufficient number of channels (e.g., 5, 6) are provided (i.e., Cond.
1, 4). However, a clear degradation in accuracy is observed for them in the presence of
channel crosstalk, as demonstrated in Cond. 2, 3, and 5.
36 S. Ikehata and Y. Asano
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Broad Spec.
Fig. 32: Input of Object ID 10. A figurine of a reindeer adorned with a red and white
Christmas hat. It holds a green ornament and wears a red bow tie. The body of the
reindeer is white with a quilted texture, and there’s a yellow bell at the center. The
material is ceramic with a glossy finish.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
7.5°15.9°18.2°9.3°19.8°26.6°9.8°35.1°62.1°
6.6°17.1°14.6°6.5°24.7°26.7°
Fig. 33: Output of Object ID 10. This object is distinguished by its complex geometric
shape, notably the red hat and the white knitted sweater, and its overall specular
reflection. The hat, in particular, shows low brightness values even when observed
with NIR sensors, posing a challenge for recovery. However, the proposed method
demonstrates good performance even in Cond. 2 and 3, where the effective channels for
the hat section are limited. In contrast, prior methods manage to recover relatively well
in Cond. 1 and 4, but significant degradation in accuracy is observed under conditions
of channel crosstalk.
SpectraM-PS 37
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Multicolor
Diffuse
Fig. 34: Input of Ob ject ID 11. A figurine depicting Santa Claus with a child. The
materials consist of painted plastic, providing a matte finish and vibrant colors.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
10.3°19.0°24.5°12.7°23.5°32.7°14.6°38.2°49.8°
9.1°21.2°24.5°9.8°27.6°31.6°
Fig. 35: Output of Ob ject ID 11. This object ranks among the most challenging of the
14 objects due to its non-uniform surface reflectance and its complex non-convex shape.
In particular, the knee part in Cond. 2 and 3 poses significant recovery challenges due to
the combined effects of limited effective channels and cast shadows from the non-convex
shape. The proposed method experiences some accuracy degradation in Cond. 2 and
3 compared to other conditions. However, it still achieves sufficiently accurate results,
unlike Lv2023 [48] and Guo2022 [19], which exhibit almost unsatisfactory performance.
38 S. Ikehata and Y. Asano
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Monochrome
Diffuse
Fig. 36: Input of Ob ject ID 12. A sheep figurine crafted from wood. The sheep’s shape
is stylized and simplistic, featuring soft curves. The surface material is nearly uniform,
though the wood grain is slightly visible.
4.0°
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
5.1°16.4°11.4°5.3°20.8°14.7°6.3°36.6°50.2°
17.5°13.9°4.2°24.8°26.3°
Fig. 37: Output of Object ID 12. This object, a wooden sheep, is characterized by
a mostly uniform diffuse surface. Its shape is relatively simple. In fact, in Cond. 1
and 4, even Lv2023 [48] and Guo2022 [19] manage to achieve somewhat accurate re-
sults. However, they exhibit significant performance degradation under conditions with
channel crosstalk. In contrast, the proposed method consistently achieves very accurate
recovery under all conditions.
SpectraM-PS 39
Cond. 1
Cond. 2-3
Cond. 4
Cond. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
multiplexed
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5
Ch. 1 Ch. 2 Ch. 3 multiplexed Ch. 1 Ch. 2 Ch. 3
surface appearance
Monochrome
Diffuse
Fig. 38: Input of Object ID 13. A ceramic rabbit figurine featuring a smooth and
glossy finish and uniform material. While the design lacks intricate details, it includes
some non-convex structures.
GT/mask Ours Guo2022Lv2023
Cond. 1 (separate, 6ch, no crosstalk)
Ours Guo2022Lv2023 Ours Guo2022Lv2023
Ours Guo2022Lv2023
Cond. 4 (separate, 5ch, no crosstalk)
Ours Guo2022
Lv2023
Cond. 5 (separate, 5ch, high crosstalk*)
0
90 Errors (deg.)
Cond. 2 (multiplex, 3ch, low crosstalk) Cond. 3 (multiplex, 3ch, high crosstalk)
surface appearance
GT/mask
RGB sensor
NIR sensor * Two channels from Cond.4 are averaged
6.1°13.2°10.2°9.8°19.0°17.1°7.5°38.4°52.8°
7.4°19.1°29.6°7.4°22.3°25.0°
Fig. 39: Output of Object ID 13. This object is made from a material that is almost
uniform and exhibits very peaky highlights. The proposed method demonstrated very