ArticlePDF Available

DeepSelfie: Single-Shot Low-Light Enhancement for Selfies

Authors:

Abstract and Figures

Taking a high-quality selfie photo in a low-light environment is challenging. Because the foreground and background often have different illumination conditions, they suffer heavily from over/under-exposure issues and cannot be treated in the same manner when applying image enhancement algorithms. In this work, we propose DeepSelfie, a learning-based image enhancement framework for low-light selfie photos. We address selfie enhancement as a dual-layer image enhancement problem. The foreground and background are thus separately enhanced and combined together via image fusion. To train the selfie enhancement network, we also introduce a method of synthesizing pairs of noisy and dark raw selfie images and their corresponding well-illuminated images. Through extensive experiments of no-reference image quality assessment as well as human subjective evaluation, we show that DeepSelfie provides better results in comparison to several state-of-the-art methods. The code and datasets can be found at https://sites.google.com/view/deepselfie.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
DeepSelfie: Single-shot Low-light
Enhancement for Selfies
YUCHENG LU1, DONG-WOOK KIM1, AND SEUNG-WON JUNG2, (Senior Member, IEEE)
1Department of Multimedia Engineering, Dongguk University, 04620, Seoul, Republic of Korea
2Department of Electrical Engineering, Korea University, 02841, Seoul, Republic of Korea
Corresponding author: Seung-Won Jung (e-mail: swjung83@korea.ac.kr).
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by
the Ministry of Science, ICT Future Planning (NRF-2020R1F1A1069009). This work was supported by a Korea University Grant.
ABSTRACT Taking a high-quality selfie photo in a low-light environment is challenging. Because
the foreground and background often have different illumination conditions, they suffer heavily from
over/under-exposure issues and cannot be treated in the same manner when applying image enhancement
algorithms. In this work, we propose DeepSelfie, a learning-based image enhancement framework for
low-light selfie photos. We address selfie enhancement as a dual-layer image enhancement problem. The
foreground and background are thus separately enhanced and combined together via image fusion. To
train the selfie enhancement network, we also introduce a method of synthesizing pairs of noisy and
dark raw selfie images and their corresponding well-illuminated images. Through extensive experiments
of no-reference image quality assessment as well as human subjective evaluation, we show that DeepSelfie
provides better results in comparison to several state-of-the-art methods. The code and datasets can be found
at https://sites.google.com/view/deepselfie.
INDEX TERMS Deep learning, image enhancement, low-light enhancement, selfie.
I. INTRODUCTION
THE cameras on mobile phones make it convenient to
capture moments in our lives. In particular, most people
presently use the front cameras on their cellphones to take
selfies. Unlike the rear camera, which is well-optimized
to capture high-resolution images with high signal-to-noise
ratios (SNRs), the front camera is usually equipped with
a small aperture and sensor due to space limitations. This
configuration results in a lower dynamic range and SNR.
Furthermore, without a professional lighting setup, the fore-
ground subjects and background in a selfie photo often have
different illumination conditions, making it challenging to
capture decent selfies in environments with low lighting. The
photos captured in these environments are not only noisy, but
also suffer from over/under-exposure of either the foreground
or background.
Even though one can use commercially available photo
editing software, such as Lightroom [1], to enhance im-
ages through noise reduction, brightness enhancement, tone
adjustment, etc., the resultant image quality is often not
satisfactory for selfies captured in low-light conditions due
to differences in the lighting conditions in the foreground
and background. Recently, several methods [2]–[6] have
been proposed to adjust the local illumination and contrast;
however, they still result in relatively poor quality in terms of
the brightness, color, and details for low-light selfies.
To overcome this problem, we propose an end-to-end
image enhancement framework, called DeepSelfie, which is
designed for low-light selfies. To this end, we address selfie
enhancement as a dual-layer image enhancement problem.
The proposed selfie enhancement network first extracts the
features from a raw input image and then uses them to
estimate the enhancement parameters for the foreground
and background. The foreground and background layers are
separately enhanced and fused to generate the final output. To
train this network, we present a dataset called FiveKNight,
which consists of images captured under low-light environ-
ments from the FiveK dataset [7]. Binary masks are also
provided for human instances to be used for synthesizing
selfie images. Moreover, we design a learning-based image
synthesis method for generating realistic selfies in the raw
image domain with different illuminations, white balance,
and noise levels. By training the proposed selfie enhancement
network with our dataset, images can be obtained with a
higher contrast and brightness than those achievable using
the state-of-the-art methods. In summary, this study makes
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
three main contributions:
We address low-light selfie enhancement as a dual-layer
image enhancement problem. Thus, the foreground and
background layers are selectively enhanced and com-
bined to obtain an output image with satisfactory fore-
ground and background quality.
We introduce the FiveKNight dataset, which includes
733 images taken in low-light environments and 680
images containing humans from the FiveK dataset [7].
Each image pair in the dataset contains both the original
and the expert-enhanced versions. We also provide a
binary mask for each portrait image to help the selfie
enhancement network process the foreground and back-
ground layers in a different manner.
We propose a learning-based image synthesis method
for generating realistic selfies in the raw domain with
various illuminations, white balance, and noise levels.
II. RELATED WORKS
Several research topics are related to this work, includ-
ing portrait matting, raw image processing, and low-light
enhancement. This section focuses on deep learning-based
methods and presents a brief review.
A. PORTRAIT MATTING
The objective of portrait matting is to extract human seg-
mentation from selfies, in the form of either a probability
or binary classes for each pixel, which is useful for many
applications such as background removal, style transfer, and
relighting. Based on the early work on semantic segmentation
using a fully convolutional network (FCN) [8], a new net-
work called PortraitFCN+ was proposed to incorporate a por-
trait prior [9]. Specifically, two coordinate maps derived from
the detected facial points and the canonical pose were pro-
vided to the portrait regression network as well as the input
image. Later, a two-stage network was designed [10], where
the first auto-encoder network predicts the coarse alpha map
from the image and its trimap, and the second network
refines the estimated alpha map for improved accuracy and
sharpness. Similarly, two decoder networks were employed
to obtain dual binary masks, and late fusion was performed
after converting the binary masks into soft mattes [11]. A
boundary-aware structure that focuses on the edges using
a boundary attention map was also proposed [12]. In [13],
a light-weight network design was considered for mobile
portrait animation, in which a light dense network predicts
a binary portrait mask and a feathering network converts it
into the final alpha matte.
B. RAW IMAGE PROCESSING
The raw image recorded by a camera sensor contains rich
information about the scene; therefore, direct raw image pro-
cessing is gaining attention in terms of image enhancement,
super-resolution, and denoising. For image denoising, a novel
approach was proposed to synthesize ground-truth raw im-
ages by inverting each step of the image signal processing
(ISP) pipeline and used to train a raw image denoising net-
work [14]. A two-stage network was also developed, where
a Poisson-Gaussian noise map was first estimated using the
noise estimation network and the estimated noise map was
then concatenated to be input to the subsequent denoising
network [15]. In [16], Bayer pattern unification was proposed
to denoise different Bayer patterns of raw images.
From the degraded raw image, multiple image processing
operations can be applied together. For example, in [17],
ISP was considered to involve a large collection of filters
and modeled by using a combination of deep learning and
imaging system simulation. In [18], image restoration and
color transfer were applied separately to avoid one-to-many
mapping issues between raw and color-corrected ground-
truth images. In [19], an end-to-end network was proposed
to perform demosaicing, denoising, and color correction
jointly. To improve the demosaicing and super-resolution
performance, a residual-in-residual dense network was pro-
posed with pixel-shifting that remaps a fixed Bayer pattern
to multiple combinations of color channels [20]. In [21],
image restoration and enhancement were separated into two
uncorrelated tasks and solved by a dedicated network for
each of them. A light-weight yet effective network structure
was also proposed to replace the ISP completely, using a
convolutional neural network (CNN) [22].
The remains of this paper are organized as follows: Sec-
tion III presents the proposed framework and the design
for each sub-module; Section IV gives details of generating
the training dataset; Section V provides the implementation
settings and extensive experiment results; and finally, Section
VI concludes the paper and talks about the future work.
C. LOW-LIGHT ENHANCEMENT
With the remarkable success of CNNs in computer vision
tasks such as image classification and semantic segmentation,
researchers are now actively employing such networks for
low-light image enhancement. Driven by the research on
retinex theory-based image enhancement [23]–[27], several
works have revealed the potential advantages of using CNNs
for low-light image enhancement rather than hand-crafted
methods. In [28], [29], a dataset paired with low and bright
images was collected for training the retinex network that
performs reflectance and illumination decomposition, bright-
ness enhancement, and image denoising. Similarly, a network
was used to extract multi-scale retinex features and recon-
struct high quality images by discrete wavelet transforma-
tion [30]. Based on the illumination-invariant nature of object
reflectance, the decomposition network was trained using the
images captured under different illuminations [3]. In [31],
a dataset with multi-exposure images was collected, which
was used to train the enhancement network with the ground-
truth generated by merging these multi-exposure images. In
[32], a two-stream framework for both global content and
local detail enhancement was proposed with the introduction
of a spatially variant recurrent neural network. In [2], a
multi-channel illumination mapping problem was addressed,
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 1. Overall structure of the selfie enhancement network.
where a down-sized illumination map was estimated for each
color channel and bilateral up-sampling was used to obtain
the illumination map with the original resolution. In [6],
a multi-branch network was trained to extract attention and
noise maps of underexposed regions sequentially. In addi-
tion, multiple parallel sub-networks were used with different
receptive fields to improve the image enhancement perfor-
mance. In [4], low-light image enhancement was modeled
as a set of localized adjustment curves that were estimated
using a Gaussian process and the features extracted from the
network. In [33], a two-way generative adversarial network
(GAN) was adopted to train the image enhancement network
using unpaired low and normal light images. Furthermore, in
[5], a novel local patch discriminator was employed to train
a better network without cycle consistency. In [34], a GAN-
based image enhancement network was applied only to the
illumination channel based on the retinex theory.
Low-light image enhancement can also be performed in
the raw domain. In [35], a dataset of raw images taken
in extremely low-light environments with short and long
exposure was introduced. They used a network based on
U-Net [36] to learn direct mapping from a low-light raw
image to a well-exposed and color-corrected sRGB image.
Their method was further extended to low-light video en-
hancement [37]. To achieve a higher peak signal-to-noise
ratio (PSNR) and a faster speed, direct mapping was changed
to residual mapping with attention modules [38]. In [39],
an image restoration network was proposed to convert a
dark raw image into a well-exposed sRGB image using the
perceptual loss as well as the pixel loss.
However, these methods still have limitations that hinder
their use for low-light selfie enhancement. First, most of
them were developed for sRGB images; thus, their perfor-
mance can be limited due to the information loss caused
by demosaicing, denoising, and other non-linear transfor-
mations in ISP. Second, although several of them directly
process raw images [35], [38], [39], they are not specifically
designed for low-light selfie enhancement and cannot be
effortlessly applied to different camera models. DeepSelfie
directly works in the raw domain and has the capability of
extracting features from the foreground and background of
selfie photos. The proposed dual-layer image enhancement
and fusion method also enables high quality image recon-
struction of the foreground and background. Furthermore,
our proposed learning-based training data synthesis method
can reduce the effort required for applications to different
camera models.
III. NETWORK ARCHITECTURE
This section provides an overview of the proposed framework
and then explains the details of each component.
A. FRAMEWORK OVERVIEW
A straightforward approach of obtaining a bright image from
a low-light environment is to increase the exposure time
so that the sensor can receive more photons, although this
approach is sensitive to camera and object motions. Alterna-
tively, with appropriate post-processing, a bright image with
satisfactory quality can be obtained by applying an ampli-
fication factor (i.e., gain) to a raw image captured within
a short exposure time. The latter approach inspires us to
develop a fully end-to-end enhancement framework for low-
light selfies that can predict the gain for the input raw image
and perform post-processing operations in ISP to generate the
final sRGB image.
The overall structure of the proposed low-light selfie en-
hancement network is illustrated in Fig. 1. This primarily
consists of two parts: gain estimation and raw data processing
modules. The gain estimation module is designed to segment
an input raw image into the foreground and background and
to predict a gain for each of them. The two gains are then used
to obtain two intermediate sRGB images from the raw data
processing module, one with the foreground enhanced and
the other with the background enhanced. Finally, these two
sRGB images are fused into a final image with the foreground
and background enhanced.
B. GAIN ESTIMATION MODULE
Given a dark input raw image, exposure compensation is
usually performed by estimating a gain for global brightness
adjustment [35], [37]–[39]. In line with this approach, we
designed a gain estimation module to find the proper ampli-
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
fication ratios for the foreground and background, separately.
The downsized raw image is used as the input, which makes
the module light and efficient. As shown in Fig. 2, a segmen-
tation sub-module extracts two masks, where the pixel values
indicate the probabilities of being in the foreground and
background, respectively. Ideally, if the input raw image is
noise-free and has clear boundaries between the foreground
and background, the extracted masks can be directly used for
image enhancement. However, low-light images are usually
very noisy and have unclear boundaries between the two
layers, making it difficult to predict precise pixel-wise masks.
This investigation revealed that the extracted masks coarsely
represent the main regions of the foreground (e.g., face and
clothes) and background (e.g., lights and buildings). There-
fore, we introduce a prediction sub-module that estimates the
amplification ratios of the foreground and background with
the help of the predicted masks.
FIGURE 2. Structure of the gain estimation module.
For the segmentation sub-module, we adopted the U-
Net [36] as a baseline network architecture and included
channel attention modules [40] in it for better segmentation
performance. Since the segmentation sub-module should be
robust against illumination differences, the structural infor-
mation of the raw image, denoted as ˆ
R, is extracted as
follows:
ˆ
R=Rµ
σ,(1)
where Rrepresents the input raw image. µand σare the
mean and standard deviation of R, respectively. We use ˆ
Ras
the input of the segmentation sub-module, which enables bet-
ter performance than the direct use of R. For the prediction
sub-module, five convolutional layers are used with a stride
of 2, followed by three fully connected layers (see Appendix
for more details).
The output of the gain estimation module consists of the
two amplification ratios for the foreground and background.
To improve the accuracy of the predicted amplification ratios,
we introduce intermediate supervision for the foreground and
background masks as well as supervision of the amplification
ratios. The loss function of the gain estimation module,
denoted as Lgain , is defined as the combination of the binary
cross-entropy loss of the intermediate masks, denoted as
Lseg, and the L2 loss of the predicted amplification ratios,
denoted as Lamp :
Lgain =Lseg +Lamp,(2)
where
Lseg =lbce M,Mfg +lbce 1M,Mbg,(3)
Lamp =l2(a,ˆ
a).(4)
Here, lbce and l2measure the binary cross-entropy loss
and the L2 loss, respectively. Mfg and Mbg represent the
predicted foreground and background masks, respectively.
Mis the foreground mask of the same size as Mfg that
is obtained by downsampling the ground-truth foreground
mask M.ˆ
a= [ˆαfg,ˆαbg ]represents a vector that includes
the predicted amplification ratios of the foreground and back-
ground, and a= [αfg,αbg ]is the ground-truth amplification
vector. We found that separate predictions for the foreground
and background mask makes the training more stable than
predicting the foreground mask only. The process of gen-
erating the ground-truth foreground mask and ground-truth
amplification vector will be explained in Section IV.
C. RAW DATA PROCESSING MODULE
The two predicted gains account for the illumination com-
pensation required for the foreground and background. When
applying each gain to the input raw image, the converted
sRGB image has only one layer, i.e., the foreground or
background, with proper illumination, whereas the other is
either underexposed or overexposed. Moreover, the input raw
image is noisy and not color-corrected, making the generated
sRGB image visually unpleasing without further processing.
Therefore, the raw data processing module is employed to
perform denoising, demosaicing, color correction, and image
fusion.
The structure of the raw data processing module is illus-
trated in Fig. 3. Unlike the gain estimation module, we use
raw images with the original resolution, which is usually
more than 8 mega-pixels. Thus, we design a compact ISP
sub-module that converts each illumination-compensated raw
image into an sRGB image independently. The two sRGB
images are then combined via the fusion sub-module, which
consists of several convolutional layers. However, due to the
limited receptive field of the network while considering the
ultra-high resolution and poor color of the input in low-
light imaging, our first ISP sub-module design based on a
previous model [22] tended to produce inconsistent colors in
the output images. To solve this issue, we use white balance
parameters estimated by the camera as additional inputs of
the ISP sub-module. Specifically, the white balance parame-
ters are converted into constant channels (WB in Fig. 3) and
concatenated with the illumination-compensated raw images.
We found that the provided white balance parameters help
the ISP sub-module adjust the image colors in a globally
consistent manner.
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 3. Structure of the raw data processing module.
We classify the two illumination-compensated raw images,
denoted as Rhigh and Rlow , as follows:
Rhigh = max (ˆ
a)R,
Rlow = min (ˆ
a)R.(5)
These two raw images can be directly processed by the
ISP sub-module to obtain two sRGB images with different
exposures. However, since the amplified values are no longer
within the range (0.0, 1.0), simply clipping the maximum
value to 1.0 can lead to information loss in the bright regions.
Moreover, the outputs of the ISP sub-module from Rhigh and
Rlow are not guaranteed to have a consistent color. To solve
these problems, we define the input of the ISP sub-module as
follows:
Rhigh =hRhigh
Rhigh
F
Rhigh
i,(6)
where /represents element-wise division and |represents
channel-wise concatenation.
Rhigh
is the magnitude map
of Rhigh .Rlow is defined in the similar manner from Rlow .
First, note that the two normalized raw images are the same,
i.e., Rhigh
Rhigh
=Rlow
Rlow
. This is because scalar
multiplication in (5) does not change the direction of the
vectors. Therefore, the ISP sub-module is expected to restore
color consistently from Rhigh and Rlow . Second, the mag-
nitude map of the raw image is further mapped to function
F, which shrinks the range of input magnitude values to the
range (0.0, 1.0), where the sigmoid-style function [41] was
empirically chosen. This mapping function is required not
only to preserve the details in the bright regions, but also to
avoid the color ambiguity of the saturated pixels.
Let Ihigh and Ilow denote the outputs of the ISP sub-
module. Due to the fact that each raw image contains ade-
quate illumination in only the foreground or background, we
attempt to exclude the other layer with inadequate illumina-
tion when training the ISP sub-module. To this end, the loss
function of the ISP sub-module is defined as follows:
Lhigh =l2Mhigh Ihigh ,Mhigh I
+lvgg Mhigh Ihigh ,Mhigh I,
Llow =l2Mlow Ilow ,Mlow I
+lvgg Mlow Ilow ,Mlow I,
(7)
where represents element-wise multiplication, Iis the
ground-truth sRGB image, and lvgg measures the perceptual
loss using the extracted features from the pretrained VGG16
network [42]. Mhigh is the ground-truth binary mask corre-
sponding to Rhigh , which is defined as follows:
Mhigh =M,if ˆαfg ˆαbg,
1M,otherwise,(8)
and Mlow , the ground-truth binary mask corresponding to
Rlow , is then defined as Mlow = 1 - Mhigh .
The final loss function of the raw data processing module
is defined as:
Lrecon =Lhigh +Llow +Lfuse ,(9)
where
Lfuse =l2Ifuse ,I+lvgg Ifuse ,I,(10)
and Ifuse represents the final fusion output of the enhance-
ment network.
IV. TRAINING DATA GENERATION
This section explains the procedure of generating training
image samples and its application to different camera models.
A. OVERALL PROCEDURE
The training of the proposed selfie enhancement net-
work requires noisy raw selfie images captured in low-
light environments as inputs and their corresponding well-
illuminated sRGB images as the ground-truths. Unfortu-
nately, no datasets are currently available that are applicable
for this task to the best of our knowledge. Therefore, we
design a pipeline for synthesizing realistic low-light selfies
with different illuminations, color temperature, and noise
levels. As depicted in Fig. 4, two sRGB images are selected
from collections of portraits and natural scenery images,
respectively, and they are blended to synthesize a realistic
selfie image. The synthetic selfie image is then reverted to
the raw image with the given white balance parameters.
Subsequently, the obtained raw image is separated into the
foreground and background layers. These two layers are
darkened differently to simulate different illumination con-
ditions. Finally, noise is added to the darkened raw image.
To increase the diversity of the training samples, white
balance parameters, darkening ratios, and noise levels are
randomly chosen. We empirically found that the white bal-
ance values of the red and blue channels are linearly related in
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 4. Overview of the training data synthesis pipeline.
the log domain. Thus, the synthetic white balance parameters
are randomly sampled using the following distributions:
log (wr)∼ U (aw, bw),
log (wb)∼ N (µ=swlog (wr) + cw, σw),(11)
where wrand wbare the synthetic white balance values of
the red and blue channels, respectively, and the white balance
value of the green channel is set to 1, i.e., wg= 1. aw,bw,sw,
cw, and σware the camera-dependent parameters. Inspired
by previous research [14], [43], the camera shot noise and
read-out noise are determined as follows:
log (λshot )∼ U (an, bn),
log (λread )∼ N (µ=snlog (λshot ) + cn, σn),(12)
where λshot and λread are the shot noise and read noise
parameters, respectively, and an,bn,sn,cn, and σnare
the camera-dependent parameters. The noise-added intensity
value yfrom the noise-free intensity value xis then deter-
mined as follows:
y∼ N (µ=x, σ =λread +λshot x).(13)
This darkened and noise-added raw image is used to train
the selfie enhancement network. The applied amplification
ratios (i.e., inverse of the darkening ratios) and binary masks
are also used for intermediate supervision while training the
selfie enhancement network.
B. FIVEKNIGHT DATASET
The training of the proposed selfie enhancement network
requires well-illuminated sRGB images as the ground-truths.
We chose the FiveK dataset [7], which contains original raw
images and corresponding images retouched by five experts.
We used images from expert C to train our network. To
synthesize more realistic selfie images that reflect the low-
light environments encountered in daily life, we only selected
scenery images tagged with “night”, “dawn”, and “dusk”
for the background image collection. For the portrait image
collection, we extracted the regions portraying full human
bodies and cropped the upper parts of the bodies to make
the images look more like selfies after synthesis with the
background. Briefly, the FiveKNight dataset, which contains
733 and 680 images in the foreground and background col-
lections, respectively, is an elaborate subset of the FiveK
dataset [7] developed for low-light selfie enhancement. Fig. 5
shows several examples of the FiveKNight dataset. Owing
to the robustness of the fusion sub-module, automatically
generated but slightly imprecise masks obtained using the
existing method [44] were found to be sufficient for our task.
FIGURE 5. Example images of the FiveKNight dataset. Top: low-light scenery
images; Bottom: segmented portrait regions.
The FiveK dataset provides raw images captured using dif-
ferent camera models. Because the different camera models
have different resolutions, optics, and spectral responses, the
network trained for one camera model may not work well
for the other models. Therefore, instead of using a hand-
crafted method to unprocess the sRGB images to obtain the
raw images, we introduce a network that transforms the input
sRGB image to the camera-specific raw image. This trans-
formation network, called r2rNet, is trained independently by
using images captured by the specified camera.
There are two reasons for using r2rNet to generate the
raw images. First, r2rNet can learn a more direct inverse
mapping from a camera-processed sRGB image to a raw
image, which is advantageous when the built-in camera ISP is
a black-box. Second, after completing the training of r2rNet,
only the FiveKNight dataset is required to train the rest of
the framework. Since collecting raw and sRGB image pairs
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
for one camera model takes substantially less effort than
collecting expert-retouched images, the proposed framework
can be applied conveniently to different camera models.
For the design of r2rNet, we used a residual dense net-
work [45] but included the white balance parameters as addi-
tional inputs. To reduce the memory usage during training,
we pre-processed the sRGB images so that they have the
same dimensions as the corresponding raw images. Specif-
ically, every second pixel of the red and blue channels is
removed, and the green channel is split into two channels by
odd/even pixel sampling, resulting in half-sized four-channel
sRGB images. The captured raw image may contain saturated
pixels; therefore, we exclude these pixels when calculating
the loss. The loss function of r2rNet, denoted as Lr2r , is
defined as:
Lr2r =l1Mr2r Rr2r ,Mr2r Rr2r ,(14)
where l1measures the L1 loss, Mr2r is the binary saturation
mask in which a pixel of 0 indicates the saturation of the
corresponding pixel, and Rr2r and Rare the output of r2rNet
and the raw image from the camera, respectively.
V. EXPERIMENT RESULTS
This section describes the implementation details and com-
pares the performance of the proposed method with that of
several state-of-the-art methods. The training images, test
images, and enhanced images obtained using all of the com-
pared methods are available in our project website.
A. IMPLEMENTATION DETAILS
The proposed framework was implemented on PyTorch and
trained on an Nvidia Titan Xp GPU. All the networks were
trained using the Adam optimizer [46] with a learning rate
of 10
4. Random cropping and flipping were applied for
data augmentation. The gain estimation module and raw
data processing module were trained independently for two
epochs with an input size of 320 ×240 and a batch size
of four. When synthesizing selfie images, regions containing
humans were resized with random scaling factors between
0.5 and 1.0. The gain range was empirically chosen to be
(1.0, 20.0). Since the gain estimation module requires a large
receptive field to understand the entire scene captured, we
fixed the input size to 256 ×192.
We chose a DJI Osmo Action camera as a test device since
it is equipped with a front screen for selfies and a CMOS
sensor widely used in mobile devices [47]. We captured
730 images along with their raw images, including day and
night scenes with various white balances, exposures, and ISO
settings. We used 710 images for training r2rNet and 20
images for validation. We trained r2rNet for 1,500 epochs
with an input size of 128 ×128, and a batch size of eight.
Our trained r2rNet achieved an average PSNR of 66.25 dB
for the validation images. Note that the highlight regions in
the sRGB images were clipped to 1.0 and could not be re-
mapped to their original raw values; thus, we excluded these
regions during training and validation.
(a) white balance
(b) noise level
FIGURE 6. Parameter distributions of the DJI dataset.
(a) (b)
FIGURE 7. Enhanced images obtained from the networks trained (a) with and
(b) without mapping of the amplified values.
The camera-dependent parameters in (11)-(12) were de-
termined from the statistics of the DJI dataset. From the
parameter distributions shown in Fig. 6, the parameters were
set as aw= -1.02, bw= -0.04, sw= -0.93, cw= -1.24, σw=
-0.06, an= -8.37, bn= -7.05, sn= 2.38, cn= -4.05, and σn
= -0.10.
B. ABLATION STUDY
As discussed in Section III-C, the direct use of the ampli-
fied raw images could cause inaccurate colors in bright and
saturated regions. To demonstrate the effectiveness of the
mapping function Fin (6), we trained our network with and
without using F. As illustrated in Fig. 7(a), without special
treatment of the amplified values, the resultant image has
inaccurate and inconsistent colors in the bright regions.
The choice of loss function is also critical to the final
performance. As displayed in Fig. 8, the network trained by
using a single L2 loss produced blurry images with artifacts
in unevenly illuminated regions. In addition, the network
trained by using a single perceptual loss tended to produce
bluish and less vivid colors. The use of the combined L2
loss and perceptual loss enabled the network to learn both
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
(a) (b)
(c) (d)
FIGURE 8. Comparison of the loss function for the raw data processing
module: (a) Camera built-in ISP output and (b)-(d) enhanced images obtained
using the L2 loss, perceptual loss, and L2 + perceptual loss, respectively.
(a) (b)
(c) (d)
FIGURE 9. Example of the high quality predicted foreground mask and
enhanced images: (a) Camera ISP output, (b) predicted foreground mask, (c)
enhanced image obtained by linear combination, (d) enhanced image
obtained by the proposed fusion module.
low-level and high-level features, resulting in higher quality
images.
Last, we investigated the effectiveness of the fusion sub-
module. We found that if a predicted mask was obtained with
high accuracy, as shown in Fig. 9(b), a linear combination
of two enhanced images using the predicted mask could
produce an image with relatively high quality, as depicted in
Fig. 9(c). Even though a pixel-wise accurate mask is desired,
we noticed that such a mask is challenging to obtain in low-
light environments. If a predicted mask was incomplete, as
(a)
(b)
(c)
FIGURE 10. Example of the low quality predicted foreground mask and
enhanced images: (a) Predicted foreground mask, (b) enhanced image
obtained by linear combination, (c) enhanced image obtained by the proposed
fusion module.
shown in Fig. 10(a), a simple linear combination of two
enhanced images failed to render natural image boundaries,
as illustrated in Fig. 10(b). In comparison, our fusion sub-
module could robustly handle an inaccurately estimated fore-
ground map, as demonstrated in Fig. 10 (c).
C. VISUAL QUALITY COMPARISON
In order to conduct a quality evaluation, we captured selfies
using a DJI camera in the automatic mode with a maxi-
mum ISO setting of 800. In total, 124 selfie images were
captured at different locations and in various illumination
environments. For performance comparison, we selected four
state-of-the-art methods, RetinexNet [28], Exposure [48],
Enlighten [5], and DeepUPE [2]. Because these methods do
not work in the camera raw domain, we used Lightroom [1]
to obtain 16-bit images in ProPhoto RGB as the inputs for
[48] and sRGB for the rest.
Fig. 11 presents the results for three different cases.
RetinexNet best recovered the visibility of the scenes; how-
ever, it produced unnatural images. Meanwhile, Exposure
and DeepUPE showed limited performance in terms of en-
hancing very dark regions. Enlighten well enhanced the con-
trast and visibility of the scenes. Compared with Enlighten,
the proposed method rendered more natural colors in the
facial regions and exhibited better visibility of the foreground
and background.
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
(a) (b) (c) (d) (e) (f)
FIGURE 11. A visual comparison of the five methods under three illumination cases (Top: dark foreground and bright background, Middle: bright foreground and
dark background, Bottom: dark foreground and dark background): (a) Camera built-in ISP output, (b) RetinexNet [28], (c) Exposure [48], (d) Enlighten [5], (e)
DeepUPE [2], and (e) Ours. The results are best viewed in the electronic version.
D. NO-REFERENCE OBJECTIVE EVALUATION
Due to the lack of ground-truth selfie images, full-reference
image quality assessment metrics are not applicable for quan-
titative performance evaluation. However, image quality can
still be evaluated using image features such as naturalness
and sharpness that are closely related to human visual percep-
tion. To this end, we chose three no-reference image quality
assessment metrics, namely the naturalness image quality
evaluator (NIQE) [49], autoregressive image sharpness met-
ric (ARISM) [50], and blind image integrity notator using
discrete cosine transform statistics (BLIINDS) [51], [52].
Note that lower values indicate higher quality in all chosen
metrics.
Table 2 shows the average image quality scores of all com-
pared methods. Compared to ARISM, ARISMc additionally
takes chrominance information into consideration [50]. Note
that the default parameter settings from the authors’ provided
source codes were used for the compared methods. All the
evaluations were performed using the input images with full
resolution of 4000 ×3000. It can be seen that the images
generated by the proposed method have lower (better) quality
scores than the other compared methods. The histograms
of the obtained quality scores as shown in Fig. 12 further
demonstrate that the proposed method consistently produced
high-quality images.
TABLE 1. Average quality scores for the compared methods
NIQE ARISM ARISMc BLIINDS
Input 3.20 2.74 3.09 23.64
RetinexNet [28] 5.10 3.04 3.67 41.13
Exposure [48] 3.58 3.05 3.52 34.77
Enlighten [5] 2.80 2.84 3.12 24.50
DeepUPE [2] 3.43 2.85 3.31 22.57
Ours 2.46 2.61 2.85 23.08
E. HUMAN SUBJECTIVE EVALUATION
A user study was conducted for subjective quality evaluation
using 124 selfie images. For this study, 25 non-expert sub-
jects participated in the quality evaluation. For each selfie im-
age, the enhanced images obtained by the proposed method,
as well as the four compared methods, were ranked by the
subjects. The five enhanced images were shown in a random
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 12. Histograms of the quality scores for the compared methods.
order and no time restrictions were provided. The subjects
were advised to rank the images according to the contrast,
visibility, and naturalness. The experiment was performed in
a controlled lab environment.
As shown in Fig. 13 and Table 2, the proposed method
received the highest average ranking, demonstrating its supe-
riority over the other methods for low-light selfie enhance-
ment.
TABLE 2. Average rankings for the compared methods in the user study
RetinexNet
[28]
Exposure
[48]
Enlighten
[5]
DeepUPE
[2] Ours
4.31 3.94 2.19 2.32 1.67
F. NON-SELFIE PHOTOS
In addition to low-light selfies, we tested our framework on
non-selfie photos. As shown in Fig. 14, our framework pro-
vided sufficient quality improvements on non-selfie photos.
We found that when there were no human subjects in the
scene, the two gain prediction results were very close with
each other, and thus the two ISP sub-modules performed
similarly to enhance the background. This indicates that
although the proposed framework is especially developed
for low-light selfies, it can be used for general night scene
enhancement applications.
FIGURE 13. Rating distribution for various methods in the user study.
G. FAILURE CASES
Although the above-mentioned experimental results revealed
the low-light selfie enhancement capabilities of the proposed
method, it still failed in some cases. When the lighting con-
ditions of the foreground and background were significantly
different, the fusion result tended to suffer from halo artifacts
near highlight regions, as shown in Fig. 15(a). Fig. 15(b) de-
picts another case in which the foreground was very weakly
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
(a) (b)
(c) (d)
FIGURE 14. Experimental results on non-selfie photos: (a), (c) Output of the
built-in camera ISP and (b), (d) the results obtained by the proposed
framework.
illuminated. In this situation, the required gain fell out of the
range that the gain estimation network could predict, causing
incomplete illumination compensation of the foreground.
(a)
(b)
FIGURE 15. Two failure cases: (a) Halo artifacts around highlights and (b)
incomplete exposure compensation of the foreground.
VI. CONCLUSIONS
In this paper, we presented a low-light image enhancement
framework for selfies. Considering selfie enhancement as
a dual-layer enhancement problem, we introduced a gain
estimation module that predicts the enhancement parame-
ters for each layer and a raw data processing module that
jointly performs denoising, demosaicing, color correction,
and image fusion. A dataset called FiveKNight, containing
low-light scenery images as the background collection and
segmented human subjects as the foreground collection, was
also constructed from the FiveK dataset. Moreover, we pro-
posed a raw image synthesis network and used it to generate
degraded raw selfie images to train the selfie enhancement
network. Through visual quality comparisons and subjective
quality evaluations, we demonstrated the effectiveness and
robustness of DeepSelfie compared to the state-of-the-art
methods.
Several future studies are being considered. First, in this
study, we did not specifically consider the aesthetic enhance-
ment of facial regions. To provide a better selfie enhancement
solution for end users, we plan to consider facial regions
more specifically. A possible method could use unpaired
GAN to give special attention on the facial regions when gen-
erating training samples. Second, interactive image enhance-
ment is necessary from an application perspective. Although
the present method can partly support interactive enhance-
ment by allowing users to change the network-estimated am-
plification parameters, a more intuitive and diverse method of
adjusting image enhancement needs to be developed. Last, a
more efficient and light network design is required for mobile
applications.
REFERENCES
[1] Adobe Inc. (2015) Adobe lightroom. [Online]. Available: https:
//lightroom.adobe.com
[2] R. Wang, Q. Zhang, C.-W. Fu, X. Shen, W.-S. Zheng, and J. Jia, “Underex-
posed photo enhancement using deep illumination estimation,” in Proceed-
ings of IEEE Conference on Computer Vision and Pattern Recognition,
2019, pp. 6849–6857.
[3] Y. Zhang, J. Zhang, and X. Guo, “Kindling the darkness: A practical low-
light image enhancer,” in Proceedings of ACM International Conference
on Multimedia, 2019, pp. 1632–1640.
[4] Y. P. Loh, X. Liang, and C. S. Chan, “Low-light image enhancement
using gaussian process for features retrieval,Signal Processing: Image
Communication, vol. 74, pp. 175–190, 2019.
[5] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou,
and Z. Wang, “Enlightengan: Deep light enhancement without paired
supervision,” arXiv preprint arXiv:1906.06972, 2019.
[6] F. Lv and F. Lu, “Attention-guided low-light image enhancement,” arXiv
preprint arXiv:1908.00682, 2019.
[7] V. Bychkovsky, S. Paris, E. Chan, and F. Durand, “Learning photographic
global tonal adjustment with a database of input / output image pairs,”
in Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, 2011, pp. 97–104.
[8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 3431–3440.
[9] X. Shen, A. Hertzmann, J. Jia, S. Paris, B. Price, E. Shechtman, and
I. Sachs, “Automatic portrait segmentation for image stylization,Comput.
Graph. Forum, vol. 35, no. 2, pp. 93–102, 2016.
[10] N. Xu, B. Price, S. Cohen, and T. Huang, “Deep image matting,” in
Proceedings of IEEE Conference on Computer Vision and Pattern Recog-
nition, 2017, pp. 2970–2979.
[11] Y. Zhang, L. Gong, L. Fan, P. Ren, Q. Huang, H. Bao, and W. Xu, “A late
fusion CNN for digital matting,” in Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 7469–7478.
[12] X. Chen, D. Qi, and J. Shen, “Boundary-aware network for fast and high-
accuracy portrait segmentation,arXiv preprint arXiv:1901.03814, 2019.
[13] B. Zhu, Y. Chen, J. Wang, S. Liu, B. Zhang, and M. Tang, “Fast deep
matting for portrait animation on mobile phone,” in Proceedings of ACM
International Conference on Multimedia, 2017, pp. 297–305.
[14] T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron,
“Unprocessing images for learned raw denoising,” in Proceedings of
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.
11 036–11 045.
[15] H. Guan, L. Liu, S. Moran, F. Song, and G. Slabaugh, “Node: Extreme low
light raw image denoising using a noise decomposition network,” arXiv
preprint arXiv:1909.05249, 2019.
[16] J. Liu, C.-H. Wu, Y. Wang, Q. Xu, Y. Zhou, H. Huang, C. Wang, S. Cai,
Y. Ding, H. Fan, and et al., “Learning raw image denoising with bayer
pattern unification and bayer preserving augmentation,” in Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition Work-
shops, 2019, pp. 1–8.
[17] H. Jiang, Q. Tian, J. Farrell, and B. A. Wandell, “Learning the image
processing pipeline,” IEEE Trans. Image Process., vol. 26, no. 10, pp.
5032–5042, 2017.
[18] X. Xu, Y. Ma, and W. Sun, “Towards real scene super-resolution with
raw images,” in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, 2019, pp. 1723–1731.
[19] E. Schwartz, R. Giryes, and A. M. Bronstein, “Deepisp: Toward learning
an end-to-end image processing pipeline,” IEEE Trans. Image Process.,
vol. 28, no. 2, pp. 912–923, 2018.
[20] G. Qian, J. Gu, J. S. Ren, C. Dong, F. Zhao, and J. Lin, “Trinity of pixel
enhancement: A joint solution for demosaicking, denoising and super-
resolution,” arXiv preprint arXiv:1905.02538, 2019.
[21] Z. Liang, J. Cai, Z. Cao, and L. Zhang, “Cameranet: A two-stage frame-
work for effective camera isp learning,arXiv preprint arXiv:1908.01481,
2019.
[22] S. Ratnasingam, “Deep camera: A fully convolutional neural network for
image signal processing,” in Proceedings of IEEE International Confer-
ence on Computer Vision Workshops, 2019, pp. 1–11.
[23] Z.-u. Rahman, D. J. Jobson, and G. A. Woodell, “Multi-scale retinex
for color image enhancement,” in Proceedings of IEEE International
Conference on Image Processing, 1996, pp. 1003–1006.
[24] D. J. Jobson, Z.-u. Rahman, and G. A. Woodell, “A multiscale retinex
for bridging the gap between color images and the human observation of
scenes,” IEEE Trans. Image Process., vol. 6, no. 7, pp. 965–976, 1997.
[25] Z.-u. Rahman, D. J. Jobson, and G. A. Woodell, “Retinex processing for
automatic image enhancement,” J. Electron. imaging, vol. 13, no. 1, pp.
100–111, 2004.
[26] X. Guo, Y. Li, and H. Ling, “Lime: Low-light image enhancement via
illumination map estimation,” IEEE Trans. Image Process., vol. 26, no. 2,
pp. 982–993, 2016.
[27] M. Li, J. Liu, W. Yang, X. Sun, and Z. Guo, “Structure-revealing low-
light image enhancement via robust retinex model,IEEE Trans. Image
Process., vol. 27, no. 6, pp. 2828–2841, 2018.
[28] C. Wei, W. Wang, W. Yang, and J. Liu, “Deep retinex decomposition for
low-light enhancement,” arXiv preprint arXiv:1808.04560, 2018.
[29] S. Park, S. Yu, M. Kim, K. Park, and J. Paik, “Dual autoencoder network
for retinex-based low-light image enhancement,IEEE Access, vol. 6, pp.
22 084–22 093, 2018.
[30] Y. Guo, X. Ke, J. Ma, and J. Zhang, “A pipeline neural network for low-
light image enhancement,” IEEE Access, vol. 7, pp. 13737–13 744, 2019.
[31] J. Cai, S. Gu, and L. Zhang, “Learning a deep single image contrast
enhancer from multi-exposure images,” IEEE Trans. Image Process.,
vol. 27, no. 4, pp. 2049–2062, 2018.
[32] W. Ren, S. Liu, L. Ma, Q. Xu, X. Xu, X. Cao, J. Du, and M.-H. Yang,
“Low-light image enhancement via a deep hybrid network,IEEE Trans.
Image Process., vol. 28, no. 9, pp. 4364–4375, 2019.
[33] Y.-S. Chen, Y.-C. Wang, M.-H. Kao, and Y.-Y. Chuang, “Deep photo
enhancer: Unpaired learning for image enhancement from photographs
with gans,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018, pp. 6306–6314.
[34] Y. Shi, X. Wu, and M. Zhu, “Low-light image enhancement algorithm
based on retinex and generative adversarial network,arXiv preprint
arXiv:1906.06027, 2019.
[35] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,
in Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 3291–3300.
[36] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in Proceedings of International
Conference on Medical Image Computing and Computer-Assisted Inter-
vention, 2015, pp. 234–241.
[37] C. Chen, Q. Chen, M. N. Do, and V. Koltun, “Seeing motion in the dark,” in
Proceedings of IEEE International Conference on Computer Vision, 2019,
pp. 3185–3194.
[38] P. Maharjan, L. Li, Z. Li, N. Xu, C. Ma, and Y. Li, “Improving extreme
low-light image denoising via residual learning,” in Proceedings of IEEE
International Conference on Multimedia and Expo, 2019, pp. 916–921.
[39] S. W. Zamir, A. Arora, S. Khan, F. S. Khan, and L. Shao, “Learning
digital camera pipeline for extreme low-light imaging,arXiv preprint
arXiv:1904.05939, 2019.
[40] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2018, pp. 7132–7141.
[41] K. Narkowicz. (2016) Aces filmic tone mapping curve.
[Online]. Available: https://knarkowicz.wordpress.com/2016/01/06/
aces-filmic- tone-mapping- curve/
[42] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style
transfer and super-resolution,” in Proceedings of European Conference on
Computer Vision, 2016, pp. 1–17.
[43] T. Plötz and S. Roth, “Benchmarking denoising algorithms with real
photographs,” in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 1586–1595.
[44] Kaleido AI GmbH. (2019) Remove image background. [Online].
Available: https://www.remove.bg
[45] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
network for image super-resolution,” in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 2472–2481.
[46] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,
in Proceedings of International Conference on Learning Representations,
2015, pp. 1–41.
[47] DJI LLC. (2019) Osmo action. [Online]. Available: https://www.dji.com/
osmo-action
[48] Y. Hu, H. He, C. Xu, B. Wang, and S. Lin, “Exposure: A white-box photo
post-processing framework,ACM Trans. Graph., vol. 37, no. 2, p. 26,
2018.
[49] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely
blind” image quality analyzer,IEEE Signal Process. Lett., vol. 20, no. 3,
pp. 209–212, 2012.
[50] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “No-reference image
sharpness assessment in autoregressive parameter space,IEEE Trans.
Image Process., vol. 24, no. 10, pp. 3218–3231, 2015.
[51] M. A. Saad, A. C. Bovik, and C. Charrier, “A DCT statistics-based blind
image quality index,” IEEE Signal Process. Lett., vol. 17, no. 6, pp. 583–
586, 2010.
[52] ——, “Blind image quality assessment: A natural scene statistics approach
in the DCT domain,” IEEE Trans. Image Process., vol. 21, no. 8, pp. 3339–
3352, 2012.
YUCHENG LU received the B.S. degree in op-
tical information science and technology from
Hangzhou Dianzi University, Hangzhou, China, in
2016. He is currently at joint M.S. and Ph.D. de-
gree in multimedia engineering from Department
of Multimedia Engineering, Dongguk University,
Seoul, Korea. His main research interests include
image enhancement, 3D model reconstruction and
machine learning based computer vision applica-
tions.
DONG-WOOK KIM received the B.S. and M.S.
degrees in multimedia engineering from Dongguk
University, Seoul, Rep. of Korea, in 2017 and
2019, respectively, where he is currently working
on a Ph.D’s degree. His research interests include
image restoration, image colorization, and genera-
tive adversarial networks.
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
SEUNG-WON JUNG (S’06-M’11-SM’19) re-
ceived the B.S. and Ph.D. degrees in electrical
engineering from Korea University, Seoul, Ko-
rea, in 2005 and 2011, respectively. He was a
Research Professor with the Research Institute
of Information and Communication Technology,
Korea University, from 2011 to 2012. He was a
Research Scientist with the Samsung Advanced
Institute of Technology, Yongin-si, Korea, from
2012 to 2014. He was an Assistant Professor at the
Department of Multimedia Engineering, Dongguk University, Seoul, Korea,
from 2014 to 2020. He is currently an Assistant Professor at the Department
of Electrical Engineering, Korea University, Seoul, Korea. He has published
over 60 peer-reviewed articles in international journals. His current research
interests include deep learning and computer vision applications.
APPENDIX. NETWORK STRUCTURES
The network structures of the gain estimation module and
the raw data processing module are shown in Figs. 16 and
17, respectively. Every number below each block represents
the number of channels. FC and ChAtt represent a fully
connected layer and a channel attention layer [40], respec-
tively. If not specified, convolution layers are with a kernel
size of 3×3and a stride of 1. Leaky ReLU is used as
activation function after every 3×3convolution, and the
output is rectified through Sigmoid function. We input Rhigh
and Rlow separately for the ISP sub-module and combine
the outputs using the fusion sub-module. Thus, the second
ISP sub-module is rendered as semi-transparent. Regarding
the input of the ISP sub-module, f
Rhigh
in (6) is
represented as Magnitude and only this channel is passed to
the skip connection block. The intermediate input channels,
i.e., Rhigh
kRhigh k, are simply denoted as Color. The last input
channels, denoted as White Balance, are defined as three
constant channels obtained from three-dimensional white
balance parameters. The input from Rlow for the second ISP
sub-module is defined in a similar manner.
The structures of encoder, decoder, and skip connection
blocks are further illustrated in Fig. 18. Here, if an encoder
(or skip connection) block uses max pooling (or average
pooling) for downsampling, its inward arrow is green-colored
in Figs. 16 and 17. On the other hand, if a decoder block uses
upsampling and channel attention layers for upsampling, its
inward arrow is red-colored. Except for these cases, inward
arrows are grey-colored. More details can be found from the
source code.
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3006525, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 16. Network structure of the gain estimation module.
FIGURE 17. Network structure of the raw data processing module.
FIGURE 18. Structure of the encoder, decoder, and skip connection blocks.
14 VOLUME 4, 2016
... Despite data limitations, recent research has explored different SLLIE [17] approaches. The end-to-end learningbased methods [2,9,11,15,22,26,29,30,39,43,53] aim to directly map low-light images to well-lit counterparts, often integrating Convolutional Neural Networks (CNNs). Several recent methods leverage transformers [7,41,52] to enhance the receptive field of vanilla convolutions for learning spatial information, inspired by advances in image restoration [48,57,58]. ...
Preprint
Full-text available
Digital cameras often struggle to produce plausible images in low-light conditions. Improving these single-shot images remains challenging due to a lack of diverse real-world pair data samples. To address this limitation, we propose a large-scale high-resolution (i.e., beyond 4k) pair Single-Shot Low-Light Enhancement (SLLIE) dataset. Our dataset comprises 6,425 unique focus-aligned image pairs captured with smartphone sensors in dynamic settings under challenging lighting conditions (0.1--200 lux), covering various indoor and outdoor scenes with varying noise and intensity. We extracted and refined around 180,000 non-overlapping patches from 6,025 collected scenes for training while reserving 400 pairs for benchmarking. In addition to that, we collected 2,117 low-light scenes from different sources for extensive real-world aesthetic evaluation. To our knowledge, this is the largest real-world dataset available for SLLIE research. We also propose learning luminance-chrominance (LC) attributes separately through a tuning fork-shaped transformer model to enhance real-world low-light images, addressing challenges like denoising and over-enhancement in complex scenes. We also propose an LC cross-attention block for feature fusion, an LC refinement block for enhanced reconstruction, and LC-guided supervision to ensure perceptually coherent enhancements. We demonstrated our method's effectiveness across various hardware and scenarios, proving its practicality in real-world applications. Code and dataset available at https://github.com/sharif-apu/LSD-TFFormer.
Chapter
Image processing is the method of converting an image into a digital format and applying specific operations to it in order to extract some valuable information. Image processing is the backbone of many computer vision applications. Image acquisition is the initial step in this procedure. Image acquisition is the process of capturing images of the external environment. Low-light images are those that have been captured when the light intensity of the surrounding environment is low. The performance of different real-time applications will be impacted by this low-light situation. The enhancement techniques are used to improve the low-light image quality. The enhancement methods are applied to those images to get a better visual effect. In robot vision, surveillance, underwater image enhancement, haze removal and other applications, low-light image enhancement is used. This paper presents a detailed survey on the different low-light image enhancement techniques.
Article
Full-text available
Low-light image enhancement is challenging in that it needs to consider not only brightness recovery but also complex issues like color distortion and noise, which usually hide in the dark. Simply adjusting the brightness of a low-light image will inevitably amplify those artifacts. To address this difficult problem, this paper proposes a novel end-to-end attention-guided method based on multi-branch convolutional neural network. To this end, we first construct a synthetic dataset with carefully designed low-light simulation strategies. The dataset is much larger and more diverse than existing ones. With the new dataset for training, our method learns two attention maps to guide the brightness enhancement and denoising tasks respectively. The first attention map distinguishes underexposed regions from well lit regions, and the second attention map distinguishes noises from real textures. With their guidance, the proposed multi-branch decomposition-and-fusion enhancement network works in an input adaptive way. Moreover, a reinforcement-net further enhances color and contrast of the output image. Extensive experiments on multiple datasets demonstrate that our method can produce high fidelity enhancement results for low-light images and outperforms the current state-of-the-art methods by a large margin both quantitatively and visually.
Article
In low-light conditions, a conventional camera imaging pipeline produces sub-optimal images that are usually dark and noisy due to a low photon count and low signal-to-noise ratio (SNR). We present a data-driven approach that learns the desired properties of well-exposed images and reflects them in images that are captured in extremely low ambient light environments, thereby significantly improving the visual quality of these low-light images. The recent works on this problem only consider a pixel-level loss metric that ignores perceptual quality and thus generate outputs susceptible to visual artifacts. To address this problem, we propose a new loss function that exploits the characteristics of both pixel-wise and perceptual metrics, enabling our deep neural network to learn the camera processing pipeline to transform the short-exposure, low-light RAW sensor data to well-exposed sRGB images. The results show that our method outperforms the state-of-the-art according to psychophysical tests as well as pixel-wise standard metrics and recent learning-based perceptual image quality measures. In essence, the proposed model can potentially replace the conventional digital camera pipeline for the specific case of extreme low-light imaging.
Article
Deep learning-based methods have achieved remarkable success in image restoration and enhancement, but are they still competitive when there is a lack of paired training data? As one such example, this paper explores the low-light image enhancement problem, where in practice it is extremely challenging to simultaneously take a low-light and a normal-light photo of the same visual scene. We propose a highly effective unsupervised generative adversarial network, dubbed EnlightenGAN, that can be trained without low/normal-light image pairs, yet proves to generalize very well on various real-world test images. Instead of supervising the learning using ground truth data, we propose to regularize the unpaired training using the information extracted from the input itself, and benchmark a series of innovations for the low-light image enhancement problem, including a global-local discriminator structure, a self-regularized perceptual loss fusion, and the attention mechanism. Through extensive experiments, our proposed approach outperforms recent methods under a variety of metrics in terms of visual quality and subjective user study. Thanks to the great flexibility brought by unpaired training, EnlightenGAN is demonstrated to be easily adaptable to enhancing real-world images from various domains. Our codes and pre-trained models are available at: https://github.com/VITA-Group/EnlightenGAN.
Article
Traditional image signal processing (ISP) pipeline consists of a set of cascaded image processing modules onboard a camera to reconstruct a high-quality sRGB image from the sensor raw data. Recently, some methods have been proposed to learn a convolutional neural network (CNN) to improve the performance of traditional ISP. However, in these works usually a CNN is directly trained to accomplish the ISP tasks without considering much the correlation among the different components in an ISP. As a result, the quality of reconstructed images is barely satisfactory in challenging scenarios such as low-light imaging. In this paper, we firstly analyze the correlation among the different tasks in an ISP, and categorize them into two weakly correlated groups: restoration and enhancement. Then we design a two-stage network, called CameraNet, to progressively learn the two groups of ISP tasks. In each stage, a ground truth is specified to supervise the subnetwork learning, and the two subnetworks are jointly fine-tuned to produce the final output. Experiments on three benchmark datasets show that the proposed CameraNet achieves consistently compelling reconstruction quality and outperforms the recently proposed ISP learning methods.