Conference PaperPDF Available

SolarClique: Detecting Anomalies in Residential Solar Arrays

Authors:

Abstract and Figures

The proliferation of solar deployments has significantly increased over the years. Analyzing these deployments can lead to the timely detection of anomalies in power generation, which can maximize the benefits from solar energy. In this paper, we propose SolarClique, a data-driven approach that can flag anomalies in power generation with high accuracy. Unlike prior approaches, our work neither depends on expensive instrumentation nor does it require external inputs such as weather data. Rather our approach exploits correlations in solar power generation from geographically nearby sites to predict the expected output of a site and flag anomalies. We evaluate our approach on 88 solar installations located in Austin, Texas. We show that our algorithm can even work with data from few geographically nearby sites (>5 sites) to produce results with high accuracy. Thus, our approach can scale to sparsely populated regions, where there are few solar installations. Further, among the 88 installations, our approach reported 76 sites with anomalies in power generation. Moreover, our approach is robust enough to distinguish between reduction in power output due to anomalies and other factors such as cloudy conditions.
Content may be subject to copyright.
SolarClique: Detecting Anomalies in Residential Solar Arrays
Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy
University of Massachusetts Amherst
ABSTRACT
The proliferation of solar deployments has significantly increased
over the years. Analyzing these deployments can lead to the timely
detection of anomalies in power generation, which can maximize
the benefits from solar energy. In this paper, we propose Solar-
Clique, a data-driven approach that can flag anomalies in power
generation with high accuracy. Unlike prior approaches, our work
neither depends on expensive instrumentation nor does it require
external inputs such as weather data. Rather our approach exploits
correlations in solar power generation from geographically nearby
sites to predict the expected output of a site and flag anomalies. We
evaluate our approach on 88 solar installations located in Austin,
Texas. We show that our algorithm can even work with data from
few geographically nearby sites (
>
5 sites) to produce results with
high accuracy. Thus, our approach can scale to sparsely populated
regions, where there are few solar installations. Further, among
the 88 installations, our approach reported 76 sites with anomalies
in power generation. Moreover, our approach is robust enough to
distinguish between reduction in power output due to anomalies and
other factors such as cloudy conditions.
CCS CONCEPTS
Computing methodologies Anomaly detection; Supervised
learning by regression; Social and professional topics
Sustainability;
KEYWORDS
anomaly detection; renewables; solar energy; computational sustain-
ability
ACM Reference format:
Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy. 2018. So-
larClique: Detecting Anomalies in Residential Solar Arrays. In Proceedings
of ACM SIGCAS Conference on Computing and Sustainable Societies (COM-
PASS), Menlo Park and San Jose, CA, USA, June 20–22, 2018 (COMPASS
’18), 10 pages.
DOI: 10.1145/3209811.3209860
1 INTRODUCTION
Technological advances and economies of scale have significantly re-
duced the costs and made solar energy a viable renewable alternative.
From 2010 to 2017, the average system costs of solar have dropped
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
COMPASS ’18, Menlo Park and San Jose, CA, USA
©
2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-1-4503-5816-3/18/06. . . $15.00
DOI: 10.1145/3209811.3209860
from
$
7.24 per watt to
$
2.8 per watt, a reduction of approximately
61% [
13
]. At the same time, the average energy cost of producing
solar is 12.2
¢
per kilo-watt and is approaching the retail electricity
price of 12¢ per kilo-watt [1]. The declining costs have spurred the
adoption of solar among both utilities and residential owners.
Recent studies have shown that the total capacity of small-scale
residential solar installations in the US reached 7.2 GW in 2016 [
3
].
Unlike large solar farms, residential installations are not monitored
by professional operators on an ongoing basis. Consequently, anom-
alies or faults that reduce the power output of residential solar arrays
may go undetected for long periods of time, significantly reducing
their economic benefits. Further, large solar farms have extensive
sensor instrumentation to monitor the array continuously, which
enables faults or anomalous output to be determined. In contrast, res-
idential installations have little or no sensor instrumentation beyond
displaying the total power of the array, making sensor-based moni-
toring and anomaly detection infeasible in such contexts. Adding
such instrumentation increases the installation costs and is not eco-
nomically feasible in most cases.
In this paper, we seek to develop a data-driven approach for
detecting anomalous output in small-scale residential installations
without requiring any sensor information for fault detection. Our
key insight is that the solar output from other nearby installations
are correlated, and thus these correlations can be used to identify
anomalous deviations in a specific installation. We note that the
solar output from multiple sites within a city or region is available.
For instance, Enphase, an energy company, provides access to power
generation information of more than 700,000 sites across different
locations
1
. Similarly, other sites
2
share their energy generation data
from tens of thousands of solar panels through web APIs. Thus, the
availability of such datasets makes our approach feasible. However,
the primary challenge in designing such an application is to handle
intrinsic variability of solar and site-specific idiosyncrasies.
Several factors affect the output of a solar panel — such as
weather conditions, dust, snow cover, and shade from nearby trees or
structures, temperature, etc. We refer to such factors as transient fac-
tors since they temporarily reduce the output of the solar array. For
instance, a passing cloud may briefly decrease the power output of
the panel but doesn’t reduce the solar output permanently. Similarly,
shade from nearby buildings or trees can be considered transient
factors as they reduce the output temporarily and may occur only at
certain periods of the day.
Interestingly, some transient factors, such as overcast conditions,
impact the output of all arrays in a geographical neighborhood,
while other factors such as shade from a nearby tree impact the
output of only a portion of the array. In addition, factors such
as malfunctioning solar modules or electrical faults also reduce
the output of a solar array, and we refer to them as anomalies
since human intervention is needed to correct the problem. Prior
1Enphase: https://enphase.com/
2PVSense: http://pvsense.com/
COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy
studies have shown that such factors can significantly reduce the
power output by as much as 40% [
6
,
11
,
14
]. In our work, we
need to distinguish between the output fluctuations from transient
and anomalous factors. Further, site specific idiosyncrasies (such
as shade, tilt/orientation of panels) need to be considered when
exploiting the correlation between solar arrays in a region.
Naive approaches such as labeling a solar installation as anoma-
lous whenever its power output remains “low” for an extended period
do not work well. Since drops in power output may be caused due
to cloudy conditions, depending on the weather, the solar output
may remain low for days. Labeling such instances as anomalies
may result in many false positives. Since the challenge lies in
differentiating drops in power output due to transient factors (i.e.,
factors that impact power output temporarily) and those that are
anomalies (i.e., factors that may require human intervention), we
propose SolarClique, a new approach for detecting solar anomalies
using geographically nearby sites. In designing, implementing and
evaluating our approach, we make the following contributions:
We demonstrate how geographically nearby sites can be
used to detect anomalies in a residential solar installation.
In our algorithm, we present techniques to account for and
remove transient seasonal factors such as shade from nearby
trees and structures.
Our approach doesn’t require any sensor instrumentation
for fault/anomaly detection. Rather, it only requires the
production output of the array and those of nearby arrays
for performing anomaly detection. Since power output
of geographically nearby sites are readily available, our
approach can be easily applied to millions of residential
installations that are unmonitored today with little added
expense.
We implement and evaluate the performance of our algo-
rithm. Our results show that the power output of a can-
didate site can be predicted using geographically nearby
sites. Moreover, we can achieve high accuracy even when
few geographically nearby sites (
>
5 sites) are available.
This indicates that our approach can be used in sparsely
populated locations, where there are few solar installations.
We ran our anomaly detection algorithm on solar installa-
tions located in Austin, Texas. SolarClique reported power
generation anomalies in 76 sites, many of which see a so-
lar output reduction for weeks or months. Moreover, our
approach can identify different types of anomaly — (i) no
production, (ii) underproduction, and (ii) gradual degra-
dation of solar output over time, thereby exhibiting the
real-world utility of our approach.
2 BACKGROUND
Our work focuses on detecting anomalous solar power generation in
a residential solar installation using information from geographically
nearby sites. Unlike power generation from traditional mechanical
generators (e.g., diesel generators), where power output is constant
and controllable, the instantaneous power output from a PV system
is inherently intermittent and uncontrollable. The solar power output
may see sudden changes, with energy generation at peak capacity at
Figure 1: Power generation from three geographically nearby
solar sites. As shown, the power output is intermittent and cor-
related for solar arrays within a geographical neighborhood.
one moment to reduced (or zero) output in the next period (see Fig-
ure 1). The change in the power output can be attributed to a number
of factors and our goal is to determine whether the drop in power
can be attributed to anomalous behavior in the solar installation.
2.1 Factors affecting solar output
A primary factor that influences the power generation of a solar panel
is the solar irradiance, i.e., the amount of sunlight that is incident on
the panel. The amount of sunlight a solar panel receives is dependent
on many factors such as time of the day and year, dust, temperature,
cloud cover, shade from nearby buildings or structures, tilt and
orientation of the panel, etc. These factors determine the amount of
power that is generated based on how much light is incident on the
solar modules.
However, a number of other factors, related to hardware, can also
reduce the power output of a solar panel. For instance, the power
output may reduce due to defective solar modules, charge controllers,
inverters, strings in PV, wired connections and so on. Clearly, there
are many factors that can cause problems in power generation. Thus,
factors affecting output can be broadly classified into two categories:
(i) transient — factors that have a temporary effect on the power
output (such as cloud cover); and (ii) anomalies — factors that have
a more prolonged impact (e.g., solar module defect) on the power
output.
The transient factors can further be classified into common and
local factors. The common factors, such as weather, affect the power
output of all the solar panels in a given region. Moreover, its effect is
temporary as the output changes with a change in weather conditions.
For instance, overcast weather conditions temporarily reduce the
output of all panels in a given region. The local factors, such as
shade from nearby foliage or buildings, are usually site-specific and
do no affect the power output of other sites. These local factors
may be recurring and reduce the power output at fixed periods in
a day. In contrast, anomalous factors, such as bird droppings or
system malfunctions, reduces power output for prolonged periods
and usually require corrective action to restore normal operation of
the site. Note that both transient and anomalous factors may reduce
the power output of a solar array. Thus, a key challenge in designing
SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA
a solar anomaly detection algorithm is to differentiate the reduction
in power output due to transient factors and anomalies.
2.2 Anomaly detection in solar installations
Prior approaches have focused on using exogenous factors to predict
the future power generation [
8
,
20
,
28
]. A simple approach is to
use such prediction models and report anomaly in solar panels if
the power generated is below the predicted value for an extended
period. However, it is known that external factors such as cloud
cover are inadequate to accurately predict power output from solar
installations [
17
]. Thus, prediction models may over- or under-
predict power generation, and such an approach may not be sufficient
for detecting anomalies.
Prediction models can be improved using additional sensors but
can be prohibitively expensive for residential setups [
21
]. For in-
stance, drone-mounted cameras can detect occlusions in a panel
but are expensive and require elaborate setup. Other studies use an
ideal model of the solar arrays to detect faults [
5
,
12
]. These studies
rely on various site-specific parameters and assume standard test
condition (STC) values of panels are known. However, site-specific
parameters are often not available. Thus, most large solar farms
usually depend on professional operators to continuously monitor
and maintain their setup to detect faults early
3,4
. Clearly, such
elaborate setups may not be economically feasible in a residential
solar installation. Below, we present our work that focuses on a
data-driven and cost-effective approach for detecting anomalies in a
solar installation.
3 ALGORITHM DESIGN
We first introduce the intuition behind our approach to detect anoma-
lous power generation in a solar installation. Our primary insight is
that other geographically nearby sites can predict the solar output
potential, which can then reveal issues in a given site. Since factors
such as the amount of solar irradiance (e.g., due to cloudy condi-
tions) are similar within a region, the power output of solar arrays in
a geographical neighborhood is usually correlated. This can be seen
in the power output from three different solar installation sites in the
same geographical neighborhood (see Figure 1). As seen, the solar
arrays tend to follow a similar power generation pattern. So we can
use the output of a group of sites to predict the output of a specific
site and flag anomalies if the prediction significantly deviates.
We hypothesize that predicting the output using geographically
nearby sites can “remove” the effects of confounding factors (i.e.,
common factors). By accounting for confounding factors, the re-
maining influence on power generation can be attributed to local
factors in the solar installation. The local factors may include both
transient local factors and anomalies. Thus, any irregularity in power
generation, after accounting for confounding and transient local fac-
tors, must be due to anomalies in the installation. For example,
cloudy or overcast conditions in a given location have a similar im-
pact on all solar panels and will reduce the power output of all sites.
However, a malfunctioning solar module in a site (a local event) will
observe a higher drop in power output than others. If the drop in
power due to cloudy conditions (a confounding factor) along with
3ESA Renewables: http://esarenewables.com/
4Affinity Energy: https://www.affinityenergy.com/
L C
Y X
Unobserved
Observed
Figure 2: Graphical model representation of our setup.
transient local factors is accounted for, any further drop in power
can be attributed to anomalies. Our approach follows this intuition
to detect anomalies in a solar installation.
The rest of the section is organized as follows. We present a
graphical model representation for our setup that models the con-
founding variables. Next, we discuss how our algorithm removes
the confounding factors and detects anomalies in solar generation.
3.1 Graphical model representation
Our work is inspired by a study in astronomy, wherein Half-Sibling
Regression technique was used to remove the effects of confounding
variables (i.e., noises from measuring instruments) from observations
to find exoplanets [
27
]. We follow a similar approach to model and
detect anomalies in a solar installation.
Let
C
,
L
,
X
and
Y
be the random variables (RVs) in our problem.
Here,
Y
refers to the power generated by a candidate solar installation
site.
X
represents the power produced by each of the geographically
nearby solar installations (represented in a vector format). While C
represents the confounding variables that affect both
X
and
Y
, the
variable
L
represents site-specific local factors affecting a candidate
site. These local factors include both transient factors and anomalies
that affect a candidate site. In our setup, both
X
and
Y
are observed
variables (as power generation of a site can be easily measured),
whereas
C
and
L
are latent unobserved variables. Figure 2 depicts
a directed graphical model (DAG) that illustrates the relationship
between these observed and unobserved random variables.
We are interested in the random variable
L
which represents anom-
alies at a given site. As seen in the figure, since both
L
and
C
affect
the observed variable
Y
, without the knowledge of
C
it is difficult
to calculate the influence of
L
on
Y
. Clearly,
X
is independent of
L
as variable
L
impacts only
Y
. However, we note that
C
impacts
X
and when conditioned on
Y
,
Y
becomes a collider, and the vari-
ables
X
and
L
become dependent [
23
]. This implies that
X
contains
information about Land we can recover Lfrom X.
To reconstruct the quantity
L
, we impose certain assumptions on
the type of relationship between
Y
and
C
. Specifically, we assume
that Ycan be represented as an additive model denoted as follows:
Y=L+f(C)(1)
where
f
is a nonlinear function and its input
C
is unobserved. Since
L
and
X
are independent, variable
X
cannot account for the influence
of
L
on
Y
. However,
X
can be used to approximate
f(C)
, as
C
also
affects
X
. If
X
exactly approximates
f(C)
, then
f(C)=E[f(C)|X]
,
and we can show that
L
can be recovered completely using (1). Even
if
X
does not exactly approximate
f(C)
, in our case,
X
is sufficiently
COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy
Half-Sibling
Regression
Seasonality
Removal
Anomaly
Detection
Solar data from
candidate and
colocated sites
Moving
window
length
Anomaly
threshold
Figure 3: An overview of the key steps in the SolarClique algo-
rithm.
large to provide a good approximation of
E[f(C)|X]
up to an offset.
A more detailed description of the approach is given in [
27
]. Thus,
using
X
to predict
Y
(i.e.,
E[Y|X]
),
f(C)
can be approximated and
removed from (1) to estimate ˆ
Las follows:
ˆ
LBYE[Y|X](2)
where
ˆ
L
is an estimate of the local factors that may include both
transient local factors and anomalies.
3.2 SolarClique Algorithm
We now present our anomaly detection algorithm called SolarClique.
Figure 3 depicts an overview of the different steps involved in the
SolarClique algorithm. First, we use the Half-Sibling Regression
approach to build a regression model that predicts the solar gener-
ation of a candidate site using power output from geographically
nearby sites. Next, we remove any seasonal component from the
above regression model using time series decomposition. Finally,
we detect anomalies by analyzing the deviation in the power output.
Below, we describe these three steps in detail.
3.2.1 Step 1: Remove confounding eects. The first step is to
build a regression model that predicts the power generation output
Y
of a candidate site using
X
, a vector of power generation values
from geographically nearby solar installations. As mentioned earlier,
the regression model estimates
E[Y|X]
component in the additive
model shown in (2). Since
Y
is observed, subtracting the
E[Y|X]
component determines the Lcomponent.
Standard regression techniques can be used to build this regres-
sion model. The regression technique learns an estimator that best
fits the training data. Instead of constructing a single regression
model, we use bootstrapping — a technique that uses subsamples
with replacement of the training data — which gives multiple regres-
sion models and the properties of the estimator (such as standard
deviation). We use an ensemble method, wherein the mean of the
regression models is taken to estimate the
E[Y|X]
in the testing data.
Finally, we remove the confounding component
E[Y|X]
from
Y
to
obtain an estimate of
ˆ
Lt8t2T
in the testing data. The final output
of this step is an estimate
ˆ
Lt
and the standard deviation (
st
) of the
estimators.
3.2.2 Step 2: Remove seasonal component. As discussed earlier,
the solar output of a site is affected by both common (i.e., confound-
ing) and local factors. Using the Half-Sibling Regression approach,
we can account for the transient confounding factors such as weather
changes. However, we also need to account for transient local fac-
tors, such as shade from nearby trees, which may temporarily reduce
the power output at a specific time of the day. Since variable
ˆ
Lt
in-
clude both transient local factors and anomalies, we need to remove
the local factors to determine the anomaly ˆ
At.
We note that the time period of such occlusions (those from
nearby trees or structures) may not vary much on a daily basis.
This is because the maximum elevation of the sun in the sky varies
by less than 2
over a period of a week
5
on average. Using time
series decomposition techniques over short time intervals (e.g. one
week), such seasonal components (i.e the pattern occurring every
fixed period) can be removed. Thus, we perform a time series
decomposition to account for transient local factors as follows. We
compute the seasonal component and remove it from
ˆ
Lt
only when
ˆ
Lt
is outside the confidence interval
4s
and on removal of the seasonal
component,
ˆ
Lt
doesn’t go outside the confidence interval. After
removal of the seasonal component, if any, we obtain
ˆ
At
from
ˆ
Lt
as
our final output.
3.2.3 Step 3: Detect Anomalies. We use the output
ˆ
At
(from Step
2) and the standard deviation
st
(from Step 1) to detect anomalies in
a candidate site. Specifically, we flag the day as anomalous when
three conditions hold. First, the deviation of
ˆ
At
should be significant,
i.e., greater than four times the standard deviation. Second, the
anomaly should occur for at least
k
contiguous period. Finally, when
the period
t
is during the daytime period (not including the twilight).
Thus, an anomaly can be defined as follows:
anomaly =(ˆ
At<4st)^... ^(ˆ
At+k<4st)8t2T(3)
where Tdenotes the time during the daytime period.
Based on our assumption that
ˆ
At
is Gaussian, it follows that the
odds of an anomaly are very high when the deviation is more than
4s
. These anomalous values belong to the end-tail of the normal
distribution. The second condition (i.e., contiguous anomaly period)
ensures that the drop in power output is for an extended period. In
practice, depending on the data granularity, the contiguous period
can range from minutes to hours. Clearly, we would like to detect
anomalies during the period when sunlight is abundant. During the
night or twilight, the solar irradiation is very low to provide any
meaningful power generation. Thus, we choose the daytime period
in our algorithm for anomaly detection.
4 IMPLEMENTATION
We implemented our SolarClique algorithm in python using the
SciPy stack [
4
]. The SciPy stack consists of efficient data processing
and numeric optimization libraries. Further, we use the regression
techniques in the scikit-learn library to learn our models [
24
]. The
scikit-learn library comprises various regression tools, which takes a
vector of input features and learn the parameters that best describe
5
The sun directly faces the Tropic of Cancer (+23.5
) on the summer solstice.
Whereas, it faces the Tropic of Capricorn (-23.5
) on the winter solstice. Thus, over
half the year (26 weeks) the maximum elevation of the sun changes by
47
, i.e., ¡2
per week.
SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA
Number of solar installations 88
Solar installation size (kW) 0.5 to 9.3
Residential size (sq. ft.) 1142 to 3959
Granularity 1 hour
Year 2014, 2015
Table 1: Key characteristics of the dataset.
the relationship between the input and the dependent variable. Ad-
ditionally, we use Seasonal and Trend decomposition using Loess
(STL) technique to remove the seasonality component [
9
]. The STL
technique performs a time series decomposition on the input and
deconstructs it into trend, seasonal, and noise components.
5 EVALUATION
5.1 Dataset
For evaluating the efficacy of SolarClique, we use a public dataset
available through the Dataport Research Program [
2
]. The dataset
contains solar power generation from over hundred residential solar
installations located in the city of Austin, Texas. The power genera-
tion from these installations are available at an hourly granularity.
Table 1 shows the key characteristics of the dataset. For our case
study, we selected those homes that have contiguous solar generation
data, i.e., no missing values, for an overlapping period of at least two
years. Based on this criteria, we had 88 homes for our evaluation in
the year 2014 and 2015.
5.2 Evaluation Methodology
We partitioned our dataset into training and testing period. We used
the first three months of data to train the model, and the remaining
dataset for testing (21 months). Further, for bootstrapping, we sam-
ple our training dataset by randomly selecting 80% of the training
samples with replacement. These samples are then used to build the
estimator, and we repeated this step 100 times to learn the properties
of the estimator. To build our model, we used five popular regression
techniques namely Random Forest (RF), k-Nearest Neighbor (kNN),
Decision Trees (DT), Support Vector Regression (SVR), and Linear
Regression (LR). Finally, we selected the contiguous period as
k=2
(see Step 3 of our algorithm) since our data granularity is hourly.
Unless stated otherwise, we use all homes in our dataset for our
evaluation.
5.3 Metrics
Since the installation capacity can be different across solar panels, it
may not be meaningful to use a metric such as Root Mean Squared
Error (RMSE). This is because the magnitude of the error may be
different across predictions. Thus, we use Mean Absolute Percent-
age Error (MAPE) to measure the regression model’s accuracy in
predicting a candidate’s power generation. MAPE is defined as:
MAPE =100
n
n
Â
t=1
ytpt
¯yt
(4)
where
yt
and
pt
are the actual and predicted value at time
t
respec-
tively.
¯yt
represents the average of all the values and
n
is the number
of samples in the test dataset.
Figure 4: Performance of different regression techniques used
to predict the power generation of a site.
Figure 5: Mean standard deviation of predictions for different
regression techniques
5.4 Results
Below, we summarize the results of using SolarClique on the Data-
port dataset.
5.4.1 Prediction performance using geographically nearby sites.
We compare the accuracy of the five regression techniques used to
predict the power generated at a candidate site (
Y
) using the data
from nearby sites (
X
). Figure 4 shows the spread of the MAPE values
for the regression techniques used for all the 88 sites. Random Forest
and Decision Trees show the best performance closely followed by
k-NN with average MAPE values of approximately 7.81%, 7.87%,
and 8.94% respectively. Linear Regression, on the other hand, shows
poor accuracy with an average MAPE of 19%.
As discussed earlier, our approach uses bootstrapping to generate
the standard deviation values for each prediction. Note that a small
standard deviation means tighter confidence interval and indicates
that the regression technique has a consistent prediction across runs.
Figure 5 shows the mean value of standard deviation over all the
testing samples normalized by the size of the solar installation. We
observe that RF and k-NN have tight confidence intervals, while LR
has considerably wider bounds. In particular, we observe that the
average standard deviation of RF and k-NN is 0.0032 and 0.0059
using all the sites, respectively. In comparison, the average standard
deviation of LR is 0.0078. Since RF performs better than other
regression techniques, we use RF for the rest of our evaluation.
COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy
Figure 6: Average MAPE diminishes with increase in the num-
ber of geographically nearby sites.
Figure 7: Standard deviation of MAPE diminishes with in-
crease in the number of geographically nearby sites.
5.4.2 Impact due to the number of geographically nearby sites.
We now focus on understanding the minimum number of geographi-
cally nearby sites to accurately predict the power generated at the
candidate site. As discussed earlier, the power output of geographi-
cally nearby sites are used as input features to build the regression
model. Since in this experiment we are not interested in analyzing
the confidence intervals, we use the entire training data to build
the model (i.e., no bootstrapping). We vary the number of geo-
graphically nearby sites from 1 to 50 and for each value, we build
100 different models learnt from choosing random combinations of
nearby sites.
Figure 6 shows the spread of average MAPE values as we vary
the number of geographically nearby sites used for all 88 sites. We
use the Random Forest regression technique to build the model. As
expected, the average MAPE value reduces when more number of
geographically nearby sites are used to predict the output. Note that
as the nearby sites increase, the variations in nearby sites cancel out,
which provides a more robust regression model. This suggests that
an increase in the nearby site can improve the accuracy of the power
generation model of a candidate site. We also note that the reduction
in MAPE diminishes as the number of geographically nearby sites
increases. With at least five randomly chosen geographically nearby
sites, we observe that the MAPE is around 10%. This indicates that
our algorithm can be effective in sparsely populated regions such as
towns/villages, having few solar installations.
Next, we analyze the variability in performance of the different
models as the number of geographically nearby sites increases. Fig-
ure 7 shows the spread of the standard deviation of the 100 models
with increasing number of geographically nearby sites. As shown in
the figure, we observe that the variability reduces when the number
of nearby sites increases. However, unlike the previous result, the
variability continues to reduce — albeit at a slower rate — even
when the number of nearby sites is greater than five. Thus, the
performance of the learned models is closer to its average.
5.4.3 Detection of anomalies. We illustrate the different steps
involved in our algorithm using Figure 8. In the top subplot of the
figure, the blue line depicts the power generation trace from a solar
installation for over a week in August, 2015. The red marker shows
the prediction from the RandomForest regression technique with
data from the remaining 87 sites as features. While the prediction
(i.e., red marker) closely follows the actual power output (i.e., blue
line), there is a significant difference in the actual and predicted
after 14th August. As seen, there is a sharp drop in the actual power
generated in the late morning of 14th August. The drop in power
is significant, and there is no output recorded in the site for an
extended period until October (not shown in the figure). However,
the regression model forecasts a non-negative power output for the
given site.
The second subplot shows the residual, i.e., the difference be-
tween the actual and the predicted values (i.e., the black line) along
with the confidence interval (i.e., the gray shaded region). The
confidence interval, which is within
±4s
, is calculated using the
pointwise standard deviation obtained from the bootstrap process.
In this figure, we observe that the residual sometimes lie outside the
confidence interval at the same time of the day across multiple days
— which indicates a fixed periodic component.
On removing the seasonal component using our approach, we
observe that the residual always lies within the confidence interval,
except when there is an anomaly in power generation. This is shown
in the third subplot of the figure, where the black line (i.e., residual)
lie within the gray shaded region (i.e., the confidence interval). Fi-
nally, the last subplot depicts our anomaly detection algorithm in
action. We observe that our algorithm accurately flags periods of no
output as an anomaly (depicted by the red shaded region).
6 CASE-STUDY: ANOMALY DETECTION
ANALYSIS
In this case study, we use the solar installations in the Dataport as
they represent a typical setup within a city. We ran our SolarClique
algorithm on the generation output from all solar installations and
obtained the anomalous days in the dataset. Below, we present our
analysis.
6.1 Anomalies in solar installations
Figure 9 shows the total number of anomalous days in each solar
installation site. We observe that our SolarClique algorithm found
anomalous days in 76 solar installations, out of the 88 sites in the
dataset. As seen in the figure, the total number of anomalous days
span from a day to several months. Together, all the installation sites
had a total of 1906 anomalous days. This indicates a significant
loss of renewable power output. Specifically, we observe that 17 of
SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA
Residual crossing the confidence interval
Figure 8: An illustrative example that depicts the data-processing and anomaly detection steps in SolarClique.
Figure 9: Number of anomalous days for each site. Installation
sites are plotted in ascending order of anomalous days.
the 88 (around 20%) installations had anomalous power generation
for at least a total of one month that represents more than 5% of
the overall 640 days in the testing period. Anomalies from these
installations account for nearly 80% of all the anomalous days.
To better understand the anomalous periods, we group them into
short-term and long-term periods. The short-term periods have less
than three contiguous anomalous days, while the long-term periods
have consecutive anomalous days for at least three days. Our results
show the dataset has 587 occurrences of short-term periods spread
over 683 days. Further, we observe 123 occurrences of long-term
periods spread over 1223 days. We also observe that the maximum
contiguous anomalous period found in a site was approximately five
months (i.e., 158 days), with no power output during that period.
Clearly, such high number of long-term anomalous periods demon-
strate the need for early anomaly detection tools. Additionally, we
note that long-term anomalies are relatively easier to detect than
short-term anomalies. While long-term anomalies represent serious
issues that may need immediate attention, short-term anomalies may
be minor problems, if unattended, could become major problems
in future. The advantage of our approach is we can detect both
short-term and long-term anomalies.
Figure 10: Under-production of solar detected using our algo-
rithm.
6.2 Analysis of anomalies detected
Note that the reduction in power output depends on the severity of
an anomaly. This is because some electrical faults (e.g., short-circuit
of a panel) may have localized impact on a solar array, which can
marginally reduce the power output, while other faults (e.g., inverter
faults) may show significant power reduction or completely stop
power generation.
SolarClique detects anomalous days when there is no solar gen-
eration and also when an installation under produces power. Our
algorithm reported 1099 and 807 anomalous days with under produc-
tion and no solar generation, respectively. Since no solar generation
days are trivially anomalous, we specifically examine cases of solar
under production. Figure 10 shows the power output from three
different sites. The top plot shows the power output (depicted by
the blue line) with no anomalous days, the subplots below show
sites that have anomalous days (depicted by the red marker). Our
results show that the SolarClique algorithm detects anomalies even
when a site under produces solar power. Note that the site with no
anomaly, which is exposed to the same solar irradiance as other
COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy
Less than 5% error
Figure 11: Distribution of the difference in actual and predicted
on underproducing anomalous days.
(a) Site A (b) Site B
Figure 12: Anomalies detected in two sample sites where the
difference in actual and predicted was less than 5%. The fig-
ure shows a good fit on all days except the anomalous period
highlighted in the circle.
sites, continues to produce solar output. However, we observe a
drop in power output for an extended period in the anomalous sites.
Specifically, we observe the drop in power output is around 75% and
40% in Site 1 and Site 2, respectively — presumably due to factors
such as line faults in the solar installation. Usually, anomalies such
as line faults can cause a significant drop in the power output. In
particular, a 75% drop in Site 1 can be attributed to faults in three
fourth of the strings (i.e., connected in series).
We further examine the reduction in power output in the under-
production cases. Figure 11 shows the distribution of the difference
in actual and predicted power output for anomalous days. Out of
the1099 under production days, our algorithm reported 23 days when
the difference in percentage was less than or equal to 5%. Typically,
more than 5% drop in power output is considered significant. This is
because malfunctioning of a single panel in a solar array with 20 pan-
els
6
will result in a 5% reduction. Thus, we investigate anomalous
days wherein the difference is less than or equal to 5%. Figure 12
compares the regression fit of anomalous days with two normal days
(adjacent to the anomalous days) from two sample sites where the
difference was less than 5%. Note that the figure shows a good fit
for most periods except during the anomalous period highlighted
6
Typically, a 5kW installation capacity has 20 panels, each panel having 250W
capacity.
Degradation over a year
Figure 13: Accelerated degradation in the power output of a
solar site.
Anomaly
Type #Sites #Days Avg. power
reduction(%)
Single
No Production 5 515 98.87
Multiple No
Production 3 295 98.65
Single Under
Production 2 348 60.22
Multiple Under
Production 4 164 43.63
Severe
Degradation 3 179 30.67
Table 2: Types of anomaly in sites having more than a month of
anomalous days.
in the circle. In comparison to other periods, we observe a drop in
power during the anomalous period, occurring during the mid-day.
Even though the difference in percentage is small, it represents a
relatively significant drop since the power output is at its peak during
the mid-day.
We observe that our approach also detects anomalies due to degra-
dation in the power output, which usually spans over an extended
time period. Since the drop in power output over the time period
may be small, such changes are more subtle and harder to identify.
Figure 13 shows the degradation in power output of an anomalous
site. Our algorithm reports an increase in the frequency of anoma-
lous days in the installation site over the year, with more anomalous
days in the latter half. To understand the increase in anomalous days,
we plot the difference between the actual and predicted (seen in the
bottom subplot). We observe that the difference between the actual
and predicted value steadily increases over time. It is known that
the power output of solar installations may reduce over time due to
aging [
22
] at a rate of around 1% a year. However, the accelerated
degradation seen in Figure 13 is presumably due to occurrences of
hot-spots or increased contact resistance due to corrosion. Early
detection of such conditions can help homeowners take advantage
of product warranties available on solar panels.
We now examine the types of anomalies in the top 17 sites with
more than a month of anomalous days. The power output of anoma-
lous days can be categorized into three types — (i) no production,
SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA
(ii) under production, and (iii) degradation over a period. Table 2
summarizes the different types occurring over a period in these sites.
The single period represents a single contiguous period of anomaly,
while the multiple period represents more than one contiguous pe-
riod. We observe that the average power reduction during anomalous
periods may range from 98.8% to 30.6%. We classify “no produc-
tion days” as days with no power output for the majority of the
period. Overall, we observe that there are 810 no production days —
a significant loss in renewable output. Although the average power
reduction due to severe degradation is 30%, it is likely to grow over
time.
7 FUTURE EXTENSIONS TO SOLARCLIQUE
As mentioned earlier, several third-party sites exist that host solar
generation data for rooftop installations. While in our approach,
we use power to determine the existence of anomalies in power
generation, several other electrical characteristics such as voltage
and current are available that carry much richer information about
the type of anomaly. This information can be leveraged to further
infer the exact type of anomaly in power generation. For example,
a line fault (broken string) will reduce the current produced by the
overall setup, but the voltage will remain unchanged. Conversely,
covering of dust/bird droppings can impact both the voltage and the
current. Thus, our algorithm can be extended to use multi-modal
data (e.g., voltage, current, and power) to further diagnose the exact
cause of the anomaly.
Our approach can also be extended to a single solar installation
for detecting anomalies. With the proliferation of micro-inverters in
residential solar installations, power generation data from individual
panels are available. Power output from these colocated panels can
also be used to detect faults in the PV setup, as they can predict
the power output with higher fidelity. This can be used in remote
locations where data from other solar installations are not easily
available. As part of future work, we plan to use SolarClique al-
gorithm to discover faults in a single panel by comparing power
generated with others in the same setup.
8 RELATED WORK
There has been significant work on predicting the solar output from
solar arrays [
7
,
8
,
16
,
20
,
28
]. While some studies have used site-
specific data such as panel configuration [
8
,
20
] for building the
prediction model, others have used external data such as weather or
historical generation data [
17
,
28
]. Such models can provide short-
term generation forecast (e.g., an hour) to long-term forecast (e.g.,
days or weeks). Although these studies can predict the reduction
in power output, a limitation in these studies is that they cannot
attribute the reduction to anomalies in the solar installation.
Prior work has also focused on anomaly detection in PV pan-
els [
14
,
15
,
22
,
25
,
30
32
]. These studies propose methods to model
the effects of shades/covering [
14
,
19
], hot-spots [
18
], degrada-
tion [
22
,
30
] or short-circuit and other faults [
15
]. However, these
methods require extensive data (such as statistics on different types
of anomalies) [
29
] or do not focus on hardware-related issues [
14
].
For instance, [
29
] proposes a solution to determine probable causes
of anomalies but require detailed site-specific information along
with pre-defined profiles of anomalies. Unlike prior approaches,
our approach doesn’t require such extensive data or setup and relies
instead on power generation from co-located sites. Thus, it pro-
vides a scalable and cost-effective approach to detect anomalies in
thousands of solar installation sites.
The idea behind our approach is similar to [
26
,
27
]. However,
the authors use the approach in the context of an astronomy appli-
cation, wherein systematic errors are removed to detect exoplanets.
In this case, the systematic errors are confounding factors due to
telescope and spacecraft, which influences the observations from
distant stars. In contrast, our solution uses inputs from other geo-
graphically nearby sites to detect anomalies in solar. As discussed
earlier, today, such datasets are easily accessible over the internet,
which makes our approach feasible. Further, using regression on
the data from neighbors has been studied earlier [
10
]. However,
the main focus of this work was in the context of quality control in
climate observations by imputing missing values. In our case, we
use the learned regression model to find anomalous solar generation.
9 CONCLUSION
In this paper, we proposed SolarClique, a data-driven approach to
detect anomalies in power generation of a solar installation. Our
approach requires only power generation data from geographically
nearby sites and doesn’t rely on expensive instrumentation or other
external data. We evaluated SolarClique on the power generation
data over a period of two years from 88 solar installations in Austin,
Texas. We showed how our solar installation regression models
are accurate with tight confidence intervals. Further, we showed
that our approach could generate models with as few as just five
geographically nearby sites. We observed that out of the 88 solar
installations, 76 deployments had anomalies in power generation.
Additionally, we found that our approach is powerful enough to
distinguish between reduction in power output due to anomalies and
other factors (such as cloudy conditions). Finally, we presented a
detailed analysis of the different anomalies observed in our dataset.
Acknowledgment
This research is supported by NSF grants
IIP-1534080, CNS-1645952, CNS-1405826, CNS-1253063, CNS-
1505422, CCF-1522054 and the Massachusetts Department of En-
ergy Resources.
REFERENCES
[1]
2016. When Will Rooftop Solar Be Cheaper Than the Grid? https://goo.gl/h1Ayy5.
(2016). Accessed March, 2018.
[2] 2017. Dataport dataset. https://dataport.cloud/. (2017).
[3]
2017. EIA adds small-scale solar photovoltaic forecasts to its monthly Short-Term
Energy Outlook. https://www.eia.gov/todayinenergy/detail.php?id=31992. (2017).
Accessed March, 2018.
[4]
2018. SciPy Stack. http://www.scipy.org/stackspec.html. (Accessed March 2018).
[5]
Mohamed Hassan Ali, Abdelhamid Rabhi, Ahmed El Hajjaji, and Giuseppe M
Tina. 2017. Real Time Fault Detection in Photovoltaic Systems. Energy Procedia
(2017).
[6]
Rob W Andrews, Andrew Pollard, and Joshua M Pearce. 2013. The effects of
snowfall on solar photovoltaic performance. Solar Energy 92 (2013), 84–97.
[7]
Yona Atsushi and Funabashi Toshihisa. 2007. Application of recurrent neural
network to short-term-ahead generating power forecasting for photovoltaic system.
In Power Engineering Society General Meeting. Tampa, Florida, USA.
[8]
Peder Bacher, Henrik Madsen, and Henrik Aalborg Nielsen. 2009. Online short-
term solar power forecasting. Solar Energy (2009).
[9]
Robert B Cleveland, William S Cleveland, and Irma Terpenning. 1990. STL:
A seasonal-trend decomposition procedure based on loess. Journal of Official
Statistics (1990).
[10]
Christopher Daly, Wayne Gibson, Matthew Doggett, Joseph Smith, and George
Taylor. 2004. A probabilistic-spatial approach to the quality control of climate
COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy
observations. In Proceedings of the 14th AMS Conference on Applied Climatology,
Amer. Meteorological Soc., Seattle, WA.
[11]
Chris Deline. 2009. Partially shaded operation of a grid-tied PV system. In
Photovoltaic Specialists Conference (PVSC), 2009 34th IEEE. IEEE, 001268–
001273.
[12]
Mahmoud Dhimish, Violeta Holmes, and Mark Dales. 2017. Parallel fault de-
tection algorithm for grid-connected photovoltaic plants. Renewable Energy
(2017).
[13]
Ran Fu, David J Feldman, Robert M Margolis, Michael A Woodhouse, and
Kristen B Ardani. 2017. US solar photovoltaic system cost benchmark: Q1 2017.
Technical Report. National Renewable Energy Laboratory (NREL), Golden, CO
(United States).
[14]
Peter Xiang Gao, Lukasz Golab, and Srinivasan Keshav. 2015. What’s Wrong
with my Solar Panels: a Data-Driven Approach.. In EDBT/ICDT Workshops.
86–93.
[15]
Elyes Garoudja, Fouzi Harrou, Ying Sun, Kamel Kara, Aissa Chouder, and
Santiago Silvestre. 2017. Statistical fault detection in photovoltaic systems. Solar
Energy (2017).
[16]
Rui Huang, Tiana Huang, Rajit Gadh, and Na Li. 2012. Solar generation pre-
diction using the ARMA model in a laboratory-level micro-grid. In Smart Grid
Communications (SmartGridComm), 2012 IEEE Third International Conference
on. IEEE.
[17]
Srinivasan Iyengar, Navin Sharma, David Irwin, Prashant Shenoy, and Krithi
Ramamritham. 2017. A Cloud-Based Black-Box Solar Predictor for Smart Homes.
ACM Transactions on Cyber-Physical Systems 1, 4 (2017), 21.
[18]
Katherine A Kim, Gab-Su Seo, Bo-Hyung Cho, and Philip T Krein. 2016. Photo-
voltaic hot-spot detection for solar panel substrings using ac parameter characteri-
zation. IEEE Transactions on Power Electronics (2016).
[19]
Alexander Kogler and Patrick Traxler. 2016. Locating Faults in Photovoltaic
Systems Data. In International Workshop on Data Analytics for Renewable Energy
Integration. Springer.
[20]
Elke Lorenz, Johannes Hurka, Detlev Heinemann, and Hans Georg Beyer. 2009.
Irradiance forecasting for the power prediction of grid-connected photovoltaic
systems. IEEE Journal of selected topics in applied earth observations and remote
sensing (2009).
[21]
Ricardo Marquez and Carlos FM Coimbra. 2013. Intra-hour DNI forecasting
based on cloud tracking image analysis. Solar Energy (2013).
[22]
Ababacar Ndiaye, Cheikh MF K
´
eb
´
e, Pape A Ndiaye, Abd
´
erafi Charki, Ab-
dessamad Kobi, and Vincent Sambou. 2013. A novel method for investigating
photovoltaic module degradation. Energy Procedia (2013).
[23] Judea Pearl. 2009. Causality. Cambridge university press.
[24]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:
Machine Learning in Python. Journal of Machine Learning Research 12 (2011).
[25]
M Sabbaghpur Arani and MA Hejazi. 2016. The comprehensive study of electrical
faults in PV arrays. Journal of Electrical and Computer Engineering 2016 (2016).
[26]
Bernhard Sch
¨
olkopf, David Hogg, Dun Wang, Dan Foreman-Mackey, Dominik
Janzing, Carl-Johann Simon-Gabriel, and Jonas Peters. 2015. Removing system-
atic errors for exoplanet search via latent causes. In International Conference on
Machine Learning.
[27]
Bernhard Sch
¨
olkopf, David W Hogg, Dun Wang, Daniel Foreman-Mackey, Do-
minik Janzing, Carl-Johann Simon-Gabriel, and Jonas Peters. 2016. Modeling
confounding by half-sibling regression. Proceedings of the National Academy of
Sciences (2016).
[28]
Navin Sharma, Pranshu Sharma, David Irwin, and Prashant Shenoy. 2011. Pre-
dicting solar generation from weather forecasts using machine learning. In Smart
Grid Communications (SmartGridComm), 2011 IEEE International Conference
on. IEEE, 528–533.
[29]
S Stettler, P Toggweiler, E Wiemken, W Heydenreich, AC de Keizer, WGJHM
van Sark, S Feige, M Schneider, G Heilscher, E Lorenz, and others. 2005. Failure
detection routine for grid-connected PV systems as part of the PVSAT-2 project.
In Proceedings of the 20th European Photovoltaic Solar Energy Conference &
Exhibition, Barcelona, Spain. 2490–2493.
[30]
Ali Tahri, Takashi Oozeki, and Azzedine Draou. 2013. Monitoring and evaluation
of photovoltaic system. Energy Procedia (2013), 456–464.
[31]
Patrick Traxler. 2013. Fault detection of large amounts of photovoltaic systems.
In Proceedings of the ECML/PKDD 2013 Workshop on Data Analytics for Re-
newable Energy Integration.
[32]
Achim Woyte, Mauricio Richter, David Moser, Stefan Mau, Nils Reich, and Ulrike
Jahn. 2013. Monitoring of photovoltaic systems: good practices and systematic
analysis. In Proc. 28th European Photovoltaic Solar Energy Conference.
... More recently, this approach was used in a system called SolarClique [17] to predict the output of an entire array using nearby solar arrays. We draw inspiration from the halfsibling regression paper [28] and SolarClique [17] for Sun-Down's anomaly detection, but point out important differences between the SolarClique method and our approach as shown in table 1. ...
... More recently, this approach was used in a system called SolarClique [17] to predict the output of an entire array using nearby solar arrays. We draw inspiration from the halfsibling regression paper [28] and SolarClique [17] for Sun-Down's anomaly detection, but point out important differences between the SolarClique method and our approach as shown in table 1. First, SolarClique is designed for systemlevel predictions (predicting the total generation of an entire array) and does not have the capability of making fine-grain per-panel predictions, which is the focus of our method. ...
... Next, sinceL contains effects of transient factors such as shade on panels as well anomalies, we must remove the impact of transient factors to obtain the "true" anomalies. We can use time series decomposition to extract the seasonal component that represents shading effect that occur daily at set time periods and remove it fromL [17]. The remainder ofL represents production loss at that panel due to any anomalies. ...
Preprint
There has been significant growth in both utility-scale and residential-scale solar installations in recent years, driven by rapid technology improvements and falling prices. Unlike utility-scale solar farms that are professionally managed and maintained, smaller residential-scale installations often lack sensing and instrumentation for performance monitoring and fault detection. As a result, faults may go undetected for long periods of time, resulting in generation and revenue losses for the homeowner. In this paper, we present SunDown, a sensorless approach designed to detect per-panel faults in residential solar arrays. SunDown does not require any new sensors for its fault detection and instead uses a model-driven approach that leverages correlations between the power produced by adjacent panels to detect deviations from expected behavior. SunDown can handle concurrent faults in multiple panels and perform anomaly classification to determine probable causes. Using two years of solar generation data from a real home and a manually generated dataset of multiple solar faults, we show that our approach has a MAPE of 2.98\% when predicting per-panel output. Our results also show that SunDown is able to detect and classify faults, including from snow cover, leaves and debris, and electrical failures with 99.13% accuracy, and can detect multiple concurrent faults with 97.2% accuracy.
... SolarClique, a data-driven method, is considered by [26] to detect anomalies in the power generation of a solar establishment. The method does not need any sensor apparatus for fault/anomaly detection. ...
Article
Full-text available
The rapid industrial growth in solar energy is gaining increasing interest in renewable power from smart grids and plants. Anomaly detection in photovoltaic (PV) systems is a demanding task. In this sense, it is vital to utilize the latest updates in machine learning technology to accurately and timely disclose different system anomalies. This paper addresses this issue by evaluating the performance of different machine learning schemes and applying them to detect anomalies on photovoltaic components. The following schemes are evaluated: AutoEncoder Long Short-Term Memory (AE-LSTM), Facebook-Prophet, and Isolation Forest. These models can identify the PV system’s healthy and abnormal actual behaviors. Our results provide clear insights to make an informed decision, especially with experimental trade-offs for such a complex solution space.
... To validate the proposed clustering-based method compared with previous approaches, SolarClique [50] was implemented, and the obtained results were evaluated. To give a brief information of SolarClique, it detects anomalies on the basis of prediction by employing solar generation data from geographically nearby sites. ...
Article
Full-text available
This work proposes a fault detection and imputation scheme for a fleet of small-scale photovoltaic (PV) systems, where the captured data includes unlabeled faults. On-site meteorological information, such as solar irradiance, is helpful for monitoring PV systems. However, collecting this type of weather data at every station is not feasible for a fleet owing to the limitation of installation costs. In this study, to monitor a PV fleet efficiently, neighboring PV generation profiles were utilized for fault detection and imputation, as well as solar irradiance. For fault detection from unlabeled raw PV data, K-means clustering was employed to detect abnormal patterns based on customized input features, which were extracted from the fleet PVs and weather data. When a profile was determined to have an abnormal pattern, imputation for the corresponding data was implemented using the subset of neighboring PV data clustered as normal. For evaluation, the effectiveness of neighboring PV information was investigated using the actual rooftop PV power generation data measured at several locations in the Gwangju Institute of Science and Technology (GIST) campus. The results indicate that neighboring PV profiles improve the fault detection capability and the imputation accuracy. For fault detection, clustering-based schemes provided error rates of 0.0126 and 0.0223, respectively, with and without neighboring PV data, whereas the conventional prediction-based approach showed an error rate of 0.0753. For imputation, estimation accuracy was significantly improved by leveraging the labels of fault detection in the proposed scheme, as much as 18.32% reduction in normalized root mean square error (NRMSE) compared with the conventional scheme without fault consideration.
... Many countries have set ambitious goals for the percentage of renewable penetration in their overall energy mix and solar installations continue to play a dominant role in ongoing deployments. Solar deployments vary in size, ranging from large solar farms that are deployed by utilities to small-scale installations by individuals [14]. More than half of the installed solar capacity continue to come from small-scale solar deployments, i.e., arrays with 10kW of capacity of less [18]. ...
Conference Paper
Full-text available
Rooftop solar deployments are an excellent source for generating clean energy. As a result, their popularity among homeowners has grown significantly over the years. Unfortunately, estimating the solar potential of a roof requires homeowners to consult solar consultants, who manually evaluate the site. Recently there have been efforts to automatically estimate the solar potential for any roof within a city. However, current methods work only for places where LIDAR data is available, thereby limiting their reach to just a few places in the world. In this paper, we propose DeepRoof, a data-driven approach that uses widely available satellite images to assess the solar potential of a roof. Using satellite images, DeepRoof determines the roof's geometry and leverages publicly available real-estate and solar irradiance data to provide a pixel-level estimate of the solar potential for each planar roof segment. Such estimates can be used to identify ideal locations on the roof for installing solar panels. Further, we evaluate our approach on an annotated roof dataset, validate the results with solar experts and compare it to a LIDAR-based approach. Our results show that DeepRoof can accurately extract the roof geometry such as the planar roof segments and their orientation, achieving a true positive rate of 91.1% in identifying roofs and a low mean orientation error of 9.3 degree. We also show that DeepRoof's median estimate of the available solar installation area is within 11% of a LIDAR-based approach.
... Prior work has attempted to maximize electricity production by controlling movable solar panels [84,85] or wind turbine blades [86] using RL or Bayesian optimization. Other work has used graphical models to detect faults in rooftop solar panels [87], genetic algorithms to optimally place wind turbines within a wind farm [88], and multi-objective optimization to place hydropower dams in a way that satisfies both energy and ecological objectives [89]. ...
Preprint
Full-text available
Climate change is one of the greatest challenges facing humanity, and we, as machine learning experts, may wonder how we can help. Here we describe how machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we identify high impact problems where existing gaps can be filled by machine learning, in collaboration with other fields. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the machine learning community to join the global effort against climate change.
... As to the proactive characteristics of predictive maintenance systems, it is possible to minimize down times as maintenance can be applied ahead of failure of the affected system component(s). Anomalies of system components are detected in such systems, for example, to recognize a failing solar cell in a batch of identically constructed cells in a neighboring environment [12]. Sensed data in predictive maintenance systems includes operating values or their average (instead of detailed time series data). ...
Article
Homeowners are increasingly deploying rooftop solar photovoltaic (PV) arrays due to the rapid decline in solar module prices. However, homeowners may have to spend up to ∼$375 to diagnose their damaged rooftop solar PV system. Thus, recently, there is a rising interest to inspect potential damage on solar PV arrays automatically and passively. Unfortunately, recent approaches that leverage machine learning techniques have the limitation of distinguishing solar PV array damages from other solar degradation (e.g., shading, dust, snow). To address this problem, we design a new system—SolarDiagnostics that can automatically detect and profile damages on rooftop solar PV arrays using their rooftop images with a lower cost. In essence, SolarDiagnostics first leverages an K-Means algorithm to isolate rooftop objects to extract solar panel residing contours. Then, SolarDiagnostics employs a Convolutional Neural Networks to accurately identify and characterize the damage on each solar panel residing contour. We evaluate SolarDiagnostics by building a lower cost prototype and using 60,000 damaged solar PV array images generated by Deep Convolutional Generative Adversarial Networks. We find that SolarDiagnostics is able to detect damaged solar PV arrays with a Matthews Correlation Coefficient (MCC) of 1.0. In addition, pre-trained SolarDiagnostics yields an MCC of 0.95, which is significantly better than other re-trained machine learning-based approaches and yields as the similar MCC as of re-trained SolarDiagnostics. We make the source code and datasets that we use to build and evaluate SolarDiagnostics publicly-available.
Article
Full-text available
In this paper, a method for real time monitoring and fault diagnosis in photovoltaic systems is proposed. This approach is based on a comparison between the performances of a faulty photovoltaic module, with its accurate model by quantifying the specific differential residue that will be associated with it. The electrical signature of each default will be fixed by considering the deformations induced on the I-V curves. Some faults, such as: interconnection resistance faults and different shading patterns are considered. The proposed technique can be generalized and extended to more types of faults. The fault diagnosis will be determined by fixing a normal and a fault threshold for each fault. These thresholds are calculated based on the Euclidean norm between ideal and normal measurement or between ideal and fault mode measurement. Each threshold is set in a range bounded by the minimum and maximum values of the differential residue obtained for the considered fault. The proposed approach provides identification of faults by calculating their specific threshold ranges. This method allows the instantaneous monitoring of the electrical power delivered by the photovoltaic system.
Article
Full-text available
The rapid growth of the solar industry over the past several years has expanded the significance of photovoltaic (PV) systems. Fault analysis in solar photovoltaic (PV) arrays is a fundamental task to increase reliability, efficiency, and safety in PV systems and, if not detected, may not only reduce power generation and accelerated system aging but also threaten the availability of the whole system. Due to the current-limiting nature and nonlinear output characteristics of PV arrays, faults in PV arrays may not be detected. In this paper, all possible faults that happen in the PV system have been classified and six common faults (shading condition, open-circuit fault, degradation fault, line-to-line fault, bypass diode fault, and bridging fault) have been implemented in 7.5 KW PV farm. Based on the simulation results, both normal operational curves and fault curves have been compared.
Article
The popularity of rooftop solar for homes is rapidly growing. However, accurately forecasting solar generation is critical to fully exploiting the benefits of locally generated solar energy. In this article, we present two machine-learning techniques to predict solar power from publicly available weather forecasts. We use these techniques to develop SolarCast, a cloud-based web service that automatically generates models that provide customized site-specific predictions of solar generation. SolarCast utilizes a “black box” approach that requires only (1) a site’s geographic location and (2) a minimal amount of historical generation data. Since we intend SolarCast for small rooftop deployments, it does not require detailed site- and panel-specific information, which owners may not know, but instead automatically learns these parameters for each site. We evaluate the accuracy of SolarCast’s different algorithms on two publicly available datasets, each containing over 100 rooftop deployments with a variety of attributes (e.g., climate, tilt, orientation, etc.). We show that SolarCast learns a more accurate model using much less data (∼1 month) than prior SVM-based approaches, which require ∼3 months of data. SolarCast also provides a programmatic API, enabling developers to integrate its predictions into energy efficiency applications. Finally, we present two case studies of using SolarCast to demonstrate how real-world applications can leverage its predictions. We first evaluate a “sunny” load scheduler, which schedules a dryer’s energy usage to maximally align with a home’s solar generation. We then evaluate a smart solar-powered charging station, which can optimally charge the maximum number of electric vehicles (EVs) on a given day. Our results indicate that a representative home is capable of reducing its grid demand up to 40% by providing a modest amount of flexibility (of ∼5 hours) in the dryer’s start time with opportunistic load scheduling. Further, our charging station uses SolarCast to provide EV owners the amount of energy they can expect to receive from solar energy sources.
Article
In this work, we present a new algorithm for detecting faults in grid-connected photovoltaic (GCPV) plant. There are few instances of statistical tools being deployed in the analysis of PV measured data. The main focus of this paper is, therefore, to outline a parallel fault detection algorithm that can diagnose faults on the DC-side and AC-side of the examined GCPV system based on the t-test statistical analysis method. For a given set of operational conditions, solar irradiance and module's temperature, a number of attributes such as voltage and power ratio of the PV strings are measured using virtual instrumentation (VI) LabVIEW software. The results obtained indicate that the parallel fault detection algorithm can detect and locate accurately different types of faults such as, faulty PV module, faulty PV String, Faulty Bypass diode, Faulty Maximum power point tracking (MPPT) unit and Faulty DC/AC inverter unit. The parallel fault detection algorithm has been validated using an experimental data climate, with electrical parameters based on a 1.98 and 0.52 kWp PV systems installed at the University of Huddersfield, United Kingdom.
Article
Faults in photovoltaic (PV) systems, which can result in energy loss, system shutdown or even serious safety breaches, are often difficult to avoid. Fault detection in such systems is imperative to improve their reliability, productivity, safety and efficiency. Here, an innovative model-based fault-detection approach for early detection of shading of PV modules and faults on the direct current (DC) side of PV systems is proposed. This approach combines the flexibility, and simplicity of a one-diode model with the extended capacity of an exponentially weighted moving average (EWMA) control chart to detect incipient changes in a PV system. The one-diode model, which is easily calibrated due to its limited calibration parameters, is used to predict the healthy PV array’s maximum power coordinates of current, voltage and power using measured temperatures and irradiances. Residuals, which capture the difference between the measurements and the predictions of the one-diode model, are generated and used as fault indicators. Then, the EWMA monitoring chart is applied on the uncorrelated residuals obtained from the one-diode model to detect and identify the type of fault. Actual data from the grid-connected PV system installed at the Renewable Energy Development Center, Algeria, are used to assess the performance of the proposed approach. Results show that the proposed approach successfully monitors the DC side of PV systems and detects temporary shading.
Conference Paper
Faults of photovoltaic systems often result in an energy drop and therefore decrease the efficiency of the system. Detecting and analyzing faults is thus an important problem in the analysis of photovoltaic systems data. We consider the problem of estimating the starting time and end time of a fault, i.e. we want to locate the fault in time series data. We assume to know the power output, plane-of-array irradiance and optionally the module temperature. We demonstrate how to use our fault location algorithm to classify shading events. We present results on real data with simulated and real faults.
Article
We describe a method for removing the effect of confounders to reconstruct a latent quantity of interest. The method, referred to as "half-sibling regression," is inspired by recent work in causal inference using additive noise models. We provide a theoretical justification, discussing both independent and identically distributed as well as time series data, respectively, and illustrate the potential of the method in a challenging astronomy application.
Article
Solar panels have been improving in eficiency and dropping in price, and are therefore becoming more common and eco- nomically viable. However, the performance of solar panels depends not only on the weather, but also on other exter- nal factors such as shadow, dirt, dust, etc. In this paper, we describe a simple and practical data-driven method for classifying anomalies in the power output of solar panels. In particular, we propose and experimentally verify (using two solar panel arrays in Ontario, Canada) a simple classifica- tion rule based on physical properties of solar radiation that can distinguish between shadows and direct covering of the panel, e.g,. by dirt or snow.
Article
Hot spotting is a problem in photovoltaic (PV) systems that reduces panel power performance and accelerates cell degradation. In present day systems, bypass diodes are used to mitigate hot spotting, but it does not prevent hot spotting or the damage it causes. This paper presents an active hot-spot detection method to detect hot spotting within a series of PV cells, using ac parameter characterization. A PV cell is comprised of series and parallel resistances and parallel capacitance, which are affected by voltage bias, illumination, and temperature. Experimental results have shown that when a PV string is under a maximum power point tracking control, hot spotting in a single cell results in a capacitance increase and dc impedance increase. The capacitance change is detectable by measuring the ac impedance magnitude in the 10–70 kHz frequency range. An impedance value change due to hot spotting can be detected by monitoring one high-frequency measurement in the capacitive region and one low-frequency measurement in the dc impedance region. Alternatively, the dc impedance can also be calculated using dc operating point measurements. The proposed hot-spot detection method can be integrated into a dc–dc power converter that operates at the panel or subpanel level.
Article
We describe a method for removing the effect of confounders in order to reconstruct a latent quantity of interest. The method, referred to as half-sibling regression, is inspired by recent work in causal inference using additive noise models. We provide a theoretical justification and illustrate the potential of the method in a challenging astronomy application.