Content uploaded by Srinivasan Iyengar

Author content

All content in this area was uploaded by Srinivasan Iyengar on Jul 04, 2018

Content may be subject to copyright.

SolarClique: Detecting Anomalies in Residential Solar Arrays

Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy

University of Massachusetts Amherst

ABSTRACT

The proliferation of solar deployments has signiﬁcantly increased

over the years. Analyzing these deployments can lead to the timely

detection of anomalies in power generation, which can maximize

the beneﬁts from solar energy. In this paper, we propose Solar-

Clique, a data-driven approach that can ﬂag anomalies in power

generation with high accuracy. Unlike prior approaches, our work

neither depends on expensive instrumentation nor does it require

external inputs such as weather data. Rather our approach exploits

correlations in solar power generation from geographically nearby

sites to predict the expected output of a site and ﬂag anomalies. We

evaluate our approach on 88 solar installations located in Austin,

Texas. We show that our algorithm can even work with data from

few geographically nearby sites (

>

5 sites) to produce results with

high accuracy. Thus, our approach can scale to sparsely populated

regions, where there are few solar installations. Further, among

the 88 installations, our approach reported 76 sites with anomalies

in power generation. Moreover, our approach is robust enough to

distinguish between reduction in power output due to anomalies and

other factors such as cloudy conditions.

CCS CONCEPTS

•Computing methodologies Anomaly detection; Supervised

learning by regression; •Social and professional topics

Sustainability;

KEYWORDS

anomaly detection; renewables; solar energy; computational sustain-

ability

ACM Reference format:

Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy. 2018. So-

larClique: Detecting Anomalies in Residential Solar Arrays. In Proceedings

of ACM SIGCAS Conference on Computing and Sustainable Societies (COM-

PASS), Menlo Park and San Jose, CA, USA, June 20–22, 2018 (COMPASS

’18), 10 pages.

DOI: 10.1145/3209811.3209860

1 INTRODUCTION

Technological advances and economies of scale have signiﬁcantly re-

duced the costs and made solar energy a viable renewable alternative.

From 2010 to 2017, the average system costs of solar have dropped

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

COMPASS ’18, Menlo Park and San Jose, CA, USA

©

2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.

978-1-4503-5816-3/18/06. . . $15.00

DOI: 10.1145/3209811.3209860

from

$

7.24 per watt to

$

2.8 per watt, a reduction of approximately

61% [

13

]. At the same time, the average energy cost of producing

solar is 12.2

¢

per kilo-watt and is approaching the retail electricity

price of 12¢ per kilo-watt [1]. The declining costs have spurred the

adoption of solar among both utilities and residential owners.

Recent studies have shown that the total capacity of small-scale

residential solar installations in the US reached 7.2 GW in 2016 [

3

].

Unlike large solar farms, residential installations are not monitored

by professional operators on an ongoing basis. Consequently, anom-

alies or faults that reduce the power output of residential solar arrays

may go undetected for long periods of time, signiﬁcantly reducing

their economic beneﬁts. Further, large solar farms have extensive

sensor instrumentation to monitor the array continuously, which

enables faults or anomalous output to be determined. In contrast, res-

idential installations have little or no sensor instrumentation beyond

displaying the total power of the array, making sensor-based moni-

toring and anomaly detection infeasible in such contexts. Adding

such instrumentation increases the installation costs and is not eco-

nomically feasible in most cases.

In this paper, we seek to develop a data-driven approach for

detecting anomalous output in small-scale residential installations

without requiring any sensor information for fault detection. Our

key insight is that the solar output from other nearby installations

are correlated, and thus these correlations can be used to identify

anomalous deviations in a speciﬁc installation. We note that the

solar output from multiple sites within a city or region is available.

For instance, Enphase, an energy company, provides access to power

generation information of more than 700,000 sites across different

locations

1

. Similarly, other sites

2

share their energy generation data

from tens of thousands of solar panels through web APIs. Thus, the

availability of such datasets makes our approach feasible. However,

the primary challenge in designing such an application is to handle

intrinsic variability of solar and site-speciﬁc idiosyncrasies.

Several factors affect the output of a solar panel — such as

weather conditions, dust, snow cover, and shade from nearby trees or

structures, temperature, etc. We refer to such factors as transient fac-

tors since they temporarily reduce the output of the solar array. For

instance, a passing cloud may brieﬂy decrease the power output of

the panel but doesn’t reduce the solar output permanently. Similarly,

shade from nearby buildings or trees can be considered transient

factors as they reduce the output temporarily and may occur only at

certain periods of the day.

Interestingly, some transient factors, such as overcast conditions,

impact the output of all arrays in a geographical neighborhood,

while other factors such as shade from a nearby tree impact the

output of only a portion of the array. In addition, factors such

as malfunctioning solar modules or electrical faults also reduce

the output of a solar array, and we refer to them as anomalies —

since human intervention is needed to correct the problem. Prior

1Enphase: https://enphase.com/

2PVSense: http://pvsense.com/

COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy

studies have shown that such factors can signiﬁcantly reduce the

power output by as much as 40% [

6

,

11

,

14

]. In our work, we

need to distinguish between the output ﬂuctuations from transient

and anomalous factors. Further, site speciﬁc idiosyncrasies (such

as shade, tilt/orientation of panels) need to be considered when

exploiting the correlation between solar arrays in a region.

Naive approaches such as labeling a solar installation as anoma-

lous whenever its power output remains “low” for an extended period

do not work well. Since drops in power output may be caused due

to cloudy conditions, depending on the weather, the solar output

may remain low for days. Labeling such instances as anomalies

may result in many false positives. Since the challenge lies in

differentiating drops in power output due to transient factors (i.e.,

factors that impact power output temporarily) and those that are

anomalies (i.e., factors that may require human intervention), we

propose SolarClique, a new approach for detecting solar anomalies

using geographically nearby sites. In designing, implementing and

evaluating our approach, we make the following contributions:

•

We demonstrate how geographically nearby sites can be

used to detect anomalies in a residential solar installation.

In our algorithm, we present techniques to account for and

remove transient seasonal factors such as shade from nearby

trees and structures.

•

Our approach doesn’t require any sensor instrumentation

for fault/anomaly detection. Rather, it only requires the

production output of the array and those of nearby arrays

for performing anomaly detection. Since power output

of geographically nearby sites are readily available, our

approach can be easily applied to millions of residential

installations that are unmonitored today with little added

expense.

•

We implement and evaluate the performance of our algo-

rithm. Our results show that the power output of a can-

didate site can be predicted using geographically nearby

sites. Moreover, we can achieve high accuracy even when

few geographically nearby sites (

>

5 sites) are available.

This indicates that our approach can be used in sparsely

populated locations, where there are few solar installations.

•

We ran our anomaly detection algorithm on solar installa-

tions located in Austin, Texas. SolarClique reported power

generation anomalies in 76 sites, many of which see a so-

lar output reduction for weeks or months. Moreover, our

approach can identify different types of anomaly — (i) no

production, (ii) underproduction, and (ii) gradual degra-

dation of solar output over time, thereby exhibiting the

real-world utility of our approach.

2 BACKGROUND

Our work focuses on detecting anomalous solar power generation in

a residential solar installation using information from geographically

nearby sites. Unlike power generation from traditional mechanical

generators (e.g., diesel generators), where power output is constant

and controllable, the instantaneous power output from a PV system

is inherently intermittent and uncontrollable. The solar power output

may see sudden changes, with energy generation at peak capacity at

Figure 1: Power generation from three geographically nearby

solar sites. As shown, the power output is intermittent and cor-

related for solar arrays within a geographical neighborhood.

one moment to reduced (or zero) output in the next period (see Fig-

ure 1). The change in the power output can be attributed to a number

of factors and our goal is to determine whether the drop in power

can be attributed to anomalous behavior in the solar installation.

2.1 Factors affecting solar output

A primary factor that inﬂuences the power generation of a solar panel

is the solar irradiance, i.e., the amount of sunlight that is incident on

the panel. The amount of sunlight a solar panel receives is dependent

on many factors such as time of the day and year, dust, temperature,

cloud cover, shade from nearby buildings or structures, tilt and

orientation of the panel, etc. These factors determine the amount of

power that is generated based on how much light is incident on the

solar modules.

However, a number of other factors, related to hardware, can also

reduce the power output of a solar panel. For instance, the power

output may reduce due to defective solar modules, charge controllers,

inverters, strings in PV, wired connections and so on. Clearly, there

are many factors that can cause problems in power generation. Thus,

factors affecting output can be broadly classiﬁed into two categories:

(i) transient — factors that have a temporary effect on the power

output (such as cloud cover); and (ii) anomalies — factors that have

a more prolonged impact (e.g., solar module defect) on the power

output.

The transient factors can further be classiﬁed into common and

local factors. The common factors, such as weather, affect the power

output of all the solar panels in a given region. Moreover, its effect is

temporary as the output changes with a change in weather conditions.

For instance, overcast weather conditions temporarily reduce the

output of all panels in a given region. The local factors, such as

shade from nearby foliage or buildings, are usually site-speciﬁc and

do no affect the power output of other sites. These local factors

may be recurring and reduce the power output at ﬁxed periods in

a day. In contrast, anomalous factors, such as bird droppings or

system malfunctions, reduces power output for prolonged periods

and usually require corrective action to restore normal operation of

the site. Note that both transient and anomalous factors may reduce

the power output of a solar array. Thus, a key challenge in designing

SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA

a solar anomaly detection algorithm is to differentiate the reduction

in power output due to transient factors and anomalies.

2.2 Anomaly detection in solar installations

Prior approaches have focused on using exogenous factors to predict

the future power generation [

8

,

20

,

28

]. A simple approach is to

use such prediction models and report anomaly in solar panels if

the power generated is below the predicted value for an extended

period. However, it is known that external factors such as cloud

cover are inadequate to accurately predict power output from solar

installations [

17

]. Thus, prediction models may over- or under-

predict power generation, and such an approach may not be sufﬁcient

for detecting anomalies.

Prediction models can be improved using additional sensors but

can be prohibitively expensive for residential setups [

21

]. For in-

stance, drone-mounted cameras can detect occlusions in a panel

but are expensive and require elaborate setup. Other studies use an

ideal model of the solar arrays to detect faults [

5

,

12

]. These studies

rely on various site-speciﬁc parameters and assume standard test

condition (STC) values of panels are known. However, site-speciﬁc

parameters are often not available. Thus, most large solar farms

usually depend on professional operators to continuously monitor

and maintain their setup to detect faults early

3,4

. Clearly, such

elaborate setups may not be economically feasible in a residential

solar installation. Below, we present our work that focuses on a

data-driven and cost-effective approach for detecting anomalies in a

solar installation.

3 ALGORITHM DESIGN

We ﬁrst introduce the intuition behind our approach to detect anoma-

lous power generation in a solar installation. Our primary insight is

that other geographically nearby sites can predict the solar output

potential, which can then reveal issues in a given site. Since factors

such as the amount of solar irradiance (e.g., due to cloudy condi-

tions) are similar within a region, the power output of solar arrays in

a geographical neighborhood is usually correlated. This can be seen

in the power output from three different solar installation sites in the

same geographical neighborhood (see Figure 1). As seen, the solar

arrays tend to follow a similar power generation pattern. So we can

use the output of a group of sites to predict the output of a speciﬁc

site and ﬂag anomalies if the prediction signiﬁcantly deviates.

We hypothesize that predicting the output using geographically

nearby sites can “remove” the effects of confounding factors (i.e.,

common factors). By accounting for confounding factors, the re-

maining inﬂuence on power generation can be attributed to local

factors in the solar installation. The local factors may include both

transient local factors and anomalies. Thus, any irregularity in power

generation, after accounting for confounding and transient local fac-

tors, must be due to anomalies in the installation. For example,

cloudy or overcast conditions in a given location have a similar im-

pact on all solar panels and will reduce the power output of all sites.

However, a malfunctioning solar module in a site (a local event) will

observe a higher drop in power output than others. If the drop in

power due to cloudy conditions (a confounding factor) along with

3ESA Renewables: http://esarenewables.com/

4Afﬁnity Energy: https://www.afﬁnityenergy.com/

L C

Y X

Unobserved

Observed

Figure 2: Graphical model representation of our setup.

transient local factors is accounted for, any further drop in power

can be attributed to anomalies. Our approach follows this intuition

to detect anomalies in a solar installation.

The rest of the section is organized as follows. We present a

graphical model representation for our setup that models the con-

founding variables. Next, we discuss how our algorithm removes

the confounding factors and detects anomalies in solar generation.

3.1 Graphical model representation

Our work is inspired by a study in astronomy, wherein Half-Sibling

Regression technique was used to remove the effects of confounding

variables (i.e., noises from measuring instruments) from observations

to ﬁnd exoplanets [

27

]. We follow a similar approach to model and

detect anomalies in a solar installation.

Let

C

,

L

,

X

and

Y

be the random variables (RVs) in our problem.

Here,

Y

refers to the power generated by a candidate solar installation

site.

X

represents the power produced by each of the geographically

nearby solar installations (represented in a vector format). While C

represents the confounding variables that affect both

X

and

Y

, the

variable

L

represents site-speciﬁc local factors affecting a candidate

site. These local factors include both transient factors and anomalies

that affect a candidate site. In our setup, both

X

and

Y

are observed

variables (as power generation of a site can be easily measured),

whereas

C

and

L

are latent unobserved variables. Figure 2 depicts

a directed graphical model (DAG) that illustrates the relationship

between these observed and unobserved random variables.

We are interested in the random variable

L

which represents anom-

alies at a given site. As seen in the ﬁgure, since both

L

and

C

affect

the observed variable

Y

, without the knowledge of

C

it is difﬁcult

to calculate the inﬂuence of

L

on

Y

. Clearly,

X

is independent of

L

as variable

L

impacts only

Y

. However, we note that

C

impacts

X

and when conditioned on

Y

,

Y

becomes a collider, and the vari-

ables

X

and

L

become dependent [

23

]. This implies that

X

contains

information about Land we can recover Lfrom X.

To reconstruct the quantity

L

, we impose certain assumptions on

the type of relationship between

Y

and

C

. Speciﬁcally, we assume

that Ycan be represented as an additive model denoted as follows:

Y=L+f(C)(1)

where

f

is a nonlinear function and its input

C

is unobserved. Since

L

and

X

are independent, variable

X

cannot account for the inﬂuence

of

L

on

Y

. However,

X

can be used to approximate

f(C)

, as

C

also

affects

X

. If

X

exactly approximates

f(C)

, then

f(C)=E[f(C)|X]

,

and we can show that

L

can be recovered completely using (1). Even

if

X

does not exactly approximate

f(C)

, in our case,

X

is sufﬁciently

COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy

Half-Sibling

Regression

Seasonality

Removal

Anomaly

Detection

Solar data from

candidate and

colocated sites

Moving

window

length

Anomaly

threshold

Figure 3: An overview of the key steps in the SolarClique algo-

rithm.

large to provide a good approximation of

E[f(C)|X]

up to an offset.

A more detailed description of the approach is given in [

27

]. Thus,

using

X

to predict

Y

(i.e.,

E[Y|X]

),

f(C)

can be approximated and

removed from (1) to estimate ˆ

Las follows:

ˆ

LBYE[Y|X](2)

where

ˆ

L

is an estimate of the local factors that may include both

transient local factors and anomalies.

3.2 SolarClique Algorithm

We now present our anomaly detection algorithm called SolarClique.

Figure 3 depicts an overview of the different steps involved in the

SolarClique algorithm. First, we use the Half-Sibling Regression

approach to build a regression model that predicts the solar gener-

ation of a candidate site using power output from geographically

nearby sites. Next, we remove any seasonal component from the

above regression model using time series decomposition. Finally,

we detect anomalies by analyzing the deviation in the power output.

Below, we describe these three steps in detail.

3.2.1 Step 1: Remove confounding e�ects. The ﬁrst step is to

build a regression model that predicts the power generation output

Y

of a candidate site using

X

, a vector of power generation values

from geographically nearby solar installations. As mentioned earlier,

the regression model estimates

E[Y|X]

component in the additive

model shown in (2). Since

Y

is observed, subtracting the

E[Y|X]

component determines the Lcomponent.

Standard regression techniques can be used to build this regres-

sion model. The regression technique learns an estimator that best

ﬁts the training data. Instead of constructing a single regression

model, we use bootstrapping — a technique that uses subsamples

with replacement of the training data — which gives multiple regres-

sion models and the properties of the estimator (such as standard

deviation). We use an ensemble method, wherein the mean of the

regression models is taken to estimate the

E[Y|X]

in the testing data.

Finally, we remove the confounding component

E[Y|X]

from

Y

to

obtain an estimate of

ˆ

Lt8t2T

in the testing data. The ﬁnal output

of this step is an estimate

ˆ

Lt

and the standard deviation (

st

) of the

estimators.

3.2.2 Step 2: Remove seasonal component. As discussed earlier,

the solar output of a site is affected by both common (i.e., confound-

ing) and local factors. Using the Half-Sibling Regression approach,

we can account for the transient confounding factors such as weather

changes. However, we also need to account for transient local fac-

tors, such as shade from nearby trees, which may temporarily reduce

the power output at a speciﬁc time of the day. Since variable

ˆ

Lt

in-

clude both transient local factors and anomalies, we need to remove

the local factors to determine the anomaly ˆ

At.

We note that the time period of such occlusions (those from

nearby trees or structures) may not vary much on a daily basis.

This is because the maximum elevation of the sun in the sky varies

by less than 2

over a period of a week

5

on average. Using time

series decomposition techniques over short time intervals (e.g. one

week), such seasonal components (i.e the pattern occurring every

ﬁxed period) can be removed. Thus, we perform a time series

decomposition to account for transient local factors as follows. We

compute the seasonal component and remove it from

ˆ

Lt

only when

ˆ

Lt

is outside the conﬁdence interval

4s

and on removal of the seasonal

component,

ˆ

Lt

doesn’t go outside the conﬁdence interval. After

removal of the seasonal component, if any, we obtain

ˆ

At

from

ˆ

Lt

as

our ﬁnal output.

3.2.3 Step 3: Detect Anomalies. We use the output

ˆ

At

(from Step

2) and the standard deviation

st

(from Step 1) to detect anomalies in

a candidate site. Speciﬁcally, we ﬂag the day as anomalous when

three conditions hold. First, the deviation of

ˆ

At

should be signiﬁcant,

i.e., greater than four times the standard deviation. Second, the

anomaly should occur for at least

k

contiguous period. Finally, when

the period

t

is during the daytime period (not including the twilight).

Thus, an anomaly can be deﬁned as follows:

anomaly =(ˆ

At<4st)^... ^(ˆ

At+k<4st)8t2T(3)

where Tdenotes the time during the daytime period.

Based on our assumption that

ˆ

At

is Gaussian, it follows that the

odds of an anomaly are very high when the deviation is more than

4s

. These anomalous values belong to the end-tail of the normal

distribution. The second condition (i.e., contiguous anomaly period)

ensures that the drop in power output is for an extended period. In

practice, depending on the data granularity, the contiguous period

can range from minutes to hours. Clearly, we would like to detect

anomalies during the period when sunlight is abundant. During the

night or twilight, the solar irradiation is very low to provide any

meaningful power generation. Thus, we choose the daytime period

in our algorithm for anomaly detection.

4 IMPLEMENTATION

We implemented our SolarClique algorithm in python using the

SciPy stack [

4

]. The SciPy stack consists of efﬁcient data processing

and numeric optimization libraries. Further, we use the regression

techniques in the scikit-learn library to learn our models [

24

]. The

scikit-learn library comprises various regression tools, which takes a

vector of input features and learn the parameters that best describe

5

The sun directly faces the Tropic of Cancer (+23.5

) on the summer solstice.

Whereas, it faces the Tropic of Capricorn (-23.5

) on the winter solstice. Thus, over

half the year (26 weeks) the maximum elevation of the sun changes by

⇡

47

, i.e., ¡2

per week.

SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA

Number of solar installations 88

Solar installation size (kW) 0.5 to 9.3

Residential size (sq. ft.) 1142 to 3959

Granularity 1 hour

Year 2014, 2015

Table 1: Key characteristics of the dataset.

the relationship between the input and the dependent variable. Ad-

ditionally, we use Seasonal and Trend decomposition using Loess

(STL) technique to remove the seasonality component [

9

]. The STL

technique performs a time series decomposition on the input and

deconstructs it into trend, seasonal, and noise components.

5 EVALUATION

5.1 Dataset

For evaluating the efﬁcacy of SolarClique, we use a public dataset

available through the Dataport Research Program [

2

]. The dataset

contains solar power generation from over hundred residential solar

installations located in the city of Austin, Texas. The power genera-

tion from these installations are available at an hourly granularity.

Table 1 shows the key characteristics of the dataset. For our case

study, we selected those homes that have contiguous solar generation

data, i.e., no missing values, for an overlapping period of at least two

years. Based on this criteria, we had 88 homes for our evaluation in

the year 2014 and 2015.

5.2 Evaluation Methodology

We partitioned our dataset into training and testing period. We used

the ﬁrst three months of data to train the model, and the remaining

dataset for testing (21 months). Further, for bootstrapping, we sam-

ple our training dataset by randomly selecting 80% of the training

samples with replacement. These samples are then used to build the

estimator, and we repeated this step 100 times to learn the properties

of the estimator. To build our model, we used ﬁve popular regression

techniques namely Random Forest (RF), k-Nearest Neighbor (kNN),

Decision Trees (DT), Support Vector Regression (SVR), and Linear

Regression (LR). Finally, we selected the contiguous period as

k=2

(see Step 3 of our algorithm) since our data granularity is hourly.

Unless stated otherwise, we use all homes in our dataset for our

evaluation.

5.3 Metrics

Since the installation capacity can be different across solar panels, it

may not be meaningful to use a metric such as Root Mean Squared

Error (RMSE). This is because the magnitude of the error may be

different across predictions. Thus, we use Mean Absolute Percent-

age Error (MAPE) to measure the regression model’s accuracy in

predicting a candidate’s power generation. MAPE is deﬁned as:

MAPE =100

n

n

Â

t=1

ytpt

¯yt

(4)

where

yt

and

pt

are the actual and predicted value at time

t

respec-

tively.

¯yt

represents the average of all the values and

n

is the number

of samples in the test dataset.

Figure 4: Performance of different regression techniques used

to predict the power generation of a site.

Figure 5: Mean standard deviation of predictions for different

regression techniques

5.4 Results

Below, we summarize the results of using SolarClique on the Data-

port dataset.

5.4.1 Prediction performance using geographically nearby sites.

We compare the accuracy of the ﬁve regression techniques used to

predict the power generated at a candidate site (

Y

) using the data

from nearby sites (

X

). Figure 4 shows the spread of the MAPE values

for the regression techniques used for all the 88 sites. Random Forest

and Decision Trees show the best performance closely followed by

k-NN with average MAPE values of approximately 7.81%, 7.87%,

and 8.94% respectively. Linear Regression, on the other hand, shows

poor accuracy with an average MAPE of 19%.

As discussed earlier, our approach uses bootstrapping to generate

the standard deviation values for each prediction. Note that a small

standard deviation means tighter conﬁdence interval and indicates

that the regression technique has a consistent prediction across runs.

Figure 5 shows the mean value of standard deviation over all the

testing samples normalized by the size of the solar installation. We

observe that RF and k-NN have tight conﬁdence intervals, while LR

has considerably wider bounds. In particular, we observe that the

average standard deviation of RF and k-NN is 0.0032 and 0.0059

using all the sites, respectively. In comparison, the average standard

deviation of LR is 0.0078. Since RF performs better than other

regression techniques, we use RF for the rest of our evaluation.

COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA Srinivasan Iyengar, Stephen Lee, Daniel Sheldon, Prashant Shenoy

Figure 6: Average MAPE diminishes with increase in the num-

ber of geographically nearby sites.

Figure 7: Standard deviation of MAPE diminishes with in-

crease in the number of geographically nearby sites.

5.4.2 Impact due to the number of geographically nearby sites.

We now focus on understanding the minimum number of geographi-

cally nearby sites to accurately predict the power generated at the

candidate site. As discussed earlier, the power output of geographi-

cally nearby sites are used as input features to build the regression

model. Since in this experiment we are not interested in analyzing

the conﬁdence intervals, we use the entire training data to build

the model (i.e., no bootstrapping). We vary the number of geo-

graphically nearby sites from 1 to 50 and for each value, we build

100 different models learnt from choosing random combinations of

nearby sites.

Figure 6 shows the spread of average MAPE values as we vary

the number of geographically nearby sites used for all 88 sites. We

use the Random Forest regression technique to build the model. As

expected, the average MAPE value reduces when more number of

geographically nearby sites are used to predict the output. Note that

as the nearby sites increase, the variations in nearby sites cancel out,

which provides a more robust regression model. This suggests that

an increase in the nearby site can improve the accuracy of the power

generation model of a candidate site. We also note that the reduction

in MAPE diminishes as the number of geographically nearby sites

increases. With at least ﬁve randomly chosen geographically nearby

sites, we observe that the MAPE is around 10%. This indicates that

our algorithm can be effective in sparsely populated regions such as

towns/villages, having few solar installations.

Next, we analyze the variability in performance of the different

models as the number of geographically nearby sites increases. Fig-

ure 7 shows the spread of the standard deviation of the 100 models

with increasing number of geographically nearby sites. As shown in

the ﬁgure, we observe that the variability reduces when the number

of nearby sites increases. However, unlike the previous result, the

variability continues to reduce — albeit at a slower rate — even

when the number of nearby sites is greater than ﬁve. Thus, the

performance of the learned models is closer to its average.

5.4.3 Detection of anomalies. We illustrate the different steps

involved in our algorithm using Figure 8. In the top subplot of the

ﬁgure, the blue line depicts the power generation trace from a solar

installation for over a week in August, 2015. The red marker shows

the prediction from the RandomForest regression technique with

data from the remaining 87 sites as features. While the prediction

(i.e., red marker) closely follows the actual power output (i.e., blue

line), there is a signiﬁcant difference in the actual and predicted

after 14th August. As seen, there is a sharp drop in the actual power

generated in the late morning of 14th August. The drop in power

is signiﬁcant, and there is no output recorded in the site for an

extended period until October (not shown in the ﬁgure). However,

the regression model forecasts a non-negative power output for the

given site.

The second subplot shows the residual, i.e., the difference be-

tween the actual and the predicted values (i.e., the black line) along

with the conﬁdence interval (i.e., the gray shaded region). The

conﬁdence interval, which is within

±4s

, is calculated using the

pointwise standard deviation obtained from the bootstrap process.

In this ﬁgure, we observe that the residual sometimes lie outside the

conﬁdence interval at the same time of the day across multiple days

— which indicates a ﬁxed periodic component.

On removing the seasonal component using our approach, we

observe that the residual always lies within the conﬁdence interval,

except when there is an anomaly in power generation. This is shown

in the third subplot of the ﬁgure, where the black line (i.e., residual)

lie within the gray shaded region (i.e., the conﬁdence interval). Fi-

nally, the last subplot depicts our anomaly detection algorithm in

action. We observe that our algorithm accurately ﬂags periods of no

output as an anomaly (depicted by the red shaded region).

6 CASE-STUDY: ANOMALY DETECTION

ANALYSIS

In this case study, we use the solar installations in the Dataport as

they represent a typical setup within a city. We ran our SolarClique

algorithm on the generation output from all solar installations and

obtained the anomalous days in the dataset. Below, we present our

analysis.

6.1 Anomalies in solar installations

Figure 9 shows the total number of anomalous days in each solar

installation site. We observe that our SolarClique algorithm found

anomalous days in 76 solar installations, out of the 88 sites in the

dataset. As seen in the ﬁgure, the total number of anomalous days

span from a day to several months. Together, all the installation sites

had a total of 1906 anomalous days. This indicates a signiﬁcant

loss of renewable power output. Speciﬁcally, we observe that 17 of

SolarClique: Detecting Anomalies in Residential Solar Arrays COMPASS ’18, June 20–22, 2018, Menlo Park and San Jose, CA, USA

Residual crossing the conﬁdence interval

Figure 8: An illustrative example that depicts the data-processing and anomaly detection steps in SolarClique.

Figure 9: Number of anomalous days for each site. Installation

sites are plotted in ascending order of anomalous days.

the 88 (around 20%) installations had anomalous power generation

for at least a total of one month that represents more than 5% of

the overall 640 days in the testing period. Anomalies from these

installations account for nearly 80% of all the anomalous days.

To better understand the anomalous periods, we group them into

short-term and long-term periods. The short-term periods have less

than three contiguous anomalous days, while the long-term periods

have consecutive anomalous days for at least three days. Our results

show the dataset has 587 occurrences of short-term periods spread

over 683 days. Further, we observe 123 occurrences of long-term

periods spread over 1223 days. We also observe that the maximum

contiguous anomalous period found in a site was approximately ﬁve

months (i.e., 158 days), with no power output during that period.

Clearly, such high number of long-term anomalous periods demon-

strate the need for early anomaly detection tools. Additionally, we

note that long-term anomalies are relatively easier to detect than

short-term anomalies. While long-term anomalies represent serious

issues that may need immediate attention, short-term anomalies may

be minor problems, if unattended, could become major problems

in future. The advantage of our approach is we can detect both

short-term and long-term anomalies.

Anomalous days

Anomalous days

Figure 10: Under-production of solar detected using our algo-

rithm.

6.2 Analysis of anomalies detected

Note that the reduction in power output depends on the severity of

an anomaly. This is because some electrical faults (e.g., short-circuit

of a panel) may have localized impact on a solar array, which can

marginally reduce the power output, while other faults (e.g., inverter

faults) may show signiﬁcant power reduction or completely stop

power generation.

SolarClique detects anomalous days when there is no solar gen-

eration and also when an installation under produces power. Our

algorithm reported 1099 and 807 anomalous days with under produc-

tion and no solar generation, respectively. Since no solar generation

days are trivially anomalous, we speciﬁcally examine cases of solar

under production. Figure 10 shows the power output from three

different sites. The top plot shows the power output (depicted by

the blue line) with no anomalous days, the subplots below show

sites that have anomalous days (depicted by the red marker). Our

results show that the SolarClique algorithm detects anomalies even

when a site under produces solar power. Note that the site with no

anomaly, which is exposed to the same solar irradiance as other

Less than 5% error

Figure 11: Distribution of the difference in actual and predicted

on underproducing anomalous days.

(a) Site A (b) Site B

Figure 12: Anomalies detected in two sample sites where the

difference in actual and predicted was less than 5%. The ﬁg-

ure shows a good ﬁt on all days except the anomalous period

highlighted in the circle.

sites, continues to produce solar output. However, we observe a

drop in power output for an extended period in the anomalous sites.

Speciﬁcally, we observe the drop in power output is around 75% and

40% in Site 1 and Site 2, respectively — presumably due to factors

such as line faults in the solar installation. Usually, anomalies such

as line faults can cause a signiﬁcant drop in the power output. In

particular, a 75% drop in Site 1 can be attributed to faults in three

fourth of the strings (i.e., connected in series).

We further examine the reduction in power output in the under-

production cases. Figure 11 shows the distribution of the difference

in actual and predicted power output for anomalous days. Out of

the1099 under production days, our algorithm reported 23 days when

the difference in percentage was less than or equal to 5%. Typically,

more than 5% drop in power output is considered signiﬁcant. This is

because malfunctioning of a single panel in a solar array with 20 pan-

els

6

will result in a 5% reduction. Thus, we investigate anomalous

days wherein the difference is less than or equal to 5%. Figure 12

compares the regression ﬁt of anomalous days with two normal days

(adjacent to the anomalous days) from two sample sites where the

difference was less than 5%. Note that the ﬁgure shows a good ﬁt

for most periods except during the anomalous period highlighted

6

Typically, a 5kW installation capacity has 20 panels, each panel having 250W

capacity.

Degradation over a year

Figure 13: Accelerated degradation in the power output of a

solar site.

Anomaly

Type #Sites #Days Avg. power

reduction(%)

Single

No Production 5 515 98.87

Multiple No

Production 3 295 98.65

Single Under

Production 2 348 60.22

Multiple Under

Production 4 164 43.63

Severe

Degradation 3 179 30.67

Table 2: Types of anomaly in sites having more than a month of

anomalous days.

in the circle. In comparison to other periods, we observe a drop in

power during the anomalous period, occurring during the mid-day.

Even though the difference in percentage is small, it represents a

relatively signiﬁcant drop since the power output is at its peak during

the mid-day.

We observe that our approach also detects anomalies due to degra-

dation in the power output, which usually spans over an extended

time period. Since the drop in power output over the time period

may be small, such changes are more subtle and harder to identify.

Figure 13 shows the degradation in power output of an anomalous

site. Our algorithm reports an increase in the frequency of anoma-

lous days in the installation site over the year, with more anomalous

days in the latter half. To understand the increase in anomalous days,

we plot the difference between the actual and predicted (seen in the

bottom subplot). We observe that the difference between the actual

and predicted value steadily increases over time. It is known that

the power output of solar installations may reduce over time due to

aging [

22

] at a rate of around 1% a year. However, the accelerated

degradation seen in Figure 13 is presumably due to occurrences of

hot-spots or increased contact resistance due to corrosion. Early

detection of such conditions can help homeowners take advantage

of product warranties available on solar panels.

We now examine the types of anomalies in the top 17 sites with

more than a month of anomalous days. The power output of anoma-

lous days can be categorized into three types — (i) no production,

(ii) under production, and (iii) degradation over a period. Table 2

summarizes the different types occurring over a period in these sites.

The single period represents a single contiguous period of anomaly,

while the multiple period represents more than one contiguous pe-

riod. We observe that the average power reduction during anomalous

periods may range from 98.8% to 30.6%. We classify “no produc-

tion days” as days with no power output for the majority of the

period. Overall, we observe that there are 810 no production days —

a signiﬁcant loss in renewable output. Although the average power

reduction due to severe degradation is 30%, it is likely to grow over

time.

7 FUTURE EXTENSIONS TO SOLARCLIQUE

As mentioned earlier, several third-party sites exist that host solar

generation data for rooftop installations. While in our approach,

we use power to determine the existence of anomalies in power

generation, several other electrical characteristics such as voltage

and current are available that carry much richer information about

the type of anomaly. This information can be leveraged to further

infer the exact type of anomaly in power generation. For example,

a line fault (broken string) will reduce the current produced by the

overall setup, but the voltage will remain unchanged. Conversely,

covering of dust/bird droppings can impact both the voltage and the

current. Thus, our algorithm can be extended to use multi-modal

data (e.g., voltage, current, and power) to further diagnose the exact

cause of the anomaly.

Our approach can also be extended to a single solar installation

for detecting anomalies. With the proliferation of micro-inverters in

residential solar installations, power generation data from individual

panels are available. Power output from these colocated panels can

also be used to detect faults in the PV setup, as they can predict

the power output with higher ﬁdelity. This can be used in remote

locations where data from other solar installations are not easily

available. As part of future work, we plan to use SolarClique al-

gorithm to discover faults in a single panel by comparing power

generated with others in the same setup.

8 RELATED WORK

There has been signiﬁcant work on predicting the solar output from

solar arrays [

7

,

8

,

16

,

20

,

28

]. While some studies have used site-

speciﬁc data such as panel conﬁguration [

8

,

20

] for building the

prediction model, others have used external data such as weather or

historical generation data [

17

,

28

]. Such models can provide short-

term generation forecast (e.g., an hour) to long-term forecast (e.g.,

days or weeks). Although these studies can predict the reduction

in power output, a limitation in these studies is that they cannot

attribute the reduction to anomalies in the solar installation.

Prior work has also focused on anomaly detection in PV pan-

els [

14

,

15

,

22

,

25

,

30

–

32

]. These studies propose methods to model

the effects of shades/covering [

14

,

19

], hot-spots [

18

], degrada-

tion [

22

,

30

] or short-circuit and other faults [

15

]. However, these

methods require extensive data (such as statistics on different types

of anomalies) [

29

] or do not focus on hardware-related issues [

14

].

For instance, [

29

] proposes a solution to determine probable causes

of anomalies but require detailed site-speciﬁc information along

with pre-deﬁned proﬁles of anomalies. Unlike prior approaches,

our approach doesn’t require such extensive data or setup and relies

instead on power generation from co-located sites. Thus, it pro-

vides a scalable and cost-effective approach to detect anomalies in

thousands of solar installation sites.

The idea behind our approach is similar to [

26

,

27

]. However,

the authors use the approach in the context of an astronomy appli-

cation, wherein systematic errors are removed to detect exoplanets.

In this case, the systematic errors are confounding factors due to

telescope and spacecraft, which inﬂuences the observations from

distant stars. In contrast, our solution uses inputs from other geo-

graphically nearby sites to detect anomalies in solar. As discussed

earlier, today, such datasets are easily accessible over the internet,

which makes our approach feasible. Further, using regression on

the data from neighbors has been studied earlier [

10

]. However,

the main focus of this work was in the context of quality control in

climate observations by imputing missing values. In our case, we

use the learned regression model to ﬁnd anomalous solar generation.

9 CONCLUSION

In this paper, we proposed SolarClique, a data-driven approach to

detect anomalies in power generation of a solar installation. Our

approach requires only power generation data from geographically

nearby sites and doesn’t rely on expensive instrumentation or other

external data. We evaluated SolarClique on the power generation

data over a period of two years from 88 solar installations in Austin,

Texas. We showed how our solar installation regression models

are accurate with tight conﬁdence intervals. Further, we showed

that our approach could generate models with as few as just ﬁve

geographically nearby sites. We observed that out of the 88 solar

installations, 76 deployments had anomalies in power generation.

Additionally, we found that our approach is powerful enough to

distinguish between reduction in power output due to anomalies and

other factors (such as cloudy conditions). Finally, we presented a

detailed analysis of the different anomalies observed in our dataset.

Acknowledgment

This research is supported by NSF grants

IIP-1534080, CNS-1645952, CNS-1405826, CNS-1253063, CNS-

1505422, CCF-1522054 and the Massachusetts Department of En-

ergy Resources.

REFERENCES

[1]

2016. When Will Rooftop Solar Be Cheaper Than the Grid? https://goo.gl/h1Ayy5.

(2016). Accessed March, 2018.

[2] 2017. Dataport dataset. https://dataport.cloud/. (2017).

[3]

2017. EIA adds small-scale solar photovoltaic forecasts to its monthly Short-Term

Energy Outlook. https://www.eia.gov/todayinenergy/detail.php?id=31992. (2017).

Accessed March, 2018.

[4]

2018. SciPy Stack. http://www.scipy.org/stackspec.html. (Accessed March 2018).

[5]

Mohamed Hassan Ali, Abdelhamid Rabhi, Ahmed El Hajjaji, and Giuseppe M

Tina. 2017. Real Time Fault Detection in Photovoltaic Systems. Energy Procedia

(2017).

[6]

Rob W Andrews, Andrew Pollard, and Joshua M Pearce. 2013. The effects of

snowfall on solar photovoltaic performance. Solar Energy 92 (2013), 84–97.

[7]

Yona Atsushi and Funabashi Toshihisa. 2007. Application of recurrent neural

network to short-term-ahead generating power forecasting for photovoltaic system.

In Power Engineering Society General Meeting. Tampa, Florida, USA.

[8]

Peder Bacher, Henrik Madsen, and Henrik Aalborg Nielsen. 2009. Online short-

term solar power forecasting. Solar Energy (2009).

[9]

Robert B Cleveland, William S Cleveland, and Irma Terpenning. 1990. STL:

A seasonal-trend decomposition procedure based on loess. Journal of Ofﬁcial

Statistics (1990).

[10]

Christopher Daly, Wayne Gibson, Matthew Doggett, Joseph Smith, and George

Taylor. 2004. A probabilistic-spatial approach to the quality control of climate

observations. In Proceedings of the 14th AMS Conference on Applied Climatology,

Amer. Meteorological Soc., Seattle, WA.

[11]

Chris Deline. 2009. Partially shaded operation of a grid-tied PV system. In

Photovoltaic Specialists Conference (PVSC), 2009 34th IEEE. IEEE, 001268–

001273.

[12]

Mahmoud Dhimish, Violeta Holmes, and Mark Dales. 2017. Parallel fault de-

tection algorithm for grid-connected photovoltaic plants. Renewable Energy

(2017).

[13]

Ran Fu, David J Feldman, Robert M Margolis, Michael A Woodhouse, and

Kristen B Ardani. 2017. US solar photovoltaic system cost benchmark: Q1 2017.

Technical Report. National Renewable Energy Laboratory (NREL), Golden, CO

(United States).

[14]

Peter Xiang Gao, Lukasz Golab, and Srinivasan Keshav. 2015. What’s Wrong

with my Solar Panels: a Data-Driven Approach.. In EDBT/ICDT Workshops.

86–93.

[15]

Elyes Garoudja, Fouzi Harrou, Ying Sun, Kamel Kara, Aissa Chouder, and

Santiago Silvestre. 2017. Statistical fault detection in photovoltaic systems. Solar

Energy (2017).

[16]

Rui Huang, Tiana Huang, Rajit Gadh, and Na Li. 2012. Solar generation pre-

diction using the ARMA model in a laboratory-level micro-grid. In Smart Grid

Communications (SmartGridComm), 2012 IEEE Third International Conference

on. IEEE.

[17]

Srinivasan Iyengar, Navin Sharma, David Irwin, Prashant Shenoy, and Krithi

Ramamritham. 2017. A Cloud-Based Black-Box Solar Predictor for Smart Homes.

ACM Transactions on Cyber-Physical Systems 1, 4 (2017), 21.

[18]

Katherine A Kim, Gab-Su Seo, Bo-Hyung Cho, and Philip T Krein. 2016. Photo-

voltaic hot-spot detection for solar panel substrings using ac parameter characteri-

zation. IEEE Transactions on Power Electronics (2016).

[19]

Alexander Kogler and Patrick Traxler. 2016. Locating Faults in Photovoltaic

Systems Data. In International Workshop on Data Analytics for Renewable Energy

Integration. Springer.

[20]

Elke Lorenz, Johannes Hurka, Detlev Heinemann, and Hans Georg Beyer. 2009.

Irradiance forecasting for the power prediction of grid-connected photovoltaic

systems. IEEE Journal of selected topics in applied earth observations and remote

sensing (2009).

[21]

Ricardo Marquez and Carlos FM Coimbra. 2013. Intra-hour DNI forecasting

based on cloud tracking image analysis. Solar Energy (2013).

[22]

Ababacar Ndiaye, Cheikh MF K

´

eb

´

e, Pape A Ndiaye, Abd

´

eraﬁ Charki, Ab-

dessamad Kobi, and Vincent Sambou. 2013. A novel method for investigating

photovoltaic module degradation. Energy Procedia (2013).

[23] Judea Pearl. 2009. Causality. Cambridge university press.

[24]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,

D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:

Machine Learning in Python. Journal of Machine Learning Research 12 (2011).

[25]

M Sabbaghpur Arani and MA Hejazi. 2016. The comprehensive study of electrical

faults in PV arrays. Journal of Electrical and Computer Engineering 2016 (2016).

[26]

Bernhard Sch

¨

olkopf, David Hogg, Dun Wang, Dan Foreman-Mackey, Dominik

Janzing, Carl-Johann Simon-Gabriel, and Jonas Peters. 2015. Removing system-

atic errors for exoplanet search via latent causes. In International Conference on

Machine Learning.

[27]

Bernhard Sch

¨

olkopf, David W Hogg, Dun Wang, Daniel Foreman-Mackey, Do-

minik Janzing, Carl-Johann Simon-Gabriel, and Jonas Peters. 2016. Modeling

confounding by half-sibling regression. Proceedings of the National Academy of

Sciences (2016).

[28]

Navin Sharma, Pranshu Sharma, David Irwin, and Prashant Shenoy. 2011. Pre-

dicting solar generation from weather forecasts using machine learning. In Smart

Grid Communications (SmartGridComm), 2011 IEEE International Conference

on. IEEE, 528–533.

[29]

S Stettler, P Toggweiler, E Wiemken, W Heydenreich, AC de Keizer, WGJHM

van Sark, S Feige, M Schneider, G Heilscher, E Lorenz, and others. 2005. Failure

detection routine for grid-connected PV systems as part of the PVSAT-2 project.

In Proceedings of the 20th European Photovoltaic Solar Energy Conference &

Exhibition, Barcelona, Spain. 2490–2493.

[30]

Ali Tahri, Takashi Oozeki, and Azzedine Draou. 2013. Monitoring and evaluation

of photovoltaic system. Energy Procedia (2013), 456–464.

[31]

Patrick Traxler. 2013. Fault detection of large amounts of photovoltaic systems.

In Proceedings of the ECML/PKDD 2013 Workshop on Data Analytics for Re-

newable Energy Integration.

[32]

Achim Woyte, Mauricio Richter, David Moser, Stefan Mau, Nils Reich, and Ulrike

Jahn. 2013. Monitoring of photovoltaic systems: good practices and systematic

analysis. In Proc. 28th European Photovoltaic Solar Energy Conference.