Content uploaded by Ross J. W. James

Author content

All content in this area was uploaded by Ross J. W. James

Content may be subject to copyright.

Understanding the Relationship between Scheduling

Problem Structure and Heuristic Performance using

Knowledge Discovery

Kate A. Smith-Miles

1

, Ross J. W. James

2

, John W. Giffin

2

and Yiqing Tu

1

1

School of Engineering and IT, Deakin University, Burwood VIC 3125, Australia

{katesm, ytu}

@deakin.edu.au

2

Department of Management, University of Canterbury, Christchurch 8140, New Zealand

{ross.james, john.giffin}@canterbury.ac.nz

Abstract. Using a knowledge discovery approach, we seek insights into the

relationships between problem structure and the effectiveness of scheduling

heuristics. A large collection of 75,000 instances of the single machine

early/tardy scheduling problem is generated, characterized by six features, and

used to explore the performance of two common scheduling heuristics. The best

heuristic is selected using rules from a decision tree with accuracy exceeding

97%. A self-organizing map is used to visualize the feature space and generate

insights into heuristic performance. This paper argues for such a knowledge

discovery approach to be applied to other optimization problems, to contribute

to automation of algorithm selection as well as insightful algorithm design.

Keywords: Scheduling; heuristics; algorithm selection; self-organizing map;

performance prediction; knowledge discovery

1 Introduction

It has long been appreciated that knowledge of a problem’s structure and instance

characteristics can assist in the selection of the most suitable algorithm or heuristic [1,

2]. The No Free Lunch theorem [3] warns us against expecting a single algorithm to

perform well on all classes of problems, regardless of their structure and

characteristics. Instead we are likely to achieve better results, on average, across many

different classes of problem, if we tailor the selection of an algorithm to the

characteristics of the problem instance. This approach has been well illustrated by the

recent success of the algorithm portfolio approach on the 2007 SAT competition [4].

As early as 1976, Rice [1] proposed a framework for the algorithm selection

problem. There are four essential components of the model:

• the problem space P represents the set of instances of a problem class;

• the feature space F contains measurable characteristics of the instances

generated by a computational feature extraction process applied to P;

• the algorithm space A is the set of all considered algorithms for tackling the

problem;

• the performance space Y represents the mapping of each algorithm to a set of

performance metrics.

In addition, we need to find a mechanism for generating the mapping from feature

space to algorithm space. The Algorithm Selection Problem can be formally stated as:

For a given problem instance x ∈ P, with features f(x) ∈ F, find the selection mapping

S(f(x)) into algorithm space A, such that the selected algorithm α ∈ A maximizes the

performance mapping y(α(x)) ∈ Y. The collection of data describing {P, A, Y, F} is

known as the meta-data.

There have been many studies in the broad area of algorithm performance

prediction, which is strongly related to algorithm selection in the sense that supervised

learning or regression models are used to predict the performance ranking of a set of

algorithms, given a set of features of the instances. In the AI community, most of the

relevant studies have focused on constraint satisfaction problems like SAT, QBF or

QWH (P, in Rice’s notation), using solvers like DPLL, CPLEX or heuristics (A), and

building a regression model (S) to use the features of the problem structure (F) to

predict the run-time performance of the algorithms (Y). Studies of this nature include

Leyton-Brown and co-authors [5-7], and the earlier work of Horvitz [8] that used a

Bayesian approach to learn the mapping S. In recent years these studies have extended

into the algorithm portfolio approach [4] and a focus on dynamic selection of

algorithm components in real-time [9, 10].

In the machine learning community, research in the field of meta-learning has

focused on classification problems (P), solved using typical machine learning

classifiers such as decision trees, neural networks, or support vector machines (A),

where supervised learning methods (S) have been used to learn the relationship

between the statistical and information theoretic measures of the classification

instance (F) and the classification accuracy (Y). The term meta-learning [11] is used

since the aim is to learn about learning algorithm performance. Studies of this nature

include [12-14] to name only three of the many papers published over the last 15

years.

In the operations research community, particularly in the area of constrained

optimization, researchers appear to have made fewer developments, despite recent

calls for developing greater insights into algorithm performance by studying search

space or problem instance characteristics. According to Stützle and Fernandes [15],

“currently there is still a strong lack of understanding of how exactly the relative

performance of different meta-heuristics depends on instance characteristics”.

Within the scheduling community, some researchers have been influenced by the

directions set by the AI community when solving constraint satisfaction problems.

The dynamic selection of scheduling algorithms based on simple low-level

knowledge, such as the rate of improvement of an algorithm at the time of dynamic

selection, has been applied successfully [16]. Other earlier approaches have focused

on integrating multiple heuristics to boost scheduling performance in flexible

manufacturing systems [17].

For many NP-hard optimization problems, such as scheduling, there is a great deal

we can discover about problem structure which could be used to create a rich set of

features. Landscape analysis (see [18-20]) is one framework for measuring the

characteristics of problems and instances, and there have been many relevant

developments in this direction, but the dependence of algorithm performance on these

measures is yet to be completely determined [20].

Clearly, Rice’s framework is applicable to a wide variety of problem domains. A

recent survey paper [21] has discussed the developments in algorithm selection across

a variety of disciplines, using Rice’s notation as a unifying framework, through which

ideas for cross-fertilization can be explored. Beyond the goal of performance

prediction also lies the ideal of greater insight into algorithm performance, and very

few studies have focused on methodologies for acquiring such insights. Instead the

focus has been on selecting the best algorithm for a given instance, without

consideration for what implications this has for algorithm design or insight into

algorithm behaviour. This paper demonstrates that knowledge discovery processes

can be applied to a rich set of meta-data to develop, not just performance predictions,

but visual explorations of the meta-data and learned rules, with the goal of learning

more about the dependencies of algorithm performance on problem structure and data

characteristics.

In this paper we present a methodology encompassing both supervised and

unsupervised knowledge discovery processes on a large collection of meta-data to

explore the problem structure and its impact on algorithm suitability. The problem

considered is the early/tardy scheduling problem, described in section 2. The

methodology and meta-data is described in section 3, comprising 75,000 instances (P)

across a set of 6 features (F). We compare the performance of two common heuristics

(A), and measure which heuristic produces the lowest cost solution (Y). The mapping

S is learned from the meta-data {P, A, Y, F} using knowledge derived from self-

organizing maps, and compared to the knowledge generated and accuracy of the

performance predictions using the supervised learning methods of neural networks

and decision trees. Section 4 presents the results of this methodology, including

decision tree rules and visualizations of the feature space, and conclusions are drawn

in Section 5.

2 The Early/Tardy Machine Scheduling Problem

Research into the various types of E/T scheduling problems was motivated, in part, by

the introduction of Just-in-Time production, which required delivery of goods to be

made at the time required. Both early and late production are discouraged, as early

production incurs holding costs, and late delivery means a loss of customer goodwill.

A summary of the various E/T problems was presented in [22] which showed the NP-

completeness of the problem.

2.1 Formulation

The E/T scheduling problem we consider is the single machine, distinct due date,

early/tardy scheduling problem where each job has an earliness and tardiness penalty

and due date. The objective is to minimise the total penalty produced by the schedule.

The objective of this problem can be defined as follows:

(

)

∑

=

+

−+

+

−

n

i

i

d

i

c

ii

c

i

d

i

1

min

βα

.

(1)

where n is the number of jobs to be scheduled, c

i

is the completion time of job i, d

i

is

the due date of job i,

α

i

is the penalty per unit of time when job i is produced early,

β

i

is the penalty per unit of time when job i is produced tardily, and |x|

+

= x if x > 0,

or 0 otherwise. We also define p

i

as the processing time of job i.

The objective of this problem is to schedule the jobs as closely as possible to their

due dates; however the difficulty in formulating a schedule occurs when it is not

possible to schedule all jobs on their due dates, which also causes difficulties in

managing the many tradeoffs between jobs competing for processing at a given time

[23]. Two of the simplest and most commonly used dispatching heuristics for the E/T

scheduling problem are the Earliest Due Date and Shortest Processing Time

heuristics.

2.2 Earliest Due Date (EDD) heuristic

The EDD heuristic orders the jobs based on the date the job is due to be delivered to

the customer. The jobs with the earliest due date are scheduled first, while the jobs

with the latest due date are scheduled last. After the sequence is determined, the

completion times of each job are then calculated using the optimal idle time insertion

algorithm of Fry, Armstrong and Blackstone [24]. For single machine problems the

EDD is known to be the best rule to minimise the maximum lateness, and therefore

tardiness, and also the lateness variance [25]. The EDD has the potential to produce

optimal solutions to this problem, for example when there are few jobs and the due

dates are widely spread so that all jobs may be scheduled on their due date without

interfering with any other jobs. As there are no earliness or tardiness penalties, the

objective value will be 0 and therefore optimal.

2.3 Shortest Processing Time (SPT) heuristic

The SPT heuristic orders the jobs based on their processing time. The jobs with the

smallest processing time are scheduled first, while the jobs with the largest processing

time are scheduled last; this is the fastest way to get most of the jobs completed

quickly. Once the SPT sequence has been determined, the job completion times are

calculated using the optimal idle time insertion algorithm [24]. The SPT heuristic has

been referred to as “the world champion” scheduling heuristic [26], as it produces

schedules for single machine problems that are good at minimising the average time

of jobs in a system, minimising the average number of jobs in the system and

minimising the average job lateness [25]. When the tardiness penalties for the jobs are

similar and the due dates are such that the majority of jobs are going to be late, SPT is

likely to produce a very good schedule for the E/T scheduling problem, as it gets the

jobs completed as quickly as possible. The “weighted” version of the SPT heuristic,

where the order is determined by p

i

/β

i

,

is used in part by many E/T heuristics, as this

order can be proven to be optimal for parts of a given schedule.

2.4 Discussion

Due to the myopic nature of the EDD and SPT heuristics, neither heuristic is going to

consistently produce high quality solutions to the general E/T scheduling problem.

Both of these simple heuristics generate solutions very quickly however and therefore

it is possible to carry out a large sample of problems in order to demonstrate whether

or not the approach proposed here is useful for exploring the relative performance of

two heuristics (or algorithms) and is able to predict the superiority of one heuristic

over another for a given instance.

3 Methodology

In this section we describe the meta-data for the E/T scheduling problem in the form

of {P, A, Y, F}. We also provide a description of the machine learning algorithms

applied to the meta-data to produce rules and visualizations of the meta-data.

3.1 Meta-Data for the E/T Scheduling Problem

The most critical part of the proposed methodology is identification of suitable

features of the problem instances that reflect the structure of the problem and the

characteristics of the instances that might explain algorithm performance. Generally

there are two main approaches to characterizing the instances: the first is to identify

problem dependent features based on domain knowledge of what makes a particular

instance challenging or easy to solve; the second is a more general set of features

derived from landscape analysis [27]. Related to the latter is the approach known in

the meta-learning community as landmarking [28], whereby an instance is

characterized by the performance of simple algorithms which serve as a proxy for

more complicated (and computationally expensive) features. Often a dual approach

makes sense, particularly if the feature set derived from problem dependent domain

knowledge is not rich, and supplementation from landscape analysis can assist in the

characterization of the instances. In the case of the generalised single machine E/T

scheduling problem however, there is sufficient differentiation power in a small

collection of problem dependent features that we can derive rules explaining the

different performance of the two common heuristics. Extending this work to include a

greater set of algorithms (A) may justify the need to explore landscape analysis tools

to derive greater characterisation of the instances.

In this paper, each n-job instance of the generalised single machine E/T scheduling

problem has been characterized by the following features.

1. Number of jobs to be scheduled in the instance, n

2. Mean Processing Time

p : The mean processing time of all jobs in an instance

3. Processing Time Range p

σ

: The range of the processing times of all jobs in the

instance

4. Tardiness Factor

τ

: Defines where the average due date occurs relative to, and

as a fraction of the total processing time of all jobs in the instance. A positive

tardiness factor indicates the proportion of the schedule that is expected to be

tardy, while a negative tardiness factor indicates the amount of idle time that is

expected in the schedule as a proportion of the total processing time of all jobs

in the sequence. Mathematically the tardiness factor was defined by Baker and

Martin [29] as:

∑

∑

−=

i

i

pn

d

1

τ

.

5. Due Date Range D

σ

: Determines the spread of the due dates from the average

due date for all jobs in the instance. It is defined as

∑

−

=

i

p

ab

D

)(

σ

, where b is

the maximum due date in the instance and a is the minimum due date in the

instance.

6. Penalty Ratio

ρ

: The maximum over all jobs in the instance of the ratio of the

tardy penalty to the early penalty.

Any instance of the problem, whether contained in the meta-data set or generated

at a future time, can be characterized by this set of six features. Since we are

comparing the performance of only two heuristics, we can create a single binary

variable to indicate which heuristic performs best for a given problem instance. Let

Y

i

=1 if EDD is the best performing heuristic (lowest objective function) compared to

SPT for problem instance i, and Y

i

=0 otherwise (SPT is best). The meta-data then

comprises the set of six-feature vectors and heuristic performance measure (Y), for a

large number of instances, and the task is to learn the relationship between features

and heuristic performance.

In order to provide a large and representative sample of instances for the meta-data,

an instance generator was created to span a range of values for each feature. Problem

instances were then generated for all combinations of parameter values. The

parameter settings used were:

• problem size (number of jobs, n): 20-100 with increments of 20 (5 levels)

• processing times p

i

: processing times randomly generated within ranges of

2-10 with increments of 2 (5 levels)

• processing time means

p : calculated from randomly generated p

i

• processing time range p

σ

: calculated from randomly generated p

i

• due dates d

i

: due dates randomly generated within ranges of 0.2-1 with

increments of 0.2 (5 levels)

• due date range D

σ

: calculated from randomly generated due dates d

i

• tardiness factor

τ

: calculated based on randomly generated p

i

and d

i

• penalty ratio

ρ

: 1-10 with increments of 1 (10 levels)

Ten instances using each parameter setting were then generated, giving a total of 5

(size levels) x 5 (processing time range levels) x 6 (tardiness factor levels) x 5 (due

date range levels) x 10 (penalty ratio levels) x 10 (instances) = 75,000 instances.

A correlation analysis between the instance features and the Y values across all

75,000 instances reveals that the only instance features that appear to correlate

(linearly) with heuristic performance are the tardiness factor (correlation = -0.59) and

due date range (correlation = 0.44). None of the other instance features appear to have

a linear relationship with algorithm performance. Clearly due date range and tardiness

factor correlate somewhat with the heuristic performances, but it is not clear if these

are non-linear relationships, and if either of these features with combinations of the

others can be used to seek greater insights into the conditions under which one

heuristic is expected to outperform the other.

Using Rice’s notation, our meta-data can thus be described as:

• P = 75,000 E/T scheduling instances

• A = 2 heuristics (EDD and SPT)

• Y = binary decision variable indicating if EDD is best compared to SPT

(based on objective function which minimizes weighted deviation from due

dates)

• F = 6 instance features (problem size, processing time mean, processing time

range, due date range, tardiness factor and penalty ratio).

Additional features could undoubtedly be derived either from problem dependent

domain knowledge, or using problem independent approaches such as landscape

analysis [28], landmarking [28], or hyper-heuristics [30]. For now though, we seek to

learn the relationships that might exist in this meta-data.

3.2 Knowledge Discovery on the Meta-Data

When exploring any data-set to discover knowledge, there are two broad approaches.

The first is supervised learning (aka directed knowledge discovery) which uses

training examples – sets of independent variables (inputs) and dependent variables

(outputs) - to learn a predictive model which is then generalized for new examples to

predict the dependent variable (output) based only on the independent variables

(inputs). This approach is useful for building models to predict which algorithm or

heuristic is likely to perform best given only the feature vector as inputs. The second

broad approach to knowledge discovery is unsupervised learning (aka undirected

knowledge discovery) which uses only the independent variables to find similarities

and differences between the structure of the examples, from which we may then be

able to infer relationships between these structures and the dependent variables. This

second approach is useful for our goal of seeking greater insight into why certain

heuristics might be better suited to certain instances and, rather than just building

predictive models of heuristic performance.

In this section we briefly summarise the machine learning methods we have used

for knowledge discovery on the meta-data.

Neural Networks.

As a supervised learning method [31], neural networks can be used to learn to predict

which heuristic is likely to return the smallest objective function value. A training

dataset is randomly extracted (80% of the 75,000 instances) and used to build a non-

linear model of the relationships between the input set (features F) and the output

(metric Y). Once the model has been learned, its generalisation on an unseen test set

(the remaining 20% of the instances) is evaluated and recorded as a percentage

accuracy in predicting the superior heuristic. This process is repeated ten times for

different random extractions of the training and test sets, to ensure that the results

were not simply an artifact of the random number seed. This process is known as ten-

fold cross validation, and the reported results show the average accuracy on the test

set across these ten folds.

For our experimental results, the neural network implementation within the Weka

machine learning platform [32] was used with 6 input nodes, 4 hidden nodes, and 2

output nodes utilising binary encoding. The transfer function for the hidden nodes was

a sigmoidal function, and the neural network was trained with the backpropagation

(BP) learning algorithm with learning rate = 0.3, momentum = 0.2. The BP algorithm

stops when the number of epochs (complete presentation of all examples) reaches a

maximum training time of 500 epochs or the error on the test set does not decrease

after a threshold of 20 epochs.

Decision Tree

A decision tree [33] is also a supervised learning method, which uses the training data

to successively partition the data, based on one feature at a time, into classes. The

goal is to find features on which to split the data so that the class membership at lower

leaves of the resulting tree is as “pure” as possible. In other words, we strive for

leaves that are comprised almost entirely of one class only. The rules describing each

class can then be read up the tree by noting the features and their splitting points. Ten-

fold cross validation is also used in our experiments to ensure the generalisation of the

rules.

The J4.8 decision tree algorithm, implemented in Weka [32], was used for our

experimental results, with a minimum leaf size of 500 instances. The generated

decision tree is pruned using subtree raising with confidence factor = 0.25.

Self-Organizing Maps

Self-Organizing Maps (SOMs) are the most well-known unsupervised neural network

approach to clustering. Their advantage over traditional clustering techniques such as

the k-means algorithm lies in the improved visualization capabilities resulting from

the two-dimensional map of the clusters. Often patterns in a high dimensional input

space have a very complicated structure, but this structure is made more transparent

and simple when they are clustered in a lower dimensional feature space. Kohonen

[34] developed SOMs as a way of automatically detecting strong features in large data

sets. SOMs find a mapping from the high dimensional input space to low dimensional

feature space, so any clusters that form become visible in this reduced dimensionality.

The architecture of the SOM is an multi-dimensional input vector connected via

weights to a 2-dimensional array of neurons. When an input pattern is presented to the

SOM, each neuron calculates how similar the input is to its weights. The neuron

whose weights are most similar (minimal distance in input space) is declared the

winner of the competition for the input pattern, and the weights of the winning

neuron, and its neighbours, are strengthened to reflect the outcome. The final set of

weights embeds the location of cluster centres, and is used to recognize to which

cluster a new input vector is closest.

For our experiments we randomly split the 75000 instances into training data

(50000 instances) and test data (25000 instances). We use the Viscovery SOMine

software (www.eudaptics.com) to cluster the instances based only on the six features

as inputs. A map of 2000 nodes is trained for 41 cycles, with the neighbourhood size

diminishing linearly at each cycle. After the clustering of the training instances, the

distribution of Y values is examined within each cluster, and knowledge about the

relationships between instance structure and heuristic performance is inferred and

evaluated on the test data.

4 Experimental Evaluation

4.1 Supervised Learning Results

Both the neural network and decision tree algorithms were able to learn the

relationships in the meta-data, achieving greater than 97% accuracy (on ten-fold

cross-validation test sets) in predicting which of the two heuristics would be superior

based only on the six features (inputs). These approaches have an overall

classification accuracy of 97.34% for the neural network and 97.13% for the decision

tree. While the neural network can be expected to learn the relationships in the data

more powerfully, due to its nonlinearity, its limitation is the lack of insight and

explanation of those relationships. The decision tree’s advantage is that it produces a

clear set of rules, which can be explored to see if any insights can be gleaned. The

decision tree rules are presented in the form of pseudo-code in Figure 3, with the

fraction in brackets showing the number of instances that satisfied both the condition

and the consequence (decision) in the numerator, divided by the total number of

instances that satisfied the condition in the denominator. This proportion is equivalent

to the accuracy of the individual rule.

The results allow us to state a few rules with exceptionally high accuracy:

1) If the majority of jobs are expected to be scheduled early (tardiness factor <=

0.5) then EDD is best in 99.8% of instances

2) If the majority of the jobs are expected to be scheduled late (tardiness factor

> 0.7) then SPT is best in 99.5% of instances

3) If slightly more than half of the jobs are expected to be late (tardiness factor

between 0.5 and 0.7) then as long as the tardiness penalty ratio is no more

than 3 times larger than the earliness penalty (

ρ

≤ 3), then EDD is best in

98.9% of the instances with a due date range greater than 0.2.

The first two rules are intuitive and can be justified from what we know about the

heuristics - EDD is able to minimise lateness deviations when the majority of jobs can

be scheduled before their due date, and SPT is able to minimise the time of jobs in the

system and hence tardiness when the majority of jobs are going to be late [25]. The

third rule reveals the kind of knowledge that can be discovered by adopting a machine

learning approach to the meta-data. Of course other rules can also be explored from

Figure 3, with less confidence due to the lower accuracy, but they may still provide

the basis for gaining insight into the conditions under which different algorithms can

be expected to perform well.

If (

τ

<= 0.7) Then

If (

τ

<= 0.5) Then EDD best (44889/45000 = 99.8%)

If (

τ

> 0.5) Then If (D

σ

<= 0.2) Then If (

ρ

<= 3) Then EDD best (615/750 = 82.0%)

Else SPT best (1483/1750 = 84.7%)

Else If (

ρ

<= 3) Then EDD best (5190/5250 = 98.9%)

Else If (

τ

<= 0.6) Then EDD best (8320/8750 = 95.1%)

Else If (p <= 2) Then EDD best (556/700 = 79.4%)

Else If (n <= 60) Then SPT best (1150/1680 = 68.4%)

Else EDD best (728/1120 = 65%)

Else SPT best (9950/10000 = 99.5%)

Fig. 3. Pseudo-code for the decision tree rule system, showing the accuracy of each rule

4.2 Unsupervised Learning Results

After training the SOM, the converged map shows 5 clusters, each of which contains

similar instances defined by Euclidean distance in feature space. Essentially, the six-

dimensional input vectors have been projected onto a two-dimensional plane, with

topology-preserving properties. The clusters can be inspected to understand what the

instances within each cluster have in common. The statistical properties of the 5

clusters can be seen in Table 1. The distribution of the input variables (features), and

additional variables including the performance of the heuristics, can be visually

explored using the maps shown in Figure 4. A k-nearest neighbour algorithm (with

k=7) is used to distribute additional data instances (from the test set) or extra variables

(Y values) across the map.

Looking first at the bottom row of Table 1, it is clear that clusters 1, 2 and 3

contain instances that are best solved using EDD (Y=1). These clusters are shown

visually in the bottom half of the 2-d self-organizing map (see Figure 4a for cluster

boundaries, and Figure 4b to see the distribution of Y across the clusters). These three

clusters of instances account for 70.2% of the 75,000 instances (see Table 1). The

remaining clusters 4 and 5 are best solved, on average, by SPT. The maps shown in

Figure 4c – 4h enable us to develop a quick visual understanding of how the clusters

differ from each other, and to see which features are prominent in defining instance

structure.

Table 1. Cluster statistics - mean values of input variables, and heuristic performance variable

Y, as well as cluster size. The first number in each cell is the value for the training data, and the

second number in parenthesis is for the test data.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 All Data

instances

17117

(8483)

10454

(5236)

7428

(3832)

8100

(4000)

6901

(3449)

50000

(25000)

instances

(%)

34.23

(33.93)

20.91

(20.94)

14.86

(15.33)

16.2

(16.0)

13.8

(13.8)

100

(100)

n

60.65

(61.03)

59.73

(59.73)

58.73

(58.96)

57.8

(57.7)

63.39

(61.56)

60.0

(59.97)

2.77 (2.76) 5.24 (5.22) 5.08 (5.07) 5.12 (5.11) 2.70 (2.71) 4.0 (3.99)

p

σ

3.54 (3.52) 8.48 (8.45) 8.16 (8.13) 8.24 (8.21) 3.41 (3.41) 6.0 (5.99)

τ

0.31 (0.31) 0.36 (0.35) 0.21 (0.21) 0.72 (0.73) 0.72 (0.72) 0.43 (0.42)

D

σ

0.70 (0.70) 0.88 (0.88) 0.38 (0.38) 0.40 (0.39) 0.40 (0.40) 0.6 (0.59)

ρ

5.89 (5.88) 4.93 (4.99) 5.37 (5.41) 5.24 (5.19) 5.87 (5.72) 5.5 (5.49)

Y

1.00 (0.99) 1.00 (1.00) 0.99 (0.99) 0.36 (0.36) 0.42 (0.41) 0.82 (0.82)

By inspecting the maps shown in Figure 4, and the cluster statistics in Table 1, we can

draw some conclusions about whether the variables in each cluster are above or below

average (compared to the entire dataset), and look for correlations with the heuristic

performance metric Y. For instance, cluster 2 is characterized by instances with

above average values of processing time mean and range, below average tardiness

factor, and above average due date range. The EDD heuristic is always best under

these conditions (Y=1). Instances in cluster 3 are almost identical, except that the due

date range tends to be below average. Since cluster 3 instances are also best solved by

the EDD heuristic, one could hypothesize that the due date range does not have much

influence in predicting heuristic performance. An inspection of the maps, however,

shows this is not the case.

The distribution of Y across the map (Figure 4b) shows a clear divide between the

clusters containing instances best solved using EDD (bottom half) and the clusters

containing instances best solved using SPT (top half). Inspecting the distribution of

features across this divide leads to a simple observation that, if the tardiness factor τ is

below average (around 0.5 represented by white to mid-grey in Figure 4f), then EDD

will be best. But there are small islands of high Y values in clusters 4 and 5 that

overlay nicely with the medium grey values of due date range. So we can observe

another rule that EDD will also be best if the tardiness factor is above average and the

due date range is above average. Also of interest, from these maps we can see that

problem size and the penalty ratio do not influence the relative performance of these

heuristics. As neither heuristic considers the penalty ratio (it is used within the

optimal idle time insertion algorithm [24], common to both heuristics, but not used by

the EDD or SPT heuristics themselves), its not being a factor in the clusters is not

surprising.

a) 5 Clusters in 2-d space

b) Distribution of Y across clusters

c

)

Distribution of n

d) Distribution of

e) Distribution of

σ

Fig. 4. Self-Organizing Map showing 5 clusters (fig. 4a), the heuristic performance

variable Y (fig 4b), and the distribution of each of the six features across the clusters

(fig 4c – fig 4h). The grey scale shows each variable at its minimum value as white,

and maximum value as black.

Within Viscovery SOMine, specific regions of the map can be selected, and used

as the basis of a classification. In other words, we can define regions and islands to be

predictive of one heuristic excelling based on the training data (50,000 instances). We

can then test the generalization of the predictive model using the remaining 25,000

instances as a test set, and applying the k-nearest neighbour algorithm to determine

instances that belong to the selected region. We select the dark-grey to black regions

f) Distribution of

τ

ρ

h) Distribution of

g) Distribution of

σ

of the Y map in Figure 4b, and declare that any test instances falling in the selected

area are classified as Y=1, while any instances falling elsewhere in the map are

classified as Y=0. The resulting accuracy on the test set is 95% in predicting which

heuristic will perform better. The self-organizing map has proven to be useful for both

visualization of feature space and predictive modeling of heuristic performance,

although the accuracy is not quite as high as the supervised learning approaches.

5 Conclusions and Future Research

In this paper we have illustrated how the concepts of Rice’s Algorithm Selection

Problem can be extended within a knowledge discovery framework, and applied to

the domain of optimization in order that we might gain to insights into optimization

algorithm performance. This paper represents one of the first attempts to apply this

approach to understand more about optimisation algorithm performance. A large

meta-data set comprising 75,000 instances of the E/T scheduling problem has been

used to explore what can be learned about the relationships between the features of

the problem instances and the performance of heuristics. Both supervised and

unsupervised learning approaches have been presented, each with their own

advantages and disadvantages made clear by the empirical results. The neural network

obtained the highest accuracy for performance prediction, but its weakness is the lack

of explanation or interpretability of the model. Our goal is not merely performance

prediction, but to gain insights into the characteristics of instances that make solution

by one heuristic superior than another. Decision trees are also a supervised learning

method, and the rules produced demonstrate the potential to obtain both accurate

performance predictions and some insights. Finally, the self-organizing map

demonstrated its benefits for visualization of the meta-data and relationships therein.

One of the most important considerations for this approach to be successful for any

arbitrary optimization problem is the choice of features used to characterize the

instances. These features need to be carefully chosen in such a way that they can

characterize instance and problem structure as well as differentiate algorithm

performance. There is little that will be learned via a knowledge discovery process if

the features selected to characterize the instances do not have any differentiation

power. The result will be supervised learning models of algorithm performance that

predict average behaviour with accuracy measures no better than the default

accuracies one could obtain from using a naïve model. Likewise, the resulting self-

organizing map would show no discernible difference between the clusters when

superimposing Y values (unlike in Figure 3b where we obtain a clear difference

between the top and bottom halves of the map). Thus the success of any knowledge

discovery process depends on the quality of the data, and in this case, the meta-data

must use features that serve the purpose of differentiating algorithm performance. In

this paper we have used a small set of problem-dependent features, related to the E/T

Scheduling Problem, which would be of no use to any other optimization problem.

For other optimization problem like graph colouring or the Travelling Salesman

Problem, recent developments in phase transition analysis (e.g. [35]) could form the

foundation of the development of useful features. Landscape analysis [20, 27]

provides a more general (problem independent) set of features, as do ideas from

landmarking [28] and hyper-heuristics [30]. It is natural to expect that the best results

will be obtained from a combination of generic and problem dependent features, and

this will be the focus of our future research. In addition, we plan to extend the

approach to consider the performance of a wider variety of algorithms, especially

meta-heuristics, where we will also be gathering meta-data related to the features of

the meta-heuristics themselves (e.g. hill-climbing capability, tabu list, annealing

mechanism, population-based search, etc.). This will help to close the loop to ensure

that any insights derived from such an approach are able to provide inputs into the

design of new hybrid algorithms that adapt the components of the meta-heuristic

according to the instance features – an extension of the highly successful algorithm

portfolio approach [4].

References

1. Rice, J. R.: The Algorithm Selection Problem. Adv. Comp. 15, 65--118 (1976)

2. Watson, J.P., Barbulescu, L., Howe, A.E., Whitley, L.D.: Algorithm Performance and

Problem Structure for Flow-shop Scheduling. In: Proc. AAAI Conf. on Artificial

Intelligence, pp. 688--694 (1999)

3. Wolpert, D.H., Macready, W.G.: No Free Lunch Theorems for Optimization. IEEE T.

Evolut. Comput. 1, 67 (1997)

4. Xu, L., Hutter, F., Hoos, H., Leyton-Brown, K.: Satzilla-07: The Design And Analysis Of

An Algorithm Portfolio For SAT. LNCS, vol. 4741, pp. 712--727 (2007)

5. Leyton-Brown, K., Nudelman, E., Shoham, Y.: Learning the Empirical Hardness of

Optimization Problems: The Case of Combinatorial Auctions. LNCS, vol. 2470. pp. 556--

569. Springer, Heidelberg (2002)

6. Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y.: A Portfolio

Approach to Algorithm Selection. In: Proc. IJCAI. pp. 1542--1543 (2003)

7. Nudelman, E., Leyton-Brown, K., Hoos, H., Devkar, A., Shoham, Y.: Understanding

Random SAT: Beyond the Clauses-To-Variables Ratio. LNCS, vol. 3258, pp. 438--452

(2004)

8. Horvitz, E., Ruan, Y., Gomes, C., Kautz, H., Selman, B., Chickering, M.: A Bayesian

Approach to Tackling Hard Computational Problems. In: Proc. 17th Conf. on Uncertainty

in Artificial Intelligence, pp. 235--244. Morgan Kaufmann, San Francisco (2001)

9. Samulowitz, H., Memisevic, R.: Learning to solve QBF. In: Proc. 22nd AAAI Conf. on

Artificial Intelligence, pp. 255--260 (2007)

10. Streeter, M., Golovin, D., Smith, S. F.: Combining multiple heuristics online. In: Proc.

22nd AAAI Conf. on Artificial Intelligence, pp. 1197--1203 (2007)

11. Vilalta, R., Drissi, Y.: A Perspective View and Survey of Meta-Learning. Artif. Intell.

Rev. 18, 77--95 (2002)

12. Michie, D., Spiegelhalter, D.J., Taylor C.C. (eds.) Machine Learning, Neural and

Statistical Classification. Ellis Horwood, New York (1994)

13. Brazdil, P., Soares, C., Costa, J.: Ranking Learning Algorithms: Using IBL and Meta-

Learning on Accuracy and Time Results. Mach. Learn. 50, 251--277 (2003)

14. Ali, S., Smith, K.: On Learning Algorithm Selection for Classification. Appl. Soft Comp.

6, 119--138 (2006)

15. Stützle, T., Fernandes, S.: New Benchmark Instances for the QAP and the Experimental

Analysis of Algorithms. LNCS, vol. 3004, pp. 199--209 (2004)

16. Carchrae, T., Beck, J.C.: Applying Machine Learning to Low Knowledge Control of

Optimization Algorithms. Comput. Intell. 21, 373--387 (2005)

17. Shaw, M.J., Park, S., Raman, N.: Intelligent Scheduling With Machine Learning

Capabilities: The Induction Of Scheduling Knowledge. IIE Trans. 24, 156--168 (1992)

18. Knowles, J. D., Corne, D. W.: Towards Landscape Analysis to Inform the Design of a

Hybrid Local Search for the Multiobjective Quadratic Assignment Problem. In: Abraham,

A., Ruiz-Del-Solar, J., Koppen M. (eds.) Soft Computing Systems: Design, Management

and Applications, pp. 271--279. IOS Press, Amsterdam (2002)

19. Merz, P.: Advanced Fitness Landscape Analysis and the Performance of Memetic

Algorithms. Evol. Comp., 2, 303--325 (2004)

20. Watson, J., Beck, J. C., Howe, A. E., Whitley, L. D.: Problem Difficulty for Tabu Search

in Job-Shop Scheduling. Artif. Intell. 143, 189--217 (2003)

21. Smith-Miles, K. A.: Cross-Disciplinary Perspectives On Meta-Learning For Algorithm

Selection. ACM Computing Surveys. In press (2009).

22. Baker, K.R., Scudder, G.D.: Sequencing With Earliness and Tardiness Penalties: A

Review. Ops. Res., 38, 22--36 (1990)

23. James, R. J. W., Buchanan, J. T.: A Neighbourhood Scheme with a Compressed Solution

Space for The Early/Tardy Scheduling Problem. Eur. J. Oper.Res. 102, 513--527 (1997)

24. Fry T.D., Armstrong R.D., Blackstone J.H.: Minimizing Weighted Absolute Deviation in

Single Machine Scheduling. IIE Transactions, 19, 445--450 (1987)

25. Vollmann T.E., Berry, W.L., Whybark, D.C., Jacobs, F.R.: Manufacturing Planning and

Control for Supply Chain Management. 5

th

Edition, McGraw Hill, New York (2005)

26. Krajewski, L.J., Ritzman, L.P.: Operations Management: Processes and Value Chains. 7

th

Edition, Pearson Prentice Hall, New Jersey (2005)

27. Schiavinotto, T., Stützle, T.: A review of metrics on permutations for search landscape

analysis. Comput. Oper. Res. 34, 3143--3153 (2007).

28. Pfahringer, B., Bensusan, H., Giraud-Carrier, C. G.: Meta-Learning by Landmarking

Various Learning Algorithms. In: Proc. ICML. pp. 74--750 (2000)

29. Baker K. B., Martin, J. B.: An Experimental Comparison of Solution Algorithms for the

Single Machine Tardiness Problem. Nav. Res. Log. 21, 187--199 (1974)

30. Burke, E., Hart, E., Kendall, G., Newall, J., Ross, P., Schulenburg, S.: Hyper-heuristics:

An Emerging Direction in Modern Search Technology. In: Glover, F., Kochenberger, G.

(eds.) Handbook of Meta-heuristics. pp. 457--474. Kluwer, Norwell MA (2002)

31. Smith, K. A.: Neural Networks for Prediction and Classification. In: Wang, J.(ed.),

Encyclopaedia of Data Warehousing And Mining. vol. 2, pp. 86--869, Information

Science Publishing, Hershey PA (2006)

32. Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques.

2nd Edition. Morgan Kaufmann, San Francisco (2005)

33. Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco

(1993)

34. Kohonen, T.: Self-Organized Formation of Topologically Correct Feature Maps. Biol.

Cyber. 43, 59--69 (1982)

35. Achlioptas, D., Naor, A., Peres, Y.: Rigorous Location of Phase Transitions in Hard

Optimization Problems. Nature, 435, 759--764 (2005)